![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/PySpark/7.PySpark_Clustering.ipynb)

# **PySpark Tutorial-7 Clustering**

# **Overview**

In this notebook, clustering is performed for the iris dataset using PySpark.

### **Clustering**

*Clustering is an unsupervised learning technique, in short, you are working on data, without having any information about a target attribute or a dependent variable.*

[article](https://towardsdatascience.com/k-means-clustering-using-pyspark-on-big-data-6214beacdc8b#:~:text=K%2Dmeans%20is%20one%20of,The%20KMeans%20function%20from%20pyspark.)

[spark](https://spark.apache.org/docs/latest/ml-clustering.html)

###  **Install Spark**



In [None]:
!pip install pyspark

### **Importing Libraries and Read File**

In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.appName('cluster').getOrCreate()

In [None]:
from pyspark.ml.clustering import KMeans

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/PySpark/data/iris.csv

In [None]:
dataset = spark.read.csv("iris.csv",header=True,inferSchema=True)

In [None]:
dataset.show()

+------------+-----------+------------+-----------+-------+
|sepal_length|sepal_width|petal_length|petal_width|species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|        3.0|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|         5.0|        3.6|         1.4|        0.2| setosa|
|         5.4|        3.9|         1.7|        0.4| setosa|
|         4.6|        3.4|         1.4|        0.3| setosa|
|         5.0|        3.4|         1.5|        0.2| setosa|
|         4.4|        2.9|         1.4|        0.2| setosa|
|         4.9|        3.1|         1.5|        0.1| setosa|
|         5.4|        3.7|         1.5|        0.2| setosa|
|         4.8|        3.4|         1.6|        0.2| setosa|
|         4.8|        3.0|         1.4|        0.1| setosa|
|         4.3|        3.0|         1.1| 

In [None]:
dataset.describe().show()

+-------+------------------+-------------------+------------------+------------------+---------+
|summary|      sepal_length|        sepal_width|      petal_length|       petal_width|  species|
+-------+------------------+-------------------+------------------+------------------+---------+
|  count|               150|                150|               150|               150|      150|
|   mean| 5.843333333333335|  3.057333333333334|3.7580000000000027| 1.199333333333334|     null|
| stddev|0.8280661279778637|0.43586628493669793|1.7652982332594662|0.7622376689603467|     null|
|    min|               4.3|                2.0|               1.0|               0.1|   setosa|
|    max|               7.9|                4.4|               6.9|               2.5|virginica|
+-------+------------------+-------------------+------------------+------------------+---------+



In [None]:
dataset2 = dataset.select("sepal_length", "sepal_width", "petal_length", "petal_width")

In [None]:
dataset2.show()

+------------+-----------+------------+-----------+
|sepal_length|sepal_width|petal_length|petal_width|
+------------+-----------+------------+-----------+
|         5.1|        3.5|         1.4|        0.2|
|         4.9|        3.0|         1.4|        0.2|
|         4.7|        3.2|         1.3|        0.2|
|         4.6|        3.1|         1.5|        0.2|
|         5.0|        3.6|         1.4|        0.2|
|         5.4|        3.9|         1.7|        0.4|
|         4.6|        3.4|         1.4|        0.3|
|         5.0|        3.4|         1.5|        0.2|
|         4.4|        2.9|         1.4|        0.2|
|         4.9|        3.1|         1.5|        0.1|
|         5.4|        3.7|         1.5|        0.2|
|         4.8|        3.4|         1.6|        0.2|
|         4.8|        3.0|         1.4|        0.1|
|         4.3|        3.0|         1.1|        0.1|
|         5.8|        4.0|         1.2|        0.2|
|         5.7|        4.4|         1.5|        0.4|
|         5.

### **Import Libraries and Make Clustering**

In [None]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [None]:
dataset2.columns

['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

In [None]:
vec_assembler = VectorAssembler(inputCols = dataset2.columns, outputCol='features')

In [None]:
final_data = vec_assembler.transform(dataset2)

In [None]:
from pyspark.ml.feature import StandardScaler

In [None]:
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)

In [None]:
# Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(final_data)

In [None]:
# Normalize each feature to have unit standard deviation.
final_data = scalerModel.transform(final_data)

In [None]:
# Trains a k-means model.
kmeans = KMeans(featuresCol='scaledFeatures',k=3)
model = kmeans.fit(final_data)

In [None]:
# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

Cluster Centers: 
[6.8887588  6.04493327 2.38782168 1.74828502]
[6.05788156 7.91761264 0.83006151 0.32128819]
[8.08674985 7.02050171 3.06927278 2.5427526 ]


In [None]:
model.transform(final_data).select('prediction').show()

+----------+
|prediction|
+----------+
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
|         1|
+----------+
only showing top 20 rows



In [None]:
model.transform(final_data).groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|         1|   49|
|         2|   55|
|         0|   46|
+----------+-----+



In [None]:
result = model.transform(final_data).select('prediction')

In [None]:
dataset.join(result).show()

+------------+-----------+------------+-----------+-------+----------+
|sepal_length|sepal_width|petal_length|petal_width|species|prediction|
+------------+-----------+------------+-----------+-------+----------+
|         5.1|        3.5|         1.4|        0.2| setosa|         1|
|         5.1|        3.5|         1.4|        0.2| setosa|         1|
|         5.1|        3.5|         1.4|        0.2| setosa|         1|
|         5.1|        3.5|         1.4|        0.2| setosa|         1|
|         5.1|        3.5|         1.4|        0.2| setosa|         1|
|         5.1|        3.5|         1.4|        0.2| setosa|         1|
|         5.1|        3.5|         1.4|        0.2| setosa|         1|
|         5.1|        3.5|         1.4|        0.2| setosa|         1|
|         5.1|        3.5|         1.4|        0.2| setosa|         1|
|         5.1|        3.5|         1.4|        0.2| setosa|         1|
|         5.1|        3.5|         1.4|        0.2| setosa|         1|
|     