## Introduction

**K-means** **clustering** is one of the algorithm which **unsupervised** machine learning supports hence before moving forward with K-means let's have background knowledge of unsupervised learning method. In this method we don't really have the **predefined lables** unlike the **supervised** method hence we **don't draw predictions** but make **clusters/group** out of them so that the data could get **segmented** according to the features that are fed to model.

## What Approach does K-Means clustering follows?

As discussed that this algorithm is a part of unsupervised learning so instead of making predictions we will segment that data based on the different number of clusters. K-means follow few mathematical steps which are important to discuss:

**Step 1:** Selecting the number of clusters, and there are few ways to select the **appropriate number of clusters** like **elbow method** and **domain** knowledge.

**Step 2:** Assigning the K points or we can also say **randomly assigning the centroids** from the dataset.

**Step 3:** In the last step each K point will be adjusted closely towards their **closest centroid** that will eventually form the **clusters/group**. 

Okay! Enough of the theories now let's get our hands dirty on implementing the K-means clustering on Spark's official clustering [dataset](https://https://github.com/apache/spark/blob/master/data/mllib/sample_kmeans_data.txt). **Though this dataset is very small and pecuilar still enough that we can explain each concept of K-means clustering using PySpark's MLIB**

In [18]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Initiating the Spark Object

This step in any spark related project is the first thing we have to do as by **creating and instantiating the PySpark environment** so that we can access all of it's utilities and functions that are required to implement the **K-means clustering algorithm.**

In [19]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('intro_cluster').getOrCreate()
spark

**Inference:** First and foremost SparkSession object is called from pyspark's sql module then we created the **intro_cluster** session using the **getOrCreate() function** incorporated by **builder() function**

In [20]:
from pyspark.ml.clustering import KMeans

**Inference:** In PySpark we import all the classification algorithms like **decision tree classifier, GBT classifier, Random forest classifier and logistic regression** from the **classification** class same way we are importing the **KMeans** object from **clustering** class.

## Reading the dataset

As mentioned earlier as well that this is the dataset which is provided by the **Spark in their official github repository**, this dataset is quite **small** so one should not expect the real world results but surely it can help to **undertand the concept of practically implementing the k-means clustering**

In [21]:
dataset_kmeans = spark.read.format("libsvm").load("sample_kmeans_data.txt")
dataset_kmeans.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|           (3,[],[])|
|  1.0|(3,[0,1,2],[0.1,0...|
|  2.0|(3,[0,1,2],[0.2,0...|
|  3.0|(3,[0,1,2],[9.0,9...|
|  4.0|(3,[0,1,2],[9.1,9...|
|  5.0|(3,[0,1,2],[9.2,9...|
+-----+--------------------+



**Inference:** Here notice that we are reading the dataset after converting it in the **format of libsvm** as the dataset is in the form of libsvm and then it become easy to load. Later we are looking at dataset using **show function**.

In [22]:
dataset_kmeans.head(5)

[Row(label=0.0, features=SparseVector(3, {})),
 Row(label=1.0, features=SparseVector(3, {0: 0.1, 1: 0.1, 2: 0.1})),
 Row(label=2.0, features=SparseVector(3, {0: 0.2, 1: 0.2, 2: 0.2})),
 Row(label=3.0, features=SparseVector(3, {0: 9.0, 1: 9.0, 2: 9.0})),
 Row(label=4.0, features=SparseVector(3, {0: 9.1, 1: 9.1, 2: 9.1}))]

**Inference:** There is one other method to have a sneak peek of the dataset i.e. the **head function** which returns all the column name with their corresponding values. Notice that features are being returned as **SparseVector**.

In [23]:
dataset_kmeans.printSchema()

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)



**Inference:** If one want to know that how the Schema of the dataset looks like then they can use the **printSchema**() function. Here it have returned two column one is the **label** and other is **features** column in the **vector format**.

## Implementation

As we know that we don't predict any labels in K-means instead we used to make a clusters to segment the data in best possible way, but here comes the complex part as **there are not 100% optimal way to find best number of clusters** either one should have the amazing domain knowledge or **elbow method**, sometimes for easier problem we can give a shot to **hit and trial method**, that's what we are gonna do here:

1. **When clusters = 2** i.e. when the dataset will be seperated into two groups.
2. **When clusters = 3** i.e. when the dataset will be seperated into three groups.

**When clusters = 2**

In [24]:
kmeans_2_cluster = KMeans().setK(2).setSeed(1)

**Inference:** Above code is the way how we set the number of clusters from **setK()** method. Note that we are using the **setSeed()** function as well as KMeans choose the **random number** of data for each groups so if we set an approporiate number then after each time it will execute it will take the **same random distribution**.

In [25]:
first_model = kmeans_2_cluster.fit(dataset_kmeans)

**Inference:** Now we are **fitting the KMeans model** i.e. training the model based on the available dataset. Note that we are feeding the **complete data for training** as we don't have the labels hence there is **no point of splitting the dataset**.

In [26]:
predictions_first_model = first_model.transform(dataset_kmeans)
predictions_first_model.show()

+-----+--------------------+----------+
|label|            features|prediction|
+-----+--------------------+----------+
|  0.0|           (3,[],[])|         1|
|  1.0|(3,[0,1,2],[0.1,0...|         1|
|  2.0|(3,[0,1,2],[0.2,0...|         1|
|  3.0|(3,[0,1,2],[9.0,9...|         0|
|  4.0|(3,[0,1,2],[9.1,9...|         0|
|  5.0|(3,[0,1,2],[9.2,9...|         0|
+-----+--------------------+----------+



**Inference:** Now it's time to make some predictions (here let's make a relevant clusters based on the input data) and for that we are using **transform** method on the whole dataset. When we saw the output we can see that **there are two groups one is tagged as 1 and other one as 2**.

In [27]:
from pyspark.ml.evaluation import ClusteringEvaluator
evaluator = ClusteringEvaluator()

**Inference:** Building the model is one thing and testing and evaluating it is more important as it will allow us to consider that this model is good to go or need any changes. Hence, **we are using the clustering evaluator object to get some Kmeans evaluation method.**

In [28]:
silhouette_2_clusters = evaluator.evaluate(predictions_first_model)
print("Silhouette evaluation results = " + str(silhouette_2_clusters))

Silhouette evaluation results = 0.9997530305375207


**Inference:** sillhoutte is a statistical method which checks upon the **consistency within the clusters of data**. It's coefficient value ranges between **[-1,1]**  the more positive is the coefficient the better is the more the data point is within that particular clusters.

In [29]:
two_centroid = first_model.clusterCenters()
print("Center of clusters: ")
for c in two_centroid:
    print(c)

Center of clusters: 
[9.1 9.1 9.1]
[0.1 0.1 0.1]


**Inference:** To get to know about the **center/centroid of each cluster is very important as they help us to know how much seperable they are with each other**. In the output one can see that there are two clusters as we choose the number of K (clusters) as **2**.

**When clusters = 3**

In [30]:
kmeans_3_clusters = KMeans().setK(3).setSeed(1)

second_model = kmeans_3_clusters.fit(dataset_kmeans)

predictions_second_model = second_model.transform(dataset_kmeans)

**Inference:** Now it's time to check how our model will perform when we choose to have three clusters and for that the process is almost the same like here we will **set the K value as 3** i.e. **3 groups** then we will **fit/train the complete data** (reason I have already discussed) at the last for drawing predictions **transform method** comes to rescue

In [31]:
from pyspark.ml.evaluation import ClusteringEvaluator
evaluator = ClusteringEvaluator()

silhouette_3_clusters = evaluator.evaluate(predictions_second_model)
print("Silhouette evaluation results = " + str(silhouette_3_clusters))

Silhouette evaluation results = 0.6248737134600261


**Inference:** Here if we will compare the Silhouette distance when clusters were 2 then **one can conclude that we should go with 2 clusters only as it is giving us better results.**

In [32]:
three_centroid = second_model.clusterCenters()
print("Center of clusters: ")
for c in three_centroid:
    print(c)

Center of clusters: 
[9.1 9.1 9.1]
[0.05 0.05 0.05]
[0.2 0.2 0.2]


**Inference:** The sole purpose of building the model with **3-K value** is to compare both of them and **choose the best possible K value** here we can see three centroid value.

## Conclusion

We are in the endgame guys :) In this section we will go through everything we did soo far in this article in terms of practical implementation. **From introduction with K-means algorithm and it's way of operating we went through comparing two different cases and chose the best one.**

1. Firstly we discuss how the **K-Means algorithm** works and then setup an PySpark platform so that we could implement it and get an hands on experience.

2. Then we read the **official dataset** from PySpark documentation example and also **analyze the Schema** and got the basic understanding on the same.


3. Then at the last we build the K-means model on two cases (when **clusters are 2 and 3**) and after seeing the evaluation results we concluded that for this particular data when **cluster were 2 it performed better**. 