# <p style='text-align: center;'> K-Means Clustering Algorithm </p>

## K-Means Clustering Algorithm:
- It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group that has similar properties.


- K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.


- It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters.


- It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the unlabeled dataset on its own without the need for any training.


- K-Means Clustering Algorithm is a Partitioning Clustering. K-Means divides the data into non-overlapping subsets (clusters) without any cluster-internal structure.


- Intra-cluster (within cluster) distances are minimized, whereas Inter-cluster (b/w two clusters) distances are maximized.


- The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.


The k-means clustering algorithm mainly performs two tasks:

- Determines the best value for K center points or centroids by an iterative process.


- Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster.


Hence each cluster has datapoints with some commonalities, and it is away from other clusters.


The below diagram explains the working of the K-means Clustering Algorithm:

![image.png](attachment:image.png)

## How does the K-Means Algorithm Work?

<b> The working of the K-Means algorithm is explained in the below steps:

- **Step-1:** Select the number K to decide the number of clusters.
    

- **Step-2:** Select random K points or centroids. (It can be other from the input dataset).
    

- **Step-3:** Assign each data point to their closest centroid, which will form the predefined K clusters.
    
    
- **Step-4:** Calculate the variance and place a new centroid of each cluster.
    

- **Step-5:** Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.
    

- **Step-6:** If any reassignment occurs, then go to step-4 else go to FINISH.
    

- **Step-7:** The model is ready.

<b> Let's understand the above steps by considering the visual plots:
    
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below:
    
![image.png](attachment:image.png)
    
    
Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters. It means here we will try to group these datasets into two different clusters.
    
    
We need to choose some random k points or centroid to form the cluster. These points can be either the points from the dataset or any other point. So, here we are selecting the below two points as k points, which are not the part of our dataset. Consider the below image:
    
![image-2.png](attachment:image-2.png)
    
    
Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute it by applying some mathematics that we have studied to calculate the distance between two points,i.e. Euclidean distance. So, we will draw a median between both the centroids. Consider the below image:
    
![image-3.png](attachment:image-3.png)
    
    
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization.
    
![image-4.png](attachment:image-4.png)    
    
    
As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose the new centroids, we will compute the center of gravity of these centroids, and will find new centroids as below:
    
![image-5.png](attachment:image-5.png)
    
    
Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of finding a median line. The median will be like below image:
    
![image-6.png](attachment:image-6.png)
    
    
From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right to the line. So, these three points will be assigned to new centroids.
    
![image-7.png](attachment:image-7.png)
    
    
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points.
    

We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown in the below image:
    
![image-8.png](attachment:image-8.png)   
    
    
As we got the new centroids so again will draw the median line and reassign the data points. So, the image will be:
    
![image-9.png](attachment:image-9.png)    
    
    
We can see in the above image; there are no dissimilar data points on either side of the line, which means our model is formed. Consider the below image:
    
![image-10.png](attachment:image-10.png)
    
    
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as shown in the below image:

![image-11.png](attachment:image-11.png)

## How to choose the value of "K number of clusters" in K-means Clustering?
The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms. But choosing the optimal number of clusters is a big task. To choose the optimal number of clusters we will go with **"Elbow Method"**.

### Elbow Method:
The Elbow method is one of the most popular ways to find the optimal number of clusters. This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the total variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below:

![image.png](attachment:image.png)


In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its centroid within a cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method such as Euclidean distance or Manhattan distance.


<b> To find the optimal value of clusters, the elbow method follows the below steps:

- It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
    
    
- For each value of K, calculates the WCSS value.
    
    
- Plots a curve between calculated WCSS values and the number of clusters K.
    
    
- The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the best value of K.
    
    
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow method. The graph for the elbow method looks like the below image:
    
![image-2.png](attachment:image-2.png)
    
    
**Note:** We can choose the number of clusters equal to the given data points. If we choose the number of clusters equal to the data points, then the value of WCSS becomes zero, and that will be the endpoint of the plot.
    
    
**Note:** As number of clusters increases, WCSS decreases or vice-versa.    

## The Random Initialization Trap:
One majoe drawback of K-Means Clustering is the Random Initialization of centroids. The formation of clusters is closely bound by the initial position of a centroid. The random positioning of the centroids can completely alter clusters ans can result in random formation. The solution is **K-Means++**. The K-Means++ is an algorithm that is used to initialise the K-Means Algorithm.




## K-Means++
 It specifies a procedure to initialize the cluster centers before moving forward with the standard k-means clustering algorithm.
 
 
Using the K-Means++ algorithm, we optimize the step where we randomly pick the cluster centroid. We are more likely to find a solution that is competitive with the optimal K-Means solution while using the K-Means++ initialization.


<b> The steps to initialize the centroids using K-Means++ are:
    
1. The first cluster is chosen uniformly at random from the data points we want to cluster. This is similar to what we do in K-Means, but instead of randomly picking all the centroids, we just pick one centroid here.
    
    
2. Next, we compute the distance (D(x)) of each data point (x) from the cluster center that has already been chosen.
    
    
3. Then, choose the new cluster center from the data points with the probability of x being proportional to (D(x))2.
    
    
4. We then repeat steps 2 and 3 until k clusters have been chosen.
    
    
<b> Exapmle:
    
Let’s take an example to understand this more clearly. Let’s say we have the following points, and we want to make 3 clusters here:
    
![image.png](attachment:image.png)
    
    
Now, the first step is to randomly pick a data point as a cluster centroid:
    
![image-2.png](attachment:image-2.png)    
    
    
Let’s say we pick the green point as the initial centroid. Now, we will calculate the distance (D(x)) of each data point with this centroid:
    
![image-3.png](attachment:image-3.png)
    
    
The next centroid will be the one whose squared distance (D(x)2) is the farthest from the current centroid:
    
![image-4.png](attachment:image-4.png)
    
    
In this case, the red point will be selected as the next centroid. Now, to select the last centroid, we will take the distance of each point from its closest centroid, and the point having the largest squared distance will be selected as the next centroid:
    
![image-5.png](attachment:image-5.png)
    
    
We will select the last centroid as:
    
![image-6.png](attachment:image-6.png)
    
    
We can continue with the K-Means algorithm after initializing the centroids. Using K-Means++ to initialize the centroids tends to improve the clusters. Although it is computationally costly relative to random initialization, subsequent K-Means often converge more rapidly.

    
I’m sure there’s one question that you’ve been wondering since the start of this article – how many clusters should we make? In other words, what should be the optimum number of clusters to have while performing K-Means?  
    
    
The solution for the above question is: The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms. But choosing the optimal number of clusters is a big task. To choose the optimal number of clusters we will go with **"Elbow Method"**. Inthe above we have discussed this method.