# Clustering Intuition

Clustering is an **unsupervised machine learning algorithm**.
Clustering can be defined as grouping unlabelled data. So far, we have been working with **supervised types of algorithms such as regressions & classifications**. 

And the way the supervised algorithm works is, we already have the training data and an answers. An answer that we supply to the model. We have labels column in our dataset. And then we supply these to models and we ask the model to learn from this labels to predict the output. So if any observations are given, it correctly predicts the value.

But here in unsupervised algorithm, we don't have labels or answers and the **model has to think for itself**. For instance, if we give some images of fruits without any labels and ask the model to group it with categories, the model has no idea about the groups or classifications, it can just see that there are some similarities in the data and some differences in the data. By this, it can make conclusions to make groups.

In a nutshell, in supervised learning, we give the model an opportunity to train where it has the answers where as in unsupervised learning, we don't have answers to supply to the model.


**Let's look at an example in a business sense.**

As usual, we have X-Y axis. In x axis we have **annual income** and in y-axis we have **spending score** of a customer of a store. The way the spend, how much they spend, when and for what they spend- all are combined into spending score. And we don't have any preexisting categories. We don't have prexisting classes or groups of customers to group them. **We want to create those groups**.

That's where we would apply **clustering**. And by apply clustering it would show us that these are possible groups and from there we can go deep dive into the data. Why these groups might be emerging. In terms of customer sense, in terms of business sense, in terms of spending sense - understand how to best service these customers. How to provide a good offers or promotions to provide from one group versus the other group and how to best use this information for business.


# K-Means Clustering

K-Means clustering is a very simple algorithm yet powerful.

As usual we have a data in scatter plot and the goal is by using k-means clustering, it must identify the clusters.
So, we don't have any training data or the classes or categories in advance. We just have this data and we want to create clusters.

**How does this work** - The **First step** is we need to decide how many clusters we want and the way we decide and how to decide will be seen later.

So now, we need 2 clusters. For each cluster, we need to randomly place a **centroid** on the scatter plot. So, **2 clusters means 2 centroids** and we can place wherever we like. One is a **blue centroid** and the other is a **red centroid** just for differentiation purpose.

What happens next is this, k-means will assign **each of the datapoint** to the **closest centroid.** In this case, it's easy by **drawing the equidistant line**. Now, anything above the line is assigned to **blue centroid** and anything below is assigned to **red centroid**.

The next step is to **calculate the center of mass** or the **center of gravity** for each of the clusters or the preliminary clusters that we have identified. Ofcourse the **centroid is not included in this calculation**.

So, for **blue clusters** we need to take all the **x-coordinates and average it.** Then take all the **y-coordinates and average it**. This will give us the **place or the position of center of mass**.

The same is followed for **red clusters** by identifying the **average of x & y coordinates**.

![Screenshot 2024-05-29 104521.png](attachment:a6a678c6-a612-498d-8f7a-50645e6f4e47.png)

Then we **move the centroids to those poitions**. Once they moved, we repeat the process. We reassign the data points to the closest centroid. So again equidistant line. If you see the below picture,the **datapoint colors** are mismatched after placing the **equidistant line**. 

We need to change the appropriate color of datapoints according to the placement of equidistant line. **Change the color of datapoints**

![Screenshot 2024-05-29 105929.png](attachment:1409d168-d684-4dc7-94d2-08cea1cfea5c.png)

Again we calculate the center of gravity/mass. Move the centroid accordingly. Do the process again.

We repeat this process until when no changes are observed even after calculating the center of mass and moving the centroid and all the blue points are above the line and all red points are below the line. That is the place where we reached the end point of k-means clustering. These below are our final centroids.

![Screenshot 2024-05-29 110535.png](attachment:39be873c-8b10-4984-bf07-ca29f364dc1d.png)

Now we have our **2 clusters** and now we can go ahead and interpret what they mean from business sens and domain knowledge sense.

![Screenshot 2024-05-29 110319.png](attachment:ed073d16-5a37-4341-8dd4-041a0ab9a4b8.png)

# The Elbow Method

We have 2 dimensional datapoints here in X & Y coordinates. Also, the point to be noted that the K-means clustering is not necessarily be in 2 dimensional only. It can be multidimension as well.

We already know how k-means clustering work. How do we decide how many clusters to select is the first step. The **elbow method** is one of the approach to help us make this decision.

Elbow method is the pretty good way to find the optimal number of clusters.

Elbow method requires us to look at the equation. This equation is known as **WCSS - Within Cluster Sum of Squares**.

WCSS - It's nothing but the sum of all the squares of distance of each datapoint to the centroid of each cluster.

![Screenshot 2024-05-30 081728.png](attachment:a75995fd-27da-4f54-8ee0-c94e24fd7fba.png)

For instance in below diagram, if we have one cluster, then we need to measure the distance between each point and that centroid, square it and then add them up.

If we have two clusters, then we need to do it for red points, measure distance, square and add them up. do it for blue points. And both the values of red and blue points.

The same thing for 3 clusters and so on.

**Important point to understand** - To calculate **WCSS**, we need clusters to **already exist**. So everytime we have to first run the k-means clustering algorithm and then we calculate the WCSS. It's a kind of **bit backwards**. We don't first do the elbow method to find the optimal number of clusters and then k-means clustering. **We do k-means many times, find the WCSS for every single setup** whether it's 1,2,3,4 or 5 so on clusters. Then we apply elbow method.

Secondly, the **more clusters we have the smaller the WCSS**. We can see that in diagram, if we have a single cluster, the distance will be more, so the squares of it will also be more. hence, the WCSS will be more or largers.

But if there are more clusters, we have the less value of WCSS. So, we run the k-means until we get the maximum number of clusters which equals the no. of datapoints that we have. Then WCSS will be zero because each datapoint is on its own centroid and the distance is zero.

Below is the chart if we build like this. X-axis has got the no. of clusters and Y-axis has got WCSS. As we can see, it drops off all the way down to zero. If we look for **where the kink is or where the elbow is** exactly on 100 on WCSS axis. That's the **optimal number of clusters**. Those many clusters we have to create to proceed further on data analysis.

![Screenshot 2024-05-30 082256.png](attachment:cc185265-0739-46e9-bb6d-26ba4a945dfc.png)

# K-Means++

Let's understand how K-Means++ works and why it's used.

We have the dataset below and let's say we want to apply k-means. For argument sake, let's say the centroids are initialized in such a way as per below diagram. It's done in **random initialization**. Then once we apply k-means, all the steps take place. We will end with three clusters.

But now for the same dataset we apply k-means again but because the centroids are initialized at random, let's say they we applied in the following way as per the below diagram.

We are getting a different set of three clusters. 

Now this is not good. The results are different for the same dataset if we run the same machine learning model of k-means. They are only different because the clusters are being **initialized randomly**. This is called **random initialization trap**.

![Screenshot 2024-05-30 094138.png](attachment:fe13565e-5a32-4839-8486-c960858f4a2c.png)

**Why this is a bad thing** - Because for any algorithms or like k-means, it should be deterministic. the result should be same. It shouldn't be different or not by the random initialization. This model tells something about the business or the problem at hand. So, the insight shouldn't depend on random initialization.

Now, what does K-means++ does. **It adds some extra steps at the begining to initialize cetroids in a certain way**.

Step1: The first centroid is chosen at random as usual.

Step2: For each remainig data points compute the distance to the nearest out of already selected centroids.

Step3: We use a weighted random selection to pick out the next centroid based on the distance of the existing centroid. This means, we take the farthest distance of a datapoint to a centroid(Initial centroid) in step 2 and assign another centroid (second centroid). This is what we call as **weighted random selection**. To assign a second centroid, the farthest distance of a datapoint to a initial or previous centroid is considered.

![Screenshot 2024-05-30 094458.png](attachment:39ef082b-d4aa-4700-96ba-4a4183e1ca82.png)

Step4: Repeat step 2 and step 3 until all the centroids are initialized to the datapoint or the distance is zero.

![Screenshot 2024-05-30 094620.png](attachment:09108019-7c6f-4ad4-86cb-64801bc346bc.png)

Step5: Proceed with K-means clustering.

![Screenshot 2024-05-30 094656.png](attachment:b13dd3eb-2271-4830-93d4-3f1b5eadccb6.png)

Last thing to note that k-means **doesn't guarantee that there will not be an issue in terms of initialized becuase it's done at random. It's done in weighted random to initialize the centroids so the chances of the happening are much lower.**

# Importing libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Importing dataset

In [None]:
df = pd.read_csv('/kaggle/input/malls-customers/Mall_Customers.csv')

"""Since in clustering technique, we don't have any training data or classes or categories or dependent variable column,
we don't create a variable 'y' for dependent variable vector. Our main aim is to create the dependent variable vector.

Although we need all the column for X(matrix of features), for teaching purpose and visualize the cluster, we take only
last columns (Annual income and Spending score). If you see for specifying columns, we didn't give it in range format.
We give the index of last 2 columns in pair of square bracket. Inside the square bracket we specify the column index that
we want.This is another type/method to select some column indexes in iloc (location indeces) function."""

X = df.iloc[:,[3,4]].values

# Using elbow method to find the optimal number of clusters

In [None]:
"""We need to find the number of clusters that will be passed as an argument in K-Means algorithm. The way we implement the
elbow method will actually be by running k-means algorithm with several clusters.

And, let's import the KMeans class from cluster module of sklearn library.

The K-Means algorithm will run in for loop with 10 different types of clusters (ranging from 1 to 11). And each time we run
the K-Means algorithm we will compute WCSS (within cluster sum of squares). The same(WCSS) will be on the graph of y-axis
and the no. of clusters (10) will be on x-axis.

We will create an empty list and call it as wcss to pass that in for loop. which will through the for loop be populated
with each of no. of clusters. And later, we will add (append method) one by one the different WCSS values for each of
clusters.

Since we need 10 values, we start the range from 1 to 11. Because the upper bound will be omitted in python range()
Unlike other algorithm steps, here the object of KMeans class is created inside the for loop so as to have 10 different
objects (kmeans) for each cluster.

Parameters of KMeans :
    n_clusters - It defines the no. of clusters we use. Since the i value in for loop takes the value of no. of clusters
    we pass the same i to n_clusters
    init - From our intuition class, we learnt the random initialization of specific method is k-means++.
    random_state - As usual the 42 is the value that brings luck in maths. So we pass the same.

Then we run the algorithm. It means, we train the K-Means algorithm by fit() with 'i' no.of clusters.

inertia_ : inertia_ is an attribute of kmeans object which will give us exactly the wcss value.

After this we plot the graph to identify the kink or the optimal number of cluster. So, we will pass range(1,11) value
for x-coordinate and wcss for y-cordinate."""

from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
    kmeans = KMeans(n_clusters=i,init='k-means++',random_state=42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
plt.plot(range(1,11),wcss)
plt.title('Elbow method to find optimal no.of clusters')
plt.xlabel('No.of clsuters')
plt.ylabel('WCSS')
plt.show()


# Training the K-Means model on the dataset

In [None]:
"""Our main aim is to create the dependent variable vector which is the principle of clustering.

We found the no.of clusters that we wanted is 5 to build/train/run our K-Means algorithm. Our next aim is to
classify/group our dataset customers into this 5 clusters based on their similarities & differences among them. We need
to find patterns using this 5 clusters.

We reuse the same kmeans object initiation from previous step and replace 'i' value as 5 since we found it.

We need to build the dependent variable which will group or segment the customers from our dataset into these 5 clusters.
The creation of dependent variable is actually the values of exactly these clusters from 1 to 5. Each of these clusters
will be a certain group of customers from dataset.

Fortunately, the fit_predict() in our K-Means API will not only train the K-Means model on dataset but also retruns exactly
the dependent variable which we are about to create with 5 different values. So we will call it as y_kmeans for the
dependent variable creation."""

kmeans = KMeans(n_clusters=5,init='k-means++',random_state=42)
y_kmeans = kmeans.fit_predict(X)

In [None]:
"""If we print the y_kmeans we have a list of values.

These are the values of cluster indexes belonging to the customer from first.
For example, the below first value 2 refers to index 2 - which means it belong to 3rd cluster of first customer. Same
way, the next value 3 means - the second customer is belonging to 4th cluster etc. and so on.

We need to compare the dataset with these values for better understanding of which customer is falling under which
cluster."""

print(y_kmeans)

# Visualizing the cluster

**We will visualize by doing 5 scatter plot for each of clusters. From cluster1 to cluster5.**

We will expalin for cluster1 and cluster2. The rest of the clusters (3,4 and 5) are self explanatory.

Let's first see the cluster1 of scatter(). Scatter() usually requires x-coordinate and y-coordinate as mandate arguments.


**1.Cluster1 x-coordinate:** we need to mention the rows & columns of dataset which we are plotting. These rows and columns
are specified in square bracket.

**rows :** rows are nothing but all customers who are falling under cluster1. which means index zero.We get this cluster1       or customers of index zero from 'y_kmeans==0'. This condition will give us all the customers who 
    are falling under cluster1(zeroth index of cluster).
    
**columns:** Since we are accounting for only two columns - Annual Income and Spending Score, the Annual Income column         index is zero and Spending Score column index is considered as one. So, here we will mention the value as zero.
    
**Cluster1 y-coordinate:** Sameway as x-coordinate we need to mention the rows & columns of dataset which we are plotting. These rows and columns are specified in square bracket. 
   
**rows :** rows are nothing but all customers who are falling under cluster1. which means index zero.
   We get this cluster1 or customers of index zero from 'y_kmeans==0'. This condition will give us all the customers who
   are falling under cluster1(zeroth index of cluster).
   
**columns:** Since we are accounting for only two columns - Annual Income and Spending Score, the index of Annual Income     column is considered as zero and the index of Spending Score column is considered as one. So, here we will mention the     value as one.



   
**2.Cluster2 x-coordinate:** we need to mention the rows & columns of dataset which we are plotting. These rows and columns
are specified in square bracket.

**rows :** rows are nothing but all customers who are falling under cluster2. which means index one.
     We get this cluster2 or customers of index one from 'y_kmeans==1'. This condition will give us all the customers who 
     are falling under cluster2(first index of cluster).
   
**columns:** Since we are accounting for only two columns - Annual Income and Spending Score, the Annual Income column
     index is zero and Spending Score column index is considered as one. So, here we will mention the value as zero.
    
**Cluster2 y-coordinate:** Sameway as x-coordinate we need to mention the rows & columns of dataset which we are 
  plotting. These rows and columns are specified in square bracket.
  
**rows :** rows are nothing but all customers who are falling under cluster2. which means index one.
    We get this cluster2 or customers of index one from 'y_kmeans==1'. This condition will give us all the customers who
    are falling under cluster2(first index of cluster).
  
**columns:** Since we are accounting for only two columns - Annual Income and Spending Score, the index of Annual Income        column is considered as zero and the index of Spending Score column is considered as one. So, here we will mention the     value as one.

In [None]:
"""Here s means size, c means color"""

plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')

"""We will plot centroids for each of these 5 clusters.

cluster_centers_ is an attribute in kmeans object which is the 2D array in which the rows corresponds to different
centroids and the columns corresponds to their coordinates.

So, here in scatter(), the coordinates of these centroids will be rows & columns of cluster_centers_ for both
x & y.

The ':' is specified for rows because we need all rows. and for column we specify it as '0' because for x-coordinate the
first column which has the index 0 which corresponds to x-coordinate of these cluster_centers_.

Copy same for y-coordinate. But just change from '0' to '1' because it corresponds to second column inside the
cluster_centers_ array and ofcourse the y-coordinate of the centroids.

s stands for size. This will clearly highlight the centroids among all these observation points in each cluster."""

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()
