# Clustering Intuition

Clustering is an **unsupervised machine learning algorithm**.
Clustering can be defined as grouping unlabelled data. So far, we have been working with **supervised types of algorithms such as regressions & classifications**. 

And the way the supervised algorithm works is, we already have the training data and an answers. An answer that we supply to the model. We have labels column in our dataset. And then we supply these to models and we ask the model to learn from this labels to predict the output. So if any observations are given, it correctly predicts the value.

But here in unsupervised algorithm, we don't have labels or answers and the **model has to think for itself**. For instance, if we give some images of fruits without any labels and ask the model to group it with categories, the model has no idea about the groups or classifications, it can just see that there are some similarities in the data and some differences in the data. By this, it can make conclusions to make groups.

In a nutshell, in supervised learning, we give the model an opportunity to train where it has the answers where as in unsupervised learning, we don't have answers to supply to the model.

![Screenshot 2024-05-28 232316.png](attachment:a95a2b2f-f74a-48d9-a2bb-e130c61ab8c1.png)

**Let's look at an example in a business sense.**

As in below diagram, we have X-Y axis. In x axis we have **annual income** and in y-axis we have **spending score** of a customer of a store. The way the spend, how much they spend, when and for what they spend- all are combined into spending score. So, if we plot the customers, it might look like below diagram. And we don't have any preexisting categories. We don't have prexisting classes or groups of customers to group them. **We want to create those groups**.

That's where we would apply **clustering**. And by apply clustering it would show us that these are possible groups and from there we can go deep dive into the data. Why these groups might be emerging. In terms of customer sense, in terms of business sense, in terms of spending sense - understand how to best service these customers. How to provide a good offers or promotions to provide from one group versus the other group and how to best use this information for business.

![Screenshot 2024-05-28 232447.png](attachment:9f4850db-dbc8-4046-b1a0-9588fd2e8269.png)


# K-Means Clustering

K-Means clustering is a very simple algorithm yet powerful.

As usual we have a data in scatter plot and the goal is by using k-means clustering, it must identify the clusters.
So, we don't have any training data or the classes or categories in advance. We just have this data and we want to create clusters.

![Screenshot 2024-05-29 094530.png](attachment:c7357c1d-fe4c-4359-a980-8ef2d74e14e7.png)

**How does this work** - The **First step** is we need to decide how many clusters we want and the way we decide and how to decide will be seen later.

So now, we need 2 clusters. For each cluster, we need to randomly place a **centroid** on the scatter plot. So, **2 clusters means 2 centroids** and we can place wherever we like. One is a **blue centroid** and the other is a **red centroid** just for differentiation purpose.

What happens next is this, k-means will assign **each of the datapoint** to the **closest centroid.** In this case, it's easy by **drawing the equidistant line**. Now, anything above the line is assigned to **blue centroid** and anything below is assigned to **red centroid**.

The next step is to **calculate the center of mass** or the **center of gravity** for each of the clusters or the preliminary clusters that we have identified. Ofcourse the **centroid is not included in this calculation**.

So, for **blue clusters** we need to take all the **x-coordinates and average it.** Then take all the **y-coordinates and average it**. This will give us the **place or the position of center of mass**.

The same is followed for **red clusters** by identifying the **average of x & y coordinates**.

![Screenshot 2024-05-29 104521.png](attachment:a6a678c6-a612-498d-8f7a-50645e6f4e47.png)

Then we **move the centroids to those poitions**. Once they moved, we repeat the process. We reassign the data points to the closest centroid. So again equidistant line. If you see the below picture,the **datapoint colors** are mismatched after placing the **equidistant line**. 

![Screenshot 2024-05-29 105711.png](attachment:c2078c54-4218-4bab-b455-4c6aceaea8f0.png)

We need to change the appropriate color of datapoints according to the placement of equidistant line. **Change the color of datapoints**

![Screenshot 2024-05-29 105929.png](attachment:1409d168-d684-4dc7-94d2-08cea1cfea5c.png)

Again we calculate the center of gravity/mass. Move the centroid accordingly. Do the process again.

We repeat this process until when no changes are observed even after calculating the center of mass and moving the centroid and all the blue points are above the line and all red points are below the line. That is the place where we reached the end point of k-means clustering. These below are our final centroids.

![Screenshot 2024-05-29 110535.png](attachment:39be873c-8b10-4984-bf07-ca29f364dc1d.png)

Now we have our **2 clusters** and now we can go ahead and interpret what they mean from business sens and domain knowledge sense.

![Screenshot 2024-05-29 110319.png](attachment:ed073d16-5a37-4341-8dd4-041a0ab9a4b8.png)

# The Elbow Method

We have 2 dimensional datapoints here in X & Y coordinates. Also, the point to be noted that the K-means clustering is not necessarily be in 2 dimensional only. It can be multidimension as well.

We already know how k-means clustering work. How do we decide how many clusters to select is the first step. The **elbow method** is one of the approaches to help us make this decision.

Elbow method is the pretty good way to find the optimal number of clusters.

Elbow method requires us to look at the equation. This equation is known as **WCSS - Within Cluster Sum of Squares**.

WCSS - It's nothing but the sum of all the squares of distance of each datapoint to the centroid of each cluster.

----> diagram

For instance in below diagram, if we have one cluster, then we need to measure the distance between each point and that centroid, square it and then add them up.

---->diagram

If we have two clusters, then we need to do it for red points, measure distance, square and add them up. do it for blue points. And both the values of red and blue points.

The same thing for 3 clusters and so on.

**Important point to understand** - To calculate **WCSS**, we need clusters to **already exist**. So everytime we have to first run the k-means clustering algorithm and then we calculate the WCSS. It's a kind of **bit backwards**. We don't first do the elbow method to find the optimal number of clusters and then k-means clustering. **We do k-means many times, find the WCSS for every single setup** whether it's 1,2,3,4 or 5 so on clusters. Then we apply elbow method.

Secondly, the **more clusters we have the smaller the WCSS**. We can see that in diagram, if we have a single cluster, the distance will be more, so the squares of it will also be more. hence, the WCSS will be more or largers.

But if there are more clusters, we have the less value of WCSS. So, we run the k-means until we get the maximum number of clusters which equals the no. of datapoints that we have. Then WCSS will be zero because each datapoint is on its own centroid and the distance is zero.

----->diagram

Below is the chart if we build like this. X-axis has got the no. of clusters and Y-axis has got WCSS. As we can see, it drops off all the way down to zero. If we look for **where the kink is or where the elbow is** exactly on 100 on WCSS axis. That's the **optimal number of clusters**. Those many clusters we have to create to proceed further on data analysis.

# K-Means++

Let's understand how K-Means++ works and why it's used.

We have the dataset below and let's say we want to apply k-means. For argument sake, let's say the centroids are initialized in such a way as per below diagram. It's done in **random initialization**. Then once we apply k-means, all the steps take place. We will end with three clusters.

---diagram

But now for the same dataset we apply k-means again but because the centroids are initialized at random, let's say they we applied in the following way as per the below diagram

---->diagram

Now, we will get a different set of three clusters. 

Now this is not good. The results are different for the same dataset if we run the same machine learning model of k-means. They are only different because the clusters are being **initialized randomly**. This is called **random initialization trap**.

Why this is a bad thing - Because for any algorithms or like k-means, it should be deterministic. the result should be same. It shouldn't be different or not by the random initialization. This model tells something about the business or the problem at hand. So, the insight shouldn't depend on random initialization.

Now, what does K-means++ does. **It adds some extra steps at the begining to initialize cetroids in a certain way**.

Step1: The first centroid is chosen at random as usual.

Step2: For each remainig data points compute the distance to the nearest out of already selected centroids.

Step3: We use a weighted random selection to pick out the next centroid based on the distance of the existing centroid. This means, we take the farthest distance of a datapoint to a centroid in step 2 and assign another centroid. This is what we call as **weighted random selection**. To assign a second centroid, the farthest distance of a datapoint to a initial or previous centroid is considered.

---->diagram

Step4: Repeat step 2 and step 3 until all the centroids are initialized to the datapoint or the distance is zero.

---->diagram

Step5: Proceed with K-means clustering.


Last thing to note that k-means **doesn't guarantee that there will not be an issue in terms of initialized becuase it's done at random. It's done in weighted random to initialize the centroids so the chances of the happening are much lower.