## K-Means algorithm

K-Means is the most popular and straightforward algorithm for clustering. It is a **centroid-based** algorithm, meaning it relies on finding the center points of the clusters.

The algorithm is iterative, following a few simple steps:

### The K-Means Steps

1.  **Decide on K:** We first decide on the value of **K**, which is the total number of clusters (groups) we want to find. (*For example, K=3*).

2.  **Initialize Centroids:** The algorithm randomly chooses $K$ points in the dataset to act as the initial center points, or **Centroids**.
    > ![](assets/kmeans_step1_initial.png)

3.  **Assign Points (The Assignment Step):** The algorithm assigns every single data point to the cluster of the **closest Centroid**. This "closeness" is calculated using the distance formula (like Euclidean distance).

    > ![](assets/kmeans_step2_assignment.png)

4.  **Update Centroids:** After all points are assigned, the algorithm moves the Centroid of each cluster to the true center (the **mean position**) of all the points currently assigned to it.

5.  **Iterate:** Steps 3 and 4 repeat until the centroids stop moving significantly. This indicates the clusters have stabilized, and the algorithm has **converged**.

    > ![](assets/kmeans_step3_converged.png)

### The Objective Function (Minimizing Error)

The entire K-Means process is an attempt to make the clusters as "tight" as possible by minimizing the **Sum of Squared Errors (SSE)**.

$$
\text{SSE} = \sum_{i=1}^{n} (x_i - c_j)^2
$$

This formula calculates the total squared distance between every data point ($x_i$) and the center ($c_j$) of its cluster. The algorithm keeps running until this total error value is minimized.

### Important Notes on K-Means

K-Means is fast, but it has a crucial limitation:

*   **Local Optimum:** Because the initial centroids are chosen randomly, K-Means does not guarantee that it will find the absolute best grouping (the global optimum). It might only find a **local optimum**—a good solution for that specific random start.

**The Fix:** The standard practice is to **run the K-Means algorithm many times** (e.g., 20 or 50 times) with different random starting points and then choose the final clustering result that has the **lowest overall SSE**.

---

## Find the best value for the variable "**_K_**"

Ok! We know how the system works, but how can we find the **_"K"_**? I mean we can give a number like 3 or 4 or ... anything, but it does not really make sense to give at randomly. Well, don't worry; we have a solution for that.

As you know if we have _n_ instances, we can have from one up to _n_ clusters. But if we think logically, we cannot have _n_ clusters! Otherwise, what is the point of clustering at all?! From the other hand, we cannot have one cluster too! So we have to discuss the best **_"K"_** for this.

The first solution is to test the algorithm with different **_"K"s_**. Each that has better results and less error, has a higher chance of being the best model out there.

The other solution is to draw a plot. The image below is called the "Elbow method"

---

![](assets/elbow_method.png)

---

As seen in the image, the **_"K"_** value from 1 to 3 is getting a good decreasing **mean distance**. But right from **_K_** = **_4_**, the incline is getting lower. In this case, we say that **_K_** = **_3_** seems to be the best **_"K"_** out there!

---

Enough theory! Let's do a quick lab here to see what's going on in action.

# Diving into code

**_Note that_** The following dataset is coming from IBM

Imagine that you have a customer dataset, and you need to apply customer segmentation on this historical data.
Customer segmentation is the practice of partitioning a customer base into groups of individuals that have similar characteristics. It is a significant strategy as a business can target these specific groups of customers and effectively allocate marketing resources. For example, one group might contain customers who are high-profit and low-risk, that is, more likely to purchase products, or subscribe for a service. A business task is to retain those customers. Another group might include customers from non-profit organizations and so on.

### Load Data From CSV File

Before working with our data, let's load the customers segmentation csv file and look at the main structure of the data.

In [None]:
import pandas as pd
cust_df = pd.read_csv("data/customer_segmentation.csv")
cust_df.head()

## Pre processing data

We actually don't need "**_Address_**" in here because it does not matter at all.


In [None]:
df = cust_df.drop(columns=["Address"])
df.head()

### Let's normalize our data a little bit.

Why?

Well Imagine you have a dataset about customers with two features:

*   **Age:** (e.g., from 20 to 70)
*   **Income:** (e.g., from 30,000 to 150,000)

Now, imagine you are the K-Means algorithm trying to calculate the distance between two customers. A difference of **10** in age (e.g., 30 vs 40) is quite significant. But a difference of **10** in income (e.g., \$50,000 vs \$50,010) is tiny and meaningless.

Because the numbers for **Income** are so much bigger, any distance calculation will be completely dominated by the Income feature. The algorithm will essentially ignore Age, not because Age is unimportant, but because its numerical contribution to the distance is just too small to matter.

It's like comparing apples and oranges. The algorithm can't make a fair comparison.

### The Solution: StandardScaler

`StandardScaler` fixes this problem. For each feature, it does the following:

1.  It calculates the average (mean) and the standard deviation.
2.  It then transforms each value so that the new average of the feature is **0** and the new standard deviation is **1**.

### The Result

After using `StandardScaler`, your data might look something like this:

| Feature | Original Value | Scaled Value |
| :--- | :--- | :--- |
| **Age** | 35 | -0.52 |
| **Income** | 90,000 | 1.25 |
| **Age** | 25 | -1.50 |
| **Income** | 45,000 | -0.80 |

Now, a change of "1" in the scaled Age is just as significant as a change of "1" in the scaled Income. All features are now on a level playing field and have an equal chance to influence the clustering result.


## A quick example :

Imagine we have a small dataset with just one feature: **Age**.

**Our Raw Data (Original Ages):**
`[25, 30, 35, 40, 50]`

`StandardScaler` follows a two-step process to find the "ingredients" it needs, and a final step to apply the transformation.

### Step 1: Calculate the Mean (the Average)

First, we find the average of our data.

$$
\text{Mean (μ)} = \frac{25 + 30 + 35 + 40 + 50}{5} = \frac{180}{5} = 36
$$

So, the average age is **36**.

### Step 2: Calculate the Standard Deviation (the "Spread")

Next, we calculate how spread out the data is. This takes a few sub-steps:

1.  For each number, subtract the mean (36) and square the result.
    *   $(25 - 36)^2 = (-11)^2 = 121$
    *   $(30 - 36)^2 = (-6)^2 = 36$
    *   $(35 - 36)^2 = (-1)^2 = 1$
    *   $(40 - 36)^2 = (4)^2 = 16$
    *   $(50 - 36)^2 = (14)^2 = 196$
2.  Find the average of these squared results.
    *   $\frac{121 + 36 + 1 + 16 + 196}{5} = \frac{370}{5} = 74$
3.  Take the square root of that average.
    *   $\sqrt{74} \approx 8.6$

So, the standard deviation (σ) is approximately **8.6**.

### Step 3: Apply the StandardScaler Transformation

Now we have our two "ingredients": **Mean (μ) = 36** and **Standard Deviation (σ) = 8.6**.

The formula for `StandardScaler` is:

$$
\text{Scaled Value} = \frac{\text{Original Value} - \text{Mean}}{\text{Standard Deviation}}
$$

Let's apply this to every one of our original age values:

| Original Age | Calculation | Scaled Age |
| :---: | :---: | :---: |
| **25** | (25 - 36) / 8.6 | **-1.28** |
| **30** | (30 - 36) / 8.6 | **-0.70** |
| **35** | (35 - 36) / 8.6 | **-0.12** |
| **40** | (40 - 36) / 8.6 | **0.47** |
| **50** | (50 - 36) / 8.6 | **1.63** |

### What We Achieved

We successfully transformed our original data into a new, scaled version:

*   **Original Data:** `[25, 30, 35, 40, 50]`
*   **Scaled Data:** `[-1.28, -0.70, -0.12, 0.47, 1.63]`

This new set of numbers has a **mean of 0** and a **standard deviation of 1**, and it's now ready to be used in a distance-based algorithm like K-Means without any risk of bias due to its scale.

In [None]:
from sklearn.preprocessing import StandardScaler
import numpy as np
X = df.values[:,1:] # From Customer Id all the way to the DebtIncomeRatio
X = np.nan_to_num(X) # NAN values -> 0, infinite -> A very large number
Clus_dataSet = StandardScaler().fit_transform(X)
Clus_dataSet

#### Modeling

In our example (if we didn't have access to the k-means algorithm), it would be the same as guessing that each customer group would have certain age, income, education, etc, with multiple tests and experiments. However, using the K-means clustering we can do all this process much easier.

Let's apply k-means on our dataset, and take look at cluster labels.

In [None]:
from sklearn.cluster import KMeans

clusterNum = 3
k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 30)
k_means.fit(Clus_dataSet)
labels = k_means.labels_
print(labels)

In [None]:
df["Clustered Label"] = labels
df.head()

In [None]:
overal_view = df.groupby("Clustered Label").mean().drop(columns=["Customer Id"])
overal_view

### Let's show the data on the plot

In [None]:
import matplotlib.pyplot as plt

area = np.pi * ( X[:, 1])**2
plt.scatter(X[:, 0], X[:, 3], s=area, c=labels.astype(float), alpha=0.5)
plt.xlabel('Age', fontsize=18)
plt.ylabel('Income', fontsize=16)

plt.show()

### Why **_K = 3_** ?

Well, as we have discussed before, we have elbow method in K-means algorithm. so let's do the same in here as well to see the result:

In [None]:
wcss = []
k_values = range(1, 16)

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)  # inertia_ is WCSS

plt.plot(k_values, wcss, 'bo-')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.title('Elbow Method For Optimal K')
plt.show()

As you can see, the diagram has a lower inclinde decrease at point 3. This confirms that this is a good trade-off bewtween complexity and performance!

How ever you might get better results for **_K = 4_** as well. you can test that yourself. why don't you try that out?

Anyway, that was all about K-means. Let's dive into the next algorithm called **_Hierarchial_**