# KMeans Clustering: An Overview | [Link](https://github.com/AdilShamim8/50-Days-of-Machine-Learning/tree/main/Day%2046%20K-Means)

KMeans is an unsupervised learning algorithm that partitions a set of \( n \) observations into \( k \) clusters, where each observation is assigned to the cluster with the nearest mean (centroid). The algorithm minimizes the total intra-cluster variance.

## Objective Function

The goal of KMeans is to minimize the **within-cluster sum of squares (WCSS)**:

$$
J = \sum_{i=1}^{k} \sum_{x \in S_i} \| x - \mu_i \|^2
$$

- \( S_i \) is the set of points in cluster \( i \).
- \( \mu_i \) is the centroid of cluster \( i \), computed as:

$$
\mu_i = \frac{1}{|S_i|} \sum_{x \in S_i} x
$$

## Algorithm Steps

1. **Initialization:**  
   Choose \( k \) initial centroids (e.g., randomly or using a method like k-means++).

2. **Assignment Step (Expectation):**  
   For each data point \( x \), assign it to the cluster with the nearest centroid:

   $$
   \text{Cluster}(x) = \arg \min_{i \in \{1,\dots,k\}} \| x - \mu_i \|^2
   $$

3. **Update Step (Maximization):**  
   Recalculate the centroid of each cluster by taking the mean of all points assigned to that cluster:

   $$
   \mu_i = \frac{1}{|S_i|} \sum_{x \in S_i} x
   $$

4. **Convergence:**  
   Repeat the assignment and update steps until the centroids no longer change (or the change is below a predefined threshold).

## Python Code Example

Below is a sample Python code that demonstrates the KMeans algorithm using scikit-learn:

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate synthetic data with 4 centers
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Visualize the generated data
plt.figure(figsize=(6, 4))
plt.scatter(X[:, 0], X[:, 1], s=30, color='gray')
plt.title("Generated Data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

# Set the number of clusters
k = 4

# Initialize KMeans with k-means++ initialization
kmeans = KMeans(n_clusters=k, init="k-means++", n_init=10, max_iter=300, random_state=0)

# Fit KMeans to the data and predict cluster labels
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

# Plot the clustered data with centroids
plt.figure(figsize=(6, 4))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=30, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, marker='X')
plt.title("KMeans Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
```

## Summary

- **Objective:** Minimize the total intra-cluster variance

  $$
  J = \sum_{i=1}^{k} \sum_{x \in S_i} \| x - \mu_i \|^2
  $$

- **Centroid Calculation:**

  $$
  \mu_i = \frac{1}{|S_i|} \sum_{x \in S_i} x
  $$

- **Key Steps:**
  1. Initialization of centroids
  2. Assignment of points to the nearest centroid
  3. Recalculation of centroids
  4. Repeating until convergence

This note provides the essential formulas and a practical example to help you understand and implement KMeans clustering.