<h1 align=center> K-means Clustering In Depth </h1>

![kmeans.png](attachment:kmeans.png)

- Unsupervised learning algorithm
- Require feature scaling
- Affected by imbalance data
- Influenced by outliers

**Algorithm**: K-means works iteratively to assign each data point to one of the K clusters based on the features provided. It then calculates the centroid of each cluster and reassigns the data points to the nearest centroid. This process is repeated until the centroids no longer change significantly or a maximum number of iterations is reached.

#### **Objective**: 
- The goal of K-means clustering is to group similar data points into clusters, where “similar” means that points within the same cluster are closer to each other in terms of some distance metric, typically Euclidean distance.

$$
J =\sum \limits _{j=1} ^{k} \sum \limits _{i=1} ^{n} ∥X_i^{(j)}−C_j∥^2
$$

- J: objective function
- k: number of cluster
- n: number of cases
- Cj: centroid of cluster j


#### **Cost Function for k-means:**

$$
J(c^{(1)},...,c^{(m)}, \mu_1,...,\mu_k) = \frac 1{m} \sum \limits _{i=1} ^{m} ∥x^{(i)}−\mu_{c^{(i)}}∥^2
$$

- c^(i): index of cluster (1,2,3,…,k) to which example x^(i) is currently assigned
- µ_k: cluster centroid k
- µ_c^(i): cluster centroid of center to which example x^(i) has been assigned

### **How it works:**

1. **Specifying K:** The user must specify the desired number of clusters (k) upfront. Choosing an appropriate k is an important step that often involves experimentation and domain knowledge.
2. **Initialization**: K-means typically starts by randomly placing k centroids within the data space. There are also more sophisticated methods for centroid initialization, such as k-means++ which aims for better initial placement.
3. **Assignment**: Assign each data point to the nearest centroid, based on a distance measure (often Euclidean distance).
4. **Update**: Recalculate the centroids of the clusters by taking the mean of all data points assigned to each cluster.
5. **Convergence**: Repeat the assignment and update steps until the centroids no longer change significantly or a maximum number of iterations is reached.

Practical Example:

- Below is the data with having 8 points

![cluster1.png](attachment:cluster1.png)

- Step1: we select number of cluster (k=2) for this example
- Step2: we randomly select the centroid for each cluster

![cluster2.png](attachment:cluster2.png)

- Step3: we assign all the points to the closest cluster centroid

![cluster3.png](attachment:cluster3.png)

- Step4: recalculate the centroids of newly formed clusters

![cluster4.png](attachment:cluster4.png)

- Finally, we repeat step3 and step 4 until the centroids no longer change significantly or a maximum number of iterations is reached

![cluster5.png](attachment:cluster5.png)

- The above is our final grouped data points into clusters

### **Key Considerations:**

- **Distance Measure:** The choice of distance measure (e.g., Euclidean distance, Manhattan distance) can impact the clustering results. Selecting a suitable measure depends on the nature of your data.
- **Initialization:** Random initialization can lead to slightly different clustering upon each run. K-means++ helps to mitigate this by placing initial centroids further apart, often leading to better convergence.
- **Choosing the Right k:** Determining the optimal number of clusters can be challenging. Techniques like the elbow method (plotting the sum of squared distances within clusters vs. k) can be used as a guide.
- We calculate **within-cluster-sum-of-square (WCSS)**, the sum of squares of the distances of each data point in all clusters to their respective centroids, to figure out the right number of clusters.
- **Inertia** calculates the sum of distances of all the points within a cluster from the centroid of that cluster.
- The **silhouette score** is a metric used to evaluate the quality of clustering in unsupervised learning, including K-means clustering.

### **Pros:**

1. Simple and Easy to Implement
2. The time complexity of K-means is linear with the number of samples, making it efficient for large datasets
3. Efficient

### **Cons:**

1. Sensitive to Initialization
2. Requires a Predefined Number of Clusters
3. Sensitive to Outliers
4. Not Suitable for Non-linear Data

In [2]:
from sklearn.cluster import KMeans
import numpy as np

X = np.array([[1, 2], [1, 4], [1, 0],
              [10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, 
                init='k-means++',
                random_state=0, 
                n_init="auto").fit(X)
print(kmeans.labels_)

print(kmeans.predict([[0, 0], [12, 3]]))

print(kmeans.cluster_centers_)

[1 1 1 0 0 0]
[1 0]
[[10.  2.]
 [ 1.  2.]]


