# K-means Clustering

K-means is one of the simplest and widely used unsupervised machine learning algorithms that can partition a dataset into a set of non-overlapping subgroups, known as clusters. The objective is to find groups in the data, with the number of groups being represented by the variable \( K \).

## How does K-means work?

1. **Initialization**: Randomly select \( K \) data points (seeds) to be the initial centroids.
2. **Assignment**: Assign each data point to the nearest centroid, and it will inherit that centroid's cluster label.
3. **Update**: Calculate the new mean (centroid) for each cluster. This will be the mean of all the points assigned to that cluster.
4. **Repeat**: Repeat the assignment and update steps iteratively until no improvement is made, meaning the assignments no longer change.

## Key Points to Remember:

- The outcome might differ based on the initial seeds. So, the algorithm might yield different results on different runs.
- The value of \( K \) needs to be specified beforehand. One common method to find a reasonably good value of \( K \) is the "Elbow Method".
- K-means is sensitive to the scale of the data. Hence, it's often recommended to scale your data before applying K-means clustering.
- It's a linear clustering algorithm, meaning it works best when the clusters are spherical and equally sized.

## Mathematical Objective:

K-means aims to minimize the total within-cluster variance, or, equivalently, the sum of squared distances from each point to its assigned center. Formally, if \( c(i) \) is the cluster to which instance \( i \) is assigned, and \( \mu_{c(i)} \) is the centroid of cluster \( c(i) \), the objective \( J \) to be minimized is:

$$ J = \sum_{i=1}^{m} ||x^{(i)} - \mu_{c(i)}||^2 $$

Where:
- $$ x^{(i)} $$ is a data point.
- $$ \mu_{c(i)} $$ is the centroid of the assigned cluster for $$ x^{(i)} $$.
- \( m \) is the number of data points.

## Limitations:

- Not suitable for clusters with complex shapes and sizes.
- Sensitivity to the initial placement of centroids.
- It may not work well with clusters of different sizes and densities.

To overcome some of these limitations, other algorithms like K-medoids, Hierarchical clustering, or DBSCAN can be considered.

## Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

In [None]:
# Generating random data for clustering
data, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Visualizing the data
plt.scatter(data[:, 0], data[:, 1], s=50)
plt.title('Data Points')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()

# Using KMeans to cluster the data into 4 clusters
kmeans = KMeans(n_clusters=4)
kmeans.fit(data)
predicted_clusters = kmeans.predict(data)
centroids = kmeans.cluster_centers_

# Visualizing the clusters
plt.scatter(data[:, 0], data[:, 1], c=predicted_clusters, s=50, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=200, marker='X')
plt.title('Clusters with Centroids')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()