# K-means Algorithm

K-means is a **clustering algorithm** that groups data into a predefined number of clusters based on feature similarity. The goal is to minimize the variance within each cluster by finding a set of centroids that best represent the data points.

### Steps of the K-means Algorithm:

1. **Choose the number of clusters, \( k \)**: 
   The first step in the K-means algorithm is to specify the number of clusters to form, denoted by \( k \). The value of \( k \) is a hyperparameter that must be chosen before running the algorithm.

2. **Initialize centroids**: 
   Randomly initialize \( k \) centroids. These centroids represent the center of each cluster and can either be chosen randomly from the data points or by using more advanced methods like **k-means++** to ensure better convergence.

3. **Assign points to the nearest centroid**: 
   For each data point \( x_i \), assign it to the cluster whose centroid is closest to it. The proximity is measured using the Euclidean distance:

   $$ d(x_i, c_j) = \sqrt{(x_i - c_j)^2} $$

   where \( x_i \) is the data point and \( c_j \) is the centroid of cluster \( j \).

4. **Update the centroids**: 
   After assigning all data points to clusters, the centroids are recomputed by calculating the mean of all points assigned to each cluster:

   $$ c_j = \frac{1}{n_j} \sum_{i \in C_j} x_i $$

   where \( C_j \) is the set of data points assigned to cluster \( j \) and \( n_j \) is the number of points in that cluster.

5. **Repeat steps 3 and 4**: 
   Steps 3 and 4 are repeated until the centroids do not change significantly or a set number of iterations is reached. This indicates that the algorithm has converged.

## Mathematical Formulation

The K-means algorithm seeks to minimize the **within-cluster sum of squares (WCSS)**, which is the sum of the squared Euclidean distances between each data point and its respective centroid.

$$ J = \sum_{j=1}^{k} \sum_{i \in C_j} \| x_i - c_j \|^2 $$

Where:
- \( k \) is the number of clusters.
- \( C_j \) is the set of points in cluster \( j \).
- \( c_j \) is the centroid of cluster \( j \).
- \( x_i \) is a data point in \( C_j \).

## Convergence

The K-means algorithm converges when the centroids no longer change significantly between iterations. The algorithm can also be stopped after a set number of iterations.

### Key Points:
- K-means is sensitive to the initial placement of centroids, which can result in different clustering outcomes.
- The number of clusters \( k \) needs to be chosen beforehand. This can be done using methods like the **elbow method** or **silhouette score**.
- K-means works best with spherical (globular) clusters.

## Applications of K-means:

- **Customer segmentation**: Grouping customers based on similar purchasing behavior.
- **Document clustering**: Categorizing documents based on content similarity.
- **Image compression**: Reducing the number of colors in an image.
- **Anomaly detection**: Identifying outliers by placing them in a separate cluster.



In [1]:
import numpy as np

In [2]:

# Function to calculate Euclidean distance between two points
def euclidean_distance(p1, p2):
    return np.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)

In [3]:

# Simple K-means function
def simple_kmeans(points, k, max_iters=100):
    # Step 1: Randomly initialize centroids (select k points as initial centroids)
    centroids = points[np.random.choice(len(points), k, replace=False)]
    
    for _ in range(max_iters):
        # Step 2: Assign each point to the nearest centroid
        clusters = {i: [] for i in range(k)}  # Create a dictionary to hold clusters
        
        for point in points:
            distances = [euclidean_distance(point, centroid) for centroid in centroids]
            closest_centroid = np.argmin(distances)  # Index of the nearest centroid
            clusters[closest_centroid].append(point)
        
        # Step 3: Recalculate centroids by averaging the points in each cluster
        new_centroids = []
        for i in range(k):
            cluster_points = np.array(clusters[i])
            new_centroids.append(np.mean(cluster_points, axis=0))  # Calculate new centroid
        
        new_centroids = np.array(new_centroids)
        
        # Step 4: If centroids do not change, we stop the algorithm
        if np.all(new_centroids == centroids):
            break
        
        centroids = new_centroids
    
    return clusters, centroids

In [4]:
# Example usage:
data_points = np.array([(0, 1), (1, 3), (2, 2), (3, 5), (4, 7), (5, 8), (6, 8), (7, 9), (8, 10), (9, 12)])
num_clusters = 3

clusters, centroids = simple_kmeans(data_points, num_clusters)


In [5]:

# Printing the results
for cluster_num, points in clusters.items():
    print(f"Cluster {cluster_num + 1}:")
    for point in points:
        print(f"  {point}")
    print(f"Centroid: {centroids[cluster_num]}")
    print()


Cluster 1:
  [7 9]
  [ 8 10]
  [ 9 12]
Centroid: [ 8.         10.33333333]

Cluster 2:
  [0 1]
  [1 3]
  [2 2]
  [3 5]
Centroid: [1.5  2.75]

Cluster 3:
  [4 7]
  [5 8]
  [6 8]
Centroid: [5.         7.66666667]



In [7]:
# using sklearn
import numpy as np
from sklearn.cluster import KMeans

# Function to perform K-means clustering and return clusters
def perform_kmeans_clustering(data_points, num_clusters):
    
    # Converting the data_points to a numpy array
    data_points = np.array(data_points)
    
    # Applying KMeans
    kmeans = KMeans(n_clusters=num_clusters, random_state=42)
    kmeans.fit(data_points)
    
    # Getting the labels (cluster assignments for each point)
    labels = kmeans.labels_
    
    # Organizing points into clusters
    clusters = {}
    for i in range(num_clusters):
        clusters[i+1] = data_points[labels == i]
    
    return clusters

# Example usage of the function
data_points = np.array([(0, 1), (1, 3), (2, 2), (3, 5), (4, 7), (5, 8), (6, 8), (7, 9), (8, 10), (9, 12)])
num_clusters = 3

# Performing K-means clustering
clusters = perform_kmeans_clustering(data_points, num_clusters)

# Printing the clusters
for cluster_num, cluster_points in clusters.items():
    print(f"Cluster {cluster_num}:")
    for point in cluster_points:
        print(f"  {point}")


Cluster 1:
  [3 5]
  [4 7]
  [5 8]
Cluster 2:
  [6 8]
  [7 9]
  [ 8 10]
  [ 9 12]
Cluster 3:
  [0 1]
  [1 3]
  [2 2]
