# K-means Clustering from Scratch

Custom implementation of the K-means algorithm.

## üìÇ 1. Data Loading and Overview

This section loads the dataset and previews its structure.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from math import sqrt
from google.colab import files
uploaded = files.upload()

Saving clustering_data.csv to clustering_data.csv


## Dataset Loading

The dataset `clustering_data.csv` is loaded using Pandas. Initial exploration includes checking the structure, data types, and any missing values using `df.info()`, followed by previewing the first few records with `df.head()`. This step ensures that the dataset is correctly imported and helps identify necessary preprocessing steps such as handling missing data or converting data types.


In [2]:
# Load the dataset
df = pd.read_csv('clustering_data.csv')

# Display basic information
print("Dataset Info:")
print(df.info())

# Display first 5 rows
df.head()


Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157126 entries, 0 to 157125
Data columns (total 11 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   CircleName    157126 non-null  object
 1   RegionName    157073 non-null  object
 2   DivisionName  157124 non-null  object
 3   OfficeName    157126 non-null  object
 4   Pincode       157126 non-null  int64 
 5   OfficeType    157126 non-null  object
 6   Delivery      157126 non-null  object
 7   District      157126 non-null  object
 8   StateName     157126 non-null  object
 9   Latitude      148288 non-null  object
 10  Longitude     148283 non-null  object
dtypes: int64(1), object(10)
memory usage: 13.2+ MB
None


  df = pd.read_csv('clustering_data.csv')


Unnamed: 0,CircleName,RegionName,DivisionName,OfficeName,Pincode,OfficeType,Delivery,District,StateName,Latitude,Longitude
0,Andhra Pradesh Circle,Kurnool Region,Hindupur Division,Peddakotla B.O,515631,BO,Delivery,ANANTAPUR,ANDHRA PRADESH,14.5689,77.85624
1,Andhra Pradesh Circle,Kurnool Region,Hindupur Division,Pinnadhari B.O,515631,BO,Delivery,ANANTAPUR,ANDHRA PRADESH,14.5281,77.857014
2,Andhra Pradesh Circle,Kurnool Region,Hindupur Division,Yerraguntapalle B.O,515631,BO,Delivery,ANANTAPUR,ANDHRA PRADESH,14.561111,77.85715
3,Andhra Pradesh Circle,Kurnool Region,Hindupur Division,Obulareddipalli B.O,515581,BO,Delivery,ANANTAPUR,ANDHRA PRADESH,14.2488,78.2588
4,Andhra Pradesh Circle,Kurnool Region,Hindupur Division,Odulapalli B.O,515581,BO,Delivery,ANANTAPUR,ANDHRA PRADESH,14.24555,78.2477


## üßπ 2. Data Cleaning

Converts latitude and longitude to numeric format and removes rows with missing coordinates.

In [3]:
# Drop rows where 'State' is missing
df = df.dropna(subset=['StateName'])

# Filter the dataset for Telangana
# If there's a 'State' column:
telangana_df = df[df['StateName'].str.lower() == 'telangana'].copy()

# Display the filtered data
print(f"Total Telangana pincodes: {len(telangana_df)}")
telangana_df.head()


Total Telangana pincodes: 5816


Unnamed: 0,CircleName,RegionName,DivisionName,OfficeName,Pincode,OfficeType,Delivery,District,StateName,Latitude,Longitude
138,Telangana Circle,Hyderabad Region,Nizamabad Division,Arsapalli B.O,503186,BO,Delivery,NIZAMABAD,TELANGANA,18.6845544,78.0773742
139,Telangana Circle,Hyderabad Region,Nizamabad Division,Camp Ootpalli B.O,503180,BO,Delivery,NIZAMABAD,TELANGANA,18.5329923,77.618717
140,Telangana Circle,Hyderabad Region,Nizamabad Division,Eraspalli B.O,503180,BO,Delivery,NIZAMABAD,TELANGANA,18.5329923,77.618717
141,Telangana Circle,Hyderabad Region,Nizamabad Division,Singitham B.O,503187,BO,Delivery,KAMAREDDY,TELANGANA,18.3096558,77.9466324
142,Telangana Circle,Hyderabad Region,Nizamabad Division,Mahmadpur B.O,503101,BO,Delivery,KAMAREDDY,TELANGANA,18.2108201,78.4853495


In [4]:
# Ensure Latitude and Longitude are numeric and drop missing values
telangana_df['Latitude'] = pd.to_numeric(telangana_df['Latitude'], errors='coerce')
telangana_df['Longitude'] = pd.to_numeric(telangana_df['Longitude'], errors='coerce')
telangana_df.dropna(subset=['Latitude', 'Longitude'], inplace=True)

import plotly.express as px

# Plot on world map using Plotly
fig = px.scatter_geo(telangana_df,
                     lat='Latitude',
                     lon='Longitude',
                     title='Telangana Pincodes on World Map',
                     projection='natural earth')
fig.update_layout(geo=dict(showland=True))
fig.show()




## K-means Clustering from Scratch

The following implementation defines the K-means clustering algorithm **without using any external clustering libraries** (like scikit-learn). The algorithm follows the classical K-means steps:

1. **Euclidean Distance Calculation**: Measures the straight-line distance between two points in 2D space (latitude and longitude).
2. **Centroid Initialization**: Randomly selects `k` initial centroids from the dataset.
3. **Cluster Assignment**: Each point is assigned to the nearest centroid based on Euclidean distance.
4. **Centroid Update**: For each cluster, the centroid is recalculated as the mean of all points assigned to it.
5. **Convergence Check**: Iteration continues until centroids do not change significantly (`np.allclose`) or until a maximum number of iterations is reached.

This custom function `kmeans(data, k)` returns both the final centroids and the cluster assignments, allowing full control and transparency over the clustering logic.


In [5]:
# K-means from scratch (initial implementation)

def euclidean_distance(p1, p2):
    return sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)

def initialize_centroids(data, k):
    return data.sample(n=k).to_numpy()

def assign_clusters(data, centroids):
    clusters = []
    for point in data.to_numpy():
        distances = [euclidean_distance(point, centroid) for centroid in centroids]
        cluster = np.argmin(distances)
        clusters.append(cluster)
    return np.array(clusters)

def update_centroids(data, clusters, k):
    new_centroids = []
    for i in range(k):
        cluster_points = data[clusters == i]
        if len(cluster_points) == 0:
            new_centroids.append(data.sample(n=1).to_numpy()[0])
        else:
            new_centroids.append(cluster_points.mean(axis=0))
    return np.array(new_centroids)

def kmeans(data, k, max_iters=100):
    """
    Runs K-means clustering on geospatial data (Latitude and Longitude).

    Parameters:
    data (DataFrame): DataFrame with 'Latitude' and 'Longitude' columns.
    k (int): Number of clusters.
    max_iters (int): Maximum number of iterations.

    Returns:
    tuple: Final centroids and cluster assignments.
    """
    data = data[['Latitude', 'Longitude']]
    centroids = initialize_centroids(data, k)

    for _ in range(max_iters):
        clusters = assign_clusters(data, centroids)
        new_centroids = update_centroids(data.to_numpy(), clusters, k)
        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids
    return centroids, clusters


## üìç 4. Clustering Visualization

Plot the clusters and centroids on a map using Plotly.

In [6]:
# Run K-means on Telangana data
k = 5  # You can change this to any number of clusters
centroids, clusters = kmeans(telangana_df, k)

# Add the cluster assignments to the DataFrame
telangana_df['Cluster'] = clusters


# Assuming you've already defined:
# - telangana_df with 'Latitude', 'Longitude', and 'Cluster' columns
# - centroids: a numpy array of shape (k, 2)

# Create a DataFrame for centroids
centroid_df = pd.DataFrame(centroids, columns=['Latitude', 'Longitude'])
centroid_df['Cluster'] = [f'Centroid {i}' for i in range(len(centroid_df))]

# Plot clustered points
fig = px.scatter_geo(telangana_df,
                     lat='Latitude',
                     lon='Longitude',
                     color='Cluster',
                     title=f'K-means Clustering of Telangana Pincodes (k={len(centroids)})',
                     projection='natural earth',
                     symbol_sequence=['circle'],
                     opacity=0.7)

# Add centroids as black "X" markers
fig.add_scattergeo(
    lat=centroid_df['Latitude'],
    lon=centroid_df['Longitude'],
    mode='markers',
    marker=dict(symbol='x', size=12, color='black'),
    name='Centroids'
)

# Final layout tweaks
fig.update_layout(
    geo=dict(showland=True, landcolor='lightgray'),
    height=700,
    legend_title="Cluster",
    margin={"r":0,"t":40,"l":0,"b":0}
)

fig.show()

