### Project Summary:
 Scalable Clustering Pipeline with PCA Visualization
This project presents a modular data clustering pipeline built using Python’s Scikit-learn, designed to efficiently group multivariate data while offering intuitive visual insights. Beginning with robust data preprocessing—including exploratory analysis, irrelevant column removal, and feature scaling via StandardScaler—the workflow guides users through KMeans clustering supported by the Elbow Method to determine optimal cluster count. Dimensionality reduction is achieved using Principal Component Analysis (PCA), enabling a clean 3D visualization of cluster distribution. The code is cleanly structured for scalability and can be integrated into FastAPI or Dockerized microservices for real-world applications. Ideal for data science exploration, educational use, and adaptable clustering systems. 



In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

In [None]:

from sklearn.cluster import KMeans

## Step 1: Load and Inspect the Dataset

We begin by loading the university dataset and inspecting its structure using `.info()` and `.describe()`. This helps identify missing values, data types, and feature distributions.

In [None]:
df=pd.read_excel("University_Clustering.xlsx")
df

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.drop("State",axis=1,inplace=True)

##  Step 2: Drop Irrelevant Columns

We remove the `"Univ"` column since it's a label and not useful for clustering. This ensures our clustering is based purely on numerical features.

In [None]:
new_df=df.drop("Univ",axis=1)
new_df

In [None]:
from sklearn.preprocessing  import StandardScaler

##  Step 3: Feature Scaling with StandardScaler

To ensure fair clustering, we scale all features using `StandardScaler`. This standardizes each feature to have mean 0 and variance 1.

In [None]:
for col in new_df:
    ss= StandardScaler() 
    new_df[col]=ss.fit_transform(new_df[[col]]) 


## Step 4: Elbow Method to Determine Optimal Clusters

We compute Within-Cluster Sum of Squares (WCSS) for cluster counts from 1 to 8. The Elbow Method helps us choose the best number of clusters by identifying the point where WCSS starts to flatten.

In [None]:
wcss=[]
clusters=list(range(1,9))
for K in clusters:
    model=KMeans(n_clusters=K)
    model.fit(new_df)
    wcss.append(model.inertia_)

In [None]:
plt.figure()
plt.plot(clusters,wcss,"x-")
plt.xlabel("K")
plt.ylabel("WCSS")
plt.show()


albo method is graphical representation of finding the optimal K in k means clustering it works by finding WCSS  that is the sum of square distance between points in a cluster and the clusters centroid 

In [None]:
model=KMeans(n_clusters=3)
model.fit(new_df)

In [None]:
model.labels_

In [None]:
df.columns

In [None]:
df["labels"]=model.labels_

In [None]:
df.head()

In [None]:
df[df['labels']==0]["Univ"]

In [None]:
df[df['labels']==1]["Univ"]

In [None]:
df[df['labels']==2]["Univ"]

In [None]:
from sklearn.decomposition import PCA

##  Step 6: Reduce Dimensions with PCA

We apply Principal Component Analysis (PCA) to reduce the dataset to 3 dimensions. This enables us to visualize clusters in 3D space while preserving maximum variance.

In [None]:
pca = PCA(n_components=3,random_state=1)

In [None]:
components=pca.fit_transform(new_df)
components

In [None]:
pc1=components[:,0]
pc2=components[:,1]
pc3=components[:,2]

##  Step 7: Visualize Clusters in 3D

Using `matplotlib`, we plot the PCA components in a 3D scatter plot. Each point is colored by its cluster label, offering intuitive visual insights into cluster separation.

In [None]:
fig=plt.figure(figsize=(8,8))
ax=plt.axes(projection='3d')
ax.scatter(pc1,pc2,pc3 ,c=df['labels'])
plt.show()

In [None]:
plt.figure(figsize=(6,4))
plt.plot(clusters, wcss, marker='o')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.grid(True)
plt.show()

##  Elbow Method: Choosing Optimal Clusters
This plot helps identify the ideal number of clusters by locating the 'elbow' point where WCSS starts to flatten.



In [None]:
df['Cluster'] = model.labels_
df.head()

##  Cluster Labels Assigned to Each University
Here’s how each university was grouped based on multivariate features.

In [None]:
df.groupby('Cluster').mean(numeric_only=True)

##  Cluster-Wise Feature Averages
This table shows the average values of each feature per cluster, helping interpret group characteristics.

In [None]:
df.to_excel("Clustered_Universities.xlsx", index=False)

## Exporting Results
Clustered data saved for downstream use or integration into microservices.