# Clustering Assignment – Complete Solution

**Name:** Arpan Paliwal  
**Subject:** Machine Learning – Clustering (PW Skills)

This notebook contains **all theoretical and practical questions fully solved** as per the provided assignment.

## Part 1: Theoretical Questions


1. **Unsupervised Learning:** Learning patterns from unlabeled data.
2. **K-Means Algorithm:** Randomly initialize centroids, assign points, update centroids until convergence.
3. **Dendrogram:** A tree diagram showing hierarchical cluster merging.
4. **K-Means vs Hierarchical:** K-Means is centroid-based; hierarchical builds a tree structure.
5. **Advantages of DBSCAN:** Detects noise, handles arbitrary shapes.
6. **Silhouette Score Usage:** Measures cluster cohesion and separation.
7. **Limitations of Hierarchical Clustering:** High time and memory complexity.
8. **Feature Scaling Importance:** Distance-based algorithms are scale-sensitive.
9. **DBSCAN Noise Points:** Points not belonging to any dense region.
10. **Inertia:** Sum of squared distances from points to their cluster centroids.
11. **Elbow Method:** Finds optimal k where inertia decrease slows.
12. **Density in DBSCAN:** Number of neighbors within epsilon.
13. **Categorical Data in Hierarchical:** Possible with suitable distance metrics.
14. **Negative Silhouette Score:** Incorrect cluster assignment.
15. **Linkage Criteria:** Defines distance between clusters.
16. **K-Means Poor Performance:** Different cluster densities/sizes.
17. **DBSCAN Parameters:** eps (radius), min_samples (density threshold).
18. **K-Means++:** Smarter centroid initialization.
19. **Agglomerative Clustering:** Bottom-up clustering approach.
20. **Silhouette vs Inertia:** Silhouette considers separation and cohesion.


## Part 2: Practical Questions – Code

In [None]:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import (
    make_blobs, make_moons, make_circles,
    load_iris, load_wine, load_breast_cancer, load_digits
)
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score


### 1. make_blobs (4 centers) + KMeans

In [None]:

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=labels)
plt.title("KMeans on make_blobs")
plt.show()


### 2. Iris + Agglomerative Clustering

In [None]:

iris = load_iris()
agg = AgglomerativeClustering(n_clusters=3)
labels = agg.fit_predict(iris.data)
print(labels[:10])


### 3. make_moons + DBSCAN

In [None]:

X, _ = make_moons(n_samples=300, noise=0.05)
db = DBSCAN(eps=0.3)
labels = db.fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=labels)
plt.title("DBSCAN on Moons")
plt.show()


### 4. Wine + StandardScaler + KMeans

In [None]:

wine = load_wine()
X = StandardScaler().fit_transform(wine.data)
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)
unique, counts = np.unique(labels, return_counts=True)
print(dict(zip(unique, counts)))


### 5. make_circles + DBSCAN

In [None]:

X, _ = make_circles(n_samples=300, noise=0.05)
labels = DBSCAN(eps=0.2).fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=labels)
plt.show()


### 6. Breast Cancer + MinMaxScaler + KMeans

In [None]:

bc = load_breast_cancer()
X = MinMaxScaler().fit_transform(bc.data)
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)
print("Centroids:\n", kmeans.cluster_centers_)


### 7. make_blobs (varying std) + DBSCAN

In [None]:

X, _ = make_blobs(n_samples=300, centers=3, cluster_std=[0.5, 1.5, 2.5], random_state=42)
labels = DBSCAN(eps=0.6).fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=labels)
plt.show()


### 8. Digits + PCA (2D) + KMeans

In [None]:

digits = load_digits()
X_pca = PCA(n_components=2).fit_transform(digits.data)
labels = KMeans(n_clusters=10, random_state=42).fit_predict(X_pca)
plt.scatter(X_pca[:,0], X_pca[:,1], c=labels)
plt.show()


### 9. Silhouette Score for k = 2 to 5

In [None]:

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
scores = []
for k in range(2,6):
    labels = KMeans(n_clusters=k, random_state=42).fit_predict(X)
    scores.append(silhouette_score(X, labels))
plt.bar(range(2,6), scores)
plt.show()


### 10. Iris Dendrogram (Average Linkage)

In [None]:

from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(load_iris().data, method='average')
dendrogram(Z)
plt.show()


### 11. Overlapping Blobs + KMeans

In [None]:

X, _ = make_blobs(n_samples=300, centers=3, cluster_std=2.0, random_state=42)
labels = KMeans(n_clusters=3, random_state=42).fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=labels)
plt.show()


### 12. Digits + t-SNE + DBSCAN

In [None]:

digits = load_digits()
X_tsne = TSNE(n_components=2, random_state=42).fit_transform(digits.data)
labels = DBSCAN(eps=5).fit_predict(X_tsne)
plt.scatter(X_tsne[:,0], X_tsne[:,1], c=labels)
plt.show()


### 13. make_blobs + Agglomerative (Complete)

In [None]:

X, _ = make_blobs(n_samples=300, centers=3, random_state=42)
labels = AgglomerativeClustering(n_clusters=3, linkage='complete').fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=labels)
plt.show()


### 14. Breast Cancer Inertia (k=2 to 6)

In [None]:

X = load_breast_cancer().data
inertia = []
for k in range(2,7):
    inertia.append(KMeans(n_clusters=k, random_state=42).fit(X).inertia_)
plt.plot(range(2,7), inertia, marker='o')
plt.show()


### 15. make_circles + Agglomerative (Single Linkage)

In [None]:

X, _ = make_circles(n_samples=300, noise=0.05)
labels = AgglomerativeClustering(n_clusters=2, linkage='single').fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=labels)
plt.show()


### 16. Wine + DBSCAN

In [None]:

wine = load_wine()
X = StandardScaler().fit_transform(wine.data)
labels = DBSCAN(eps=1.5).fit_predict(X)
print("Clusters:", len(set(labels)) - (1 if -1 in labels else 0))


### 17. make_blobs + KMeans + Centers

In [None]:

X, _ = make_blobs(n_samples=300, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=labels)
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], s=200, marker='X')
plt.show()


### 18. Iris + DBSCAN (Noise Count)

In [None]:

iris = load_iris()
labels = DBSCAN(eps=0.5).fit_predict(iris.data)
print("Noise samples:", list(labels).count(-1))


### 19. make_moons + KMeans

In [None]:

X, _ = make_moons(n_samples=300, noise=0.05)
labels = KMeans(n_clusters=2, random_state=42).fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=labels)
plt.show()


### 20. Digits + PCA (3D) + KMeans

In [None]:

from mpl_toolkits.mplot3d import Axes3D
digits = load_digits()
X_pca = PCA(n_components=3).fit_transform(digits.data)
labels = KMeans(n_clusters=10, random_state=42).fit_predict(X_pca)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_pca[:,0], X_pca[:,1], X_pca[:,2], c=labels)
plt.show()
