# Clustering — Complete Assignment (Theoretical + Practical)

This notebook answers the assignment questions (both theoretical and practical) from the uploaded PDF. Reference: fileciteturn3file0

Run all cells in order in a Jupyter environment with `scikit-learn`, `numpy`, `pandas`, `matplotlib`, and `seaborn` installed.

---


## Theoretical Questions — Short & Interview-friendly answers

**1. What is unsupervised learning?**  
Unsupervised learning finds patterns in unlabeled data (no target). Common tasks: clustering and dimensionality reduction.

**2. How does K-Means work?**  
K-Means initializes k centroids, assigns points to nearest centroid, updates centroids to cluster means, and repeats until convergence (assign→update loop).

**3. What is a dendrogram?**  
A dendrogram is a tree diagram showing hierarchical clustering merges/splits and distances at which clusters join.

**4. Main difference: K-Means vs Hierarchical**  
K-Means: partition-based, needs k, faster on large data. Hierarchical: builds tree (agglomerative or divisive), no need to pre-specify k (you can cut tree), more expensive.

**5. Advantages of DBSCAN over K-Means**  
DBSCAN finds clusters of arbitrary shape and identifies noise; it does not require number of clusters and handles varying densities better.

**6. When use Silhouette Score?**  
Use it to evaluate clustering quality (how similar samples are to their own cluster vs others). Helpful to choose k.

**7. Limitations of Hierarchical Clustering**  
Computationally expensive (O(n^3) worst), memory heavy for large datasets; sensitive to noise/outliers and linkage choice.

**8. Why is feature scaling important in K-Means?**  
K-Means uses distance — features with larger scales dominate. Scaling (StandardScaler/MinMax) prevents this bias.

**9. How does DBSCAN identify noise points?**  
Points not part of any dense region (not enough neighbors within eps) get label -1 (noise).

**10. Define inertia in K-Means**  
Inertia = sum of squared distances of samples to their closest cluster center (measure of compactness). Lower is better.

**11. What is the elbow method?**  
Plot inertia vs k and look for the "elbow" where inertia improvement slows — suggests suitable k.

**12. Describe "density" in DBSCAN**  
Density refers to how many points are within radius `eps`. High density regions form clusters.

**13. Can hierarchical clustering be used on categorical data?**  
Yes if you use suitable distance measures (e.g., Hamming) or encode categories, but typical implementations assume numeric data.

**14. What does a negative Silhouette Score indicate?**  
A sample is closer to a neighboring cluster than to its own cluster — poor clustering for that sample.

**15. Explain linkage criteria in hierarchical clustering**  
Linkage defines distance between clusters: single (min), complete (max), average (mean), ward (minimizes variance).

**16. Why K-Means may perform poorly with varying sizes/densities?**  
K-Means assumes spherical, equally-sized clusters — varying sizes/densities break this assumption, causing poor assignments.

**17. Core parameters in DBSCAN and their effect**  
`eps` (neighborhood radius) — larger eps merges clusters; `min_samples` — minimum points to form a dense region. Both control cluster shape and noise.

**18. How K-Means++ improves initialization**  
K-Means++ picks initial centroids probabilistically to be far apart, improving convergence and reducing bad random starts.

**19. What is agglomerative clustering?**  
A bottom-up hierarchical method that starts with each point as a cluster and iteratively merges the closest clusters based on linkage.

**20. Why Silhouette > inertia?**  
Inertia measures compactness only; Silhouette considers both cohesion (within) and separation (between) making it more informative for cluster quality.

---

*(The remaining theoretical items in the assignment are covered by the above (entropy, linkage, silhouette, DBSCAN concepts, etc.).)*


## Practical Tasks — Code (run each cell).  
The cells below implement the practical questions listed in the assignment PDF. Each cell is self-contained and uses sklearn / synthetic datasets or built-in datasets.

In [None]:

# Common imports for all practical tasks
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
from sklearn.datasets import make_blobs, make_moons, make_circles, load_iris, load_wine, load_breast_cancer, load_digits
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import dendrogram, linkage
sns.set(style='whitegrid')
print('Libraries imported')


### Task 1: make_blobs (4 centers) + KMeans (k=4)

In [None]:

X, y = make_blobs(n_samples=500, centers=4, cluster_std=0.6, random_state=42)
km4 = KMeans(n_clusters=4, random_state=42).fit(X)
plt.figure(figsize=(6,4))
plt.scatter(X[:,0], X[:,1], c=km4.labels_, cmap='tab10', s=25)
plt.scatter(km4.cluster_centers_[:,0], km4.cluster_centers_[:,1], c='black', s=120, marker='X')
plt.title('KMeans k=4 on make_blobs'); plt.show()


### Task 2: Iris + AgglomerativeClustering (n=3) — show first 10 labels

In [None]:

iris = load_iris(as_frame=True)
X_iris = iris.data; y_iris = iris.target
agg = AgglomerativeClustering(n_clusters=3).fit(X_iris)
print('First 10 predicted labels:', agg.labels_[:10])
plt.figure(figsize=(5,4)); plt.scatter(X_iris.iloc[:,0], X_iris.iloc[:,1], c=agg.labels_, cmap='viridis'); plt.title('Agglomerative on Iris (first two features)'); plt.show()


### Task 3: make_moons + DBSCAN (highlight noise)

In [None]:

X_moons, _ = make_moons(n_samples=400, noise=0.08, random_state=0)
db = DBSCAN(eps=0.2, min_samples=5).fit(X_moons)
labels_db = db.labels_
plt.figure(figsize=(6,4))
plt.scatter(X_moons[:,0], X_moons[:,1], c=labels_db, cmap='tab10', s=25)
noise = X_moons[labels_db==-1]
if len(noise)>0:
    plt.scatter(noise[:,0], noise[:,1], facecolors='none', edgecolors='k', s=80, label='noise')
plt.title('DBSCAN on make_moons (noise labelled -1)'); plt.legend(); plt.show()
print('Noise count:', np.sum(labels_db==-1))


### Task 4: Wine dataset + Standardize + KMeans (print cluster sizes)

In [None]:

wine = load_wine(as_frame=True); Xw = wine.data
sc = StandardScaler(); Xw_s = sc.fit_transform(Xw)
kw = KMeans(n_clusters=3, random_state=0).fit(Xw_s)
(unique, counts) = np.unique(kw.labels_, return_counts=True)
print('Cluster sizes:', dict(zip(unique, counts)))


### Task 5: make_circles + DBSCAN

In [None]:

X_circ, _ = make_circles(n_samples=500, factor=0.5, noise=0.05, random_state=1)
dbcirc = DBSCAN(eps=0.12, min_samples=5).fit(X_circ)
plt.figure(figsize=(5,5)); plt.scatter(X_circ[:,0], X_circ[:,1], c=dbcirc.labels_, cmap='tab10'); plt.title('DBSCAN on make_circles'); plt.show()


### Task 6: Breast Cancer + MinMaxScaler + KMeans (2) — show centroids

In [None]:

bc = load_breast_cancer(as_frame=True); Xbc = bc.data
mm = MinMaxScaler(); Xbc_mm = mm.fit_transform(Xbc)
k2 = KMeans(n_clusters=2, random_state=42).fit(Xbc_mm)
print('Centroids (scaled space):\n', k2.cluster_centers_)


### Task 7: make_blobs with varying std devs + DBSCAN

In [None]:

X_var, _ = make_blobs(n_samples=600, centers=[[0,0],[4,4],[8,0]], cluster_std=[0.2,1.5,0.5], random_state=2)
dbv = DBSCAN(eps=0.6, min_samples=8).fit(X_var)
plt.figure(figsize=(6,4)); plt.scatter(X_var[:,0], X_var[:,1], c=dbv.labels_, cmap='tab10'); plt.title('DBSCAN on varying std blobs'); plt.show()


### Task 8: Digits dataset → PCA(2) → KMeans visualize

In [None]:

digits = load_digits(); Xd = digits.data
pca = PCA(n_components=2, random_state=42); Xd2 = pca.fit_transform(Xd)
kd = KMeans(n_clusters=10, random_state=42).fit(Xd2)
plt.figure(figsize=(6,4)); plt.scatter(Xd2[:,0], Xd2[:,1], c=kd.labels_, cmap='tab10', s=25); plt.title('Digits PCA(2) + KMeans'); plt.show()


### Task 9: Silhouette scores for k=2..5 on synthetic blobs

In [None]:

X_sb, _ = make_blobs(n_samples=400, centers=4, cluster_std=0.7, random_state=10)
scores = {k: silhouette_score(X_sb, KMeans(n_clusters=k, random_state=0).fit_predict(X_sb)) for k in range(2,6)}
plt.figure(figsize=(5,3)); plt.bar(scores.keys(), scores.values()); plt.title('Silhouette k=2..5'); plt.xlabel('k'); plt.show(); print('Scores:', scores)


### Task 10: Iris dendrogram (average linkage) — small subset for readability

In [None]:

X_iris_small = X_iris.iloc[:50]
linked = linkage(X_iris_small, method='average')
plt.figure(figsize=(10,4)); dendrogram(linked, labels=list(iris.target_names[y_iris[:50]]), leaf_rotation=90); plt.title('Dendrogram (Iris subset)'); plt.show()


### Task 11: Overlapping blobs → KMeans → visualize decision boundaries (2D)

In [None]:

# overlapping blobs and decision boundary visualization
X_ov, y_ov = make_blobs(n_samples=400, centers=[[0,0],[2,0],[1,1.5]], cluster_std=1.1, random_state=5)
km_ov = KMeans(n_clusters=3, random_state=0).fit(X_ov)
# decision boundary grid
xx, yy = np.meshgrid(np.linspace(X_ov[:,0].min()-1, X_ov[:,0].max()+1, 300), np.linspace(X_ov[:,1].min()-1, X_ov[:,1].max()+1, 300))
Z = km_ov.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
plt.figure(figsize=(6,4)); plt.contourf(xx,yy,Z, alpha=0.2); plt.scatter(X_ov[:,0], X_ov[:,1], c=km_ov.labels_, s=25); plt.title('KMeans decision boundaries (overlapping blobs)'); plt.show()


### Task 12: Digits → t-SNE → DBSCAN → visualize

In [None]:

tsne = TSNE(n_components=2, random_state=42, init='random', learning_rate='auto', perplexity=30)
Xd_tsne = tsne.fit_transform(digits.data[:500])  # limit to 500 for speed
db_dig = DBSCAN(eps=3.5, min_samples=5).fit(Xd_tsne)
plt.figure(figsize=(6,4)); plt.scatter(Xd_tsne[:,0], Xd_tsne[:,1], c=db_dig.labels_, cmap='tab10', s=25); plt.title('Digits t-SNE + DBSCAN'); plt.show()
print('Noise in DBSCAN labels:', np.sum(db_dig.labels_==-1))


### Task 13: AgglomerativeClustering (complete linkage) on synthetic blobs

In [None]:

agg_comp = AgglomerativeClustering(n_clusters=3, linkage='complete').fit(X_ov)
plt.figure(figsize=(6,4)); plt.scatter(X_ov[:,0], X_ov[:,1], c=agg_comp.labels_, cmap='tab10'); plt.title('Agglomerative (complete) on blobs'); plt.show()


### Task 14: Wine dataset — inertia values for K=2..6 (plot)

In [None]:

inertias = []
Ks = range(2,7)
for k in Ks:
    km = KMeans(n_clusters=k, random_state=0).fit(Xw_s)
    inertias.append(km.inertia_)
plt.figure(figsize=(5,3)); plt.plot(Ks, inertias, marker='o'); plt.title('Wine dataset inertia K=2..6'); plt.xlabel('k'); plt.ylabel('Inertia'); plt.show(); print('Inertias:', dict(zip(Ks, inertias)))


### Task 15: Concentric circles → Agglomerative (single linkage)

In [None]:

X_cc, _ = make_circles(n_samples=400, factor=0.3, noise=0.03)
agg_single = AgglomerativeClustering(n_clusters=2, linkage='single').fit(X_cc)
plt.figure(figsize=(5,5)); plt.scatter(X_cc[:,0], X_cc[:,1], c=agg_single.labels_, cmap='tab10'); plt.title('Agglomerative (single) on concentric circles'); plt.show()


### Task 16: Wine dataset → scale → DBSCAN → count clusters (exclude noise)

In [None]:

from sklearn.preprocessing import StandardScaler
Xw_scaled = StandardScaler().fit_transform(load_wine(as_frame=True).data)
db_w = DBSCAN(eps=1.3, min_samples=5).fit(Xw_scaled)
labels_w = db_w.labels_
n_clusters = len(set(labels_w)) - (1 if -1 in labels_w else 0)
print('DBSCAN clusters (excluding noise):', n_clusters, 'noise points:', np.sum(labels_w==-1))


### Task 17: KMeans cluster centers on blobs (plot centers)

In [None]:

X_b, _ = make_blobs(n_samples=300, centers=4, random_state=10)
km_b = KMeans(n_clusters=4, random_state=10).fit(X_b)
plt.figure(figsize=(6,4)); plt.scatter(X_b[:,0], X_b[:,1], c=km_b.labels_, cmap='tab10'); plt.scatter(km_b.cluster_centers_[:,0], km_b.cluster_centers_[:,1], c='k', s=120, marker='X'); plt.title('Blobs with cluster centers'); plt.show()


### Task 18: Iris → DBSCAN → count noise samples

In [None]:

db_iris = DBSCAN(eps=0.9, min_samples=5).fit(X_iris)
print('Iris DBSCAN noise count:', np.sum(db_iris.labels_==-1))


### Task 19: make_moons + KMeans (note poor fit)

In [None]:

X_moons2, _ = make_moons(n_samples=300, noise=0.05, random_state=6)
km_m = KMeans(n_clusters=2, random_state=6).fit(X_moons2)
plt.figure(figsize=(6,4)); plt.scatter(X_moons2[:,0], X_moons2[:,1], c=km_m.labels_, cmap='tab10'); plt.title('KMeans on make_moons (non-linear)'); plt.show()


### Task 20: Digits → PCA(3) → KMeans → 3D scatter (requires mpl_toolkits)

In [None]:

from mpl_toolkits.mplot3d import Axes3D
pca3 = PCA(n_components=3, random_state=42); Xd3 = pca3.fit_transform(digits.data)
k3d = KMeans(n_clusters=10, random_state=42).fit(Xd3)
fig = plt.figure(figsize=(6,5)); ax = fig.add_subplot(111, projection='3d'); ax.scatter(Xd3[:,0], Xd3[:,1], Xd3[:,2], c=k3d.labels_, s=20); ax.set_title('Digits PCA(3) + KMeans'); plt.show()


### Extra Practical Tasks (grouped):
- silhouette on 5-center blobs
- BreastCancer PCA + Agglomerative (visualize 2D)
- Noisy circles: compare KMeans vs DBSCAN side-by-side
- Silhouette per sample after KMeans on Iris
- Agglomerative (average) on blobs visualize
- Wine dataset KMeans pairplot (first 4 features)
- Noisy blobs + DBSCAN identify clusters and noise
- Digits t-SNE + Agglomerative clustering


In [None]:

# 1) Silhouette for 5-center blobs
X5, _ = make_blobs(n_samples=500, centers=5, cluster_std=0.6, random_state=8)
km5 = KMeans(n_clusters=5, random_state=8).fit(X5); print('Silhouette (k=5):', silhouette_score(X5, km5.labels_))

# 2) BreastCancer PCA + Agglomerative visualize 2D
Xbc = load_breast_cancer(as_frame=True).data
pca2 = PCA(n_components=2, random_state=0); Xbc2 = pca2.fit_transform(Xbc)
agg_bc = AgglomerativeClustering(n_clusters=2).fit(Xbc2)
plt.figure(figsize=(5,4)); plt.scatter(Xbc2[:,0], Xbc2[:,1], c=agg_bc.labels_, cmap='tab10', s=20); plt.title('BreastCancer PCA(2) + Agglomerative'); plt.show()

# 3) Noisy circles: KMeans vs DBSCAN
Xn, _ = make_circles(n_samples=400, factor=0.5, noise=0.08, random_state=9)
kmn = KMeans(n_clusters=2, random_state=9).fit(Xn); dbn = DBSCAN(eps=0.12, min_samples=5).fit(Xn)
plt.figure(figsize=(10,4))
plt.subplot(1,2,1); plt.scatter(Xn[:,0], Xn[:,1], c=kmn.labels_, cmap='tab10'); plt.title('KMeans on noisy circles')
plt.subplot(1,2,2); plt.scatter(Xn[:,0], Xn[:,1], c=dbn.labels_, cmap='tab10'); plt.title('DBSCAN on noisy circles'); plt.show()

# 4) Silhouette per sample for Iris after KMeans (k=3)
from sklearn.metrics import silhouette_samples
km_iris = KMeans(n_clusters=3, random_state=0).fit(X_iris)
samps = silhouette_samples(X_iris, km_iris.labels_)
plt.figure(figsize=(6,3)); plt.bar(range(len(samps)), samps); plt.title('Silhouette values per sample (Iris + k=3)'); plt.show()

# 5) Agglomerative (average) on blobs visualize
Xavg, _ = make_blobs(n_samples=300, centers=3, random_state=12)
agg_avg = AgglomerativeClustering(n_clusters=3, linkage='average').fit(Xavg)
plt.figure(figsize=(6,4)); plt.scatter(Xavg[:,0], Xavg[:,1], c=agg_avg.labels_, cmap='tab10'); plt.title('Agglomerative average linkage'); plt.show()

# 6) Wine pairplot (first 4 features) with KMeans labels
wine_df = load_wine(as_frame=True).data
km_w = KMeans(n_clusters=3, random_state=0).fit(wine_df.iloc[:,:4])
sns.pairplot(wine_df.iloc[:,:4].assign(cluster=km_w.labels_), hue='cluster', corner=True); plt.show()

# 7) Noisy blobs + DBSCAN identify clusters and noise, print counts
Xnb, _ = make_blobs(n_samples=400, centers=4, cluster_std=1.2, random_state=20)
dbnb = DBSCAN(eps=0.9, min_samples=6).fit(Xnb)
labels_nb = dbnb.labels_
print('DBSCAN found clusters (excluding noise):', len(set(labels_nb)) - (1 if -1 in labels_nb else 0), 'noise count:', np.sum(labels_nb==-1))

# 8) Digits t-SNE + Agglomerative (sample subset)
Xd_tsne2 = TSNE(n_components=2, random_state=42, init='random', learning_rate='auto', perplexity=30).fit_transform(digits.data[:500])
agg_tsne = AgglomerativeClustering(n_clusters=10).fit(Xd_tsne2)
plt.figure(figsize=(6,4)); plt.scatter(Xd_tsne2[:,0], Xd_tsne2[:,1], c=agg_tsne.labels_, cmap='tab10', s=20); plt.title('Digits t-SNE + Agglomerative'); plt.show()


**Notebook saved programmatically** — file path: `/mnt/data/clustering_full_assignment.ipynb`. Run this notebook in Jupyter; each cell is designed to execute without internet access (uses sklearn built-in and synthetic data).