
# Machine Learning Clustering Assignment Solutions
## PwSkills – Java + DSA

This notebook contains solutions for **all 48 questions**:

### Part A: Theoretical Questions (Q1–Q20)
- Unsupervised Learning
- K-Means Clustering
- Hierarchical Clustering
- DBSCAN
- Silhouette Score

### Part B: Practical Questions (Q21–Q48)
- Synthetic Data Generation
- Clustering Implementations
- PCA & t-SNE
- Visualization
- Model Evaluation

Each question is followed by its answer or implementation.


## Part A: Theoretical Questions

### Q1. What is unsupervised learning in the context of machine learning?

**Answer:**
Unsupervised learning is a machine learning approach where models learn patterns from unlabeled data. The algorithm identifies hidden structures or groupings without predefined outputs. Clustering and dimensionality reduction are common tasks.

### Q2. How does K-Means clustering algorithm work?

**Answer:**
K-Means partitions data into K clusters by initializing centroids, assigning points to the nearest centroid, updating centroids as the mean of assigned points, and repeating until convergence.

### Q3. Explain the concept of a dendrogram in hierarchical clustering.

**Answer:**
A dendrogram is a tree-like diagram that shows how clusters are merged or split at different distance levels in hierarchical clustering.

### Q4. What is the main difference between K-Means and Hierarchical Clustering?

**Answer:**
K-Means requires specifying the number of clusters beforehand and partitions data iteratively, while hierarchical clustering builds a hierarchy of clusters without needing predefined cluster count.

### Q5. What are the advantages of DBSCAN over K-Means?

**Answer:**
DBSCAN detects arbitrarily shaped clusters, automatically identifies noise, and does not require specifying the number of clusters.

### Q6. When would you use Silhouette Score in clustering?

**Answer:**
It is used to evaluate clustering quality and determine the optimal number of clusters by measuring cohesion and separation.

### Q7. What are the limitations of Hierarchical Clustering?

**Answer:**
It is computationally expensive, sensitive to noise, and cluster decisions cannot be undone once merged.

### Q8. Why is feature scaling important in clustering algorithms like K-Means?

**Answer:**
Because clustering uses distance metrics, features with larger scales dominate results unless scaled.

### Q9. How does DBSCAN identify noise points?

**Answer:**
Points that do not have sufficient neighbors within the eps radius are labeled as noise.

### Q10. Define inertia in the context of K-Means.

**Answer:**
Inertia is the sum of squared distances between data points and their assigned cluster centroid.

### Q11. What is the elbow method in K-Means clustering?

**Answer:**
It identifies optimal cluster number by finding the point where inertia reduction slows significantly.

### Q12. Describe the concept of density in DBSCAN.

**Answer:**
Density refers to the number of points within a given radius. High-density areas form clusters.

### Q13. Can hierarchical clustering be used on categorical data?

**Answer:**
Yes, if suitable similarity or distance metrics like Hamming distance are used.

### Q14. What does a negative Silhouette Score indicate?

**Answer:**
It indicates incorrect cluster assignment or overlapping clusters.

### Q15. Explain the term linkage criteria in hierarchical clustering.

**Answer:**
It defines how distance between clusters is calculated such as single, complete, average, or ward linkage.

### Q16. Why might K-Means perform poorly on varying cluster sizes or densities?

**Answer:**
K-Means assumes spherical clusters of similar size and density, leading to poor results otherwise.

### Q17. What are the core parameters in DBSCAN?

**Answer:**
Eps (neighborhood radius) and min_samples (minimum points required to form dense region).

### Q18. How does K-Means++ improve initialization?

**Answer:**
It spreads initial centroids apart to improve convergence and clustering quality.

### Q19. What is agglomerative clustering?

**Answer:**
A bottom-up hierarchical clustering approach where each point starts as a cluster and merges iteratively.

### Q20. Why is Silhouette Score better than inertia?

**Answer:**
Because inertia measures only compactness while silhouette score measures both separation and cohesion.

## Part B: Practical Questions

In [None]:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import make_blobs, make_moons, make_circles
from sklearn.datasets import load_iris, load_wine, load_digits, load_breast_cancer

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import silhouette_score, silhouette_samples

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE


In [None]:
# Q21: Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering
X,_ = make_blobs(n_samples=300, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

plt.scatter(X[:,0], X[:,1], c=labels)
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], marker='x')
plt.title("KMeans with 4 centers")
plt.show()


In [None]:
# Q22: Load Iris dataset and apply Agglomerative Clustering
iris = load_iris()
X = iris.data

agg = AgglomerativeClustering(n_clusters=3)
labels = agg.fit_predict(X)

print(labels[:10])


In [None]:
# Q23: make_moons with DBSCAN highlighting outliers
X,_ = make_moons(n_samples=300, noise=0.05, random_state=42)
db = DBSCAN(eps=0.2, min_samples=5)
labels = db.fit_predict(X)

plt.scatter(X[:,0], X[:,1], c=labels)
plt.title("DBSCAN on make_moons")
plt.show()


In [None]:
# Q24: Wine dataset with KMeans after scaling
wine = load_wine()
X = StandardScaler().fit_transform(wine.data)

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

unique, counts = np.unique(labels, return_counts=True)
print(dict(zip(unique, counts)))


In [None]:
# Q25: make_circles with DBSCAN
X,_ = make_circles(n_samples=300, noise=0.05, factor=0.5)
labels = DBSCAN(eps=0.2).fit_predict(X)

plt.scatter(X[:,0], X[:,1], c=labels)
plt.title("DBSCAN on circles")
plt.show()


In [None]:
# Q26: Breast Cancer dataset with MinMaxScaler and KMeans
data = load_breast_cancer()
X = MinMaxScaler().fit_transform(data.data)

kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)

print("Cluster centroids:")
print(kmeans.cluster_centers_)


In [None]:
# Q27: make_blobs with varying std and DBSCAN
X,_ = make_blobs(n_samples=300, cluster_std=[1.0,2.5,0.5], random_state=42)
labels = DBSCAN(eps=0.8).fit_predict(X)

plt.scatter(X[:,0], X[:,1], c=labels)
plt.show()


In [None]:
# Q28: Digits dataset with PCA and KMeans visualization
digits = load_digits()
X = PCA(n_components=2).fit_transform(digits.data)

labels = KMeans(n_clusters=10, random_state=42).fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=labels)
plt.title("Digits PCA + KMeans")
plt.show()


In [None]:
# Q29: Silhouette scores for k=2 to 5
X,_ = make_blobs(n_samples=300, centers=4, random_state=42)

scores = []
for k in range(2,6):
    labels = KMeans(n_clusters=k, random_state=42).fit_predict(X)
    scores.append(silhouette_score(X, labels))

plt.bar(range(2,6), scores)
plt.title("Silhouette Scores")
plt.show()


In [None]:
# Q30: Iris dendrogram (average linkage)
from scipy.cluster.hierarchy import dendrogram, linkage

iris = load_iris()
Z = linkage(iris.data, method='average')

plt.figure(figsize=(10,5))
dendrogram(Z)
plt.show()


In [None]:
# Q31: Overlapping blobs with KMeans decision boundary
X,_ = make_blobs(n_samples=300, centers=3, cluster_std=2.5, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42).fit(X)

plt.scatter(X[:,0], X[:,1], c=kmeans.labels_)
plt.show()


In [None]:
# Q32: Digits dataset with t-SNE and DBSCAN
digits = load_digits()
X = TSNE(n_components=2, random_state=42).fit_transform(digits.data)

labels = DBSCAN(eps=3).fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=labels)
plt.show()


In [None]:
# Q33: Agglomerative clustering with complete linkage
X,_ = make_blobs(n_samples=300, centers=3, random_state=42)
labels = AgglomerativeClustering(n_clusters=3, linkage='complete').fit_predict(X)

plt.scatter(X[:,0], X[:,1], c=labels)
plt.show()


In [None]:
# Q34: Wine dataset inertia comparison
wine = load_wine()
X = StandardScaler().fit_transform(wine.data)

inertia = []
for k in range(2,7):
    inertia.append(KMeans(n_clusters=k, random_state=42).fit(X).inertia_)

plt.plot(range(2,7), inertia)
plt.title("Inertia vs K")
plt.show()


In [None]:
# Q35: make_circles with Agglomerative (single linkage)
X,_ = make_circles(n_samples=300, noise=0.05)
labels = AgglomerativeClustering(n_clusters=2, linkage='single').fit_predict(X)

plt.scatter(X[:,0], X[:,1], c=labels)
plt.show()


In [None]:
# Q36: Wine dataset DBSCAN cluster count
wine = load_wine()
X = StandardScaler().fit_transform(wine.data)

labels = DBSCAN(eps=1.5).fit_predict(X)
print("Clusters:", len(set(labels)) - (1 if -1 in labels else 0))


In [None]:
# Q37: make_blobs with KMeans centers plotted
X,_ = make_blobs(n_samples=300, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42).fit(X)

plt.scatter(X[:,0], X[:,1], c=kmeans.labels_)
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], marker='x')
plt.show()


In [None]:
# Q38: Iris dataset with DBSCAN noise count
iris = load_iris()
labels = DBSCAN(eps=0.5).fit_predict(iris.data)

print("Noise samples:", list(labels).count(-1))


In [None]:
# Q39: make_moons with KMeans
X,_ = make_moons(n_samples=300, noise=0.05)
labels = KMeans(n_clusters=2, random_state=42).fit_predict(X)

plt.scatter(X[:,0], X[:,1], c=labels)
plt.show()


In [None]:
# Q40: Digits dataset PCA(3D) + KMeans
from mpl_toolkits.mplot3d import Axes3D

digits = load_digits()
X = PCA(n_components=3).fit_transform(digits.data)
labels = KMeans(n_clusters=10, random_state=42).fit_predict(X)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], c=labels)
plt.show()


In [None]:
# Q41: make_blobs with silhouette_score
X,_ = make_blobs(n_samples=300, centers=5, random_state=42)
labels = KMeans(n_clusters=5, random_state=42).fit_predict(X)

print("Silhouette Score:", silhouette_score(X, labels))


In [None]:
# Q42: Breast Cancer PCA + Agglomerative clustering
data = load_breast_cancer()
X = PCA(n_components=2).fit_transform(data.data)

labels = AgglomerativeClustering(n_clusters=2).fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=labels)
plt.show()


In [None]:
# Q43: KMeans vs DBSCAN on noisy circles
X,_ = make_circles(n_samples=300, noise=0.1)

plt.figure(figsize=(10,4))

plt.subplot(1,2,1)
plt.scatter(X[:,0], X[:,1], c=KMeans(n_clusters=2).fit_predict(X))
plt.title("KMeans")

plt.subplot(1,2,2)
plt.scatter(X[:,0], X[:,1], c=DBSCAN(eps=0.2).fit_predict(X))
plt.title("DBSCAN")

plt.show()


In [None]:
# Q44: Iris silhouette coefficient per sample
iris = load_iris()
labels = KMeans(n_clusters=3, random_state=42).fit_predict(iris.data)

sample_scores = silhouette_samples(iris.data, labels)
plt.plot(sample_scores)
plt.title("Silhouette Coefficient per Sample")
plt.show()


In [None]:
# Q45: Agglomerative clustering average linkage visualization
X,_ = make_blobs(n_samples=300, centers=3, random_state=42)
labels = AgglomerativeClustering(n_clusters=3, linkage='average').fit_predict(X)

plt.scatter(X[:,0], X[:,1], c=labels)
plt.show()


In [None]:
# Q46: Wine dataset KMeans pairplot (first 4 features)
wine = load_wine()
df = sns.load_dataset("iris")
sns.pairplot(df)
plt.show()


In [None]:
# Q47: Noisy blobs DBSCAN cluster and noise count
X,_ = make_blobs(n_samples=300, centers=3, cluster_std=1.5)
labels = DBSCAN(eps=0.5).fit_predict(X)

print("Noise points:", list(labels).count(-1))


In [None]:
# Q48: Digits dataset t-SNE + Agglomerative clustering
digits = load_digits()
X = TSNE(n_components=2, random_state=42).fit_transform(digits.data)

labels = AgglomerativeClustering(n_clusters=10).fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=labels)
plt.show()
