
# Machine Learning Clustering Assignment Solutions
## PwSkills – Java + DSA

This notebook contains **all 48 questions** including:

✅ Theoretical Questions (with explanations)  
✅ Practical Questions (with Python implementations)  

Topics covered:
- Unsupervised Learning
- K-Means Clustering
- Hierarchical Clustering
- DBSCAN
- Silhouette Score
- PCA and t-SNE
- Clustering Visualization


## Theoretical Questions

### Q1. What is unsupervised learning in the context of machine learning?

**Answer:**
Unsupervised learning is a machine learning approach where models learn patterns from unlabeled data. The algorithm identifies hidden structures, similarities, or groupings without predefined outputs. Clustering and dimensionality reduction are common unsupervised learning tasks.

### Q2. How does K-Means clustering algorithm work?

**Answer:**
K-Means divides data into K clusters by initializing K centroids, assigning data points to the nearest centroid, updating centroids as the mean of assigned points, and repeating the process until centroids stabilize.

### Q3. Explain the concept of a dendrogram in hierarchical clustering.

**Answer:**
A dendrogram is a tree-like diagram used in hierarchical clustering that shows how clusters are merged or split at different distance levels. It helps visualize cluster relationships.

### Q4. What is the main difference between K-Means and Hierarchical Clustering?

**Answer:**
K-Means requires specifying the number of clusters beforehand and partitions data iteratively, while hierarchical clustering builds a hierarchy of clusters and does not require a predefined number of clusters.

### Q5. What are the advantages of DBSCAN over K-Means?

**Answer:**
DBSCAN can detect clusters of arbitrary shape, automatically identifies noise/outliers, and does not require specifying the number of clusters beforehand.

### Q6. When would you use Silhouette Score in clustering?

**Answer:**
Silhouette Score is used to evaluate clustering performance and to select the optimal number of clusters by measuring how well each data point fits within its cluster compared to others.

### Q7. What are the limitations of Hierarchical Clustering?

**Answer:**
It is computationally expensive for large datasets, sensitive to noise and outliers, and once clusters are merged or split, the process cannot be reversed.

### Q8. Why is feature scaling important in clustering algorithms like K-Means?

**Answer:**
K-Means relies on distance calculations. Features with larger scales dominate the distance metric, leading to biased clustering results. Scaling ensures equal contribution from all features.

### Q9. How does DBSCAN identify noise points?

**Answer:**
DBSCAN labels points as noise if they do not have enough neighboring points within a specified radius (eps) to form a dense region.

### Q10. Define inertia in the context of K-Means.

**Answer:**
Inertia is the sum of squared distances between data points and their assigned cluster centroids. It measures cluster compactness.

### Q11. What is the elbow method in K-Means clustering?

**Answer:**
The elbow method plots inertia against different values of K. The point where inertia starts decreasing slowly forms an elbow, indicating an optimal number of clusters.

### Q12. Describe the concept of 'density' in DBSCAN.

**Answer:**
Density refers to the concentration of data points in a region. DBSCAN forms clusters in areas where the number of points within a given radius exceeds a threshold.

### Q13. Can hierarchical clustering be used on categorical data?

**Answer:**
Yes, hierarchical clustering can be used with categorical data if appropriate similarity or distance measures such as Hamming distance are used.

### Q14. What does a negative Silhouette Score indicate?

**Answer:**
A negative Silhouette Score indicates that a data point may have been assigned to the wrong cluster.

### Q15. Explain the term 'linkage criteria' in hierarchical clustering.

**Answer:**
Linkage criteria determine how distances between clusters are computed, such as single linkage, complete linkage, average linkage, and Ward linkage.

### Q16. Why might K-Means clustering perform poorly on data with varying cluster sizes or densities?

**Answer:**
K-Means assumes clusters are spherical and similar in size. When clusters vary significantly in size or density, centroids may be incorrectly positioned.

### Q17. What are the core parameters in DBSCAN, and how do they influence clustering?

**Answer:**
The core parameters are eps (radius defining neighborhood size) and min_samples (minimum number of points needed to form a dense region). These parameters control cluster formation and noise detection.

### Q18. How does K-Means++ improve upon standard K-Means initialization?

**Answer:**
K-Means++ selects initial centroids that are far apart from each other, improving convergence speed and reducing the chances of poor clustering results.

### Q19. What is agglomerative clustering?

**Answer:**
A bottom-up hierarchical clustering method where each data point starts as its own cluster and clusters are iteratively merged based on similarity.

### Q20. What makes Silhouette Score a better metric than just inertia for model evaluation?

**Answer:**
Inertia measures only compactness, while Silhouette Score considers both intra-cluster cohesion and inter-cluster separation, making it a better evaluation metric.

## Practical Questions

In [None]:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import make_blobs, make_moons, make_circles
from sklearn.datasets import load_iris, load_wine, load_digits, load_breast_cancer
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE


In [None]:
# Q21
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

plt.scatter(X[:,0], X[:,1], c=labels)
plt.title("KMeans with 4 centers")
plt.show()


In [None]:
# Q22
iris = load_iris()
X = iris.data
agg = AgglomerativeClustering(n_clusters=3)
labels = agg.fit_predict(X)
print(labels[:10])


In [None]:
# Q23
X,_ = make_moons(n_samples=300, noise=0.05, random_state=42)
db = DBSCAN(eps=0.2, min_samples=5)
labels = db.fit_predict(X)

plt.scatter(X[:,0], X[:,1], c=labels)
plt.title("DBSCAN - make_moons")
plt.show()


In [None]:
# Q24
wine = load_wine()
X = StandardScaler().fit_transform(wine.data)
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

unique, counts = np.unique(labels, return_counts=True)
print(dict(zip(unique, counts)))


In [None]:
# Q25
X,_ = make_circles(n_samples=300, noise=0.05, factor=0.5)
labels = DBSCAN(eps=0.2).fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=labels)
plt.show()
