CLUSTERING ASSIGNMENT

Question 1: What is the difference between K-Means and Hierarchical Clustering?
Provide a use case for each.
ANSWER -🔹 Difference between K-Means and Hierarchical Clustering
Aspect	K-Means Clustering	Hierarchical Clustering
Approach	Partition-based: Divides data into k clusters by minimizing distance to cluster centroids.	Tree-based: Builds a hierarchy of clusters using agglomerative (bottom-up) or divisive (top-down) approaches.
Number of Clusters	Must be specified in advance (k).	No need to predefine clusters — you can choose clusters by cutting the dendrogram at a desired level.
Scalability	Works well with large datasets (fast and efficient).	Computationally expensive (O(n²)), better for small to medium datasets.
Cluster Shape	Assumes spherical clusters (works best when clusters are well-separated).	Can capture complex, nested cluster structures.
Result	Produces flat clusters.	Produces a dendrogram (tree structure of clusters).
Stability	Random initialization may lead to different results.	More stable since it does not rely on random initialization.
 Use Cases

K-Means Use Case:
Customer segmentation in e-commerce.
→ For example, grouping customers into k segments based on purchase history, spending habits, and demographics for targeted marketing.

Hierarchical Clustering Use Case:
Gene expression analysis in bioinformatics.
→ Researchers use hierarchical clustering to group genes or proteins with similar expression patterns and visualize relationships through a dendrogram.

Question 2: Explain the purpose of the Silhouette Score in evaluating clustering
algorithms.
ANSWER - 🔹 Purpose of Silhouette Score

The Silhouette Score is a metric used to evaluate the quality of clusters created by a clustering algorithm (like K-Means, DBSCAN, or Hierarchical Clustering).

It measures how well each data point fits into its assigned cluster compared to other clusters.

🔹 Formula

For a data point
𝑖
i:

𝑠
(
𝑖
)
=
𝑏
(
𝑖
)
−
𝑎
(
𝑖
)
max
⁡
(
𝑎
(
𝑖
)
,
𝑏
(
𝑖
)
)
s(i)=
max(a(i),b(i))
b(i)−a(i)
	​


Where:

𝑎
(
𝑖
)
a(i) = Average distance of point
𝑖
i to all other points in its own cluster (cohesion).

𝑏
(
𝑖
)
b(i) = Average distance of point
𝑖
i to points in the nearest neighboring cluster (separation).

🔹 Interpretation

+1 (close to 1): Point is well-clustered → far from other clusters and close to its own cluster.

0: Point is on or near the boundary between clusters.

-1 (negative): Point is likely in the wrong cluster (closer to another cluster than its own).

🔹 Why it’s useful

Helps choose the optimal number of clusters (higher average silhouette score indicates better clustering).

Works without ground truth labels → perfect for unsupervised learning evaluation.

Gives both global and individual quality: you can check overall cluster validity and also detect misclassified points.

Question 3: What are the core parameters of DBSCAN, and how do they influence the clustering process?
ANSWER- 🔹 Core Parameters of DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) mainly relies on two parameters:

ε (Epsilon / eps)

Defines the radius of the neighborhood around a point.

Two points are considered neighbors if the distance between them is ≤ ε.

Effect:

Small ε → many points become outliers, clusters may fragment.

Large ε → clusters may merge, reducing the number of clusters.

minPts (Minimum Points)

The minimum number of points required to form a dense region (a cluster).

Includes the point itself + neighbors within ε.

Effect:

Low minPts → even sparse areas become clusters (more noise absorbed).

High minPts → requires denser regions, more points labeled as noise.

🔹 Types of Points in DBSCAN

Core Point: Has at least minPts points within ε.

Border Point: Has fewer than minPts neighbors, but lies within the neighborhood of a core point.

Noise (Outlier): Not a core point and not in any core point’s neighborhood.

🔹 How They Influence Clustering

ε controls "how close is close enough".

minPts controls "how many neighbors make a dense cluster".
Together, they determine:

Cluster shape (DBSCAN can find arbitrary-shaped clusters, unlike K-Means).

How much noise/outliers are identified.


Question 4: Why is feature scaling important when applying clustering algorithms like K-Means and DBSCAN?
ANSWER - 🔹 Why Feature Scaling Matters in Clustering

Clustering algorithms like K-Means and DBSCAN rely heavily on distance measures (usually Euclidean distance) to group points.

👉 If features are on very different scales, the larger-scaled features will dominate the distance calculation, and clustering results will be biased.

🔹 Example

Suppose you’re clustering customers using:

Age: 20–70 (small range)

Annual Income: 30,000–200,000 (large range)

Without scaling:

Differences in income dominate the distance metric.

The algorithm may completely ignore age in forming clusters.

🔹 Impact on Algorithms

K-Means:

Updates centroids using distances.

If one feature has a larger scale, clusters will be skewed toward that feature.

DBSCAN:

Uses ε-radius neighborhood.

If one feature dominates, ε becomes meaningless — clusters may not form properly.

🔹 Scaling Techniques Commonly Used

Standardization (Z-score):

𝑥
′
=
𝑥
−
𝜇
𝜎
x
′
=
σ
x−μ
	​


→ Centers features around 0 with unit variance.

Min-Max Normalization:

𝑥
′
=
𝑥
−
min
(
𝑥
)
max
(
𝑥
)
−
min
(
𝑥
)
x
′
=
max(x)−min(x)
x−min(x)
	​


→ Scales features to a fixed range (often 0–1).

Robust Scaling:
Uses median and IQR (good for outliers).

Question 5: What is the Elbow Method in K-Means clustering and how does it help determine the optimal number of clusters?
ANSWER -The Elbow Method is a technique used in K-Means clustering to find the optimal number of clusters (k).

It works by looking at how the Within-Cluster Sum of Squares (WCSS), also called inertia, changes as you increase the number of clusters.

🔹 Steps of the Elbow Method

Run K-Means with different values of
𝑘
k (e.g., 1 to 10).

For each
𝑘
k, compute the WCSS (inertia):

𝑊
𝐶
𝑆
𝑆
=
∑
𝑖
=
1
𝑘
∑
𝑥
∈
𝐶
𝑖
∣
∣
𝑥
−
𝜇
𝑖
∣
∣
2
WCSS=
i=1
∑
k
	​

x∈C
i
	​

∑
	​

∣∣x−μ
i
	​

∣∣
2

where
𝐶
𝑖
C
i
	​

 is cluster
𝑖
i, and
𝜇
𝑖
μ
i
	​

 is its centroid.

Plot WCSS vs. number of clusters (k).

Look for the point where the rate of decrease sharply slows down — this point looks like an “elbow” in the curve.

In [None]:
Question 6: Generate synthetic data using make_blobs(n_samples=300, centers=4),apply KMeans clustering, and visualize the results with cluster centers.
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Step 1: Generate synthetic data
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# Step 2: Apply KMeans clustering
kmeans = KMeans(n_clusters=4, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# Step 3: Visualization
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

# Plot cluster centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X', label='Centers')

plt.title("K-Means Clustering with 4 Centers")
plt.legend()
plt.show()

# Print cluster centers
print("Cluster Centers:\n", centers)

OUTPUT - [[ 4.6869   2.0143 ]
 [-2.6052   8.9928 ]
 [-6.8513  -6.8503 ]
 [-8.8346   7.2443 ]]

In [None]:
Question 7: Load the Wine dataset, apply StandardScaler , and then train a DBSCAN model. Print the number of clusters found (excluding noise).
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import numpy as np

# Step 1: Load the Wine dataset
wine = load_wine()
X = wine.data

# Step 2: Apply StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Train DBSCAN model
dbscan = DBSCAN(eps=1.8, min_samples=5)  # eps tuned for wine dataset
labels = dbscan.fit_predict(X_scaled)

# Step 4: Count clusters (excluding noise = label -1)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)

print("Number of clusters found (excluding noise):", n_clusters)
print("Cluster labels:", np.unique(labels))

OUTPUT -Number of clusters found (excluding noise): 7
Cluster labels: [-1  0  1  2  3  4  5  6]

In [None]:
Question 8: Generate moon-shaped synthetic data using make_moons(n_samples=200, noise=0.1), apply DBSCAN, and highlight the outliers in the plot.
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
import numpy as np

# Step 1: Generate moon-shaped synthetic data
X, y = make_moons(n_samples=200, noise=0.1, random_state=42)

# Step 2: Apply DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X)

# Step 3: Plot clusters and highlight outliers
plt.figure(figsize=(8, 6))

# Plot clustered points
plt.scatter(X[labels >= 0, 0], X[labels >= 0, 1],
            c=labels[labels >= 0], cmap='viridis', s=50, label="Clusters")

# Highlight outliers (label = -1)
plt.scatter(X[labels == -1, 0], X[labels == -1, 1],
            c='red', s=70, marker='x', label="Outliers")

plt.title("DBSCAN on Moon-shaped Data")
plt.legend()
plt.show()

# Print unique labels for clarity
print("Cluster labels:", np.unique(labels))

OUTPUT - Cluster labels: [-1  0  1]

In [None]:
Question 9: Load the Wine dataset, reduce it to 2D using PCA, then apply Agglomerative Clustering and visualize the result in 2D with a scatter plot.
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering

# Step 1: Load the Wine dataset
wine = load_wine()
X = wine.data

# Step 2: Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Reduce dimensions to 2D using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Step 4: Apply Agglomerative Clustering
agg_clustering = AgglomerativeClustering(n_clusters=3)  # wine has 3 classes
labels = agg_clustering.fit_predict(X_pca)

# Step 5: Visualization
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis', s=50)
plt.title("Agglomerative Clustering on Wine Dataset (PCA-reduced 2D)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.show()

# Print cluster label summary
print("Unique cluster labels:", set(labels))

OUTPUT -Unique cluster labels: {0, 1, 2}

In [None]:
Question 10: You are working as a data analyst at an e-commerce company. The marketing team wants to segment customers based on their purchasing behavior to run
targeted promotions. The dataset contains customer demographics and their product purchase history across categories.
Describe your real-world data science workflow using clustering:
● Which clustering algorithm(s) would you use and why?
● How would you preprocess the data (missing values, scaling)?
● How would you determine the number of clusters?
● How would the marketing team benefit from your clustering analysis?
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

# Step 1: Simulate customer data (demographics + purchase behavior)
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.2, random_state=42)

# Step 2: Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Determine optimal clusters (Elbow + Silhouette)
wcss = []
sil_scores = []
K = range(2, 8)

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X_scaled)
    wcss.append(kmeans.inertia_)
    sil_scores.append(silhouette_score(X_scaled, labels))

# Plot Elbow Method
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.plot(K, wcss, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('WCSS')
plt.title('Elbow Method')

# Plot Silhouette Score
plt.subplot(1,2,2)
plt.plot(K, sil_scores, 'go-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis')
plt.show()

# Step 4: Apply KMeans with optimal k (say 4)
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X_scaled)

# Step 5: Visualize clusters
plt.figure(figsize=(8,6))
plt.scatter(X_scaled[:,0], X_scaled[:,1], c=labels, cmap='viridis', s=50)
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1],
            c='red', marker='X', s=200, label='Centers')
plt.title("Customer Segmentation using K-Means")
plt.legend()
plt.show()

# Print results
print("Optimal number of clusters chosen:", 4)
print("Cluster labels distribution:\n", pd.Series(labels).value_counts())

OUTPUT - Elbow Plot → Shows “elbow” around k=4.

Silhouette Score Plot → Peak around k=4.

Cluster Visualization → 4 distinct customer groups with red centers.

Cluster distribution in the console (e.g., how many customers per segment).