# Clustering Assignment: 48 Questions with Answers

**1. What is unsupervised learning in the context of machine learning?**

Unsupervised learning is a type of machine learning where the model learns patterns from unlabelled data. The algorithm tries to find hidden structure, clusters, or associations within the dataset without any target output.

**2. How does K-Means clustering algorithm work?**

K-Means clusters data by initializing 'k' centroids, assigning each point to the nearest centroid, then updating the centroids as the mean of all points in each cluster. This process repeats until the centroids no longer change significantly.

**3. Explain the concept of a dendrogram in hierarchical clustering?**

A dendrogram is a tree-like diagram that shows the arrangement of clusters formed by hierarchical clustering. It illustrates the merging or splitting of clusters at various levels of similarity or distance.

**4. What is the main difference between K-Means and Hierarchical Clustering?**

K-Means is a partitional clustering method that needs the number of clusters as input, while Hierarchical Clustering builds a hierarchy of clusters and doesn't require specifying the number of clusters beforehand.

**5. What are the advantages of DBSCAN over K-Means?**

DBSCAN can find clusters of arbitrary shapes, handles noise well, and does not require the number of clusters in advance, unlike K-Means which assumes spherical clusters and fixed k value.

**6. When would you use Silhouette Score in clustering?**

Silhouette Score is used to measure how well data points fit within their clusters. It helps evaluate the quality of clustering and decide the optimal number of clusters.

**7. What are the limitations of Hierarchical Clustering?**

It is computationally expensive for large datasets and sensitive to noise and outliers. Also, once a decision is made to merge or split clusters, it cannot be undone.

**8. Why is feature scaling important in clustering algorithms like K-Means?**

Feature scaling ensures that each feature contributes equally to the distance calculations used by clustering algorithms. Without scaling, features with larger ranges can dominate.

**9. How does DBSCAN identify noise points?**

DBSCAN labels a point as noise if it has fewer neighbors within a defined radius (eps) than a minimum number of points (min_samples). These points don’t belong to any cluster.

**10. Define inertia in the context of K-Means?**

Inertia is the sum of squared distances between each point and its assigned cluster centroid. It measures how internally coherent the clusters are. Lower inertia means better clustering.

**11. What is the elbow method in K-Means clustering?**

The elbow method involves plotting the inertia against various k values and selecting the 'elbow point' where inertia decreases less sharply. This helps determine the optimal number of clusters.

**12. Describe the concept of "density" in DBSCAN?**

Density in DBSCAN refers to the number of points in a given neighborhood (radius eps). A region is considered dense if it has at least min_samples points within eps distance.

**13. Can hierarchical clustering be used on categorical data?**

Yes, but it requires using a suitable distance metric for categorical data (like Hamming distance) and might not perform as well as specialized categorical clustering methods.

**14. What does a negative Silhouette Score indicate?**

A negative Silhouette Score means that the sample is likely placed in the wrong cluster, as it is closer to points in a neighboring cluster than to its own cluster.

**15. Explain the term "linkage criteria" in hierarchical clustering?**

Linkage criteria define how the distance between clusters is calculated. Common types include single, complete, average, and ward linkage.

**16. Why might K-Means clustering perform poorly on data with varying cluster sizes or densities?**

Because K-Means assumes clusters are spherical and of similar size and density. It may merge small dense clusters or split large sparse ones incorrectly.

**17. What are the core parameters in DBSCAN, and how do they influence clustering?**

The main parameters are eps (radius of neighborhood) and min_samples (minimum points to form a dense region). They control cluster formation and noise detection.

**18. How does K-Means++ improve upon standard K-Means initialization?**

K-Means++ initializes centroids in a smarter way by spreading them out, which reduces the chances of poor clustering and speeds up convergence.

**19. What is agglomerative clustering?**

Agglomerative clustering is a bottom-up approach where each point starts in its own cluster, and clusters are merged step by step based on distance metrics until one big cluster is formed.

**20. What makes Silhouette Score a better metric than just inertia for model evaluation?**

Silhouette Score considers both intra-cluster tightness and inter-cluster separation, making it a more balanced and informative metric than inertia alone.

### Q21. Generate synthetic data with 4 centers using `make_blobs` and apply K-Means clustering. Visualize using a scatter plot.

In [None]:
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

X, y = make_blobs(n_samples=300, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, color='red', marker='X')
plt.title("K-Means Clustering with 4 Centers")
plt.show()

### Q22. Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10 predicted labels.

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data

agg = AgglomerativeClustering(n_clusters=3)
labels = agg.fit_predict(X)

print("First 10 predicted labels:", labels[:10])

### Q23. Generate synthetic data using `make_moons` and apply DBSCAN. Highlight outliers in the plot.

In [None]:
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=300, noise=0.1, random_state=42)
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Paired')
plt.title("DBSCAN on make_moons (outliers shown as -1)")
plt.show()

24. Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each cluster?

In [None]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np

# Load data
data = load_wine()
X = data.data

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_scaled)

# Print size of each cluster
unique, counts = np.unique(labels, return_counts=True)
print("Cluster sizes:", dict(zip(unique, counts)))


25. Use make_circles to generate synthetic data and cluster it using DBSCAN. Plot the result?

In [None]:
from sklearn.datasets import make_circles
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

# Generate data
X, _ = make_circles(n_samples=300, factor=0.5, noise=0.05)

# DBSCAN
db = DBSCAN(eps=0.2, min_samples=5)
labels = db.fit_predict(X)

# Plot
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='coolwarm')
plt.title("DBSCAN on make_circles Data")
plt.show()


26. Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster centroids?

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

# Load data
data = load_breast_cancer()
X = data.data

# Scale
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# KMeans
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X_scaled)

# Centroids
print("Cluster centroids:\n", kmeans.cluster_centers_)


27. Generate synthetic data using make_blobs with varying cluster standard deviations and cluster with DBSCAN?

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

# Generate data
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=[1.0, 2.5, 0.5], random_state=42)

# DBSCAN
db = DBSCAN(eps=1.2, min_samples=5)
labels = db.fit_predict(X)

# Plot
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Set1')
plt.title("DBSCAN on Blobs with Varying Std")
plt.show()


28. Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means?

In [None]:
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Load data
digits = load_digits()
X = digits.data

# PCA to 2D
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# KMeans
kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(X_pca)

# Plot
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='tab10')
plt.title("K-Means Clustering on Digits (PCA 2D)")
plt.show()


29. Create synthetic data using make_blobs and evaluate silhouette scores for k = 2 to 5. Display as a bar chart?

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# Data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Calculate Silhouette Scores
scores = []
for k in range(2, 6):
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X)
    score = silhouette_score(X, labels)
    scores.append(score)

# Plot
plt.bar(range(2, 6), scores, color='orange')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores for K = 2 to 5')
plt.show()


30. Load the Iris dataset and use hierarchical clustering to group data. Plot a dendrogram with average linkage?

In [None]:
from sklearn.datasets import load_iris
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

# Load data
iris = load_iris()
X = iris.data

# Linkage and dendrogram
linked = linkage(X, method='average')

plt.figure(figsize=(10, 5))
dendrogram(linked, labels=iris.target, truncate_mode='level', p=5)
plt.title("Hierarchical Clustering Dendrogram (Average Linkage)")
plt.xlabel("Data Index")
plt.ylabel("Distance")
plt.show()


31. Generate synthetic data with overlapping clusters using make_blobs, then apply K-Means and visualize with decision boundaries?

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np

# Data with overlapping clusters
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=2.5, random_state=42)

# Apply KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
labels = kmeans.predict(X)

# Plot decision boundaries
x_min, x_max = X[:, 0].min()-1, X[:, 0].max()+1
y_min, y_max = X[:, 1].min()-1, X[:, 1].max()+1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.2)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='x', color='red')
plt.title("K-Means with Decision Boundaries")
plt.show()


32. Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results?

In [None]:
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

# Load and reduce
digits = load_digits()
X = digits.data
X_tsne = TSNE(n_components=2, random_state=42).fit_transform(X)

# DBSCAN
db = DBSCAN(eps=5, min_samples=5)
labels = db.fit_predict(X_tsne)

# Plot
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='tab10')
plt.title("DBSCAN on Digits Dataset (t-SNE Reduced)")
plt.show()


33. Generate synthetic data using make_blobs and apply Agglomerative Clustering with complete linkage. Plot the result?

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# Generate data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=4, linkage='complete')
labels = agg.fit_predict(X)

# Plot
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='rainbow')
plt.title("Agglomerative Clustering (Complete Linkage)")
plt.show()


34. Load the Breast Cancer dataset and compare inertia values for K = 2 to 6 using K-Means. Show results in a line plot?

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Load and scale
data = load_breast_cancer()
X = StandardScaler().fit_transform(data.data)

# Inertia for k = 2 to 6
inertias = []
k_values = range(2, 7)
for k in k_values:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X)
    inertias.append(km.inertia_)

# Plot
plt.plot(k_values, inertias, marker='o')
plt.title("Inertia for K = 2 to 6")
plt.xlabel("K")
plt.ylabel("Inertia")
plt.grid(True)
plt.show()


35. Generate synthetic concentric circles using make_circles and cluster using Agglomerative Clustering with single linkage?

In [None]:
from sklearn.datasets import make_circles
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# Data
X, _ = make_circles(n_samples=300, factor=0.5, noise=0.05)

# Clustering
agg = AgglomerativeClustering(n_clusters=2, linkage='single')
labels = agg.fit_predict(X)

# Plot
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='cool')
plt.title("Agglomerative Clustering on Circles (Single Linkage)")
plt.show()


36. Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding noise)?

In [None]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

# Load and scale
data = load_wine()
X = StandardScaler().fit_transform(data.data)

# DBSCAN
db = DBSCAN(eps=1.5, min_samples=5)
labels = db.fit_predict(X)

# Count clusters (excluding noise)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Number of clusters (excluding noise):", n_clusters)


37. Generate synthetic data with make_blobs and apply KMeans. Then plot the cluster centers on top of the data points?

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Data
X, _ = make_blobs(n_samples=300, centers=3, random_state=42)

# KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

# Plot
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], color='red', marker='x')
plt.title("K-Means Clustering with Centers")
plt.show()


38. Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise?

In [None]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import numpy as np

# Load and scale
X = load_iris().data
X_scaled = StandardScaler().fit_transform(X)

# DBSCAN
db = DBSCAN(eps=0.8, min_samples=5)
labels = db.fit_predict(X_scaled)

# Count noise (-1)
n_noise = np.sum(labels == -1)
print("Number of noise samples:", n_noise)


39. Generate synthetic non-linearly separable data using make_moons, apply K-Means, and visualize the clustering result?


In [None]:
from sklearn.datasets import make_moons
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Data
X, _ = make_moons(n_samples=300, noise=0.1, random_state=42)

# KMeans
kmeans = KMeans(n_clusters=2, random_state=42)
labels = kmeans.fit_predict(X)

# Plot
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Set2')
plt.title("K-Means on make_moons (Non-linear Clusters)")
plt.show()


40. Load the Digits dataset, apply PCA to reduce to 3 components, then use KMeans and visualize with a 3D scatter plot?

In [None]:
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Load and reduce
digits = load_digits()
X = digits.data
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)

# KMeans
kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(X_pca)

# 3D Plot
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=labels, cmap='Spectral')
ax.set_title("3D PCA + KMeans on Digits")
plt.show()


41. Generate synthetic blobs with 5 centers and apply KMeans. Then use silhouette_score to evaluate the clustering?

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Data
X, _ = make_blobs(n_samples=500, centers=5, random_state=42)

# KMeans
kmeans = KMeans(n_clusters=5, random_state=42)
labels = kmeans.fit_predict(X)

# Silhouette Score
score = silhouette_score(X, labels)
print("Silhouette Score:", score)


42. Load the Breast Cancer dataset, reduce dimensionality using PCA, and apply Agglomerative Clustering. Visualize in 2D?

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# Load and scale
data = load_breast_cancer()
X = StandardScaler().fit_transform(data.data)

# PCA
X_pca = PCA(n_components=2).fit_transform(X)

# Clustering
agg = AgglomerativeClustering(n_clusters=2)
labels = agg.fit_predict(X_pca)

# Plot
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='coolwarm')
plt.title("Agglomerative Clustering on Breast Cancer (2D PCA)")
plt.show()


43. Generate noisy circular data using make_circles and visualize clustering results from KMeans and DBSCAN side-by-side?

In [None]:
from sklearn.datasets import make_circles
from sklearn.cluster import KMeans, DBSCAN
import matplotlib.pyplot as plt

# Data
X, _ = make_circles(n_samples=300, factor=0.5, noise=0.05, random_state=42)

# Clustering
kmeans = KMeans(n_clusters=2, random_state=42).fit_predict(X)
dbscan = DBSCAN(eps=0.2, min_samples=5).fit_predict(X)

# Plot side by side
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=kmeans, cmap='Set2')
plt.title("K-Means Clustering")

plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=dbscan, cmap='Set1')
plt.title("DBSCAN Clustering")

plt.show()


44. Load the Iris dataset and plot the Silhouette Coefficient for each sample after KMeans clustering?

In [None]:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples
import matplotlib.pyplot as plt
import numpy as np

# Data
X = load_iris().data

# KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

# Silhouette Coefficients
sil_values = silhouette_samples(X, labels)

# Plot
plt.bar(range(len(X)), sil_values)
plt.title("Silhouette Coefficient for Each Sample (Iris)")
plt.xlabel("Sample Index")
plt.ylabel("Silhouette Value")
plt.show()


45. Generate synthetic data using make_blobs and apply Agglomerative Clustering with 'average' linkage. Visualize clusters?

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# Data
X, _ = make_blobs(n_samples=300, centers=3, random_state=42)

# Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=3, linkage='average')
labels = agg.fit_predict(X)

# Plot
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Accent')
plt.title("Agglomerative Clustering with Average Linkage")
plt.show()


46. Load the Wine dataset, apply KMeans, and visualize the cluster assignments in a seaborn pairplot (first 4 features)?

In [None]:
from sklearn.datasets import load_wine
from sklearn.cluster import KMeans
import pandas as pd
import seaborn as sns

# Load data
data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
df = df.iloc[:, :4]  # First 4 features

# KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
df['Cluster'] = kmeans.fit_predict(df)

# Pairplot
sns.pairplot(df, hue='Cluster', diag_kind='kde')


47. Generate noisy blobs using make_blobs and use DBSCAN to identify both clusters and noise points. Print the count?

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import numpy as np

# Data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.5, random_state=42)

# DBSCAN
db = DBSCAN(eps=1.2, min_samples=5)
labels = db.fit_predict(X)

# Count clusters and noise
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = np.sum(labels == -1)

print("Clusters found:", n_clusters)
print("Noise points:", n_noise)


48. Load the Digits dataset, reduce dimensions using t-SNE, then apply Agglomerative Clustering and plot the clusters?

In [None]:
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# Load and reduce
X = load_digits().data
X_tsne = TSNE(n_components=2, random_state=42).fit_transform(X)

# Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=10)
labels = agg.fit_predict(X_tsne)

# Plot
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='tab10')
plt.title("Agglomerative Clustering on Digits (t-SNE Reduced)")
plt.show()
