In [None]:
# 1. What is unsupervised learning in the context of machine learning?

Unsupervised learning is a type of machine learning where the model is trained on data without labeled responses. The goal is to identify patterns, groupings, or structure within the data. Common tasks include clustering and dimensionality reduction.

In [None]:
# 2. How does K-Means clustering algorithm work?

K-Means works by partitioning data into K clusters. It initializes K centroids, assigns each data point to the nearest centroid, recalculates the centroids as the mean of the points in each cluster, and repeats the process until the centroids stabilize (convergence).

In [None]:
# 3. Explain the concept of a dendrogram in hierarchical clustering.

A dendrogram is a tree-like diagram that records the sequences of merges or splits in hierarchical clustering. It shows how clusters are formed and helps decide the optimal number of clusters by cutting the tree at a desired height.

In [None]:
# 4. What is the main difference between K-Means and Hierarchical Clustering?

K-Means requires the number of clusters to be predefined and creates non-overlapping clusters, whereas hierarchical clustering builds a hierarchy of clusters and does not require the number of clusters initially.

In [None]:
# 5. What are the advantages of DBSCAN over K-Means?

No need to predefine the number of clusters

Can detect clusters of arbitrary shape

Can identify and label noise/outliers

More robust to clusters with varying densities

In [None]:
# 6. When would you use Silhouette Score in clustering?

Silhouette Score is used to evaluate the quality of clusters. It measures how similar a point is to its own cluster compared to other clusters. A high score indicates well-separated and cohesive clusters.

In [None]:
# 7. What are the limitations of Hierarchical Clustering?

Scalability: Not efficient for large datasets

No re-evaluation: Once a merge or split is done, it can't be undone

Sensitive to noisy data and outliers

In [None]:
# 8. Why is feature scaling important in clustering algorithms like K-Means?

K-Means uses distance-based metrics (like Euclidean distance), which can be biased if features have different units or scales. Scaling ensures all features contribute equally to the distance calculations.

In [None]:
# 9. How does DBSCAN identify noise points?

DBSCAN labels a point as noise if it doesn't have enough neighbors (less than minPts) within a given distance (eps) and is not part of any dense region (cluster).

In [None]:
# 10. Define inertia in the context of K-Means?

Inertia is the sum of squared distances between each data point and its assigned centroid. Lower inertia indicates more compact clusters, but it may decrease with more clusters, hence it should be balanced with model complexity.

In [None]:
# 11. What is the elbow method in K-Means clustering?

The elbow method helps determine the optimal number of clusters by plotting inertia vs. the number of clusters. The "elbow point" (where the curve bends) is considered the best trade-off between complexity and performance.

In [None]:
# 12. Describe the concept of "density" in DBSCAN.

In DBSCAN, density refers to the number of data points within a certain radius (eps). If a point has enough neighboring points (≥ minPts), it is considered part of a dense region or a cluster.

In [None]:
# 13. Can hierarchical clustering be used on categorical data?

Yes, but standard hierarchical clustering methods rely on distance metrics, so categorical data needs to be encoded (e.g., one-hot encoding), or distance measures like Hamming distance can be used.

In [None]:
# 14. What does a negative Silhouette Score indicate?

A negative Silhouette Score means that a data point is closer to a different cluster than the one it was assigned to. This suggests poor clustering quality or overlapping clusters.

In [None]:
# 15. Explain the term "linkage criteria" in hierarchical clustering?

Linkage criteria determine how the distance between clusters is calculated when merging them. Common types include:

Single linkage: shortest distance

Complete linkage: longest distance

Average linkage: average distance

Ward’s method: minimizes total within-cluster variance

In [None]:
# 16. Why might K-Means clustering perform poorly on data with varying cluster sizes or densities?

K-Means assumes equal-sized, spherical clusters and is sensitive to outliers. It performs poorly when clusters have:

Different sizes or densities

Non-spherical shapes

Presence of noise or outliers

In [None]:
# 17. What are the core parameters in DBSCAN, and how do they influence clustering?

eps (epsilon): the maximum distance between two samples to be considered neighbors

minPts: the minimum number of points required to form a dense region
These control cluster compactness and minimum size, affecting the number and shape of detected clusters.

In [None]:
# 18. How does K-Means++ improve upon standard K-Means initialization?

K-Means++ selects initial centroids in a smart way that spreads them out, reducing the chances of poor clustering and increasing convergence speed. It improves cluster quality and avoids local minima.

In [None]:
# 19. What is agglomerative clustering?

Agglomerative clustering is a type of hierarchical clustering that starts with each data point as an individual cluster and merges the closest clusters step by step until only one cluster remains or a stopping criterion is met.

In [None]:
# 20. What makes Silhouette Score a better metric than just inertia for model evaluation?

Inertia only considers intra-cluster distance, not how well-separated clusters are. Silhouette Score evaluates both:

Cohesion (intra-cluster)

Separation (inter-cluster) This makes it a more balanced and interpretable metric for assessing clustering performance.

In [None]:
                                                                        # Practical Questions

# 21. Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering. Visualize using a scatter plot
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
kmeans = KMeans(n_clusters=4, random_state=0)
y_kmeans = kmeans.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='X')
plt.show()


# 22. Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10 predicted labels
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering

iris = load_iris()
X = iris.data
agg = AgglomerativeClustering(n_clusters=3)
labels = agg.fit_predict(X)
print(labels[:10])


# 23. Generate synthetic data using make_moons and apply DBSCAN. Highlight outliers in the plot
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN

X, y = make_moons(n_samples=300, noise=0.05, random_state=42)
db = DBSCAN(eps=0.2, min_samples=5).fit(X)
labels = db.labels_

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Paired', s=50)
plt.scatter(X[labels == -1, 0], X[labels == -1, 1], color='red', label='Outliers')
plt.legend()
plt.show()


# 24. Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each cluster
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler

wine = load_wine()
X = StandardScaler().fit_transform(wine.data)
km = KMeans(n_clusters=3, random_state=42)
labels = km.fit_predict(X)

import numpy as np
print(np.bincount(labels))


# 25. Use make_circles to generate synthetic data and cluster it using DBSCAN. Plot the result
from sklearn.datasets import make_circles

X, y = make_circles(n_samples=300, factor=0.5, noise=0.05)
db = DBSCAN(eps=0.2, min_samples=5).fit(X)
labels = db.labels_

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Paired')
plt.show()


# 26. Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster centroids
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler

data = load_breast_cancer()
X = MinMaxScaler().fit_transform(data.data)
km = KMeans(n_clusters=2, random_state=42)
km.fit(X)
print(km.cluster_centers_)


# 27. Generate synthetic data using make_blobs with varying cluster standard deviations and cluster with DBSCAN
X, y = make_blobs(n_samples=300, centers=3, cluster_std=[1.0, 2.5, 0.5], random_state=42)
db = DBSCAN(eps=1.5, min_samples=5).fit(X)
labels = db.labels_

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='rainbow')
plt.show()


# 28. Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA

digits = load_digits()
X = PCA(n_components=2).fit_transform(digits.data)
kmeans = KMeans(n_clusters=10, random_state=42).fit(X)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='tab10')
plt.show()


# 29. Create synthetic data using make_blobs and evaluate silhouette scores for k = 2 to 5. Display as a bar chart
from sklearn.metrics import silhouette_score

X, y = make_blobs(n_samples=500, centers=4, random_state=42)
scores = []
for k in range(2, 6):
    km = KMeans(n_clusters=k, random_state=42).fit(X)
    scores.append(silhouette_score(X, km.labels_))

plt.bar(range(2, 6), scores)
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.show()



# 30. Load the Iris dataset and use hierarchical clustering to group data. Plot a dendrogram with average linkage
from scipy.cluster.hierarchy import dendrogram, linkage

Z = linkage(load_iris().data, method='average')
dendrogram(Z)
plt.show()


# 31. Generate synthetic data with overlapping clusters using make_blobs, then apply K-Means and visualize with decision boundaries
from matplotlib.colors import ListedColormap
import numpy as np

X, y = make_blobs(n_samples=300, centers=3, cluster_std=2.5, random_state=42)
km = KMeans(n_clusters=3).fit(X)

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))
Z = km.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, cmap=ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF']), alpha=0.4)
plt.scatter(X[:, 0], X[:, 1], c=km.labels_, cmap='viridis')
plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], c='black', marker='X')
plt.show()
|

# 32. Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results
from sklearn.manifold import TSNE

X = TSNE(n_components=2, perplexity=30, random_state=42).fit_transform(load_digits().data)
labels = DBSCAN(eps=5, min_samples=5).fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='tab10')
plt.show()


# 33. Generate synthetic data using make_blobs and apply Agglomerative Clustering with complete linkage. Plot the result
X, y = make_blobs(n_samples=300, centers=3, random_state=42)
agg = AgglomerativeClustering(n_clusters=3, linkage='complete')
labels = agg.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.show()


# 34. Load the Breast Cancer dataset and compare inertia values for K = 2 to 6 using K-Means. Show results in a line plot
X = StandardScaler().fit_transform(load_breast_cancer().data)
inertias = []
for k in range(2, 7):
    km = KMeans(n_clusters=k, random_state=42).fit(X)
    inertias.append(km.inertia_)

plt.plot(range(2, 7), inertias, marker='o')
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia")
plt.show()


# 35. Generate synthetic concentric circles using make_circles and cluster using Agglomerative Clustering with single linkage
X, y = make_circles(n_samples=300, factor=0.5, noise=0.05)
agg = AgglomerativeClustering(n_clusters=2, linkage='single')
labels = agg.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='coolwarm')
plt.show()


# 36. Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding noise)
X = StandardScaler().fit_transform(load_wine().data)
labels = DBSCAN(eps=1.5, min_samples=5).fit_predict(X)

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Number of clusters (excluding noise):", n_clusters)


# 37. Generate synthetic data with make_blobs and apply KMeans. Then plot the cluster centers on top of the data points
X, y = make_blobs(n_samples=300, centers=3, random_state=42)
km = KMeans(n_clusters=3, random_state=42).fit(X)

plt.scatter(X[:, 0], X[:, 1], c=km.labels_, cmap='viridis')
plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], c='red', marker='X', s=200)
plt.show()


# 38. Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise
X = load_iris().data
labels = DBSCAN(eps=0.8, min_samples=5).fit_predict(X)
n_noise = list(labels).count(-1)
print("Number of noise samples:", n_noise)


# 39. Generate synthetic non-linearly separable data using make_moons, apply K-Means, and visualize the clustering result
X, y = make_moons(n_samples=300, noise=0.05, random_state=42)
km = KMeans(n_clusters=2, random_state=42).fit(X)

plt.scatter(X[:, 0], X[:, 1], c=km.labels_, cmap='coolwarm')
plt.show()


# 40. Load the Digits dataset, apply PCA to reduce to 3 components, then use KMeans and visualize with a 3D scatter plot
from mpl_toolkits.mplot3d import Axes3D

X = PCA(n_components=3).fit_transform(load_digits().data)
km = KMeans(n_clusters=10, random_state=42).fit(X)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=km.labels_, cmap='tab10')
plt.show()


# 41. Generate synthetic blobs with 5 centers and apply KMeans. Then use silhouette_score to evaluate the clustering
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

X, y = make_blobs(n_samples=500, centers=5, random_state=42)
kmeans = KMeans(n_clusters=5, random_state=42)
labels = kmeans.fit_predict(X)

score = silhouette_score(X, labels)
print("Silhouette Score:", score)


# 42. Load the Breast Cancer dataset, reduce dimensionality using PCA, and apply Agglomerative Clustering. Visualize in 2D
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

data = load_breast_cancer()
X = PCA(n_components=2).fit_transform(data.data)
agg = AgglomerativeClustering(n_clusters=2)
labels = agg.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('Agglomerative Clustering on Breast Cancer (PCA-reduced)')
plt.show()


# 43. Generate noisy circular data using make_circles and visualize clustering results from KMeans and DBSCAN side-by-side
from sklearn.datasets import make_circles
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

X, y = make_circles(n_samples=500, factor=0.5, noise=0.05)
kmeans = KMeans(n_clusters=2, random_state=42).fit(X)
dbscan = DBSCAN(eps=0.2, min_samples=5).fit(X)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='coolwarm')
ax1.set_title('KMeans')
ax2.scatter(X[:, 0], X[:, 1], c=dbscan.labels_, cmap='coolwarm')
ax2.set_title('DBSCAN')
plt.show()


# 44. Load the Iris dataset and plot the Silhouette Coefficient for each sample after KMeans clustering
from sklearn.datasets import load_iris
from sklearn.metrics import silhouette_samples
import matplotlib.pyplot as plt

iris = load_iris()
X = iris.data
kmeans = KMeans(n_clusters=3, random_state=42).fit(X)
labels = kmeans.labels_
silhouette_vals = silhouette_samples(X, labels)

plt.bar(range(len(X)), silhouette_vals, color='teal')
plt.xlabel('Sample index')
plt.ylabel('Silhouette Coefficient')
plt.title('Silhouette Coefficient per sample - Iris')
plt.show()


# 45. Generate synthetic data using make_blobs and apply Agglomerative Clustering with 'average' linkage. Visualize clusters
X, y = make_blobs(n_samples=400, centers=4, random_state=42)
agg = AgglomerativeClustering(n_clusters=4, linkage='average')
labels = agg.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='tab10')
plt.title("Agglomerative Clustering (average linkage)")
plt.show()


# 46. Load the Wine dataset, apply KMeans, and visualize the cluster assignments in a seaborn pairplot (first 4 features)
import seaborn as sns
import pandas as pd
from sklearn.datasets import load_wine

wine = load_wine()
X = pd.DataFrame(wine.data[:, :4], columns=wine.feature_names[:4])
labels = KMeans(n_clusters=3, random_state=42).fit_predict(wine.data)

X['cluster'] = labels
sns.pairplot(X, hue='cluster', palette='tab10')
plt.show()


# 47. Generate noisy blobs using make_blobs and use DBSCAN to identify both clusters and noise points. Print the count
X, y = make_blobs(n_samples=500, centers=4, cluster_std=1.5, random_state=42)
db = DBSCAN(eps=1.2, min_samples=5).fit(X)
labels = db.labels_

import numpy as np
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)

print("Clusters found:", n_clusters)
print("Noise points:", n_noise)


# 48. Load the Digits dataset, reduce dimensions using t-SNE, then apply Agglomerative Clustering and plot the clusters
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE

digits = load_digits()
X_tsne = TSNE(n_components=2, perplexity=30, random_state=42).fit_transform(digits.data)
agg = AgglomerativeClustering(n_clusters=10).fit(X_tsne)

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=agg.labels_, cmap='tab10')
plt.title("Agglomerative Clustering on Digits (t-SNE reduced)")
plt.show()
