**1. What is unsupervised learning in the context of machine learning**

**Answer:** Unsupervised learning is a type of machine learning where the model is not provided with labeled data. Instead, it tries to identify hidden patterns or intrinsic structures in the input data. Clustering is a common technique used in unsupervised learning.

**2. How does K-Means clustering algorithm work**

**Answer:** K-Means works by initializing 'k' centroids, assigning each data point to the nearest centroid, and then updating the centroids based on the average of the assigned points. This process repeats until convergence.

**3. Explain the concept of a dendrogram in hierarchical clustering**

**Answer:** A dendrogram is a tree-like diagram that records the sequences of merges or splits in hierarchical clustering. It helps visualize the arrangement of the clusters formed.

**4. What is the main difference between K-Means and Hierarchical Clustering**

**Answer:** K-Means requires a predefined number of clusters and partitions the data into flat clusters, while Hierarchical Clustering builds a multilevel hierarchy of clusters without needing the number of clusters in advance.

**5. What are the advantages of DBSCAN over K-Means**

**Answer:** DBSCAN can find clusters of arbitrary shape and does not require the number of clusters to be specified. It can also identify noise points effectively, unlike K-Means.

**6. When would you use Silhouette Score in clustering**

**Answer:** Silhouette Score measures how similar a point is to its own cluster compared to other clusters. It is used to evaluate clustering performance without ground truth labels.

**7. What are the limitations of Hierarchical Clustering**

**Answer:** Hierarchical Clustering can be computationally expensive for large datasets and is sensitive to noise and outliers. It is also not very flexible once the dendrogram is created.

**8. Why is feature scaling important in clustering algorithms like K-Means**

**Answer:** Feature scaling ensures that all features contribute equally to the distance calculations used in clustering algorithms like K-Means. Without scaling, features with larger ranges can dominate.

**9. How does DBSCAN identify noise points**

**Answer:** DBSCAN identifies noise points as those that are not part of any dense region, meaning they do not have enough neighboring points within a defined distance.

**10. Define inertia in the context of K-Means**

**Answer:** Inertia is the sum of squared distances between each point and its assigned centroid in K-Means. Lower inertia indicates better clustering performance.

**11. What is the elbow method in K-Means clustering**

**Answer:** The elbow method plots inertia against the number of clusters to find the optimal 'k'. The point where the rate of decrease sharply changes (the elbow) is considered optimal.

**12. Describe the concept of "density" in DBSCAN**

**Answer:** In DBSCAN, density refers to the number of points within a specified radius. Dense regions are considered clusters, and sparse regions are treated as noise.

**13. Can hierarchical clustering be used on categorical data**

**Answer:** Yes, but it requires an appropriate similarity measure like Hamming distance since traditional hierarchical clustering relies on Euclidean distance.

**14. What does a negative Silhouette Score indicate**

**Answer:** A negative Silhouette Score indicates that the sample is closer to a neighboring cluster than to the cluster it is assigned to, suggesting poor clustering.

**15. Explain the term "linkage criteria" in hierarchical clustering**

**Answer:** Linkage criteria determine how the distance between clusters is calculated in hierarchical clustering (e.g., single, complete, average linkage).

**16. Why might K-Means clustering perform poorly on data with varying cluster sizes or densities**

**Answer:** K-Means assumes clusters are spherical and similar in size. It struggles with clusters of different sizes, densities, or non-globular shapes.

**17. What are the core parameters in DBSCAN, and how do they influence clustering**

**Answer:** The main parameters in DBSCAN are `eps` (the radius of the neighborhood) and `min_samples` (minimum number of points to form a dense region). These influence cluster formation and noise detection.

**18. How does K-Means++ improve upon standard K-Means initialization**

**Answer:** K-Means++ improves K-Means by selecting initial centroids in a smarter way that spreads them out, leading to better and faster convergence.

**19. What is agglomerative clustering**

**Answer:** Agglomerative clustering is a type of hierarchical clustering that starts with each point as its own cluster and merges the closest pairs of clusters iteratively.

**20. What makes Silhouette Score a better metric than just inertia for model evaluation?**

**Answer:** Silhouette Score considers intra-cluster cohesion and inter-cluster separation, offering a more comprehensive evaluation than inertia alone.

**Practical Questions:**

**1. Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering. Visualize using a scatter plot**

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)

plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red')
plt.title('KMeans Clustering with 4 Centers')
plt.show()

**2. Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10 predicted labels**

In [None]:
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering

iris = load_iris()
X = iris.data

agg = AgglomerativeClustering(n_clusters=3)
labels = agg.fit_predict(X)

print("First 10 predicted labels:", labels[:10])

**3. Generate synthetic data using make_moons and apply DBSCAN. Highlight outliers in the plot**

In [None]:
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

X, _ = make_moons(n_samples=300, noise=0.1)
db = DBSCAN(eps=0.2, min_samples=5).fit(X)

plt.scatter(X[:, 0], X[:, 1], c=db.labels_, cmap='Paired')
plt.title('DBSCAN on make_moons data')
plt.show()

**4. Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each cluster**

In [None]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
import numpy as np

data = load_wine()
X = StandardScaler().fit_transform(data.data)

kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
labels, counts = np.unique(kmeans.labels_, return_counts=True)
print("Cluster sizes:", dict(zip(labels, counts)))

**5. Use make_circles to generate synthetic data and cluster it using DBSCAN. Plot the result**

In [None]:
from sklearn.datasets import make_circles

X, _ = make_circles(n_samples=300, factor=0.5, noise=0.05)
db = DBSCAN(eps=0.2, min_samples=5).fit(X)

plt.scatter(X[:, 0], X[:, 1], c=db.labels_, cmap='plasma')
plt.title('DBSCAN on make_circles data')
plt.show()

**6. Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster centroids**

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler

data = load_breast_cancer()
X = MinMaxScaler().fit_transform(data.data)

kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print("Cluster centroids:", kmeans.cluster_centers_)

**7. Generate synthetic data using make_blobs with varying cluster standard deviations and cluster with DBSCAN**

In [None]:
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=[1.0, 2.5, 0.5], random_state=0)
db = DBSCAN(eps=0.9, min_samples=5).fit(X)

plt.scatter(X[:, 0], X[:, 1], c=db.labels_, cmap='tab10')
plt.title('DBSCAN with varying cluster std')
plt.show()

**8. Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means**

In [None]:
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA

digits = load_digits()
X_pca = PCA(n_components=2).fit_transform(digits.data)

kmeans = KMeans(n_clusters=10, random_state=0).fit(X_pca)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans.labels_, cmap='tab10')
plt.title('KMeans on Digits PCA')
plt.show()

**9. Create synthetic data using make_blobs and evaluate silhouette scores for k = 2 to 5. Display as a bar chart**

In [None]:
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=500, centers=4, cluster_std=0.5, random_state=0)
scores = []
for k in range(2, 6):
    kmeans = KMeans(n_clusters=k).fit(X)
    score = silhouette_score(X, kmeans.labels_)
    scores.append(score)

plt.bar(range(2, 6), scores)
plt.xlabel("Number of clusters")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Scores for k = 2 to 5")
plt.show()

**10. Load the Iris dataset and use hierarchical clustering to group data. Plot a dendrogram with average linkage**

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage

linked = linkage(iris.data, method='average')

plt.figure(figsize=(10, 7))
dendrogram(linked, labels=iris.target)
plt.title("Hierarchical Clustering Dendrogram (Average Linkage)")
plt.show()

**11. Generate synthetic data with overlapping clusters using make_blobs, then apply K-Means and visualize with decision boundaries**

In [None]:
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)
kmeans = KMeans(n_clusters=3).fit(X)

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, s=30, cmap='viridis')
plt.title("KMeans Decision Boundaries")
plt.show()

**12. Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results**

In [None]:
from sklearn.manifold import TSNE

X_tsne = TSNE(n_components=2, random_state=42).fit_transform(digits.data)
db = DBSCAN(eps=5, min_samples=5).fit(X_tsne)

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=db.labels_, cmap='tab10')
plt.title("DBSCAN on Digits (t-SNE reduced)")
plt.show()

**13. Generate synthetic data using make_blobs and apply Agglomerative Clustering with complete linkage. Plot the result**

In [None]:
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=0.6, random_state=42)
agg = AgglomerativeClustering(n_clusters=3, linkage='complete')
labels = agg.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='tab10')
plt.title("Agglomerative Clustering with Complete Linkage")
plt.show()

**14. Load the Breast Cancer dataset and compare inertia values for K = 2 to 6 using K-Means. Show results in a line plot**

In [None]:
inertias = []
for k in range(2, 7):
    kmeans = KMeans(n_clusters=k, random_state=0).fit(X)
    inertias.append(kmeans.inertia_)

plt.plot(range(2, 7), inertias, marker='o')
plt.xlabel("Number of Clusters")
plt.ylabel("Inertia")
plt.title("Inertia for KMeans (K=2 to 6)")
plt.show()

**15. Generate synthetic concentric circles using make_circles and cluster using Agglomerative Clustering with single linkage**

In [None]:
X, _ = make_circles(n_samples=300, noise=0.05, factor=0.5)
agg = AgglomerativeClustering(n_clusters=2, linkage='single')
labels = agg.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='plasma')
plt.title("Agglomerative Clustering on make_circles")
plt.show()

**16. Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding noise)**

In [None]:
from sklearn.preprocessing import StandardScaler

X = StandardScaler().fit_transform(data.data)
db = DBSCAN(eps=1.5, min_samples=5).fit(X)
n_clusters = len(set(db.labels_)) - (1 if -1 in db.labels_ else 0)
print("Number of clusters (excluding noise):", n_clusters)

**17. Generate synthetic data with make_blobs and apply KMeans. Then plot the cluster centers on top of the data points**

In [None]:
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=0).fit(X)

plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='Set1')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='black')
plt.title("Cluster Centers with make_blobs")
plt.show()

**18. Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise**

In [None]:
X = iris.data
db = DBSCAN(eps=0.5, min_samples=5).fit(X)
n_noise = list(db.labels_).count(-1)
print("Number of noise samples identified:", n_noise)

**19. Generate synthetic non-linearly separable data using make_moons, apply K-Means, and visualize the clustering result**

In [None]:
X, _ = make_moons(n_samples=300, noise=0.1)
kmeans = KMeans(n_clusters=2).fit(X)

plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='coolwarm')
plt.title("KMeans on make_moons (non-linear data)")
plt.show()

**20. Load the Digits dataset, apply PCA to reduce to 3 components, then use KMeans and visualize with a 3D scatter plot.**

In [None]:
from mpl_toolkits.mplot3d import Axes3D

X_pca = PCA(n_components=3).fit_transform(digits.data)
kmeans = KMeans(n_clusters=10).fit(X_pca)

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=kmeans.labels_, cmap='tab10')
plt.title("3D PCA of Digits with KMeans Clustering")
plt.show()

**21. Generate synthetic blobs with 5 centers and apply KMeans. Then use silhouette_score to evaluate the clustering**

In [None]:
X, _ = make_blobs(n_samples=300, centers=5, random_state=42)
kmeans = KMeans(n_clusters=5).fit(X)
score = silhouette_score(X, kmeans.labels_)
print("Silhouette Score for KMeans with 5 centers:", score)

**22. Load the Breast Cancer dataset, reduce dimensionality using PCA, and apply Agglomerative Clustering. Visualize in 2D**

In [None]:
X = StandardScaler().fit_transform(load_breast_cancer().data)
X_pca = PCA(n_components=2).fit_transform(X)

agg = AgglomerativeClustering(n_clusters=2)
labels = agg.fit_predict(X_pca)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='Accent')
plt.title("Agglomerative Clustering on Breast Cancer (PCA Reduced)")
plt.show()

**23. Generate noisy circular data using make_circles and visualize clustering results from KMeans and DBSCAN side-by-side**

In [None]:
X, _ = make_circles(n_samples=300, factor=0.5, noise=0.05)

plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.title("KMeans")
plt.scatter(X[:,0], X[:,1], c=KMeans(n_clusters=2).fit_predict(X), cmap='cool')

plt.subplot(1,2,2)
plt.title("DBSCAN")
plt.scatter(X[:,0], X[:,1], c=DBSCAN(eps=0.2).fit_predict(X), cmap='cool')
plt.tight_layout()
plt.show()

**24. Load the Iris dataset and plot the Silhouette Coefficient for each sample after KMeans clustering**

In [None]:
from sklearn.metrics import silhouette_samples
import seaborn as sns

X = iris.data
kmeans = KMeans(n_clusters=3).fit(X)
sil_values = silhouette_samples(X, kmeans.labels_)

plt.bar(range(len(sil_values)), sil_values)
plt.title("Silhouette Coefficients for Iris Dataset")
plt.xlabel("Sample Index")
plt.ylabel("Silhouette Coefficient")
plt.show()

**25. Generate synthetic data using make_blobs and apply Agglomerative Clustering with 'average' linkage. Visualize clusters**

In [None]:
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.6)
agg = AgglomerativeClustering(n_clusters=4, linkage='average')
labels = agg.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='tab10')
plt.title("Agglomerative Clustering (Average Linkage)")
plt.show()

**26. Load the Wine dataset, apply KMeans, and visualize the cluster assignments in a seaborn pairplot (first 4 features)**

In [None]:
import seaborn as sns
import pandas as pd

X = StandardScaler().fit_transform(load_wine().data)
df = pd.DataFrame(X[:, :4], columns=['feat1','feat2','feat3','feat4'])
df['cluster'] = KMeans(n_clusters=3).fit_predict(X)

sns.pairplot(df, hue='cluster')
plt.suptitle("Seaborn Pairplot of Clusters (First 4 Features)", y=1.02)
plt.show()

**27. Generate noisy blobs using make_blobs and use DBSCAN to identify both clusters and noise points. Print the count**

In [None]:
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=0.7)
db = DBSCAN(eps=0.8, min_samples=5).fit(X)

labels = db.labels_
print("Cluster counts (including noise):", dict(zip(*np.unique(labels, return_counts=True))))

**28. Load the Digits dataset, reduce dimensions using t-SNE, then apply Agglomerative Clustering and plot the clusters.**

In [None]:
X_tsne = TSNE(n_components=2).fit_transform(digits.data)
agg = AgglomerativeClustering(n_clusters=10).fit(X_tsne)

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=agg.labels_, cmap='tab10')
plt.title("Agglomerative Clustering on Digits (t-SNE Reduced)")
plt.show()