---
### THEORY QUESTIONS:
---

<div style="font-family: Verdana; font-size: 18px; line-height: 1.6;">

**1. What is unsupervised learning in the context of machine learning?**
Unsupervised learning is a type of machine learning where the model learns patterns from data without labeled outputs. It tries to find the inherent structure, clusters, or features in the input data without guidance.

---

**2. How does K-Means clustering algorithm work?**
K-Means partitions data into K clusters by:

* Initializing K centroids randomly.
* Assigning each data point to the nearest centroid.
* Updating centroids by calculating the mean of assigned points.
* Repeating assignment and update until convergence (centroids no longer change).

---

**3. Explain the concept of a dendrogram in hierarchical clustering.**
A dendrogram is a tree-like diagram that shows the arrangement of clusters formed by hierarchical clustering, illustrating the order and distance at which clusters are merged or split.

---

**4. What is the main difference between K-Means and Hierarchical Clustering?**
K-Means partitions data into a fixed number of clusters based on centroid distance, while hierarchical clustering builds a tree of clusters either by merging (agglomerative) or splitting (divisive) without specifying the number of clusters upfront.

---

**5. What are the advantages of DBSCAN over K-Means?**

* DBSCAN can find clusters of arbitrary shapes.
* It identifies noise points (outliers).
* It does not require specifying the number of clusters in advance.

---

**6. When would you use Silhouette Score in clustering?**
Use Silhouette Score to evaluate how well clusters are separated and how cohesive they are. It helps determine the quality of clustering and can be used to choose the optimal number of clusters.

---

**7. What are the limitations of Hierarchical Clustering?**

* Computationally expensive for large datasets.
* Sensitive to noise and outliers.
* Cannot easily undo merges or splits (no reassignments).
* Choice of linkage method can greatly affect results.

---

**8. Why is feature scaling important in clustering algorithms like K-Means?**
Because K-Means uses distance metrics (like Euclidean distance), features with larger scales can dominate the distance calculation, leading to biased clusters. Scaling ensures all features contribute equally.

---

**9. How does DBSCAN identify noise points?**
Points that do not belong to any cluster (i.e., points with fewer than the minimum required neighbors within a specified radius) are labeled as noise.

---

**10. Define inertia in the context of K-Means.**
Inertia is the sum of squared distances between each point and its assigned cluster centroid. It measures how internally coherent clusters are—the lower the inertia, the tighter the clusters.

---

**11. What is the elbow method in K-Means clustering?**
It is a technique to determine the optimal number of clusters by plotting inertia vs. number of clusters and looking for the "elbow" point where inertia decrease slows down significantly.

---

**12. Describe the concept of "density" in DBSCAN.**
Density refers to the number of points within a specified radius (epsilon) around a point. A point is a core point if its neighborhood contains at least a minimum number of points (minPts).

---

**13. Can hierarchical clustering be used on categorical data?**
Yes, but it requires a suitable distance metric for categorical data (e.g., Hamming distance). Direct application of Euclidean distance is not appropriate.

---

**14. What does a negative Silhouette Score indicate?**
It indicates that data points might be assigned to the wrong clusters, as they are closer to neighboring clusters than their own.

---

**15. Explain the term "linkage criteria" in hierarchical clustering.**
Linkage criteria define how the distance between two clusters is computed, such as:

* Single linkage (minimum distance between points)
* Complete linkage (maximum distance)
* Average linkage (average distance)

---

**16. Why might K-Means clustering perform poorly on data with varying cluster sizes or densities?**
Because K-Means assumes clusters are spherical and equally sized, it struggles with clusters of different sizes, densities, or non-globular shapes, leading to poor assignments.

---

**17. What are the core parameters in DBSCAN, and how do they influence clustering?**

* **Epsilon (eps):** Radius around a point to search for neighbors.
* **minPts:** Minimum number of points required to form a dense region (cluster).
  These parameters control cluster size and noise detection.

---

**18. How does K-Means++ improve upon standard K-Means initialization?**
K-Means++ initializes centroids to be spread out by choosing initial centers probabilistically, reducing the chance of poor clustering and speeding up convergence.

---

**19. What is agglomerative clustering?**
A bottom-up hierarchical clustering method where each point starts as its own cluster, and clusters are merged step-by-step based on linkage criteria until one cluster or a desired number is reached.

---

**20. What makes Silhouette Score a better metric than just inertia for model evaluation?**
Silhouette Score considers both cohesion (within-cluster similarity) and separation (between-cluster differences), providing a normalized measure of clustering quality, while inertia only measures compactness without considering cluster separation.


---
### PRACTICAL QUESTIONS
---

### 1. Generate synthetic data with 4 centers using make\_blobs and apply K-Means clustering. Visualize using a scatter plot

```python
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=500, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42).fit(X)

plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis', s=30)
plt.title('KMeans Clustering on 4-center Blobs')
plt.show()
```

---

### 2. Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10 predicted labels

```python
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering

iris = load_iris()
agglo = AgglomerativeClustering(n_clusters=3).fit(iris.data)
print("First 10 labels:", agglo.labels_[:10])
```

---

### 3. Generate synthetic data using make\_moons and apply DBSCAN. Highlight outliers in the plot

```python
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt

X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)
dbscan = DBSCAN(eps=0.2, min_samples=5).fit(X)

plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_, cmap='plasma', s=30)
outliers = dbscan.labels_ == -1
plt.scatter(X[outliers, 0], X[outliers, 1], c='red', s=50, label='Outliers')
plt.title('DBSCAN on Moons with Outliers Highlighted')
plt.legend()
plt.show()
```

---

### 4. Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each cluster

```python
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np

wine = load_wine()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(wine.data)

kmeans = KMeans(n_clusters=3, random_state=42).fit(X_scaled)
unique, counts = np.unique(kmeans.labels_, return_counts=True)
print("Cluster sizes:", dict(zip(unique, counts)))
```

---

### 5. Use make\_circles to generate synthetic data and cluster it using DBSCAN. Plot the result

```python
from sklearn.datasets import make_circles
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

X, _ = make_circles(n_samples=400, noise=0.05, factor=0.5, random_state=42)
dbscan = DBSCAN(eps=0.15, min_samples=5).fit(X)

plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_, cmap='coolwarm', s=30)
plt.title('DBSCAN Clustering on Circles')
plt.show()
```

---

### 6. Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster centroids

```python
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

bc = load_breast_cancer()
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(bc.data)

kmeans = KMeans(n_clusters=2, random_state=42).fit(X_scaled)
print("Cluster centroids:\n", kmeans.cluster_centers_)
```

---

### 7. Generate synthetic data using make\_blobs with varying cluster standard deviations and cluster with DBSCAN

```python
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=500, centers=3, cluster_std=[1.0, 2.5, 0.5], random_state=42)
dbscan = DBSCAN(eps=0.8, min_samples=10).fit(X)

plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_, cmap='Set1', s=30)
plt.title('DBSCAN on Blobs with Varying Std Dev')
plt.show()
```

---

### 8. Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means

```python
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

digits = load_digits()
pca = PCA(n_components=2)
X_2d = pca.fit_transform(digits.data)

kmeans = KMeans(n_clusters=10, random_state=42).fit(X_2d)
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=kmeans.labels_, cmap='tab10', s=30)
plt.title('KMeans Clusters on Digits PCA 2D')
plt.show()
```

---

### 9. Create synthetic data using make\_blobs and evaluate silhouette scores for k = 2 to 5. Display as a bar chart

```python
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=400, centers=4, random_state=42)
scores = []
ks = range(2, 6)
for k in ks:
    kmeans = KMeans(n_clusters=k, random_state=42).fit(X)
    score = silhouette_score(X, kmeans.labels_)
    scores.append(score)

plt.bar(ks, scores, color='skyblue')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores for KMeans')
plt.show()
```

---

### 10. Load the Iris dataset and use hierarchical clustering to group data. Plot a dendrogram with average linkage

```python
from sklearn.datasets import load_iris
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

iris = load_iris()
linked = linkage(iris.data, method='average')

plt.figure(figsize=(8, 4))
dendrogram(linked, labels=iris.target, leaf_rotation=90)
plt.title('Hierarchical Clustering Dendrogram (Average Linkage) - Iris')
plt.show()
```

---

### 11. Generate synthetic data with overlapping clusters using make\_blobs, then apply K-Means and visualize with decision boundaries

```python
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

X, _ = make_blobs(n_samples=300, centers=3, cluster_std=2.5, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42).fit(X)

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

cmap = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
plt.contourf(xx, yy, Z, cmap=cmap, alpha=0.5)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, edgecolor='k', s=30)
plt.title('KMeans with Decision Boundaries on Overlapping Blobs')
plt.show()
```

---

### 12. Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results

```python
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

digits = load_digits()
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(digits.data)

dbscan = DBSCAN(eps=3, min_samples=5).fit(X_tsne)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=dbscan.labels_, cmap='tab20', s=30)
plt.title('DBSCAN on Digits t-SNE Reduced Data')
plt.show()
```

---

### 13. Generate synthetic data using make\_blobs and apply Agglomerative Clustering with complete linkage. Plot the result

```python
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
agglo = AgglomerativeClustering(n_clusters=4, linkage='complete').fit(X)

plt.scatter(X[:, 0], X[:, 1], c=agglo.labels_, cmap='Set2', s=30)
plt.title('Agglomerative Clustering with Complete Linkage on Blobs')
plt.show()
```

---

### 14. Load the Breast Cancer dataset and compare inertia values for K = 2 to 6 using K-Means. Show results in a line plot

```python
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

bc = load_breast_cancer()
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(bc.data)

inertia = []
ks = range(2, 7)
for k in ks:
    kmeans = KMeans(n_clusters=k, random_state=42).fit(X_scaled)
    inertia.append(kmeans.inertia_)

plt.plot(ks, inertia, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('KMeans Inertia on Breast Cancer Dataset')
plt.show()
```

---

### 15. Generate synthetic concentric circles using make\_circles and cluster using Agglomerative Clustering with single linkage

```python
from sklearn.datasets import make_circles
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

X, _ = make_circles(n_samples=400, noise=0.05, factor=0.5, random_state=42)
agglo = AgglomerativeClustering(n_clusters=2, linkage='single').fit(X)

plt.scatter(X[:, 0], X[:, 1], c=agglo.labels, cmap='cool', s=30)
plt.title('Agglomerative Clustering with Single Linkage on Circles')
plt.show()

```


### 16. Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding noise)

```python
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import numpy as np

wine = load_wine()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(wine.data)

dbscan = DBSCAN(eps=1.5, min_samples=5).fit(X_scaled)
labels = dbscan.labels_

# Count clusters ignoring noise (-1 label)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Number of clusters (excluding noise):", n_clusters)
```

---

### 17. Generate synthetic data with make\_blobs and apply KMeans. Then plot the cluster centers on top of the data points

```python
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

X, _ = make_blobs(n_samples=400, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42).fit(X)

plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis', s=30, alpha=0.6)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            c='red', marker='X', s=200, label='Centers')
plt.title("KMeans with Cluster Centers")
plt.legend()
plt.show()
```

---

### 18. Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise

```python
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)

dbscan = DBSCAN(eps=0.5, min_samples=5).fit(X_scaled)
labels = dbscan.labels_

n_noise = list(labels).count(-1)
print("Number of samples identified as noise:", n_noise)
```

---

### 19. Generate synthetic non-linearly separable data using make\_moons, apply K-Means, and visualize the clustering result

```python
from sklearn.datasets import make_moons
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)
kmeans = KMeans(n_clusters=2, random_state=42).fit(X)

plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='coolwarm', s=30)
plt.title("KMeans Clustering on Non-linearly Separable Moons Data")
plt.show()
```

---

### 20. Load the Digits dataset, apply PCA to reduce to 3 components, then use KMeans and visualize with a 3D scatter plot

```python
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D  # noqa: F401 unused import

digits = load_digits()
pca = PCA(n_components=3)
X_pca = pca.fit_transform(digits.data)

kmeans = KMeans(n_clusters=10, random_state=42).fit(X_pca)

fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], 
                     c=kmeans.labels_, cmap='tab10', s=30)
ax.set_title("3D KMeans Clustering on PCA-reduced Digits")
plt.legend(*scatter.legend_elements(), title="Clusters")
plt.show()
```


### 21. Generate synthetic blobs with 5 centers and apply KMeans. Then use silhouette\_score to evaluate the clustering

```python
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

X, _ = make_blobs(n_samples=500, centers=5, random_state=42)
kmeans = KMeans(n_clusters=5, random_state=42).fit(X)
labels = kmeans.labels_

score = silhouette_score(X, labels)
print("Silhouette Score for KMeans clustering:", score)
```

---

### 22. Load the Breast Cancer dataset, reduce dimensionality using PCA, and apply Agglomerative Clustering. Visualize in 2D

```python
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

data = load_breast_cancer()
X = data.data

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

agglo = AgglomerativeClustering(n_clusters=2)
labels = agglo.fit_predict(X_pca)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='plasma', s=30)
plt.title('Agglomerative Clustering on Breast Cancer (PCA-reduced)')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
```

---

### 23. Generate noisy circular data using make\_circles and visualize clustering results from KMeans and DBSCAN side-by-side

```python
from sklearn.datasets import make_circles
from sklearn.cluster import KMeans, DBSCAN
import matplotlib.pyplot as plt

X, _ = make_circles(n_samples=300, noise=0.1, factor=0.5, random_state=42)

kmeans = KMeans(n_clusters=2, random_state=42).fit(X)
dbscan = DBSCAN(eps=0.2, min_samples=5).fit(X)

fig, axs = plt.subplots(1, 2, figsize=(12, 5))

axs[0].scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='coolwarm', s=30)
axs[0].set_title('KMeans Clustering')

axs[1].scatter(X[:, 0], X[:, 1], c=dbscan.labels_, cmap='coolwarm', s=30)
axs[1].set_title('DBSCAN Clustering')

plt.show()
```

---

### 24. Load the Iris dataset and plot the Silhouette Coefficient for each sample after KMeans clustering

```python
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples
import matplotlib.pyplot as plt
import numpy as np

iris = load_iris()
X = iris.data

kmeans = KMeans(n_clusters=3, random_state=42).fit(X)
labels = kmeans.labels_

silhouette_vals = silhouette_samples(X, labels)

plt.bar(range(len(X)), silhouette_vals, color='skyblue')
plt.xlabel('Sample index')
plt.ylabel('Silhouette Coefficient')
plt.title('Silhouette Coefficient per sample after KMeans clustering')
plt.show()
```

---

### 25. Generate synthetic data using make\_blobs and apply Agglomerative Clustering with 'average' linkage. Visualize clusters

```python
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=400, centers=4, random_state=42)

agglo = AgglomerativeClustering(n_clusters=4, linkage='average')
labels = agglo.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='tab10', s=30)
plt.title("Agglomerative Clustering with Average Linkage")
plt.show()
```

---

### 26. Load the Wine dataset, apply KMeans, and visualize the cluster assignments in a seaborn pairplot (first 4 features)

```python
import seaborn as sns
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.cluster import KMeans

wine = load_wine()
df = pd.DataFrame(wine.data[:, :4], columns=wine.feature_names[:4])

kmeans = KMeans(n_clusters=3, random_state=42).fit(df)
df['Cluster'] = kmeans.labels_

sns.pairplot(df, hue='Cluster', palette='Set1')
plt.suptitle('Wine Dataset Clusters (first 4 features)', y=1.02)
plt.show()
```

---

### 27. Generate noisy blobs using make\_blobs and use DBSCAN to identify both clusters and noise points. Print the count

```python
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN

X, _ = make_blobs(n_samples=500, centers=3, cluster_std=[1.0, 2.5, 0.5], random_state=42)

dbscan = DBSCAN(eps=0.8, min_samples=10).fit(X)
labels = dbscan.labels_

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)

print(f"Clusters found: {n_clusters}")
print(f"Noise points identified: {n_noise}")
```

---

### 28. Load the Digits dataset, reduce dimensions using t-SNE, then apply Agglomerative Clustering and plot the clusters

```python
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

digits = load_digits()
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(digits.data)

agglo = AgglomerativeClustering(n_clusters=10)
labels = agglo.fit_predict(X_tsne)

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='tab10', s=30)
plt.title('Agglomerative Clustering on t-SNE reduced Digits data')
plt.show()
```
