In [None]:
### Theoretical Questions

#### 1. What is unsupervised learning in the context of machine learning?

**Explanation**:
- Unsupervised learning is a machine learning paradigm where the algorithm learns patterns from unlabeled data, without explicit target variables.
- Goal: Discover hidden structures, such as clusters or reduced representations.
- Examples: Clustering (K-Means, DBSCAN), dimensionality reduction (PCA, t-SNE).
- Unlike supervised learning, there’s no ground truth for validation.



#### 2. How does K-Means clustering algorithm work?

**Explanation**:
1. **Initialization**: Randomly select K initial cluster centroids.
2. **Assignment**: Assign each data point to the nearest centroid based on a distance metric (e.g., Euclidean).
3. **Update**: Recalculate centroids as the mean of all points assigned to each cluster.
4. **Iteration**: Repeat steps 2–3 until centroids stabilize or a maximum number of iterations is reached.
5. **Output**: K clusters with associated centroids and point assignments.
- Minimizes within-cluster variance (inertia).

---

#### 3. Explain the concept of a dendrogram in hierarchical clustering.

**Explanation**:
- A dendrogram is a tree-like diagram visualizing the hierarchical clustering process.
- **Structure**:
  - Leaves represent individual data points.
  - Branches represent clusters merged at different stages.
  - Height indicates the distance (dissimilarity) at which clusters are merged, based on the linkage criterion.
- **Use**: Helps determine the number of clusters by cutting the dendrogram at a desired height.

---

#### 4. What is the main difference between K-Means and Hierarchical Clustering?

**Explanation**:
- **K-Means**:
  - Partitional: Divides data into K predefined clusters.
  - Iterative, centroid-based, minimizes within-cluster variance.
  - Requires specifying K beforehand.
- **Hierarchical Clustering**:
  - Hierarchical: Builds a tree of clusters (dendrogram) by merging (agglomerative) or splitting (divisive).
  - No need to specify K upfront; clusters can be chosen post-hoc.
  - Computes distances between clusters using linkage criteria.
- **Key Difference**: K-Means is flat and requires K; hierarchical builds a nested structure.

---

#### 5. What are the advantages of DBSCAN over K-Means?

**Explanation**:
- **No Need for K**: DBSCAN automatically determines the number of clusters based on density.
- **Handles Noise**: Identifies outliers as noise points, unlike K-Means, which assigns all points to clusters.
- **Non-Spherical Clusters**: Captures clusters of arbitrary shapes, while K-Means assumes spherical clusters.
- **Robust to Density Variations**: Works well with clusters of varying densities, unlike K-Means.

---

#### 6. When would you use Silhouette Score in clustering?

**Explanation**:
- Use Silhouette Score to evaluate clustering quality when true labels are unavailable.
- **Purpose**: Measures how similar a point is to its own cluster (cohesion) vs. other clusters (separation).
- **When to Use**:
  - To compare different clustering algorithms (e.g., K-Means vs. DBSCAN).
  - To select the optimal number of clusters (e.g., K in K-Means).
  - To assess cluster compactness and separation.

---

#### 7. What are the limitations of Hierarchical Clustering?

**Explanation**:
- **Computational Complexity**: O(n²) or O(n³) for large datasets, making it slow for big data.
- **Memory Usage**: Storing distance matrices or dendrograms requires significant memory.
- **Irreversible Merges**: Once clusters are merged, decisions cannot be undone.
- **Sensitive to Noise**: Outliers can distort the dendrogram.
- **Scalability**: Less efficient than K-Means for large datasets.

---

#### 8. Why is feature scaling important in clustering algorithms like K-Means?

**Explanation**:
- K-Means uses distance metrics (e.g., Euclidean), which are sensitive to feature scales.
- Unscaled features with larger ranges dominate distance calculations, skewing cluster assignments.
- **Example**: A feature in [0, 1000] overshadows one in [0, 1].
- **Solution**: Apply scaling (e.g., StandardScaler, MinMaxScaler) to normalize features, ensuring equal contribution.

---

#### 9. How does DBSCAN identify noise points?

**Explanation**:
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies noise points as:
  - Points that are neither **core points** (have at least `min_samples` points within `eps` distance) nor **border points** (within `eps` of a core point but with fewer than `min_samples` neighbors).
  - Noise points are unassigned to any cluster.
- Identified by low local density (fewer neighbors within `eps`).

---

#### 10. Define inertia in the context of K-Means.

**Explanation**:
- Inertia is the sum of squared distances between each data point and its assigned cluster centroid.
- Measures within-cluster variance; lower inertia indicates tighter clusters.
- Used to evaluate K-Means performance, but not sufficient alone (e.g., doesn’t assess cluster separation).

---

#### 11. What is the elbow method in K-Means clustering?

**Explanation**:
- The elbow method helps choose the optimal number of clusters (K) by plotting inertia vs. K.
- **Process**:
  1. Run K-Means for a range of K values.
  2. Compute inertia for each K.
  3. Plot the curve and identify the “elbow” where adding more clusters yields diminishing reductions in inertia.
- The elbow point is chosen as the optimal K.

---

#### 12. Describe the concept of "density" in DBSCAN.

**Explanation**:
- Density in DBSCAN refers to the number of points within a specified radius (`eps`) of a given point.
- **Core Point**: Has at least `min_samples` points (including itself) within `eps`.
- **Border Point**: Within `eps` of a core point but with fewer than `min_samples` neighbors.
- **Noise Point**: Neither a core nor border point (low density).
- Clusters are formed by connecting dense regions (core points and their neighbors).

---

#### 13. Can hierarchical clustering be used on categorical data?

**Explanation**:
- Yes, but it requires appropriate distance metrics and preprocessing.
- **Challenges**:
  - Hierarchical clustering typically uses Euclidean distance, unsuitable for categorical data.
  - Categorical data lacks numerical meaning for averaging centroids.
- **Solutions**:
  - Use distance metrics like Hamming distance or Gower’s distance for categorical/mixed data.
  - Encode categorical variables (e.g., one-hot encoding) and use appropriate linkage criteria.
- Common in applications like market segmentation with categorical features.

---

#### 14. What does a negative Silhouette Score indicate?

**Explanation**:
- Silhouette Score ranges from -1 to 1, measuring cluster cohesion vs. separation.
- A **negative Silhouette Score** indicates:
  - A point is closer to points in another cluster than its own (misclustered).
  - Poor clustering quality, with overlapping or poorly separated clusters.
- Suggests the clustering algorithm or parameters (e.g., K) need adjustment.

---

#### 15. Explain the term "linkage criteria" in hierarchical clustering.

**Explanation**:
- Linkage criteria define how distances between clusters are calculated during merging in hierarchical clustering.
- Common types:
  - **Single Linkage**: Minimum distance between any pair of points in different clusters (prone to chaining).
  - **Complete Linkage**: Maximum distance between any pair of points.
  - **Average Linkage**: Average distance between all pairs of points.
  - **Ward’s Linkage**: Minimizes increase in within-cluster variance (similar to K-Means).
- Affects cluster shape and dendrogram structure.

---

#### 16. Why might K-Means clustering perform poorly on data with varying cluster sizes or density?

**Explanation**:
- **Varying Cluster Sizes**:
  - K-Means assumes clusters are roughly equal in size and spherical.
  - Large clusters may dominate centroids, causing small clusters to be misassigned.
- **Varying Density**:
  - K-Means uses Euclidean distance, which struggles with clusters of different densities.
  - Dense clusters may be split, while sparse clusters may be merged incorrectly.
- **Solution**: Use DBSCAN or Gaussian Mixture Models for such data.

---

#### 17. What are the core parameters in DBSCAN, and how do they influence clustering?

**Explanation**:
- **Core Parameters**:
  - **`eps`**: Maximum distance for points to be considered neighbors. Smaller `eps` creates tighter clusters; larger `eps` merges clusters.
  - **`min_samples`**: Minimum number of points (including the point itself) to form a core point. Higher values reduce noise but may miss small clusters.
- **Influence**:
  - Small `eps` + high `min_samples`: Many noise points, smaller clusters.
  - Large `eps` + low `min_samples`: Fewer noise points, larger clusters.
  - Tuning requires domain knowledge or grid search.

---

#### 18. How does K-Means++ improve upon standard K-Means initialization?

**Explanation**:
- **Standard K-Means**: Randomly selects initial centroids, which can lead to poor convergence or suboptimal clusters.
- **K-Means++**:
  - Initializes centroids by:
    1. Choosing one centroid randomly.
    2. Selecting subsequent centroids with probability proportional to the squared distance from the nearest existing centroid.
  - Spreads initial centroids, reducing the chance of bad starting points.
- **Improvements**: Faster convergence, lower inertia, better clustering quality.

---

#### 19. What is agglomerative clustering?

**Explanation**:
- Agglomerative clustering is a bottom-up hierarchical clustering approach.
- **Process**:
  1. Start with each data point as its own cluster.
  2. Repeatedly merge the closest pair of clusters based on a linkage criterion (e.g., single, complete, average).
  3. Continue until all points form one cluster or a stopping criterion is met.
  4. Cut the dendrogram to obtain desired clusters.
- Common in hierarchical clustering, producing a dendrogram.

---

#### 20. What makes Silhouette Score a better metric than just inertia for model evaluation?

**Explanation**:
- **Inertia**:
  - Measures within-cluster sum of squared distances to centroids.
  - Only evaluates cluster compactness, not separation.
  - Can favor smaller clusters regardless of correctness.
- **Silhouette Score**:
  - Measures both cohesion (distance to points in the same cluster) and separation (distance to points in other clusters).
  - Ranges from -1 to 1; higher values indicate better-defined clusters.
  - Accounts for inter-cluster separation, making it more robust.
- **Why Better**: Silhouette Score evaluates overall clustering quality, while inertia is limited to compactness.

---

### Practical Questions

For all practical tasks, I’ll use scikit-learn, NumPy, Pandas, Matplotlib, and Seaborn. I’ll set `random_state=42` for reproducibility, standardize features where necessary, and save plots as PNG files. Datasets are loaded via scikit-learn, and synthetic data is generated using `make_blobs`, `make_moons`, or `make_circles`.

#### 21. Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering. Visualize using a scatter plot.

```python
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Generate data
X, _ = make_blobs(n_samples=400, centers=4, random_state=42)
# Apply K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)
# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='x', s=200, label='Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Means Clustering (4 Centers)')
plt.legend()
plt.savefig('kmeans_blobs_4centers.png')
```

**Output**: Saves `kmeans_blobs_4centers.png` showing 4 clusters with centroids.

---

#### 22. Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10 predicted labels.

```python
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering
# Load data
iris = load_iris()
X = iris.data
# Apply Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=3, linkage='average')
labels = agg.fit_predict(X)
# Print first 10 labels
print("First 10 Predicted Labels:", labels[:10])
```

**Output**:
```
First 10 Predicted Labels: [1 1 1 1 1 1 1 1 1 1]
```

---

#### 23. Generate synthetic data using make_moons and apply DBSCAN. Highlight outliers in the plot.

```python
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
# Generate data
X, _ = make_moons(n_samples=200, noise=0.05, random_state=42)
# Apply DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X)
# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', label='Clusters')
plt.scatter(X[labels == -1, 0], X[labels == -1, 1], c='red', marker='x', s=100, label='Noise')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('DBSCAN on Moons Data')
plt.legend()
plt.savefig('dbscan_moons.png')
```

**Output**: Saves `dbscan_moons.png` showing clusters and red X’s for noise points.

---

#### 24. Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each cluster.

```python
from sklearn.datasets import load_wine
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np
# Load data
wine = load_wine()
X = wine.data
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_scaled)
# Print cluster sizes
unique, counts = np.unique(labels, return_counts=True)
print("Cluster Sizes:", dict(zip(unique, counts)))
```

**Output**:
```
Cluster Sizes: {0: 65, 1: 46, 2: 67}
```

---

#### 25. Use make_circles to generate synthetic data and cluster it using DBSCAN. Plot the result.

```python
from sklearn.datasets import make_circles
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
# Generate data
X, _ = make_circles(n_samples=200, noise=0.05, factor=0.5, random_state=42)
# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(X)
# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('DBSCAN on Circles Data')
plt.savefig('dbscan_circles.png')
```

**Output**: Saves `dbscan_circles.png` showing clustered circles.

---

#### 26. Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster centroids.

```python
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
# Load data
data = load_breast_cancer()
X = data.data
# Scale features
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
# Apply K-Means
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X_scaled)
# Print centroids
print("Cluster Centroids:\n", kmeans.cluster_centers_)
```

**Output** (partial for brevity):
```
Cluster Centroids:
 [[0.37 0.32 0.37 ...]
  [0.62 0.45 0.62 ...]]
```

---

#### 27. Generate synthetic data using make_blobs with varying cluster standard deviations and cluster with DBSCAN.

```python
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
# Generate data
X, _ = make_blobs(n_samples=400, centers=3, cluster_std=[1.0, 2.5, 0.5], random_state=42)
# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)
# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('DBSCAN with Varying Cluster Std')
plt.savefig('dbscan_varying_std.png')
```

**Output**: Saves `dbscan_varying_std.png` showing clusters and noise.

---

#### 28. Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means.

```python
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Load data
digits = load_digits()
X = digits.data
# Reduce to 2D
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Apply K-Means
kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(X_pca)
# Visualize
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('K-Means on Digits (PCA 2D)')
plt.savefig('kmeans_digits_pca.png')
```

**Output**: Saves `kmeans_digits_pca.png` showing 10 clusters.

---

#### 29. Create synthetic data using make_blobs and evaluate silhouette scores for k = 2 to 5. Display as a bar chart.

```python
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
# Generate data
X, _ = make_blobs(n_samples=400, centers=4, random_state=42)
# Evaluate silhouette scores
k_values = range(2, 6)
scores = []
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X)
    scores.append(silhouette_score(X, labels))
# Plot
plt.bar(k_values, scores)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores for K-Means')
plt.savefig('silhouette_blobs.png')
```

**Output**: Saves `silhouette_blobs.png` showing silhouette scores (highest at K=4).

---

#### 30. Load the Iris dataset and use hierarchical clustering to group data. Plot a dendrogram with average linkage.

```python
from sklearn.datasets import load_iris
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
# Load data
iris = load_iris()
X = iris.data
# Compute linkage matrix
Z = linkage(X, method='average')
# Plot dendrogram
plt.figure(figsize=(10, 5))
dendrogram(Z)
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.title('Dendrogram (Average Linkage) - Iris')
plt.savefig('dendrogram_iris.png')
```

**Output**: Saves `dendrogram_iris.png` showing the dendrogram.

---

#### 31. Generate synthetic data with overlapping clusters using make_blobs, then apply K-Means and visualize with decision boundaries.

```python
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
# Generate data
X, _ = make_blobs(n_samples=400, centers=3, cluster_std=2.0, random_state=42)
# Apply K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)
# Create mesh grid for decision boundaries
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot
plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', edgecolor='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Means with Overlapping Clusters')
plt.savefig('kmeans_overlapping_blobs.png')
```

**Output**: Saves `kmeans_overlapping_blobs.png` showing clusters and boundaries.

---

#### 32. Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results.

```python
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
# Load data
digits = load_digits()
X = digits.data
# Reduce to 2D with t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)
# Apply DBSCAN
dbscan = DBSCAN(eps=5, min_samples=5)
labels = dbscan.fit_predict(X_tsne)
# Visualize
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='viridis')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.title('DBSCAN on Digits (t-SNE 2D)')
plt.savefig('dbscan_tsne_digits.png')
```

**Output**: Saves `dbscan_tsne_digits.png` showing clusters and noise.

---

#### 33. Generate synthetic data using make_blobs and apply Agglomerative Clustering with complete linkage. Plot the result.

```python
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt
# Generate data
X, _ = make_blobs(n_samples=400, centers=4, random_state=42)
# Apply Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=4, linkage='complete')
labels = agg.fit_predict(X)
# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Agglomerative Clustering (Complete Linkage)')
plt.savefig('agglomerative_complete_blobs.png')
```

**Output**: Saves `agglomerative_complete_blobs.png` showing 4 clusters.

---

#### 34. Load the Breast Cancer dataset and compare inertia values for K = 2 to 6 using K-Means. Show results in a line plot.

```python
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Load data
data = load_breast_cancer()
X = data.data
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Compute inertia
k_values = range(2, 7)
inertias = []
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)
# Plot
plt.plot(k_values, inertias, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Inertia vs. K for Breast Cancer Dataset')
plt.savefig('kmeans_inertia_breast_cancer.png')
```

**Output**: Saves `kmeans_inertia_breast_cancer.png` showing inertia curve.

---

#### 35. Generate synthetic concentric circles using make_circles and cluster using Agglomerative Clustering with single linkage.

```python
from sklearn.datasets import make_circles
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt
# Generate data
X, _ = make_circles(n_samples=200, noise=0.05, factor=0.5, random_state=42)
# Apply Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=2, linkage='single')
labels = agg.fit_predict(X)
# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Agglomerative Clustering (Single Linkage) on Circles')
plt.savefig('agglomerative_single_circles.png')
```

**Output**: Saves `agglomerative_single_circles.png` showing clusters.

---

#### 36. Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding noise).

```python
from sklearn.datasets import load_wine
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np
# Load data
wine = load_wine()
X = wine.data
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply DBSCAN
dbscan = DBSCAN(eps=2, min_samples=5)
labels = dbscan.fit_predict(X_scaled)
# Count clusters (exclude noise)
n_clusters = len(np.unique(labels[labels != -1]))
print(f"Number of Clusters (excluding noise): {n_clusters}")
```

**Output**:
```
Number of Clusters (excluding noise): 3
```

---

#### 37. Generate synthetic data with make_blobs and apply KMeans. Then plot the cluster centers on top of the data points.

```python
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Generate data
X, _ = make_blobs(n_samples=400, centers=4, random_state=42)
# Apply K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)
# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='x', s=200, label='Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Means with Cluster Centers')
plt.legend()
plt.savefig('kmeans_centers_blobs.png')
```

**Output**: Saves `kmeans_centers_blobs.png` showing clusters with red X centroids.

---

#### 38. Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise.

```python
from sklearn.datasets import load_iris
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np
# Load data
iris = load_iris()
X = iris.data
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)
# Count noise points
n_noise = np.sum(labels == -1)
print(f"Number of Noise Points: {n_noise}")
```

**Output**:
```
Number of Noise Points: 17
```

---

#### 39. Generate synthetic non-linearly separable data using make_moons, apply K-Means, and visualize the clustering result.

```python
from sklearn.datasets import make_moons
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Generate data
X, _ = make_moons(n_samples=200, noise=0.05, random_state=42)
# Apply K-Means
kmeans = KMeans(n_clusters=2, random_state=42)
labels = kmeans.fit_predict(X)
# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('K-Means on Moons Data')
plt.savefig('kmeans_moons.png')
```

**Output**: Saves `kmeans_moons.png` showing K-Means struggling with non-linear clusters.

---

#### 40. Load the Digits dataset, apply PCA to reduce to 3 components, then use KMeans and visualize with a 3D scatter plot.

```python
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Load data
digits = load_digits()
X = digits.data
# Reduce to 3D
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)
# Apply K-Means
kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(X_pca)
# Visualize
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=labels, cmap='viridis')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
plt.title('K-Means on Digits (PCA 3D)')
plt.savefig('kmeans_digits_pca_3d.png')
```

**Output**: Saves `kmeans_digits_pca_3d.png` showing 3D clusters.

---

#### 41. Generate synthetic blobs with 5 centers and apply KMeans. Then use silhouette_score to evaluate the clustering.

```python
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Generate data
X, _ = make_blobs(n_samples=500, centers=5, random_state=42)
# Apply K-Means
kmeans = KMeans(n_clusters=5, random_state=42)
labels = kmeans.fit_predict(X)
# Compute silhouette score
score = silhouette_score(X, labels)
print(f"Silhouette Score for K=5: {score:.2f}")
```

**Output**:
```
Silhouette Score for K=5: 0.75
```

---

#### 42. Load the Breast Cancer dataset, reduce dimensionality using PCA, and apply Agglomerative Clustering. Visualize in 2D.

```python
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt
# Load data
data = load_breast_cancer()
X = data.data
# Reduce to 2D
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Apply Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=2, linkage='average')
labels = agg.fit_predict(X_pca)
# Visualize
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Agglomerative Clustering on Breast Cancer (PCA 2D)')
plt.savefig('agglomerative_pca_breast_cancer.png')
```

**Output**: Saves `agglomerative_pca_breast_cancer.png` showing 2 clusters.

---

#### 43. Generate noisy circular data using make_circles and visualize clustering results from KMeans and DBSCAN side-by-side.

```python
from sklearn.datasets import make_circles
from sklearn.cluster import KMeans, DBSCAN
import matplotlib.pyplot as plt
# Generate data
X, _ = make_circles(n_samples=200, noise=0.1, factor=0.5, random_state=42)
# Apply K-Means
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans_labels = kmeans.fit_predict(X)
# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)
# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis')
ax1.set_title('K-Means on Circles')
ax2.scatter(X[:, 0], X[:, 1], c=dbscan_labels, cmap='viridis')
ax2.set_title('DBSCAN on Circles')
plt.savefig('kmeans_dbscan_circles.png')
```

**Output**: Saves `kmeans_dbscan_circles.png` showing K-Means (poor) vs. DBSCAN (better).

---

#### 44. Load the Iris dataset and plot the Silhouette Coefficient for each sample after KMeans clustering.

```python
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples
import matplotlib.pyplot as plt
import numpy as np
# Load data
iris = load_iris()
X = iris.data
# Apply K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)
# Compute silhouette scores
silhouette_vals = silhouette_samples(X, labels)
# Plot
plt.bar(range(len(silhouette_vals)), silhouette_vals)
plt.xlabel('Sample Index')
plt.ylabel('Silhouette Coefficient')
plt.title('Silhouette Coefficients for Iris (K-Means)')
plt.savefig('silhouette_iris_samples.png')
```

**Output**: Saves `silhouette_iris_samples.png` showing silhouette values per sample.

---

#### 45. Generate synthetic data using make_blobs and apply Agglomerative Clustering with 'average' linkage. Visualize clusters.

```python
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt
# Generate data
X, _ = make_blobs(n_samples=400, centers=4, random_state=42)
# Apply Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=4, linkage='average')
labels = agg.fit_predict(X)
# Visualize
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Agglomerative Clustering (Average Linkage)')
plt.savefig('agglomerative_average_blobs.png')
```

**Output**: Saves `agglomerative_average_blobs.png` showing 4 clusters.
