# Clustering

# Theoretical Questions:

1.  What is unsupervised learning in the context of machine learning?

    - Unsupervised learning is a type of machine learning where the model is trained on unlabeled data, meaning there are no predefined target outputs. The goal is to discover hidden patterns, groupings, or structures within the data. For example, clustering similar items or reducing data dimensions.


2. How does K-Means clustering algorithm work?

   - K-Means divides data into K clusters by minimizing the variance within each cluster.
    1. Choose the number of clusters (K).
    2. Randomly initialize K centroids.
    3. Assign each data point to the nearest centroid.
    4. Recalculate centroids as the mean of points in each cluster.
    5. Repeat steps 3 or 4 until centroids no longer change.


3. Explain the concept of a dendrogram in hierarchical clustering.

   - A dendrogram is a tree-like diagram that shows how data points are merged or split in hierarchical clustering. The height of each merge represents the distance or dissimilarity between clusters, helping to visually decide the optimal number of clusters.


4. What is the main difference between K-Means and Hierarchical Clustering?

   - K-Means: Requires the number of clusters (K) in advance; partitions data flatly.
   - Hierarchical: Builds a hierarchy of clusters without predefining K; visualized via dendrogram.
   - K-Means is faster, while hierarchical clustering provides more interpretability.


5. What are the advantages of DBSCAN over K-Means?

   - The advantages of DBSCAN over K-Means are:
   1. Detects arbitrary and shaped clusters.
   2. Automatically finds the number of clusters.
   3. Can detect noise and outliers.
   4. Doesn't assume clusters are spherical, unlike K-Means.


6. When would you use Silhouette Score in clustering?

   - The Silhouette Score is a crucial metric used in clustering to evaluate the quality and consistency of the resulting clusters. It is an unsupervised evaluation metric, meaning it doesn't require the true cluster labels of the data.
   1. Choosing the Optimal Number of Clusters (k): The most common use of the Silhouette Score is to help determine the best number of clusters (k) for algorithms like K-Means or K-Medoids.
   2. Evaluating and Comparing Cluster Quality: The score provides a single, easy-to-interpret number to assess how well-defined the clusters are.
   3. Identifying Outliers and Poorly Clustered Points: The score is calculated for each individual data point, which allows for a more granular analysis.


7. What are the limitations of Hierarchical Clustering?

   - The limitations of Hierarchical Clustering (HC) primarily stem from its computational intensity and the permanent nature of its decisions.
     - High Computational and Memory Cost.
     - Irreversible Decisions.
     - Sensitivity to Input Parameters.
     - Difficulty in Handling Noise and Outliers.
     - Difficulty in Determining the Optimal Number of Clusters.


8. Why is feature scaling important in clustering algorithms like K-Means?

   - K-Means uses Euclidean distance to measure similarity. If features have different scales, those with larger ranges dominate the distance metric, leading to biased clusters. Scaling ensures all features contribute equally.
     - Prevent Feature Dominance.
     - Ensure Equal Contribution.
     - Faster Convergence.


9. How does DBSCAN identify noise points?

   - DBSCAN classifies points as:
   - Core points: Have at least min_samples neighbors within radius eps.
   - Border points: Near core points but have fewer neighbors.
   - Noise points: Do not satisfy either condition.


10. Define inertia in the context of K-Means.

    - Inertia is the sum of squared distances between each point and its assigned cluster centroid.
    - Lower inertia means tighter clusters, but it should be balanced to avoid overfitting.
    - The primary objective of the K-Means algorithm is to minimize the inertia. A lower inertia score means that the data points within each cluster are closer to their respective centroids, indicating denser and more tightly bound clusters.


11. What is the elbow method in K-Means clustering?

    - The Elbow Method is a popular heuristic technique used to determine the optimal number of clusters (k) for the K-Means clustering algorithm.
    - It works by plotting a measure of cluster quality against the number of clusters and looking for a point on the graph where the rate of improvement sharply decreases, resembling an elbow joint.


12. Describe the concept of "density" in DBSCAN.

    - The concept of "density" is fundamental to the Density-Based Spatial Clustering of Applications with Noise algorithm. Unlike K-Means, which uses distance to centroids, DBSCAN defines clusters as contiguous regions of high density separated by regions of low density.
    - In DBSCAN, density refers to how closely packed points are in a region. A dense region with at least min_samples points within a radius eps forms a cluster.


13. Can hierarchical clustering be used on categorical data?

    - Yes, hierarchical clustering can be used on categorical data, but it requires using specialized dissimilarity/distance measures instead of standard metrics like Euclidean distance.
    - Hierarchical clustering, fundamentally, relies on a distance matrix between all pairs of data points. For the algorithm to work with categorical data, you simply need a metric that can accurately quantify the "distance" or difference between two categorical records.


14. What does a negative Silhouette Score indicate?

    - A negative Silhouette Score means that points are misclassified, i.e., they are closer to another cluster than their assigned one. It indicates poor clustering quality.
    - The Silhouette Score (s) for a single data point is calculated using two values:
    1. Cohesion (a): The average distance of the point to all other points in its own cluster. A smaller $a$ indicates better cohesion.
    2. Separation (b): The minimum average distance of the point to all points in any other cluster. A larger b indicates better separation.


15. Explain the term "linkage criteria" in hierarchical clustering.

    - Linkage criteria in hierarchical clustering define the distance between two clusters or groups of data points. Since the hierarchical process involves successively merging or splitting clusters, a rule is needed to calculate the separation between these multi-point groups so the algorithm knows which two clusters are the "closest" and should be merged next or how to define the distance for splitting.

      - Single linkage: Minimum distance between points.
      - Complete linkage: Maximum distance between points.
      - Average linkage: Mean distance between all pairs of points.
      - Ward's linkage: Minimizes variance within clusters.


16. Why might K-Means clustering perform poorly on data with varying cluster sizes or densities?

    - K-Means clustering often performs poorly on data with varying cluster sizes or densities because its underlying mechanism and objective function are optimized for finding spherical, equally sized, and equally dense clusters.
    1. The Dependence on Mean: K-Means works by minimizing the Within-Cluster Sum of Squares, which essentially tries to make all points close to their cluster's mean.
    2. The Spherical Assumption: K-Means inherently assumes that clusters are convex and roughly spherical because it uses the Euclidean distance metric and assigns points to the nearest centroid
    3. The Objective Function Bias: In a set of clusters with widely varying densities, the algorithm seeks to reduce the overall variance.


17. What are the core parameters in DBSCAN, and how do they influence clustering?

    - The DBSCAN algorithm is controlled by two core parameters that define the concept of density and, consequently, the resulting clusters.
    1. Epsilon or eps: A distance threshold that defines the radius of the neighborhood around a given data point.
       - Influence: It determines how close points must be to each other to be considered part of the same cluster.
    2. Minimum Points: The minimum number of data points required to be present within the epsilon-neighborhood of a point for that point to be classified as a Core Point.
       - Influence: It defines the minimum required density for a region to be considered a cluster.


18. How does K-Means++ improve upon standard K-Means initialization?

    - K-Means++ dramatically improves upon standard K-Means initialization by using a smart, probabilistic approach to select initial cluster centers that are well-separated.
    - K-Means++ selects initial centroids smartly to spread them apart, reducing chances of poor convergence and speeding up the algorithm. It often leads to better clustering stability and lower inertia.

19. What is agglomerative clustering?

    - Agglomerative clustering is a bottom-up hierarchical approach.Each point starts as its own cluster, and pairs of clusters are merged iteratively based on distance until one cluster or the desired number remain.
    
    - Agglomerative Clustering Works:
    1. Initialization: Start by treating each data point as a single cluster.
    2. Distance Calculation: Compute the distance between all pairs of clusters using a chosen distance metric.
    3. Merging: Merge the two closest clusters into a new, single cluster.
    4. Update: Recalculate the distances between the new cluster and all the remaining clusters.
    5. Iteration: Repeat steps 2-4 until all data points are merged into one large cluster.


20. What makes Silhouette Score a better metric than just inertia for model evaluation?

    - Inertia only measures compactness, not separation between clusters.
    - Silhouette Score evaluates both compactness and separation, providing a more comprehensive measure of clustering quality.


# Practical Questions:

21. Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering. Visualize using a scatter plot.

- Answer

      from sklearn.datasets import make_blobs
      from sklearn.cluster import KMeans
      import matplotlib.pyplot as plt

      X, y = make_blobs(n_samples=300, centers=4, random_state=42)
      kmeans = KMeans(n_clusters=4, random_state=42)
      labels = kmeans.fit_predict(X)

      plt.scatter(X[:,0], X[:,1], c=labels, cmap='rainbow')
      plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], c='black', marker='X', s=200)
      plt.title("K-Means Clustering with 4 Centers")
      plt.show()

22. Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10 predicted labels.

- Answer
      
      from sklearn.datasets import load_breast_cancer
      from sklearn.decomposition import PCA
      from sklearn.cluster import AgglomerativeClustering
      import matplotlib.pyplot as plt
      %matplotlib inline

      data = load_breast_cancer()
      X = data.data
      pca = PCA(n_components=2, random_state=42)
      X2 = pca.fit_transform(X)

      agg = AgglomerativeClustering(n_clusters=2, linkage='average')
      labels = agg.fit_predict(X2)

      plt.scatter(X2[:,0], X2[:,1], c=labels, cmap='coolwarm', s=30)
      plt.title("Agglomerative Clustering on Breast Cancer (PCA 2D)")
      plt.xlabel("PC1"); plt.ylabel("PC2")
      plt.show()


23. Generate synthetic data using make_moons and apply DBSCAN.Highlight outliers in the plot.

- Answer

      from sklearn.datasets import make_circles
      from sklearn.cluster import KMeans, DBSCAN
      import matplotlib.pyplot as plt
      %matplotlib inline

      X, y = make_circles(n_samples=500, factor=0.5, noise=0.07, random_state=0)

      kmeans = KMeans(n_clusters=2, random_state=0, n_init=10).fit_predict(X)
      db = DBSCAN(eps=0.15, min_samples=5).fit_predict(X)

      fig, axes = plt.subplots(1,2, figsize=(12,5))
      axes[0].scatter(X[:,0], X[:,1], c=kmeans, s=25)
      axes[0].set_title("KMeans (k=2)")
      axes[1].scatter(X[:,0], X[:,1], c=db, s=25)
      axes[1].set_title("DBSCAN (eps=0.15)")
      plt.show()


24. Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each cluster.

- Answer

      import numpy as np
      import matplotlib.pyplot as plt
      from sklearn.datasets import load_iris
      from sklearn.cluster import KMeans
      from sklearn.metrics import silhouette_samples, silhouette_score
      %matplotlib inline

      iris = load_iris()
      X = iris.data
      kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
      labels = kmeans.fit_predict(X)
      avg_score = silhouette_score(X, labels)
      sample_scores = silhouette_samples(X, labels)

      print("Average Silhouette Score:", avg_score)

      # Plotting per-sample silhouette
      y_lower = 10
      fig, ax = plt.subplots(figsize=(7,5))
      for i in range(3):
          ith_scores = sample_scores[labels == i]
          ith_scores.sort()
          size = ith_scores.shape[0]
          y_upper = y_lower + size
          ax.fill_betweenx(np.arange(y_lower, y_upper), 0, ith_scores)
          ax.text(-0.05, y_lower + 0.5 * size, str(i))
          y_lower = y_upper + 10
      ax.set_title(f"Silhouette plot (avg={avg_score:.3f})")
      ax.set_xlabel("Silhouette coefficient")
      ax.set_ylabel("Cluster label")
      plt.show()

25. Use make_circles to generate synthetic data and cluster it using DBSCAN. Plot the result

- Answer

      from sklearn.datasets import make_blobs
      from sklearn.cluster import AgglomerativeClustering
      import matplotlib.pyplot as plt
      %matplotlib inline

      X, y = make_blobs(n_samples=400, centers=4, cluster_std=0.6, random_state=7)
      agg = AgglomerativeClustering(n_clusters=4, linkage='average')
      labels = agg.fit_predict(X)

      plt.scatter(X[:,0], X[:,1], c=labels, s=30, cmap='tab10')
      plt.title("Agglomerative Clustering (average linkage)")
      plt.show()

26. Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster centroids.

- Answer

      import seaborn as sns
      import pandas as pd
      from sklearn.datasets import load_wine
      from sklearn.cluster import KMeans
      %matplotlib inline

      wine = load_wine()
      df = pd.DataFrame(wine.data, columns=wine.feature_names)
      kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
      df['cluster'] = kmeans.fit_predict(df[wine.feature_names])
      sns.pairplot(df.iloc[:, :4].assign(cluster=df['cluster']), hue='cluster', diag_kind='kde', corner=False)
      plt.suptitle("Wine dataset pairplot (first 4 features) with KMeans clusters", y=1.02)
      plt.show()

27. Generate synthetic data using make_blobs with varying cluster standard deviations and cluster with DBSCAN.

- Answer

      from sklearn.datasets import make_blobs
      from sklearn.cluster import DBSCAN
      import numpy as np

      X, y = make_blobs(n_samples=500, centers=3, cluster_std=[0.4, 0.8, 1.5], random_state=42)
      # Add scattered noise
      rng = np.random.RandomState(1)
      noise = rng.uniform(low=-8, high=8, size=(30, 2))
      X = np.vstack([X, noise])

      db = DBSCAN(eps=0.6, min_samples=5).fit(X)
      labels = db.labels_
      n_noise = np.sum(labels == -1)
      n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
      print("Clusters found (excluding noise):", n_clusters)
      print("Noise points:", n_noise)

28. Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means

- Answer

      from sklearn.datasets import load_digits
      from sklearn.decomposition import PCA

      digits = load_digits()
      pca = PCA(n_components=2)
      X_pca = pca.fit_transform(digits.data)

      kmeans = KMeans(n_clusters=10, random_state=42)
      labels = kmeans.fit_predict(X_pca)

      plt.scatter(X_pca[:,0], X_pca[:,1], c=labels, cmap='tab10')
      plt.title("K-Means Clustering on Digits (2D PCA Projection)")
      plt.show()

29. Create synthetic data using make_blobs and evaluate silhouette scores for k = 2 to 5. Display as a bar chart

- Answer

      from sklearn.metrics import silhouette_score

      X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
      scores = []
      for k in range(2, 6):
          kmeans = KMeans(n_clusters=k, random_state=42)
          labels = kmeans.fit_predict(X)
          scores.append(silhouette_score(X, labels))

      plt.bar(range(2, 6), scores, color='skyblue')
      plt.xlabel('Number of Clusters (k)')
      plt.ylabel('Silhouette Score')
      plt.title('Silhouette Scores for k = 2 to 5')
      plt.show()


30. Load the Iris dataset and use hierarchical clustering to group data. Plot a dendrogram with average linkage.

- Answer

      from scipy.cluster.hierarchy import dendrogram, linkage

      X = load_iris().data
      Z = linkage(X, method='average')
      plt.figure(figsize=(8, 4))
      dendrogram(Z)
      plt.title("Dendrogram (Average Linkage) - Iris Dataset")
      plt.xlabel("Samples")
      plt.ylabel("Distance")
      plt.show()    

31. Generate synthetic data with overlapping clusters using make_blobs, then apply K-Means and visualize with decision boundaries.

- Answer

      from sklearn.datasets import make_blobs
      from sklearn.cluster import KMeans
      import numpy as np
      import matplotlib.pyplot as plt

      X, _ = make_blobs(n_samples=300, centers=3, cluster_std=2.0, random_state=42)
      kmeans = KMeans(n_clusters=3, random_state=42).fit(X)
      labels = kmeans.labels_

      # Decision boundary
      x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
      y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
      xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300), np.linspace(y_min, y_max, 300))
      Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

      plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
      plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
      plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', marker='X', s=200)
      plt.title("K-Means Decision Boundaries with Overlapping Clusters")
      plt.show()      

32. Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results.

- Answer

      from sklearn.datasets import load_digits
      from sklearn.manifold import TSNE
      from sklearn.cluster import DBSCAN

      digits = load_digits()
      X_tsne = TSNE(n_components=2, random_state=42).fit_transform(digits.data)
      db = DBSCAN(eps=5, min_samples=5).fit(X_tsne)
      labels = db.labels_

      plt.scatter(X_tsne[:,0], X_tsne[:,1], c=labels, cmap='tab10', s=10)
      plt.title("DBSCAN Clustering on Digits (t-SNE Reduced Data)")
      plt.show()

33. Generate synthetic data using make_blobs and apply Agglomerative Clustering with complete linkage. Plot the result.

- Answer

      from sklearn.datasets import make_blobs
      from sklearn.cluster import AgglomerativeClustering

      X, _ = make_blobs(n_samples=300, centers=3, random_state=42)
      agg = AgglomerativeClustering(n_clusters=3, linkage='complete')
      labels = agg.fit_predict(X)

      plt.scatter(X[:,0], X[:,1], c=labels, cmap='rainbow')
      plt.title("Agglomerative Clustering (Complete Linkage)")
      plt.show()


34. Load the Breast Cancer dataset and compare inertia values for K = 2 to 6 using K-Means. Show results in a line plot.

- Answer

      from sklearn.datasets import load_breast_cancer
      from sklearn.preprocessing import StandardScaler

      data = load_breast_cancer()
      scaler = StandardScaler()
      X_scaled = scaler.fit_transform(data.data)

      inertias = []
      for k in range(2, 7):
          kmeans = KMeans(n_clusters=k, random_state=42).fit(X_scaled)
          inertias.append(kmeans.inertia_)

      plt.plot(range(2, 7), inertias, marker='o')
      plt.title("K-Means Inertia Values for K = 2 to 6 (Breast Cancer Data)")
      plt.xlabel("Number of Clusters (k)")
      plt.ylabel("Inertia")
      plt.show()


35. Generate synthetic concentric circles using make_circles and cluster using Agglomerative Clustering with single linkage.

- Answer

      from sklearn.datasets import make_circles

      X, _ = make_circles(n_samples=300, factor=0.5, noise=0.05, random_state=42)
      agg = AgglomerativeClustering(n_clusters=2, linkage='single')
      labels = agg.fit_predict(X)

      plt.scatter(X[:,0], X[:,1], c=labels, cmap='rainbow')
      plt.title("Agglomerative Clustering (Single Linkage) on make_circles")
      plt.show()

36. Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding noise).

- Answer

      from sklearn.datasets import load_wine
      from sklearn.preprocessing import StandardScaler
      from sklearn.cluster import DBSCAN
      import numpy as np

      wine = load_wine()
      scaler = StandardScaler()
      X_scaled = scaler.fit_transform(wine.data)

      db = DBSCAN(eps=1.5, min_samples=5).fit(X_scaled)
      labels = db.labels_
      n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
      print("Number of clusters (excluding noise):", n_clusters)


37. Generate synthetic data with make_blobs and apply KMeans. Then plot the cluster centers on top of the data points.

- Answer

      from sklearn.datasets import make_blobs
      from sklearn.cluster import KMeans
      import matplotlib.pyplot as plt

      X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
      kmeans = KMeans(n_clusters=4, random_state=42).fit(X)
      labels = kmeans.labels_

      plt.scatter(X[:,0], X[:,1], c=labels, cmap='rainbow')
      plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], c='black', s=200, marker='X')
      plt.title("KMeans Clustering with Cluster Centers")
      plt.show()

38. Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise.

- Answer

      from sklearn.datasets import load_iris
      from sklearn.preprocessing import StandardScaler
      from sklearn.cluster import DBSCAN

      iris = load_iris()
      X_scaled = StandardScaler().fit_transform(iris.data)

      db = DBSCAN(eps=0.8, min_samples=5).fit(X_scaled)
      labels = db.labels_
      n_noise = list(labels).count(-1)
      print("Number of noise samples identified by DBSCAN:", n_noise)

39. Generate synthetic non-linearly separable data using make_moons, apply K-Means, and visualize the clustering result.

- Answer

      from sklearn.datasets import make_moons
      from sklearn.cluster import KMeans

      X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)
      kmeans = KMeans(n_clusters=2, random_state=42).fit(X)
      labels = kmeans.labels_

      plt.scatter(X[:,0], X[:,1], c=labels, cmap='coolwarm')
      plt.title("KMeans Clustering on make_moons (Non-linear Data)")
      plt.show()

40. Load the Digits dataset, apply PCA to reduce to 3 components, then use KMeans and visualize with a 3D scatter plot.

- Answer

      from sklearn.datasets import load_digits
      from sklearn.decomposition import PCA
      from mpl_toolkits.mplot3d import Axes3D

      digits = load_digits()
      pca = PCA(n_components=3)
      X_pca = pca.fit_transform(digits.data)

      kmeans = KMeans(n_clusters=10, random_state=42)
      labels = kmeans.fit_predict(X_pca)

      fig = plt.figure(figsize=(8,6))
      ax = fig.add_subplot(111, projection='3d')
      ax.scatter(X_pca[:,0], X_pca[:,1], X_pca[:,2], c=labels, cmap='tab10', s=15)
      ax.set_title("3D Visualization of KMeans Clustering on Digits (PCA Reduced)")
      plt.show()

41. Generate synthetic blobs with 5 centers and apply KMeans. Then use silhouette_score to evaluate the clustering.

- Answer

      from sklearn.datasets import make_blobs
      from sklearn.cluster import KMeans
      from sklearn.metrics import silhouette_score

      X, _ = make_blobs(n_samples=400, centers=5, random_state=42)
      kmeans = KMeans(n_clusters=5, random_state=42).fit(X)
      labels = kmeans.labels_

      score = silhouette_score(X, labels)
      print("Silhouette Score for KMeans with 5 centers:", score)


42. Load the Breast Cancer dataset, reduce dimensionality using PCA, and apply Agglomerative Clustering. Visualize in 2D.

- Answer

      from sklearn.datasets import load_breast_cancer
      from sklearn.decomposition import PCA
      from sklearn.cluster import AgglomerativeClustering
      from sklearn.preprocessing import StandardScaler
      import matplotlib.pyplot as plt

      data = load_breast_cancer()
      X_scaled = StandardScaler().fit_transform(data.data)
      X_pca = PCA(n_components=2).fit_transform(X_scaled)

      agg = AgglomerativeClustering(n_clusters=2)
      labels = agg.fit_predict(X_pca)

      plt.scatter(X_pca[:,0], X_pca[:,1], c=labels, cmap='rainbow')
      plt.title("Agglomerative Clustering on Breast Cancer (PCA Reduced)")
      plt.show()

43. Generate noisy circular data using make_circles and visualize clustering results from KMeans and DBSCAN side-by-side.

- Answer

      from sklearn.datasets import make_circles
      from sklearn.cluster import DBSCAN
      import matplotlib.pyplot as plt

      X, _ = make_circles(n_samples=300, factor=0.5, noise=0.05, random_state=42)

      # KMeans
      kmeans = KMeans(n_clusters=2, random_state=42).fit(X)
      labels_kmeans = kmeans.labels_

      # DBSCAN
      db = DBSCAN(eps=0.1, min_samples=5).fit(X)
      labels_dbscan = db.labels_

      fig, axes = plt.subplots(1, 2, figsize=(10, 4))
      axes[0].scatter(X[:,0], X[:,1], c=labels_kmeans, cmap='rainbow')
      axes[0].set_title("KMeans on make_circles")
      axes[1].scatter(X[:,0], X[:,1], c=labels_dbscan, cmap='rainbow')
      axes[1].set_title("DBSCAN on make_circles")
      plt.show()

44. Load the Iris dataset and plot the Silhouette Coefficient for each sample after KMeans clustering.

- Answer

      from sklearn.datasets import load_iris
      from sklearn.metrics import silhouette_samples
      import numpy as np

      iris = load_iris()
      X = iris.data
      kmeans = KMeans(n_clusters=3, random_state=42).fit(X)
      labels = kmeans.labels_

      sil_samples = silhouette_samples(X, labels)
      plt.bar(range(len(sil_samples)), sil_samples, color='skyblue')
      plt.title("Silhouette Coefficient for Each Sample (Iris Dataset)")
      plt.xlabel("Sample Index")
      plt.ylabel("Silhouette Coefficient")
      plt.show()

45. Generate synthetic data using make_blobs and apply Agglomerative Clustering with 'average' linkage. Visualize clusters.

- Answer

      from sklearn.datasets import make_blobs
      from sklearn.cluster import AgglomerativeClustering

      X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
      agg = AgglomerativeClustering(n_clusters=4, linkage='average')
      labels = agg.fit_predict(X)

      plt.scatter(X[:,0], X[:,1], c=labels, cmap='rainbow')
      plt.title("Agglomerative Clustering (Average Linkage)")
      plt.show()

46. Load the Wine dataset, apply KMeans, and visualize the cluster assignments in a seaborn pairplot (first 4 features)

- Answer
      
      from sklearn.datasets import load_wine
      from sklearn.preprocessing import StandardScaler
      import seaborn as sns
      import pandas as pd

      wine = load_wine()
      scaler = StandardScaler()
      X_scaled = scaler.fit_transform(wine.data[:, :4])

      kmeans = KMeans(n_clusters=3, random_state=42)
      labels = kmeans.fit_predict(X_scaled)

      df = pd.DataFrame(X_scaled, columns=wine.feature_names[:4])
      df['Cluster'] = labels
      sns.pairplot(df, hue='Cluster', palette='tab10')
      plt.suptitle("KMeans Clustering on Wine Data (First 4 Features)", y=1.02)
      plt.show()

47. Generate noisy blobs using make_blobs and use DBSCAN to identify both clusters and noise points. Print the count.

- Answer

      from sklearn.datasets import make_blobs
      from sklearn.cluster import DBSCAN
      import numpy as np

      X, _ = make_blobs(n_samples=300, centers=3, cluster_std=2.0, random_state=42)
      db = DBSCAN(eps=1.0, min_samples=5).fit(X)
      labels = db.labels_

      n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
      n_noise = list(labels).count(-1)

      print("Number of clusters:", n_clusters)
      print("Number of noise points:", n_noise)

48. Load the Digits dataset, reduce dimensions using t-SNE, then apply Agglomerative Clustering and plot the clusters.

- Answer

      from sklearn.datasets import load_digits
      from sklearn.manifold import TSNE
      from sklearn.cluster import AgglomerativeClustering

      digits = load_digits()
      X_tsne = TSNE(n_components=2, random_state=42).fit_transform(digits.data)

      agg = AgglomerativeClustering(n_clusters=10)
      labels = agg.fit_predict(X_tsne)

      plt.scatter(X_tsne[:,0], X_tsne[:,1], c=labels, cmap='tab10', s=10)
      plt.title("Agglomerative Clustering on Digits Dataset (t-SNE Reduced)")
      plt.show()
