**Theoretical Questions**

1. What is unsupervised learning in the context of machine learning?
* Unsupervised learning is a type of machine learning where algorithms analyze and learn patterns from data without using labeled outputs or predefined categories. Instead of predicting a specific target, the model explores the hidden structure of data, grouping similar items or reducing complexity. Common techniques include clustering, such as k-means, and dimensionality reduction, such as principal component analysis. It is often used for market segmentation, anomaly detection, and recommendation systems. Since no labels are provided, the system relies on identifying similarities, differences, and relationships within the dataset, helping uncover meaningful insights and hidden structures in raw, unstructured information.
2.  How does K-Means clustering algorithm work?
* K-Means clustering is an unsupervised learning algorithm used to group data into K clusters based on similarity. It starts by initializing K centroids, either randomly or by a method like k-means++. Each data point is assigned to the nearest centroid, forming clusters. Then, centroids are recalculated as the mean of points within each cluster. The process repeats until centroids stabilize or a maximum number of iterations is reached. K-Means minimizes the within-cluster sum of squares, ensuring points within the same cluster are close. It is widely applied in customer segmentation, pattern recognition, and image compression.
3. Explain the concept of a dendrogram in hierarchical clustering.
* A dendrogram is a tree-like diagram used to represent the arrangement of clusters in hierarchical clustering. It visually illustrates how individual data points or groups of points are merged step by step into larger clusters. The vertical axis represents the distance or dissimilarity between clusters, while the horizontal axis lists the data points. At the bottom, each data point starts as its own cluster, and as we move upward, clusters join together until all points form a single cluster. By cutting the dendrogram at a chosen height, one can decide the optimal number of clusters in the dataset.
4. What is the main difference between K-Means and Hierarchical Clustering?
* The main difference between K-Means and Hierarchical Clustering lies in their approach to grouping data. K-Means is a partitioning method that requires the number of clusters to be specified beforehand and works by iteratively assigning points to centroids, making it efficient for large datasets but sensitive to initialization. In contrast, Hierarchical Clustering builds a hierarchy of clusters either through agglomerative (bottom-up) or divisive (top-down) approaches, without requiring the number of clusters initially. It produces a dendrogram for visualization, but is computationally expensive and less suitable for very large datasets compared to K-Means.
5. What are the advantages of DBSCAN over K-Means?
* DBSCAN (Density-Based Spatial Clustering of Applications with Noise) has several advantages over K-Means. Unlike K-Means, it does not require specifying the number of clusters in advance, making it more flexible for unknown datasets. DBSCAN can discover clusters of arbitrary shapes, while K-Means is limited to spherical clusters. It also effectively identifies noise and outliers by labeling them as points not belonging to any cluster, which K-Means cannot handle. Additionally, DBSCAN is less sensitive to initialization since it relies on density parameters. However, it may struggle with datasets of varying densities, where K-Means may still perform better.
6. When would you use Silhouette Score in clustering?
* The Silhouette Score is used in clustering to measure how well data points fit within their assigned clusters compared to other clusters. It combines two concepts: cohesion (how close a point is to others in the same cluster) and separation (how far it is from points in different clusters). The score ranges from -1 to +1, where values close to +1 indicate well-separated, dense clusters, values near 0 suggest overlapping clusters, and negative values mean points may be in the wrong cluster. It is especially useful for evaluating clustering quality and selecting the optimal number of clusters.
7. What are the limitations of Hierarchical Clustering?
* Hierarchical clustering has several limitations:

1. **Computational Complexity** – It is slower and less efficient for large datasets since it requires calculating pairwise distances and updating them repeatedly.
2. **Scalability Issues** – Not suitable for very large datasets due to high memory and time requirements.
3. **Irreversibility** – Once clusters are merged or split, the process cannot be undone, which may lead to suboptimal results.
4. **Sensitivity to Noise/Outliers** – Outliers can distort the clustering structure significantly.
5. **Choice of Linkage/Distance Metric** – Results vary depending on linkage criteria and distance measure, making it less consistent.
8. Why is feature scaling important in clustering algorithms like K-Means?
* Feature scaling is important in clustering algorithms like K-Means because the algorithm relies on distance measures (usually Euclidean distance) to assign points to clusters. If features are on different scales, variables with larger ranges dominate the distance calculation, biasing the clustering results. For example, income measured in thousands can overshadow age measured in years, making clusters form mainly on income. Scaling methods like standardization (z-score) or normalization (min-max) bring features to a comparable range, ensuring each variable contributes equally. Without scaling, K-Means may produce inaccurate or misleading clusters that do not truly reflect the data’s structure.
9. How does DBSCAN identify noise points?
* DBSCAN identifies noise points based on density criteria. It defines two parameters: ε (epsilon), the maximum radius to consider neighbors, and MinPts, the minimum number of points required to form a dense region (cluster).

* A **core point** has at least MinPts within its ε-neighborhood.
* A **border point** lies within the ε-neighborhood of a core point but has fewer than MinPts neighbors.
* Any point that is neither a core point nor a border point is considered noise.

Thus, DBSCAN effectively labels outliers as noise while forming clusters only in dense regions.
10. Define inertia in the context of K-Means?
* In the context of K-Means clustering, inertia is a measure of how well the data points are clustered around their respective centroids. It is calculated as the sum of squared distances between each data point and the centroid of the cluster it belongs to. Lower inertia values indicate that points are closer to their cluster centroids, suggesting tighter, more compact clusters. However, inertia alone cannot determine the optimal number of clusters, since it typically decreases as the number of clusters increases, so it is often used in combination with methods like the **elbow method** for cluster selection.
11. What is the elbow method in K-Means clustering?
* The elbow method is a technique used to determine the optimal number of clusters (K) in K-Means clustering. It involves running K-Means for a range of K values and calculating the inertia (sum of squared distances of points to their cluster centroids) for each K. As K increases, inertia decreases because clusters are smaller and points are closer to centroids. The “elbow” point on the plot—where the rate of decrease sharply slows—indicates a suitable number of clusters, balancing compactness and simplicity. This helps avoid overfitting while capturing the natural structure in the data.
12. Describe the concept of "density" in DBSCAN?
* In DBSCAN, density refers to the number of data points within a specific neighborhood around a point, defined by a distance parameter called ε (epsilon). A point is considered dense if there are at least MinPts points (including itself) within its ε-radius. Dense regions form the core of clusters, while areas with lower density are treated as borders or noise. Essentially, density measures how tightly points are packed together: clusters are high-density regions separated by low-density areas. This concept allows DBSCAN to identify clusters of arbitrary shapes and automatically distinguish outliers from meaningful clusters.
13. Can hierarchical clustering be used on categorical data?
* Yes, hierarchical clustering can be used on categorical data, but it requires careful handling because standard distance metrics like Euclidean distance are designed for numerical data. For categorical features, alternative similarity or dissimilarity measures are used, such as:

* **Hamming distance** – counts the number of mismatches between categorical values.
* **Gower distance** – handles mixed data types, including categorical and numerical features.
* **Jaccard similarity** – measures similarity between sets, useful for binary/categorical attributes.

Once a suitable distance metric is chosen, hierarchical clustering can group categorical data into meaningful clusters, producing a dendrogram for visualization.
14. What does a negative Silhouette Score indicate?
* A negative Silhouette Score indicates that a data point is closer to points in a neighboring cluster than to points in its own cluster, meaning it may have been misclassified. In general:

* **+1** → The point is well-matched to its own cluster and far from neighboring clusters.
* **0** → The point lies on or near the boundary between clusters.
* **Negative** → The point is likely assigned to the wrong cluster, suggesting poor clustering structure or overlap between clusters.

Negative scores highlight areas where the clustering algorithm failed to separate clusters effectively.
15.  Explain the term "linkage criteria" in hierarchical clustering?
* In hierarchical clustering, linkage criteria determine how the distance between clusters is calculated when merging or splitting them. It defines the rule for measuring similarity or dissimilarity between sets of points rather than individual points. Common linkage methods include:

* **Single linkage** – distance between the closest points of two clusters.
* **Complete linkage** – distance between the farthest points of two clusters.
* **Average linkage** – average distance between all pairs of points in the two clusters.
* **Ward’s linkage** – minimizes the increase in total within-cluster variance after merging.

The choice of linkage affects the shape and structure of the resulting dendrogram.
16. Why might K-Means clustering perform poorly on data with varying cluster sizes or densities?
* K-Means clustering can perform poorly on data with varying cluster sizes or densities because it assumes that clusters are spherical and roughly equal in size. The algorithm assigns points to the nearest centroid based on Euclidean distance, which can cause several issues:

1. **Unequal cluster sizes** – Larger clusters may dominate smaller ones, leading to misclassification.
2. **Varying densities** – Sparse clusters may be merged with denser clusters, or dense clusters may be split incorrectly.
3. **Non-spherical shapes** – K-Means cannot capture elongated or irregularly shaped clusters.

As a result, clusters may be poorly defined, and cluster boundaries may be inaccurate.
17. What are the core parameters in DBSCAN, and how do they influence clustering?
* The two core parameters in DBSCAN are ε (epsilon) and MinPts:

1. **ε (epsilon)** – Defines the radius around a point to consider its neighbors. It determines how close points must be to each other to be considered part of the same cluster. A small ε may result in many points being labeled as noise, while a large ε can merge distinct clusters.

2. **MinPts** – The minimum number of points required within the ε-radius to form a **core point**. Higher MinPts make clusters denser and more stringent, while lower values allow sparser clusters.

Together, these parameters control cluster density, size, and noise detection, shaping the overall clustering result.
18. How does K-Means++ improve upon standard K-Means initialization?
* **K-Means++** improves standard K-Means by using a smarter method to initialize centroids, reducing the risk of poor clustering due to random initialization.

* In standard K-Means, centroids are chosen randomly, which can lead to slow convergence or suboptimal clusters.
* K-Means++ selects the first centroid randomly, then chooses subsequent centroids probabilistically, giving higher chances to points **farther from existing centroids.

This ensures that initial centroids are well-spaced, improving convergence speed and often producing better cluster quality. K-Means++ reduces the likelihood of empty or imbalanced clusters compared to random initialization.
19. What is agglomerative clustering?
* Agglomerative clustering is a type of hierarchical clustering that follows a bottom-up approach. In this method, each data point starts as its own individual cluster. The algorithm then repeatedly merges the closest pair of clusters based on a chosen distance metric and linkage criterion, gradually building larger clusters. This process continues until all points are merged into a single cluster or a stopping condition (like a desired number of clusters) is reached. The result can be visualized using a dendrogram, which shows the sequence of merges and helps identify the natural grouping structure in the data.
20. What makes Silhouette Score a better metric than just inertia for model evaluation?
* The **Silhouette Score** is often considered better than just inertia for clustering evaluation because it evaluates both cohesion and separation, rather than only cluster compactness.

* **Inertia** measures the sum of squared distances between points and their cluster centroids, reflecting only how tightly points are grouped within clusters. It decreases as the number of clusters increases, which can be misleading.
* **Silhouette Score** considers:

  1. **Cohesion** – how close points are to their own cluster.
  2. **Separation** – how far points are from other clusters.

This provides a more holistic view of clustering quality and helps assess if clusters are well-separated and meaningful.


**Practical Questions**

In [None]:
21. Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering. Visualize using a
scatter plot
* # Import libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic data with 4 centers
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=4, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# Plotting the clusters
plt.figure(figsize=(8,6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

# Plot cluster centroids
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, marker='X')
plt.title("K-Means Clustering on Synthetic Data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
22.  Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10
predicted labels.
* # Import libraries
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering

# Load Iris dataset
iris = load_iris()
X = iris.data

# Apply Agglomerative Clustering
agglo = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = agglo.fit_predict(X)

# Display first 10 predicted labels
print("First 10 predicted labels:", labels[:10])
23. Generate synthetic data using make_moons and apply DBSCAN. Highlight outliers in the plot.
* # Import libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
import numpy as np

# Generate synthetic two-moons data
X, y_true = make_moons(n_samples=300, noise=0.05, random_state=42)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels = dbscan.fit_predict(X)

# Identify outliers (noise points labeled as -1)
outliers = X[labels == -1]
core_points = X[labels != -1]

# Plot clusters and outliers
plt.figure(figsize=(8,6))
plt.scatter(core_points[:,0], core_points[:,1], c=labels[labels != -1], cmap='viridis', s=50, label='Clustered points')
plt.scatter(outliers[:,0], outliers[:,1], c='red', s=50, marker='x', label='Outliers')
plt.title("DBSCAN Clustering on Two-Moons Data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()
24.  Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each
cluster.
* # Import libraries
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np

# Load Wine dataset
wine = load_wine()
X = wine.data

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_scaled)

# Print size of each cluster
unique, counts = np.unique(labels, return_counts=True)
cluster_sizes = dict(zip(unique, counts))
print("Cluster sizes:", cluster_sizes)
25.  Use make_circles to generate synthetic data and cluster it using DBSCAN. Plot the result.
* # Import libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.cluster import DBSCAN
import numpy as np

# Generate synthetic circular data
X, y_true = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.15, min_samples=5)
labels = dbscan.fit_predict(X)

# Identify outliers (points labeled as -1)
outliers = X[labels == -1]
clustered_points = X[labels != -1]

# Plot clusters and outliers
plt.figure(figsize=(8,6))
plt.scatter(clustered_points[:,0], clustered_points[:,1], c=labels[labels != -1], cmap='viridis', s=50, label='Clustered points')
plt.scatter(outliers[:,0], outliers[:,1], c='red', s=50, marker='x', label='Outliers')
plt.title("DBSCAN Clustering on Circular Data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()
26.  Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster
centroids.
* # Import libraries
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

# Load Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data

# Apply MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-Means with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X_scaled)

# Output cluster centroids
print("Cluster centroids:\n", kmeans.cluster_centers_)
27. Generate synthetic data using make_blobs with varying cluster standard deviations and cluster with
DBSCAN.
* # Import libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import numpy as np

# Generate synthetic data with varying cluster std deviations
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=[0.3, 1.0, 2.0], random_state=42)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.8, min_samples=5)
labels = dbscan.fit_predict(X)

# Identify outliers (points labeled as -1)
outliers = X[labels == -1]
clustered_points = X[labels != -1]

# Plot clusters and outliers
plt.figure(figsize=(8,6))
plt.scatter(clustered_points[:,0], clustered_points[:,1], c=labels[labels != -1], cmap='viridis', s=50, label='Clustered points')
plt.scatter(outliers[:,0], outliers[:,1], c='red', s=50, marker='x', label='Outliers')
plt.title("DBSCAN Clustering on Blobs with Varying Std Dev")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()
28. Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means.
# Import libraries
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# Load Digits dataset
digits = load_digits()
X = digits.data

# Reduce to 2D using PCA
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(X_pca)

# Plot the clusters
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=labels, cmap='tab10', s=50)
plt.title("K-Means Clusters on PCA-Reduced Digits Data")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.colorbar(label='Cluster Label')
plt.show()
29. Create synthetic data using make_blobs and evaluate silhouette scores for k = 2 to 5. Display as a bar chart.
* # Import libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Generate synthetic data
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)

# Evaluate silhouette scores for k = 2 to 5
sil_scores = []
k_values = range(2, 6)
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X)
    score = silhouette_score(X, labels)
    sil_scores.append(score)

# Plot as bar chart
plt.figure(figsize=(8,5))
plt.bar(k_values, sil_scores, color='skyblue')
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Scores for Different k Values")
plt.xticks(k_values)
plt.show()
30. Load the Iris dataset and use hierarchical clustering to group data. Plot a dendrogram with average linkage
* # Import libraries
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from scipy.cluster.hierarchy import dendrogram, linkage

# Load Iris dataset
iris = load_iris()
X = iris.data

# Perform hierarchical clustering using average linkage
linked = linkage(X, method='average')

# Plot dendrogram
plt.figure(figsize=(10, 6))
dendrogram(linked,
           labels=iris.target,
           distance_sort='ascending',
           leaf_rotation=90)
plt.title("Hierarchical Clustering Dendrogram (Average Linkage)")
plt.xlabel("Sample Index or Target Label")
plt.ylabel("Distance")
plt.show()
31.Generate synthetic data with overlapping clusters using make_blobs, then apply K-Means and visualize with
decision boundaries.
* # Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from matplotlib.colors import ListedColormap

# Generate synthetic data with overlapping clusters
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.5, random_state=42)

# Apply K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

# Create a mesh grid for decision boundaries
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.05),
                     np.arange(y_min, y_max, 0.05))

# Predict cluster for each point in the grid
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot decision boundaries
plt.figure(figsize=(8,6))
cmap = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
plt.contourf(xx, yy, Z, alpha=0.3, cmap=cmap)

# Plot original data points
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis', edgecolor='k')
plt.scatter(kmeans.cluster_cent_
32. Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results
* # Import libraries
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN
import numpy as np

# Load Digits dataset
digits = load_digits()
X = digits.data

# Reduce dimensions to 2D using t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

# Apply DBSCAN
dbscan = DBSCAN(eps=5, min_samples=5)
labels = dbscan.fit_predict(X_tsne)

# Identify outliers
outliers = X_tsne[labels == -1]
clustered_points = X_tsne[labels != -1]

# Plot clusters and outliers
plt.figure(figsize=(8,6))
plt.scatter(clustered_points[:,0], clustered_points[:,1], c=labels[labels != -1], cmap='tab10', s=50, label='Clustered points')
plt.scatter(outliers[:,0], outliers[:,1], c='red', s=50, marker='x', label='Outliers')
plt.title("DBSCAN Clustering on t-SNE Reduced Digits Data")
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.legend()
plt.show()
33. Generate synthetic data using make_blobs and apply Agglomerative Clustering with complete linkage. Plot
the result.
* # Import libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering

# Generate synthetic data
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=0.6, random_state=42)

# Apply Agglomerative Clustering with complete linkage
agglo = AgglomerativeClustering(n_clusters=3, linkage='complete')
labels = agglo.fit_predict(X)

# Plot the clusters
plt.figure(figsize=(8,6))
plt.scatter(X[:,0], X[:,1], c=labels, cmap='viridis', s=50)
plt.title("Agglomerative Clustering (Complete Linkage)")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
34.  Load the Breast Cancer dataset and compare inertia values for K = 2 to 6 using K-Means. Show results in a
line plot.
* # Import libraries
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Load Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Compute inertia for K = 2 to 6
k_values = range(2, 7)
inertia_values = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertia_values.append(kmeans.inertia_)

# Plot inertia values
plt.figure(figsize=(8,5))
plt.plot(k_values, inertia_values, marker='o', color='blue')
plt.title("K-Means Inertia for Different K Values")
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Inertia")
plt.xticks(k_values)
plt.grid(True)
plt.show()
35. Generate synthetic concentric circles using make_circles and cluster using Agglomerative Clustering with
single linkage.
* # Import libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.cluster import AgglomerativeClustering

# Generate synthetic concentric circles
X, y_true = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)

# Apply Agglomerative Clustering with single linkage
agglo = AgglomerativeClustering(n_clusters=2, linkage='single')
labels = agglo.fit_predict(X)

# Plot the clusters
plt.figure(figsize=(8,6))
plt.scatter(X[:,0], X[:,1], c=labels, cmap='viridis', s=50)
plt.title("Agglomerative Clustering (Single Linkage) on Concentric Circles")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
36. Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding
noise).
* # Import libraries
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import numpy as np

# Load Wine dataset
wine = load_wine()
X = wine.data

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply DBSCAN
dbscan = DBSCAN(eps=1.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

# Count number of clusters (excluding noise)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print("Number of clusters (excluding noise):", n_clusters)

# Optional: count number of noise points
n_noise = list(labels).count(-1)
print("Number of noise points:", n_noise)
37. Generate synthetic data with make_blobs and apply KMeans. Then plot the cluster centers on top of the
data points.
* # Import libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic data
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=4, random_state=42)
labels = kmeans.fit_predict(X)
centers = kmeans.cluster_centers_

# Plot data points
plt.figure(figsize=(8,6))
plt.scatter(X[:,0], X[:,1], c=labels, cmap='viridis', s=50, edgecolor='k', alpha=0.6)

# Plot cluster centers
plt.scatter(centers[:,0], centers[:,1], c='red', s=200, marker='X', label='Centroids')

plt.title("K-Means Clustering with Cluster Centers")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()
38. Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise
* # Import libraries
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

# Load Iris dataset
iris = load_iris()
X = iris.data

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

# Count noise points (labeled as -1)
n_noise = list(labels).count(-1)
print("Number of noise points:", n_noise)
39. Generate synthetic non-linearly separable data using make_moons, apply K-Means, and visualize the
clustering result?
* # Import libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import KMeans

# Generate synthetic two-moons data
X, y_true = make_moons(n_samples=300, noise=0.05, random_state=42)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=42)
labels = kmeans.fit_predict(X)

# Plot the clustering result
plt.figure(figsize=(8,6))
plt.scatter(X[:,0], X[:,1], c=labels, cmap='viridis', s=50, edgecolor='k')
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1],
            c='red', s=200, marker='X', label='Centroids')
plt.title("K-Means Clustering on Two-Moons Data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()
40.  Load the Digits dataset, apply PCA to reduce to 3 components, then use KMeans and visualize with a 3D
scatter plot.
* # Import libraries
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# Load Digits dataset
digits = load_digits()
X = digits.data

# Reduce to 3 components using PCA
pca = PCA(n_components=3, random_state=42)
X_pca = pca.fit_transform(X)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(X_pca)
centroids = kmeans.cluster_centers_

# 3D Scatter plot
fig = plt.figure(figsize=(10,7))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(X_pca[:,0], X_pca[:,1], X_pca[:,2], c=labels, cmap='tab10', s=50)

# Plot centroids
ax.scatter(centroids[:,0], centroids[:,1], centroids[:,2], c='red', s=200, marker='X', label='Centroids')

ax.set_title("3D Scatter Plot of K-Means Clusters on PCA-Reduced Digits")
ax.set_xlabel("PCA Component 1")
ax.set_ylabel("PCA Component 2")
ax.set_zlabel("PCA Component 3")
ax.legend()
plt.show()
