In [None]:
1 What is unsupervised learning in the context of machine learn?

Unsupervised learning in machine learning is a type of learning where the algorithm is trained on data that does not have labeled responses or predefined categories. Unlike supervised learning, where input-output pairs guide the model training, unsupervised learning algorithms analyze only the input data and aim to discover hidden patterns, structures, or relationships within the data independently. This process involves grouping similar data points together (clustering), reducing data dimensionality, or detecting anomalies without any human-provided labels. The ultimate goal is to uncover meaningful insights or organize data based on inherent similarities or structures without prior knowledge of the data's true categories or outcomes

2.How does K-Means clustering algorithm work?

The K-Means clustering algorithm works as follows:

	Specify the number of clusters, k, to create.

	Randomly initialize k cluster centroids (means).

	Assign each data point to the nearest centroid based on distance (usually Euclidean).

	Recalculate the centroids as the mean of all points assigned to each cluster.

	Repeat steps 3 and 4 until the cluster assignments no longer change or reach a maximum number of iterations.
  
The goal is to minimize the within-cluster sum of squares (WCSS), which quantifies the variance within each cluster. K-Means iteratively refines cluster centers and memberships until convergence, effectively partitioning the data into groups of similar points.


3.Explain the concept of a dendrogram in hierarchical clustering?

A dendrogram in hierarchical clustering is a tree-like diagram that visualizes how individual data points or clusters are progressively merged or split. It represents the hierarchy of clusters formed during the clustering process.

At the bottom of the dendrogram, every data point starts as its own cluster. As you move up, clusters that are most similar (or closest) merge together, forming larger clusters. The height at which two clusters merge indicates their dissimilarity or distance—the lower the height, the more similar the clusters are. This visual representation helps to understand the structure and relationships in the data, and it allows selecting the number of clusters by "cutting" the dendrogram at a chosen height.

Essentially, a dendrogram provides an intuitive way to see how clusters nest within each other and to decide on an appropriate clustering by analyzing the distances between merges

Q. What is the main difference between K-Means and Hierarchical Clustering?

The main difference between K-Means and Hierarchical Clustering lies in their approach to grouping data:

K-Means is a partitional clustering algorithm that requires the number of clusters,
k
k, to be specified beforehand. It iteratively assigns data points to the nearest cluster centroid and updates centroids until convergence, producing flat, non-overlapping clusters. It assumes clusters are spherical and similarly sized, and is computationally efficient for large datasets.

Hierarchical Clustering builds a tree-like structure called a dendrogram without needing to predefine the number of clusters. It either merges points/clusters progressively (agglomerative) or splits a single cluster recursively (divisive), allowing exploration of cluster relationships at different levels. It can capture clusters of various shapes and sizes but is more computationally intensive, typically suited for smaller datasets.

In summary, K-Means partitions data into a fixed number of flat clusters efficiently, while Hierarchical Clustering creates a multi-level hierarchy of clusters offering flexibility in cluster selection and deeper insight into data structure

5.What are the advantages of DBSCAN over K-Means?

DBSCAN has several advantages over K-Means clustering:

Ability to find clusters of arbitrary shape: DBSCAN identifies clusters based on density and can detect non-spherical, irregularly shaped clusters, while K-Means assumes spherical clusters and tends to perform poorly on complex shapes.

No need to specify the number of clusters: DBSCAN automatically determines the number of clusters based on data density, whereas K-Means requires specifying the number of clusters
k k upfront.

Robustness to noise and outliers: DBSCAN explicitly labels low-density points as noise and excludes them from clusters, improving cluster quality. K-Means assigns every point to a cluster, which can distort cluster boundaries if outliers exist.

Handles clusters with varying densities: DBSCAN can recognize clusters with different densities, while K-Means assumes clusters have similar densities and sizes.

However, DBSCAN can be sensitive to its parameters (epsilon radius and minimum points) and may struggle with very high-dimensional datasets. But overall, DBSCAN is preferred in scenarios where clusters have complex shapes, the number of clusters is unknown, and noise/outliers are present

6. When would you use Silhouette Score in clustering?

The Silhouette Score is used in clustering to evaluate the quality of clusters formed by a clustering algorithm. It measures how well each data point fits within its assigned cluster compared to other clusters by considering two key factors:

Cohesion: How close the point is to other points in the same cluster (intra-cluster distance).

Separation: How far the point is from points in the nearest neighboring cluster (nearest-cluster distance).

The Silhouette Score ranges from -1 to +1:

A score close to +1 indicates that points are well matched to their own cluster and distinctly separated from others.

A score near 0 suggests points lie between clusters or clusters overlap.

A score close to -1 means points may be misclassified.

It is commonly used to:

Assess clustering effectiveness when no ground truth labels exist,

Help determine the optimal number of clusters by comparing average silhouette scores across different cluster counts,

Visually analyze cluster cohesion and separation.

Thus, Silhouette Score is a valuable internal validation metric for clustering evaluation, especially when deciding how well-separated and well-formed the clusters are

7.What are the limitations of Hierarchical Clustering?

Hierarchical clustering has several limitations:

	Computationally Expensive: It has a time complexity of roughly O(n^3), making it inefficient and slow for large datasets.

	Sensitivity to Noise and Outliers: Outliers can significantly distort the clustering results and the dendrogram structure.

	Irreversibility: Once clusters are merged or split in the hierarchical process, the decision cannot be undone, which may lead to suboptimal clustering.

	Difficulty in Choosing Parameters: Selecting the appropriate linkage criterion and distance metric can greatly influence results, and there is no definitive method for choosing them.

	Interpretability Challenges: The dendrogram can be complex and hard to interpret, especially for large or high-dimensional data.

	Sensitivity to Data Order: The clustering outcome may depend on the order in which data points are processed.

	Limited Scalability: Due to its computational demands, it is not suitable for very large datasets.

	Challenges with Mixed or Categorical Data: Hierarchical clustering works best with numeric data and struggles with categorical variables without proper encoding.

These limitations make hierarchical clustering more appropriate for smaller datasets or cases where understanding hierarchical relationships is important


8.Why is feature scaling important in clustering algorithms like K-Means?

Feature scaling is important in clustering algorithms like K-Means because these algorithms rely on distance calculations (commonly Euclidean distance) to assign points to clusters. If the features have different scales or units, those with larger ranges can dominate the distance calculations, causing the clustering results to be biased toward those features.

Scaling features to a uniform range ensures that all features contribute equally to the distance metric, preventing any feature from disproportionately influencing cluster assignments. This leads to more accurate, meaningful cluster structures and often improves the algorithm's convergence speed.

Hence, feature scaling is crucial especially when the dataset contains features measured in different units or ranges to achieve reliable and consistent clustering outcomes

9. How does DBSCAN identify noise points?

DBSCAN identifies noise points as those which do not belong to any cluster based on density criteria. Specifically:

For each point, DBSCAN counts the number of points within a radius ε (epsilon), called the neighborhood.

If the number of points in the neighborhood is less than a minimum threshold MinPts, the point is not dense enough to be a core point.

Points that are neither core points (with sufficient neighbors) nor reachable from any core point are labeled as noise points.

Noise points are isolated from dense regions and are not assigned to any cluster.

In summary, noise points are defined as those that do not have enough neighboring points within the ε radius and cannot be density-connected to any core point cluster, effectively identifying outliers or sparse areas in the data

10. Define inertia in the context of K-Means?

In the context of K-Means clustering, inertia is defined as the sum of the squared distances between each data point and the centroid of the cluster to which it is assigned. It measures how internally coherent the clusters are; lower inertia indicates that the points are closer to their respective cluster centers, implying more compact clusters.
Mathematically, inertia is the total within-cluster sum of squares (WCSS), calculated as:

"Inertia"=∑_(i=1)^n "distance"(x_i,c_j^* )^2

where x_i is a data point, and c_j^* is the centroid of the cluster nearest to x_i.

Inertia is commonly used in the Elbow Method to identify the optimal number of clusters k by plotting inertia against different k values and looking for the "elbow point" where the decrease in inertia starts to level off, indicating diminishing returns in cluster compactness improvement.


11.What is the elbow method in K-Means clustering?

The elbow method in K-Means clustering is a heuristic technique used to determine the optimal number of clusters (K) in the data. It involves the following steps:
1.	Run K-Means clustering on the dataset for a range of K values (e.g., 1 to 10).

2.	For each K, calculate the within-cluster sum of squares (WCSS) or inertia, which measures how compact the clusters are.

3.	Plot the WCSS values against the number of clusters K.

4.	Look for the "elbow point" on the curve where adding more clusters results in only a small decrease in WCSS, indicating diminishing returns.

5.	The K value at this elbow point is considered optimal because it balances cluster compactness with model complexity.

The elbow method helps select a K that avoids both underfitting (too few clusters) and overfitting (too many clusters), providing a practical way to choose the cluster count based on data structure


12.Describe the concept of "density" in DBSCAN?

The concept of "density" in DBSCAN (Density-Based Spatial Clustering of Applications with Noise) refers to how closely packed data points are in a region of the data space. Specifically, density is determined by two parameters:

Epsilon (ε) - The radius around a data point defining its neighborhood.

MinPts - The minimum number of points required within this ε-radius neighborhood to qualify the point as part of a dense region.

A point is considered a core point if it has at least MinPts neighbors within the ε distance, indicating it lies in a dense area. Points that are reachable from core points but do not themselves meet the density criteria are border points, while points that are neither core nor reachable border points are considered noise or outliers.

Thus, DBSCAN defines clusters as areas of high density separated by regions of low density, enabling it to discover clusters of arbitrary shape and identify noise effectively based on the spatial density of points.

13.Can hierarchical clustering be used on categorical data?

Yes, hierarchical clustering can be used on categorical data, but it requires specialized handling since traditional hierarchical clustering relies on distance measures like Euclidean distance, which are not suitable for categorical variables.

For categorical data, alternative similarity or distance measures such as Hamming distance, Jaccard distance, or Gower’s distance are employed. These metrics effectively capture dissimilarities based on category mismatches. The hierarchical clustering process then proceeds similarly, either merging or splitting clusters based on these distances.

Additionally, categorical data may be preprocessed through appropriate encoding methods or use of dissimilarity matrices. The resulting dendrogram visualizes the nested cluster structure and helps in selecting the number of clusters by cutting the tree at different heights.

This approach is commonly applied in domains like customer segmentation, survey analysis, or document grouping where data is categorical

14. What does a negative Silhouette Score indicate?

A negative Silhouette Score indicates that a data point may be assigned to the wrong cluster. Specifically, it means that the point is, on average, closer to points in a neighboring cluster than to points within its own assigned cluster. This suggests poor clustering quality for that point and possibly for the clustering solution overall.

Intuitively, negative values imply misclassification or that the cluster boundaries are not well defined for those points. If many points have negative silhouette values, it may indicate too many or too few clusters or that the data naturally doesn't cluster well at the chosen granularity.

Thus, a negative Silhouette Score is a strong signal to reconsider the clustering configuration or the number of clusters.

15.Explain the term "linkage criteria" in hierarchical clustering?

In hierarchical clustering, the term "linkage criteria" refers to the rule or method used to measure the distance or dissimilarity between clusters when deciding which clusters to merge at each step. Linkage criteria determine how the distance between two clusters is computed based on the pairwise distances between data points in those clusters.

Common types of linkage criteria include:

Single Linkage: Distance between the closest points of two clusters (minimum distance). It tends to produce elongated, chain-like clusters.

Complete Linkage: Distance between the farthest points of two clusters (maximum distance). It favors compact, spherical clusters.

Average Linkage: Average distance between all pairs of points in the two clusters, providing a balance between single and complete linkage.

Ward's Linkage: Minimizes the increase in total within-cluster variance after merging, producing clusters with minimum variance.

Choosing the appropriate linkage criterion impacts the shape and size of clusters and the resulting dendrogram structure in hierarchical clustering

16.Why might K-Means clustering perform poorly on data with varying cluster sizes or densities?

K-Means clustering can perform poorly on data with varying cluster sizes or densities because it assumes clusters are spherical and have roughly equal sizes and densities. This assumption causes several issues:

Unequal Cluster Sizes: K-Means assigns points to the nearest centroid, which works best when clusters are similar in size. Larger or smaller clusters may be split or merged incorrectly.

Varying Densities: K-Means uses distance to centroids without considering density, so clusters with different point densities may not be properly separated.

Cluster Shape Assumption: K-Means favors spherical (globular) clusters, thus struggling on elongated or irregular-shaped clusters.

Sensitivity to Outliers: Outliers can shift the cluster centroid, distorting cluster assignments especially when cluster densities vary.

Due to these limitations, K-Means may split large clusters inaccurately and merge smaller or less dense ones, leading to suboptimal clustering results on such data distributions. More flexible algorithms like DBSCAN or Gaussian Mixture Models are preferred for data with varying cluster sizes and densities.

17. What are the core parameters in DBSCAN, and how do they influence cluster?

The core parameters in DBSCAN are:

Epsilon (ε): This defines the radius of the neighborhood around a data point. Two points are considered neighbors if the distance between them is less than or equal to ε. The choice of ε determines the scale at which the algorithm searches for dense regions. If ε is too small, many points will be classified as noise; if too large, clusters may merge incorrectly.

MinPts (Minimum Points): This specifies the minimum number of points required in an ε-radius neighborhood for a point to be considered a core point, indicating a dense region. The value of MinPts influences cluster density definition; smaller MinPts may detect smaller clusters but be sensitive to noise, while larger MinPts require denser clusters.

Together, these parameters control how DBSCAN defines clusters based on density and how it distinguishes core points, border points, and noise, thereby influencing the shape, size, and number of clusters detected in the data

18.How does K-Means++ improve upon standard K-Means initialization?

K-Means++ improves upon standard K-Means initialization by choosing initial cluster centers more strategically rather than randomly.

In K-Means++, the first centroid is chosen randomly from the data points, but each subsequent centroid is selected with a probability proportional to the square of the distance from the point to the nearest already chosen centroid. This approach ensures the initial centroids are spread out across the data space.

By spreading out the initial cluster centers, K-Means++ often leads to faster convergence and better clustering results, avoiding poor solutions where centroids cluster too close together or some clusters are left empty, which can happen with random initialization.

Though the initialization step is computationally more involved than standard K-Means, the overall run-time is typically reduced due to fewer iterations needed for convergence and more stable clustering.

19.What is agglomerative clustering?

Agglomerative clustering is a type of hierarchical clustering that follows a bottom-up approach. It begins by treating each data point as its own individual cluster. Then, iteratively, it merges the two closest or most similar clusters step-by-step based on a chosen distance metric and linkage criteria. This merging process continues until all points are combined into a single cluster or until a stopping criterion is reached.

The result is a hierarchy of clusters represented by a dendrogram, which shows how clusters are nested and merged at various levels of similarity.

Agglomerative clustering is useful for discovering the structure in data without pre-specifying the number of clusters, and it is good at identifying small and nested clusters.

20. What makes Silhouette Score a better metric than just inertia for model evaluation?

Silhouette Score is generally considered a better metric than inertia for clustering model evaluation because it captures both cohesion (how close points are within the same cluster) and separation (how well clusters are separated from each other), whereas inertia only measures cluster compactness.

Key reasons why Silhouette Score is better than inertia:

Considers Cluster Separation: Silhouette Score evaluates how far data points are from neighboring clusters in addition to their closeness within clusters. Inertia ignores separation and focuses solely on minimizing intra-cluster variance.

Scale-Free Interpretation: Silhouette values range from -1 to +1, providing an interpretable scale where values near +1 indicate well-defined clusters, values near 0 indicate overlapping clusters, and negative values suggest misclassification. Inertia is an unbounded quantity sensitive to data scale.

Prevents Overfitting on Cluster Count: Inertia monotonically decreases with more clusters, encouraging overfitting with too many clusters. Silhouette Score often identifies a peak value indicating a balanced number of clusters.

Reflects Overall Cluster Quality: Silhouette Score reflects cluster consistency and separation for individual points and averages them for global cluster structure insights. Inertia aggregates within-cluster distances only.

Thus, Silhouette Score is more robust for evaluating the goodness of clustering and choosing the optimal number of clusters because it balances the internal cohesion and external separation of clusters, unlike inertia which focuses narrowly on compactness.

                                     Practicle

21.Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering. Visualize using a
scatter plot?

In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# 1. Generate synthetic data with 4 centers
X, y_true = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42)

# 2. Apply K-Means clustering
kmeans = KMeans(n_clusters=4, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# 3. Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', s=50, alpha=0.7)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            c='red', s=200, marker='X', label='Centroids')

plt.title('K-Means Clustering (4 Clusters)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()


22.Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10 predicted labels?

In [None]:
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering

# Load the Iris dataset
iris = load_iris()
X = iris.data

# Perform Agglomerative Clustering with 3 clusters
agg_clustering = AgglomerativeClustering(n_clusters=3)
labels = agg_clustering.fit_predict(X)

# Display the first 10 predicted labels
print(labels[:10])


23 Generate synthetic data using make_moons and apply DBSCAN. Highlight outliers in the plot?

In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# 1. Generate synthetic data (two interleaving half circles)
X, y_true = make_moons(n_samples=300, noise=0.07, random_state=42)

# 2. Standardize data (DBSCAN is distance-based, scaling is important)
X_scaled = StandardScaler().fit_transform(X)

# 3. Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.3, min_samples=5)  # eps = neighborhood radius, min_samples = minimum points per cluster
y_dbscan = dbscan.fit_predict(X_scaled)

# 4. Identify outliers (DBSCAN labels outliers as -1)
outliers = (y_dbscan == -1)

# 5. Visualize clusters and outliers
plt.figure(figsize=(8, 6))
plt.scatter(X_scaled[~outliers, 0], X_scaled[~outliers, 1],
            c=y_dbscan[~outliers], cmap='plasma', s=50, label='Clusters')
plt.scatter(X_scaled[outliers, 0], X_scaled[outliers, 1],
            c='black', s=60, marker='x', label='Outliers')

plt.title('DBSCAN Clustering on make_moons Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()


24.Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each cluster?

In [None]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np

# Load the Wine dataset
data = load_wine()
X = data.data

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-Means clustering with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_scaled)

# Print the size of each cluster
unique, counts = np.unique(labels, return_counts=True)
cluster_sizes = dict(zip(unique, counts))
print("Cluster sizes:", cluster_sizes)


25.Use make_circles to generate synthetic data and cluster it using DBSCAN. Plot the result?

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.cluster import DBSCAN

# Generate synthetic concentric circles data
X, y = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.1, min_samples=5)
labels = dbscan.fit_predict(X)

# Plot the clustering result
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.title('DBSCAN Clustering on Synthetic Circles Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()


26. Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster
centroids?

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

# Load Breast Cancer dataset
data = load_breast_cancer()
X = data.data

# Scale features using MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-Means clustering with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X_scaled)

# Output cluster centroids
print("Cluster Centroids:\n", kmeans.cluster_centers_)


27. Generate synthetic data using make_blobs with varying cluster standard deviations and cluster with
DBSCAN?

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN

# Generate synthetic data with varying cluster standard deviations
X, y = make_blobs(
    n_samples=500,
    centers=[[0, 0], [5, 5], [10, 0]],
    cluster_std=[0.5, 1.5, 0.3],  # Different std dev for each cluster
    random_state=42
)

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.8, min_samples=5)
labels = dbscan.fit_predict(X)

# Plot the clustering results
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='plasma', s=50)
plt.title('DBSCAN Clustering on Synthetic Blobs with Varying Std Dev')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()


28. Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means?

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# Load the Digits dataset
digits = load_digits()
X = digits.data

# Reduce dimensionality to 2D using PCA
pca = PCA(n_components=2, random_state=42)
X_reduced = pca.fit_transform(X)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=10, random_state=42)
labels = kmeans.fit_predict(X_reduced)

# Plot the clusters
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=labels, cmap='tab10', s=50)
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('K-Means Clusters on PCA-Reduced Digits Data')
plt.colorbar(scatter, label='Cluster Label')
plt.show()


30.Load the Iris dataset and use hierarchical clustering to group data. Plot a dendrogram with average linkage?

In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.preprocessing import StandardScaler

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Standardize the data (important for distance-based methods)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Perform Hierarchical Clustering with average linkage
Z = linkage(X_scaled, method='average')

# 4. Plot the dendrogram
plt.figure(figsize=(10, 6))
dendrogram(Z, truncate_mode=None, p=150, leaf_rotation=90, leaf_font_size=10)
plt.title('Hierarchical Clustering Dendrogram (Average Linkage) - Iris Dataset')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()


31Generate synthetic data with overlapping clusters using make_blobs, then apply K-Means and visualize with
decision boundaries?

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# 1. Generate synthetic data with overlapping clusters
X, y_true = make_blobs(n_samples=500, centers=3, cluster_std=2.0, random_state=42)

# Standardize features for better clustering
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
y_kmeans = kmeans.fit_predict(X_scaled)

# 3. Create a mesh grid for decision boundaries
x_min, x_max = X_scaled[:, 0].min() - 1, X_scaled[:, 0].max() + 1
y_min, y_max = X_scaled[:, 1].min() - 1, X_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500),
                     np.linspace(y_min, y_max, 500))

# Predict cluster labels for each point in the grid
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# 4. Visualize clusters with decision boundaries
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_kmeans, cmap='viridis', s=50, edgecolor='k')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            c='red', s=200, marker='X', label='Centroids')

plt.title('K-Means Clustering with Decision Boundaries (Overlapping Clusters)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()


32.Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results?

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN

# Load Digits dataset
digits = load_digits()
X = digits.data

# Reduce dimensionality to 2D using t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_embedded = tsne.fit_transform(X)

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=3.0, min_samples=5)
labels = dbscan.fit_predict(X_embedded)

# Plot clusters
plt.figure(figsize=(8, 6))
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=labels, cmap='tab10', s=50)
plt.title("DBSCAN Clustering on t-SNE Reduced Digits Data")
plt.xlabel("t-SNE Dim 1")
plt.ylabel("t-SNE Dim 2")
plt.colorbar(label='Cluster Label')
plt.show()


33. Generate synthetic data using make_blobs and apply Agglomerative Clustering with complete linkage. Plot
the result?

In [None]:
# Import required libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler

# 1. Generate synthetic data
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.5, random_state=42)

# 2. Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Apply Agglomerative Clustering with complete linkage
agg_clust = AgglomerativeClustering(n_clusters=3, linkage='complete')
y_pred = agg_clust.fit_predict(X_scaled)

# 4. Visualize the clustering result
plt.figure(figsize=(8, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_pred, cmap='viridis', s=50, edgecolor='k')
plt.title('Agglomerative Clustering (Complete Linkage)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()


34.Load the Breast Cancer dataset and compare inertia values for K = 2 to 6 using K-Means. Show results in a
line plot?

In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# 1. Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data

# 2. Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Compute K-Means inertia values for K = 2 to 6
inertias = []
K_values = range(2, 7)

for k in K_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

# 4. Plot inertia vs K
plt.figure(figsize=(8, 6))
plt.plot(K_values, inertias, marker='o', linestyle='-', color='blue')
plt.title('K-Means Inertia for Different K Values (Breast Cancer Dataset)')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()


35.Generate synthetic concentric circles using make_circles and cluster using Agglomerative Clustering with single linkage?

In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler

# 1. Generate synthetic concentric circles data
X, y_true = make_circles(n_samples=400, factor=0.5, noise=0.05, random_state=42)

# 2. Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Apply Agglomerative Clustering with single linkage
agg_cluster = AgglomerativeClustering(n_clusters=2, linkage='single')
y_pred = agg_cluster.fit_predict(X_scaled)

# 4. Visualize clustering results
plt.figure(figsize=(7, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_pred, cmap='plasma', s=50, edgecolor='k')
plt.title("Agglomerative Clustering (Single Linkage) on Concentric Circles")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()


36. Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters excluding noise?

In [None]:
# Import required libraries
import numpy as np
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

# 1. Load the Wine dataset
wine = load_wine()
X = wine.data

# 2. Standardize the data (important for DBSCAN)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Apply DBSCAN clustering
dbscan = DBSCAN(eps=1.5, min_samples=5)
y_db = dbscan.fit_predict(X_scaled)

# 4. Count the number of clusters (excluding noise labeled as -1)
n_clusters = len(set(y_db)) - (1 if -1 in y_db else 0)
n_noise = list(y_db).count(-1)

print(f"Number of clusters (excluding noise): {n_clusters}")
print(f"Number of noise points: {n_noise}")


37 Generate synthetic data with make_blobs and apply KMeans. Then plot the cluster centers on top of the data points?

In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# 1. Generate synthetic data
X, y_true = make_blobs(n_samples=400, centers=3, cluster_std=1.0, random_state=42)

# 2. Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# 3. Get cluster centers
centers = kmeans.cluster_centers_

# 4. Plot the clustered data and cluster centers
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis', edgecolor='k', alpha=0.7)
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, marker='X', label='Cluster Centers')
plt.title('K-Means Clustering with Cluster Centers')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()


38 Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise?

In [None]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data

# 2. Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.8, min_samples=5)
y_db = dbscan.fit_predict(X_scaled)

# 4. Count how many samples are identified as noise
n_noise = list(y_db).count(-1)
n_clusters = len(set(y_db)) - (1 if -1 in y_db else 0)

print(f"Number of clusters (excluding noise): {n_clusters}")
print(f"Number of noise samples: {n_noise}")


39 Generate synthetic non-linearly separable data using make_moons, apply K-Means, and visualize the clustering result?

In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# 1. Generate synthetic non-linear data (two interleaving half-moons)
X, y_true = make_moons(n_samples=400, noise=0.1, random_state=42)

# 2. Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Apply K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=42)
y_kmeans = kmeans.fit_predict(X_scaled)

# 4. Visualize the clustering result
plt.figure(figsize=(8, 6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y_kmeans, cmap='viridis', s=50, edgecolor='k')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            c='red', s=200, marker='X', label='Cluster Centers')
plt.title("K-Means Clustering on Non-linearly Separable Data (make_moons)")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()


40.Load the Digits dataset, apply PCA to reduce to 3 components, then use KMeans and visualize with a 3D
scatter plot.?

In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# 1. Load the Digits dataset
digits = load_digits()
X = digits.data
y_true = digits.target

# 2. Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Apply PCA to reduce dimensions to 3
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)

# 4. Apply K-Means clustering (choose 10 clusters for digits 0–9)
kmeans = KMeans(n_clusters=10, random_state=42)
y_kmeans = kmeans.fit_predict(X_pca)

# 5. 3D Visualization
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2],
                     c=y_kmeans, cmap='tab10', s=40, alpha=0.8, edgecolor='k')

ax.set_title('K-Means Clustering on Digits Dataset (PCA 3D Visualization)', fontsize=12)
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
plt.show()


41.Generate synthetic blobs with 5 centers and apply KMeans. Then use silhouette_score to evaluate the clustering?

In [None]:
# Import necessary libraries
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# 1. Generate synthetic data with 5 centers
X, y_true = make_blobs(n_samples=500, centers=5, cluster_std=1.0, random_state=42)

# 2. Apply K-Means clustering
kmeans = KMeans(n_clusters=5, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# 3. Evaluate clustering using Silhouette Score
score = silhouette_score(X, y_kmeans)
print(f"Silhouette Score for K-Means with 5 clusters: {score:.3f}")

# 4. Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', s=50, edgecolor='k', alpha=0.7)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            c='red', s=200, marker='X', label='Cluster Centers')
plt.title('K-Means Clustering (5 Centers) with Silhouette Evaluation')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()


42.Load the Breast Cancer dataset, reduce dimensionality using PCA, and apply Agglomerative Clustering. Visualize in 2D?

In [None]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# 1. Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# 2. Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Apply PCA to reduce dimensionality to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# 4. Apply Agglomerative Clustering (e.g., 2 clusters for benign/malignant)
agg = AgglomerativeClustering(n_clusters=2, linkage='ward')
y_agg = agg.fit_predict(X_pca)

# 5. Visualize clusters in 2D PCA space
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_agg, cmap='viridis', s=50, alpha=0.7, edgecolor='k')
plt.title('Agglomerative Clustering on PCA-Reduced Breast Cancer Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()


43.Generate noisy circular data using make_circles and visualize clustering results from KMeans and DBSCAN side-by-side?

In [None]:
# Import necessary libraries
from sklearn.datasets import make_circles
from sklearn.cluster import KMeans, DBSCAN
import matplotlib.pyplot as plt

# 1. Generate noisy circular data
X, y_true = make_circles(n_samples=500, factor=0.5, noise=0.05, random_state=42)

# 2. Apply K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# 3. Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.15, min_samples=5)
y_dbscan = dbscan.fit_predict(X)

# 4. Visualize both results side-by-side
plt.figure(figsize=(12, 5))

# K-Means plot
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', s=50, edgecolor='k')
plt.title("K-Means Clustering on make_circles Data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")

# DBSCAN plot
plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=y_dbscan, cmap='viridis', s=50, edgecolor='k')
plt.title("DBSCAN Clustering on make_circles Data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")

plt.tight_layout()
plt.show()


44. Load the Iris dataset and plot the Silhouette Coefficient for each sample after KMeans clustering?

In [None]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import numpy as np

# 1. Load the Iris dataset
data = load_iris()
X = data.data

# 2. Apply K-Means clustering (3 clusters for 3 iris species)
kmeans = KMeans(n_clusters=3, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# 3. Compute silhouette values
silhouette_vals = silhouette_samples(X, y_kmeans)
overall_silhouette = silhouette_score(X, y_kmeans)

# 4. Plot silhouette coefficients for each sample
plt.figure(figsize=(8, 6))
y_lower = 10  # For spacing between clusters

for i in range(3):
    cluster_silhouette_vals = silhouette_vals[y_kmeans == i]
    cluster_silhouette_vals.sort()
    cluster_size = cluster_silhouette_vals.shape[0]
    y_upper = y_lower + cluster_size

    plt.fill_betweenx(
        np.arange(y_lower, y_upper),
        0,
        cluster_silhouette_vals,
        alpha=0.7,
        label=f"Cluster {i + 1}"
    )
    y_lower = y_upper + 10  # Add spacing between clusters

# Draw average silhouette score line
plt.axvline(x=overall_silhouette, color="red", linestyle="--", label=f"Avg Silhouette = {overall_silhouette:.2f}")

# Plot settings
plt.title("Silhouette Plot for K-Means Clustering on Iris Dataset")
plt.xlabel("Silhouette Coefficient Values")
plt.ylabel("Cluster Label")
plt.legend()
plt.grid(True, linestyle="--", alpha=0.6)
plt.show()


45.Generate synthetic data using make_blobs and apply Agglomerative Clustering with 'average' linkage. Visualize clusters?

In [None]:
# Import required libraries
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# 1. Generate synthetic data
X, y_true = make_blobs(
    n_samples=400,
    centers=4,
    cluster_std=1.0,
    random_state=42
)

# 2. Apply Agglomerative Clustering with 'average' linkage
agg = AgglomerativeClustering(n_clusters=4, linkage='average')
y_agg = agg.fit_predict(X)

# 3. Visualize clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_agg, cmap='viridis', s=50, edgecolor='k')
plt.title("Agglomerative Clustering with 'Average' Linkage")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()


46. Load the Wine dataset, apply KMeans, and visualize the cluster assignments in a seaborn pairplot (first 4 features?

In [None]:
# Import required libraries
from sklearn.datasets import load_wine
from sklearn.cluster import KMeans
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# 1. Load the Wine dataset
data = load_wine()
X = data.data
feature_names = data.feature_names

# 2. Create a DataFrame for easier visualization
df = pd.DataFrame(X, columns=feature_names)

# 3. Apply K-Means clustering (3 clusters expected for 3 wine types)
kmeans = KMeans(n_clusters=3, random_state=42)
df['Cluster'] = kmeans.fit_predict(X)

# 4. Use only the first 4 features for pairplot visualization
subset_features = feature_names[:4]

# 5. Visualize with Seaborn pairplot
sns.pairplot(df, vars=subset_features, hue='Cluster', palette='viridis', diag_kind='kde')
plt.suptitle("K-Means Clustering on Wine Dataset (First 4 Features)", y=1.02, fontsize=14)
plt.show()


47.Generate noisy blobs using make_blobs and use DBSCAN to identify both clusters and noise points. Print the count?

In [None]:
# Import required libraries
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt

# 1. Generate synthetic noisy blob data
X, _ = make_blobs(n_samples=500, centers=3, cluster_std=0.8, random_state=42)

# Add random noise points
rng = np.random.RandomState(42)
noise = rng.uniform(low=-10, high=10, size=(30, 2))  # 30 random noise points
X_noisy = np.vstack([X, noise])

# 2. Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.8, min_samples=5)
labels = dbscan.fit_predict(X_noisy)

# 3. Count clusters and noise points
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)

print(f"Number of clusters found (excluding noise): {n_clusters}")
print(f"Number of noise points: {n_noise}")

# 4. Visualize results
plt.figure(figsize=(8, 6))
plt.scatter(X_noisy[:, 0], X_noisy[:, 1], c=labels, cmap='viridis', s=50, edgecolor='k')
plt.title("DBSCAN Clustering with Noisy Blobs")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()


48.Load the Digits dataset, reduce dimensions using t-SNE, then apply Agglomerative Clustering and plot the clusters.?

In [None]:
# Import required libraries
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# 1. Load the Digits dataset
digits = load_digits()
X = digits.data
y = digits.target

# 2. Standardize the data for t-SNE
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Reduce dimensions using t-SNE (2 components for visualization)
tsne = TSNE(n_components=2, random_state=42, perplexity=30, learning_rate=200)
X_tsne = tsne.fit_transform(X_scaled)

# 4. Apply Agglomerative Clustering
agg = AgglomerativeClustering(n_clusters=10, linkage='ward')
y_agg = agg.fit_predict(X_tsne)

# 5. Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_agg, cmap='tab10', s=40, edgecolor='k', alpha=0.7)
plt.title("Agglomerative Clustering on t-SNE Reduced Digits Data")
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()
