<a href="https://colab.research.google.com/github/Swati642/Python-Assignment-1/blob/main/Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1.  What is unsupervised learning in the context of machine learning
Unsupervised learning is a type of machine learning where the model is trained on data that doesn't have labeled responses. The goal is to identify underlying patterns, structures, or relationships in the data. Common techniques include clustering (e.g., K-means) and dimensionality reduction (e.g., PCA). The model tries to find similarities or groupings within the data without specific guidance on what the output should be.

2. K-Means clustering works by partitioning data into a predefined number of clusters (K) based on feature similarity. Here's how it works:

1. **Initialization**: Select K random data points as cluster centroids.
2. **Assignment Step**: Assign each data point to the nearest centroid based on distance (e.g., Euclidean distance).
3. **Update Step**: Calculate new centroids by averaging all points assigned to each centroid's cluster.
4. **Repeat**: Repeat the assignment and update steps until the centroids no longer change (convergence).

This algorithm minimizes the variance within clusters, making the data points in each cluster as similar as possible.

3. Explain the concept of a dendrogram in hierarchical clustering
A **dendrogram** in hierarchical clustering is a tree-like diagram that visually represents the merging process of clusters. It shows how data points are grouped into clusters step-by-step based on their similarity.

- **X-axis**: Represents the data points or clusters.
- **Y-axis**: Represents the dissimilarity or distance at which clusters are merged.

### Key Points:
1. **Bottom**: Individual data points start as separate clusters.
2. **Branches**: Data points or clusters are progressively merged into larger clusters as the algorithm moves upwards.
3. **Height**: The height of the branches indicates the distance (or dissimilarity) at which two clusters are joined.

You can decide the number of clusters by "cutting" the dendrogram at a certain height. This helps identify how many clusters are formed based on a chosen level of similarity.

4. What is the main difference between K-Means and Hierarchical Clustering
The main difference between **K-Means** and **Hierarchical Clustering** lies in their approach to forming clusters:

1. **K-Means Clustering**:
   - **Type**: Partitional.
   - **Approach**: Requires the number of clusters (K) to be predefined. It iteratively assigns data points to clusters and updates cluster centroids.
   - **Merge/Split**: It doesn't form a hierarchy and doesn't provide a hierarchy of clusters.
   - **Efficiency**: Works well with large datasets but sensitive to the initial choice of centroids.

2. **Hierarchical Clustering**:
   - **Type**: Agglomerative or Divisive.
   - **Approach**: Builds a hierarchy of clusters either by merging smaller clusters (agglomerative) or splitting a larger cluster (divisive).
   - **Merge/Split**: Produces a dendrogram that shows the hierarchical relationships between clusters.
   - **Efficiency**: Computationally expensive, especially for large datasets.

In summary, **K-Means** requires the number of clusters in advance and is fast, while **Hierarchical Clustering** produces a hierarchy and does not require predefined clusters, but is more computationally intensive.

5. What are the advantages of DBSCAN over K-Means
**Advantages of DBSCAN over K-Means**:

1. **No Need to Predefine Number of Clusters**: Unlike K-Means, DBSCAN does not require specifying the number of clusters (K) in advance. It can automatically determine the number of clusters based on the density of data points.

2. **Can Handle Non-Spherical Clusters**: DBSCAN is capable of identifying clusters of arbitrary shapes, while K-Means typically works best with spherical (or circular) clusters.

3. **Handles Noise and Outliers**: DBSCAN can identify noise points as outliers and does not force them into any cluster, unlike K-Means, which assigns every data point to a cluster, even if it doesn't fit well.

4. **Works Well with Unevenly Sized Clusters**: DBSCAN can discover clusters of different densities, whereas K-Means struggles when clusters have varying sizes and densities.

5. **No Need for Distance to Centroid**: K-Means uses a centroid-based approach, which can lead to problems if clusters are of irregular shape. DBSCAN uses density, making it more robust to such issues.

In summary, DBSCAN is more flexible for complex data structures, particularly in the presence of noise and varying cluster shapes.

6. When would you use Silhouette Score in clustering
The **Silhouette Score** is used in clustering to evaluate the quality of the clusters. It is particularly useful in the following scenarios:

1. **Determine Optimal Number of Clusters**: The Silhouette Score helps in selecting the best number of clusters (K) by evaluating how well-separated the clusters are. A higher silhouette score indicates better-defined clusters.

2. **Assess Cluster Quality**: After performing clustering, the Silhouette Score can be used to measure how similar each data point is to its own cluster compared to other clusters. Scores close to +1 indicate well-separated clusters, while scores close to -1 indicate poor clustering.

3. **Compare Different Clustering Algorithms**: It can be used to compare different clustering algorithms (e.g., K-Means vs DBSCAN) and their performance on the same dataset, helping to select the best model.

4. **Identifying Potential Outliers**: Points with a low or negative silhouette score can be considered as outliers or misclassified, providing insight into the clustering quality.

In summary, the Silhouette Score is a valuable metric for assessing cluster cohesion, separation, and identifying the optimal number of clusters.

7. What are the limitations of Hierarchical Clustering
The limitations of **Hierarchical Clustering** are:

1. **Scalability**: Hierarchical clustering is computationally expensive with a time complexity of \(O(n^2)\), making it inefficient for large datasets.

2. **Sensitivity to Noise**: It can be sensitive to outliers, which may distort the clustering structure and lead to incorrect results.

3. **Difficulty in Deciding Number of Clusters**: Although the dendrogram provides a visual representation, selecting the right number of clusters is subjective and can be challenging.

4. **Non-flexible Cluster Shapes**: It works best with spherical clusters, and struggles with non-convex or irregularly shaped clusters.

5. **Once Merged, Cannot Be Unmerged**: Once two clusters are merged, they cannot be undone, which might limit the flexibility in adjusting clustering results.

These factors can make hierarchical clustering less practical for large, noisy, or complex datasets.

8. Why is feature scaling important in clustering algorithms like K-Means
Feature scaling is important in clustering algorithms like **K-Means** because:

1. **Distance Sensitivity**: K-Means relies on calculating distances (typically Euclidean) between data points. Features with larger ranges or units dominate the distance calculation, which can distort the clustering results.

2. **Equal Weightage**: Scaling ensures that all features contribute equally to the distance calculation, avoiding any single feature from disproportionately influencing the clusters.

3. **Improved Convergence**: Scaled data can help the K-Means algorithm converge faster and more reliably, leading to more accurate clustering.

Common techniques for feature scaling include **Standardization** (zero mean, unit variance) and **Normalization** (scaling to a fixed range).

9. How does DBSCAN identify noise points
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies noise points based on its density-based clustering approach. Here's how it works:

1. **Core Points**: A point is considered a core point if it has a minimum number of neighbors (defined by the `min_samples` parameter) within a given radius (`eps`).

2. **Border Points**: A point that is not a core point but lies within the `eps` distance of a core point is called a border point.

3. **Noise Points**: Any point that is neither a core point nor a border point is labeled as **noise**. These points don't belong to any cluster and are considered outliers.

Thus, DBSCAN effectively separates dense regions as clusters and labels sparsely populated regions as noise.

10. Define inertia in the context of K-Means
Apologies for the confusion! Here's the formula for **inertia** in K-Means clustering, written mathematically:

\[
\text{Inertia} = \sum_{i=1}^{k} \sum_{x_j \in C_i} \|x_j - \mu_i\|^2
\]

Where:
- \( k \) is the number of clusters,
- \( C_i \) is the set of points assigned to the \( i \)-th cluster,
- \( x_j \) represents each point in the cluster \( C_i \),
- \( \mu_i \) is the centroid (mean) of the \( i \)-th cluster,
- \( \|x_j - \mu_i\|^2 \) is the squared Euclidean distance between the point \( x_j \) and its assigned centroid \( \mu_i \).

This formula computes the sum of squared distances between each point and its assigned cluster centroid, which is used to measure the compactness of the clusters.

11. What is the elbow method in K-Means clustering
The **Elbow Method** in K-Means clustering is used to determine the optimal number of clusters, \( k \), for a given dataset. Here's how it works:

1. **Run K-Means clustering** for a range of values for \( k \) (e.g., from 1 to 10).
2. **Compute the inertia (within-cluster sum of squares)** for each value of \( k \). This represents the compactness of the clusters.
3. **Plot the inertia** against the number of clusters \( k \).
4. **Look for the "elbow" point** in the plot, which is where the inertia starts decreasing at a slower rate. This point indicates the optimal number of clusters.

The **elbow point** represents the value of \( k \) where adding more clusters no longer significantly improves the compactness of the clusters, thus balancing between the model's complexity and performance.

12. Describe the concept of "density" in DBSCAN
In DBSCAN (Density-Based Spatial Clustering of Applications with Noise), **density** refers to the number of data points within a specific region around a given data point, determined by two parameters: **ε (epsilon)** and **MinPts**.

1. **ε (epsilon)**: This is the radius that defines the neighborhood around a point. It determines the maximum distance between two points to be considered neighbors.

2. **MinPts**: This is the minimum number of points required to form a dense region (cluster). A point is considered a core point if it has at least **MinPts** points (including itself) within its ε-neighborhood.

DBSCAN classifies points based on their density:

- **Core Points**: Points that have at least **MinPts** points within their ε-neighborhood. These are the central points of a cluster.
  
- **Border Points**: Points that are not core points but lie within the ε-neighborhood of a core point. They can belong to a cluster but are not the center of it.

- **Noise Points (Outliers)**: Points that do not meet the density criteria (i.e., they don't have enough neighbors to be classified as core or border points). These points are considered noise and are excluded from any clusters.

DBSCAN's ability to find clusters of arbitrary shapes and handle noise is a direct result of its density-based approach.

In [None]:
13. Can hierarchical clustering be used on categorical data

Yes, **hierarchical clustering** can be used on categorical data, but it requires a suitable distance metric to handle categorical variables. Traditional distance metrics like **Euclidean distance** are not appropriate for categorical data. Instead, you can use specialized metrics such as:

1. **Hamming Distance**: This is used to calculate the distance between two categorical data points by counting the number of mismatches between corresponding attributes.

2. **Jaccard Index**: Measures the similarity between two sets. It is useful when clustering binary or categorical data, comparing the presence or absence of attributes.

3. **Matching Coefficient**: Similar to Hamming distance, it calculates the proportion of matching attributes between two categorical data points.

When using hierarchical clustering on categorical data, you can apply these distance metrics and then proceed with the usual agglomerative or divisive clustering techniques, such as **Single-Linkage**, **Complete-Linkage**, or **Average-Linkage**.

By choosing an appropriate distance metric, hierarchical clustering can effectively group categorical data into meaningful clusters.

14. What does a negative Silhouette Score indicate
A **negative Silhouette Score** indicates that a data point is **closer to points in another cluster** than to points in its own cluster. This suggests:

- The point is likely **misclassified**.
- The clusters may be **overlapping** or **not well-separated**.
- The current clustering **may not be optimal**.

15. Explain the term "linkage criteria" in hierarchical clustering
**Linkage criteria** determine how the distance between two clusters is calculated in hierarchical clustering. Common types:

- **Single linkage**: Minimum distance between any two points in the clusters.  
- **Complete linkage**: Maximum distance between points.  
- **Average linkage**: Average of all pairwise distances.  
- **Ward’s method**: Minimizes variance increase.  

It influences the shape of the dendrogram and cluster formation.

16. Why might K-Means clustering perform poorly on data with varying cluster sizes or densities
K-Means assumes all clusters are spherical and equally sized. It performs poorly when:

- Clusters have different sizes or densities.  
- Clusters are not well-separated.  
- It struggles with non-globular shapes.  
- Sensitive to outliers and initialization.

17. What are the core parameters in DBSCAN, and how do they influence clustering
Core parameters in DBSCAN:

1. **eps (ε)**: Max distance for two points to be neighbors. Larger `eps` → larger clusters.  
2. **min_samples**: Min points to form a dense region. Higher `min_samples` → fewer clusters.  
They control density threshold to define core, border, and noise points.

18. How does K-Means++ improve upon standard K-Means initialization
K-Means++ improves standard K-Means by choosing initial centroids more strategically:  
1. First centroid is picked randomly.  
2. Next centroids are chosen with higher probability from distant points.  
This reduces chances of poor clustering and speeds up convergence.

19. What is agglomerative clustering
Agglomerative clustering is a bottom-up hierarchical clustering method.  
1. Each point starts as its own cluster.  
2. Closest clusters are merged step by step.  
3. This continues until all points are in one cluster or a stopping criterion is met.  
It builds a dendrogram to show the merging process.

20. What makes Silhouette Score a better metric than just inertia for model evaluation?
Silhouette Score considers both **intra-cluster cohesion** and **inter-cluster separation**, unlike inertia which only measures how close points are to centroids.  
It gives a **normalized score** (−1 to 1), making it easier to interpret and compare across models.  
Useful for **non-spherical clusters** too.  
Hence, it's more **comprehensive and reliable** than just using inertia.

21. Generate synthetic data with 4 centers using make_blobs and apply K-Means clustering. Visualize using a
scatter plot

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Generate data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Apply K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# Plot
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            s=200, c='red', marker='X', label='Centroids')
plt.title("K-Means Clustering with 4 Centers")
plt.legend()
plt.show()


22. Load the Iris dataset and use Agglomerative Clustering to group the data into 3 clusters. Display the first 10
predicted labels


In [None]:
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering

# Load Iris data
data = load_iris()
X = data.data

# Apply Agglomerative Clustering
model = AgglomerativeClustering(n_clusters=3)
labels = model.fit_predict(X)

# Display first 10 predicted labels
print("First 10 labels:", labels[:10])


23. Generate synthetic data using make_moons and apply DBSCAN. Highlight outliers in the plot

In [None]:
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

# Generate data
X, _ = make_moons(n_samples=300, noise=0.1, random_state=0)

# Apply DBSCAN
db = DBSCAN(eps=0.2, min_samples=5).fit(X)
labels = db.labels_

# Plot
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', edgecolor='k')
plt.scatter(X[labels == -1, 0], X[labels == -1, 1], color='red', label='Outliers')
plt.title("DBSCAN on make_moons")
plt.legend()
plt.show()


24. Load the Wine dataset and apply K-Means clustering after standardizing the features. Print the size of each
cluster

In [None]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np

# Load the Wine dataset
data = load_wine()
X = data.data

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X_scaled)

# Print the size of each cluster
cluster_sizes = np.bincount(kmeans.labels_)
print("Cluster sizes:", cluster_sizes)


25. Use make_circles to generate synthetic data and cluster it using DBSCAN. Plot the result

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.cluster import DBSCAN

# Generate synthetic data with make_circles
X, _ = make_circles(n_samples=300, factor=0.5, noise=0.1)

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.1, min_samples=5)
labels = dbscan.fit_predict(X)

# Plot the result
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('DBSCAN Clustering on Synthetic Data (make_circles)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()


26.  Load the Breast Cancer dataset, apply MinMaxScaler, and use K-Means with 2 clusters. Output the cluster
centroids

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
import pandas as pd

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data

# Apply MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-Means clustering with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X_scaled)

# Output the cluster centroids
centroids = kmeans.cluster_centers_
print("Cluster Centroids:\n", centroids)


27. Generate synthetic data using make_blobs with varying cluster standard deviations and cluster with
DBSCAN

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN

# Generate synthetic data with varying cluster standard deviations
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=[1.0, 2.0, 0.5], random_state=42)

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
y_dbscan = dbscan.fit_predict(X)

# Plot the DBSCAN result
plt.scatter(X[:, 0], X[:, 1], c=y_dbscan, cmap='viridis')
plt.title("DBSCAN Clustering on Synthetic Data with Varying Cluster Std")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()


28, Load the Digits dataset, reduce it to 2D using PCA, and visualize clusters from K-Means

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.cluster import KMeans

# Load the Digits dataset
digits = load_digits()
X = digits.data

# Reduce the data to 2D using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=10, random_state=42)
kmeans_labels = kmeans.fit_predict(X_pca)

# Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans_labels, cmap='viridis')
plt.title('K-Means Clusters on Digits Dataset (PCA-reduced)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label='Cluster')
plt.show()


29. Create synthetic data using make_blobs and evaluate silhouette scores for k = 2 to 5. Display as a bar chart

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Generate synthetic data using make_blobs
X, _ = make_blobs(n_samples=500, centers=4, random_state=42)

# List to store silhouette scores
silhouette_scores = []

# Evaluate silhouette scores for k = 2 to 5
for k in range(2, 6):
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X)
    score = silhouette_score(X, labels)
    silhouette_scores.append(score)

# Display the silhouette scores as a bar chart
plt.bar(range(2, 6), silhouette_scores, color='skyblue')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores for Different Values of k')
plt.xticks(range(2, 6))
plt.show()


30. Load the Iris dataset and use hierarchical clustering to group data. Plot a dendrogram with average linkage

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from scipy.cluster.hierarchy import dendrogram, linkage

# Load the Iris dataset
iris = load_iris()
X = iris.data

# Perform hierarchical clustering using average linkage
Z = linkage(X, method='average')

# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.title('Dendrogram for Iris Dataset (Average Linkage)')
plt.xlabel('Sample index')
plt.ylabel('Distance')
plt.show()


31. Generate synthetic data with overlapping clusters using make_blobs, then apply K-Means and visualize with
decision boundaries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from matplotlib.colors import ListedColormap

# Generate synthetic data with overlapping clusters
X, y = make_blobs(n_samples=300, centers=3, cluster_std=2.5, random_state=42)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

# Define a function to plot decision boundaries
def plot_decision_boundaries(X, y, model, title="Decision Boundaries"):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                         np.arange(y_min, y_max, 0.1))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    cmap_background = ListedColormap(['#FFAAAA', '#AAAAFF', '#AAFFAA'])
    cmap_points = ListedColormap(['#FF0000', '#0000FF', '#00FF00'])

    plt.contourf(xx, yy, Z, alpha=0.8, cmap=cmap_background)
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_points, edgecolors='k', s=30)
    plt.title(title)
    plt.show()

# Visualize decision boundaries
plot_decision_boundaries(X, y_kmeans, kmeans, title="K-Means Clustering with Decision Boundaries")


 32. Load the Digits dataset and apply DBSCAN after reducing dimensions with t-SNE. Visualize the results

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# Load the Digits dataset
digits = load_digits()
X = digits.data
y = digits.target

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reduce dimensions with t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
y_dbscan = dbscan.fit_predict(X_tsne)

# Plot the results
plt.figure(figsize=(8, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_dbscan, cmap='viridis', edgecolors='k', s=50)
plt.title('DBSCAN on Digits Dataset after t-SNE Dimensionality Reduction')
plt.xlabel('t-SNE component 1')
plt.ylabel('t-SNE component 2')
plt.colorbar(label='Cluster')
plt.show()


33, Generate synthetic data using make_blobs and apply Agglomerative Clustering with complete linkage. Plot
the result

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Apply Agglomerative Clustering with complete linkage
agg_clust = AgglomerativeClustering(linkage='complete', n_clusters=4)
y_agg = agg_clust.fit_predict(X)

# Plot the results
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_agg, cmap='viridis', edgecolors='k', s=50)
plt.title('Agglomerative Clustering with Complete Linkage')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()


34. Load the Breast Cancer dataset and compare inertia values for K = 2 to 6 using K-Means. Show results in a
line plot

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# List to store inertia values
inertia_values = []

# Apply K-Means for K = 2 to 6 and calculate inertia
for k in range(2, 7):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertia_values.append(kmeans.inertia_)

# Plot the inertia values for different K
plt.figure(figsize=(8, 6))
plt.plot(range(2, 7), inertia_values, marker='o', linestyle='-', color='b')
plt.title('Inertia Values for Different K in K-Means')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.xticks(range(2, 7))
plt.grid(True)
plt.show()


35. Generate synthetic concentric circles using make_circles and cluster using Agglomerative Clustering with
single linkage

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.cluster import AgglomerativeClustering

# Generate synthetic data with concentric circles
X, _ = make_circles(n_samples=300, factor=0.5, noise=0.1)

# Apply Agglomerative Clustering with single linkage
agg_clust = AgglomerativeClustering(n_clusters=2, linkage='single')
labels = agg_clust.fit_predict(X)

# Plot the result
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.title('Agglomerative Clustering with Single Linkage on Concentric Circles')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()


36. Use the Wine dataset, apply DBSCAN after scaling the data, and count the number of clusters (excluding
noise

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

# Load Wine dataset
wine = load_wine()
X = wine.data

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

# Count the number of clusters (excluding noise)
# Noise points are labeled as -1 by DBSCAN
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)

print(f"Number of clusters (excluding noise): {n_clusters}")


37. Generate synthetic data with make_blobs and apply KMeans. Then plot the cluster centers on top of the
data points

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic data with 3 clusters
X, y = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)

# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Plot the data points
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis', s=50)

# Plot the cluster centers
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            marker='x', color='red', s=200, label='Centroids')

# Add labels and title
plt.title('K-Means Clustering with Cluster Centers')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()

# Show the plot
plt.show()


38.  Load the Iris dataset, cluster with DBSCAN, and print how many samples were identified as noise

In [None]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN

# Load the Iris dataset
iris = load_iris()
X = iris.data

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan.fit(X_scaled)

# Identify noise points (label -1 indicates noise)
noise_samples = sum(dbscan.labels_ == -1)

print(f'Number of noise samples: {noise_samples}')


39. Generate synthetic non-linearly separable data using make_moons, apply K-Means, and visualize the
clustering result

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.cluster import KMeans

# Generate synthetic non-linearly separable data
X, _ = make_moons(n_samples=300, noise=0.1, random_state=42)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# Plot the clustering result
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', marker='o', s=50)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='X', label='Centroids')
plt.title('K-Means Clustering on make_moons Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()


40.  Load the Digits dataset, apply PCA to reduce to 3 components, then use KMeans and visualize with a 3D
scatter plot

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from mpl_toolkits.mplot3d import Axes3D

# Load the Digits dataset
digits = load_digits()
X = digits.data

# Apply PCA to reduce to 3 components
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)

# Apply KMeans clustering
kmeans = KMeans(n_clusters=10, random_state=42)
y_kmeans = kmeans.fit_predict(X_pca)

# 3D scatter plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=y_kmeans, cmap='viridis', s=50)
ax.set_title('K-Means Clustering on PCA-reduced Digits Dataset')
ax.set_xlabel('PCA Component 1')
ax.set_ylabel('PCA Component 2')
ax.set_zlabel('PCA Component 3')

# Add color bar
plt.colorbar(scatter)
plt.show()


41. Generate synthetic blobs with 5 centers and apply KMeans. Then use silhouette_score to evaluate the
clustering

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Generate synthetic data with 5 centers
X, y = make_blobs(n_samples=1000, centers=5, cluster_std=1.0, random_state=42)

# Apply KMeans clustering
kmeans = KMeans(n_clusters=5, random_state=42)
y_kmeans = kmeans.fit_predict(X)

# Evaluate clustering performance using silhouette score
sil_score = silhouette_score(X, y_kmeans)

# Print silhouette score
print(f'Silhouette Score: {sil_score:.3f}')

# Visualize the clustering result
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', s=50)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red', marker='X')
plt.title('KMeans Clustering with 5 Centers')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()


42. Load the Breast Cancer dataset, reduce dimensionality using PCA, and apply Agglomerative Clustering.
Visualize in 2D

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Reduce dimensionality to 2D using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Apply Agglomerative Clustering
agg_clust = AgglomerativeClustering(n_clusters=2)  # Assuming 2 clusters
y_pred = agg_clust.fit_predict(X_pca)

# Visualize the clustering result in 2D
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_pred, cmap='viridis', s=50)
plt.title('Agglomerative Clustering on PCA-reduced Breast Cancer Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label='Cluster')
plt.show()


43. Generate noisy circular data using make_circles and visualize clustering results from KMeans and DBSCAN
side-by-side

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.cluster import KMeans, DBSCAN
from sklearn.preprocessing import StandardScaler

# Generate noisy circular data using make_circles
X, _ = make_circles(n_samples=300, noise=0.1, factor=0.5)

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply KMeans clustering
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans_labels = kmeans.fit_predict(X_scaled)

# Apply DBSCAN clustering
dbscan = DBSCAN(eps=0.2, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)

# Create a figure to visualize both clustering results side-by-side
fig, axs = plt.subplots(1, 2, figsize=(12, 6))

# KMeans visualization
axs[0].scatter(X_scaled[:, 0], X_scaled[:, 1], c=kmeans_labels, cmap='viridis', s=50)
axs[0].set_title('KMeans Clustering')
axs[0].set_xlabel('Feature 1')
axs[0].set_ylabel('Feature 2')

# DBSCAN visualization
axs[1].scatter(X_scaled[:, 0], X_scaled[:, 1], c=dbscan_labels, cmap='viridis', s=50)
axs[1].set_title('DBSCAN Clustering')
axs[1].set_xlabel('Feature 1')
axs[1].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()


44. Load the Iris dataset and plot the Silhouette Coefficient for each sample after KMeans clustering

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
iris = load_iris()
X = iris.data

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply KMeans clustering with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_scaled)

# Compute the Silhouette Coefficient for each sample
silhouette_vals = silhouette_samples(X_scaled, labels)

# Plot the Silhouette Coefficients
plt.figure(figsize=(8, 6))
plt.bar(range(len(silhouette_vals)), silhouette_vals, color='skyblue', edgecolor='black')
plt.axhline(y=0, color='black', linestyle='--')
plt.title('Silhouette Coefficients for Each Sample')
plt.xlabel('Sample Index')
plt.ylabel('Silhouette Coefficient')
plt.show()


45. Generate synthetic data using make_blobs and apply Agglomerative Clustering with 'average' linkage.
Visualize clusters

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering

# Generate synthetic data using make_blobs
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

# Apply Agglomerative Clustering with 'average' linkage
agg_clustering = AgglomerativeClustering(n_clusters=4, linkage='average')
labels = agg_clustering.fit_predict(X)

# Visualize the resulting clusters
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', marker='o')
plt.title('Agglomerative Clustering (Average Linkage)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()


46. Load the Wine dataset, apply KMeans, and visualize the cluster assignments in a seaborn pairplot (first 4
features)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.cluster import KMeans
import pandas as pd

# Load the Wine dataset
wine = load_wine()
X = wine.data[:, :4]  # First 4 features
y = wine.target  # Labels (not used in clustering, but can be used for comparison)

# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
cluster_labels = kmeans.fit_predict(X)

# Create a DataFrame for the features and cluster labels
df = pd.DataFrame(X, columns=wine.feature_names)
df['Cluster'] = cluster_labels

# Visualize the cluster assignments using a pairplot
sns.pairplot(df, hue='Cluster', palette='viridis')
plt.suptitle('Wine Dataset - KMeans Clustering (First 4 Features)', y=1.02)
plt.show()


47. Generate noisy blobs using make_blobs and use DBSCAN to identify both clusters and noise points. Print the
count

In [None]:
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

# Generate synthetic data with noisy blobs
X, _ = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)
X = np.vstack([X, np.random.uniform(low=-6, high=6, size=(50, 2))])  # Adding noise points

# Apply DBSCAN
db = DBSCAN(eps=0.5, min_samples=5)
labels = db.fit_predict(X)

# Count the number of clusters and noise points
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)  # -1 indicates noise
n_noise = list(labels).count(-1)

# Print the results
print(f'Number of clusters: {n_clusters}')
print(f'Number of noise points: {n_noise}')

# Visualize the result
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', marker='o')
plt.title("DBSCAN Clustering (Noise points are labeled -1)")
plt.show()


48. Load the Digits dataset, reduce dimensions using t-SNE, then apply Agglomerative Clustering and plot the
clusters

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import AgglomerativeClustering

# Load the Digits dataset
digits = load_digits()
X = digits.data

# Reduce dimensionality using t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

# Apply Agglomerative Clustering
agg_clustering = AgglomerativeClustering(n_clusters=10)
labels = agg_clustering.fit_predict(X_tsne)

# Plot the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=labels, cmap='tab10', s=50, marker='o')
plt.title('Agglomerative Clustering on Digits Dataset (t-SNE reduced)')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.colorbar(label='Cluster')
plt.show()
