# Clustering Methods on MNIST Dataset

This notebook demonstrates the use of various clustering methods on the MNIST dataset. We will compare the performance of different clustering techniques using evaluation metrics such as Silhouette Score and Dunn Index. The dataset is located in the `./data` folder.

## 1. Introduction

- Brief introduction to clustering and its applications.
- Overview of the clustering methods we'll apply:
  - K-means Clustering
  - Agglomerative Hierarchical Clustering
  - DBSCAN
- Evaluation Metrics:
  - Inertia (for K-means)
  - Silhouette Score
  - Dunn Index

## 2. Load MNIST Dataset

```python
import pandas as pd
import utils  # Assuming utils.py contains functions to load data
from sklearn.preprocessing import StandardScaler

# Load training and test data
train_path = './data/mnist_train.csv'
test_path = './data/mnist_test.csv'

X_train, y_train = utils.load_data(train_path, test_path)

# Preprocess data (scaling)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Display shape of the dataset
print(f"Training data shape: {X_train_scaled.shape}")


## 3. Clustering Methods
### 3.1. K-means Clustering
K-means is a partitioning method that divides the dataset into K clusters by minimizing the in-cluster variance (inertia).

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Initialize KMeans with 10 clusters (for 10 digits)
kmeans = KMeans(n_clusters=10, random_state=42)

# Fit the model
kmeans.fit(X_train_scaled)

# Predict the clusters
kmeans_labels = kmeans.predict(X_train_scaled)

# Evaluation metrics
inertia = kmeans.inertia_  # Measure of how tightly clusters are packed
silhouette_avg = silhouette_score(X_train_scaled, kmeans_labels)

print(f'K-means Inertia: {inertia}')
print(f'K-means Silhouette Score: {silhouette_avg}')


### 3.2. Agglomerative Hierarchical Clustering
This method builds a hierarchy of clusters by either merging or splitting them based on distance measures.


In [None]:
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as sch

# Fit the agglomerative clustering model
agg_clustering = AgglomerativeClustering(n_clusters=10)
agg_labels = agg_clustering.fit_predict(X_train_scaled)

# Evaluation metrics
silhouette_avg = silhouette_score(X_train_scaled, agg_labels)

print(f'Hierarchical Clustering Silhouette Score: {silhouette_avg}')


#### 3.2.1. Dendrogram: 
visualize the hierarchy of clusters

In [None]:
# Create a dendrogram to visualize the hierarchy of clusters
dendrogram = sch.dendrogram(sch.linkage(X_train_scaled[:500], method='ward'))


### 3.3. DBSCAN (Density-Based Clustering)
DBSCAN is a density-based clustering method that groups points that are closely packed together while marking points that are far from others as noise.

In [None]:
from sklearn.cluster import DBSCAN

# Fit DBSCAN
dbscan = DBSCAN(eps=3, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_train_scaled)

# Evaluation metrics
silhouette_avg = silhouette_score(X_train_scaled, dbscan_labels)

print(f'DBSCAN Silhouette Score: {silhouette_avg}')


## 4. Evaluation of Clustering Methods
### 4.1. Silhouette Score
The silhouette score measures how similar an object is to its own cluster compared to other clusters. Values range from -1 to 1, where higher is better.

In [None]:
from sklearn.metrics import silhouette_score

# Calculate silhouette scores for each method
kmeans_silhouette = silhouette_score(X_train_scaled, kmeans_labels)
agg_silhouette = silhouette_score(X_train_scaled, agg_labels)
dbscan_silhouette = silhouette_score(X_train_scaled, dbscan_labels)

print(f'K-means Silhouette Score: {kmeans_silhouette}')
print(f'Hierarchical Clustering Silhouette Score: {agg_silhouette}')
print(f'DBSCAN Silhouette Score: {dbscan_silhouette}')


### 4.2. Dunn Index (optional)
Dunn Index is another clustering evaluation metric that aims to identify clusters that are compact and well-separated.

In [None]:
# Function for Dunn Index (use external library or custom function)
from dunn_index import dunn

kmeans_dunn = dunn(X_train_scaled, kmeans_labels)
agg_dunn = dunn(X_train_scaled, agg_labels)
dbscan_dunn = dunn(X_train_scaled, dbscan_labels)

print(f'K-means Dunn Index: {kmeans_dunn}')
print(f'Hierarchical Clustering Dunn Index: {agg_dunn}')
print(f'DBSCAN Dunn Index: {dbscan_dunn}')


## 5. Save Results to Results Folder
Save clustering results, metrics, and visualizations to the results/ folder.
### 5.1. Save Silhouette Score and Dunn Index

In [None]:
# Save silhouette scores
with open('./results/silhouette_scores.txt', 'w') as f:
    f.write(f'K-means Silhouette Score: {kmeans_silhouette}\n')
    f.write(f'Hierarchical Clustering Silhouette Score: {agg_silhouette}\n')
    f.write(f'DBSCAN Silhouette Score: {dbscan_silhouette}\n')

# Save Dunn Index (optional)
with open('./results/dunn_index.txt', 'w') as f:
    f.write(f'K-means Dunn Index: {kmeans_dunn}\n')
    f.write(f'Hierarchical Clustering Dunn Index: {agg_dunn}\n')
    f.write(f'DBSCAN Dunn Index: {dbscan_dunn}\n')


### 5.2. Save Clustering Visualizations

In [None]:
import matplotlib.pyplot as plt

# Function to visualize clusters (e.g., PCA 2D projection)
def plot_clusters(X, labels, title, filename):
    plt.figure(figsize=(8, 6))
    plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
    plt.title(title)
    plt.savefig(f'./results/{filename}')
    plt.show()

# Plot K-means clusters using PCA projection
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_train_scaled)
plot_clusters(X_pca, kmeans_labels, 'K-means Clustering', 'kmeans_clusters.png')

# Similar plots for other methods...


## 6. Conclusion
- Compare the clustering methods based on the evaluation metrics and visualizations.
- Summarize the strengths and weaknesses of each method.\
- Choose the best method for the MNIST dataset based on metrics like the Silhouette Score and Dunn Index


---

This notebook covers a range of clustering methods, uses metrics to compare them, and saves the results in a structured manner. It includes methods like K-means, Agglomerative Clustering, and DBSCAN, along with their evaluation and visualization steps. The MNIST dataset is used as an example to demonstrate each method’s performance.
