# Unsupervised Learning
In this notebook, we apply three clustering algorithms (DBSCAN, Agglomerative Clustering, and K-Means) to group patient data and evaluate their performance.

## DBSCAN Clustering

In [None]:
import pandas as pd
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv("cleaned_dataset.csv")  # We load the preprocessed dataset

# Drop any unnecessary columns, especially the class label 'Disease'
df = df.drop(columns=[col for col in ['nan', 'Disease'] if col in df.columns], errors='ignore')

# Standardize the data to normalize scale
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

# Apply DBSCAN algorithm with pre-defined parameters
dbscan = DBSCAN(eps=0.8, min_samples=15)
labels = dbscan.fit_predict(df_scaled)

# Store the cluster labels
df['Cluster'] = labels

# Calculate the number of clusters and noise points
num_clusters = len(set(labels)) - (1 if -1 in labels else 0)
noise = list(labels).count(-1)

# Evaluate clustering using Silhouette Score
silhouette = silhouette_score(df_scaled, labels) if num_clusters > 1 else "N/A"

print(f"Number of clusters: {num_clusters}")
print(f"Noise points: {noise}")
print(f"Silhouette Score: {silhouette}")
print(df['Cluster'].value_counts())

# Visualize the clustering result using PCA
pca = PCA(n_components=2)
data_2d = pca.fit_transform(df_scaled)
colors = ['gray' if label == -1 else f'C{label}' for label in labels]

plt.figure(figsize=(8, 6))
plt.scatter(data_2d[:, 0], data_2d[:, 1], c=colors, s=50)
plt.title('DBSCAN Clustering Visualization (eps=0.8, min_samples=15)')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.grid(True)
plt.show()

## Agglomerative Clustering (HAC)

In [None]:
from sklearn.cluster import AgglomerativeClustering

# Try different values for the number of clusters and find the best based on silhouette score
best_score = -1
best_k = 0
best_labels = None

for k in range(2, 8):
    hac = AgglomerativeClustering(n_clusters=k, linkage='ward')
    labels = hac.fit_predict(df_scaled)
    score = silhouette_score(df_scaled, labels)
    if score > best_score:
        best_score = score
        best_k = k
        best_labels = labels

# Assign best clustering result to DataFrame
df['Cluster'] = best_labels

print(f"Best number of clusters: {best_k}")
print(f"Best Silhouette Score: {best_score:.4f}")
print(df['Cluster'].value_counts())

# Visualize the HAC clustering result using PCA
data_2d = pca.fit_transform(df_scaled)

plt.figure(figsize=(8, 6))
plt.scatter(data_2d[:, 0], data_2d[:, 1], c=best_labels, cmap='Set1', s=50)
plt.title(f'HAC Clustering Visualization (k={best_k})')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(label='Cluster')
plt.grid(True)
plt.show()

## K-Means Clustering

In [None]:
from sklearn.cluster import KMeans

# Use the Elbow method to determine optimal k
inertia = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(df_scaled)
    inertia.append(kmeans.inertia_)

# Plot Elbow Method result
plt.plot(K_range, inertia, marker='o')
plt.xlabel("Number of Clusters (K)")
plt.ylabel("Inertia")
plt.title("Elbow Method for Optimal K")
plt.show()

# Apply KMeans with the selected k
k_optimal = 5  # Change this if elbow plot shows a better value
kmeans = KMeans(n_clusters=k_optimal, random_state=42, n_init=10)
df['Cluster'] = kmeans.fit_predict(df_scaled)

# Evaluate clustering using silhouette score
sil_score = silhouette_score(df_scaled, df['Cluster'])
print(f"Silhouette Score: {sil_score:.4f}")
print(df['Cluster'].value_counts())

# Visualize the K-Means result
data_2d = pca.fit_transform(df_scaled)
plt.figure(figsize=(8, 6))
plt.scatter(data_2d[:, 0], data_2d[:, 1], c=df['Cluster'], cmap='Set2', s=50)
plt.title(f'K-Means Clustering Visualization (k={k_optimal})')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.colorbar(label='Cluster')
plt.grid(True)
plt.show()

## Conclusion and Comparison

- **DBSCAN** detected arbitrary-shaped clusters and identified noise, but may not have formed distinct clusters in sparse data.
- **HAC** gave the best silhouette score and produced structured clusters.
- **K-Means** was simple and efficient, with reasonable performance based on the elbow method.

## How Clustering Helps

Clustering helps identify patterns or groupings in the data that could enhance recommendations. For example, patients within the same cluster might share symptoms, and could receive similar treatments or suggestions.

If clustering doesn’t directly improve recommendations, it still provides insight into the data structure and variability.

