# **DBScan Clustering - CMPT 459 Course Project**

This notebook performs **Density-based clustering** on the diabetic patient dataset.

We do the following:
- Data preprocessing consistent with the project pipeline
- PCA for dimensionality reduction for visualization of results
- 2D/3D PCA visualization of best clustering 
- Custom implemention of **density-based clustering** ( `dbscan_clustering.py`)
- Silhouette score as a metric to evaluate cluster quality

This notebook is part of our group’s modular report and references:
- `dbscan_clustering.py`
- `dbscan_clustering_analysis.py` (original script version)


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from dbscan_clustering import DBScan

##  Loading and Preprocessing Data 

Following the same loading and preprocessing data process as the rest of the project, we apply the following:

- Replace `'?'` values with `NaN`  
- Drop columns with >40% missing values 
- One-hot encode high-cardinality categorical columns  
- Label-encode low-cardinality categorical features  
- Normalize numerical features  
- Encode our target variable `readmitted` as integers:  
  - `NO → 0`, `>30 → 1`, `<30 → 2`  
- Remove sensitive/identifying data: `encounter_id`, `patient_nbr`

Finally, we sample **1,000** rows (default) to allow DBScan to run in a reasonable amount of time as the algorithm runs in worst case **O(n^2)**.

In [None]:
def load_and_preprocess_data(path):
    print("Loading data...")
    df = pd.read_csv(path)
    print(f"Original shape: {df.shape}")

    # Replace '?'
    df = df.replace('?', np.nan)

    # Drop >40% missing
    threshold = 0.4 * len(df)
    df = df.dropna(thresh=threshold, axis=1)

    # Fill remaining categorical NA
    for col in df.select_dtypes(include='object').columns:
        df[col] = df[col].fillna("Unknown")

    # Encode target
    df["readmitted"] = df["readmitted"].map({'NO':0, '>30':1, '<30':2})

    # Encode categorical
    cat_cols = df.select_dtypes(include='object').columns
    le = LabelEncoder()
    for col in cat_cols:
        if df[col].nunique() < 10:
            df[col] = le.fit_transform(df[col].astype(str))
        else:
            df = pd.get_dummies(df, columns=[col], drop_first=True)

    # Remove ID columns
    for col in ["encounter_id", "patient_nbr"]:
        if col in df.columns:
            df = df.drop(columns=[col])

    # Normalize numeric
    num_cols = df.select_dtypes(include=["int64", "float64"]).columns
    scaler = StandardScaler()
    df[num_cols] = scaler.fit_transform(df[num_cols])

    print("Preprocessing complete!")
    print("Final shape:", df.shape)

    target = df["readmitted"].copy()
    X = df.drop(columns=["readmitted"]).values

    return X, target

## Representative Subset Sampling
We will take a representative sample subset (default: 1000) of the diabetic dataset as DBScan scales quadratically with sample size.

In [None]:
X, target = load_and_preprocess_data("data/diabetic_data.csv")
sample_size = 1000
np.random.seed(42)

if len(X) > sample_size:
        print(f"Sampling {sample_size} points for DBScan...")
        idx = np.random.choice(len(X), sample_size, replace = False)
        X = X[idx]
        target = target.iloc[idx].values
else:
    target = target.values

X.shape 

## PCA Dimensionality Reduction 

We reduce dimensionality to **50 principal components**, preserving ~85–90% variance.  We use PCA to reduce dimensionality of the dataset to **50 principal components**, preserving ~85-90% variance. Doing so allows us to speed up the clustering algorithm and avoid complications with higher dimension non-linearity. 

In [None]:
n_components = 50 
pca = PCA(n_components)
X_pca = pca.fit_transform(X)

print("PCA shape:", X_pca.shape)
print("PCA done running. Explained variance:", np.sum(pca.explained_variance_ratio_))

##  Running DBScan Clustering

As DBScan does not require cluster counts to run, we instead run DBScan on the sample dataset. The value of epsilon and the number of minimum points set in a neighbourhood determines how well-separated clusters are. Thus, we will compute the **silhouette coefficient* for different values of epsilon and minPts. 

*Sihouette score measures the well-separateness of clusters, with higher = better with a max of 1.0.*



In [None]:
silhouettes = []
best_score = -1
best_clustering = None

db = DBScan(0.2, 5) 
clustering = db.fit(X_pca)

sil = silhouette_score(X_pca, clustering)
silhouettes.append(sil)
print(f"Silhouette = {sil:.4f}")

if sil > best_score:
    best_score = sil
    best_clustering = clustering.copy()

print("\nBest clustering: ", best_clustering, "with silhouette score: ", best_score)

##  Sihouette Coefficient Plot 

The plot below shows silhouette coefficients for different values of epsilon.  

In [None]:
plt.figure(figsize = (10, 6))
plt.plot(silhouettes, "o-", color = "steelblue")
plt.title("DBScan Clustering Silhouette Scores")
plt.xlabel("Epsilon Value")
plt.ylabel("Silhouette Coefficient")
plt.grid(alpha = 0.3)
plt.show()

## 2D Visualization of Clusters 

For visualization, we plot the 50-dim PCA data into a 2D scatter plot. 

*Different colour = different cluster.*

In [None]:
X_vis = PCA(n_components = 2).fit_transform(X)
cmap = matplotlib.colormaps["tab20"]
unique_clusters = np.unique(clustering)
num_clusters = len(unique_clusters)

plt.figure(figsize = (10, 8))
for i, cid in enumerate(unique_clusters):
    mask = clustering == cid
    c_value = i / max(num_clusters - 1, 1) if num_clusters > 1 else 0
    color = cmap(c_value)
    plt.scatter(X_vis[mask, 0], X_vis[mask, 1],
        s = 20, alpha = 0.6, color = color, edgecolors = "black",
        linewidths = 0.5, label = f"Cluster {cid}")
    
plt.title(f"DBScan Clustering", fontsize = 14, fontweight = "bold")
plt.xlabel("Principal Component 1", fontsize=12)
plt.ylabel("Principal Component 2", fontsize=12)
plt.legend(bbox_to_anchor = (1.05, 1), loc = "upper left")
plt.tight_layout()
plt.show()

# **Interpretation & Discussion**
## **Strengths of DBScan Clustering**

* Does not require initialization of centroids and/or clusters
* Robust to noise/outliers in dataset 
* Reveals correlation between features in the same cluster due to objects in cluster being density-reachable

---

## **Problems of DBScan Clustering**

* Running time at worst case (choice of epsilon, minPts, noise/cluster separation) is **O(n2)** → requires representative subset of dataset
* As with hierarchical clustering, results are dependent on euclidean distance of points after PCA dimensionality reduction. 
* Difficulty in separating noise from clusters and may negatively impact results
* Varying density in the dataset may result in inaccurate clustering. 
