<a href="https://www.kaggle.com/code/farrelad/eng-machine-learning-case-study-clustering?scriptVersionId=269047192" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Preparation

In [None]:
import pandas as pd
import numpy as np
import random
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.decomposition import PCA
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from annoy import AnnoyIndex
from IPython.display import display

import warnings
warnings.filterwarnings('ignore')

In [None]:
# setup random seed

RANDOM_SEED = 24
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

# 1. Preprocessing Data

In [None]:
df = pd.read_csv('/kaggle/input/heart-disease-dataset/heart.csv')

display(df.head(20))
display(df.describe())
display(df.info())

## 1.1 Handle missing values

In [None]:
df.isnull().sum()

Based on the data observation, there are no columns with missing values.

## 1.2 Create One New Feature

In [None]:
# add new column [CholAge] (cholesterol * age)
df['chol_age'] = df['chol'] * df['age']

display(df['chol_age'].describe())
print('\n')
display(df['chol_age'].head(20))

## 1.3 Normalization / Standardization

In [None]:
num_cols = ["age", "trestbps", "chol", "thalach", "oldpeak", "chol_age"]

scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

# 2. Clustering

## 2.1 k-Means

### 2.1.1 Find the best k

#### 2.1.1.1 Elbow method

In [None]:
wcss = []
K = range(1, 20)

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=RANDOM_SEED)
    kmeans.fit(df)
    inertia = kmeans.inertia_ # inertia_ = WCSS
    print(f"k={k}, inertia={inertia:.4f}")
    wcss.append(inertia)  

# Plot the elbow
plt.plot(K, wcss, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('WCSS (Inertia)')
plt.title('Elbow Method for Optimal k')
plt.show()

#### 2.1.1.2 Silhouette score

In [None]:
s_scores = []
K = range(2, 20)

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=RANDOM_SEED)
    labels = kmeans.fit_predict(df)
    score = silhouette_score(df, labels)
    print(f"k={k}, silhouette score={score:.4f}")
    s_scores.append(score)
    
# Plot
plt.plot(K, s_scores, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Method for Optimal k')
plt.show()

### 2.1.2 Model training

In [None]:
kmeans = KMeans(n_clusters=2, random_state=RANDOM_SEED)
kmeans_labels = kmeans.fit_predict(df)

### 2.1.3 Visualization

In [None]:
pca = PCA(n_components=2)
reduced = pca.fit_transform(df)

plt.scatter(
    reduced[:, 0], 
    reduced[:, 1], 
    c=kmeans_labels, 
    cmap='viridis', 
    s=1
)
plt.title("K-Means Clustering Visualization (PCA-reduced)")
plt.show()

## 2.2 DBSCAN

### 2.2.1 Setup DBSCAN

In [None]:
neigh = NearestNeighbors(n_neighbors=5)
nbrs = neigh.fit(df)
distances, indices = nbrs.kneighbors(df)

distances = np.sort(distances[:, 4])

# Plot
plt.plot(distances)
plt.ylabel("5th Nearest Neighbor Distance")
plt.xlabel("Points sorted by distance")
plt.title("Find epsilon")
plt.show()

### 2.2.2 Model training

In [None]:
dbscan = DBSCAN(eps=2, min_samples=5)
dbscan_labels = dbscan.fit_predict(df)

In [None]:
score = silhouette_score(df, dbscan_labels)
print("Silhouette Score (DBSCAN):", score)

### 2.2.3 Visualization

In [None]:
pca_2d = PCA(n_components=2)
reduced_2d = pca_2d.fit_transform(df)

# 2D Plot
plt.figure(figsize=(8, 6))
plt.scatter(
    reduced_2d[:, 0], reduced_2d[:, 1],
    c=dbscan_labels,
    cmap='viridis',
    s=5
)
plt.title("DBSCAN Clustering Visualization")
plt.colorbar(label='Cluster Label')
plt.show()

## 2.3 Comparing Result

In [None]:
sil_kmeans = silhouette_score(df, kmeans_labels)
dbi_kmeans = davies_bouldin_score(df, kmeans_labels)

sil_dbscan = silhouette_score(df, dbscan_labels) if len(set(dbscan_labels)) > 1 else None
dbi_dbscan = davies_bouldin_score(df, dbscan_labels) if len(set(dbscan_labels)) > 1 else None

print(f"K-Means  → Silhouette: {sil_kmeans:.3f}, DBI: {dbi_kmeans:.3f}")
print(f"DBSCAN   → Silhouette: {sil_dbscan}, DBI: {dbi_dbscan}")

# 3. Analysis

Based on the results from the K-Means method, it has a silhouette score of $0.185$, which is close to 0. This indicates that clusters exist, but their boundaries overlap significantly, meaning the separation between clusters is weak.

For the Davies–Bouldin Index (DBI), K-Means has a score of $1.969$, which suggests poor separation between different clusters.

For the DBSCAN method, it has a silhouette score of $-0.10$, which is negative. This means many points are likely assigned to the wrong cluster or the clusters are not well formed. The DBI score for DBSCAN is $1.913$, which is similar to that of K-Means, also indicating poor cluster separation.

# 4. ANN (Approximate Nearest Neighbor)

In [None]:
f = df.shape[1]  # number of features
index = AnnoyIndex(f, 'euclidean')  # or 'angular'

# Add each vector into the index
for i, vector in enumerate(df.values):
    index.add_item(i, vector)

# Build the forest
index.build(10)  # number of trees (higher = more accurate, slower)

In [None]:
query_index = 10
k = 5 

neighbors, distances = index.get_nns_by_item(
    query_index,
    k,
    include_distances=True
)

print("Query index:", query_index)
print("Nearest neighbors:", neighbors)
print("Distances:", distances)