# Lab: k-NN & Decision Tree on the Heart Attack Dataset

**Goal:** Train and compare **k-NN** and **Decision Tree** classifiers for predicting heart attack risk (binary classification).  
You will:
1. Load `heart.csv` (Heart Attack dataset)
2. Clean & preprocess data
3. Tune hyperparameters with Cross-Validation (CV)
4. Train final models and evaluate on a held-out test set
5. Make predictions on new samples



In [2]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

RANDOM_STATE = 42


## 1) Load dataset

In [8]:
# Load dataset
CSV_PATH = "Heart_Attack.csv"
df = pd.read_csv(CSV_PATH)

# Display basic info
print("Loaded:", CSV_PATH)
print("Shape:", df.shape)
display(df.head())


Loaded: Heart_Attack.csv
Shape: (303, 14)


Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## 2) Inspect & basic cleaning

In [9]:
print("Columns:", list(df.columns))
print("\nMissing values per column:")
print(df.isna().sum())

# Basic cleanup (drop rows with missing values)
df = df.dropna().reset_index(drop=True)
print("\nAfter dropna -> Shape:", df.shape)

print("Using TARGET_COL =", ...)

# Separate X / y

X = df.drop(['output'],axis=1)
y = df['output']

print("X shape:", X.shape, "y counts:", np.bincount(y))


Columns: ['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh', 'exng', 'oldpeak', 'slp', 'caa', 'thall', 'output']

Missing values per column:
age         0
sex         0
cp          0
trtbps      0
chol        0
fbs         0
restecg     0
thalachh    0
exng        0
oldpeak     0
slp         0
caa         0
thall       0
output      0
dtype: int64

After dropna -> Shape: (303, 14)
Using TARGET_COL = Ellipsis
X shape: (303, 13) y counts: [138 165]


## 3) Train/Test split  
We keep the test set untouched until the final evaluation.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

print("Train:", X_train.shape, "Test:", X_test.shape)


Train: (242, 13) Test: (61, 13)


## 4) k-NN (with scaling) + CV to select best `k`

### 4.1) Scaling
**Why scaling?** k-NN uses distances, so features must be on comparable scales.


In [11]:
# ---- scaling with StandardScaler ----
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

print("After scaling:")
print("\nX_train_s mean:", np.mean(X_train_s, axis=0))
print("\nX_train_s max:", np.max(X_train_s, axis=0))
print("\nX_train_s min:", np.min(X_train_s, axis=0))


After scaling:

X_train_s mean: [-1.02764445e-16  4.40419051e-17 -1.19280160e-17 -8.11105086e-16
  4.03717464e-17 -2.56911113e-17 -1.02764445e-16 -4.73450480e-16
  7.34031752e-18 -5.87225401e-17 -1.61486985e-16 -1.65157144e-17
 -2.62416351e-16]

X_train_s max: [2.48139745 0.69617712 1.94013791 3.52434923 5.94512442 2.39211668
 2.7826449  2.25652485 1.40984195 4.31192549 0.95577901 3.18198052
 1.14190596]

X_train_s min: [-2.89619822 -1.43641607 -0.92274852 -2.21351487 -2.16936794 -0.41803981
 -0.97936664 -3.40927198 -0.70929937 -0.89249331 -2.27916533 -0.70710678
 -3.67799943]


### 4.2) Cross-validation to select best k value

In [12]:
# Initialize StratifiedKFold (5 folds)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=7)

In [14]:
# Make a list of k values to try
# CV to try k = 1, 3, 5, ..., 15
k_list = list(range(1, 17, 2))

# Build mean_scores list
mean_scores = []

In [15]:
# Loop over k values and perform CV
# For each k, create KNN model and get CV accuracy
for k in k_list:
    knn =KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_s, y_train, cv=cv, scoring="accuracy")
    mean_scores.append(scores.mean())

# Find best k
best_idx = int(np.argmax(mean_scores))
best_k = k_list[best_idx]
best_cv_acc = mean_scores[best_idx]

In [16]:
# Print results
print("=== k-NN CV results (TRAIN only) ===")
for k, m in zip(k_list, mean_scores):
    print(f"k={k:2d}  mean_CV_acc={m:.4f}")

print(f"\nBest k = {best_k} (mean CV acc = {best_cv_acc:.4f})")

=== k-NN CV results (TRAIN only) ===
k= 1  mean_CV_acc=0.7231
k= 3  mean_CV_acc=0.7937
k= 5  mean_CV_acc=0.7896
k= 7  mean_CV_acc=0.8144
k= 9  mean_CV_acc=0.8020
k=11  mean_CV_acc=0.8102
k=13  mean_CV_acc=0.8268
k=15  mean_CV_acc=0.8307

Best k = 15 (mean CV acc = 0.8307)


### 4.3) Train final k-NN model with best k

In [17]:
# Train final k-NN model with best k
best_knn = KNeighborsClassifier(n_neighbors=best_k)
best_knn.fit(X_train_s, y_train)

### 4.4) Evaluate k-NN on test set

In [18]:
# Evaluate k-NN on test set
y_pred_knn = best_knn.predict(X_test_s)
acc_knn = accuracy_score(y_test, y_pred_knn)
cm_knn = confusion_matrix(y_test, y_pred_knn)


In [19]:
# Print results
print("\n=== k-NN Test ===")
print("Accuracy:", round(acc_knn, 4))
print("Confusion matrix [[TN FP],[FN TP]]:\n", cm_knn)
print("\nReport:\n", classification_report(y_test, y_pred_knn, digits=4))


=== k-NN Test ===
Accuracy: 0.8361
Confusion matrix [[TN FP],[FN TP]]:
 [[20  7]
 [ 3 31]]

Report:
               precision    recall  f1-score   support

           0     0.8696    0.7407    0.8000        27
           1     0.8158    0.9118    0.8611        34

    accuracy                         0.8361        61
   macro avg     0.8427    0.8263    0.8306        61
weighted avg     0.8396    0.8361    0.8341        61



## 5) Decision Tree + CV to select best `max_depth`

**Note:** Decision trees do **not** require scaling.

#

### 5.1) Cross-validation to select best max_depth

In [20]:
# Initialize StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=7)

In [21]:
# Create a list of max_depth values to try
# max_depth => 1 to 10
depth_list = list(range(1, 11))

# Build mean_scores list
mean_scores = []

In [29]:
# Loop over depth values and perform CV
for d in depth_list:
    tree = DecisionTreeClassifier(criterion="gini", max_depth=d, random_state=7)
    scores = cross_val_score(tree, X_train, y_train, cv=cv, scoring="accuracy")
    mean_scores.append(scores.mean())

In [30]:
# Find best max_depth
best_idx = int(np.argmax(mean_scores))
best_depth = depth_list[best_idx]
best_cv_acc = mean_scores[best_idx]

In [31]:
# Print results
print("=== Decision Tree CV results (TRAIN only) ===")
for d, m in zip(depth_list, mean_scores):
    d_text = "None" if d is None else str(d)
    print(f"max_depth={d_text:>4s}  mean_CV_acc={m:.4f}")

print(f"\nBest max_depth = {('None' if best_depth is None else best_depth)} (mean CV acc = {best_cv_acc:.4f})")

=== Decision Tree CV results (TRAIN only) ===
max_depth=   1  mean_CV_acc=0.6904
max_depth=   2  mean_CV_acc=0.8018
max_depth=   3  mean_CV_acc=0.7851
max_depth=   4  mean_CV_acc=0.7608
max_depth=   5  mean_CV_acc=0.7688
max_depth=   6  mean_CV_acc=0.7645
max_depth=   7  mean_CV_acc=0.7438
max_depth=   8  mean_CV_acc=0.7522
max_depth=   9  mean_CV_acc=0.7358
max_depth=  10  mean_CV_acc=0.7521

Best max_depth = 2 (mean CV acc = 0.8018)


### 5.2) Train final Decision Tree model with best max_depth

In [25]:
# Train final Decision Tree model with best max_depth
best_tree = DecisionTreeClassifier(criterion="gini", max_depth=best_depth, random_state=7)
best_tree.fit(X_train, y_train)

In [32]:
# Evaluate Decision Tree on test set
y_pred_tree = best_tree.predict(X_test)
acc_tree = accuracy_score(y_test, y_pred_tree)
cm_tree = confusion_matrix(y_test, y_pred_tree)

In [33]:
# Print results
print("\n=== Decision Tree Test ===")
print("Accuracy:", round(acc_tree, 4))
print("Confusion matrix [[TN FP],[FN TP]]:\n", cm_tree)
print("\nReport:\n", classification_report(y_test, y_pred_tree, digits=4))



=== Decision Tree Test ===
Accuracy: 0.7377
Confusion matrix [[TN FP],[FN TP]]:
 [[13 14]
 [ 2 32]]

Report:
               precision    recall  f1-score   support

           0     0.8667    0.4815    0.6190        27
           1     0.6957    0.9412    0.8000        34

    accuracy                         0.7377        61
   macro avg     0.7812    0.7113    0.7095        61
weighted avg     0.7713    0.7377    0.7199        61



## 6) Compare models (test set)

You can report:
- accuracy
- confusion matrix
- precision/recall/F1 for class 1 (heart attack)


In [36]:
results = pd.DataFrame([
    {"model": "k-NN", "test_accuracy": acc_knn},
    {"model": "Decision Tree", "test_accuracy": acc_tree},
]).sort_values("test_accuracy", ascending=False)

display(results)


Unnamed: 0,model,test_accuracy
0,k-NN,0.836066
1,Decision Tree,0.737705


## 7) Predict a new patient

You must provide values for **all feature columns** exactly as in `X.columns`.


In [37]:
print("Feature columns:", list(X.columns))

# Example input
new_sample = {col: float(X[col].median()) for col in X.columns}  # simple default using median

new_df = pd.DataFrame([new_sample])
display(new_df)

# k-NN needs scaling
new_df_s = scaler.transform(new_df)

pred_knn = int(best_knn.predict(new_df_s)[0])
pred_tree = int(best_tree.predict(new_df)[0])

print("k-NN prediction (0/1):", pred_knn)
print("Decision Tree prediction (0/1):", pred_tree)


Feature columns: ['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh', 'exng', 'oldpeak', 'slp', 'caa', 'thall']


Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall
0,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0


k-NN prediction (0/1): 1
Decision Tree prediction (0/1): 1
