# Lab: k-NN & Decision Tree on the Heart Attack Dataset

**Goal:** Train and compare **k-NN** and **Decision Tree** classifiers for predicting heart attack risk (binary classification).  
You will:
1. Load `heart.csv` (Heart Attack dataset)
2. Clean & preprocess data
3. Tune hyperparameters with Cross-Validation (CV)
4. Train final models and evaluate on a held-out test set
5. Make predictions on new samples



In [9]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

RANDOM_STATE = 42


## 1) Load dataset

In [None]:
# Load dataset
CSV_PATH = ...
df = ...

# Display basic info
print("Loaded:", CSV_PATH)
print("Shape:", df.shape)
display(df.head())


Loaded: Heart_Attack.csv
Shape: (303, 14)


Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## 2) Inspect & basic cleaning

In [None]:
print("Columns:", list(df.columns))
print("\nMissing values per column:")
print(df.isna().sum())

# Basic cleanup (drop rows with missing values)
df = df.dropna().reset_index(drop=True)
print("\nAfter dropna -> Shape:", df.shape)

print("Using TARGET_COL =", ...)

# Separate X / y
X = ...
y = ...

print("X shape:", X.shape, "y counts:", np.bincount(y))


Columns: ['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh', 'exng', 'oldpeak', 'slp', 'caa', 'thall', 'output']

Missing values per column:
age         0
sex         0
cp          0
trtbps      0
chol        0
fbs         0
restecg     0
thalachh    0
exng        0
oldpeak     0
slp         0
caa         0
thall       0
output      0
dtype: int64

After dropna -> Shape: (303, 14)
Using TARGET_COL = output
X shape: (303, 13) y counts: [138 165]


## 3) Train/Test split  
We keep the test set untouched until the final evaluation.

In [None]:
X_train, X_test, y_train, y_test = ...

print("Train:", X_train.shape, "Test:", X_test.shape)


Train: (242, 13) Test: (61, 13)


## 4) k-NN (with scaling) + CV to select best `k`

### 4.1) Scaling
**Why scaling?** k-NN uses distances, so features must be on comparable scales.


In [None]:
# ---- scaling with StandardScaler ----
scaler = ...
X_train_s = ...
X_test_s = ...

print("After scaling:")
print("\nX_train_s mean:", np.mean(X_train_s, axis=0))
print("\nX_train_s max:", np.max(X_train_s, axis=0))
print("\nX_train_s min:", np.min(X_train_s, axis=0))


After scaling:

X_train_s mean: [ 3.52335241e-16  1.02764445e-16  4.58769845e-17  2.64251431e-16
 -1.54146668e-16 -3.67015876e-17  1.44971271e-16 -3.96377146e-16
  1.21115239e-16  1.48182660e-16  0.00000000e+00  2.20209526e-17
  2.20209526e-17]

X_train_s max: [2.48629146 0.68313005 2.04442042 3.8031055  3.43709132 2.47338777
 2.74397736 2.36525938 1.39686059 4.26005746 0.94818498 3.1287583
 1.09506875]

X_train_s min: [-2.76857783 -1.46385011 -0.93599971 -2.09979655 -2.48980555 -0.40430377
 -1.02899151 -2.83250832 -0.71589105 -0.89045843 -2.28365678 -0.71469098
 -3.8738057 ]


### 4.2) Cross-validation to select best k value

In [None]:
# Initialize StratifiedKFold (5 folds)
cv = ...

In [None]:
# Make a list of k values to try
# CV to try k = 1, 3, 5, ..., 15
k_list = ...

# Build mean_scores list
mean_scores = []

In [None]:
# Loop over k values and perform CV
# For each k, create KNN model and get CV accuracy
for k in k_list:
    knn = ...
    scores = cross_val_score(...)
    mean_scores.append(scores.mean())

# Find best k
best_idx = int(np.argmax(mean_scores))
best_k = k_list[best_idx]
best_cv_acc = mean_scores[best_idx]

In [17]:
# Print results
print("=== k-NN CV results (TRAIN only) ===")
for k, m in zip(k_list, mean_scores):
    print(f"k={k:2d}  mean_CV_acc={m:.4f}")

print(f"\nBest k = {best_k} (mean CV acc = {best_cv_acc:.4f})")

=== k-NN CV results (TRAIN only) ===
k= 1  mean_CV_acc=0.7524
k= 3  mean_CV_acc=0.7976
k= 5  mean_CV_acc=0.8139
k= 7  mean_CV_acc=0.8098
k= 9  mean_CV_acc=0.8014
k=11  mean_CV_acc=0.8095
k=13  mean_CV_acc=0.8385
k=15  mean_CV_acc=0.8303
k=17  mean_CV_acc=0.8302
k=19  mean_CV_acc=0.8219
k=21  mean_CV_acc=0.8262
k=23  mean_CV_acc=0.8304
k=25  mean_CV_acc=0.8264
k=27  mean_CV_acc=0.8304
k=29  mean_CV_acc=0.8304

Best k = 13 (mean CV acc = 0.8385)


### 4.3) Train final k-NN model with best k

In [None]:
# Train final k-NN model with best k
best_knn = ...
....fit(...)

### 4.4) Evaluate k-NN on test set

In [None]:
# Evaluate k-NN on test set
y_pred_knn = ...
acc_knn = ...
cm_knn = ...


In [20]:
# Print results
print("\n=== k-NN Test ===")
print("Accuracy:", round(acc_knn, 4))
print("Confusion matrix [[TN FP],[FN TP]]:\n", cm_knn)
print("\nReport:\n", classification_report(y_test, y_pred_knn, digits=4))


=== k-NN Test ===
Accuracy: 0.8197
Confusion matrix [[TN FP],[FN TP]]:
 [[19  9]
 [ 2 31]]

Report:
               precision    recall  f1-score   support

           0     0.9048    0.6786    0.7755        28
           1     0.7750    0.9394    0.8493        33

    accuracy                         0.8197        61
   macro avg     0.8399    0.8090    0.8124        61
weighted avg     0.8346    0.8197    0.8154        61



## 5) Decision Tree + CV to select best `max_depth`

**Note:** Decision trees do **not** require scaling.

#

### 5.1) Cross-validation to select best max_depth

In [None]:
# Initialize StratifiedKFold
cv = ...

In [None]:
# Create a list of max_depth values to try
# max_depth => 1 to 10
depth_list = ...

# Build mean_scores list
mean_scores = []

In [None]:
# Loop over depth values and perform CV
for d in depth_list:
    tree = ...
    scores = cross_val_score(...)
    mean_scores.append(scores.mean())

In [24]:
# Find best max_depth
best_idx = int(np.argmax(mean_scores))
best_depth = depth_list[best_idx]
best_cv_acc = mean_scores[best_idx]

In [25]:
# Print results
print("=== Decision Tree CV results (TRAIN only) ===")
for d, m in zip(depth_list, mean_scores):
    d_text = "None" if d is None else str(d)
    print(f"max_depth={d_text:>4s}  mean_CV_acc={m:.4f}")

print(f"\nBest max_depth = {('None' if best_depth is None else best_depth)} (mean CV acc = {best_cv_acc:.4f})")

=== Decision Tree CV results (TRAIN only) ===
max_depth=   1  mean_CV_acc=0.7399
max_depth=   2  mean_CV_acc=0.7107
max_depth=   3  mean_CV_acc=0.7770
max_depth=   4  mean_CV_acc=0.7520
max_depth=   5  mean_CV_acc=0.7229
max_depth=   6  mean_CV_acc=0.7313
max_depth=   8  mean_CV_acc=0.7271
max_depth=  10  mean_CV_acc=0.7271

Best max_depth = 3 (mean CV acc = 0.7770)


### 5.2) Train final Decision Tree model with best max_depth

In [None]:
# Train final Decision Tree model with best max_depth
best_tree = ...
....fit(...)

In [None]:
# Evaluate Decision Tree on test set
y_pred_tree = ...
acc_tree = ...
cm_tree = ...

In [28]:
# Print results
print("\n=== Decision Tree Test ===")
print("Accuracy:", round(acc_tree, 4))
print("Confusion matrix [[TN FP],[FN TP]]:\n", cm_tree)
print("\nReport:\n", classification_report(y_test, y_pred_tree, digits=4))



=== Decision Tree Test ===
Accuracy: 0.7541
Confusion matrix [[TN FP],[FN TP]]:
 [[19  9]
 [ 6 27]]

Report:
               precision    recall  f1-score   support

           0     0.7600    0.6786    0.7170        28
           1     0.7500    0.8182    0.7826        33

    accuracy                         0.7541        61
   macro avg     0.7550    0.7484    0.7498        61
weighted avg     0.7546    0.7541    0.7525        61



## 6) Compare models (test set)

You can report:
- accuracy
- confusion matrix
- precision/recall/F1 for class 1 (heart attack)


In [None]:
results = pd.DataFrame([
    {"model": "k-NN", "test_accuracy": ...},
    {"model": "Decision Tree", "test_accuracy": ...},
]).sort_values("test_accuracy", ascending=False)

display(results)


Unnamed: 0,model,test_accuracy
0,k-NN,0.819672
1,Decision Tree,0.754098


## 7) Predict a new patient

You must provide values for **all feature columns** exactly as in `X.columns`.


In [30]:
print("Feature columns:", list(X.columns))

# Example input 
new_sample = {col: float(X[col].median()) for col in X.columns}  # simple default using median

new_df = pd.DataFrame([new_sample])
display(new_df)

# k-NN needs scaling
new_df_s = scaler.transform(new_df)

pred_knn = int(best_knn.predict(new_df_s)[0])
pred_tree = int(best_tree.predict(new_df)[0])

print("k-NN prediction (0/1):", pred_knn)
print("Decision Tree prediction (0/1):", pred_tree)


Feature columns: ['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh', 'exng', 'oldpeak', 'slp', 'caa', 'thall']


Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall
0,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0


k-NN prediction (0/1): 1
Decision Tree prediction (0/1): 1
