# Task #13978 & #16111: K-Nearest Neighbours and Grid Search

This notebook documents the implementation of a **K-Nearest Neighbours (K-NN)** algorithm to predict heart disease using the `enes_final_cleaned_data.csv` dataset. It is divided into two parts:
1. **Task #13978**: Building a baseline K-Nearest Neighbours model from scratch.
2. **Task #16111**: Applying `GridSearchCV` to exhaustively test hyperparameters to improve the baseline model's performance.

## 1. Importing Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## 2. Loading the Dataset

I use `enes_final_cleaned_data.csv` which contains numerical columns ready for machine learning consumption.

In [2]:
data = pd.read_csv('enes_final_cleaned_data.csv')
data.head()

Unnamed: 0,age,sex,chest_pain_type,resting_blood_pressure,cholesterol,fasting_blood_sugar,resting_ecg,max_heart_rate,exercise_induced_angina,st_depression,slope,num_major_vessels,thalassemia,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## 3. Preprocessing (Scaling and Splitting)

For K-Nearest Neighbours (K-NN), **feature scaling is mandatory** because the algorithm calculates the distance between points via metrics like Euclidean distance. Features with larger ranges would overwhelmingly dictate these distance formulations.

I will split 20% of the dataset to be my hold-out test set.

In [3]:
X = data.drop(columns=['target'])
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set shape: {X_train_scaled.shape}")
print(f"Test set shape: {X_test_scaled.shape}")

Training set shape: (241, 13)
Test set shape: (61, 13)


## 4. Part 1: Baseline K-NN Model

Let's test an arbitrary number of neighbours (e.g., $k=5$) and review its performance.

In [4]:
knn_baseline = KNeighborsClassifier(n_neighbors=5)
knn_baseline.fit(X_train_scaled, y_train)

y_pred_base = knn_baseline.predict(X_test_scaled)

print(f"Baseline Accuracy (k=5): {accuracy_score(y_test, y_pred_base):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_base))

Baseline Accuracy (k=5): 0.8033

Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.68      0.76        28
           1       0.77      0.91      0.83        33

    accuracy                           0.80        61
   macro avg       0.82      0.79      0.80        61
weighted avg       0.81      0.80      0.80        61



## 5. Part 2: Applying Grid Search (Task #16111)

Instead of guessing my parameters, I can employ **Grid Search**. `GridSearchCV` will exhaustively train a K-NN model using every single parameter permutation present in a supplied grid. 

I use **5-Fold Cross Validation** to ensure the model I pick doesn't just casually overfit a specific subset of the training set. I will evaluate:
- `n_neighbors`: Between 1 and 30
- `weights`: Giving neighbors equal weight (`uniform`) or giving closer neighbors more weight (`distance`)
- `metric`: The specific mathematical calculation used to determine "distance".

In [5]:
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", KNeighborsClassifier())
])

param_grid = {
    'knn__n_neighbors': list(range(1, 31)),
    'knn__weights': ['uniform', 'distance'],
    'knn__metric': ['euclidean', 'manhattan', 'minkowski']
}

grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    verbose=1,
    n_jobs=-1
)

# This will perform 180 combinations * 5 folds = 900 unique fits
grid_search.fit(X_train, y_train)

print("Grid Search completed!")

Fitting 5 folds for each of 180 candidates, totalling 900 fits
Grid Search completed!


## 6. Optimal Results and Evaluation
Now I extract the best estimator found and use it to predict against my hold-out test set.

In [6]:
print("Best Parameters Found:")
print(grid_search.best_params_)
print(f"Best Cross-Validation Accuracy: {grid_search.best_score_:.4f}\n")

best_model = grid_search.best_estimator_
y_pred_optimal = best_model.predict(X_test)

print("--- Final Evaluation on Unseen Test Data ---")
print(f"Test Set Accuracy: {accuracy_score(y_test, y_pred_optimal):.4f}\n")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_optimal))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_optimal))

Best Parameters Found:
{'knn__metric': 'manhattan', 'knn__n_neighbors': 16, 'knn__weights': 'uniform'}
Best Cross-Validation Accuracy: 0.8509

--- Final Evaluation on Unseen Test Data ---
Test Set Accuracy: 0.8033

Confusion Matrix:
[[19  9]
 [ 3 30]]

Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.68      0.76        28
           1       0.77      0.91      0.83        33

    accuracy                           0.80        61
   macro avg       0.82      0.79      0.80        61
weighted avg       0.81      0.80      0.80        61

