# K-Nearest Neighbors (KNN) Regression Analysis

This notebook evaluates the **K-Nearest Neighbors (KNN) Regressor** on a dataset **(FINAL_cleaned.csv)** using:

- different train-test splits
- cross-validation strategies **(K-Fold and Leave-One-Out)**
- and performance metrics **(MSE, MAE, R²)**.

### Import Libraries & Load Dataset

This cell:

- Imports core libraries for data handling **(numpy, pandas)**, visualization **(matplotlib)**, and model evaluation **(sklearn)**.
- Loads the cleaned dataset **FINAL_cleaned**.csv.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split, KFold, LeaveOneOut, cross_val_score
from sklearn.metrics import make_scorer, mean_squared_error, mean_absolute_error, r2_score

df = pd.read_csv('FINAL_cleaned.csv')

### Train-Test Evaluation Across Splits & K-values

This cell:

- Defines input features X and target y.
- Iterates over different training set ratios and k values.
- Repeats each train-test split 5 times with different random seeds.
- Computes average and std. deviation of MSE, MAE, and R² for each combination.
- Stores the results in a DataFrame and displays the best-performing configurations by R².

In [None]:
X = df.drop(columns=["Quality_of_Life_Value", "Quality_of_Life_Category","Country","Sub-region","Sub_region_encoded"])
y = df["Quality_of_Life_Value"]

print(X.head())
print(y.head())


# Parameters to explore
split_ratios = [0.6, 0.7, 0.8, 0.9]
k_values = [3, 5, 7]
n_repeats = 5

rows = []

for train_ratio in split_ratios:
    for k in k_values:
        mse_list = []
        mae_list = []
        r2_list = []

        for seed in range(n_repeats):
            X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_ratio, random_state=seed)

            model = KNeighborsRegressor(n_neighbors=k)
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)

            mse_list.append(mean_squared_error(y_test, y_pred))
            mae_list.append(mean_absolute_error(y_test, y_pred))
            r2_list.append(r2_score(y_test, y_pred))

        rows.append({
            "train_ratio": train_ratio,
            "k": k,
            "MSE_mean": np.mean(mse_list),
            "MSE_std": np.std(mse_list),
            "MAE_mean": np.mean(mae_list),
            "MAE_std": np.std(mae_list),
            "R2_mean": np.mean(r2_list),
            "R2_std": np.std(r2_list),
        })

# Results DataFrame
results_df = pd.DataFrame(rows)
display(results_df.sort_values(by="R2_mean", ascending=False))

Unnamed: 0,train_ratio,k,MSE_mean,MSE_std,MAE_mean,MAE_std,R2_mean,R2_std
5,0.7,7,2261.455164,358.597862,38.967991,3.556619,0.008687,0.108375
2,0.6,7,2230.824857,288.491464,38.30315,3.161513,-0.010835,0.116149
8,0.8,7,2292.758283,809.743254,38.589347,7.389943,-0.01847,0.160177
4,0.7,5,2365.065148,336.113594,39.421897,2.582152,-0.046067,0.156454
7,0.8,5,2408.614061,943.68268,39.796114,8.293346,-0.064402,0.210008
1,0.6,5,2364.199275,357.576958,38.938857,2.738037,-0.071654,0.151727
11,0.9,7,2133.935569,864.277583,38.473584,8.921787,-0.131683,0.38453
10,0.9,5,2240.55789,876.702182,39.670545,8.754845,-0.157321,0.30877
3,0.7,3,2673.002106,289.59509,40.852538,1.779079,-0.198479,0.235772
6,0.8,3,2702.364708,836.682832,41.666127,6.825178,-0.218815,0.146363


### K-Fold Cross-Validation

This cell:

- Uses 6-fold cross-validation (KFold) to evaluate KNN for different k values.
- Calculates mean and standard deviation for MSE, MAE, and R².
- Stores results in a DataFrame and displays them sorted by R².

In [None]:


# Define custom scorers for positive MSE and MAE
mse_scorer = make_scorer(mean_squared_error, greater_is_better=False)
mae_scorer = make_scorer(mean_absolute_error, greater_is_better=False)

kf = KFold(n_splits=6, shuffle=True, random_state=42)
k_values = [3, 5, 7]

rows_kfold = []

for k in k_values:
    model = KNeighborsRegressor(n_neighbors=k)

    mse_scores = -cross_val_score(model, X, y, cv=kf, scoring=mse_scorer)
    mae_scores = -cross_val_score(model, X, y, cv=kf, scoring=mae_scorer)
    r2_scores = cross_val_score(model, X, y, cv=kf, scoring='r2')

    rows_kfold.append({
        "split": "KFold",
        "k": k,
        "MSE_mean": mse_scores.mean(),
        "MSE_std": mse_scores.std(),
        "MAE_mean": mae_scores.mean(),
        "MAE_std": mae_scores.std(),
        "R2_mean": r2_scores.mean(),
        "R2_std": r2_scores.std()
    })

df_kfold = pd.DataFrame(rows_kfold)
display(df_kfold.sort_values(by="R2_mean", ascending=False))


Unnamed: 0,split,k,MSE_mean,MSE_std,MAE_mean,MAE_std,R2_mean,R2_std
2,KFold,7,2412.701844,1026.359269,39.902068,9.676326,-0.103316,0.44549
1,KFold,5,2542.607454,898.737421,39.990035,9.123901,-0.189932,0.397692
0,KFold,3,2760.806598,875.154554,42.587466,7.841172,-0.311469,0.384556


### Leave-One-Out Cross-Validation (LOO)

This cell:

- Uses **Leave-One-Out cross-validation**, where each sample is used once as a test set.
- Calculates **MSE** and **MAE** for each k value.
- Note: **R²** is not applicable here due to 1-sample test sets, so it’s marked as None.

In [4]:


# Define custom scorers
mse_scorer = make_scorer(mean_squared_error, greater_is_better=False)
mae_scorer = make_scorer(mean_absolute_error, greater_is_better=False)

loo = LeaveOneOut()
k_values = [3, 5, 7]

rows_loo = []

for k in k_values:
    model = KNeighborsRegressor(n_neighbors=k)

    mse_scores = -cross_val_score(model, X, y, cv=loo, scoring=mse_scorer)
    mae_scores = -cross_val_score(model, X, y, cv=loo, scoring=mae_scorer)

    rows_loo.append({
        "split": "LOO",
        "k": k,
        "MSE_mean": mse_scores.mean(),
        "MSE_std": mse_scores.std(),
        "MAE_mean": mae_scores.mean(),
        "MAE_std": mae_scores.std(),
        "R2_mean": None,  # Cannot compute
        "R2_std": None
    })

df_loo = pd.DataFrame(rows_loo)
display(df_loo.sort_values(by="R2_mean", ascending=False))

NameError: name 'make_scorer' is not defined

<img src="KNN.png" width="500"/>


### Summary and Conclusion

This notebook explored the performance of the K-Nearest Neighbors (KNN) Regressor for predicting Quality_of_Life_Value based on a range of geographic and environmental features. We evaluated the model using three validation techniques:

- Train/Test Split
- K-Fold Cross-Validation (6 folds)
- Leave-One-Out Cross-Validation (LOO)
- Each method was tested with different values of k (3, 5, 7), and the primary metric used for evaluation was Mean Squared Error (MSE).

### Observations from MSE vs. k Graph

- For all three validation methods, the lowest MSE is achieved at k = 7, indicating that using more neighbors improves local averaging and reduces prediction error.
- Train/Test Split delivers the lowest MSE values overall among the three methods (at k = 7), though this might be influenced by randomness in favorable splits.
- K-Fold Cross-Validation shows the highest MSE overall among the three methods.
- Leave-One-Out Cross-Validation (LOO) yields intermediate MSE values, higher than Train/Test but lower than K-Fold, indicating a balance between accuracy and stability with minimal data leakage.