# ML CUP 2022

## Regression based on k-nearest neighbors

This notebook creates a k-nearest neighbors for regression (KNR) model to generalize the problem of the ML cup 2022. It searches the best combination of hyperparameters performing a grid searches over a given range of values.

Hyperparameters considered for the grid search:

1. n_neighbors
2. algorithm

### Loading libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy import linalg as LA

from sklearn.metrics import make_scorer
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV

import joblib

import math
import random

In [2]:
# choosing a seed for reproducibility
seed = 1
random.seed(seed)
np.random.seed(seed)

### Definition of the Mean Euclidean Distance

In [3]:
def my_mean_euclidean_distance(y_true, y_pred):
    points = len(y_true)
    tot_sum = 0
    for i in range (points):
        tot_sum += LA.norm(y_true[i] - y_pred[i])
    
    return tot_sum / points

In [4]:
mean_euclidean_distance = make_scorer(my_mean_euclidean_distance, greater_is_better=False)

### Loading data

In [5]:
colnames = ['id', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'target1', 'target2']
mlcup_tr = pd.read_csv("./dataset/ml_cup22/ML-CUP22-TR.csv", sep = ",", names=colnames)
mlcup_tr = mlcup_tr.iloc[1:, :]
mlcup_tr = mlcup_tr.drop('id', axis=1)

In [6]:
x_mlcup_tr = mlcup_tr.iloc[:, 0:9].values
y_mlcup_tr = mlcup_tr.iloc[:, 9:11].values

We used the function below to normalize our training set (both input and target) according to a min-max normalization

In [7]:
x_cols = len(x_mlcup_tr[0])

max_col_value_x = [None]*x_cols
max_vl = None

min_col_value_x = [None]*x_cols
min_vl = None

for i in range(x_cols):
    col = x_mlcup_tr[:, i]
    max_vl = np.amax(col)
    min_vl = np.amin(col)
    
    x_mlcup_tr[:, i] = (x_mlcup_tr[:, i] - min_vl) / (max_vl - min_vl)
    
    max_col_value_x[i] = max_vl
    min_col_value_x[i] = min_vl

### Grid search

In [8]:
n_neighbors = np.arange(1, 50)
algorithm = ['auto', 'ball_tree', 'kd_tree', 'brute']

param_grid = dict(
    n_neighbors = n_neighbors,
    algorithm = algorithm
)

grid = GridSearchCV(
    KNeighborsRegressor(),
    param_grid = param_grid,
    cv = 5,
    scoring = mean_euclidean_distance,
    verbose = 4,
    n_jobs = -1
)

grid.fit(x_mlcup_tr, y_mlcup_tr)

print(
    "The best parameters are %s with a score of %0.5f"
    % (grid.best_params_, grid.best_score_)
)

Fitting 5 folds for each of 196 candidates, totalling 980 fits
[CV 1/5] END ....algorithm=auto, n_neighbors=1;, score=-2.089 total time=   0.0s
[CV 4/5] END ....algorithm=auto, n_neighbors=2;, score=-1.761 total time=   0.0s
[CV 5/5] END ....algorithm=auto, n_neighbors=2;, score=-1.629 total time=   0.0s
[CV 1/5] END ....algorithm=auto, n_neighbors=3;, score=-1.731 total time=   0.0s
[CV 2/5] END ....algorithm=auto, n_neighbors=3;, score=-1.626 total time=   0.0s
[CV 4/5] END ....algorithm=auto, n_neighbors=3;, score=-1.720 total time=   0.0s
[CV 1/5] END ....algorithm=auto, n_neighbors=4;, score=-1.645 total time=   0.0s
[CV 3/5] END ....algorithm=auto, n_neighbors=4;, score=-1.568 total time=   0.0s
[CV 1/5] END ....algorithm=auto, n_neighbors=5;, score=-1.631 total time=   0.0s
[CV 4/5] END ....algorithm=auto, n_neighbors=5;, score=-1.610 total time=   0.0s
[CV 1/5] END ....algorithm=auto, n_neighbors=6;, score=-1.606 total time=   0.0s
[CV 4/5] END ....algorithm=auto, n_neighbors=6

[CV 3/5] END ....algorithm=auto, n_neighbors=2;, score=-1.684 total time=   0.0s
[CV 5/5] END ...algorithm=auto, n_neighbors=17;, score=-1.391 total time=   0.0s
[CV 1/5] END ...algorithm=auto, n_neighbors=18;, score=-1.518 total time=   0.0s
[CV 2/5] END ...algorithm=auto, n_neighbors=18;, score=-1.475 total time=   0.0s
[CV 3/5] END ...algorithm=auto, n_neighbors=18;, score=-1.439 total time=   0.0s
[CV 3/5] END ...algorithm=auto, n_neighbors=23;, score=-1.433 total time=   0.0s
[CV 4/5] END ...algorithm=auto, n_neighbors=23;, score=-1.459 total time=   0.0s
[CV 5/5] END ...algorithm=auto, n_neighbors=23;, score=-1.354 total time=   0.0s
[CV 1/5] END ...algorithm=auto, n_neighbors=24;, score=-1.560 total time=   0.0s
[CV 3/5] END ...algorithm=auto, n_neighbors=31;, score=-1.440 total time=   0.0s
[CV 4/5] END ...algorithm=auto, n_neighbors=31;, score=-1.488 total time=   0.0s
[CV 5/5] END ...algorithm=auto, n_neighbors=31;, score=-1.398 total time=   0.0s
[CV 1/5] END ...algorithm=au

In [9]:
knr = grid.best_estimator_

In [10]:
pred_label_knr_tr = knr.predict(x_mlcup_tr)

After we train the model, we denormalize the data and we evaluate the error.

In [11]:
# Mean euclidean distance
points = y_mlcup_tr.shape[0]
tot_sum = 0
for i in range (points):
    tot_sum += math.sqrt(math.pow((y_mlcup_tr[i][0] - pred_label_knr_tr[i][0]), 2)
                         + math.pow((y_mlcup_tr[i][1] - pred_label_knr_tr[i][1]), 2))
    
print('MEE on the training set:', tot_sum / points)

MEE on the training set: 1.3685651958637293


### Saving the model

In [12]:
joblib.dump(knr, './results/ml_cup/KNR/knr.z')

['./results/ml_cup/KNR/knr.z']

[CV 4/5] END ..algorithm=brute, n_neighbors=22;, score=-1.460 total time=   0.0s
[CV 5/5] END ..algorithm=brute, n_neighbors=22;, score=-1.367 total time=   0.0s
[CV 1/5] END ..algorithm=brute, n_neighbors=23;, score=-1.545 total time=   0.0s
[CV 2/5] END ..algorithm=brute, n_neighbors=23;, score=-1.470 total time=   0.0s
[CV 3/5] END ..algorithm=brute, n_neighbors=23;, score=-1.433 total time=   0.0s
[CV 4/5] END ..algorithm=brute, n_neighbors=23;, score=-1.459 total time=   0.0s
[CV 5/5] END ..algorithm=brute, n_neighbors=23;, score=-1.354 total time=   0.0s
[CV 1/5] END ..algorithm=brute, n_neighbors=24;, score=-1.560 total time=   0.0s
[CV 2/5] END ..algorithm=brute, n_neighbors=24;, score=-1.470 total time=   0.0s
[CV 3/5] END ..algorithm=brute, n_neighbors=24;, score=-1.434 total time=   0.0s
[CV 4/5] END ..algorithm=brute, n_neighbors=24;, score=-1.468 total time=   0.0s
[CV 5/5] END ..algorithm=brute, n_neighbors=24;, score=-1.361 total time=   0.0s
[CV 1/5] END ..algorithm=bru

[CV 2/5] END ..algorithm=brute, n_neighbors=26;, score=-1.478 total time=   0.0s
[CV 3/5] END ..algorithm=brute, n_neighbors=26;, score=-1.430 total time=   0.0s
[CV 4/5] END ..algorithm=brute, n_neighbors=26;, score=-1.478 total time=   0.0s
[CV 5/5] END ..algorithm=brute, n_neighbors=26;, score=-1.374 total time=   0.0s
[CV 1/5] END ..algorithm=brute, n_neighbors=27;, score=-1.546 total time=   0.0s
[CV 2/5] END ..algorithm=brute, n_neighbors=27;, score=-1.478 total time=   0.0s
[CV 3/5] END ..algorithm=brute, n_neighbors=27;, score=-1.430 total time=   0.0s
[CV 4/5] END ..algorithm=brute, n_neighbors=27;, score=-1.474 total time=   0.0s
[CV 5/5] END ..algorithm=brute, n_neighbors=27;, score=-1.371 total time=   0.0s
[CV 1/5] END ..algorithm=brute, n_neighbors=28;, score=-1.553 total time=   0.0s
[CV 2/5] END ..algorithm=brute, n_neighbors=28;, score=-1.475 total time=   0.0s
[CV 1/5] END ..algorithm=brute, n_neighbors=37;, score=-1.560 total time=   0.0s
[CV 5/5] END ..algorithm=bru