# ML CUP 2022

## Kernel Ridge Regression

This notebook creates two Kernel Ridge Regression (KRR) models to generalize the problem of the ML cup 2022. It searches the best combination of hyperparameters performing a grid searches over a given range of values. Two different models are given as output in this phase, one for each target, and for both models there is a tuning phase based on the same hyperparameters.

Hyperparameters considered for the grid search:

1. kernel
1. alpha
2. gamma (only for rbf and poly kernels)
3. degree (only for poly kernel)

Model selection performed using a cross validation.\
The model assessment phase is not included in this notebook.

### Importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy import linalg as LA

from sklearn.metrics import make_scorer
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV

import joblib

import math
import random

In [2]:
# choosing a seed for reproducibility
seed = 1
random.seed(seed)
np.random.seed(seed)

### Definition of the Mean Euclidean Distance

In [3]:
def my_mean_euclidean_distance(y_true, y_pred):
    points = len(y_true)
    tot_sum = 0
    for i in range (points):
        tot_sum += LA.norm(y_true[i] - y_pred[i])
    
    return tot_sum / points

In [4]:
mean_euclidean_distance = make_scorer(my_mean_euclidean_distance, greater_is_better=False)

### Loading data

In [5]:
colnames = ['id', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'target1', 'target2']
mlcup_tr = pd.read_csv("./dataset/ml_cup22/ML-CUP22-TR.csv", sep = ",", names=colnames)
mlcup_tr = mlcup_tr.iloc[1:, :]
mlcup_tr = mlcup_tr.drop('id', axis=1)

In [6]:
x_mlcup_tr = mlcup_tr.iloc[:, 0:9].values
y_mlcup_tr = mlcup_tr.iloc[:, 9:11].values

We used the function below to normalize our training set (both input and target) according to a min-max normalization

In [7]:
x_cols = len(x_mlcup_tr[0])

max_col_value_x = [None]*x_cols
max_vl = None

min_col_value_x = [None]*x_cols
min_vl = None

for i in range(x_cols):
    col = x_mlcup_tr[:, i]
    max_vl = np.amax(col)
    min_vl = np.amin(col)
    
    x_mlcup_tr[:, i] = (x_mlcup_tr[:, i] - min_vl) / (max_vl - min_vl)
    
    max_col_value_x[i] = max_vl
    min_col_value_x[i] = min_vl

In [8]:
y1_mlcup_tr = y_mlcup_tr[:, 0]
y2_mlcup_tr = y_mlcup_tr[:, 1]

### Grid search for target 1
#### rbf kernel

In [9]:
alpha_range = np.logspace(-9, 0, 30, base = 2)
gamma_range = np.logspace(-9, 3, 10, base = 2)

param_grid = [
    {'alpha': alpha_range, 'gamma': gamma_range},
    {'alpha': alpha_range}
]

kr = GridSearchCV(
    KernelRidge(kernel="rbf"),
    param_grid = param_grid,
    cv = 4,
    scoring = mean_euclidean_distance,
    n_jobs = -1
)

kr.fit(x_mlcup_tr, y1_mlcup_tr)

print(
    "The best parameters are %s with a score of %0.5f"
    % (kr.best_params_, kr.best_score_)
)

The best parameters are {'alpha': 0.03968747494481696, 'gamma': 1.259921049894872} with a score of -0.80567


In [10]:
rbf_krr_1 = kr.best_estimator_

#### linear kernel

In [11]:
alpha_range = np.logspace(-9, 0, 30, base = 2)

param_grid = dict(
    alpha = alpha_range
)

kr = GridSearchCV(
    KernelRidge(kernel="linear"),
    param_grid = param_grid,
    cv = 4,
    scoring = mean_euclidean_distance,
    n_jobs = -1
)

kr.fit(x_mlcup_tr, y1_mlcup_tr)

print(
    "The best parameters are %s with a score of %0.5f"
    % (kr.best_params_, kr.best_score_)
)

The best parameters are {'alpha': 0.8064489817576826} with a score of -1.55315


In [12]:
linear_krr_1 = kr.best_estimator_

#### polynomial kernel

In [13]:
degree_range = np.arange(2, 8, 1)
alpha_range = np.logspace(-9, 0, 10, base = 2)
gamma_range = np.logspace(-9, 3, 10, base = 2)

param_grid = dict(
    degree = degree_range,
    alpha = alpha_range,
    gamma = gamma_range
)

param_grid = [
    {'alpha': alpha_range, 'gamma': gamma_range, 'degree': degree_range},
    {'alpha': alpha_range, 'degree': degree_range}
]

kr = GridSearchCV(
    KernelRidge(kernel = 'poly'),
    param_grid = param_grid,
    cv = 4,
    scoring = mean_euclidean_distance,
    n_jobs = -1
)

kr.fit(x_mlcup_tr, y1_mlcup_tr)

print(
    "The best parameters are %s with a score of %0.5f"
    % (kr.best_params_, kr.best_score_)
)

The best parameters are {'alpha': 0.03125, 'degree': 7, 'gamma': 0.19842513149602486} with a score of -0.84055


In [14]:
poly_krr_1 = kr.best_estimator_

### Grid search for target 2
#### rbf kernel

In [15]:
alpha_range = np.logspace(-9, 0, 30, base = 2)
gamma_range = np.logspace(-9, 3, 10, base = 2)

param_grid = [
    {'alpha': alpha_range, 'gamma': gamma_range},
    {'alpha': alpha_range}
]

kr = GridSearchCV(
    KernelRidge(kernel="rbf"),
    param_grid = param_grid,
    cv = 4,
    scoring = mean_euclidean_distance,
    n_jobs = -1
)

kr.fit(x_mlcup_tr, y2_mlcup_tr)

print(
    "The best parameters are %s with a score of %0.5f"
    % (kr.best_params_, kr.best_score_)
)

The best parameters are {'alpha': 0.004617665262461984, 'gamma': 0.5} with a score of -1.09913


In [16]:
rbf_krr_2 = kr.best_estimator_

#### linear kernel

In [17]:
alpha_range = np.logspace(-9, 0, 30, base = 2)

param_grid = dict(
    alpha = alpha_range
)

kr = GridSearchCV(
    KernelRidge(kernel="linear"),
    param_grid = param_grid,
    cv = 4,
    scoring = mean_euclidean_distance,
    n_jobs = -1
)

kr.fit(x_mlcup_tr, y2_mlcup_tr)

print(
    "The best parameters are %s with a score of %0.5f"
    % (kr.best_params_, kr.best_score_)
)

The best parameters are {'alpha': 1.0} with a score of -1.69602


In [18]:
linear_krr_2 = kr.best_estimator_

#### polynomial kernel

In [19]:
degree_range = np.arange(2, 8, 1)
alpha_range = np.logspace(-9, 0, 10, base = 2)
gamma_range = np.logspace(-9, 3, 10, base = 2)

param_grid = [
    {'alpha': alpha_range, 'gamma': gamma_range, 'degree': degree_range},
    {'alpha': alpha_range, 'degree': degree_range}
]

kr = GridSearchCV(
    KernelRidge(kernel = 'poly'),
    param_grid = param_grid,
    cv = 4,
    scoring = mean_euclidean_distance,
    n_jobs = -1
)

kr.fit(x_mlcup_tr, y2_mlcup_tr)

print(
    "The best parameters are %s with a score of %0.5f"
    % (kr.best_params_, kr.best_score_)
)

The best parameters are {'alpha': 0.0625, 'degree': 7, 'gamma': 0.19842513149602486} with a score of -1.12928


In [20]:
poly_krr_2 = kr.best_estimator_

## Model selection
### Target 1 and 2

Since the best results are given by the Kernel ridge regression classifier with the rbf kernel on the first target and the poly kernel on the second target, we choose these two models.

In [21]:
joblib.dump(rbf_krr_1, './results/ml_cup/KRR/rbf_krr_1.z')
joblib.dump(poly_krr_2, './results/ml_cup/KRR/rbf_krr_2.z')

['./results/ml_cup/KRR/rbf_krr_2.z']

### MEE on both targets on the training set

In [22]:
krr1 = joblib.load('./results/ml_cup/KRR/rbf_krr_1.z')
krr2 = joblib.load('./results/ml_cup/KRR/rbf_krr_2.z')

In [23]:
pred_label_krr_1 = krr1.predict(x_mlcup_tr)
pred_label_krr_2 = krr2.predict(x_mlcup_tr)
pred_label_krr = np.vstack((pred_label_krr_1, pred_label_krr_2)).T

In [24]:
# Mean euclidean distance
points = y_mlcup_tr.shape[0]
tot_sum = 0
for i in range (points):
    tot_sum += math.sqrt(math.pow((y_mlcup_tr[i][0] - pred_label_krr[i][0]), 2)
                         + math.pow((y_mlcup_tr[i][1] - pred_label_krr[i][1]), 2))
    
print('MEE on the training set:', tot_sum / points)

MEE on the training set: 1.3637822027846693
