# Supervised Learning
&rarr; find correlations between variables; predicting/estimating output variables based on one or more input variables

- Given an input ($x$, samples of one or more variables) and an output response ($y$, samples of one or more other variables of the dataset), we have to find the relationship ($f$) between them: $y = f(df_X) + e$
    - where "$e$" is called **bias**, a random error that is independent of input and has mean zero
- The predicted/estimated function ($\hat{f}$) generates, with the same input ($df_X$), a resulting output ($\hat{y}$, the estimated samples): $\hat{y}=\hat{f}(df_X)$
- The accuracy lies on the similarity between $y$ and $\hat{y}$ (between the real samples and the predicted ones)

&rarr; **Regression**: predicting continuous (quantitative) value <br>
&rarr; **Classification**: predicting a discrete value, that corresponds to whether your sample belongs to a class or not

### Metrics:
- https://scikit-learn.org/stable/modules/model_evaluation.html
- Classification model's Accuracy:
    - Mean between the error rate of all the predictions; error rate is 1 if wrong prediction, 0 otherwise
- Regression model's Accuracy:
    - **Residual Sum of Squares (RSS)**: the sum of the squares of all residuals, which are the distance between the predicted value and the real one. A problem is the value that increases with the number of samples. $$ RSS = \sum_{i=1}^{n} (y_i - \hat{y_i})^2 $$
    - **Mean Squared Error (MSE)**: the mean of all the distances (squared) between a predicted sample and the real one $$ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 $$
    - **Mean Absolute Error (MAE)**: $$ MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y_i}| $$
    - **Residual Standard Error (RSE)**: absolute measure of lack of fit of the model. RSS normalized with respect to n and p: $$ RSE = \sqrt{\frac{1}{n - p - 2} \times RSS} $$
    - **Explained Sum of Squares (ESS)**: quantity of variance explained by the model. How much the predicted data differ from the mean: $$ ESS = \sum_{i=1}^n (\hat{y_i} - \overline{y})^2 $$

    - **Total Sum of Squares (TSS)**: Total variance in the response data. How much the real data differ from the mean: $$ TSS = ESS + RSS = \sum_{i=1}^n (y_i - \overline{y})^2 $$
    - **R^2**
        $$ R^2 = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS} $$
        - close to 1: the model fits well the data
        - close to 0: linear model is wrong, or bias error is high


---
## Over/Underfitting

- The estimation of $f$ can be made by a parametric approach: it is generally much easier to estimate a set of parameters, than it is to fit an entirely arbitrary function.
    -  **Overfitting**: the model we choose will usually not match the true unknown form of $f$. If the model is too flexible (too similar to $f$), the training error will be too low and the test error can be high. The solution is to change/remove parameters or simplify the model.
    - **Underfitting** happens when you underestimate the model.
- It is important to decide for any given set of data which method produces the best results
    - **Linear** models are simple but often too inaccurate
    - **Highly non-linear** models can potentially provide more accurate predictions, but far more complex

---

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

plt.style.use("seaborn-v0_8")
random_state = 42

df = pd.read_csv("datasets/nba.csv")
# Original dataset (except the target variable we want to predict)
df_X = df.dropna().loc[:, df.columns != "AST"].select_dtypes(include="number")
# Original variable we want to predict
df_y = df.dropna()["AST"]

## Train/Validation/Test split

- The Training Set is the set of samples used for training our model
- The Validation Set is the set of samples we used to tune the hyperparameters of our model. ***Calibrating the hyperparameters on the test set means overestimating the performance***
- The Test Set is the set of samples on which to evaluate the final performance of the system

In [8]:
#########################################################################
#########################################################################
# CLASSIC SPLIT
#########################################################################

# Split X and y into train and test, with test size of 33% of the total
X_train, X_test, y_train, y_test = train_test_split(
    df_X, df_y, test_size=0.33, random_state=random_state
)
# Split X_train and y_train into train and valuation, with valuation size of 25% of the train total
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=random_state
)


#########################################################################
#########################################################################
# PERSONALIZED train_validation_test_split()
#########################################################################


def train_validation_test_split(
    *arrays, train_size=None, validation_size=None, test_size=None, random_state=None
):
    """
    Splits a dataset into train, validation, and test sets based on the provided proportions.

    # Parameters:
        *arrays : sequence of indexables with same length / shape[0]
            Allowed inputs are lists, numpy arrays, scipy-sparse
            matrices or pandas dataframes.
        train_size (float):
            Proportion of the dataset for the training set (0.0 to 1.0).
        validation_size (float):
            Proportion of the dataset for the validation set (0.0 to 1.0).
        test_size (float):
            Proportion of the dataset for the test set (0.0 to 1.0).

    # Returns:
        X_train, X_val, X_test, y_train, y_val, y_test: Split datasets.
    """
    if train_size and validation_size:
        test_size = 1.0 - train_size - validation_size
    elif test_size and validation_size:
        train_size = 1.0 - test_size - validation_size
    else:
        raise ValueError(
            "Provide either train and validation sizes or test and validation sizes."
        )

    if test_size < 0 or validation_size < 0 or train_size < 0:
        raise ValueError("The total sum must be a value < 1.0")

    X_train, X_test, y_train, y_test = train_test_split(
        *arrays, test_size=test_size, random_state=random_state
    )
    X_train, X_val, y_train, y_val = train_test_split(
        X_train,
        y_train,
        test_size=validation_size
        * (X_train.shape[0] + X_test.shape[0])
        / X_train.shape[0],
        random_state=random_state,
    )

    return X_train, X_val, X_test, y_train, y_val, y_test


X_train, X_val, X_test, y_train, y_val, y_test = train_validation_test_split(
    df_X, df_y, test_size=0.33, validation_size=0.1675
)

print("X:\t", df_X.shape)
print("X_train:", X_train.shape, ", %", X_train.shape[0] * 100 / df_X.shape[0])
print("X_val:\t", X_val.shape, ", %", X_val.shape[0] * 100 / df_X.shape[0])
print("X_test:\t", X_test.shape, ", %", X_test.shape[0] * 100 / df_X.shape[0])
print("y:\t", df_y.shape)
print("y_train:", y_train.shape)
print("y_val:\t", y_val.shape)
print("y_test:\t", y_test.shape)

X:	 (7836, 22)
X_train: (3937, 22) , % 50.242470648289945
X_val:	 (1313, 22) , % 16.755997958141908
X_test:	 (2586, 22) , % 33.00153139356815
y:	 (7836,)
y_train: (3937,)
y_val:	 (1313,)
y_test:	 (2586,)


---
---
## Cross Validation with *Train-Validation-Test Split*

In [6]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

random_state = 42

wine = datasets.load_wine()
X = wine["data"]
y = wine["target"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=random_state
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=random_state
)


clf = dict()
best_acc = 0
best_acc_ind = -1

for i in range(1, 22, 5):
    clf[i] = KNeighborsClassifier(n_neighbors=i)

    # Fit on TRAIN SET
    clf[i].fit(X_train, y_train)

    # Score on VALIDATION SET
    acc = clf[i].score(X_val, y_val)

    if best_acc < acc:
        best_acc = acc
        best_acc_ind = i
    print("Accuracy on VALIDATION SET: {:.5f}".format(acc), "K =", i)

print("\nBest validation model, K =", best_acc_ind)

# Test on TEST SET
acc = clf[best_acc_ind].score(X_test, y_test)
print("Accuracy on TEST SET: {:.5f}".format(acc))

Accuracy on VALIDATION SET: 0.75000 K = 1
Accuracy on VALIDATION SET: 0.75000 K = 6
Accuracy on VALIDATION SET: 0.72222 K = 11
Accuracy on VALIDATION SET: 0.69444 K = 16
Accuracy on VALIDATION SET: 0.72222 K = 21

Best validation model, K = 1
Accuracy on TEST SET: 0.72222


---
# Cross Validation with k-Fold
- Often there is not a designated test set to calculate the model validity through the test error.
- **Leave-One-Out Cross-Validation**. To estimate the test(/validation) error, leave out one sample, fit the model on the remaining samples, and finally test the model on the left-out sample. Then repeat for every sample, and calculate the final error as the mean of all errors. It's very very expensive because there are N iterations.
- **k-Fold Cross Validation**. Randomly divide the set of samples into k groups, or folds, of approximately equal size. The first fold is left out, the remaining are fitted, and finally the model is tested on the fold left out. Repeat for every fold. Typically 3 to 5 folds.

### With Preprocessing and *Data Leakage*:

In [None]:
import numpy as np
from sklearn import datasets
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.svm import SVC

random_state = 42
dataset = datasets.load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    dataset["data"], dataset["target"], test_size=0.4, random_state=random_state
)

# PREPROCESSING
# WARNING!!!: preprocessing the data with Min-Max or StandardScaling (in general with all normalizations that follow the vertical axes) and then performing a cross_val_score() introduces *DATA LEAKAGE*. This is because cross_val_score() splits the train set in "cv" folds at each iteration, and uses only "cv-1" for training. The left-out is used for validation, that is to calculate the error. But this left-out fold is already preprocessed with the scaler taken from all the "cv" folds, and not only on the "cv-1" train folds. Therefore this left-out validation fold is preprocessed even with his own data, *and this means overstimate the performance*.
scaler = preprocessing.MinMaxScaler()
scaler.fit(X_train)
X_train_p = scaler.transform(X_train)
X_test_p = scaler.transform(X_test)

# Parameters to keep the best results
best_acc = 0
bestK = "linear"

# KFold
cv = KFold(n_splits=5, shuffle=True, random_state=random_state)

# Iterate on an hyperparameter
for kernel in ["linear", "poly", "rbf"]:
    clf = SVC(kernel=kernel, random_state=random_state)
    # Cross Validate
    scores = cross_val_score(clf, X_train_p, y_train, cv=cv, n_jobs=-1)

    # save best result so far
    acc = np.mean(scores)
    print(
        "Cross validation score: {:.3f}".format(acc),
        "kernel =",
        kernel,
    )
    if best_acc < acc:
        best_acc = acc
        bestK = kernel


print("Best Kernel:", bestK)
clf = SVC(kernel=bestK, random_state=random_state)
clf.fit(X_train_p, y_train)
print("Test set accuracy: {:.5f}".format(clf.score(X_test_p, y_test)))

---
# Cross Validation with GridSearch
- Find the best hyperparameters, trying every combination of them

### With Preprocessing and no *Data Leakage*:

In [1]:
from sklearn import preprocessing


# a function with different normalization and scaling techniques
def preprocess(X_train, X_test, modality):
    X_train_p = X_train
    X_test_p = X_test

    if modality == "l2" or modality == "l1":
        X_train_p = preprocessing.normalize(X_train, norm=modality)
        X_test_p = preprocessing.normalize(X_test, norm=modality)

    if modality == "standard":
        scaler = preprocessing.StandardScaler()
        scaler.fit(X_train)
        X_train_p = scaler.transform(X_train)
        X_test_p = scaler.transform(X_test)

    if modality == "min-max":
        scaler = preprocessing.MinMaxScaler()
        scaler.fit(X_train)
        X_train_p = scaler.transform(X_train)
        X_test_p = scaler.transform(X_test)

    return X_train_p, X_test_p

In [9]:
from sklearn.pipeline import Pipeline
from sklearn import datasets, preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC

random_state = 42

dataset = datasets.load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    dataset["data"], dataset["target"], test_size=0.5, random_state=random_state
)

# Grid of hyperparameters
param_grid = {
    "svc__kernel": ["linear", "poly", "rbf"],
    "svc__C": [0.01, 0.1, 1, 10, 100],
}

# Parameters to keep the best results
best_score = 0
best_params = {}
bestP = "no"

# Iterate over Preprocess types
prep = [
    Pipeline([("svc", SVC())]),
    Pipeline([("std-scaler", preprocessing.StandardScaler()), ("svc", SVC())]),
    Pipeline([("min-max", preprocessing.MinMaxScaler()), ("svc", SVC())]),
]

for pipe in prep:
    search = GridSearchCV(pipe, param_grid, n_jobs=-1, cv=5)
    search.fit(X_train, y_train)

    print(
        "Steps:",
        pipe.steps[0][0],
        ", best params:",
        search.best_params_,
        ", score:",
        search.best_score_,
    )
    if search.best_score_ > best_score:
        best_score = search.best_score_
        best_params = search.best_params_
        bestP = pipe.steps[0][0]

# test with best configuration
clf = SVC(
    kernel=best_params["svc__kernel"],
    C=best_params["svc__C"],
    random_state=random_state,
)
X_train_p, X_test_p = preprocess(X_train, X_test, bestP)
clf.fit(X_train_p, y_train)
print("\nParameter used for test set:", best_params, ", scaling:", bestP)
print("Test set accuracy: {:.2f}".format(clf.score(X_test_p, y_test)))

Steps: svc , best params: {'svc__C': 0.1, 'svc__kernel': 'linear'} , score: 0.9437908496732026
Steps: std-scaler , best params: {'svc__C': 0.1, 'svc__kernel': 'linear'} , score: 0.9777777777777779
Steps: min-max , best params: {'svc__C': 1, 'svc__kernel': 'linear'} , score: 0.9888888888888889

Parameter used for test set: {'svc__C': 1, 'svc__kernel': 'linear'} , scaling: min-max
Test set accuracy: 0.99
