In [1]:
%matplotlib inline

In [3]:
import numpy as np
import sympy as sp

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split, GridSearchCV

# 03. Model Training and Improvement
### How to train your model
* Training and testing set;
* Bias-variance tradeoff;
* k-fold cross-validation;
* Graphical methods: train/test curve, ROC, confusion matrix;
* Hyperparameters. Hyperparameter optimization;
* Model selection.

In [3]:
print('Kernel is working ..')

Kernel is working ..


### IEEE754 Examples

In [1]:
0.2 + 0.1 == 0.3

False

In [2]:
a = 0
for i in range(10_000_000):
    a += 0.01

In [3]:
a

99999.99998630969

### Scaling Data

For scaling we can use the `MinMaxScaler`. It works like this:
$$ x\prime = \frac{x - min(x)}{max(x) - min(x)} $$

For standartization we can use the `StandardScaler`. It works like this:
$$ x\prime =  \frac{x - \overline{x}}{s(x)} $$

<img src="images/lr.png" />

## Regularization
Taming your model

### Bias-Variance Tradeoff
* When we fit models, we have two main sources of errors
    * **Bias** - how far are the predicted from the actual values
    * **Variance** - variability of prediction from the actual values
* Illustration - shooter aimed at a bullseye target 
    * High bias - the aim is shifted away from the center
    * High variance - the points are more "spread out"
<img src="images/bv.png" />

`Polynomial wiggle` - when the model explains the data, but between the data it is chaotic. Function in which with a small change in the input data, there is a huge difference in the output.

To test if a model is good we need to test it with new data.

* When we fit several models, they perform differently
    * Some are not complex enough (don't describe data well enough)
        * **Underfitting** (high bias)
    * Some may describe the data "too well" and **fail to generalize** when **new** data points are introduced
        * **Overfitting** (high variance)
* Optimal model: tradeoff between underfitting and overfitting
    * Usually, underfitting is easy to spot
        * Poor performance w.r.t. some metric
    * Overfitting is more complicated
        * Many methods exist to prevent overfitting
<img src="images/ovf.png" />

`Fitting a model` - Aproximating a data in a certain model

### Regilarization
* Method for finding as good bias-variance tradeoff
    * Filter out noise from data
    * Handle highly correlated features
* **L2** regularization - "second norm" (Euclidean): $\lambda 	\|\omega \| _2^2 \equiv \lambda \displaystyle\sum_{j=1}^n \omega^{2}_{j}$
    * $ \lambda $ - regilarization parameter
    * Shrinks all model weights by the same value
    
* **L1** regularization - "first norm": $\lambda \| \omega \| _1 \equiv \lambda \displaystyle\sum_{j=1}^n 	\mid \omega_j \mid$
    * Sets some coefficients to 0: feature selection
* In the ideal case, we can use both L1 and L2
* Usage: add the regilarization term to the cost function
    * $ \lambda > 0 $, larger $ \Rightarrow $ stronger regularization

$$ J(\omega) = J(\tilde{y}, y) + \lambda\|\omega\| _2^2$$

### Linear Regression with Regularization
* **Ridge** regression - L2
    * Cost function: $ J(\omega) = \frac{1}{2n} \displaystyle\sum^{n}_{i=1}(y_i - \tilde{y_i})^2 + \lambda\|\omega\|_2^2 $
    * We use the `Ridge` class from scikit liearn, which has parameter "alpha" for $ \lambda $
* **LASSO** (**L**east **A**bsolute **S**hrinkage and **S**election **O**perator) - L1
    * Cost function: $ J(\omega) = \frac{1}{2n} \displaystyle\sum^{n}_{i=1}(y_i - \tilde{y_i})^2 + \lambda \|\omega\|_1 $
    * We use the `Lasso` class from scikit liearn, which has parameter "alpha" for $ \lambda $
* **Elastic Net**
    * Has both regularization terms
    * We use the `ElasticNet` class from scikit liearn, which has parameter "alpha" for $ \lambda $

We also have regularization in the `LogisticRegression` class, but there we use $C$ instead of $ \lambda $, $C = \frac{1}{\lambda}$

Bigger regularization makes the model care more about the weights than the data.

## Model Testing
Seeing how well your model performs on new data

### Training and Testing Sets
* One of the most important rules in machine learning is **NEVER test the model with the data you trained it on!**
    * The model may "cheat" and learn the answers instead of finding structure in the data
* Since we usually have one dataset, it's useful to "hold out" some of the data for testing
    * E.g., **70%** of the data is for training and **30%** - for testing
    * We need to take **randomized samples**
    * In case of classification, we need stratified samples
* scikit-learn has a convinient method for this - `train_test_split` 

If we do a train test split, there is a chance, that the model would know the testing data better than the training one. Why?  Because the "number" it produces using the score method is just an extract. If the model knows it's training data way better than the testing one, it has a **high variance**. If the model does not fit it's own data well (the train score is low) it's capacity is not enough, so it has **high bias**.

### Evaluating Model Performance
* Once we train the model, we use the test data to score it
    * Using one of the scoring metrics
* Scoring metrics
    * **Regression:** usually *coefficient of determination* $R^2$ - proportion of variance predictable from the independent variables
        * Other: mean squared error, mean absolute error, explained variance
    * **Classification:** usually *accuracy* (how many items have been properly classified)
        * OtherL precision, recall, F1
* The output from scoring tells us how good the model is

### Evaluating Regression
* No fixed rules, use your intuition and knowledge about the data
* Severa; guidelines
    * One metric is usually not enough
        * For example, mean squared error and coefficient of determination
        * Also useful: mean absolute error and mean squared error
    * Create residual plot ($0 - E$: observed minus estimated)
        * There should be no visible structure
        * If there is some structure in the residuals, the model fails to explain something
    * Create a histogram of the residuals
        * Most residuals should be "sufficiently close" to zero
        * There should be no observable structure
<img src="images/pp.png" />

### Evaluating Classification
* **Confusion matrix** (error matrix)
    * Shows predicted vs. actual classes
    * Simplest case: 2-class classifier
        * Can be extended
    * FP $\equiv$ Type I error, FN $\equiv$ Type II error
* Metrics: numbers derived from the confusion matrix
    * Accuracy (number of correctly classified samples): $A = \frac{TP + TN}{TP+TN+FP+FN}$
        * If detecting anomalies, accuracy can be misleading
    * Precision (how many selected samples are relevant): $P = \frac{TP}{TP+FP}$
    * Recall (how many relevant samples are selected): $R = \frac{TP}{TP+FN}$
    * F1-score: $\frac{2TP}{2TP + FP + FN} = 2 \frac{R.P}{R+P}$
    * Many more metrics exist (useful for specific cases)

<img src="images/pn.png" />

### Receiver Operating Characteristic
* Limited to 2-class classification
    * We can use "1 vs. all" for more classes
* A plot of true positive rate vs. false positive rate
    * A "bisector line" represents truly random guessing
    * Any curve above the line is better than random
        * Closer to the upper left corner = better
        * Below the line: still better than random, we have to reverse the classifier output
<img src="images/roc.png" />

### Learning and Validation Curves
* Plots which allow us to diagnose bias and variance problems
    * Some metric (e.g., accuracy) vs. a model parameter (e.g., sample size)
    * Plot two curves - for the training and validation data
* High bias - accuracu for training and validation is too low
    * Solution: add more model features, decrease regularization
* High variance - large gap between the two curves
    * Solution: remove model features (preprocessing / feature selection / feature engineering, etc.), increase regularization
<img src="images/lv.png" />

$$ \tilde{y} = f(X, \overrightarrow{\beta}, \overrightarrow{h}) $$
* $ \overrightarrow{\beta} $ - parameters for the model
* $ \overrightarrow{h} $ - hyperparameters, eg. ($\lambda$, $\alpha$, etc.) 

If we want to choose the best model, and we have a bunch of them, testing them over a certain test data will show us the model, which performs the most accurate over the test split, not in general. The set, on which we choose the best model and (test once again) is called *validational set* (or development set). `X_val, y_val`

**We run the `test` set only once!!!**
<img src="images/tvs.png" />

If the data for the validation set is too small, we could make few validation sets and few training sets using one dataset.
<img src="images/kfc.png" />

### Cross-Validation
* Most algorithms improve their parameters based on the test scores
    * This means knowledge of test data may "leak" into the algorithm and overfit the data
* Solution: cross-validation
    * Split all data into $ k $ groups (folds) - usually $ k = 10 $
        * **More** samples = **fewer** folds
        * Using a KFold splitter
    * Each time train with $ k - 1 $ folds and test with other fold

The more the data is, the less percentage the testing and validation sets could take from our data.

### Improved Technique
* Test / train set $\rightarrow$ faster
* Cross-validation $\rightarrow$ more accurate
* Best performance (but even slower): combine the methods
    * Leave out some of the data for testing at the beginnng (e.g. 30%)
    * Perform cross-validation on the other 70%
    * Fine-tune the model and / or select one of many models based on the best cross-validation score
    * Run the best model for the other 30%
        * We selected the best model based on the training data $\Rightarrow$ we have some bias
            * One model will always have the highest score, even if it's by chance
        * This truly out-of-sample method removes (some of) the bias
    * **Model selection**: choose the best performing model

### Hyperparameter Tuning
* Techniques for choosing the best model hyperparameters
    * Such as regularization
* Most widely used: grid search
    * "Brute-force": specify parameters; run models with all possible parameters combinations; choose the best model
* Randomized search - each setting is sampled randomly from a parameter range
    * Faster but not guaranteed to produce the best results

### Making the Best of Our Models
* Usually, we **don't know** right away which algorithm will perform **best**
* We *select several algorithm types*
    * Fine-tune their parameters using grid search (or some other technique)
    * Select the best combination of parameters
* After that, we *compare the best algorithms* for each type
    * Using the model selection procedure
        * Hold-out (test) set + cross-validation set
    * Select the best performing model type on the cross-validation set
        * Test it om the "hold-out" set
* Improvements: perform test/train split on the hyperparameter tuning step; use different performance scores

We could use the `GridSearchCV` class from the `sklearn.model_selection` module.

## Manipulating Features
Making things simpler

### Guides for Manipulating Features
* Main ideas
    * **Reduce** the number of features ($\Rightarrow$ simpler model)
    * Keep **relevant** information only
* Feature selection
    * Removing irrelevant features
    * Regularization does a good job at this
    * Other methods: dimensionality reduction
        * We'll talk about this later in the course
* Feature engineering
    * Producing new, meaningful features
        * Requires a lot of work and domain knowledge

Date: 21.09.2023