# Python Machine Learning with Scikit-learn - Partitioning

In [11]:
import pandas as pd 
import numpy as np 
import math 

from matplotlib import pyplot as plt 

plt.style.use('seaborn')

# Introduction

In this notebook, we introduced a Python Machine Learning library (`scikit-learn`) built using Numpy, Scipy and matplotlib.

The different modules can be found in the [documentation] (https://scikit-learn.org/stable/user_guide.html).

As it is a heavy library, we do not normally import it in its entirety; instead, **we import only the necessary functionality**.

In [12]:
from sklearn.model_selection import cross_validate, train_test_split

from sklearn.datasets import load_boston

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import cross_val_score

# Exercise 1

For each of the following hypotheses, **identify whether it is a possible source of *bias* or variance**:
- The use of **very flexible** models (e.g., non-parametric, non-linear, or with many parameters).
    - [] Bias.
    - [X] Variance.
- The use of **models with simplistic *assumptions* on the data.**
    - [X] Bias.
    - [] Variance.
- **Ignore important features**.
    - [X] Bias.
    - [] Variance.
- Has **more features than training examples**.
    - [] Bias.
    - [X] Variance.

**How can we evaluate** the *bias* and variance of a model?

** Through the Train-test Split process, that is, the definition and comparison of training errors vs test errors, seeking to optimize them.

# Exercise 2

Overall, data partitioning is done with the aim of **measuring the predictive performance of a model**, i.e., generalization.

To measure the performance of the model we can test it on a set of test data that was not used for training.

This technique is known as *hold-out set*.

**Exercise**: For this, we will test to create two datasets, at random, **training** (70% of the data) and **test** (30% of the data).

In [13]:

### BEGIN SOLUTION
X, y = load_boston(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

### END SOLUTION

#### Tests

In [14]:
assert X_train.shape == (354, 13)
assert X_test.shape == (152, 13)
assert y_train.shape == (354,)
assert y_test.shape == (152,)

# Exercise 3

As we have seen, however, the training dataset, so that you can adequately estimate the generalization error, **should not be burned**.

What we mean by this is that if we continually evaluate models according to the performance under test and choose the best ones, we will:
- **Inflating performance** under test (because we are optimizing for that)
- Losing the possibility of using it to predict the **performance of the model in data never seen before** (*out-of-sample*).

However, before choosing a "final" model it is essential to choose from several alternative models.

In this sense, **we broke the training dataset into training and validation**, and used the second one to choose the best model (s).

**Exercise**: Now, we want to create three datasets: **training** (80% of data), **validation** (10% of data) and **test** (10% of data).

In [16]:

### BEGIN SOLUTION
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.1)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=.1/.9)
### END SOLUTION

#### Testes

In [17]:
assert X_train.shape == (404, 13)
assert X_val.shape == X_test.shape == (51, 13)
assert y_train.shape == (404,)
assert y_val.shape == y_test.shape == (51,)

# Exercise 4

Sometimes, however, **it is not possible to give up a significant part of the data** to create training and validation datasets.

A more efficient alternative is to use *cross-validation*. There are **several types of cross-validation**:
- *Leave-one-out*.
    - Train $n$ models using $n-1$ observations and, for each of them, calculate the error for the $n$-th observation, omitted in training.
    - Calculate the average of the errors.
- *Leave-k-out*.
    - Similar to *leave-one-out* but with *k* observations omitted from training (for validation) in each iteration.
- *K-fold*.
    - The dataset is partitioned into *k* sub-datasets and each one is omitted (for validation) in each iteration.

The last alternative, *leave-k-out*, is, in practice, more consistent and **more robust to small changes in the data**.

This is the most used alternative, in practice, to **evaluate and select models** of machine learning.

**Exercise**: Evaluate the models below using *K-fold* cross-validation, with 4 *folds*.

In [7]:
# Some models for testing.

# https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
dt = DecisionTreeRegressor()

# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
lr = LinearRegression()

# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
rf = RandomForestRegressor()


### BEGIN SOLUTION

dt_results = cross_val_score(dt, X_train, y_train, cv=4)
lr_results = cross_val_score(lr, X_train, y_train, cv=4)
rf_results = cross_val_score(rf, X_train, y_train, cv=4)

### END SOLUTION

results = pd.DataFrame({
    'Decision Tree': dt_results,
    'Linear Regression': lr_results,
    'Random Forest': rf_results
})

results

Unnamed: 0,Decision Tree,Linear Regression,Random Forest
0,0.783102,0.722671,0.860863
1,0.761287,0.698294,0.793908
2,0.864839,0.775026,0.896901
3,0.434137,0.575477,0.687971


How can we **interpret the results**? Include in the discussion:
- What each column means.

*** Each of the columns refers to the 3 models under analysis.

- What each line means.

*** Each of the lines refers to the 4 experiments made on the 4 folds generated by the K-fold partitioning process.

- How to interpret the values in the cells.
     - Each result is a measure of the model's performance (not error).
     - The higher the result, the better the performance.

# Exercise 5

Finally, we calculate the mean and the standard deviation (as a measure of variance).

In [18]:

### BEGIN SOLUTION

models_mean = results.mean().sort_values(ascending=False)
models_mean

### END SOLUTION


Random Forest        0.809911
Decision Tree        0.710841
Linear Regression    0.692867
dtype: float64

In [20]:

### BEGIN SOLUTION

models_std = results.std().sort_values()
models_std

### END SOLUTION



Linear Regression    0.084555
Random Forest        0.091813
Decision Tree        0.189778
dtype: float64

In [10]:
models_var = results.var()
models_var

Decision Tree        0.036016
Linear Regression    0.007149
Random Forest        0.008430
dtype: float64

## Exercise 5.1

Based on the results above, **sort the models in decreasing order of bias**.
 
 - Decision Tree, Linear Regression, Random Forest

## Exercise 5.2

Based on the results above, **sort the models in descending order of variance**.

 - Decision Tree, Linear Regression, Random Forest

## Exercise 5.3

Do these results seem to **contradict**, in any way, the *bias-variance trade-off*? If so, **why**?

 - Although it seems to be so, in fact what is happening is that the Random Forest model is at the optimal level of complexity, with reduced Variance and reduced Bias, the sweet spot. This fact contradicts the loss-gain ratio that the bias-variance trade-off represents (losing Bias gains variance and vice versa).
 
 - In addition, looking at the table of values of the models, it appears that the Decision Tree model presents high flexibility in its analysis, sometimes having very high values and other times very low values.
 
 - The Linear Regression model presents, in a more consistent way, low performance values, indicating high levels of Bias.