# SLU11 - Advanced Validation: Exercises notebook

In [1]:
import pandas as pd
import numpy as np

## 1 Bias-variance trade-off

### Exercise 1: Detecting bias and variance in a simple model (not graded)

Imagine you are measuring voting intentions, namely the percentage of people that will vote in a given political party A, as opposed to political party B.

A way to build this model would be to randomly choose 50 numbers from the phone book, call each one and ask the responder who they planned to vote.

Now, consider we got the following results:

| Party A | Party B | Non-Respondents | Total |
|---------|---------|-----------------|-------|
| 13      | 16      | 21              | 50    |

From the data, we estimate the probability of voting A as:

In [2]:
13 / (13 + 16)

0.4482758620689655

Using our (flawed, as we will see) model, we predict a victory for the party B. But can we expect our model to generalize, coming the elections?

In order to understand that, we need to idenfify sources of bias and variance.

Below you will find a list of issues undermining the model. You need to identify which ones are sources of bias and which ones are sources of variance:

1. Only sampling people from the phone book (bias/~~variance~~)
2. Not following-up with non-respondents (bias/variance)
3. Not weighting responses by likeliness to vote (bias/variance)
4. Small sample size (bias/variance)

In [3]:
answers_dict_1 = { 1 : 'bias',
                   2 : 'bias',
                   3 : 'bias',
                   4 : 'variance' }

### Exercise 2: Detecting bias and variance in the real world (not graded)

For each of the following, identify if they are more likely to be sources of bias or variance:

1. Using very flexible models (e.g., non-parametric, non-linear), such as K-nearest neighbors or decision trees (bias/variance)
2. Using models with simplistic assumptions, such as linear or logistic regressions (bias/variance)
3. Increasing the polynomial degree of our hypothesis function (bias/variance)
4. Ignoring important features (bias/variance)

In [4]:
answers_dict_2 = { 1 : 'variance', 
                   2 : 'bias',
                   3 : 'variance',
                   4 : 'bias'}

## 2 Train-test split

### Exercise 3: Create training and test datasets (graded)

1. Load the `data/beer.csv` dataset

In [5]:
beer = pd.read_csv('data/beer.csv')

In [6]:
from sklearn.model_selection import train_test_split


def implement_hold_out_method(X, y, test_size=.4):
    """ 
    Implementing the holdout method, using sklearn.
    
    Args:
        X (pd.DataFrame): a pandas dataframe containing the features
        y (pd.Series): a pandas series containing the target variable
        test_size (float): proportion of the dataset to include in the test set

    Returns:
        X_train (pd.DataFrame): the features for the training examples
        X_test (pd.DataFrame): the features for the test examples
        y_train (pd.Series): target for the training set 
        y_test (pd.Series): target for the test set

    """
    # use train_test_split to create the training and test datasets
    # X_train, X_test, y_train, y_test = ...
    ### BEGIN SOLUTION 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)
    ### END SOLUTION
    
    return X_train, X_test, y_train, y_test

In [7]:
"""Check that the solution is correct."""
from random import randint

def generate_test_data(m , n):
    values = np.random.randint(0, m, size=(m, n))
    df = pd.DataFrame(values)
    X = df.copy()
    y = X.pop(0)
    return X, y

X, y = generate_test_data(m=100, n=4)
X_train, X_test, y_train, y_test = implement_hold_out_method(X, y)
assert X_train.shape == (60, 3)
assert X_test.shape == (40, 3)
assert y_train.shape == (60, )
assert y_test.shape == (40, )

### Exercise 4: Creating a validation dataset (graded)

In [8]:
def implement_validation_dataset(X, y, test_size=.25, validation_size=.25):
    """ 
    Implementing the holdout method with validation, using sklearn.
    
    Args:
        X (pd.DataFrame): a pandas dataframe containing the features
        y (pd.Series): a pandas series containing the target variable
        test_size (float): proportion of the dataset to include in the test set
        validation_size (float): proportion of the dataset to include in the validation set

    Returns:
        X_train (pd.DataFrame): the features for the training examples
        X_test (pd.DataFrame): the features for the test examples
        y_train (pd.Series): target for the training set 
        y_test (pd.Series): target for the test set

    """
    # use train_test_split to create the test dataset
    # X_temp, X_test, y_temp, y_test = ...
    ### BEGIN SOLUTION 
    X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=test_size)
    ### END SOLUTION
    
    # compute the size of the validation dataset relative to the temp dataset
    # validation_temp_size = ...
    ### BEGIN SOLUTION 
    validation_temp_size = validation_size / (1 - test_size)
    ### END SOLUTION
    
    
    
    return X_train, X_test, X_val, y_train, y_test, y_val

## 3 Cross-validation

### Exercise 5: Implementing K-fold cross-validation (graded)