This notebook introduces how to measure the performances of your models and why it is necessary to validate your model.

In [None]:
import numpy as np
import pandas as pd

# Loss functions

Before building a model, we need to choose what are the properties of the model that we want to stress and how to measure its performance. For a regression problem, it is possible to measure the distance between the predictions and the ground truth. The best model will be the one that minimize this error.

A simple idea is to compute the difference between the predictions and the groundTruth.

In [None]:
# To complete
def loss(predictions, groundTruth):
    """
        Computes the difference between each prediction and its groundTruth
        And returns the mean over all the sample
    """
    error = 
    return error

What is the limit of this loss ? Give an example that has an error of zero but none of the predictions is even close to the ground truth.

In [None]:
# To complete
predictions = [1, -1, 0, 5, -5, 3, -3]
groundTruth = [-1, 1, 0, 5, -5, 3, -3]
assert loss(predicitons, groundTruth) == 0, "The loss is not null"

How to correct this problem ?

What other aspects of the model do you think important ?

# Coefficient of determination

In [None]:
An enhanced version of this loss is the coefficient of determination which measures how the model replicates the observed outcomes.

# Overfitting

If we minimize the loss on the training set, we will certainly have a problem of **overfitting**. This section is absed on the medium article: [Memorizing is not learning! — 6 tricks to prevent overfitting in machine learning](https://hackernoon.com/memorizing-is-not-learning-6-tricks-to-prevent-overfitting-in-machine-learning-820b091dc42)

## What is overfitting ?

Let's give a simple example: build a model thanks to a `if` `then` `else` condition.
In python, it works as follow:

    if condition:
        # If the condition is True executes this part of the code 
        code ...
    elif secondCondition:
        # If secondCondition is True executes this part 
        # (you can repeat elif as many time as you need)
        code ...
    else:
        # If none of the previous conditions is True
        code ...
        
Now let's built our first simplified model: let's say that each house is represented as follow: colorWindow, yearConstruction, squareFeet and you have to predict a price. Here is the dataset that you will use for training your model.

In [None]:
trainingHouses = [["blue", 2010, 700], ["blue", 2015, 550], ["green", 2015, 700]]
trainingSalePrices = [100000, 150000, 50000]

In [None]:
# To complete
# Here your first model
def model(house):
    colorWindow, yearConstruction, SquareFeet = house
    if ...:
        ...
    elif :
        ...

In [None]:
predictions = [model(trainingHouse) for trainingHouse in trainingHouses]
assert lossCorrected(predictions, trainingSalePrices) == 0, "Your model seems to be uncorrect"

A real estate agent comes to you and wants your estimation of a new house that he has to sell. What is your estimation ?

In [None]:
newHouse = ["blue", 2015, 700]
# Your estimation ?


The agent is really unpleased by your answer ! Your model has not been able to generalized to new data: it has overfitted !

"The word overfitting refers to a model that **models the training data too well**. Instead of learning the general distribution of the data, the model learns the expected output for every data point. 

This is the same a memorizing the answers to a maths quizz instead of knowing the formulas. Because of this, the model cannot generalize. Everything is all good as long as you are in familiar territory, but as soon as you step outside, you’re lost.

The tricky part is that, at first glance, **it may seem that your model is performing well** because it has a very small error on the training data. However, as soon as you ask it to predict new data points, it will fail."



![Overfitting / Underfitting](https://cdn-images-1.medium.com/max/1600/1*SBUK2QEfCP-zvJmKm14wGQ.png)

## How to measure it ?

In order to detect it, you want to analyze the performance on data that **it has never seen before** ! This will allow you to measure its capacity to **generalize** to new data. However it is hard to have new data to test your model for each new model, so to overcome this issue, you **split** your dataset into **training** and **testing** sets.

The training set has to be large enough to generalize and the testing set **representative** enough of the dataset in order to measure how **accurate** is your model (usually you randomize this testing set to have a overview of all the data). The following table allows you to detect when your model is overfitting to the data.

<img src="https://cdn-images-1.medium.com/max/800/1*3XvSvKfde8u89TMwjkz3kg.png" alt="Detection" style="width: 600px;"/>

So let's split our original dataset in two sets (80% for training and 20% for the testing)

In order to select a subset of a list in Python, you have to use `[]`

    list = [10,23,52,32,44]
    list[0] # Returns the value 10
    list[:2] # Returns the list [20,23]
    
In order to take 20% of this list, you have to code
    
    length = len(list)
    stop = 0.2 * length # Takes only 20%
    stop = int(stop) # Changes the type 
    # (because it does not make sense to take the float of a list)
    twentyPercentFirst = list[:stop] # Returns twenty first percents
    twentyPercentLast = list[-stop:] # Returns last twenty percents

In [None]:
# To complete
# Opens data
houses = 

In [None]:
# Shuffles the dataset
np.random.shuffle(houses)

In [None]:
# To complete
# Splits the data
train, test = ...

You will now be able to detect if your models overfit but how to prevent this behavior ?

## How to prevent it ?

### Augment dataset

The model is limited in the number of information it can "memorize", more data you will have more there is a chance that it overfit. In order to do so, you can **increase the dataset with new datapoints** (however it is sometimes hard, even impossible). 

In this case, you can **create** new datapoints. In computer vision, it is usual to add noise, reframe images in order to make the model more robust to perturbation that can be observed in real case (the noise has to be realisitc, it is why the point of view of an expert is valuable).

Do you think that it is possible to increase our current dataset ? If so, how would you do ?

### Model complexity

<img src="https://cdn-images-1.medium.com/max/1000/1*vuZxFMi5fODz2OEcpG-S1g.png" alt="Overfitting / Underfitting Loss" style="width: 600px;"/>

A model has different caracteristics that can change how much information it can memorize. Reducing the complexity of your model, by variating the different parameters can increase the performances of your model.

For instance, in our previous model, you can use only use two if conditions. The performances on the training will perhaps be slightly lower but the goal is to find the right generalization of the problem.

Exploring each possibilities by changing the different hyperparameters can be time consuming. It is more efficient to find a way to integrate it to the training.

### Regularization

The last solution that we will present is to not try to minimize the loss but to add a penalty that will quantify the complexity of the model. So if the model becomes too complex, the penalty will increase and the sum of the loss and this **regularizor** will also increase. The best model will the one that minimize the loss but also minimize the penalty.

$$ \min Loss(model(data), truth) + complexity(model)$$

This solution is faster than the exploration of the hyperparameters space because it intervenes directly during the training process and does not require to compute multiple models with different hyperparameter.

In real application, it is usual to take advantage of the different techniques to create the best generalization.