# Cross Validation 

## 1. Introduction

In this exercise, we will dive into the whole machine learning development workflow. A common mistake (and we made it intentionnaly during the previous days) is to learn the parameters of a prediction function and testing it on the same data. What's wrong with that? We can’t fit the model to our training data and hope it would accurately work for the real data it has never seen before.

To avoid that to happen, there are several techniques: we could remove a part of the training data and using it to get predictions from the model trained on rest of the data (= __Holdout Method__). But, by reducing the training data, we risk losing patterns in data set and increase the error. __K-Fold cross validation__ will help us to solve this problem.

In K Fold cross validation, we  split our data into k separated "folds". Then, the Holdout Method is repeated k times, such as each time, one of the k folds will be the test subset and the (k-1) other folds will be used together as the training set.

__Note__ that this method does not depend on the model. In this example, we will use it on a Linear Regression but you could use it on any methods you want (KNN, Logistic Regression,...).

This following image schematize this algorithm.
<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" style="width:50%;">

The general workflow to apply the Cross Validation is always the same:
1. Instanciate the model from scikit-learn you want to use (LinearRegression for the today exercise);
2. Instanciate the KFold class with the parameters you want;
3. Use the cross_val_score() function to measure the performance of your model.

## 2. Data exploration

In this section, we will analyse the sale price of houses in Iowa. We will try to predict the final sale price depending on different criterias. 

- read the data from the file "saleprice_housing.txt" (__do not forget__ to specify the delimiter as "\t" for tabulation)
- display the 5 first lines
- Generate 3 diagrams (scatter plot) : 
    1. The first one represents the "Garage Area" on the X-axis and the "SalePrice" on the Y-axis
    2. The second one represents the "Gr Liv Area" on the X-axis and the "SalePrice" on the Y-axis
    3. And the last one is the "Overall Cond" on the X-axis and the "SalePrice" on the Y-axis

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

data = pd.read_csv("data/saleprice_housing.txt", delimiter="\t")

## 3. Linear Regression with Scikit-Learn

- Use the LinearRegression() model from Scikit Learn with the column "Gr Liv Area" for the training and "SalePrice" for the target.
- Plot the line on the corresponding graph.
- What would be the price for a house with "Gr Live Area" equal to 2000?

## 4. Cross Validation with Scikit-Learn 

Well done, you made a prediction (based on one feature). Now we will apply the Cross Validation in order to measure the performance of our model.
- Create your K-Fold class with the following parameters: 
    - 5 folds;
    - shuffle = True;
    - random_state = 1 (so we get the same values);
    - assign this model to the variable `kf`.

- Create a new instance of the class `LinearRegression` and assign it to the variable `lr`.

- Use the `cross_val_score()` function to do the cross validation of your K-folds: 
    - use the `LinearRegression` instance `lr` you just created;
    - use the column "Gr Liv Area" for the training and the column "SalePrice" for the target;
    - use the scoring parameter with the value "neg_mean_squared_error";
    - return an array with the MSE values (one for each fold);
    - assign the result to the variable `mses`.

- Compute the squared root of the absolute value of each mse and assign it to the variable `rmses`.
- Compute the mean of each RMSE and assign it to the variable `avg_rmses`.
- Compute the standard deviation of each RMSE and assign it to the variable `std_rmses`.
- Print your results (`avg_rmses` and `std_rmses`).

## 5. Explore different values for K

Well done! You just calculated the standard deviation and the mean of your Root Mean Squared Error (RMSE). But what does that mean? To answer this question, let's compare the `avg_rmses` and `std_rmses` for differentt values of K (number of folds).

- Using a For-Loop `for k in num_folds:`, compute the `avg_rmses` and `std_rmses` for each k in num_folds = [3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 100, 1000].
- For each iteration print the result with the following command : `print(str(k), "folds: ", "AVG RMSE: ", str(avg_rmses), "STD RMSE: ", str(std_rmses))`.
- __hint__: do no hesitate to copy paste from the previous exercise...

It seems that as k become bigger, the average RMSE decrease but the standard deviation increase. In an ideal world, we would like to have an avg_rmses and a std_rmses as small as possible. So, we have to choose a k that offers a good compromise between a small `avg_rmse` and a `small std_rmse`.