## Random Forest Regression

Thanks to the bagging algorithm, we get "smoother" boundaries. We are actually reducing the variance. Now, we will use the Random Forest Model (aka Features Bagging).

With bagging, we randomly choose which **samples** to train on. With RF, we randomly choose which **features** to train on. but the principle stay the same ! We can choose the number of features to optimize our model but generally we take :
```python
import math
d = math.floor(D/3) # for regression
d = math.floor(D**0.5) # for classification
```
Before going further, you should have a look at the docs and make sure you understand how it works and all the options (max_depth, max_features, criterion,...):
- [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- [Random Forest Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

Some questions you could ask yourself :
- What if most of the features are irrelevant ?
- What if your RF model choose randomly only irrelevant features ?

Generally, RF can deal with this. However, in some cases it can be problematic. Later, we'll learn about an algorithm called boosting that fixes this problem.

In this exercise, we will work with more than just one feature because we saw in the previous exercise the "smoothing effect of ensembling" and RF makes no sense with just one feature.

For this exercise, we will work **on a housing price dataset**. Our goal is to predict the price of a house given some attributes about it.

Through this exercise, we are going to :
- Standardize the data
- Create a Linear Regression model;
- Create a DecisionTree model;
- Create a RF model;
- Use Cross-validation and test score to compare each model

## Visualize the data

Before trying to solve this problem, have a look at the data : 
- check the head
- check the type of each column
- Try to understand the meaning of each column
- what's the shape of the dataset
- plot some data
- ...

As you will notice, the columns have no name. You could have a look at the dataset and the docs on [this website](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/) if you want more information. To help you, we create a variable `columns_name` in the following cell. This variable contains the names in the same orders as they appear in the dataset.

Start by loading the data located in `data/housing.data` (you can use pandas `read_csv` method) and by renaming the columns with their corresponding names.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
np.random.seed(10)

columns_name = [
    'crim', # numerical
    'zn', # numerical
    'nonretail', # numerical
    'river', # BINARY
    'nox', # numerical
    'rooms', # numerical
    'age', # numerical
    'dis', # numerical
    'rad', # numerical
    'tax', # numerical
    'ptratio', # numerical
    'b', # numerical
    'lstat', # numerical
    'medv', # numerical -- this is the TARGET
]

# TO DO: add names to each column of the dataset



## Split and standardize your data

In this exercise, we will start by standardize our data and split them into 2 subsets : 
- **training data** (X_train, y_train) : 70% of the dataset
- **test data** (X_test, y_test) : the rest of the dataset (30%).


In [3]:
# Your code here


Then, standardize `X_train` using `StandardScaler` from Sklearn. You should fit the scaler on `X_train` and use it to transform `X_train` and `X_test`. Be careful, some column(s) should not be standardized (for example binary column(s) (0 or 1) and the target column `medv`).

In [5]:
# Your code here


Check your standardization by calculating the mean and std for each feature of `X_train`.

In [7]:
# Your code here


For `y_train` and `y_test`, replace the column by the log of each value. Why do we do that ? Imagine, you sell a house 10.000Sk and you realized you have made an error equivalent to 5.000Sk. It's a big mistake, right? Now, imagine,  you sell a house 1.000.000Sk and make an error of 5.000Sk. In comparison to the selling price, it's not that bad. Taking the log of the target column is a way to represent that.  

In [10]:
# Your code here


## Random Forest Regressor
The data are splitted and standardized, ... We can start to use our models ! We will compare the performance of 3 models : 
- Random Forest Regressor 
- Linear Regression
- Decision Tree Regressor

For each model, you have to:
1. Initialize your model
2. Train your model on the training set
3. Make predictions on the test set
4. **graph 1** : scatter plot the Ytest on the X_axis and the predictions on the Y_axis
5. **graph 2** : plot the predictions and the Ytest
6. Compute the test score and the cross validation score

**Note** : the code for each model will be very similar... Do not hesitate to copy paste !

In [12]:
np.random.seed(10)
# TO DO:
# 1. Initialize the RandomForestRegressor with n_estimators=50
# 2. Train your model on the training set
# 3. Make predictions on the test set


# predictions[:5] should return : 
# array([3.24018446, 3.44032857, 3.23737663, 3.35909237, 3.03636152])

In [13]:
# TO DO:
# Scatter plot Ytest on the X_axis and the predictions on the Y_axis
# plot a line on the same graph with slope 0 and intercept 0 --> Can you guess why we do that?

In [16]:
np.random.seed(10)
# TO DO:
# Print the cross_val_score and the score of the RF model



# You should get : 
# -> CV forest: 0.7564990017784684
# -> test score forest: 0.8555640751859941

## Linear Regression 
Well done ! We will do exactly the same for the Linear Regressor. Quick reminder : 
1. Initialize your model
2. Train your model on the training set
3. Make predictions on the test set
4. **graph 1** : scatter plot the Ytest on the X_axis and the predictions on the Y_axis
5. **graph 2** : plot the predictions and the Ytest
6. Compute the test score and the cross validation score


In [18]:
np.random.seed(10)
# TO DO:
# 1. Initialize the LinearRegression
# 2. Train your model on the training set
# 3. Make predictions on the test set


# predictions[:5] should return : 
# [3.40505004 3.44102604 3.41337444 3.07337911 2.93213621]

In [19]:
np.random.seed(10)
# TO DO:
# Scatter plot Ytest on the X_axis and the predictions on the Y_axis
# plot a line on the same graph with slope 0 and intercept 0




In [20]:
np.random.seed(10)
# TO DO:
# Print the cross_val_score and the score of the Linear Regression


# You should get : 
# -> CV baseline: 0.7299181763124764
# -> test score baseline: 0.7754525837588145

## Decision Tree Regressor
Again, repeat the previous step for the DecisionTreeRegressor (train, predict, plot, measure)


In [22]:
np.random.seed(10)
# TO DO:
# 1. Initialize the LinearRegression
# 2. Train your model on the training set
# 3. Make predictions on the test set


# predictions[:5] should return : 
# array([3.10009229 3.21084365 3.21084365 3.44041809 2.67414865])

In [23]:
np.random.seed(10)
# TO DO:
# Scatter plot Ytest on the X_axis and the predictions on the Y_axis
# plot a line on the same graph with slope 0 and intercept 0




In [24]:
np.random.seed(10)
# TO DO:
# Print the cross_val_score and the score of the Linear Regression


# You should get : 
# -> CV baseline: 0.6449516714366336
# -> test score baseline: 0.6228843203786647

## Conclusion

Congratulations 🎉🎉🎉! Regarding to the graphs, the cross validation score and the test score of each model, which one is the best?