In this notebook, you will build three models and analyze them !

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Open training data
houses = 

In [None]:
# Split in train / test group
test, train = 

# Normalization

Because some models can be sensible to the scale of the data, it is important to force the data to have a common scale on each dimension. To ensure that, we can force each feature to have a mean of 0 and a std deviation of 1 by computing;
$$X_{norm} = \frac{X - \mu}{\sigma}$$ 
with $\mu$ the mean of $X$ and $\sigma$ its standrad deviation.

In [None]:
def applyNormalization(data, mean, std):
    """
        Given a mean and std, normalize the data
    """
    normalizedData = 
    return normalizedData

def normalize(data):
    """
        Normalize the given data
    """
    mean =
    std = 
    normalizedData = applyNormalization(data, mean, std)
    return normalizedData, mean, std

Now iterate this normalization over each column and save the mean and std in a dictionary. These values are important in order to reverse the transformation and also to apply the exact same transformation to a new set of data. 

It is important to compute these values over the training set and to apply the same transformation on all the other sets that we will use in order to make them coparable to the data used for training.

In [None]:
savedTransformation = {}

## Train dataset

First, complete the following loop in order to iterate over each column of the training dataset, normalize each of them and save the parameters that you have computed in a dictionary.

In [None]:
columns = 
for column in columns:
    # Computes normalization
    normalizedData, mean, std =
    # Save the normalization in a dictionary
    # Actually we save a dictionary in a dictionary !
    savedTransformation[column] = {"std": std, "mean": mean} 
    # Write the new data in the dataset
    

Then, verify that the mean and standard deviation of the new columns are respectively 0 and 1.

## Test dataset

Finally, apply the same transformation than before on each column of the test dataset.

Now, everything is ready for the first models !

# Modelization

[Sklearn](http://scikit-learn.org/stable/documentation.html) proposes a lot of models for classification, regression, clustering ... And much more ! In addition to an optimize code, it offers a standard way of training and testing models. For each model, you will have to do :

    from sklearn import model
    m = model() # add the parameter n_jobs=-1 for parallel
    m.fit(trainData, trainLabels)
    m.score(testData, testLabels)

Which part of the data will you use for features and which one is the label ?

In [None]:
trainData, trainLabels =
testData, testLabels =

## Linear Regression

A linear regression tries to fit a line to the distribution of points. It is simplier to look at it in two dimensions. The following example plots the linear regression if we tries to predict the salePrice, only with OverallQual.

In [None]:
plt.figure()
sns.lmplot(x = "OverallQual", y = "SalePrice", data = houses)
plt.show()

This line would be the results of the linear regression on this given example. However, it is possible to compute it in multiple dimensions at the same time !

Do you already see an improvement to this model ?

### Train & Test

In [None]:
# Import the model from sklearn
from sklearn.linear_model import LinearRegression

Create the first model and display the score on the training and testing set

In [None]:
lr = 

### Analysis of features

The linear regression puts a weight on the different features in order to compute the best prediction. The resulting model is easy to interpret because it is only necessary to look at the weight on each features and take the one with the most important weight.

For instance, on the following model, which features is the most and least important features:
    
    price = 0.6 * numberRoom - 0.9 * yearConstruction 
        + 0.7 * squarefeet

In [None]:
def displayImportanceFeatures(dataset, importance):
    """
        Display the importance of each features on
        the given dataset
    """
    sort = np.argsort(importance)[::-1]
    sort = sort[:10]
    plt.figure()
    sns.barplot(y = dataset.columns[sort], x = importance[sort], orient='h')
    plt.xlabel("Importance")
    plt.ylabel("Features")
    plt.show()

In [None]:
# Change this value with numpy
importance = lr.coef_ 

In [None]:
displayImportanceFeatures(houses, importance)

This model has interesting results, but we should continue to explore different models in order to see if we can have reduce the loss. 

Which is the most important feature in order to predict the sale price of the house ?

You have the following house, what would you have to change in your house to increase the prediction of $10,000 ?

In [None]:
yourHouse = test.iloc[48]

If you want to analyze the impact of two features on the prediction of the price of your house. For a linear regression, this will be a plan because linear regression tries to modelize each parameters with a simple line, however for more complex model it could have a different shape.

In [None]:
# In order to help you, here is a function that will plot the impact in 3d
from mpl_toolkits.mplot3d import Axes3D

def plotInfluence(model, col1, col2, yourHouse, grid = 50): 
    """
        Plots the evolution of the model by variating
        the value of col1, col2
    """
    x_values = np.linspace(houses[col1].min(), houses[col1].max(), grid)
    y_values = np.linspace(houses[col2].min(), houses[col2].max(), grid)
    copy = yourHouse.copy()
    
    toPredict = []
    for x in x_values:
        for y in y_values:
            copy[col1] = x
            copy[col2] = y
            toPredict.append(copy.copy().values.reshape((1, -1)))
            
    %matplotlib inline
    fig = plt.figure()
    ax = fig.gca(projection='3d')
    X, Y = np.meshgrid(x_values, y_values)
    ax.plot_surface(X, Y, np.reshape(model.predict(np.concatenate(toPredict)),(50, 50)))
    ax.set_xlabel(col1)
    ax.set_ylabel(col2)
    ax.set_zlabel("Predictions")
    plt.show()
    %matplotlib inline

In [None]:
# Columns to select
col1 = "OverallQual"
col2 = "LotArea"
plotInfluence(lr, col1, col2, yourHouse)

## Regression Tree

An interesting second model is a decision tree. Here is an example from the [notebook](https://www.kaggle.com/dansbecker/underfitting-overfitting-and-model-optimization):
![Tree](http://i.imgur.com/R3ywQsR.png)

The successive decision allows to fit more complex curve than a single line to fit the data.

### Train & Test

Implement the following model and look how changing the two most important features differs from the linear regression

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
# Train


In [None]:
# Plot 2d 


### Impact of depth

You can play with a lot of parameters, we will only look at the max_depth parameters of the tree (how many successive comparisons can be made between the root and the decision), which is how high can be the tree. 
How much the number of nodes increase as each step (if you suppose that the tree is complete) ?
If you don't have an intuition, try to create a loop that computes it for you !

In [None]:
nodes = 1 # The root
for depth in range(10):
    print(depth, nodes)
    nodes = nodes + 

How do you expect to be the performance when the size increase ?

Train a model for different depths and then plot the performance over the training and testing sets.

In [None]:
depths = list(range(5, 100, 15))
perfTest, perfTrain = [], []
for depth in depths:
    print(depth)
    # Trains model
    model = DecisionTreeRegressor(max_depth=depth)
    model.
    
    # Computes score
    scoreTest = 
    scoreTrain = 
    
    # Append the scores in the list
    perfTest.append(scoreTest)
    perfTrain.append(scoreTrain)

Display the evolution of the scores given the depth using the lists : `depths`, `perfTest` and `perfTrain`

Why should we look at only one tree ? Let's explore a clever extrapolation of the decision tree: the **random forest** !

## Random forest

This model is an extrapolation of the tree, you can think it as a forest of tree. Some little tricks have to be applied to make this model less sensitive to overfitting. At the end, that will be a lot of trees, that will give a predictions and the final results will be the average of all these predictions.

### Train & Test

In [None]:
from sklearn.ensemble import RandomForestRegressor
# For this model it is interesting to use n_estimators=100 
# Bonus: A better analysis can be made like in the previous point to see when the model begins to overfit
rf = 

### Analysis of importance

Display the importance of the different featuers, is it coherent with the previous models ?

In [None]:
displayImportanceFeatures(houses, rf.feature_importances_)

### Impact of the two first features

Analyze the impact of variating the two first components on your house. What has changed compare to the previous models ?

In [None]:
col1 = "..."
col2 = "..."
plotInfluence(rf, col1, col2, yourHouse)

## Comparison

In order to compare these different model, it is important to compare their performances on the training set. However, if your goal is to explain to a seller which aspects of his house he should change, which model would you use ?

# Kaggle

Do you wanna try to participate to the Kaggle ? 

To do so, you have to make a prediction for the houses present in the file kaggle.csv and finally format it by selecting the `Id` and `SalePrice` and save it in afile (an example is given in the file `sample_submission.csv`)