### Machine Learning Technique 1 - Regression

# 1. Predict a value - Predict price of a house

This notebook will demonstrate how to use basic scikit functionalities to predict a value.
As an example, we are going to use a dataset from kaggle competition that is very interesting to use for basic and advanced regression techniques. The dataset contains housing data from the city Ames, Iowa (USA).

- The dataset comes with 2 CSV, a training CSV and a test CSV.
- The training CSV contains features of houses with their price and the goal is to create a model and predict the prices of the test CSV.
- The predicted prices for the tests could be submitted to kaggle to participate to the competition.

For us, we are just going to work on the training data and see if we can make a nice and correct model.

This notebook encapsulates all important code into functions that are **easy to reuse in other notebooks or in your project**.

#### Data:

You can find the data in your OneDrive under `house_data/`

#### Links:

[Kaggle Competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/kernels)

[Explanations to the Dataset columns](https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt)


# 2. Import Libraries

In [None]:
# Starting by importing our beloved libraries: pandas, numpy, matplotlib.pyplot

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

# 3. Load the Data

In [None]:
# Load the data
house_train = pd.read_csv('../data/house_data/train.csv')
house_test = pd.read_csv('../data/house_data/test.csv')

In [None]:
# See the training data
house_train.head()

In [None]:
# See the test data
house_test.head()

# 4. Explore a little

In [None]:
# Notice that there are a lot of columns.
# Let's see how much.


# 5. Slice / Split the data into training and validation

Now, we are going to split our training data in 2 datasets, train and validation.

We learned in the course why it is important to have a training and validation set.

Just remember that when you use a machine learning algorithm, you basically create a model that needs to be trained on your data.
Essentially, you preprocess your data, give it to the algorithm and it will learn something.
In this case, we will give some columns of our housing data plus the prices from the training data.

The algorithm will try to find relations bewteen, for example, the number of bedrooms, the size of the house and the price of the house.

Then, to test your algorithm, you should ask him to predict the price of a house that is NOT part of the training data.
That way, you can effectively test if your model is great or not.

In [None]:
# Split the data, 70% for training, 30% for validation
# If there is any red message, it is just a "DeprecationWarning", meaning that some function
# will be changed in the next version of the library
...

# train_test_split returns 2 values
...

In [None]:
print('Size of train      : {}'.format(house_x_train.shape[0]))
print('Size of validation : {}'.format(house_x_validation.shape[0]))

The data has been split. Now we are going to extract the column that we want to predict, which is `SalePrice`.

The reason we do that is because of how scikit algorithms work. Typically they ask to get the features and the target separatly.

In [None]:
# Extract the SalePrice as target y


# 6. Scale the values

Because we are doing basic regression, we will only select a few columns that contains integer or float values.
We will need to scale those columns, because of how regression algorithm works.

Essentially, most regression algorithm are just a linear (`y = c * x + i`) or polynomial (`y = c * x + c * x^2 + ... +  i`) formula. 

So if you try to use, for example, the number of bedrooms and the price in the same equation you might have numerical issues, because bedrooms are in the range of 1 to 5 most of the time and prices can go into the millions.
For this reason, we scale or normalize the values.

Again, this is just an overview to show what is possible using scikit-learn.

In [None]:
# Select a few features only
columns = ['MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', '1stFlrSF']

In [None]:
# NOTE: Ingore all Red Big Warnings, it is not important and doesn't break anything.



# Create two different Scaler and fit them with all training columns, respectively all training prices data



# 7. Creating, Training & Measuring the Model

Next, we are going to create multiple Machine Learning models, train them, test their performance and see what works best for us.

We provide multiple functions that will make it easier for you to get started.

**Performance:**

We will use the [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) to measure the model's performance.
This value is between 0 and 1, 1 being the best.

We will see that we can use mutliple algorithms to do regression jobs. We have to measure and see what model suits best for us.
Note that we only cover the basics, each alorithm has a lot of specific parameters and features that can be tuned to improve the performance.

The next few cells will define a few functions.

In [None]:
from sklearn.cross_validation import *

def train_and_evaluation(model, x_train, y_train):
    ''' Trains and evaluation the performance of a Regression Model.
    
    Returns the scores array.
    '''
    ...
    
    print('Coefficient of determination on training set: {}'.format( ... ) )
          
    # create a k-fold cross validation iterator of k=5 folds
    cv = KFold(x_train.shape[0], 5, shuffle=True, random_state=33)
    
    scores = cross_val_score(model, x_train, y_train, cv=cv)
    print('Average coefficient of determination using 5-fold crossvalidation:', np.mean(scores))
    
    return scores

In [None]:
def plot_model(model, x, y, label='Model'):
    ''' Makes a simple plot of the predictions of a model.
    '''
    ...

## 7.1 Simple Linear Regression

Let's start with a simple linear regression model.

[Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html)

In [None]:
# Create the Model


# Train and evaluate


# Plot


In [None]:
# Create the Model


# Train and evaluate


# Plot


## 7.2 Support Vector Machines (SVM) 

This is another algorithm that can be used to perform regression.

[Documentation](http://scikit-learn.org/stable/modules/svm.html#regression)

In [None]:
# Create the Model


# Train and evaluate


# Plot


In [None]:
# Create the Model


# Train and evaluate


# Plot


In [None]:
# Create the Model


# Train and evaluate


# Plot


## 7.3 Random Forest

This is yet another algorithm, but this time based on Decisions Trees.

In this example it is intersting to see that we can find out what features / columns **are most important** to the algorithm for predicting the prices.

[Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html)

In [None]:
# Create the Model


# Train and evaluate


# Plot


In [None]:
# Here we print the importance of the features


We see that `OverallQual` weights in at 60% for deciding the price of the house.
Also note that we should definitely use more features, but here let's keep it simple.

## 7.4 AdaBoost

And this is the final algorithm that we will try.

[Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html#sklearn.ensemble.AdaBoostRegressor)

In [None]:
# Create the Model


# Train and evaluate


# Plot


# 8. Model Evaluation

Let's finish by making a small method that will evaluate our model.

Before, we trained and evaluate our model on training data.
Now we will measure the model on the validation data and get our final coefficient of determination.

In [None]:
models = {
    'Linear': model_linear,
    'Linear l2': model_linear_l2,
    'SVM-Linear': model_svr,
    'SVM-Poly': model_svr_poly,
    'SVM-RBF': model_svr_rbf,
    'Random Forest': model_rf,
    'Adaboost': model_ada
}



# 9. Predict Values

Finally, ones you are happy with a model, you can predict values of new houses!

In [None]:
prediction = ...
print('Predictions (scaled): \n{}'.format(prediction)) # Those are the predictions scaled

# We can inverse transform the predictions to get the real dollar price
prediction = ...
print('\nPredictions ($): \n{}'.format(prediction))

# We can add the predictions as a column to our test data and then print.


# Print the house_test, note that many features are scalled.


# 10. Validation Curves: Plot scores to evaluate models (OPTIONAL)

This next section is more advanced.
It consists of plotting different things such as the scores and learning curves in order to visualize and evaluate our model.

[Documentation](http://scikit-learn.org/stable/modules/learning_curve.html)

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit


def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and training learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - An object to be used as a cross-validation generator.
          - An iterable yielding train/test splits.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

In [None]:
# Plot the Model that uses SVM with RBF

cv = KFold(house_x_train[columns].shape[0], 5, shuffle=True, random_state=33)
plot_learning_curve(model_rf, 'SVM RBF', house_x_train[columns], house_y_train, cv=cv, n_jobs=4)

In [None]:
plot_learning_curve(model_svr_rbf, 'SVM RBF', house_x_train[columns], house_y_train, cv=cv, n_jobs=4)