In [18]:
%load_ext autoreload
%autoreload 2

from algebra import *
from cache import *
from costs import *
from features import *
from gradients import *
from helpers import *
from model import *
from splits import *

import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import csv
import warnings
warnings.filterwarnings('ignore')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Tutorial

Here we will see how to run a basic model and how to do some grid search on the model's parameter.

First, we define the directories path. You can specifiy if you want to load a sub sample of the data set or the full dataset by changing the `SUB_SAMPLE` constant to `True` or `False`:

In [19]:
SUB_SAMPLE = True
CACHE_DIR = "test/cache/" if SUB_SAMPLE else "cache/"
SUBMISSIONS_DIR = "test/submissions/" if SUB_SAMPLE else "submissions/"

Then, we load our data set:

In [20]:
y, x, ids = load_csv_data('data/train.csv', SUB_SAMPLE)

### Pre-processing

We now define how we want to process our dataset before doing any training. For this dataset we do the following preprocessing steps:
1. we remove all the `-999` values
2. we remove the outliers with a clamping
3. we standardize our dataset
4. we do a polynomial expansion with the `degree` value passed as hyperparameter

The functions used in this function are in the file `features.py`

In [5]:
def clean_standardize_expand(y, x, h):
        
    degree = int(h['degree'])

    x = remove_errors(x)
    x = remove_outliers(x)
    x = standardize_all(x)
    x = remove_nan_features(x)
    x = build_poly(x, degree)

    return y, x

### Parameters exploration for a simple Least Squares model

Now we want to try some models with different parameters to see which one is the best for our problem.

We need to define our fitting function:

In [21]:
def least_squares_analytical(y, x, h):

    degree = int(h['degree'])

    w = least_squares(y, x)
    
    return {
        'w': w,
        'mse': compute_mse(y, x, w)
    }

def mse(y, x, w):
    return {
        'mse' : compute_mse(y, x, w)
    }

Then we define the parameters we want to explore for our model. Since least squares is a simple model we only have the degree expansion to explore, let's see the results of this model with the degree varying from 10 to 13:

In [22]:
#where we store our results
cache = Cache(CACHE_DIR + 'Tutorial_Results')

#the parameters we want to try
hs = { 
    'degree': np.arange(10, 14), 
}

evaluate(
    clean = clean_standardize_expand, 
    fit   = fit_with_cache(least_squares_analytical, cache),  
    x     = x, 
    y     = y, 
    hs    = hs
)

[{'degree': 10,
  'w': array([-3.08785378e+00, -2.55094901e-01, -2.69387781e-01, -4.40360497e-02,
          3.22180321e-01, -4.25473762e-02, -9.31026075e-02,  4.09865122e-02,
         -5.22365049e-03, -8.47383602e-05,  4.13820917e-05,  2.92309566e-01,
         -6.81009438e-01, -5.44915567e-01,  4.07911791e-01,  7.49432146e-02,
         -7.46001940e-02,  8.52876491e-04,  6.06093493e-03, -1.22162964e-03,
          7.27929851e-05, -4.16238709e-02,  1.61502159e-02,  1.53711985e-01,
         -2.73094741e-01,  7.60526096e-03,  2.82576769e-01, -2.29951839e-01,
          7.87358517e-02, -1.27004232e-02,  7.93618152e-04,  9.03236040e-03,
          1.27048102e-01,  2.51074833e-01, -1.32402376e-01, -7.52288309e-02,
          3.48158492e-02,  1.12304889e-02, -4.23170628e-03, -5.24067243e-04,
          1.74828746e-04, -3.02130022e-01,  2.78895384e-01,  6.33260555e-01,
         -9.01032449e-01,  3.06380110e-02,  5.56720052e-01, -4.10768354e-01,
          1.34858752e-01, -2.17254333e-02,  1.39452928e

Now if we take a look at `test/cache/Tutorial_Results.csv` we can see that the best model is the one with the polynomial expansion of 13th degree.