## Make Kaggle Submission
According to the Kaggle site, our submission file needs to look like this:

ParcelId,201610,201611,201612,201710,201711,201712
10754147,0.1234,1.2234,-1.3012,1.4012,0.8642-3.1412
10759547,0,0,0,0,0,0
etc.

In [3]:
import sys
sys.path.insert(0, './helper')
from helpers import read_in_dataset

import numpy as np
import pandas as pd
import gc
from sklearn.externals import joblib

CHUNK_SIZE = 50000

Load in the model we created in 05_FitFinalModel.ipynb to make our submission.

In [4]:
my_model = joblib.load('models/model.pkl')

### Create Naive Models for Comparison
Let's also create a couple naive "models" to compare to ours. Since we want to minimize Mean Abolute Error, the median of the target in the train dataset will be a good naive model to try (the median actually does minimize the mean absolute error of a dataset if all you can predict is a constant). Another good one to try would be predicting all 0's as we know this dataset is model residuals (which should be zero), so predicting zero is essentially accepting that the Zillow model can get no better.

Any model that gives worse performance than either of these two naive models is totally, utterly, USELESS!! I've seen Kaggle submissions that don't hit this threshold, which is totally bonkers! This is a great lesson in why it's important to remember why you're doing what you're doing instead of concentrating on minimizing some loss function or spending a bunch of time and energy and computational resources tuning hyperparameters when something is obviously wrong with your setup (i.e. you're not beating the naive models).

In [5]:
class medianPredictor:
    
    def fit(self, X, y=None):
        self.med = X.median()
        
    def predict(self, X, y=None):
        return np.array([self.med] * len(X))
    
class zeroPredictor:
    
    def fit(self, X, y=None):
        return self
    
    def predict(self, X, y=None):
        return np.array([0] * len(X))

These are helper functions to help generate the submission file. Since we're asked to make 6 predictions on all ~3 million properties, we need to make ~18 million predictions. Instead of creating a giant dataframe that is 18 million rows long, I'll simply make roughly 50k predictions at a time to avoid using up all the memory on my machine. Since we wrote our code not to depend on the size of the training data, this will have no effect on our final output.

In [6]:
def make_chunks(df, chunksize):
    """Generator to return chunks of a dataframe of a given size"""
    chunk = 1
    total = len(df) // chunksize + 1
    while chunk <= total:
        if chunk < total:
            yield df.iloc[((chunk-1)*chunksize) : (chunk*chunksize)]
        else:
            yield df.iloc[(chunk-1)*chunksize:]
        chunk += 1

In [7]:
def add_date(df, dt):
    df['transactiondate'] = pd.to_datetime(dt)
    return df

Here's the code to actually make the submission file. Basically it loops through the chunks of the 2016 properties dataset and the dates we're asked to predict on and uses a model (be it the median, zero, or gbm model) to generate the predictions.

In [14]:
def make_sub_file(model, chunksize):
    
    dates = ['2016-10-01', '2016-11-01', '2016-12-01', '2017-10-01', '2017-11-01', '2017-12-01']
    props = read_in_dataset('properties_2016')
    
    submission_df = pd.DataFrame(index=props.parcelid)
    
    for d in dates:
        props = add_date(props, d)
        for x in make_chunks(props, chunksize):
            preds = model.predict(x)
            ix = x.parcelid
            submission_df.loc[ix,str(pd.to_datetime(d).year) + str(pd.to_datetime(d).month)] = preds
        print('Processed date {0}'.format(d))
    
    del props
    
    return submission_df.round(4).reset_index()

Make a directory to hold the submissions:

In [15]:
!mkdir -p submissions

### Just predict the Median

In [16]:
mp = medianPredictor()
mp.fit(read_in_dataset('train_2016').logerror)
make_sub_file(mp, CHUNK_SIZE).to_csv('submissions/median_submission.csv', index=False)
gc.collect() # because of memory issues, garbage collect

Processed date 2016-10-01
Processed date 2016-11-01
Processed date 2016-12-01
Processed date 2017-10-01
Processed date 2017-11-01
Processed date 2017-12-01


35755

### Just predict 0

In [18]:
zp = zeroPredictor()
# don't need to fit
make_sub_file(zp, CHUNK_SIZE).to_csv('submissions/zero_submission.csv', index=False)
gc.collect() # because of memory issues, garbage collect

  exec(code_obj, self.user_global_ns, self.user_ns)


Processed date 2016-10-01
Processed date 2016-11-01
Processed date 2016-12-01
Processed date 2017-10-01
Processed date 2017-11-01
Processed date 2017-12-01


42

### Use our Model

In [17]:
make_sub_file(my_model, CHUNK_SIZE).to_csv('submissions/model_submission.csv', index=False)
gc.collect() # because of memory issues, garbage collect

Processed date 2016-10-01
Processed date 2016-11-01
Processed date 2016-12-01
Processed date 2017-10-01
Processed date 2017-11-01
Processed date 2017-12-01


63