## Train and test notebook

This is where we train the final model on the whole dataset, and then test it to get our final score. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime

In [2]:
def metric(preds, actuals):
    preds = preds.reshape(-1)
    actuals = actuals.reshape(-1)
    assert preds.shape == actuals.shape
    return 100 * np.linalg.norm((actuals - preds) / actuals) / np.sqrt(preds.shape[0])

def get_score(actuals, preds):
    new_test= pd.DataFrame({'Actuals': actuals,'Preds': preds})
    new_test = new_test.loc[new_test['Actuals'] != 0,:]
    return metric(np.array(new_test['Actuals']), np.array(new_test['Preds']))

def mean_encoder(df, col, target = 'Sales'):
    Mean_encoded_subject = no_closed_stores.groupby([col])[target].mean().to_dict() 
    return df[col].map(Mean_encoded_subject)

In [14]:
no_null_sales = pd.read_csv('clean_data.csv', parse_dates = True)
no_null_sales.loc[:,'Date'] = pd.to_datetime(no_null_sales.loc[:,'Date'], format='%Y-%m-%d')
no_closed_stores = no_null_sales[no_null_sales['Open'] == 1]
stores = pd.read_csv('stores_light.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [15]:
no_closed_stores = no_closed_stores.loc[no_closed_stores.loc[:,'Sales'] > 0]
no_closed_stores.loc[no_closed_stores['StateHoliday'] == 0, 'StateHoliday'] = '0'
no_closed_stores['Month'] = no_closed_stores['Date'].dt.month

In [16]:
no_closed_stores = no_closed_stores.merge(stores, how='left', on='Store')

In [10]:
rf_set = no_closed_stores.drop(['Date', 'Customers', 'SchoolHoliday'], axis = 1)

rf_set.loc[rf_set['Month'] != 12, 'Month'] = 0
rf_set.loc[rf_set['Month'] == 12, 'Month'] = 1
rf_set['Store'] = mean_encoder(rf_set, 'Store')
rf_set['StateHoliday'] = mean_encoder(rf_set, 'StateHoliday')

rf_set['StoreType'] = mean_encoder(rf_set, 'StoreType')
rf_set['Assortment'] = mean_encoder(rf_set, 'Assortment')

X = rf_set.drop('Sales', axis = 1)
y = rf_set['Sales']
mask = y > 1200
X = X[mask]
y = y[mask]

X.fillna(0, inplace=True)

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state = 42, max_depth = 20, n_estimators = 3000, 
                           min_samples_split = 15, max_samples = 0.7, max_features = 8)
rf.fit(X, y)
preds = rf.predict(X)
actuals = np.array(y)
print(f'Training score: {get_score(actuals, preds)}')

Training score: 14.783856308948387


In [11]:
import pickle

filename = 'model2.sav'
pickle.dump(rf, open(filename, 'wb'))

## Testing

Let's apply the test data and find out what score we get in the competition!

In [None]:
# loading in and preparing the test data
test = pd.read_csv('data/test.csv', parse_dates=True)

test.loc[:,'Date'] = pd.to_datetime(test.loc[:,'Date'], format='%Y-%m-%d')
test.loc[:,'DayOfWeek'] = test.loc[:,'Date'].dt.weekday + 1

test.loc[test.loc[:,'StateHoliday'] == 0.0, 'StateHoliday'] = '0'
test.loc[test.loc[:,'StateHoliday'] == 0, 'StateHoliday'] = '0'

test.drop(['Customers', 'SchoolHoliday'], axis = 1, inplace = True)

test.loc[:,'Month'] = test.loc[:,'Date'].dt.month

test.loc[test.loc[:,'Month'] != 12, 'Month'] = 0
test.loc[test.loc[:,'Month'] == 12, 'Month'] = 1

test = test.merge(stores_light, how = 'left', on = 'Store')

assortment_dict = no_closed_stores.groupby('Assortment').mean()['Sales'].to_dict()
storetype_dict = no_closed_stores.groupby('StoreType').mean()['Sales'].to_dict()
store_dict = no_closed_stores.groupby('Store').mean()['Sales'].to_dict()
holiday_dict = no_closed_stores.groupby('StateHoliday').mean()['Sales'].to_dict()

test['Store'] = test['Store'].map(store_dict)
test['StateHoliday'] = test['StateHoliday'].map(holiday_dict)
test['StoreType'] = test['StoreType'].map(storetype_dict)
test['Assortment'] = test['Assortment'].map(assortment_dict)

test.fillna(0, inplace = True) # just in case

In [17]:
preds = rf.predict(test.drop(['Date', 'Sales'], axis = 1))
actuals = test['Sales']

get_score(actuals, preds)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


18.825026651849814

We achieved a final score of 18.83%, earning us first place in the competition (2nd place scored 26%). Still, 18.83 is a big departure from our training score of 14.78. There are probably good reasons for this, which I hope to explore when I find the time (presumably when the course is over). For now, I have 2 main suspects:
- The general approach of training for randomly selected days instead of treating it as a time series. After all, yesterday's sales is a pretty good indicator of today's sales. 
- Changes in stores over time. For example, a store may be remodeled and gain a sales boost. With our approach, we're taking all sales into account when we do mean encoding, from both before and after the remodeling. When predicting the future, we're in this example only interested in the sales levels from after the remodeling. 