In [1]:
%autosave 0

Autosave disabled


# Rossmann: Random Forest (Stage2 - Train)

## Intro

Keggle competition: [Rossmann Store Sales](https://www.kaggle.com/c/rossmann-store-sales)

## Usage

**Input parameters**
1. PROCESSED_TRAIN_CSV:The name of the file used to store the processed train data
1. MODEL_PKL: The output file used to store the generated model

**Output**
1. A file that store the generated model serialized with pickle

## Setup env

### Set global variables

In [2]:
!pwd

/opt/shared/notebooks


In [3]:
DATASETS_DIR = '../data'
MODELS_DIR = '../models'

In [4]:
# this cell is tagged `parameters`
PROCESSED_TRAIN_CSV = DATASETS_DIR + '/processed/tst-train.csv'
MODEL_PKL = MODELS_DIR + '/tst-model.pkl'

In [5]:
!wc -lc {PROCESSED_TRAIN_CSV} # show lines & bytes count

  1017210 174831981 ../data/processed/tst-train.csv


###  Install required packages

Se il notebook è eseguito su una macchina pulita installare i pacchetti necessari con i seguenti comandi ...

In [6]:
#!curl https://raw.githubusercontent.com/andrea-gioia/boostrap.ai/master/fastai07colab	 | bash

In [7]:
#!pip list

Se il notebook è eseguito all'interno di un ambiente virtuale conda con tutti i pacchetti specificati nel file di requirements già installati fare solo un check eseguendo i seguenti comandi ...

In [8]:
!conda env list

# conda environments:
#
base                     /opt/conda
custom                *  /opt/conda/envs/custom



In [9]:
!python -V

Python 3.7.4


In [10]:
#!conda list

###  Dump environment

In [11]:
!python -V

Python 3.7.4


In [12]:
!conda env list

# conda environments:
#
base                     /opt/conda
custom                *  /opt/conda/envs/custom



In [13]:
#!conda list

In [14]:
#!pip list

### Import packagest

In [15]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [18]:
from fastai.imports import *
#from fastai.structured import *

#from pandas_summary import DataFrameSummary
from sklearn import model_selection
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display

from sklearn import metrics

### Set random seed

In [19]:
# Set a seed value: 
seed_value= 42  


# Set `python` built-in pseudo-random generator at a fixed value: 
random.seed(seed_value) 

# Set `numpy` pseudo-random generator at a fixed value:
np.random.seed(seed_value) 

# Set `torch` pseudo-random generator at a fixed value:
torch.manual_seed(seed_value)
torch.backends.cudnn.deterministic = True 
torch.backends.cudnn.benchmark = False
    
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

### Define shared functions

In [20]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

## Stage 2: train

### Define loss functions

In [21]:
def rmse(p,a): return math.sqrt(((a-p)**2).mean())

# ritorna un vettore w in cui w_i = y_i^-2 se i!=0, 0 altrimnti
# serve per ignorare dalla misura finale i casi in cui la variabile y da predirre e nulla
def toWeight(y):
    w = np.zeros(y.shape, dtype=float)
    ind = y != 0
    w[ind] = 1./(y[ind]**2)
    return w


def rmspe(p, a):
    w = toWeight(a)
    rmspe = np.sqrt(np.mean( w * (a - p)**2 ))
    return rmspe

### Fill training set and validation set

In [22]:
df_processed_train = pd.read_csv(PROCESSED_TRAIN_CSV)
print('The input data frame {} size is {}\n'.format(PROCESSED_TRAIN_CSV, df_processed_train.shape))

The input data frame ../data/processed/tst-train.csv size is (1017209, 37)



In time series data, cross-validation is not random. Instead, our holdout data is generally the most recent data, as it would be in real application. This issue is discussed in detail in [this post](https://www.fast.ai/2017/11/13/validation-sets/) on our web site.

One approach is to take the last 25% of rows (sorted by date) as our validation set.

In [23]:
#display_all(df_processed_train.tail().T)

In [24]:
df_processed_train = df_processed_train.loc[:, df_processed_train.columns != 'Date']
df_train, df_valid = model_selection.train_test_split(df_processed_train, test_size=.25, shuffle=False)

An even better option for picking a validation set is using the exact same length of time period as the test set uses - this is implemented here:

In [25]:
#TODO
#val_idx = np.flatnonzero((df.index<=datetime.datetime(2014,9,17)) & (df.index>=datetime.datetime(2014,8,1)))

In [26]:
print('Train set size: {}; Validation set size: {}\n'.format(df_train.shape[0], df_valid.shape[0]))
df_train.describe(include='all').T

Train set size: 762906; Validation set size: 254303



Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Unnamed: 0,762906,,,,508795.0,293649.0,231.0,254497.0,508700.0,763234.0,1017210.0
Store,762906,,,,558.514,321.915,1.0,280.0,558.0,838.0,1115.0
DayOfWeek,762906,,,,3.99734,1.99709,1.0,2.0,4.0,6.0,7.0
Customers,762906,,,,634.863,464.663,0.0,405.0,610.0,839.0,7388.0
Open,762906,,,,0.832358,0.373548,0.0,1.0,1.0,1.0,1.0
Promo,762906,,,,0.37833,0.484971,0.0,0.0,0.0,1.0,1.0
StateHoliday,762906,,,,1.03981,0.259721,1.0,1.0,1.0,1.0,4.0
SchoolHoliday,762906,,,,0.181792,0.385673,0.0,0.0,0.0,0.0,1.0
StoreType,762906,,,,2.2056,1.36467,1.0,1.0,1.0,4.0,4.0
Assortment,762906,,,,1.93649,0.99388,1.0,1.0,1.0,3.0,3.0


In [27]:
X_train = df_train.loc[:, df_train.columns != 'Sales']
y_train = df_train['Sales']

X_valid = df_valid.loc[:, df_valid.columns != 'Sales']
y_valid = df_valid['Sales']

X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

((762906, 35), (762906,), (254303, 35), (254303,))

### Setup hyper parameters

In [28]:
# TODO ...

### Train model

In [29]:
rfm = RandomForestRegressor(n_jobs=-1)
%time rfm.fit(X_train, y_train)



CPU times: user 2min 46s, sys: 1.45 s, total: 2min 47s
Wall time: 53.5 s


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

### Evaluate model

In [30]:
def print_score(m, lossfunct=rmse):
    lf_train = lossfunct(m.predict(X_train), y_train)
    lf_valid = lossfunct(m.predict(X_valid), y_valid)
    r2_train = m.score(X_train, y_train)
    r2_valid = m.score(X_valid, y_valid)
    res = [lf_train, lf_valid,  
           r2_train, r2_valid]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

In [31]:
print_score(rfm, lossfunct=rmspe)

[0.02639480588395337, 0.06847645742979377, 0.9974198433116129, 0.9797561541455969]


### Save model

In [32]:
pickle.dump(rfm, open(MODEL_PKL, 'wb'))

In [24]:
# FINE