# Rossmann: XGBoost (Stage2 - Train)

## Intro

Keggle Kernel: https://www.kaggle.com/paso84/xgboost-in-python-with-rmspe

## Usage

**Input parameters**
1. PROCESSED_TRAIN_CSV:The name of the file used to store the processed train data
1. MODEL_PKL: The output file used to store the generated model

**Output**
1. A file that store the generated model serialized with pickle


## Setup env

### Set global variables

In [1]:
!pwd

/opt/shared/notebooks


In [2]:
DATASETS_DIR = '../data'
MODELS_DIR = '../models'

In [3]:
# this cell is tagged `parameters`
PROCESSED_TRAIN_CSV = DATASETS_DIR + '/processed/tst-train.csv'
MODEL_PKL = MODELS_DIR + '/tst-model.pkl'

In [5]:
!wc -lc {PROCESSED_TRAIN_CSV} # show lines & bytes count

  1017210 174831981 ../data/processed/tst-train.csv


### Install required packages

Se il notebook è eseguito su una macchina pulita installare i pacchetti necessari con i seguenti comandi ...

In [6]:
#!curl https://raw.githubusercontent.com/andrea-gioia/boostrap.ai/master/???	 | bash

In [7]:
#!pip list

Se il notebook è eseguito all'interno di un ambiente virtuale conda con tutti i pacchetti specificati nel file di requirements già installati fare solo un check eseguendo i seguenti comandi ...

In [8]:
!conda env list

# conda environments:
#
base                     /opt/conda
custom                *  /opt/conda/envs/custom



In [9]:
!python -V

Python 3.7.3


In [10]:
#!conda list

###  Dump environment

In [29]:
!python -V

Python 3.7.3


In [30]:
!conda env list

# conda environments:
#
base                     /opt/conda
custom                *  /opt/conda/envs/custom



In [31]:
conda env export

name: custom
channels:
  - pytorch
  - fastai
  - conda-forge
  - defaults
dependencies:
  - _py-xgboost-mutex=2.0=cpu_0
  - asn1crypto=0.24.0=py37_0
  - attrs=19.1.0=py37_1
  - backcall=0.1.0=py37_0
  - beautifulsoup4=4.7.1=py37_1
  - blas=1.0=mkl
  - bleach=3.1.0=py37_0
  - bottleneck=1.2.1=py37h035aef0_1
  - ca-certificates=2019.5.15=0
  - certifi=2019.3.9=py37_0
  - cffi=1.12.3=py37h2e261b9_0
  - chardet=3.0.4=py37_1
  - cryptography=2.6.1=py37h1ba5d50_0
  - cudatoolkit=10.0.130=0
  - cycler=0.10.0=py37_0
  - cymem=2.0.2=py37hfd86e86_0
  - cytoolz=0.9.0.1=py37h14c3975_1
  - dataclasses=0.6=py_0
  - dbus=1.13.6=h746ee38_0
  - decorator=4.4.0=py37_1
  - defusedxml=0.6.0=py_0
  - dill=0.2.9=py37_0
  - entrypoints=0.3=py37_0
  - expat=2.2.6=he6710b0_0
  - fastai=1.0.53.post2=1
  - fastprogress=0.1.21=py_0
  - fontconfig=2.13.0=h9420a91_0
  - freetype=2.9.1=h8a8886c_1
  - glib=2.56.2=hd408876_0
  - gmp=6.1.2=h6c8ec71_1
  - gst-plugins-base=1.14.0=hbbd80ab_1
  - gstreamer=1.14.0=hb453b48

In [None]:
#!conda list

In [None]:
#!pip list

### Import packages

In [11]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [35]:
from fastai.imports import *
import sys
import pandas as pd
from sklearn import model_selection
import xgboost as xgb
import pickle
import datetime
import numpy as np

### Set random seed

In [36]:
# Set a seed value: 
seed_value= 42  


# Set `python` built-in pseudo-random generator at a fixed value: 
random.seed(seed_value) 

# Set `numpy` pseudo-random generator at a fixed value:
np.random.seed(seed_value) 

# Set `torch` pseudo-random generator at a fixed value:
torch.manual_seed(seed_value)
torch.backends.cudnn.deterministic = True 
torch.backends.cudnn.benchmark = False
    
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

### Define shared functions

In [None]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

## Stage 2: train

### Define loss functions

In [13]:
def ToWeight(y):
    w = np.zeros(y.shape, dtype=float)
    ind = y != 0
    w[ind] = 1./(y[ind]**2)
    return w


def rmspe(yhat, y):
    w = ToWeight(y)
    rmspe = np.sqrt(np.mean( w * (y - yhat)**2 ))
    return rmspe


def rmspe_xg(yhat, y):
    y = y.get_label()
    y = np.exp(y) - 1
    yhat = np.exp(yhat) - 1
    w = ToWeight(y)
    rmspe = np.sqrt(np.mean(w * (y - yhat)**2))
    return "rmspe", rmspe

### Fill training set and validation set

In [15]:
df_processed_train = pd.read_csv(PROCESSED_TRAIN_CSV)
print('The input data frame {} size is {}\n'.format(PROCESSED_TRAIN_CSV, df_processed_train.shape))

The input data frame ../data/processed/tst-train.csv size is (1017209, 37)



In time series data, cross-validation is not random. Instead, our holdout data is generally the most recent data, as it would be in real application. This issue is discussed in detail in [this post](https://www.fast.ai/2017/11/13/validation-sets/) on our web site.

One approach is to take the last 25% of rows (sorted by date) as our validation set.

In [16]:
df_processed_train = df_processed_train.loc[:, df_processed_train.columns != 'Date']
df_train, df_valid = model_selection.train_test_split(df_processed_train, test_size=.25, shuffle=False)

An even better option for picking a validation set is using the exact same length of time period as the test set uses - this is implemented here:

In [17]:
#TODO
#val_idx = np.flatnonzero((df.index<=datetime.datetime(2014,9,17)) & (df.index>=datetime.datetime(2014,8,1)))

In [18]:
print('Train set size: {}; Validation set size: {}\n'.format(df_train.shape[0], df_valid.shape[0]))
#df_train.describe(include='all').T

Train set size: 762906; Validation set size: 254303



In [19]:
#df_train.head().T

In [20]:
#df_train.tail().T

In [21]:
#df_valid.head().T

In [22]:
#df_valid.tail().T

In [23]:
#features = ['Store', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'SchoolHoliday', 'DayOfWeek', 'month', 'day', 'year', 'StoreType', 'Assortment']
#X_train = df_train[fetaures];
X_train = df_train.loc[:, df_train.columns != 'Sales']
y_train = np.log(df_train["Sales"] + 1) # perchè?

#X_valid = df_valid[features]
X_valid = df_valid.loc[:, df_valid.columns != 'Sales']
y_valid = np.log(df_valid["Sales"] + 1) # perchè?


dm_train = xgb.DMatrix(X_train, y_train)
dm_valid = xgb.DMatrix(X_valid, y_valid)

### Setup hyper parameters

In [24]:
watchlist = [(dm_valid, 'valid'), (dm_train, 'train')]

params = {"objective": "reg:linear",
          "eta": 0.3,
          "max_depth": 8,
          "subsample": 0.7,
          "colsample_bytree": 0.7,
          "silent": 1
          }
num_trees = 300

### Train model

In [25]:
gbm = xgb.train(params, dm_train, num_trees, evals=watchlist, early_stopping_rounds=50, feval=rmspe_xg, verbose_eval=True)

[0]	valid-rmse:5.29132	train-rmse:5.27622	valid-rmspe:0.90469	train-rmspe:0.909458
Multiple eval metrics have been passed: 'train-rmspe' will be used for early stopping.

Will train until train-rmspe hasn't improved in 50 rounds.
[1]	valid-rmse:3.81946	train-rmse:3.73742	valid-rmspe:0.892515	train-rmspe:0.896094
[2]	valid-rmse:2.70989	train-rmse:2.61962	valid-rmspe:0.85756	train-rmspe:0.858045
[3]	valid-rmse:1.93575	train-rmse:1.83739	valid-rmspe:0.791365	train-rmspe:0.785987
[4]	valid-rmse:1.3987	train-rmse:1.29075	valid-rmspe:0.697942	train-rmspe:0.684082
[5]	valid-rmse:1.0285	train-rmse:0.909838	valid-rmspe:0.591205	train-rmspe:0.567349
[6]	valid-rmse:0.774563	train-rmse:0.645268	valid-rmspe:0.486362	train-rmspe:0.452599
[7]	valid-rmse:0.619591	train-rmse:0.472604	valid-rmspe:0.398685	train-rmspe:0.355395
[8]	valid-rmse:0.504042	train-rmse:0.345525	valid-rmspe:0.325564	train-rmspe:0.274253
[9]	valid-rmse:0.430986	train-rmse:0.259482	valid-rmspe:0.270717	train-rmspe:0.214798
[10]	val

[94]	valid-rmse:0.290842	train-rmse:0.057472	valid-rmspe:0.14608	train-rmspe:0.056108
[95]	valid-rmse:0.290807	train-rmse:0.057238	valid-rmspe:0.146013	train-rmspe:0.055883
[96]	valid-rmse:0.290747	train-rmse:0.057066	valid-rmspe:0.145931	train-rmspe:0.055703
[97]	valid-rmse:0.290717	train-rmse:0.056932	valid-rmspe:0.145891	train-rmspe:0.055564
[98]	valid-rmse:0.290634	train-rmse:0.056644	valid-rmspe:0.145805	train-rmspe:0.055293
[99]	valid-rmse:0.290544	train-rmse:0.056507	valid-rmspe:0.145719	train-rmspe:0.055163
[100]	valid-rmse:0.290536	train-rmse:0.056423	valid-rmspe:0.145707	train-rmspe:0.055099
[101]	valid-rmse:0.290514	train-rmse:0.056318	valid-rmspe:0.145672	train-rmspe:0.054992
[102]	valid-rmse:0.290496	train-rmse:0.056237	valid-rmspe:0.145646	train-rmspe:0.054907
[103]	valid-rmse:0.290479	train-rmse:0.056108	valid-rmspe:0.145565	train-rmspe:0.054775
[104]	valid-rmse:0.290469	train-rmse:0.056032	valid-rmspe:0.145513	train-rmspe:0.054703
[105]	valid-rmse:0.290438	train-rmse:0.

[188]	valid-rmse:0.291068	train-rmse:0.049804	valid-rmspe:0.14547	train-rmspe:0.048731
[189]	valid-rmse:0.291066	train-rmse:0.049767	valid-rmspe:0.145467	train-rmspe:0.048693
[190]	valid-rmse:0.291059	train-rmse:0.049742	valid-rmspe:0.145463	train-rmspe:0.048668
[191]	valid-rmse:0.291014	train-rmse:0.049666	valid-rmspe:0.145426	train-rmspe:0.048595
[192]	valid-rmse:0.290922	train-rmse:0.049617	valid-rmspe:0.14535	train-rmspe:0.048554
[193]	valid-rmse:0.290927	train-rmse:0.049587	valid-rmspe:0.145351	train-rmspe:0.048528
[194]	valid-rmse:0.290914	train-rmse:0.049507	valid-rmspe:0.145343	train-rmspe:0.048447
[195]	valid-rmse:0.290807	train-rmse:0.049459	valid-rmspe:0.145273	train-rmspe:0.048403
[196]	valid-rmse:0.290802	train-rmse:0.0494	valid-rmspe:0.145259	train-rmspe:0.048343
[197]	valid-rmse:0.2909	train-rmse:0.049366	valid-rmspe:0.145285	train-rmspe:0.048311
[198]	valid-rmse:0.290908	train-rmse:0.04935	valid-rmspe:0.145289	train-rmspe:0.048296
[199]	valid-rmse:0.290904	train-rmse:0.

[282]	valid-rmse:0.289782	train-rmse:0.046161	valid-rmspe:0.145041	train-rmspe:0.045405
[283]	valid-rmse:0.289787	train-rmse:0.046135	valid-rmspe:0.145046	train-rmspe:0.045381
[284]	valid-rmse:0.289814	train-rmse:0.046119	valid-rmspe:0.145057	train-rmspe:0.045365
[285]	valid-rmse:0.289814	train-rmse:0.046101	valid-rmspe:0.145052	train-rmspe:0.045347
[286]	valid-rmse:0.289643	train-rmse:0.046093	valid-rmspe:0.144964	train-rmspe:0.045337
[287]	valid-rmse:0.289613	train-rmse:0.046044	valid-rmspe:0.14496	train-rmspe:0.045293
[288]	valid-rmse:0.287914	train-rmse:0.046007	valid-rmspe:0.144364	train-rmspe:0.045254
[289]	valid-rmse:0.287911	train-rmse:0.04598	valid-rmspe:0.144359	train-rmspe:0.04523
[290]	valid-rmse:0.287912	train-rmse:0.045945	valid-rmspe:0.14436	train-rmspe:0.045201
[291]	valid-rmse:0.287867	train-rmse:0.045911	valid-rmspe:0.144348	train-rmspe:0.045167
[292]	valid-rmse:0.288124	train-rmse:0.04589	valid-rmspe:0.14448	train-rmspe:0.045148
[293]	valid-rmse:0.288121	train-rmse:0

### Evaluate model

In [26]:
def print_score(m, lossfunct=rmspe):
    lf_train = lossfunct(np.exp(m.predict(xgb.DMatrix(X_train)))-1, np.exp(y_train)-1)
    lf_valid = lossfunct(np.exp(m.predict(xgb.DMatrix(X_valid)))-1, np.exp(y_valid)-1)
    res = [lf_train, lf_valid]
    #if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

In [27]:
print_score(gbm)

[0.04493089660236272, 0.1444584758207349]


### Save model

In [28]:
pickle.dump(gbm, open(MODEL_PKL, 'wb'))

In [28]:
# FINE