# Rossmann: XGBoost (Stage3 - Evaluate)

## Intro

Keggle Kernel: https://www.kaggle.com/paso84/xgboost-in-python-with-rmspe

## Usage

**Input parameters**
1. PROCESSED_TRAIN_CSV:The name of the file used to store the processed train data
1. MODEL_PKL: The file containing the serialized model to evaluate
1. MATRICS_OUT: The output file used to save calculate metrics

**Output**
1. A file that contains the calculated metrics


## Setup env

### Set global variables

In [21]:
!pwd

/opt/shared/notebooks


In [22]:
DATASETS_DIR = '../data'
MODELS_DIR = '../models'
METRICS_DIR = '../metrics'

In [23]:
# this cell is tagged `parameters`
PROCESSED_TRAIN_CSV = DATASETS_DIR + '/processed/tst-train.csv'
MODEL_PKL = MODELS_DIR + '/tst-model.pkl'
METRICS_OUT = METRICS_DIR + '/tst.metrics'

### Install required packages

Se il notebook è eseguito su una macchina pulita installare i pacchetti necessari con i seguenti comandi ...

In [24]:
#!curl https://raw.githubusercontent.com/andrea-gioia/boostrap.ai/master/???	 | bash

In [25]:
#!pip list

Se il notebook è eseguito all'interno di un ambiente virtuale conda con tutti i pacchetti specificati nel file di requirements già installati fare solo un check eseguendo i seguenti comandi ...

###  Dump environment

In [26]:
!python -V

Python 3.7.3


In [27]:
!conda env list

# conda environments:
#
base                     /opt/conda
custom                *  /opt/conda/envs/custom



In [28]:
#!conda list

In [29]:
#!pip list

### Import packages

In [30]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [31]:
from fastai.imports import *
import sys
import pandas as pd
from sklearn import model_selection
import xgboost as xgb
import pickle
import datetime
import numpy as np

### Set random seed

In [32]:
# Set a seed value: 
seed_value= 42  


# Set `python` built-in pseudo-random generator at a fixed value: 
random.seed(seed_value) 

# Set `numpy` pseudo-random generator at a fixed value:
np.random.seed(seed_value) 

# Set `torch` pseudo-random generator at a fixed value:
torch.manual_seed(seed_value)
torch.backends.cudnn.deterministic = True 
torch.backends.cudnn.benchmark = False
    
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

### Define shared functions

In [None]:
# nope

## Stage 3: evaluate

### Define loss functions

In [11]:
def ToWeight(y):
    w = np.zeros(y.shape, dtype=float)
    ind = y != 0
    w[ind] = 1./(y[ind]**2)
    return w


def rmspe(yhat, y):
    w = ToWeight(y)
    rmspe = np.sqrt(np.mean( w * (y - yhat)**2 ))
    return rmspe


def rmspe_xg(yhat, y):
    y = y.get_label()
    y = np.exp(y) - 1
    yhat = np.exp(yhat) - 1
    w = ToWeight(y)
    rmspe = np.sqrt(np.mean(w * (y - yhat)**2))
    return "rmspe", rmspe

### Fill training set and validation set

In [12]:
df_processed_train = pd.read_csv(PROCESSED_TRAIN_CSV)
print('The input data frame {} size is {}\n'.format(PROCESSED_TRAIN_CSV, df_processed_train.shape))

df_processed_train = df_processed_train.loc[:, df_processed_train.columns != 'Date']
df_train, df_valid = model_selection.train_test_split(df_processed_train, test_size=.25, shuffle=False)
print('Train set size: {}; Validation set size: {}\n'.format(df_train.shape[0], df_valid.shape[0]))


#features = ['Store', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'SchoolHoliday', 'DayOfWeek', 'month', 'day', 'year', 'StoreType', 'Assortment']
#X_train = df_train[fetaures];
X_train = df_train.loc[:, df_train.columns != 'Sales']
y_train = np.log(df_train["Sales"] + 1) # perchè?

#X_valid = df_valid[features]
X_valid = df_valid.loc[:, df_valid.columns != 'Sales']
y_valid = np.log(df_valid["Sales"] + 1) # perchè?


dm_train = xgb.DMatrix(X_train, y_train)
dm_valid = xgb.DMatrix(X_valid, y_valid)

The input data frame ../data/processed/tst-train.csv size is (1017209, 37)

Train set size: 762906; Validation set size: 254303



### Load model

In [13]:
gbm = pickle.load(open(MODEL_PKL, 'rb'))

### Calculate metrics

In [14]:
def calculate_metrics(m, lossfunct=rmspe):
    lf_train = lossfunct(np.exp(m.predict(xgb.DMatrix(X_train)))-1, np.exp(y_train)-1)
    lf_valid = lossfunct(np.exp(m.predict(xgb.DMatrix(X_valid)))-1, np.exp(y_valid)-1)
    res = [lf_train, lf_valid]
    #if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    return res

In [15]:
metrics = calculate_metrics(gbm)
metrics #[0.04457844963361768, 0.06381738749662394]

[0.04493089660236272, 0.1444584758207349]

### Save metrics

In [16]:
with open(METRICS_OUT, 'w') as fd:
    fd.write('rmspe(train): {}\n'.format(metrics[0]))
    fd.write('rmspe(valid): {}\n'.format(metrics[1]))

In [17]:
METRICS_OUT

'../metrics/tst.metrics'

In [18]:
!cat {METRICS_OUT}

rmspe(train): 0.04493089660236272
rmspe(valid): 0.1444584758207349


In [None]:
# FINE