# 😼 TPS-JAN22, Quick EDA + CatBoost

The following model is a simple implementation using CATBoost Regressor, </br> I saw in some of the public code available that this model architecture was performing better than XGBoost. 
The objective is to provide a simple framework and foundation as a baseline for more sophisticated implementations using CATBoost.

Below the table of content of the notebook...
1. [Installing & Loading Python Libraries.](#1) -- Install and load everything that's nesesary for the model 
2. [Auxiliary FunctionsAuxiliary Functions.](#2) -- I define a few functions that will be used in the model
3. [Configuring the Notebook.](#3) -- I set some of the decimals and the default amount of cols and rows
4. [Importing the Information and Creating a DataFrame.](#4) -- Loading the CSV files into a DataFrame
5. [Exploring the Loaded Data (DataFrames).](#5) -- Review the information loaded to identify everything is right
6. [Engineering some Features.](#6) -- I will create functions to build features for the model
7. [Pre-Processing the Features for Training.](#7) -- Encode the features or apply required transformations
8. [Identifyting Features for Training.](#8) -- Select the features that I'm going to use in the training steps
9. [Creates a Simple Train / Validation Strategy.](#9) -- Just training a simple model, mosthly as a baseline
10. [Train a Simple Model (CATBoost Regressor).](#10) -- 
11. [Train a Simple Model (CATBoost Regressor) using a CV Loop.](#11)
12. [Model Inference (Submission to Kaggle).](#12)


**Data Description** </br>
For this challenge, you will be predicting a full year worth of sales for three items at two stores located in three different countries. This dataset is completely fictional, but contains many effects you see in real-world data, e.g., weekend and holiday effect, seasonality, etc. The dataset is small enough to allow you to try numerous different modeling approaches.

Good luck!

The objective of this competition is the following.

**Objective** </br>
Using 2015 - 2018, predict the sales by date, country, store, and product for 2019.

**Strategy** </br>
Because we are dealing with a time series type of estimation, we need to hide future information from the model; in this simple approach we will use as validation all the data from 2018, so we will train the model with data from 2015-2017

**Update 01/02/2021**
* Developed a simple Notebook, Quick EDA + Simple Feature Engineering.
* Cross-Validation strategy based on a fixed date.
* Cross-Validation strategy base on Time Series Kfold.

**Ideas that I want to implement**
* Incorporate some type of Lag Features for the Model.

<a name="1"></a>
# 1. Installing & Loading Python Libraries. 

In [None]:
# Installing a library to utilize the holidays as a feature
!pip install holidays

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
%%time
# Import LGBM Regressor Model.

from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit

from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

import holidays

<a name="2"></a>
# 2. Auxiliary Functions

In [None]:
# Define a function to measure the model performance.
def SMAPE(y_true, y_pred):
    denominator = (y_true + np.abs(y_pred)) / 200.0
    diff = np.abs(y_true - y_pred) / denominator
    diff[denominator == 0] = 0.0
    return np.mean(diff)

<a name="3"></a>
# 3. Configuring the Notebook.

In [None]:
%%time
# I like to disable my Notebook Warnings.
import warnings
warnings.filterwarnings('ignore')

In [None]:
%%time
# Configure notebook display settings to only use 2 decimal places, tables look nicer.
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_columns', 15) 
pd.set_option('display.max_rows', 50)

In [None]:
%%time
# Configure notebook display settings to only use 2 decimal places, tables look nicer.
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_columns', 15) 
pd.set_option('display.max_rows', 50)

<a name="4"></a>
# 4. Importing the Information and Creating a DataFrame.
...

In [None]:
%%time
# Define the datasets locations...

TRN_PATH = '/kaggle/input/tabular-playground-series-jan-2022/train.csv'
TST_PATH = '/kaggle/input/tabular-playground-series-jan-2022/test.csv'
SUB_PATH = '/kaggle/input/tabular-playground-series-jan-2022/sample_submission.csv'

In [None]:
%%time
# Read the datasets and create dataframes...

train_df = pd.read_csv(TRN_PATH)
test_df = pd.read_csv(TST_PATH)
submission_df = pd.read_csv(SUB_PATH)

<a name="5"></a>
# 5. Exploring the Loaded Data (DataFrames).

In [None]:
%%time
# Explore the size of the dataset loaded...

train_df.info()

In [None]:
%%time
# Explore the first 5 rows to have an idea what we are dealing with...

train_df.head()

In [None]:
%%time
# Explore the size of the dataset loaded...

test_df.info()

In [None]:
%%time
# Explore the first 5 rows to have an idea what we are dealing with, in this case the Test Set...

test_df.head()

In [None]:
%%time
# Review some information for the categorical variables...

country_list = train_df['country'].unique()
store_list = train_df['store'].unique()
product_list = train_df['product'].unique()

print(f'Country List:{country_list}')
print(f'Store List:{store_list}')
print(f'Product List:{product_list}')

In [None]:
%%time
# Review if there is missing information in the dataset...

train_df.isnull().sum()

In [None]:
# Create a simple function to evaluate the time-ranges of the information provided.
# It will help with the train / validation separations

def evaluate_time(df):
    min_date = df['date'].min()
    max_date = df['date'].max()
    print(f'Min Date: {min_date} /  Max Date: {max_date}')
    return None

evaluate_time(train_df)
evaluate_time(test_df)

In [None]:
# Average Sales / Year (Help me to understand if there is some upward trend.)
train_df['date'] = pd.to_datetime(train_df['date']) # Convert the date to datetime.
train_df['year'] = train_df['date'].dt.year
summary = train_df.groupby(['country', 'year'])['num_sold'].mean()
summary

<a name="6"></a>
# 6. Engineering some Features.

In [None]:
# Define the model Target for future reference.
TARGET = 'num_sold'

In [None]:
# Country List:['Finland' 'Norway' 'Sweden']
holiday_FI = holidays.CountryHoliday('FI', years=[2015, 2016, 2017, 2018, 2019])
holiday_NO = holidays.CountryHoliday('NO', years=[2015, 2016, 2017, 2018, 2019])
holiday_SE = holidays.CountryHoliday('SE', years=[2015, 2016, 2017, 2018, 2019])

holiday_dict = holiday_FI.copy()
holiday_dict.update(holiday_NO)
holiday_dict.update(holiday_SE)

train_df['date'] = pd.to_datetime(train_df['date']) # Convert the date to datetime.
train_df['holiday_name'] = train_df['date'].map(holiday_dict)
train_df['is_holiday'] = np.where(train_df['holiday_name'].notnull(), 1, 0)
train_df['holiday_name'] = train_df['holiday_name'].fillna('Not Holiday')

test_df['date'] = pd.to_datetime(test_df['date']) # Convert the date to datetime.
test_df['holiday_name'] = test_df['date'].map(holiday_dict)
test_df['is_holiday'] = np.where(test_df['holiday_name'].notnull(), 1, 0)
test_df['holiday_name'] = test_df['holiday_name'].fillna('Not Holiday')

In [None]:
# Create some simple features base on the Date field...

def create_time_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Create features base on the date variable, the idea is to extract as much 
    information from the date componets.
    Args
        df: Input data to create the features.
    Returns
        df: A DataFrame with the new time base features.
    """
    
    df['date'] = pd.to_datetime(df['date']) # Convert the date to datetime.
    
    # Start the creating future process.
    df['year'] = df['date'].dt.year
    df['quarter'] = df['date'].dt.quarter
    df['month'] = df['date'].dt.month
    df['day'] = df['date'].dt.day
    df['dayofweek'] = df['date'].dt.dayofweek
    df['dayofmonth'] = df['date'].dt.days_in_month
    df['dayofyear'] = df['date'].dt.dayofyear
    df['weekofyear'] = df['date'].dt.weekofyear
    df['weekday'] = df['date'].dt.weekday
    df['is_weekend'] = np.where((df['weekday'] == 5) | (df['weekday'] == 6), 1, 0)
    
    return df

In [None]:
# Apply the function 'create_time_features' to the dataset...
train_df = create_time_features(train_df)
test_df = create_time_features(test_df)

<a name="7"></a>
# 7. Pre-Processing the Features for Training.

In [None]:
# Convert the Categorical variables to one-hoe encoded features...
# It will help in the training process

CATEGORICAL = ['country', 'store', 'product', 'holiday_name']
def create_one_hot(df, categ_colums = CATEGORICAL):
    """
    Creates one_hot encoded fields for the specified categorical columns...
    Args
        df
        categ_colums
    Returns
        df
    """
    df = pd.get_dummies(df, columns=CATEGORICAL)
    return df


def encode_categ_features(df, categ_colums = CATEGORICAL):
    """
    Use the label encoder to encode categorical features...
    Args
        df
        categ_colums
    Returns
        df
    """
    le = LabelEncoder()
    for col in categ_colums:
        df['enc_'+col] = le.fit_transform(df[col])
    return df

train_df = encode_categ_features(train_df)
test_df = encode_categ_features(test_df)

In [None]:
def create_log_target(df, taget = TARGET):
    """
    Apply a log transformation to the target for better optimization 
    during training.
    """
    df[TARGET] = np.log(df[TARGET])
    return df

#train_df = create_log_target(train_df, TARGET)

In [None]:
train_df['num_sold'].describe()

<a name="8"></a>
# 8. Identifyting Features for Training.

In [None]:
# Extract features and avoid certain columns from the dataframe for training purposes...
avoid = ['row_id', 'date', 'num_sold']
FEATURES = [feat for feat in train_df.columns if feat not in avoid]

# Print a list of all the features created...
print(FEATURES)

In [None]:
# Selecting Features....
print(FEATURES)

In [None]:
FEATURES = [
            #'country',
            #'store',
            #'product',
            #'holiday_name',
            'is_holiday',
            'year',
            #'quarter',
            'month',
            'day',
            'dayofweek',
            'dayofmonth',
            'dayofyear',
            'weekofyear',
            'weekday',
            #'is_weekend',
            'enc_country',
            'enc_store',
            'enc_product',
            #'enc_holiday_name'
            ]

<a name="9"></a>
# 9. Creates a Simple Train / Validation Strategy.

In [None]:
# Creates the Train and Validation sets to train the model...
# Define a cutoff date to split the datasets
CUTOFF_DATE = '2018-01-01'

# Split the data into train and validation datasets using timestamp best suited for timeseries...
X_train = train_df[train_df['date'] < CUTOFF_DATE][FEATURES]
y_train = train_df[train_df['date'] < CUTOFF_DATE][TARGET]

X_val = train_df[train_df['date'] >= CUTOFF_DATE][FEATURES]
y_val = train_df[train_df['date'] >= CUTOFF_DATE][TARGET]

<a name="10"></a>
# 10. Train a Simple Model (CATBoost Regressor).

In [None]:
# Defines a really simple XGBoost Regressor...

catboost_params = {'n_estimators': 20_000}

# Create an instance of the XGBRegressor and set the model parameters...
cbr = CatBoostRegressor(**catboost_params)

# Train the XGBRegressor using the train and validation datasets, 
# Utilizes early_stopping_rounds to control overfitting...
cbr.fit(X_train,
        y_train,
        eval_set=[(X_val, y_val)],
        early_stopping_rounds = 250,
        verbose = 500)

In [None]:
# Use the model to predict on the validation set..
val_pred = cbr.predict(X_val)

# Convert the target back from logarictic transformation.
val_pred = val_pred
y_val = y_val

#val_pred = np.exp(val_pred)
#y_val = np.exp(y_val)

score = np.sqrt(mean_squared_error(y_val, val_pred))
print(f'RMSE: {score} / SMAPE: {SMAPE(y_val, val_pred)}')

<a name="11"></a>
# 11. Train a Simple Model (CATBoost Regressor) using a CV Loop. 

In [None]:
%%time
N_SPLITS = 20
EARLY_STOPPING_ROUNDS = 1500 # Will stop training if one metric of one validation data doesn’t improve in last round
VERBOSE = 0 # Controls the level of information, verbosity

In [None]:
%%time
# Cross Validation Loop for the Classifier.
def cross_validation_train(train, labels, test, model, model_params, n_folds = 10):
    """
    The following function is responsable of training a model in a
    cross validation loop and generate predictions on the specified test set.
    The function provides the model feature importance list as other variables.

    Args:
    train  (Dataframe): ...
    labels (Series): ...
    test   (Dataframe): ...
    model  (Model): ...
    model_params (dict of str: int): ...

    Return:
    classifier  (Model): ...
    feat_import (Dataframe): ...
    test_pred   (Dataframe): ...
    ...

    """
    # Creates empty place holders for out of fold and test predictions.
    oof_pred  = np.zeros(len(train)) # We are predicting prob. we need more dimensions.
    oof_label = np.zeros(len(train))
    test_pred = np.zeros(len(test)) # We are predicting prob. we need more dimensions
    test_pred_array = []
    val_indexes_used = []
    
    # Creates empty place holder for the feature importance.
    feat_import = np.zeros(len(FEATURES))
    
    # Creates Stratified Kfold object to be used in the train / validation
    # phase of the model.
    Kf = TimeSeriesSplit(n_splits = n_folds)
    
    # Start the training and validation loops.
    for fold, (train_idx, val_idx) in enumerate(Kf.split(train)):
        # Creates the index for each fold
        print(f'Fold: {fold+1}')        
        train_min_date = train_df.iloc[train_idx]['date'].min()
        train_max_date = train_df.iloc[train_idx]['date'].max()
        
        valid_min_date = train_df.iloc[val_idx]['date'].min()
        valid_max_date = train_df.iloc[val_idx]['date'].max()
        
        print(f'Train Min / Max Dates: {train_min_date} / {train_max_date}')
        print(f'Valid Min / Max Dates: {valid_min_date} / {valid_max_date}')

        print(f'Training on {train_df.iloc[train_idx].shape[0]} Records')
        print(f'Validating on {train_df.iloc[val_idx].shape[0]} Records')
        
        # Generates the Fold. Train and Validation datasets
        X_trn, y_trn = train.iloc[train_idx], labels.iloc[train_idx]
        X_val, y_val = train.iloc[val_idx], labels.iloc[val_idx]
        
        val_indexes_used = np.concatenate((val_indexes_used, val_idx), axis=None)
        
        # Instanciate a classifier based on the model parameters
        regressor = model(**model_params)
 
        regressor.fit(X_trn, 
                      y_trn, 
                      eval_set = [(X_val, y_val)], 
                      early_stopping_rounds = EARLY_STOPPING_ROUNDS, 
                      verbose = VERBOSE)
        
        # Generate predictions using the trained model
        val_pred = regressor.predict(X_val)
        oof_pred[val_idx]  = val_pred # store the predictions for that fold.
        oof_label[val_idx] = y_val # store the true labels for that fold.

        # Calculate the model error based on the selected metric
        error =  np.sqrt(mean_squared_error(y_val, val_pred))
        #error =  np.sqrt(mean_squared_error(np.exp(y_val), np.exp(val_pred)))

        
        # Print some of the model performance metrics
        print(f'RMSE: {error}')
        print(f'SMAPE: {SMAPE(y_val, val_pred)}')
        #print(f'SMAPE: {SMAPE(np.exp(y_val), np.exp(val_pred))}')
                        
        print("."*50)

        # Populate the feature importance matrix
        feat_import += regressor.feature_importances_

        # Generate predictions for the test set
        test_pred += (regressor.predict(test)) / n_folds
        test_pred_array.append(regressor.predict(test))
                        
    # Calculate the error across all the folds and print the reuslts
    val_indexes_used = val_indexes_used.astype(int)
    global_error = np.sqrt(mean_squared_error(labels.iloc[val_indexes_used], oof_pred[val_indexes_used]))
    #global_error = np.sqrt(mean_squared_error(np.exp(labels.iloc[val_indexes_used]), np.exp(oof_pred[val_indexes_used])))
    
    print('')
    print(f'RMSE: {global_error}...')
    print(f'SMAPE: {SMAPE(labels.iloc[val_indexes_used], oof_pred[val_indexes_used])}...')
    #print(f'SMAPE: {SMAPE(np.exp(labels.iloc[val_indexes_used]), np.exp(oof_pred[val_indexes_used]))}...')

                           
    return regressor, feat_import, test_pred, oof_label, oof_pred, test_pred_array

In [None]:
%%time
# Uses the cross_validation_train to build and train the model with XGBoost
cbr, ft_imp, pred, oof_label, oof_pred, pred_arr = cross_validation_train(train  = train_df[FEATURES], 
                                                                           labels = train_df[TARGET], 
                                                                           test   = test_df[FEATURES], 
                                                                           model  = CatBoostRegressor, 
                                                                           model_params = catboost_params,
                                                                           n_folds = N_SPLITS
                                                                           )

Plain CATBoost Model </br>
RMSE: 69.09856619718157...
SMAPE: 9.879218782588355...
CPU times: user 6min 1s, sys: 42.9 s, total: 6min 44s
Wall time: 1min 59s

Plain CATBoost Model / Log Target </br>
RMSE: 73.75406067872883...
SMAPE: 8.18583587943846...
CPU times: user 4min 6s, sys: 30.6 s, total: 4min 36s
Wall time: 1min 22s

Plain CATBoost Model, Added is_weekend </br>
RMSE: 68.89177386350742...
SMAPE: 9.753530743531984...
CPU times: user 5min 55s, sys: 41.7 s, total: 6min 37s
Wall time: 1min 57s

<a name="12"></a>
# 12. Model Inference (Submission to Kaggle).

In [None]:
# Use the created model to predict the sales for 2019...
submission_df['num_sold'] = pred
submission_df['num_sold'] = submission_df['num_sold'].apply(np.ceil)
#submission_df['num_sold'] = np.exp(pred)

# Creates a submission file for Kaggle...
submission_df.to_csv('submission.csv',index=False)

# Print some of the submisson
submission_df.head()