# Demand Forecasting Model with Python Scikit-Learn

## Executive Summary

The goal of this project is to forecast the demand for wholesale liquor in Iowa based on historical sales transactions between the Alcoholic Beverages Division (ABD) and retailers. Predicting the number of bottles that are sold to retailers in the future is a supervised learning task. 

Based on the earlier exploratory data analysis and feature selection, we have a set of features indicating the items, origin, type of liquor, pack size, bottle volume and so on to predict bottles_sold (a.k.a. the label). This is a regression task because the label are continuous quantities. 

Although there are many algorithms for supervised regression task, I want to be able to predict future demand with both accuracy and interpretability. Therefore, I would focus on 4 different modelling approaches covering the spectrum of complexity. 
- ARIMA
- Linear Regression
- Random Forest Regression
- Support Vector Machine Regression

This notebook would define the ML pipeline tasks using TFX to get ready for execution with a pipeline orchestrator such as Airflow or Kubeflow Pipelines. 

## Data Sources

summary_sales.parquet: Summarised liquor sales records from January 2018 to March 2021. 

## Revision History

- 04-24-2021: Started the notebook

## Overview

### TensorFlow Extended

TFX is a Google-production-scale machine learning (ML) platform based on TensorFlow. It provides a configuration framework and shared libraries to integrate common components needed to define, launch, and monitor your machine learning system.

A TFX pipeline is a sequence of components that implement an ML pipeline which is specifically designed for scalable, high-performance machine learning tasks. That includes modeling, training, serving inference, and managing deployments to online, native mobile, and JavaScript targets.

Further information about TFX can be found [here](https://www.tensorflow.org/tfx/guide).

### ML Pipeline

Below are the key takeways about ML pipelines from the book [Building Machine Pipelines by Hannes Hapke and Catherine Nelson](https://learning.oreilly.com/library/view/building-machine-learning/9781492053187/). 

A machine learning pipeline starts with the ingestion of new training data and ends with receiving some kind of feedback on how your newly trained model is performing. Machine learning pipelines can become very complicated and consume a lot of overhead to manage task dependencies. The key benefit of machine learning pipelines lies in the automation of the model life cycle steps. When new training data becomes available, a workflow which includes data validation, preprocessing, model training, analysis, and deployment should be triggered. 

A machine learning pipeline commonly includes the following steps. 
1. Data Ingestion & Data Versioning
    - Process the data into a format that the following components can digest
    - Does not perform any feature engineering (this happens after the data validation step)
    - To version the incoming data to connect a data snapshot with the trained model at the end of the pipeline
    - To split the original dataset into a training set and an evaluation set
2. Data Validation
    - Check that the statistics of the new data are as expected (e.g., the range, number of categories, and distribution of categories)
    - Alert the data scientist if any anomalies (e.g. imbalanced dataset) are detected
    - Compare between the training and the evaluation set to ensure the label split is roughly the same between the two datasets
3. Data Preprocessing
    - Preprocess the data to use it for your training runs (e.g. one-hot encoding, feature scaling)
    - Since preprocessing is only required prior to model training and not with every training epoch, it makes the most sense to run the preprocessing in its own life cycle step before training the model.
4. Model Training & Tuning
    - Train a model to take inputs and predict an output with the lowest error possible
    - Tune the model to pick out the optimal model hyperparameters for our final production model
5. Model Analysis
    - Carry out a more in-depth analysis of the model’s performance
    - Check that the model’s predictions are fair
6. Model Versioning
    - Keep track of which model, set of hyperparameters, and datasets have been selected as the next version to be deployed
    - Document all inputs into a new model version (hyperparameters, datasets, architecture) and track them as part of the release step
7. Model Deployment
    - Deploy your models without writing web app code using API interfaces such as representational state transfer (REST) or remote procedure call (RPC) protocols
    - Update a model version without redeploying your application, which will reduce your application’s downtime and reduce the communication between the application development and the machine learning teams
8. Feedback Loops
    - Capture valuable information about the performance of the model
    - Capture new training data to increase our datasets and update our model
    
Except for the two manual review steps (the model analysis step and the feedback step), we can automate the entire pipeline. Data scientists should be able to focus on the development of new models, not on updating and maintaining existing models.

## Required Python Libraries

In [23]:
from pathlib import Path
from datetime import datetime
import pandas as pd
import numpy as np

# Ignore all warnings
import warnings
warnings.filterwarnings("ignore")

# Data preprocessing
from sklearn.model_selection import train_test_split
import category_encoders as ce
from sklearn.preprocessing import RobustScaler

# Model training
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from xgboost.sklearn import XGBRegressor

# Model evaluation
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# Fine-tune model performance
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV

# Data Visualisation for EDA
import matplotlib.pyplot as plt
import seaborn as sns

# Set up matplotlib so it uses Jupyter's graphical backend when plotting the charts
%matplotlib inline 

# Adjust display options for pandas dataframes
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 60)
pd.set_option('float_format','{:.2f}'.format)

## File Locations

In [2]:
raw_data = Path.cwd().parent / "data" / "raw" / "all_sales.parquet"

# Summarise transactional data into training dataset for demand forecasting
summarised_data = Path.cwd().parent / "data" / "processed" / "summary_sales.parquet"

## Load the data & basic exploration

In [3]:
liquor_df = pd.read_parquet(summarised_data)
liquor_df

Unnamed: 0,date,item_number,vendor_name,category_name,city,county,bottle_volume_ml,state_bottle_cost,state_bottle_retail,pack,sale_dollars,bottles_sold,volume_sold_liters,item_description
0,2018-01-02,10006,DIAGEO AMERICAS,Scotch Whiskies,Dubuque,DUBUQUE,750,5.13,7.70,12,15.40,2,1.50,Scoresby Rare Scotch
1,2018-01-03,10006,DIAGEO AMERICAS,Scotch Whiskies,DeWitt,CLINTON,750,5.13,7.70,12,92.40,12,9.00,Scoresby Rare Scotch
2,2018-01-03,10006,DIAGEO AMERICAS,Scotch Whiskies,Emmetsburg,PALO ALTO,750,5.13,7.70,12,92.40,12,9.00,Scoresby Rare Scotch
3,2018-01-03,10006,DIAGEO AMERICAS,Scotch Whiskies,Knoxville,MARION,750,5.13,7.70,12,30.80,4,3.00,Scoresby Rare Scotch
4,2018-01-03,10006,DIAGEO AMERICAS,Scotch Whiskies,Nevada,STORY,750,5.13,7.70,12,15.40,2,1.50,Scoresby Rare Scotch
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6073614,2021-03-31,38530,SAZERAC COMPANY INC,American Vodkas,Cedar Falls,BLACK HAWK,50,4.30,6.45,12,6.45,1,0.05,Wheatley Vodka Mini
6073615,2021-03-31,57279,SAZERAC COMPANY INC,Cocktails / RTD,Cedar Rapids,LINN,1750,6.50,9.75,6,58.50,6,10.50,Chi-Chi's Pink Lemonade Margarita
6073616,2021-03-31,917527,Brown Forman Corp.,Straight Bourbon Whiskies,Allerton,WAYNE,750,17.15,25.73,6,154.38,6,4.50,Coopers' Craft Reserve Kentucky Straight Bourb...
6073617,2021-03-31,946606,Modern Matriarch,Gold Rum,Council Bluffs,POTTAWATTA,750,14.00,21.00,12,504.00,24,18.00,Modern Matriarch Amber Rum


In [4]:
# Retain data from 2019 onwards to speed up subsequent steps for data processing and model training 
liquor_df = liquor_df[liquor_df['date'] >= "2020-01-01"]
liquor_df

Unnamed: 0,date,item_number,vendor_name,category_name,city,county,bottle_volume_ml,state_bottle_cost,state_bottle_retail,pack,sale_dollars,bottles_sold,volume_sold_liters,item_description
1302,2020-01-02,10006,SAZERAC COMPANY INC,Scotch Whiskies,Forest City,WINNEBAGO,750,5.13,7.70,12,15.40,2,1.50,Scoresby Rare Scotch
1303,2020-01-02,10006,SAZERAC COMPANY INC,Scotch Whiskies,Ottumwa,WAPELLO,750,5.13,7.70,12,30.80,4,3.00,Scoresby Rare Scotch
1304,2020-01-02,10006,SAZERAC COMPANY INC,Scotch Whiskies,Spencer,CLAY,750,5.13,7.70,12,23.10,3,2.25,Scoresby Rare Scotch
1305,2020-01-03,10006,SAZERAC COMPANY INC,Scotch Whiskies,Des Moines,POLK,750,5.13,7.70,12,92.40,12,9.00,Scoresby Rare Scotch
1306,2020-01-03,10006,SAZERAC COMPANY INC,Scotch Whiskies,Indianola,WARREN,750,5.13,7.70,12,184.80,24,18.00,Scoresby Rare Scotch
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6073614,2021-03-31,38530,SAZERAC COMPANY INC,American Vodkas,Cedar Falls,BLACK HAWK,50,4.30,6.45,12,6.45,1,0.05,Wheatley Vodka Mini
6073615,2021-03-31,57279,SAZERAC COMPANY INC,Cocktails / RTD,Cedar Rapids,LINN,1750,6.50,9.75,6,58.50,6,10.50,Chi-Chi's Pink Lemonade Margarita
6073616,2021-03-31,917527,Brown Forman Corp.,Straight Bourbon Whiskies,Allerton,WAYNE,750,17.15,25.73,6,154.38,6,4.50,Coopers' Craft Reserve Kentucky Straight Bourb...
6073617,2021-03-31,946606,Modern Matriarch,Gold Rum,Council Bluffs,POTTAWATTA,750,14.00,21.00,12,504.00,24,18.00,Modern Matriarch Amber Rum


## Feature Engineering

### date

In [5]:
# Add month, year and day of week to analyse liquor demand over time
liquor_df['month'] = pd.DatetimeIndex(liquor_df['date']).month
liquor_df['year'] = pd.DatetimeIndex(liquor_df['date']).year
liquor_df['day_of_week'] = pd.DatetimeIndex(liquor_df['date']).dayofweek
liquor_df['day'] = pd.DatetimeIndex(liquor_df['date']).day

### category_name

In [6]:
# Consolidate the category_name into liquor_type
liquor_df['liquor_type'] = liquor_df['category_name'].replace({'Scotch Whiskies': 'Whiskey'
                                                               , 'American Vodkas': 'Vodka'
                                                               , 'Temporary & Specialty Packages': 'Specialty'
                                                               , 'Straight Bourbon Whiskies': 'Whiskey'
                                                               , 'Canadian Whiskies': 'Whiskey'
                                                               , 'Whiskey Liqueur': 'Liqueur'
                                                               , 'Irish Whiskies': 'Whiskey'
                                                               , 'Blended Whiskies': 'Whiskey'
                                                               , 'Bottled in Bond Bourbon': 'Whiskey'
                                                               , 'Spiced Rum': 'Rum'
                                                               , 'Straight Rye Whiskies': 'Whiskey'
                                                               , 'Single Barrel Bourbon Whiskies': 'Whiskey'
                                                               , 'Cream Liqueurs': 'Liqueur'
                                                               , 'Tennessee Whiskies': 'Whiskey'
                                                               , 'Iowa Distillery Whiskies': 'Whiskey'
                                                               , 'Corn Whiskies': 'Whiskey'
                                                               , 'American Distilled Spirits Specialty' : 'Specialty'
                                                               , 'Imported Dry Gins': 'Gin'
                                                               , 'American Dry Gins': 'Gin'
                                                               , 'Flavored Gin': 'Gin'
                                                               , 'Imported Flavored Vodka': 'Vodka'
                                                               , 'Imported Vodkas': 'Vodka'
                                                               , 'American Sloe Gins': 'Gin'
                                                               , 'American Flavored Vodka': 'Vodka'
                                                               , 'Single Malt Scotch': 'Whiskey'
                                                               , 'Neutral Grain Spirits': 'Others'
                                                               , 'Aged Dark Rum' : 'Rum'
                                                               , 'Imported Distilled Spirits Specialty': 'Specialty'
                                                               , 'Flavored Rum': 'Rum'
                                                               , 'White Rum': 'Rum'
                                                               , 'Gold Rum': 'Rum'
                                                               , 'American Cordials & Liqueurs': 'Liqueur'
                                                               , 'Imported Brandies': 'Brandy'
                                                               , 'Imported Cordials & Liqueurs': 'Liqueur'
                                                               , 'American Brandies': 'Brandy'
                                                               , 'Cocktails / RTD': 'Others'
                                                               , '100% Agave Tequila': 'Tequila'
                                                               , 'Imported Schnapps': 'Liqueur'
                                                               , 'American Schnapps': 'Liqueur'
                                                               , 'Coffee Liqueurs': 'Liqueur'
                                                               , 'Neutral Grain Spirits Flavored': 'Others'
                                                               , 'Triple Sec': 'Liqueur'
                                                               , 'Mixto Tequila': 'Tequila'
                                                               , 'Mezcal': 'Others'
                                                               , 'Special Order Items': 'Others'
                                                               , 'Distilled Spirits Specialty': 'Specialty'
                                                               , 'Imported Gins': 'Gin'
                                                               , 'Delisted / Special Order Items': 'Specialty'
                                                               , 'Imported Whiskies': 'Whiskey'})

In [7]:
# Identify product's origin based on category_name
liquor_df['origin'] = liquor_df['category_name'].replace({'Scotch Whiskies': 'Imported'
                                                               , 'American Vodkas': 'Local'
                                                               , 'Temporary & Specialty Packages': 'Unknown'
                                                               , 'Straight Bourbon Whiskies': 'Local'
                                                               , 'Canadian Whiskies': 'Imported'
                                                               , 'Whiskey Liqueur': 'Unknown'
                                                               , 'Irish Whiskies': 'Imported'
                                                               , 'Blended Whiskies': 'Unknown'
                                                               , 'Bottled in Bond Bourbon': 'Local'
                                                               , 'Spiced Rum': 'Unknown'
                                                               , 'Straight Rye Whiskies': 'Unknown'
                                                               , 'Single Barrel Bourbon Whiskies': 'Local'
                                                               , 'Cream Liqueurs': 'Unknown'
                                                               , 'Tennessee Whiskies': 'Local'
                                                               , 'Iowa Distillery Whiskies': 'Local'
                                                               , 'Corn Whiskies': 'Unknown'
                                                               , 'American Distilled Spirits Specialty' : 'Local'
                                                               , 'Imported Dry Gins': 'Imported'
                                                               , 'American Dry Gins': 'Local'
                                                               , 'Flavored Gin': 'Unknown'
                                                               , 'Imported Flavored Vodka': 'Imported'
                                                               , 'Imported Vodkas': 'Imported'
                                                               , 'American Sloe Gins': 'Local'
                                                               , 'American Flavored Vodka': 'Local'
                                                               , 'Single Malt Scotch': 'Imported'
                                                               , 'Neutral Grain Spirits': 'Unknown'
                                                               , 'Aged Dark Rum' : 'Unknown'
                                                               , 'Imported Distilled Spirits Specialty': 'Imported'
                                                               , 'Flavored Rum': 'Unknown'
                                                               , 'White Rum': 'Unknown'
                                                               , 'Gold Rum': 'Unknown'
                                                               , 'American Cordials & Liqueurs': 'Local'
                                                               , 'Imported Brandies': 'Imported'
                                                               , 'Imported Cordials & Liqueurs': 'Imported'
                                                               , 'American Brandies': 'Local'
                                                               , 'Cocktails / RTD': 'Unknown'
                                                               , '100% Agave Tequila': 'Imported'
                                                               , 'Imported Schnapps': 'Imported'
                                                               , 'American Schnapps': 'Local'
                                                               , 'Coffee Liqueurs': 'Unknown'
                                                               , 'Neutral Grain Spirits Flavored': 'Unknown'
                                                               , 'Triple Sec': 'Imported'
                                                               , 'Mixto Tequila': 'Imported'
                                                               , 'Mezcal': 'Imported'
                                                               , 'Special Order Items': 'Unknown'
                                                               , 'Distilled Spirits Specialty': 'Unknown'
                                                               , 'Imported Gins': 'Imported'
                                                               , 'Delisted / Special Order Items': 'Unknown'
                                                               , 'Imported Whiskies': 'Imported'})

In [8]:
# Classify imported or local LIQUEUR based on item_description

# Mark imported products
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Baileys"), 'Imported', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Yukon Jack"), 'Imported', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Kahlua"), 'Imported', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Fireball"), 'Imported', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Ryan's Cream"), 'Imported', liquor_df['origin'])

# Mark local products
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Dr"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Tippy Cow"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Atomic Fusion"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Blue Ox"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Rebel Yell"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Iowish"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Iowa"), 'Local', liquor_df['origin'])

In [9]:
# Classify imported or local WHISKEY based on item_description

# Mark imported products
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Five Star"), 'Imported', liquor_df['origin'])

# Mark local products
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Seagrams 7 Crown"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Kessler"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Templeton"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Beam's 8"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Hawkeye Blended"), 'Local', liquor_df['origin'])

In [10]:
# Classify imported or local RUM based on item_description

# Mark imported products
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Captain Morgan"), 'Imported', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Bacardi"), 'Imported', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Captain Morgan"), 'Imported', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Malibu"), 'Imported', liquor_df['origin'])


# Mark local products
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Admiral Nelson"), 'Local', liquor_df['origin'])

In [11]:
# Classify imported or local OTHERS based on item_description

# Mark imported products
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Jose Cuervo"), 'Imported', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("1800"), 'Imported', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Margaritaville"), 'Imported', liquor_df['origin'])

# Mark local products
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Chi-Chi's"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Everclear Alcohol"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Desert Island Long Island"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Ole Smoky"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Midnight Moon"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Salvador's"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Skinnygirl"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Ice Box Mudslide"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Tooters"), 'Local', liquor_df['origin'])


In [12]:
# Classify imported or local SPECIALTY based on item_description

# Mark imported products
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Crown Royal"), 'Imported', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Smirnoff Peppermint Twist"), 'Imported', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Ciroc"), 'Imported', liquor_df['origin'])

# Mark local products
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Smirnoff Red, White & Berry"), 'Local', liquor_df['origin'])
liquor_df['origin'] = np.where(liquor_df['item_description'].str.startswith("Jack Daniels"), 'Local', liquor_df['origin'])

### pack

In [13]:
# Create new categorical feature called pack_size
liquor_df['pack_size'] = np.where(liquor_df['pack'].isin([6, 12, 24, 48]), liquor_df['pack'], 'Others')

### bottle_volume_ml

In [14]:
# Create new categorical feature called pack_size
liquor_df['bottle_size'] = np.where(liquor_df['bottle_volume_ml'].isin([50, 100, 200, 375, 750, 1000, 1750])
                                    , liquor_df['bottle_volume_ml'], 'Non-standard')

In [15]:
# Drop irrelevant features
try:
    liquor_df = liquor_df.drop(columns = ['bottle_volume_ml', 'state_bottle_cost', 'pack', 'item_description', 'category_name', 'date'])
except KeyError:
    print("Features have already been removed.")

In [16]:
# Drop features that cause look-ahead bias
try:
    liquor_df = liquor_df.drop(columns = ['sale_dollars', 'volume_sold_liters'])
except KeyError:
    print("Features have already been removed.")

## Split the data

I will use a 80-20 split for the training and test sets. Since this is a time series dataset, I will not randomly shuffle the data before splitting due to 2 main reasons. 
1. To ensure that chopping the data into windows of consecutive samples is still possible. 
2. To ensure the test results are realistically evaluated on data collected after the model has been trained. 

To ensure reproducibility, I will split the data based on column indices. The code is adapted and slightly modified based on [this tutorial](https://www.tensorflow.org/tutorials/structured_data/time_series#multi-step_models).

In [17]:
# Split the data into the training set and the test set based on 80-20 split
column_indices = {name: i for i, name in enumerate(liquor_df.columns)}
n = len(liquor_df)
train_df = liquor_df[0:int(n*0.8)]
test_df = liquor_df[int(n*0.8):]

X_train = train_df.drop(columns = 'bottles_sold')
y_train = train_df['bottles_sold']
X_test = test_df.drop(columns = 'bottles_sold')
y_test = test_df['bottles_sold']

## Encode categorical values

Based on specific characteristics of categorical features, I will use different encoding methods. 
- No encoding: month, year, day_of_week, day (because these are already represented in numerical format)
- OnehotEncoder: origin, pack_size, bottle_size (because of low cardinality)
- HashingEncoder: item_number, vendor_name, city, county (because of high cardinality)

In [18]:
# Apply one-hot encoding
encoder = ce.OneHotEncoder(cols = ['origin', 'pack_size', 'bottle_size', 'liquor_type'], return_df = True)
X_train_transformed = encoder.fit_transform(X_train, y_train)
X_test_transformed = encoder.transform(X_test)

# Apply hash encoding for categorical features having high cardinality
encoder = ce.HashingEncoder(cols = ['item_number', 'vendor_name', 'city', 'county'], return_df = True)
X_train_transformed = encoder.fit_transform(X_train_transformed, y_train)
X_test_transformed = encoder.transform(X_test_transformed)

X_train_transformed

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,state_bottle_retail,month,year,day_of_week,day,liquor_type_1,liquor_type_2,liquor_type_3,liquor_type_4,liquor_type_5,liquor_type_6,liquor_type_7,liquor_type_8,liquor_type_9,origin_1,origin_2,origin_3,pack_size_1,pack_size_2,pack_size_3,pack_size_4,pack_size_5,bottle_size_1,bottle_size_2,bottle_size_3,bottle_size_4,bottle_size_5,bottle_size_6,bottle_size_7,bottle_size_8
0,1,0,0,0,1,1,1,0,7.70,1,2020,3,2,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0
1,1,0,0,0,0,1,2,0,7.70,1,2020,3,2,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0
2,1,0,1,1,0,1,0,0,7.70,1,2020,3,2,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0
3,1,0,0,1,0,1,0,1,7.70,1,2020,4,3,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0
4,1,0,0,2,0,1,0,0,7.70,1,2020,4,3,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1967022,0,1,0,0,1,0,1,1,41.25,11,2020,2,4,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0
1967023,0,1,0,1,1,1,0,0,41.25,11,2020,2,4,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0
1967024,0,2,0,1,1,0,0,0,41.25,11,2020,2,4,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0
1967025,0,2,0,0,1,0,0,1,41.25,11,2020,2,4,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0


## Perform Feature Scaling

Based on earlier data preparation and exploratory data analysis, it is noted that outliers exist in the dataset. Therefore, I will be using RobustScaler. 

In [19]:
scaler = RobustScaler()
X_train_transformed = scaler.fit_transform(X_train_transformed, y_train)
X_test_transformed = scaler.transform(X_test_transformed)

## Model Training

In [20]:
models = []
models.append(('LNR', LinearRegression()))
models.append(('XGB', XGBRegressor()))
models.append(('RFR', RandomForestRegressor()))
models.append(('ADA', AdaBoostRegressor()))

results = []
names = []
scoring = 'neg_mean_absolute_error'
for name, model in models: 
    kfold = KFold(n_splits = 10)
    cv_results = cross_val_score(model, X_train_transformed, y_train, cv = kfold, scoring = scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

LNR: -14.236720 (3.492755)
XGB: -13.709195 (3.874968)


KeyboardInterrupt: 

In an ideal world, with infinite resources and where time is not an issue, I would be tempted to proceed with Random Forest Regressor since it results the most accurate Mean Absolute Percentage Error. 

However, Random Forest Regressor takes very long to train. As I expect the data grows bigger, training time will even grow longer and each step in fine-tuning the model becomes more expensive. Therefore, I decided to go with XGBoost Regressor since it is much faster to train and results in the second best MAPE. 

The next steps illustrate how to fine-tune XBBoost Regressor. 

## Fine-tuning the ML models

### XGBoost Regressor

Before fine-tuning, below are several key points about XGBoost adapted from [this article](https://info.cambridgespark.com/latest/getting-started-with-xgboost) that should be taken into account.

**Overview about XGBoost**

XGBoost stands for Extreme Gradient Boosting, it is a performant machine learning library based on the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman. 

XGBoost implements a Gradient Boosting algorithm based on an ensemble of decision trees. Those trees are poor models individually, but when they are grouped they can be really performant. XGBoost is also recognized for its flexibility and speed. Whilst gradient boosting requires to build trees one by one sequentially, XGBoost implements a way to parallelize the training of each tree, making the training faster and the job of Data Scientists easier. 

**Difference between XGBoost and Random Forest**

The difference between XGBoost and Random Forest lies in the way those trees are built and combined. Random Forest builds fully grown decision trees in parallel on subsamples of the data. Each tree is higly specialized to predict on its subsample and do not generalize well (high variance). By combining the predictions made by each individual tree, the Random Forest algorithm decreases variance and gives good performance.

XGBoost on the other hand, builds really short and simple decision trees iteratively. Each tree is called a “weak learner” for their high bias. XGBoost starts by creating a first simple tree which has poor performance by itself. It then builds another tree which is trained to predict what the first tree was not able to, and is itself a weak learner too. The algorithm goes on by sequentially building more weak learners, each one correcting the previous tree until a stopping condition is reached, such as the number of trees (estimators) to build.

**Warnings for XGBoost**

It can sometimes be harder to tune or have a higher tendency to overfitting than a simpler model such as Random Forest and perform poorly with non structured data.

**XGBoost Hyperparameters**

It would be unrealistically slow and expensive to tune all hyperparameters related to XGBoost. I will only focus on the below 5 hyperparameters that I expect would impact the model performance most. 

1. `max_depth`is the maximum number of nodes allowed from the root to the farthest leaf of a tree. Deeper trees can model more complex relationships by adding more nodes, but as we go deeper, splits become less relevant and are sometimes only due to noise, causing the model to overfit.
2. `min_child_weight` is the minimum weight (or number of samples if all samples have a weight of 1) required in order to create a new node in the tree. A smaller min_child_weight allows the algorithm to create children that correspond to fewer samples, thus allowing for more complex trees, but again, more likely to overfit.
3. `subsample` corresponds to the fraction of observations (the rows) to subsample at each step. By default it is set to 1 meaning that we use all rows. Instead of using the whole training set every time, we can build a tree on slightly different data at each step, which makes it less likely to overfit to a single sample or feature.
4. `colsample_bytree` corresponds to the fraction of features (the columns) to use. By default it is set to 1 meaning that we will use all features. Instead of using the whole training set every time, we can build a tree on slightly different data at each step, which makes it less likely to overfit to a single sample or feature.
5. `ETA` controls the learning rate. It corresponds to the shrinkage of the weights associated to features after each round, in other words it defines the amount of "correction" we make at each step. Having a lower eta makes our model more robust to overfitting thus, usually, the lower the learning rate, the best. But with a lower eta, we need more boosting rounds, which takes more time to train, sometimes for only marginal improvements.

#### Out-of-the-box XGBoost Regressor

In [24]:
# Calculate accuracy for the base model 
start_time = datetime.now()
xgb_reg = XGBRegressor()
xgb_baseline = xgb_reg.fit(X_train_transformed, y_train)
y_train_pred = cross_val_predict(xgb_reg, X_train_transformed, y_train, cv = 10)
end_time = datetime.now()
print('Total running time:', (end_time - start_time).total_seconds())

base_model_MAE = mean_absolute_error(y_train, y_train_pred)
base_model_MAE

Total running time: 396.932523


13.709192394609925

#### Narrow the search with RandomizedSearchCV

In [25]:
# Create the random grid
random_grid = {'max_depth': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
               'min_child_weight': [1, 3, 5, 7],
               'subsample': [0.5, 0.6, 0.7, 0.8, 0.9],
               'colsample_bytree': [0.3, 0.5, 0.7],
               'ETA': [0.2, 0.4, 0.6]}

print(random_grid)

{'max_depth': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 'min_child_weight': [1, 3, 5, 7], 'subsample': [0.5, 0.6, 0.7, 0.8, 0.9], 'colsample_bytree': [0.3, 0.5, 0.7], 'ETA': [0.2, 0.4, 0.6]}


In [None]:
# Random search of parameveters using 3 fold cross validation
xgb_random_search = XGBRegressor()
xgb_random = RandomizedSearchCV(estimator = xgb_random_search, param_distributions = random_grid, n_iter = 100, cv = 3, verbose = 2, n_jobs = -1)

start_time = datetime.now()
# Fit the random search model
xgb_random.fit(X_train_transformed, y_train)
end_time = datetime.now()
print('Total running time:', (end_time - start_time).total_seconds())

# View the best parameters from fitting the random search
xgb_random.best_params_

Fitting 3 folds for each of 100 candidates, totalling 300 fits
