# Build a Demand Forecasting Model with BigQuery ML

## Executive Summary

**What is this notebook for?**

- To create a baseline for demand forecasting model with BigQuery ML using SQL (which is much faster and simpler than creating an ML model with Python or Tensorflow)
- To understand different metrics for evaluating demand forecasting model

**What have we learned from this notebook?** (To be updated)

1. Demand forecasting with ARIMA 
    1. Missing values
        - city: extract city from store_name
        - county: infer from the imputed values for missing city
        - category_name: impute based on category_name of the same items from other records
        - vendor_name: impute based on vendor_name of the same items from other records
2. Metrics for evaluating demand forecasting model

**What are the next steps after this notebook?**
1. To conduct further EDA, data preprocessing (e.g. feature scaling for numerical features, encoding for categorical features) with Python
2. To experiment with other models using scikit-learn, Keras or Tensorflow

## Data Sources

summarised_sales.pkl: Summarised from all_sales.pkl

## Revision History

- 04-17-2021: Started the notebook

## Required Python Libraries

In [2]:
from google.cloud import bigquery
from pathlib import Path
from datetime import datetime
import pandas as pd
import numpy as np

# Ignore all warnings
import warnings
warnings.filterwarnings("ignore")

# EDA

# Chi-square test and t-test for EDA
from scipy.stats import chi2_contingency
from scipy import stats

# Logistic correlation for EDA
import statsmodels.api as sm

# Data Visualisation for EDA
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Set up matplotlib so it uses Jupyter's graphical backend when plotting the charts
%matplotlib inline 

# Adjust display options for pandas dataframes
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 60)
pd.set_option('float_format','{:.2f}'.format)

## File Locations

In [3]:
raw_data = Path.cwd().parent / "data" / "raw" / "all_sales.pkl"

# Summarise transactional data into training dataset for demand forecasting
summarised_data = Path.cwd().parent / "data" / "processed" / "summary_sales.pkl"

# Master file for common dimensions
item_data = Path.cwd().parent / "data" / "interim" / "item_list.pkl"

## Load the data & basic exploration

In [4]:
liquor_df = pd.read_pickle(summarised_data)
liquor_df.tail(10)

Unnamed: 0,date,item_number,vendor_name,category_name,city,county,bottle_volume_ml,state_bottle_cost,state_bottle_retail,pack,sale_dollars,bottles_sold,volume_sold_liters,item_description
6073609,2021-03-30,991350,Mara Imports,100% Agave Tequila,Coralville,JOHNSON,750,72.0,108.0,6,648.0,6,4.5,Hoyo 19 Extra Anejo
6073610,2021-03-31,28894,BACARDI USA INC,Imported Dry Gins,Washington,WASHINGTON,50,120.0,180.0,1,180.0,1,0.05,Bombay Bramble
6073611,2021-03-31,37403,PERNOD RICARD USA,Imported Flavored Vodka,Ankeny,POLK,750,9.99,14.99,6,44.97,3,2.25,Absolut Watermelon
6073612,2021-03-31,37403,PERNOD RICARD USA,Imported Flavored Vodka,Cedar Falls,BLACK HAWK,750,9.99,14.99,6,44.97,3,2.25,Absolut Watermelon
6073613,2021-03-31,37403,PERNOD RICARD USA,Imported Flavored Vodka,Des Moines,POLK,750,9.99,14.99,6,59.96,4,3.0,Absolut Watermelon
6073614,2021-03-31,38530,SAZERAC COMPANY INC,American Vodkas,Cedar Falls,BLACK HAWK,50,4.3,6.45,12,6.45,1,0.05,Wheatley Vodka Mini
6073615,2021-03-31,57279,SAZERAC COMPANY INC,Cocktails / RTD,Cedar Rapids,LINN,1750,6.5,9.75,6,58.5,6,10.5,Chi-Chi's Pink Lemonade Margarita
6073616,2021-03-31,917527,Brown Forman Corp.,Straight Bourbon Whiskies,Allerton,WAYNE,750,17.15,25.73,6,154.38,6,4.5,Coopers' Craft Reserve Kentucky Straight Bourb...
6073617,2021-03-31,946606,Modern Matriarch,Gold Rum,Council Bluffs,POTTAWATTA,750,14.0,21.0,12,504.0,24,18.0,Modern Matriarch Amber Rum
6073618,2021-03-31,946608,Modern Matriarch,Flavored Rum,Council Bluffs,POTTAWATTA,750,13.75,20.63,12,495.12,24,18.0,Modern Matriarch Salted Caramel Flavored Rum


In [5]:
liquor_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6073619 entries, 0 to 6073618
Data columns (total 14 columns):
 #   Column               Dtype         
---  ------               -----         
 0   date                 datetime64[ns]
 1   item_number          object        
 2   vendor_name          object        
 3   category_name        object        
 4   city                 object        
 5   county               object        
 6   bottle_volume_ml     int64         
 7   state_bottle_cost    float64       
 8   state_bottle_retail  float64       
 9   pack                 int64         
 10  sale_dollars         float64       
 11  bottles_sold         int64         
 12  volume_sold_liters   float64       
 13  item_description     object        
dtypes: datetime64[ns](1), float64(4), int64(3), object(6)
memory usage: 695.1+ MB


In [6]:
liquor_df.describe()

Unnamed: 0,bottle_volume_ml,state_bottle_cost,state_bottle_retail,pack,sale_dollars,bottles_sold,volume_sold_liters
count,6073619.0,6073619.0,6073619.0,6073619.0,6073619.0,6073619.0,6073619.0
mean,883.16,10.76,16.14,11.93,192.88,14.66,12.16
std,502.05,9.88,14.82,7.5,630.66,49.87,46.77
min,20.0,0.33,0.5,1.0,1.3,1.0,0.02
25%,750.0,6.0,9.0,6.0,37.48,3.0,1.75
50%,750.0,8.66,12.99,12.0,89.16,6.0,6.0
75%,1000.0,13.0,19.5,12.0,180.0,12.0,10.5
max,6000.0,1871.2,2806.8,60.0,185248.8,7920.0,11812.5


## Set up BigQuery project and ingest the summarised data

In [5]:
client = bigquery.Client()


SyntaxError: invalid syntax (<ipython-input-5-30cba670a6e3>, line 2)