# Create Training Data from BigQuery Public Dataset

The Iowa Liquor Sales data contains every wholesale purchase of liquor in the State of Iowa by retailers for sale to individuals since January 1, 2012. 

The State of Iowa controls the wholesale distribution of liquor intended for retail sale (off-premises consumption), which means this dataset offers a complete view of retail liquor consumption in the entire state. The dataset contains every wholesale order of liquor by all grocery stores, liquor stores, convenience stores, etc., with details about the store and location, the exact liquor brand and size, and the number of bottles ordered.

Since the project aims to develop a ML model for short-term demand forecasting, analysing all historical data from 2012 will take too long and does not reflect recent trends in liquor consumption. Therefore, I would only extract 3 years of historical data from 2018 onwards to create a demand forecasting model. 

## Data Sources

Iowa Liquor Retail Sales: hosted publicly on [Google BigQuery](https://console.cloud.google.com/marketplace/product/iowa-department-of-commerce/iowa-liquor-sales?project=australiarain&folder=&organizationId=)

## Revision History

- 04-15-2021: Started the project

In [1]:
import pandas as pd
from pathlib import Path
from datetime import datetime

## File Locations

In [2]:
today = datetime.today()
raw_data = Path.cwd().parent / "data" / "raw" / "all_sales.pkl"

## Create Training Dataset from BigQuery

In [3]:
from google.cloud import bigquery
client = bigquery.Client()

In [4]:
# Run a query to retrieve the summarised data from BigQuery public dataset
sql = """
SELECT
  *
FROM
  `bigquery-public-data.iowa_liquor_sales.sales`
WHERE
  date >= "2018-01-01"
ORDER BY
  date;
  """
liquor_df = client.query(sql).to_dataframe()
liquor_df.head()

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,store_location,county_number,county,...,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
0,INV-09565300015,2018-01-02,5151,IDA Liquor,"500, Hwy 175",Ida Grove,51445,,47,IDA,...,89786,Sauza Gold,12,750,9.5,14.25,4,56.72,3.0,0.79
1,INV-09556000113,2018-01-02,2524,Hy-Vee Food Store / Dubuque,3500 Dodge St,Dubuque,52001,,31,DUBUQUE,...,76042,Midnight Moon Blackberry,6,750,11.5,17.25,2,34.5,1.5,0.4
2,INV-09555600025,2018-01-02,3632,Wal-Mart 2004 / Dubuque,4200 Dodge St,Dubuque,52003,POINT (-90.736955 42.489041),31,DUBUQUE,...,42716,Malibu Coconut Rum,12,750,7.49,11.24,12,134.88,9.0,2.38
3,INV-09556500061,2018-01-02,2465,Sid's Beverage Shop,2727 Dodge St,Dubuque,52003,POINT (-90.705328 42.491862),31,DUBUQUE,...,43299,Captain Morgan Coconut,12,750,8.26,12.39,1,12.39,0.75,0.2
4,INV-09560000001,2018-01-02,2587,Hy-Vee Food Store / Johnston,5750 Merle Hay Road,Johnston,50131,POINT (-93.697731 41.665172),77,POLK,...,4367,Balvenie Caribbean Cask 14yr,6,750,39.96,59.94,2,119.88,1.5,0.4


In [5]:
liquor_df.dtypes

invoice_and_item_number     object
date                        object
store_number                object
store_name                  object
address                     object
city                        object
zip_code                    object
store_location              object
county_number               object
county                      object
category                    object
category_name               object
vendor_number               object
vendor_name                 object
item_number                 object
item_description            object
pack                         int64
bottle_volume_ml             int64
state_bottle_cost          float64
state_bottle_retail        float64
bottles_sold                 int64
sale_dollars               float64
volume_sold_liters         float64
volume_sold_gallons        float64
dtype: object

## Data Manipulation

In [6]:
# Convert date column from string to DateTime
liquor_df['date'] = pd.to_datetime(liquor_df['date'])

## Save Output Files into Raw Data Directory

In [7]:
# Save the dataframes into pickle (.pkl) file for faster read/ write and retaining information about data types
liquor_df.to_pickle(raw_data)