# Create Training Data from BigQuery Public Dataset

The Iowa Liquor Sales data contains every wholesale purchase of liquor in the State of Iowa by retailers for sale to individuals since January 1, 2012. 

The State of Iowa controls the wholesale distribution of liquor intended for retail sale (off-premises consumption), which means this dataset offers a complete view of retail liquor consumption in the entire state. The dataset contains every wholesale order of liquor by all grocery stores, liquor stores, convenience stores, etc., with details about the store and location, the exact liquor brand and size, and the number of bottles ordered.

Since the project aims to develop a ML model for short-term demand forecasting, analysing all historical data from 2012 will take too long and does not reflect recent trends in liquor consumption. Therefore, I would only extract 3 years of historical data from 2018 onwards to create a demand forecasting model. 

## Data Sources

Iowa Liquor Retail Sales: hosted publicly on [Google BigQuery](https://console.cloud.google.com/marketplace/product/iowa-department-of-commerce/iowa-liquor-sales?project=australiarain&folder=&organizationId=)

## Revision History

- 04-15-2021: Started the project

In [1]:
import pandas as pd
from pathlib import Path
from datetime import datetime

## File Locations

In [2]:
today = datetime.today()
raw_data = Path.cwd().parent / "data" / "raw" / "all_sales.parquet"

## Create Training Dataset from BigQuery

In [3]:
from google.cloud import bigquery
client = bigquery.Client()

In [4]:
# Run a query to retrieve the data from 2018 which is currently hosted  on BigQuery public dataset
sql = """
SELECT
  *
FROM
  `bigquery-public-data.iowa_liquor_sales.sales`
WHERE
  date >= "2018-01-01"
ORDER BY
  date;
  """
liquor_df = client.query(sql).to_dataframe()
liquor_df.head()

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,store_location,county_number,county,...,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
0,INV-09550400072,2018-01-02,2549,Hy-Vee Food Store / Indianola,910 N Jefferson,Indianola,50125,,91,WARREN,...,18350,Four Roses Single Barrel,6,750,20.17,30.26,6,181.56,4.5,1.19
1,INV-09567800094,2018-01-02,2505,Hy-Vee Wine and Spirits / Boone,1111 8TH ST,Boone,50036,POINT (-93.876159 42.06479800000001),8,BOONE,...,43352,Captain Morgan Pineapple,12,750,8.26,12.39,2,24.78,1.5,0.4
2,INV-09561800137,2018-01-02,2564,Hy-Vee Food Store #4 / Waterloo,4000 University,Waterloo,50701,POINT (-92.403843 42.505197),7,BLACK HAWK,...,35418,Burnett's Vodka 80 Prf,6,1750,9.48,14.22,6,85.32,10.5,2.77
3,INV-09554500039,2018-01-02,4844,Iowa City Fast Break,"2580, Naples Ave",Iowa City,52240,POINT (-91.571064 41.632792),52,JOHNSON,...,37993,Smirnoff 80prf,48,200,2.54,3.81,48,182.88,9.6,2.54
4,INV-09558700018,2018-01-02,3691,Target Store T-1791 / Urbandale,11148 Plum Dr,Urbandale,50322,POINT (-93.769776 41.646972),77,POLK,...,72913,Captain Morgan Loconut,6,750,9.06,13.59,6,81.0,4.5,1.19


In [5]:
liquor_df.dtypes

invoice_and_item_number     object
date                        object
store_number                object
store_name                  object
address                     object
city                        object
zip_code                    object
store_location              object
county_number               object
county                      object
category                    object
category_name               object
vendor_number               object
vendor_name                 object
item_number                 object
item_description            object
pack                         int64
bottle_volume_ml             int64
state_bottle_cost          float64
state_bottle_retail        float64
bottles_sold                 int64
sale_dollars               float64
volume_sold_liters         float64
volume_sold_gallons        float64
dtype: object

## Data Manipulation

In [6]:
# Convert date column from string to DateTime
liquor_df['date'] = pd.to_datetime(liquor_df['date'])

## Save Output Files into Raw Data Directory

In [7]:
# Save the dataframes into pickle (.pkl) file for faster read/ write and retaining information about data types
liquor_df.to_parquet(raw_data)