# Create Training Data from BigQuery Public Dataset

The Iowa Liquor Sales data contains every wholesale purchase of liquor in the State of Iowa by retailers for sale to individuals since January 1, 2012. 

The State of Iowa controls the wholesale distribution of liquor intended for retail sale (off-premises consumption), which means this dataset offers a complete view of retail liquor consumption in the entire state. The dataset contains every wholesale order of liquor by all grocery stores, liquor stores, convenience stores, etc., with details about the store and location, the exact liquor brand and size, and the number of bottles ordered.

Since the project aims to develop a ML model for short-term demand forecasting, analysing all historical data from 2012 will take too long and does not reflect recent trends in liquor consumption. Therefore, I would only extract 3 years of historical data from 2018 onwards to create a demand forecasting model. 

## Data Sources

Iowa Liquor Retail Sales: hosted publicly on [Google BigQuery](https://console.cloud.google.com/marketplace/product/iowa-department-of-commerce/iowa-liquor-sales?project=australiarain&folder=&organizationId=)

## Revision History

- 04-15-2021: Started the project

In [1]:
import pandas as pd
from pathlib import Path
from datetime import datetime

## File Locations

In [2]:
today = datetime.today()
raw_data = Path.cwd().parent / "data" / "raw" / "all_sales.parquet"

## Create Training Dataset from BigQuery

In [3]:
from google.cloud import bigquery
client = bigquery.Client()

In [None]:
# Run a query to retrieve the data from 2018 which is currently hosted  on BigQuery public dataset
sql = """
SELECT
  *
FROM
  `bigquery-public-data.iowa_liquor_sales.sales`
WHERE
  date >= "2018-01-01"
ORDER BY
  date;
  """
liquor_df = client.query(sql).to_dataframe()
liquor_df.head()

In [None]:
liquor_df.dtypes

## Data Manipulation

In [None]:
# Convert date column from string to DateTime
liquor_df['date'] = pd.to_datetime(liquor_df['date'])

## Save Output Files into Raw Data Directory

In [None]:
# Save the dataframes into pickle (.pkl) file for faster read/ write and retaining information about data types
liquor_df.to_parquet(raw_data)