# **Iowa Liquor Retail Sales Big Query Dataset Transformation & Analysis**
---

**The dataset used here includes statewide wholesale liquor purchases by Iowa (state in USA) retailers since January 1, 2012. It details orders from grocery stores, liquor stores, and convenience stores, including store locations, liquor brands, sizes, and quantities. The complete Iowa state's liquor retail sales raw dataset contains more than 40 million records and can be downloaded from [this Big Query link](https://console.cloud.google.com/marketplace/product/iowa-department-of-commerce/iowa-liquor-sales).**

**For our data transformation and analysis activity, we've used here only the top 30000 records from the dataset.**

---

## **Importing the dataset**

In [10]:
import pandas as pd

df = pd.read_csv('/content/BQ_liquor_sales_data.csv')
pd.set_option('display.max_columns', df.shape[1])

In [12]:
df.head(3)

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,store_location,county_number,county,category,category_name,vendor_number,vendor_name,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
0,RINV-04433600124,2022-12-27,5102,WILKIE LIQUORS,724 1ST STREET NE,MOUNT VERNON,52314.0,POINT(-91.41231 41.92012),,LINN,1012100.0,CANADIAN WHISKIES,260.0,DIAGEO AMERICAS,11297,CROWN ROYAL,12,1000,19.99,29.99,-84,-2519.16,-84.0,-22.19
1,RINV-05415000024,2024-09-03,3549,QUICKER LIQUOR STORE,1414 48TH ST,FORT MADISON,52627.0,POINT(-91.37319 40.62423),,LEE,1092100.0,IMPORTED DISTILLED SPIRITS SPECIALTY,434.0,LUXCO INC,75087,JUAREZ GOLD DSS,12,1000,5.09,7.64,-24,-183.36,-24.0,-6.34
2,RINV-04846600166,2023-09-12,2560,HY-VEE FOOD STORE (1396) / MARION,3600 BUSINESS HWY 151 EAST,MARION,52302.0,POINT(-91.572182976 42.037394006),,LINN,1031100.0,AMERICAN VODKAS,301.0,FIFTH GENERATION INC,38176,TITOS HANDMADE VODKA,12,750,10.0,15.0,-24,-360.0,-18.0,-4.75


## **Data Transformation**

**Some methods for premiliary checks are as follows. One can run them one line at a time.**

In [None]:
'''Methods for preliminary checks'''

# df.shape              # o/p : (30000, 24)
# df.info()             # Shows df shape, col names, their non-null val count, & dtypes
# df.dtypes             # Col names & their dtypes
# df.describe()         # Statistical data about df's numeric cols
# df.axes               # List of row axis' and col axis' labels, in that order
# df.index              # List of labels in index col
# df.columns            # List of all col labels
# df.keys()             # List of all col labels
# df.index.name         # Label of index col
# df.index.names        # Labels of multi-col index, aka multi-index
# df.ndim               # No. of dimensions in df (2 here)
# df.memory_usage()     # Memory usage of each col in bytes
# df.select_dtypes(exclude = 'object')        # include/exclude cols of specified dtypes

**Checking dtypes of columns**

In [28]:
[df.dtypes]  # Enclosing in brackets for a compact o/p

[invoice_and_item_number            object
 date                       datetime64[ns]
 store_number                        int64
 store_name                         object
 address                            object
 city                               object
 zip_code                           object
 store_location                     object
 county_number                     float64
 county                             object
 category                          float64
 category_name                      object
 vendor_number                     float64
 vendor_name                        object
 item_number                         int64
 item_description                   object
 pack                                int64
 bottle_volume_ml                    int64
 state_bottle_cost                 float64
 state_bottle_retail               float64
 bottles_sold                        int64
 sale_dollars                      float64
 volume_sold_liters                float64
 volume_sol

**The** 'date' **column's dtype should be of** datetime **type so that we can extract various date components later on.**

In [None]:
df['date'] = pd.to_datetime(df['date'], format = '%Y-%m-%d')
df['date'].dtype  # o/p : dtype('<M8[ns]')

**Identifying the number of unique values in each column**

In [27]:
[df.nunique()]  # Enclosing in brackets for a compact o/p

[invoice_and_item_number    30000
 date                        3236
 store_number                2399
 store_name                  2464
 address                     2388
 city                         438
 zip_code                     834
 store_location              4825
 county_number                 99
 county                        99
 category                      94
 category_name                 90
 vendor_number                161
 vendor_name                  218
 item_number                 3220
 item_description            3140
 pack                          19
 bottle_volume_ml              21
 state_bottle_cost           1400
 state_bottle_retail         1553
 bottles_sold                  29
 sale_dollars                2270
 volume_sold_liters            48
 volume_sold_gallons           49
 dtype: int64]

**Dropping unwanted columns**

In [29]:
df.drop(columns = ['volume_sold_gallons'], inplace = True)

**Checking the number of nulls in each column:**

In [31]:
[df.isnull().sum()]  # Enclosing in brackets for a compact o/p

[invoice_and_item_number       0
 date                          0
 store_number                  0
 store_name                    0
 address                     114
 city                        114
 zip_code                    114
 store_location             2170
 county_number              8362
 county                      190
 category                     12
 category_name                22
 vendor_number                 0
 vendor_name                   0
 item_number                   0
 item_description              0
 pack                          0
 bottle_volume_ml              0
 state_bottle_cost             1
 state_bottle_retail           1
 bottles_sold                  0
 sale_dollars                  1
 volume_sold_liters            0
 dtype: int64]

In [38]:
# prompt: How many rows have number of nulls in them equal to 6?

import pandas as pd
# Assuming the code you provided is already executed and 'df' is available.

# Calculate the number of nulls in each row
null_counts = df.isnull().sum(axis=1)

# Count the number of rows where the number of nulls is equal to 6
num_rows_with_6_nulls = (null_counts == 6).sum()

print(f"The number of rows with 6 null values is: {num_rows_with_6_nulls}")


114

**Column renaming**

In [26]:
'''TO BE RUN'''
df.rename(columns = {'invoice_and_item_number': 'bill_number'}, inplace = True)