## Data Cleaning

### Author: Lucia Zou
### Contact: lucia.zouyuebca@gmail.com 
### Date: Oct.18th, 2023

### Table of Contents

- [Introduction](#Introduction)
- [Problem Statement](#Problem-Statement)
- [Methods](#Methods)
- [Feature Overview](#Feature-Overview)
- [Data Import](#Data-Import)
- [Assessment](#Assessment)
- [Handling Missing Values](#Handling-Missing-Values)
- [Extra Cleaning](#Extra-Cleaning)

###

### Introduction

This dataset compiles the spirits purchase data of Iowa Class “E” liquor license holders, detailing products and purchase dates spanning from January 2021 to January 2022. It serves as a valuable resource for analyzing individual product sales at the store level, offering insights into the total spirits sales landscape in Iowa. The objective of this project is to leverage machine learning techniques to forecast future sales for these stores. Our aim is to empower store owners by maximizing their revenue potential and optimizing their operations

###

### Problem Statement

Can we use machine learning models to predict future sales of liquor?

###

### Methods

We utilized the Python programming language along with key libraries like pandas for seamless data manipulation, matplotlib for insightful data visualization, and sklearn for advanced machine learning techniques and data preprocessing. Additionally, we harnessed the power of Jupyter Notebook as our integrated development environment (IDE), enabling interactive programming and dynamic visualization.

###

### Feature Overview

1. invoice_and_item_number: concatenated invoice and line number associated with the liquor order. This provides a unique identifier for the individual liquor products included in the store order.

2. date: date of order.

3. store_number: unique number assigned to the store who ordered the liquor.

4. store_name: name of store who ordered the liquor.

5. address: address of store who ordered the liquor.

6. city: city where the store who ordered the liquor is located.

7. zip_code: zip code where the store who ordered the liquor is located.

8. store_location: location of store who ordered the liquor. The address, city, state and zip code are geocoded to provide geographic coordinates.

9. county_number: iowa county number for the county where store who ordered the liquor is located.

10. county: county where the store who ordered the liquor is located.

11. category: category code associated with the liquor ordered.

12. category_name: category of the liquor ordered.

13. vendor_number: the vendor number of the company for the brand of liquor ordered.

14. vendor_name: the vendor name of the company for the brand of liquor ordered.

15. item_number: item number for the individual liquor product ordered.

16. item_description: description of the individual liquor product ordered.

17. pack: the number of bottles in a case for the liquor ordered.

18. bottle_volume_ml: volume of each liquor bottle ordered in milliliters.

19. state_bottle_cost: the amount that alcoholic beverages division paid for each bottle of liquor ordered.

20. state_bottle_retail: the amount the store paid for each bottle of liquor ordered.

21. bottles_sold: the number of bottles of liquor ordered by the store.

22. sale_dollars: total cost of liquor order (number of bottles multiplied by the state bottle retail).

23. volume_sold_liters: total volume of liquor ordered in liters. 

24. volume_sold_gallons: total volume of liquor ordered in gallons. 

###

### Data Import

In [1]:
import numpy as np
import pandas as pd

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns

from statsmodels.api import tsa 
import statsmodels.api as sm

In [2]:
df=pd.read_csv('Iowa.csv')
df.head()

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,store_location,county_number,county,...,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
0,INV-33179700135,2021-01-04,2576,Hy-Vee Wine and Spirits / Storm Lake,1250 N Lake St,Storm Lake,50588.0,POINT (-95.200758 42.65318400000001),11.0,BUENA VIST,...,64870,Fireball Cinnamon Whiskey,48,100,0.9,1.35,48,64.8,4.8,1.26
1,INV-33196200106,2021-01-04,2649,Hy-Vee #3 / Dubuque,400 Locust St,Dubuque,52001.0,POINT (-90.666497 42.49721900000001),31.0,DUBUQUE,...,65200,Tequila Rose Liqueur,12,750,11.5,17.25,4,69.0,3.0,0.79
2,INV-33184300011,2021-01-04,2539,Hy-Vee Food Store / Iowa Falls,640 S. Oak,Iowa Falls,50126.0,POINT (-93.262364 42.508752),42.0,HARDIN,...,38008,Smirnoff 80prf PET,6,1750,14.75,22.13,6,132.78,10.5,2.77
3,INV-33184100015,2021-01-04,4024,Wal-Mart 1546 / Iowa Falls,840 S Oak,Iowa Falls,50126.0,POINT (-93.262446 42.503407),42.0,HARDIN,...,36648,Caliber Vodka,12,750,3.31,4.97,12,59.64,9.0,2.37
4,INV-33174200025,2021-01-04,5385,Vine Food & Liquor,2704 Vine St.,West Des Moines,50265.0,POINT (-93.741511 41.580206),77.0,POLK,...,4626,Buchanan Deluxe 12YR,12,750,20.99,31.49,2,62.98,1.5,0.39


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2805307 entries, 0 to 2805306
Data columns (total 24 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   invoice_and_item_number  object 
 1   date                     object 
 2   store_number             int64  
 3   store_name               object 
 4   address                  object 
 5   city                     object 
 6   zip_code                 float64
 7   store_location           object 
 8   county_number            float64
 9   county                   object 
 10  category                 float64
 11  category_name            object 
 12  vendor_number            float64
 13  vendor_name              object 
 14  item_number              int64  
 15  item_description         object 
 16  pack                     int64  
 17  bottle_volume_ml         int64  
 18  state_bottle_cost        float64
 19  state_bottle_retail      float64
 20  bottles_sold             int64  
 21  sale_dol

#### The original data has 2805307 rows and 24 columns

In [4]:
df.shape

(2805307, 24)

In [5]:
df.nunique()

invoice_and_item_number    2805307
date                           326
store_number                  1954
store_name                    1952
address                       1936
city                           442
zip_code                       479
store_location                2570
county_number                   99
county                         119
category                        57
category_name                   57
vendor_number                  222
vendor_name                    227
item_number                   4545
item_description              4032
pack                            21
bottle_volume_ml                25
state_bottle_cost             1312
state_bottle_retail           1315
bottles_sold                   397
sale_dollars                 11136
volume_sold_liters             772
volume_sold_gallons            761
dtype: int64

###

### Assessment

In [6]:
df.describe()

Unnamed: 0,store_number,zip_code,county_number,category,vendor_number,item_number,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
count,2805307.0,2805165.0,2805165.0,2805307.0,2805303.0,2805307.0,2805307.0,2805307.0,2805307.0,2805307.0,2805307.0,2805307.0,2805307.0,2805307.0
mean,4137.796,51243.74,57.12714,1055993.0,287.1833,54561.97,11.94263,821.8059,11.325,16.98823,11.85837,162.4497,9.383131,2.473141
std,1265.248,990.655,27.32875,103932.2,141.5585,91853.49,7.846285,525.4814,11.03231,16.54628,35.66817,587.1846,41.20727,10.88601
min,2106.0,50002.0,1.0,1011000.0,33.0,258.0,1.0,20.0,0.66,0.99,1.0,1.34,0.02,0.0
25%,2644.0,50314.0,31.0,1012100.0,205.0,27125.0,6.0,375.0,6.0,9.0,3.0,42.0,1.5,0.39
50%,4186.0,51040.0,62.0,1031200.0,260.0,39916.0,12.0,750.0,8.99,13.49,6.0,88.92,4.5,1.18
75%,5244.0,52302.0,77.0,1062500.0,420.0,65251.0,12.0,1000.0,14.0,21.0,12.0,166.5,10.5,2.77
max,9049.0,57222.0,99.0,1901200.0,978.0,999995.0,120.0,5250.0,2098.94,3148.41,13200.0,250932.0,13200.0,3487.07


In [7]:
#Because we have e for all outputs, we want to change it
pd.set_option('display.float_format', lambda x: '{:.6f}'.format(x))

df.describe()

Unnamed: 0,store_number,zip_code,county_number,category,vendor_number,item_number,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
count,2805307.0,2805165.0,2805165.0,2805307.0,2805303.0,2805307.0,2805307.0,2805307.0,2805307.0,2805307.0,2805307.0,2805307.0,2805307.0,2805307.0
mean,4137.795966,51243.743009,57.127137,1055993.194862,287.183324,54561.974709,11.942627,821.805932,11.324997,16.988232,11.858369,162.449687,9.383131,2.473141
std,1265.247896,990.654998,27.328748,103932.156424,141.558502,91853.494064,7.846285,525.481377,11.03231,16.546277,35.668169,587.18464,41.207272,10.886014
min,2106.0,50002.0,1.0,1011000.0,33.0,258.0,1.0,20.0,0.66,0.99,1.0,1.34,0.02,0.0
25%,2644.0,50314.0,31.0,1012100.0,205.0,27125.0,6.0,375.0,6.0,9.0,3.0,42.0,1.5,0.39
50%,4186.0,51040.0,62.0,1031200.0,260.0,39916.0,12.0,750.0,8.99,13.49,6.0,88.92,4.5,1.18
75%,5244.0,52302.0,77.0,1062500.0,420.0,65251.0,12.0,1000.0,14.0,21.0,12.0,166.5,10.5,2.77
max,9049.0,57222.0,99.0,1901200.0,978.0,999995.0,120.0,5250.0,2098.94,3148.41,13200.0,250932.0,13200.0,3487.07


#### Key statistics:
The average number of bottles per order rounds up to 12, with 75% of orders also comprising 12 bottles, indicating consistency. However, the maximum of 13200 bottles suggests a potential outlier. Similarly, the average sales amount, standing at 162.45 US dollars, and in the 75% where sales hover around 166.5 US dollars, reaffirming the consistency observed in most purchase amounts. Yet, a noteworthy anomaly surfaces in the shape of a substantial maximum sales figure, reaching an impressive $250932, which may become another outlier.

###

### Handling Missing Values

#### Now we go to the data cleaning process, let us check null values first.

In [8]:
df.isnull().sum()                                                                                

invoice_and_item_number         0
date                            0
store_number                    0
store_name                      0
address                       142
city                          142
zip_code                      142
store_location             330335
county_number                 142
county                        142
category                        0
category_name                   0
vendor_number                   4
vendor_name                     4
item_number                     0
item_description                0
pack                            0
bottle_volume_ml                0
state_bottle_cost               0
state_bottle_retail             0
bottles_sold                    0
sale_dollars                    0
volume_sold_liters              0
volume_sold_gallons             0
dtype: int64

In [9]:
#check % of missing values
df.isnull().sum()/len(df)*100

invoice_and_item_number    0.000000
date                       0.000000
store_number               0.000000
store_name                 0.000000
address                    0.005062
city                       0.005062
zip_code                   0.005062
store_location            11.775360
county_number              0.005062
county                     0.005062
category                   0.000000
category_name              0.000000
vendor_number              0.000143
vendor_name                0.000143
item_number                0.000000
item_description           0.000000
pack                       0.000000
bottle_volume_ml           0.000000
state_bottle_cost          0.000000
state_bottle_retail        0.000000
bottles_sold               0.000000
sale_dollars               0.000000
volume_sold_liters         0.000000
volume_sold_gallons        0.000000
dtype: float64

###

Given that we possess the address, zip code, county name, and city name in the dataset, including the store name column is unnecessary for our mapping visualizations. Moreover, considering the dataset pertains to Iowa, the map's geographical scope is naturally limited. Therefore, we have decided to exclude the store name from our analysis, as it provides enough information for our mapping needs.

In [10]:
#make a copy just in case, then drop the column
df1= df.drop(columns='store_location')
#sanity check
df1.head()

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,county_number,county,category,...,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
0,INV-33179700135,2021-01-04,2576,Hy-Vee Wine and Spirits / Storm Lake,1250 N Lake St,Storm Lake,50588.0,11.0,BUENA VIST,1081600.0,...,64870,Fireball Cinnamon Whiskey,48,100,0.9,1.35,48,64.8,4.8,1.26
1,INV-33196200106,2021-01-04,2649,Hy-Vee #3 / Dubuque,400 Locust St,Dubuque,52001.0,31.0,DUBUQUE,1081200.0,...,65200,Tequila Rose Liqueur,12,750,11.5,17.25,4,69.0,3.0,0.79
2,INV-33184300011,2021-01-04,2539,Hy-Vee Food Store / Iowa Falls,640 S. Oak,Iowa Falls,50126.0,42.0,HARDIN,1031100.0,...,38008,Smirnoff 80prf PET,6,1750,14.75,22.13,6,132.78,10.5,2.77
3,INV-33184100015,2021-01-04,4024,Wal-Mart 1546 / Iowa Falls,840 S Oak,Iowa Falls,50126.0,42.0,HARDIN,1031100.0,...,36648,Caliber Vodka,12,750,3.31,4.97,12,59.64,9.0,2.37
4,INV-33174200025,2021-01-04,5385,Vine Food & Liquor,2704 Vine St.,West Des Moines,50265.0,77.0,POLK,1012200.0,...,4626,Buchanan Deluxe 12YR,12,750,20.99,31.49,2,62.98,1.5,0.39


Check null values again

In [11]:
rows_with_null = df1[df1.isnull().any(axis=1)]
rows_with_null.head(10)

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,county_number,county,category,...,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
101913,INV-41995800001,2021-11-15,6229,Beer Thirty Storm Lake / Storm Lake,,,,,,1701100.0,...,21228,1792 Full Proof,6,750,24.0,36.0,1,36.0,0.75,0.19
102454,INV-41995800027,2021-11-15,6229,Beer Thirty Storm Lake / Storm Lake,,,,,,1031100.0,...,38178,Titos Handmade Vodka,6,1750,19.0,28.5,360,10260.0,630.0,166.42
102802,INV-41995800003,2021-11-15,6229,Beer Thirty Storm Lake / Storm Lake,,,,,,1701100.0,...,16906,Bookers Bourbon,6,750,42.5,63.75,1,63.75,0.75,0.19
103286,INV-41995800011,2021-11-15,6229,Beer Thirty Storm Lake / Storm Lake,,,,,,1062100.0,...,42426,Barcelo Anejo,12,750,7.45,11.18,12,134.16,9.0,2.37
103602,INV-41995800014,2021-11-15,6229,Beer Thirty Storm Lake / Storm Lake,,,,,,1022100.0,...,89198,Jose Cuervo Especial Reposado,6,1750,21.5,32.25,6,193.5,10.5,2.77
103719,INV-41995800006,2021-11-15,6229,Beer Thirty Storm Lake / Storm Lake,,,,,,1011200.0,...,920374,SOOH Old Forester 1910,6,750,27.99,41.99,6,251.94,4.5,1.18
103735,INV-41995800022,2021-11-15,6229,Beer Thirty Storm Lake / Storm Lake,,,,,,1082000.0,...,65258,Jagermeister Liqueur,6,1750,26.05,39.08,2,78.16,3.5,0.92
103837,INV-41995800017,2021-11-15,6229,Beer Thirty Storm Lake / Storm Lake,,,,,,1012200.0,...,5349,Johnnie Walker Red,6,1750,26.5,39.75,6,238.5,10.5,2.77
104024,INV-41995800024,2021-11-15,6229,Beer Thirty Storm Lake / Storm Lake,,,,,,1071100.0,...,63753,Salvadors Peach Margarita,6,1750,6.0,9.0,6,54.0,10.5,2.77
104268,INV-41995800026,2021-11-15,6229,Beer Thirty Storm Lake / Storm Lake,,,,,,1081600.0,...,100413,Fireball Cinnamon Whiskey Party Bucket,1,50,51.6,77.4,1,77.4,0.05,0.01


###

We have chosen to utilize the store name as the key to fill in the null values for address, city, zip code, county number, and county. This decision is grounded in our belief that these attributes are all intricately linked to the unique store names, allowing us to maintain clean data.

###

Check how many unique store names has null values

In [12]:
unique_store_names_with_null = rows_with_null['store_name'].unique()
unique_store_names_with_null

array(['Beer Thirty Storm Lake / Storm Lake', 'Northside Liquor',
       'Benz Distributing', 'Hy-Vee #3 / BDI / Des Moines',
       'Hy-Vee Food Store / Mount Ayr', 'Bootlegging Barzinis',
       "Casey's General Store #2811 / Springville"], dtype=object)

###

In [13]:
#Before fill in null values, make a copy of df1 just in case
df2 = df1.copy()

Check if the name of store has a unique store number

In [14]:
beer_liquor_rows = df1[df1['store_name'] == 'Beer Thirty Storm Lake / Storm Lake'].groupby('store_number')['store_number'].unique()
beer_liquor_rows

store_number
6229    [6229]
Name: store_number, dtype: object

In [15]:
#check what information need to be filled
beer_liquor_rows_with_noNull = df1[df1['store_name'] == 'Beer Thirty Storm Lake / Storm Lake'].dropna(subset=['address'])
beer_liquor_rows_with_noNull

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,county_number,county,category,...,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
155678,INV-42238900003,2021-11-22,6229,Beer Thirty Storm Lake / Storm Lake,208 E Milwaukee,Storm Lake,50588.000000,11.000000,BUENA VIST,1011200.000000,...,920376,SOOH Old Forester 1920,6,750,30.980000,46.470000,6,278.820000,4.500000,1.180000
156362,INV-42238900031,2021-11-22,6229,Beer Thirty Storm Lake / Storm Lake,208 E Milwaukee,Storm Lake,50588.000000,11.000000,BUENA VIST,1081200.000000,...,68036,Baileys Original Irish Cream,12,750,16.490000,24.740000,12,296.880000,9.000000,2.370000
156538,INV-42238900001,2021-11-22,6229,Beer Thirty Storm Lake / Storm Lake,208 E Milwaukee,Storm Lake,50588.000000,11.000000,BUENA VIST,1011300.000000,...,21589,EH Taylor Jr Single Barrel,6,750,33.340000,50.010000,1,50.010000,0.750000,0.190000
156710,INV-42238900013,2021-11-22,6229,Beer Thirty Storm Lake / Storm Lake,208 E Milwaukee,Storm Lake,50588.000000,11.000000,BUENA VIST,1022200.000000,...,89496,Margaritaville Gold Tequila,12,750,7.510000,11.270000,12,108.000000,9.000000,2.370000
157071,INV-42238900026,2021-11-22,6229,Beer Thirty Storm Lake / Storm Lake,208 E Milwaukee,Storm Lake,50588.000000,11.000000,BUENA VIST,1012100.000000,...,11348,Seagrams VO Canadian Whiskey PET,6,1750,10.450000,15.680000,6,94.080000,10.500000,2.770000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2664153,INV-41317000005,2021-10-25,6229,Beer Thirty Storm Lake / Storm Lake,205 E 7th Street,Storm Lake,50588.000000,11.000000,BUENA VIST,1031100.000000,...,36903,McCormick 80prf Vodka,48,200,1.130000,1.700000,48,81.600000,9.600000,2.530000
2664213,INV-41316800024,2021-10-25,6229,Beer Thirty Storm Lake / Storm Lake,205 E 7th Street,Storm Lake,50588.000000,11.000000,BUENA VIST,1082000.000000,...,64876,Drambuie Liqueur,12,750,23.000000,34.500000,3,103.500000,2.250000,0.590000
2664358,INV-41316800030,2021-10-25,6229,Beer Thirty Storm Lake / Storm Lake,205 E 7th Street,Storm Lake,50588.000000,11.000000,BUENA VIST,1011600.000000,...,27125,Templeton Rye 6YR,6,750,22.750000,34.130000,6,204.780000,4.500000,1.180000
2664438,INV-41316800010,2021-10-25,6229,Beer Thirty Storm Lake / Storm Lake,205 E 7th Street,Storm Lake,50588.000000,11.000000,BUENA VIST,1022200.000000,...,89841,Hornitos Anejo,12,750,16.000000,24.000000,12,288.000000,9.000000,2.370000


We have identified instances where the same store number is associated with two different addresses. To maintain data integrity, we have decided to use the new address information to fill in the data, ensuring consistency and cleanliness throughout the dataset​.

In [16]:
fill_values = {
    'address': '205 E 7th Street',
    'city': 'Storm Lake',
    'zip_code': 50588.0,
    'county_number': 11.0,
    'county': 'BUENA VIST'
}
# Iterate over rows and fill null values
for index, row in df2.iterrows():
    if row['store_name'] == 'Beer Thirty Storm Lake / Storm Lake':
        for column in fill_values.keys():
            if pd.isnull(row[column]):
                df2.at[index, column] = fill_values[column]

In [17]:
#Sanity Check
beer_liquor_rows = df2[df2['store_name'] == 'Beer Thirty Storm Lake / Storm Lake']['store_name'].isnull().any()
beer_liquor_rows

False

###

Repeat those steps for Northside Liquor

In [18]:
#check how many unique stores does northside liquor store have
northside_liquor_rows = df1[df1['store_name'] == 'Northside Liquor'].groupby('store_number')['store_number'].unique()
northside_liquor_rows

store_number
5251    [5251]
Name: store_number, dtype: object

In [19]:
#check what information need to be filled
northside_liquor_rows = df1[df1['store_name'] == 'Northside Liquor']
northside_liquor_rows

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,county_number,county,category,...,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
31725,INV-33285700023,2021-01-07,5251,Northside Liquor,1303 North Federal,Mason City,50401.000000,17.000000,CERRO GORD,1022100.000000,...,89197,Jose Cuervo Especial Reposado,12,1000,13.000000,19.500000,12,234.000000,12.000000,3.170000
31738,INV-33285700065,2021-01-07,5251,Northside Liquor,1303 North Federal,Mason City,50401.000000,17.000000,CERRO GORD,1012400.000000,...,15641,Jameson Caskmates IPA,24,375,6.490000,9.740000,24,233.760000,9.000000,2.370000
32049,INV-33285800001,2021-01-07,5251,Northside Liquor,1303 North Federal,Mason City,50401.000000,17.000000,CERRO GORD,1091300.000000,...,76414,Ole Smoky Moonshine Pickles,6,750,12.500000,18.750000,18,337.500000,13.500000,3.560000
32106,INV-33285700044,2021-01-07,5251,Northside Liquor,1303 North Federal,Mason City,50401.000000,17.000000,CERRO GORD,1071100.000000,...,63345,On the Rocks Cocktails Tres Gen Jalapeno Pinea...,12,375,6.000000,9.000000,12,108.000000,4.500000,1.180000
32300,INV-33285900002,2021-01-07,5251,Northside Liquor,1303 North Federal,Mason City,50401.000000,17.000000,CERRO GORD,1012200.000000,...,5318,Johnnie Walker Double Black,6,750,24.050000,36.080000,6,216.480000,4.500000,1.180000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1807161,INV-34951900031,2021-03-11,5251,Northside Liquor,1303 North Federal,Mason City,50401.000000,17.000000,CERRO GORD,1071100.000000,...,59159,1800 Ultimate Raspberry Margarita,6,1750,10.040000,15.060000,6,90.360000,10.500000,2.770000
1807187,INV-34951900038,2021-03-11,5251,Northside Liquor,1303 North Federal,Mason City,50401.000000,17.000000,CERRO GORD,1012100.000000,...,11774,Black Velvet,24,375,3.070000,4.610000,48,221.280000,18.000000,4.750000
1807530,INV-34951900004,2021-03-11,5251,Northside Liquor,1303 North Federal,Mason City,50401.000000,17.000000,CERRO GORD,1701100.000000,...,21946,Wild Turkey Kentucky Spirit,6,750,30.000000,45.000000,1,45.000000,0.750000,0.190000
1807545,INV-34951900026,2021-03-11,5251,Northside Liquor,1303 North Federal,Mason City,50401.000000,17.000000,CERRO GORD,1031100.000000,...,37348,Phillips Vodka,6,1750,7.600000,11.400000,6,68.400000,10.500000,2.770000


In [20]:
# Define the values to fill nulls
fill_values = {
    'address': '1303 North Federal',
    'city': 'Mason City',
    'zip_code': 50401.0,
    'county_number': 17.0,
    'county': 'CERRO GORD'
}

# Iterate over rows and fill null values where 'store_name' is 'Northside Liquor'
for index, row in df2.iterrows():
    if row['store_name'] == 'Northside Liquor':
        for column in fill_values.keys():
            if pd.isnull(row[column]):
                df2.at[index, column] = fill_values[column]

In [21]:
#Sanity Check
North_null_values = df2[df2['store_name'] == 'Northside Liquor']['store_name'].isnull().any()
North_null_values

False

###

Repeat those steps for Benz Distributing

In [22]:
Benz_liquor_rows = df2[df2['store_name'] == 'Benz Distributing'].groupby('store_number')['store_number'].unique()
Benz_liquor_rows

store_number
3773    [3773]
Name: store_number, dtype: object

In [23]:
Benz_liquor_rows = df2[df2['store_name'] == 'Benz Distributing']
Benz_liquor_rows

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,county_number,county,category,...,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
239,INV-33170600014,2021-01-04,3773,Benz Distributing,501 7th Ave SE,Cedar Rapids,52401.000000,57.000000,LINN,1041100.000000,...,29515,Bowling & Burch Gin,6,750,20.000000,30.000000,2,60.000000,1.500000,0.390000
378,INV-33170700043,2021-01-04,3773,Benz Distributing,501 7th Ave SE,Cedar Rapids,52401.000000,57.000000,LINN,1081300.000000,...,75210,Kinky Pink,6,750,10.000000,15.000000,30,450.000000,22.500000,5.940000
414,INV-33170600043,2021-01-04,3773,Benz Distributing,501 7th Ave SE,Cedar Rapids,52401.000000,57.000000,LINN,1011200.000000,...,27369,Cedar Ridge Port Cask Finished Bourbon,6,750,25.500000,38.250000,2,76.500000,1.500000,0.390000
425,INV-33170700031,2021-01-04,3773,Benz Distributing,501 7th Ave SE,Cedar Rapids,52401.000000,57.000000,LINN,1081100.000000,...,67527,Kahlua Coffee,12,1000,15.990000,23.990000,12,287.880000,12.000000,3.170000
507,INV-33170600021,2021-01-04,3773,Benz Distributing,501 7th Ave SE,Cedar Rapids,52401.000000,57.000000,LINN,1011100.000000,...,16026,Hatozaki Finest Japanese Whisky,6,750,23.150000,34.730000,2,69.460000,1.500000,0.390000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2784551,INV-41849100033,2021-11-10,3773,Benz Distributing,501 7th Ave SE,Cedar Rapids,52401.000000,57.000000,LINN,1091300.000000,...,76413,Ole Smoky Butter Pecan Moonshine,6,750,12.500000,18.750000,3,56.250000,2.250000,0.590000
2784686,INV-41849100004,2021-11-10,3773,Benz Distributing,501 7th Ave SE,Cedar Rapids,52401.000000,57.000000,LINN,1701100.000000,...,17917,Elijah Craig Barrel Proof,6,750,35.000000,52.500000,6,315.000000,4.500000,1.180000
2784836,INV-41849200053,2021-11-10,3773,Benz Distributing,501 7th Ave SE,Cedar Rapids,52401.000000,57.000000,LINN,1082000.000000,...,64645,Domaine de Canton,6,750,17.500000,26.250000,6,157.500000,4.500000,1.180000
2784844,INV-41849200011,2021-11-10,3773,Benz Distributing,501 7th Ave SE,Cedar Rapids,52401.000000,57.000000,LINN,1062200.000000,...,43127,Bacardi Superior,12,1000,9.500000,14.250000,24,342.000000,24.000000,6.340000


In [24]:
fill_values = {
    'address': '501 7th Ave SE',
    'city': 'Cedar Rapids',
    'zip_code': 52401.0,
    'county_number': 57.0,
    'county': 'LINN'
}

for index, row in df2.iterrows():
    if row['store_name'] == 'Benz Distributing':
        for column in fill_values.keys():
            if pd.isnull(row[column]):
                df2.at[index, column] = fill_values[column]

In [25]:
Benz_null_values = df2[df2['store_name'] == 'Benz Distributing']['store_name'].isnull().any()
Benz_null_values

False

###

Repeat those steps for Hy-Vee #3 / BDI / Des Moines

In [26]:
Hy3_liquor_rows = df2[df2['store_name'] == 'Hy-Vee #3 / BDI / Des Moines'].groupby('store_number')['store_number'].unique()
Hy3_liquor_rows

store_number
2633    [2633]
Name: store_number, dtype: object

In [27]:
Hy3_liquor_rows = df2[df2['store_name'] == 'Hy-Vee #3 / BDI / Des Moines']
Hy3_liquor_rows

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,county_number,county,category,...,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
87,INV-33169200066,2021-01-04,2633,Hy-Vee #3 / BDI / Des Moines,3221 SE 14th St,Des Moines,50320.000000,77.000000,POLK,1081200.000000,...,68049,Baileys Vanilla Cinnamon,12,750,16.490000,24.740000,12,296.880000,9.000000,2.370000
96,INV-33169200117,2021-01-04,2633,Hy-Vee #3 / BDI / Des Moines,3221 SE 14th St,Des Moines,50320.000000,77.000000,POLK,1062200.000000,...,43127,Bacardi Superior,12,1000,9.500000,14.250000,36,513.000000,36.000000,9.510000
380,INV-33169200051,2021-01-04,2633,Hy-Vee #3 / BDI / Des Moines,3221 SE 14th St,Des Moines,50320.000000,77.000000,POLK,1062500.000000,...,43050,Bacardi Dragon Berry,12,1000,9.500000,14.250000,24,342.000000,24.000000,6.340000
503,INV-33169200091,2021-01-04,2633,Hy-Vee #3 / BDI / Des Moines,3221 SE 14th St,Des Moines,50320.000000,77.000000,POLK,1081400.000000,...,82627,Dekuyper Cherry Pucker,12,1000,7.870000,11.810000,12,141.720000,12.000000,3.170000
586,INV-33169200061,2021-01-04,2633,Hy-Vee #3 / BDI / Des Moines,3221 SE 14th St,Des Moines,50320.000000,77.000000,POLK,1032200.000000,...,64914,Ciroc Mango,12,750,16.490000,24.740000,12,296.880000,9.000000,2.370000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2793582,INV-41877300005,2021-11-11,2633,Hy-Vee #3 / BDI / Des Moines,3221 SE 14th St,Des Moines,50320.000000,77.000000,POLK,1071100.000000,...,59037,Desert Island Long Island Iced Tea Cocktail,12,1000,4.360000,6.540000,24,156.960000,24.000000,6.340000
2793617,INV-41878700045,2021-11-11,2633,Hy-Vee #3 / BDI / Des Moines,3221 SE 14th St,Des Moines,50320.000000,77.000000,POLK,1032200.000000,...,34787,Stolichnaya Blueberi,12,1000,15.970000,23.960000,12,287.520000,12.000000,3.170000
2793675,INV-41878700069,2021-11-11,2633,Hy-Vee #3 / BDI / Des Moines,3221 SE 14th St,Des Moines,50320.000000,77.000000,POLK,1062500.000000,...,44419,Cruzan Black Cherry,12,750,7.000000,10.500000,12,126.000000,9.000000,2.370000
2793698,INV-41878700063,2021-11-11,2633,Hy-Vee #3 / BDI / Des Moines,3221 SE 14th St,Des Moines,50320.000000,77.000000,POLK,1062400.000000,...,43338,Captain Morgan Original Spiced,6,1750,18.000000,27.000000,6,162.000000,10.500000,2.770000


In [28]:
fill_values = {
    'address': '3221 SE 14th St',
    'city': 'Des Moines',
    'zip_code': 50320.0,
    'county_number': 77.0,
    'county': 'POLK'
}

for index, row in df2.iterrows():
    if row['store_name'] == 'Hy-Vee #3 / BDI / Des Moines':
        for column in fill_values.keys():
            if pd.isnull(row[column]):
                df2.at[index, column] = fill_values[column]

In [29]:
Hy3_null_values = df2[df2['store_name'] == 'Hy-Vee #3 / BDI / Des Moines']['store_name'].isnull().any()
Hy3_null_values

False

###

Repeat those steps for Hy-Vee Food Store / Mount Ayr

In [30]:
HyF_liquor_rows = df2[df2['store_name'] == 'Hy-Vee Food Store / Mount Ayr'].groupby('store_number')['store_number'].unique()
HyF_liquor_rows

store_number
2658    [2658]
Name: store_number, dtype: object

In [31]:
HyF_liquor_rows = df2[df2['store_name'] == 'Hy-Vee Food Store / Mount Ayr']
HyF_liquor_rows

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,county_number,county,category,...,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
38711,INV-33301900010,2021-01-08,2658,Hy-Vee Food Store / Mount Ayr,402 S Hayes St,Mount Ayr,50854.000000,80.000000,RINGGOLD,1062400.000000,...,43244,Captain Morgan 100prf Spiced Rum,12,750,10.490000,15.740000,12,188.880000,9.000000,2.370000
38965,INV-33301900033,2021-01-08,2658,Hy-Vee Food Store / Mount Ayr,402 S Hayes St,Mount Ayr,50854.000000,80.000000,RINGGOLD,1012100.000000,...,11588,Black Velvet Reserve,6,1750,12.990000,19.490000,6,116.940000,10.500000,2.770000
39186,INV-33301900022,2021-01-08,2658,Hy-Vee Food Store / Mount Ayr,402 S Hayes St,Mount Ayr,50854.000000,80.000000,RINGGOLD,1081600.000000,...,64865,Fireball Cinnamon Whiskey PET,12,750,8.980000,13.470000,12,161.640000,9.000000,2.370000
39550,INV-33301900020,2021-01-08,2658,Hy-Vee Food Store / Mount Ayr,402 S Hayes St,Mount Ayr,50854.000000,80.000000,RINGGOLD,1011400.000000,...,74801,Jack Daniels Tennessee Apple,12,750,15.570000,23.360000,2,46.720000,1.500000,0.390000
39623,INV-33301900004,2021-01-08,2658,Hy-Vee Food Store / Mount Ayr,402 S Hayes St,Mount Ayr,50854.000000,80.000000,RINGGOLD,1031100.000000,...,36904,McCormick 80prf Vodka PET,24,375,1.800000,2.700000,24,64.800000,9.000000,2.370000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2804424,INV-41946200013,2021-11-12,2658,Hy-Vee Food Store / Mount Ayr,201 N Fillmore St,Mount Ayr,50854.000000,80.000000,RINGGOLD,1011200.000000,...,17206,Cedar Ridge Bourbon,6,750,18.100000,27.150000,6,162.900000,4.500000,1.180000
2804467,INV-41946200008,2021-11-12,2658,Hy-Vee Food Store / Mount Ayr,201 N Fillmore St,Mount Ayr,50854.000000,80.000000,RINGGOLD,1011100.000000,...,87026,Skrewball Peanut Butter Whiskey,6,750,18.500000,27.750000,6,166.500000,4.500000,1.180000
2804554,INV-41946200022,2021-11-12,2658,Hy-Vee Food Store / Mount Ayr,201 N Fillmore St,Mount Ayr,50854.000000,80.000000,RINGGOLD,1701100.000000,...,100636,Jim Beam Mini 12 Seasons,6,50,6.500000,9.750000,1,9.750000,0.050000,0.010000
2804558,INV-41946200012,2021-11-12,2658,Hy-Vee Food Store / Mount Ayr,201 N Fillmore St,Mount Ayr,50854.000000,80.000000,RINGGOLD,1011200.000000,...,19066,Jim Beam,12,750,11.000000,16.500000,12,198.000000,9.000000,2.370000


In [32]:
fill_values = {
    'address': '201 N Fillmore St',
    'city': 'Mount Ayr',
    'zip_code': 50854.0,
    'county_number': 80.0,
    'county': 'RINGGOLD'
}

for index, row in df2.iterrows():
    if row['store_name'] == 'Hy-Vee Food Store / Mount Ayr':
        for column in fill_values.keys():
            if pd.isnull(row[column]):
                df2.at[index, column] = fill_values[column]

###

Repeat those steps for Bootlegging Barzinis

In [33]:
Boot_liquor_rows = df2[df2['store_name'] == 'Bootlegging Barzinis'].groupby('store_number')['store_number'].unique()
Boot_liquor_rows

store_number
6035    [6035]
Name: store_number, dtype: object

In [34]:
Boot_liquor_rows = df2[df2['store_name'] == 'Bootlegging Barzinis']
Boot_liquor_rows

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,county_number,county,category,...,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
31760,INV-33272300036,2021-01-07,6035,Bootlegging Barzinis,412 First Ave,Coralville,52241.000000,52.000000,JOHNSON,1082000.000000,...,65144,Il Tramonto Limoncello,6,750,10.500000,15.750000,6,94.500000,4.500000,1.180000
31835,INV-33272300041,2021-01-07,6035,Bootlegging Barzinis,412 First Ave,Coralville,52241.000000,52.000000,JOHNSON,1011400.000000,...,87043,Ole Smoky Peanut Butter Whiskey,6,750,10.000000,15.000000,6,90.000000,4.500000,1.180000
31843,INV-33272300029,2021-01-07,6035,Bootlegging Barzinis,412 First Ave,Coralville,52241.000000,52.000000,JOHNSON,1082000.000000,...,64876,Drambuie Liqueur,12,750,23.000000,34.500000,3,103.500000,2.250000,0.590000
31905,INV-33272100006,2021-01-07,6035,Bootlegging Barzinis,412 First Ave,Coralville,52241.000000,52.000000,JOHNSON,1012300.000000,...,994972,Glendronach 12YR Original,6,750,32.500000,48.750000,12,585.000000,9.000000,2.370000
31908,INV-33272200001,2021-01-07,6035,Bootlegging Barzinis,412 First Ave,Coralville,52241.000000,52.000000,JOHNSON,1082000.000000,...,64776,Cointreau Liqueur,12,750,19.990000,29.990000,12,359.880000,9.000000,2.370000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2774379,INV-41815900006,2021-11-09,6035,Bootlegging Barzinis,412 First Ave,Coralville,52241.000000,52.000000,JOHNSON,1082100.000000,...,966067,Toschi Nocello,6,750,17.670000,26.510000,24,636.240000,18.000000,4.750000
2774464,INV-41816100075,2021-11-09,6035,Bootlegging Barzinis,412 First Ave,Coralville,52241.000000,52.000000,JOHNSON,1011600.000000,...,101296,WhistlePig 6YR PiggyBack Rye,4,50,49.410000,74.120000,1,74.120000,0.050000,0.010000
2774686,INV-41816100027,2021-11-09,6035,Bootlegging Barzinis,412 First Ave,Coralville,52241.000000,52.000000,JOHNSON,1031200.000000,...,41989,UV Cake,12,750,7.000000,10.500000,3,31.500000,2.250000,0.590000
2775382,INV-41816100003,2021-11-09,6035,Bootlegging Barzinis,412 First Ave,Coralville,52241.000000,52.000000,JOHNSON,1701100.000000,...,25459,Slipknot Limited Edition Anniversary Iowa Labe...,6,750,25.000000,37.500000,6,225.000000,4.500000,1.180000


In [35]:
fill_values = {
    'address': '412 First Ave',
    'city': 'Coralville',
    'zip_code': 52241.0,
    'county_number': 52.0,
    'county': 'JOHNSON'
}

for index, row in df2.iterrows():
    if row['store_name'] == 'Bootlegging Barzinis  ':
        for column in fill_values.keys():
            if pd.isnull(row[column]):
                df2.at[index, column] = fill_values[column]

###

Repeat those steps for Casey's General Store #2811 / Springville

In [36]:
Casey_liquor_rows = df2[df2['store_name'] == "Casey's General Store #2811 / Springville"].groupby('store_number')['store_number'].unique()
Casey_liquor_rows

store_number
4934    [4934]
Name: store_number, dtype: object

In [37]:
Casey_liquor_rows = df2[df2['store_name'] == "Casey's General Store #2811 / Springville"]
Casey_liquor_rows

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,county_number,county,category,...,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
236815,INV-37221200007,2021-06-05,4934,Casey's General Store #2811 / Springville,605 6th St S,Springville,52336.000000,57.000000,LINN,1081600.000000,...,64865,Fireball Cinnamon Whiskey PET,12,750,8.980000,13.470000,12,161.640000,9.000000,2.370000
237326,INV-37221200003,2021-06-05,4934,Casey's General Store #2811 / Springville,605 6th St S,Springville,52336.000000,57.000000,LINN,1081300.000000,...,75230,Kinky Red,6,750,10.000000,15.000000,6,90.000000,4.500000,1.180000
238233,INV-37221300005,2021-06-05,4934,Casey's General Store #2811 / Springville,605 6th St S,Springville,52336.000000,57.000000,LINN,1062400.000000,...,43338,Captain Morgan Original Spiced,6,1750,18.000000,27.000000,6,162.000000,10.500000,2.770000
239212,INV-37221300003,2021-06-05,4934,Casey's General Store #2811 / Springville,605 6th St S,Springville,52336.000000,57.000000,LINN,1062200.000000,...,43125,Bacardi Superior PET,12,750,8.260000,12.390000,12,148.680000,9.000000,2.370000
239762,INV-37221300014,2021-06-05,4934,Casey's General Store #2811 / Springville,605 6th St S,Springville,52336.000000,57.000000,LINN,1062400.000000,...,43024,Admiral Nelson Spiced,24,375,2.990000,4.490000,24,107.760000,9.000000,2.370000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2803344,INV-41952100015,2021-11-12,4934,Casey's General Store #2811 / Springville,605 6th St S,Springville,52336.000000,57.000000,LINN,1081600.000000,...,65013,Fireball Cinnamon Whiskey Mini Sleeve,12,50,4.300000,6.450000,24,154.800000,1.200000,0.310000
2803590,INV-41952100006,2021-11-12,4934,Casey's General Store #2811 / Springville,605 6th St S,Springville,52336.000000,57.000000,LINN,1011100.000000,...,25616,Seagrams 7 Crown PET Flask,12,750,7.500000,11.250000,3,33.750000,2.250000,0.590000
2804277,INV-41952100003,2021-11-12,4934,Casey's General Store #2811 / Springville,605 6th St S,Springville,52336.000000,57.000000,LINN,1012100.000000,...,10805,Crown Royal Regal Apple,24,375,8.000000,12.000000,6,72.000000,2.250000,0.590000
2804573,INV-41952100012,2021-11-12,4934,Casey's General Store #2811 / Springville,605 6th St S,Springville,52336.000000,57.000000,LINN,1081600.000000,...,64866,Fireball Cinnamon Whiskey,12,750,9.000000,13.500000,6,81.000000,4.500000,1.180000


In [38]:
fill_values = {
    'address': '605 6th St S',
    'city': 'Springville',
    'zip_code': 52336.0,
    'county_number': 57.0,
    'county': 'LINN'
}

for index, row in df2.iterrows():
    if row['store_name'] == "Casey's General Store #2811 / Springville":
        for column in fill_values.keys():
            if pd.isnull(row[column]):
                df2.at[index, column] = fill_values[column]

###

After we run all the loops to fill in null values, we check those columns

In [39]:
df2.isnull().sum()

invoice_and_item_number    0
date                       0
store_number               0
store_name                 0
address                    0
city                       0
zip_code                   0
county_number              0
county                     0
category                   0
category_name              0
vendor_number              4
vendor_name                4
item_number                0
item_description           0
pack                       0
bottle_volume_ml           0
state_bottle_cost          0
state_bottle_retail        0
bottles_sold               0
sale_dollars               0
volume_sold_liters         0
volume_sold_gallons        0
dtype: int64

Now we only have 8 null values in total.  We will keep trying to fill them in.

###

In [40]:
#Run this code to see all columns
pd.set_option('display.max_columns', None)

In [41]:
df2.head()

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,county_number,county,category,category_name,vendor_number,vendor_name,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
0,INV-33179700135,2021-01-04,2576,Hy-Vee Wine and Spirits / Storm Lake,1250 N Lake St,Storm Lake,50588.0,11.0,BUENA VIST,1081600.0,Whiskey Liqueur,421.0,SAZERAC COMPANY INC,64870,Fireball Cinnamon Whiskey,48,100,0.9,1.35,48,64.8,4.8,1.26
1,INV-33196200106,2021-01-04,2649,Hy-Vee #3 / Dubuque,400 Locust St,Dubuque,52001.0,31.0,DUBUQUE,1081200.0,Cream Liqueurs,300.0,McCormick Distilling Co.,65200,Tequila Rose Liqueur,12,750,11.5,17.25,4,69.0,3.0,0.79
2,INV-33184300011,2021-01-04,2539,Hy-Vee Food Store / Iowa Falls,640 S. Oak,Iowa Falls,50126.0,42.0,HARDIN,1031100.0,American Vodkas,260.0,DIAGEO AMERICAS,38008,Smirnoff 80prf PET,6,1750,14.75,22.13,6,132.78,10.5,2.77
3,INV-33184100015,2021-01-04,4024,Wal-Mart 1546 / Iowa Falls,840 S Oak,Iowa Falls,50126.0,42.0,HARDIN,1031100.0,American Vodkas,55.0,SAZERAC NORTH AMERICA,36648,Caliber Vodka,12,750,3.31,4.97,12,59.64,9.0,2.37
4,INV-33174200025,2021-01-04,5385,Vine Food & Liquor,2704 Vine St.,West Des Moines,50265.0,77.0,POLK,1012200.0,Scotch Whiskies,260.0,DIAGEO AMERICAS,4626,Buchanan Deluxe 12YR,12,750,20.99,31.49,2,62.98,1.5,0.39


In [42]:
#Get the null rows
null_vendor_rows = df2[df2['vendor_name'].isnull() | df2['vendor_number'].isnull()]
pd.set_option('display.max_columns', None)
null_vendor_rows

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,county_number,county,category,category_name,vendor_number,vendor_name,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
337624,INV-35139200005,2021-03-18,5251,Northside Liquor,1303 North Federal,Mason City,50401.0,17.0,CERRO GORD,1032100.0,Imported Vodkas,,,965108,Grey Goose VX,6,1000,31.34,80.0,18,1440.0,18.0,4.75
459143,INV-33740700003,2021-01-25,3773,Benz Distributing,501 7th Ave SE,Cedar Rapids,52401.0,57.0,LINN,1012400.0,Irish Whiskies,,,915574,Connemara 12 Year,6,750,14.66,56.25,6,337.5,4.5,1.18
465554,INV-33728600005,2021-01-25,2633,Hy-Vee #3 / BDI / Des Moines,3221 SE 14th St,Des Moines,50320.0,77.0,POLK,1022200.0,100% Agave Tequila,,,988100,Monte Alban Silver,12,750,12.98,15.56,12,186.72,9.0,2.37
1397930,INV-35973800003,2021-04-20,6035,Bootlegging Barzinis,412 First Ave,Coralville,52241.0,52.0,JOHNSON,1012400.0,Irish Whiskies,,,915574,Connemara 12YR,6,750,30.84,56.25,6,337.5,4.5,1.18


###

In [43]:
#Check if we can use invoice numbers to get vendor names and fill them in
invoice_numbers = ['INV-35139200005', 'INV-33740700003', 'INV-33728600005', 'INV-35973800003']
filtered_rows = df2[df2['invoice_and_item_number'].isin(invoice_numbers)]
filtered_rows

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,county_number,county,category,category_name,vendor_number,vendor_name,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
337624,INV-35139200005,2021-03-18,5251,Northside Liquor,1303 North Federal,Mason City,50401.0,17.0,CERRO GORD,1032100.0,Imported Vodkas,,,965108,Grey Goose VX,6,1000,31.34,80.0,18,1440.0,18.0,4.75
459143,INV-33740700003,2021-01-25,3773,Benz Distributing,501 7th Ave SE,Cedar Rapids,52401.0,57.0,LINN,1012400.0,Irish Whiskies,,,915574,Connemara 12 Year,6,750,14.66,56.25,6,337.5,4.5,1.18
465554,INV-33728600005,2021-01-25,2633,Hy-Vee #3 / BDI / Des Moines,3221 SE 14th St,Des Moines,50320.0,77.0,POLK,1022200.0,100% Agave Tequila,,,988100,Monte Alban Silver,12,750,12.98,15.56,12,186.72,9.0,2.37
1397930,INV-35973800003,2021-04-20,6035,Bootlegging Barzinis,412 First Ave,Coralville,52241.0,52.0,JOHNSON,1012400.0,Irish Whiskies,,,915574,Connemara 12YR,6,750,30.84,56.25,6,337.5,4.5,1.18


Since invoice numbers are unique and cannot be used to obtain vendor information, and considering we have only 4 rows with null values, we have opted to drop these rows to ensure data accuracy and reliability.

###

In [44]:
#make a new df just in case
df3 = df2.drop(filtered_rows.index)
df3.head()

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,county_number,county,category,category_name,vendor_number,vendor_name,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
0,INV-33179700135,2021-01-04,2576,Hy-Vee Wine and Spirits / Storm Lake,1250 N Lake St,Storm Lake,50588.0,11.0,BUENA VIST,1081600.0,Whiskey Liqueur,421.0,SAZERAC COMPANY INC,64870,Fireball Cinnamon Whiskey,48,100,0.9,1.35,48,64.8,4.8,1.26
1,INV-33196200106,2021-01-04,2649,Hy-Vee #3 / Dubuque,400 Locust St,Dubuque,52001.0,31.0,DUBUQUE,1081200.0,Cream Liqueurs,300.0,McCormick Distilling Co.,65200,Tequila Rose Liqueur,12,750,11.5,17.25,4,69.0,3.0,0.79
2,INV-33184300011,2021-01-04,2539,Hy-Vee Food Store / Iowa Falls,640 S. Oak,Iowa Falls,50126.0,42.0,HARDIN,1031100.0,American Vodkas,260.0,DIAGEO AMERICAS,38008,Smirnoff 80prf PET,6,1750,14.75,22.13,6,132.78,10.5,2.77
3,INV-33184100015,2021-01-04,4024,Wal-Mart 1546 / Iowa Falls,840 S Oak,Iowa Falls,50126.0,42.0,HARDIN,1031100.0,American Vodkas,55.0,SAZERAC NORTH AMERICA,36648,Caliber Vodka,12,750,3.31,4.97,12,59.64,9.0,2.37
4,INV-33174200025,2021-01-04,5385,Vine Food & Liquor,2704 Vine St.,West Des Moines,50265.0,77.0,POLK,1012200.0,Scotch Whiskies,260.0,DIAGEO AMERICAS,4626,Buchanan Deluxe 12YR,12,750,20.99,31.49,2,62.98,1.5,0.39


In [45]:
#sanity check
df3.isna().sum()

invoice_and_item_number    0
date                       0
store_number               0
store_name                 0
address                    0
city                       0
zip_code                   0
county_number              0
county                     0
category                   0
category_name              0
vendor_number              0
vendor_name                0
item_number                0
item_description           0
pack                       0
bottle_volume_ml           0
state_bottle_cost          0
state_bottle_retail        0
bottles_sold               0
sale_dollars               0
volume_sold_liters         0
volume_sold_gallons        0
dtype: int64

###

### Extra-Cleaning

In [47]:
#check all county names
county_names = df3['county'].unique()

# Convert the array to a list and print
print(list(county_names))

['BUENA VIST', 'DUBUQUE', 'HARDIN', 'POLK', 'BUTLER', 'BENTON', 'MUSCATINE', 'LINN', 'CLINTON', 'SAC', 'MITCHELL', 'POWESHIEK', 'SCOTT', 'JOHNSON', 'CEDAR', 'BLACK HAWK', 'FRANKLIN', 'Linn', 'CARROLL', 'OBRIEN', 'GRUNDY', 'KOSSUTH', 'STORY', 'CHICKASAW', 'CLARKE', 'MAHASKA', 'IOWA', 'JACKSON', 'IDA', 'WARREN', 'LEE', 'CHEROKEE', 'GREENE', 'WORTH', 'HOWARD', 'WEBSTER', 'CALHOUN', 'CERRO GORD', 'Polk', 'MARSHALL', 'FLOYD', 'WAPELLO', 'HAMILTON', 'WOODBURY', 'MARION', 'TAMA', 'LYON', 'Delaware', 'DALLAS', 'FAYETTE', 'JASPER', 'SHELBY', 'PLYMOUTH', 'AUDUBON', 'MONONA', 'CRAWFORD', 'BUCHANAN', 'SIOUX', 'WINNESHIEK', 'DELAWARE', 'HARRISON', 'Black Hawk', 'BREMER', 'CLAYTON', 'Webster', 'OSCEOLA', 'Dallas', 'JONES', 'JEFFERSON', 'PALO ALTO', 'POTTAWATTA', 'KEOKUK', 'HANCOCK', 'WRIGHT', 'DICKINSON', 'CLAY', 'WASHINGTON', 'EMMET', 'WINNEBAGO', 'VAN BUREN', 'Wapello', 'MILLS', 'DAVIS', 'HENRY', 'Hancock', 'WAYNE', 'GUTHRIE', 'APPANOOSE', 'ADAMS', 'ADAIR', 'MONROE', 'DES MOINES', 'ALLAMAKEE', 'Ma

In [48]:
#print outputs alphabetically
sorted_county_names = sorted(list(county_names))
print(sorted_county_names)

['ADAIR', 'ADAMS', 'ALLAMAKEE', 'APPANOOSE', 'AUDUBON', 'Adair', 'BENTON', 'BLACK HAWK', 'BOONE', 'BREMER', 'BUCHANAN', 'BUENA VIST', 'BUTLER', 'Black Hawk', 'CALHOUN', 'CARROLL', 'CASS', 'CEDAR', 'CERRO GORD', 'CHEROKEE', 'CHICKASAW', 'CLARKE', 'CLAY', 'CLAYTON', 'CLINTON', 'CRAWFORD', 'DALLAS', 'DAVIS', 'DECATUR', 'DELAWARE', 'DES MOINES', 'DICKINSON', 'DUBUQUE', 'Dallas', 'Delaware', 'Des Moines', 'EMMET', 'FAYETTE', 'FLOYD', 'FRANKLIN', 'FREMONT', 'GREENE', 'GRUNDY', 'GUTHRIE', 'HAMILTON', 'HANCOCK', 'HARDIN', 'HARRISON', 'HENRY', 'HOWARD', 'HUMBOLDT', 'Hancock', 'Hardin', 'Henry', 'IDA', 'IOWA', 'Iowa', 'JACKSON', 'JASPER', 'JEFFERSON', 'JOHNSON', 'JONES', 'Jackson', 'KEOKUK', 'KOSSUTH', 'LEE', 'LINN', 'LOUISA', 'LUCAS', 'LYON', 'Linn', 'MADISON', 'MAHASKA', 'MARION', 'MARSHALL', 'MILLS', 'MITCHELL', 'MONONA', 'MONROE', 'MONTGOMERY', 'MUSCATINE', 'Madison', 'Marion', 'Marshall', 'OBRIEN', 'OSCEOLA', 'PAGE', 'PALO ALTO', 'PLYMOUTH', 'POCAHONTAS', 'POLK', 'POTTAWATTA', 'POWESHIEK', 

###

We found that the name of counties are in different format.  We need to fix it.

In [56]:
#make a copy just in case
df4 = df3.copy()
df4.head(10)

Unnamed: 0,invoice_and_item_number,date,store_number,store_name,address,city,zip_code,county_number,county,category,category_name,vendor_number,vendor_name,item_number,item_description,pack,bottle_volume_ml,state_bottle_cost,state_bottle_retail,bottles_sold,sale_dollars,volume_sold_liters,volume_sold_gallons
0,INV-33179700135,2021-01-04,2576,Hy-Vee Wine and Spirits / Storm Lake,1250 N Lake St,Storm Lake,50588.0,11.0,BUENA VIST,1081600.0,Whiskey Liqueur,421.0,SAZERAC COMPANY INC,64870,Fireball Cinnamon Whiskey,48,100,0.9,1.35,48,64.8,4.8,1.26
1,INV-33196200106,2021-01-04,2649,Hy-Vee #3 / Dubuque,400 Locust St,Dubuque,52001.0,31.0,DUBUQUE,1081200.0,Cream Liqueurs,300.0,McCormick Distilling Co.,65200,Tequila Rose Liqueur,12,750,11.5,17.25,4,69.0,3.0,0.79
2,INV-33184300011,2021-01-04,2539,Hy-Vee Food Store / Iowa Falls,640 S. Oak,Iowa Falls,50126.0,42.0,HARDIN,1031100.0,American Vodkas,260.0,DIAGEO AMERICAS,38008,Smirnoff 80prf PET,6,1750,14.75,22.13,6,132.78,10.5,2.77
3,INV-33184100015,2021-01-04,4024,Wal-Mart 1546 / Iowa Falls,840 S Oak,Iowa Falls,50126.0,42.0,HARDIN,1031100.0,American Vodkas,55.0,SAZERAC NORTH AMERICA,36648,Caliber Vodka,12,750,3.31,4.97,12,59.64,9.0,2.37
4,INV-33174200025,2021-01-04,5385,Vine Food & Liquor,2704 Vine St.,West Des Moines,50265.0,77.0,POLK,1012200.0,Scotch Whiskies,260.0,DIAGEO AMERICAS,4626,Buchanan Deluxe 12YR,12,750,20.99,31.49,2,62.98,1.5,0.39
5,INV-33186700007,2021-01-04,4110,"Brothers Market, Inc.",706 Highway 57,Parkersburg,50665.0,12.0,BUTLER,1032100.0,Imported Vodkas,115.0,CONSTELLATION BRANDS INC,34821,Svedka 80prf,6,1750,13.5,20.25,6,121.5,10.5,2.77
6,INV-33197500003,2021-01-04,4228,Fareway Stores #462 / Vinton,501 A Ave,Vinton,52349.0,6.0,BENTON,1032100.0,Imported Vodkas,370.0,PERNOD RICARD USA,34006,Absolut Swedish Vodka 80prf,12,750,9.99,14.99,12,179.88,9.0,2.37
7,INV-33197200010,2021-01-04,2713,Hy-Vee Dyersville Dollar Fresh,1201 12th Ave SE,Dyersville,52040.0,31.0,DUBUQUE,1031100.0,American Vodkas,434.0,LUXCO INC,36308,Hawkeye Vodka,6,1750,7.17,10.76,6,64.56,10.5,2.77
8,INV-33174600126,2021-01-04,2648,Hy-Vee #4 / WDM,555 S 51st St,West Des Moines,50265.0,77.0,POLK,1012200.0,Scotch Whiskies,260.0,DIAGEO AMERICAS,5318,Johnnie Walker Double Black,6,750,24.05,36.08,6,216.48,4.5,1.18
9,INV-33202000002,2021-01-04,5735,Super Saver Liquor -Muscatine,1510 A Isett Avenue,Muscatine,52761.0,70.0,MUSCATINE,1091300.0,Neutral Grain Spirits Flavored,346.0,OLE SMOKY DISTILLERY LLC,86739,Ole Smoky Apple Pie Moonshine 70prf Mini,8,50,8.75,13.13,8,105.04,0.4,0.1


In [49]:
# Capitalize and sort county names
capitalized_counties = [county.capitalize() for county in county_names]
sorted_counties = sorted(capitalized_counties)

# Loop through sorted counties and print
for county in sorted_counties:
    print(county)

Adair
Adair
Adams
Allamakee
Appanoose
Audubon
Benton
Black hawk
Black hawk
Boone
Bremer
Buchanan
Buena vist
Butler
Calhoun
Carroll
Cass
Cedar
Cerro gord
Cherokee
Chickasaw
Clarke
Clay
Clayton
Clinton
Crawford
Dallas
Dallas
Davis
Decatur
Delaware
Delaware
Des moines
Des moines
Dickinson
Dubuque
Emmet
Fayette
Floyd
Franklin
Fremont
Greene
Grundy
Guthrie
Hamilton
Hancock
Hancock
Hardin
Hardin
Harrison
Henry
Henry
Howard
Humboldt
Ida
Iowa
Iowa
Jackson
Jackson
Jasper
Jefferson
Johnson
Jones
Keokuk
Kossuth
Lee
Linn
Linn
Louisa
Lucas
Lyon
Madison
Madison
Mahaska
Marion
Marion
Marshall
Marshall
Mills
Mitchell
Monona
Monroe
Montgomery
Muscatine
Obrien
Osceola
Page
Palo alto
Plymouth
Pocahontas
Polk
Polk
Pottawatta
Pottawatta
Poweshiek
Poweshiek
Ringgold
Sac
Scott
Scott
Shelby
Sioux
Story
Tama
Taylor
Union
Van buren
Wapello
Wapello
Warren
Washington
Wayne
Webster
Webster
Winnebago
Winneshiek
Woodbury
Worth
Wright


In [51]:
# Create a mapping dictionary to map original county names to capitalized and sorted names
county_mapping = dict(zip(county_names, sorted_counties))
county_mapping

{'BUENA VIST': 'Adair',
 'DUBUQUE': 'Adair',
 'HARDIN': 'Adams',
 'POLK': 'Allamakee',
 'BUTLER': 'Appanoose',
 'BENTON': 'Audubon',
 'MUSCATINE': 'Benton',
 'LINN': 'Black hawk',
 'CLINTON': 'Black hawk',
 'SAC': 'Boone',
 'MITCHELL': 'Bremer',
 'POWESHIEK': 'Buchanan',
 'SCOTT': 'Buena vist',
 'JOHNSON': 'Butler',
 'CEDAR': 'Calhoun',
 'BLACK HAWK': 'Carroll',
 'FRANKLIN': 'Cass',
 'Linn': 'Cedar',
 'CARROLL': 'Cerro gord',
 'OBRIEN': 'Cherokee',
 'GRUNDY': 'Chickasaw',
 'KOSSUTH': 'Clarke',
 'STORY': 'Clay',
 'CHICKASAW': 'Clayton',
 'CLARKE': 'Clinton',
 'MAHASKA': 'Crawford',
 'IOWA': 'Dallas',
 'JACKSON': 'Dallas',
 'IDA': 'Davis',
 'WARREN': 'Decatur',
 'LEE': 'Delaware',
 'CHEROKEE': 'Delaware',
 'GREENE': 'Des moines',
 'WORTH': 'Des moines',
 'HOWARD': 'Dickinson',
 'WEBSTER': 'Dubuque',
 'CALHOUN': 'Emmet',
 'CERRO GORD': 'Fayette',
 'Polk': 'Floyd',
 'MARSHALL': 'Franklin',
 'FLOYD': 'Fremont',
 'WAPELLO': 'Greene',
 'HAMILTON': 'Grundy',
 'WOODBURY': 'Guthrie',
 'MARION': 

In [60]:
# Map the county names in subset1_df using the county_mapping dictionary
df4['capitalized_county'] = df3['county'].map(county_mapping)

In [62]:
#sanity check
df4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2805303 entries, 0 to 2805306
Data columns (total 24 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   invoice_and_item_number  object 
 1   date                     object 
 2   store_number             int64  
 3   store_name               object 
 4   address                  object 
 5   city                     object 
 6   zip_code                 float64
 7   county_number            float64
 8   county                   object 
 9   category                 float64
 10  category_name            object 
 11  vendor_number            float64
 12  vendor_name              object 
 13  item_number              int64  
 14  item_description         object 
 15  pack                     int64  
 16  bottle_volume_ml         int64  
 17  state_bottle_cost        float64
 18  state_bottle_retail      float64
 19  bottles_sold             int64  
 20  sale_dollars             float64
 21  volume_s

###

In [63]:
#Check Duplicates
duplicate_rows = df4.duplicated()
num_duplicate_rows = duplicate_rows.sum()
num_duplicate_rows

0

Now our data is clean.

In [65]:
pip install pyarrow

Collecting pyarrowNote: you may need to restart the kernel to use updated packages.

  Downloading pyarrow-13.0.0-cp310-cp310-win_amd64.whl (24.3 MB)
     --------------------------------------- 24.3/24.3 MB 11.3 MB/s eta 0:00:00
Installing collected packages: pyarrow
Successfully installed pyarrow-13.0.0


In [66]:
# Save the cleaned data to a new file
import pyarrow.parquet as pq

# Export DataFrame to Parquet file
df4.to_parquet('IowaClean', engine='pyarrow')