## Dram Item Cleaning

This notebook attempts to map Dram Shop items on to the actual beverage along with the producer and additional information. For instance, we want to map `Ommegang - Rare Vos Belgian Style Amber Ale  - 6.5% ABV 20 IBU` to `ommegang` (the producer), `Rare Vos Belgian Style Amber Ale` (the beer), and `6.5% ABV 20 IBU` (the additional information). 

This is hard! The items are manually entered so we have things like abbreviations, misspellings, odd orderings, etc. 


We'll start by pulling in the items and their categories, which will be our starting point.

In [8]:
import os
import re
import datetime 
from collections import Counter

import pandas as pd
import numpy as np
import pandas_gbq
import janitor

# Do our imports for the code
from google.cloud import bigquery
from google.oauth2 import service_account

### GBQ Set Up

In this next section we connect to our GBQ project and list the data sets inside to test the connection.

In [2]:
# These first two values will be different on your machine. 
service_path = "/Users/chandler/Dropbox/Teaching/"
service_file = 'umt-msba-037daf11ee16.json' # change this to your authentication information  
gbq_proj_id = 'umt-msba' # change this to your project. 

# And this should stay the same. 
private_key =service_path + service_file

# Now we pass in our credentials so that Python has permission to access our project.
credentials = service_account.Credentials.from_service_account_file(service_path + service_file)

# And finally we establish our connection
client = bigquery.Client(credentials = credentials, project=gbq_proj_id)

for item in client.list_datasets() : 
    print(item.full_dataset_id)

umt-msba:dram_shop
umt-msba:transactions
umt-msba:wedge_example
umt-msba:wedge_transactions


### Pulling data from GBQ

Let's get the items and their categories from our Dram data.

In [14]:
query = """
    SELECT DISTINCT item,
           category
    FROM `umt-msba.dram_shop.dram_items_20220901`
"""

item_cat = dict()
cats = list()

for row in client.query(query) :
    item_cat[row[0]] = [row[1]] # Let's make a list so we can add on information
    cats.append(row[1])

### Investigating Categories

Let's take a look at the categories and do some work to clean them up (mostly across locations)

In [15]:
Counter(cats).most_common(25)

[('F-IPA Draught', 29),
 ('C-IPA Draught', 25),
 ('C-Seasonal', 21),
 ('Red Wine - Bottled', 21),
 ('IPA - Bottled', 16),
 ('F-Amber/Pale Draught', 15),
 ('Lagers/Pils/Wheat - Bottled', 15),
 ('F-Wine Draught', 14),
 ('F-Seasonal', 14),
 ('F-Lagers/Pils/Wheat Draught', 11),
 ('C-Lagers/Pils/Wheat Draught', 11),
 ('C-Wine Draught', 11),
 ('F-Cider Draught', 11),
 ('F-Sour Draught', 11),
 ('C-Sour Draught', 10),
 ('C-IPA - Bottles', 10),
 ('C-Sparkling Wine - Bottles', 10),
 ('C-Red Wine - Bottles', 9),
 ('C-Lagers/Pils/Wheat - Bottled', 9),
 ('Growlers', 9),
 ('F-Porter/Stout Draught', 9),
 ('Softgoods', 8),
 ('C-Sour - Bottles', 8),
 ('C-Wine Packages/Tastings', 7),
 ('Sparkling Wine - Bottled', 7)]

So we often have an "F-" or a "C-" prefix on the category. Then we have "Draught" or " - Bottled" in many cases. 

In [29]:
cat_prefix = re.compile(r"[FC] ?- ?")
cat_suffix = re.compile(r" -? ?(Draught|Bottled)")

In [30]:
for cat in set(cats) :
    print(cat)
    holder = cat_prefix.sub("",cat)
    print(holder)
    holder = cat_suffix.sub("",holder)
    print(holder)
    print("-----------------")


F-Belgian Draught
Belgian Draught
Belgian
-----------------
NA - Bottles
NA - Bottles
NA - Bottles
-----------------
White Wine - Bottled
White Wine - Bottled
White Wine
-----------------
C-Belgian Draught
Belgian Draught
Belgian
-----------------
Softgoods
Softgoods
Softgoods
-----------------
Snacks
Snacks
Snacks
-----------------
F-Special Orders
Special Orders
Special Orders
-----------------
C-Beer Packages/Tastings
Beer Packages/Tastings
Beer Packages/Tastings
-----------------
Soda - Bottled
Soda - Bottled
Soda
-----------------
Belgian - Bottled
Belgian - Bottled
Belgian
-----------------
C-Lagers/Pils/Wheat - Bottled
Lagers/Pils/Wheat - Bottled
Lagers/Pils/Wheat
-----------------
Sparkling Wine - Bottled
Sparkling Wine - Bottled
Sparkling Wine
-----------------
C-Cider Draught
Cider Draught
Cider
-----------------
F - Hard Seltzer
Hard Seltzer
Hard Seltzer
-----------------
C-Wild - Bottled
Wild - Bottled
Wild
-----------------
Red Wine - Bottled
Red Wine - Bottled
Red Wine
--

In [34]:
# Now let's append on a clean category and the delivery method to our list

for item in item_cat : 
    cat = item_cat[item][0]
    
    
    holder = cat_prefix.sub("",cat)
    holder = cat_suffix.sub("",holder)

    item_cat[item].append(holder.lower())
    
    if "bottled" in cat.lower() : 
        item_cat[item].append("bottled")
    elif "draught" in cat.lower() :
        item_cat[item].append("draught")
    else : 
        item_cat[item].append("other")
        
    


### Clean up beers

Based on our earlier code, let's try to clean up the items that are in the beer categories. Maybe we can figure out what those are first.

In [None]:
item_translation = dict()
# key = original item
# value = [clean_item_name, brewery (if present), remainder (if present)]

clean_items = set()

for item in items :
    
    clean_item = prefix_pattern.sub("",item).lower()
    
    clean_item = clean_item.replace("windmere","widmer")
    
    clean_items.add(clean_item)
    pieces = spaced_hyphen_pattern.split(clean_item)
    
    beer = ""
    brewery = ""
    other_info = ""
    
    # This next section tries to get the right values in the right 
    # places for beer/brewery/other stuff
    if len(pieces) > 1 : 
        pieces = [p.strip() for p in pieces]
        
        if len(pieces) == 2 :        
            if pieces[0] in brewery_set : 
                brewery = pieces[0]
                beer = pieces[1]
            
            
            else :
                brewery = pieces[1]
                beer = pieces[0]

                
        elif len(pieces) == 3 :
            if pieces[0] in brewery_set : 
                brewery, beer, other_info = pieces
            else :
                beer, brewery, other_info = pieces
        else : 
            if pieces[0] in brewery_set : 
                brewery, beer = pieces[:2]
            else :
                beer = pieces[0]
                brewery = pieces[1]
                
            other_info = " - ".join(pieces[1:]).strip()
        
    else :
        
        clean_item_tokens = clean_item.split()
        
        
        
        for bry in brewery_set :
            if bry in clean_item : 
                brewery = bry
                beer = clean_item.replace(brewery,"")
                print(f"Brewery = {brewery}; Beer = {beer}")
            
    
#        for token in clean_item_tokens :
#            if token in brewery_set :
#                brewery = token
        
#        beer = " ".join([token for token in clean_item_tokens if token != brewery])
                
            
    item_translation[item] = [beer, brewery, other_info]

    if "Cioke" in item :
        print(item)
        print(item_translation[item])

