# Intro

The initial dataset being sourced is from the HuggingFace library, (https://huggingface.co/datasets/brianarbuckle/cocktail_recipes)

This initial dataset contains columns `title`, `ingredients`, `directions`, `misc`, `source`, and `ner`.

**Overview**
The 'Ingredients' column of our dataset presents a unique challenge. It consists of objects, each being a list of strings, where each string details a part of a cocktail recipe. These details typically include the quantity, unit, and name of an ingredient, but they can also contain preparation instructions or garnishing details that are not directly usable in their current form for data analysis or recipe generation.

**Objective**
Our goal is to parse these strings to extract structured information that can be effectively utilized in our project. Specifically, we aim to separate the ingredient details into distinct components: quantity, unit, and ingredient name, while filtering out the non-ingredient related information.

## Imports

In [1]:
import pandas as pd
import re
from datasets import load_dataset
import matplotlib.pyplot as plt
import seaborn as sns

## Initial Data Inspection

In [2]:
# Loading the cocktail_recipes dataset
dataset = load_dataset('brianarbuckle/cocktail_recipes')

df = pd.DataFrame(dataset['train'])

In [3]:
df.head(5)

Unnamed: 0,title,ingredients,directions,misc,source,ner
0,151 Swizzle,[1.5 oz. 151-Proof Demerara Rum [Lemon Hart or...,[],[],Beachbum Berry Remixed,"[pernod, rum]"
1,20th Century,"[The 21st Century, 2 oz. Siete Leguas Blanco T...","[shake on ice and strain into coupe , The Best...",[],Jim Meehan,"[cocchi americano, pernod, tequila]"
2,20th Century,"[1.5 oz. Plymouth Gin, 3\/4 oz. Mari Brizard W...",[shake on ice and strain],[],PDT,"[lillet, gin]"
3,Abbey Cocktail,[],"[Shake liquid ingredients with ice., Strain in...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,[]
4,Absinthe Drip,[1 1/2 ounces Pernod (or other absinthe substi...,[Pour Pernod into a pousse-caf or sour glass....,[The Absinthe Drip was made famous by Toulouse...,The Ultimate Bar Book,"[pernod, absinthe]"


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 875 entries, 0 to 874
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        875 non-null    object
 1   ingredients  875 non-null    object
 2   directions   875 non-null    object
 3   misc         875 non-null    object
 4   source       875 non-null    object
 5   ner          875 non-null    object
dtypes: object(6)
memory usage: 41.1+ KB


In [5]:
df.shape

(875, 6)

# Data Cleaning

## Lowercase standardization

Here we remove adjectives and lowercase all ingredients to avoid redundancies. 

In [6]:
def lowercase_ingredients(ingredients_list):
    """
    Converts all strings within a list of ingredients to lowercase.

    This function is useful for standardizing the case of all ingredient entries,
    making the dataset more uniform for analysis or processing.

    Parameters:
    
    ingredients_list (list): A list of ingredient strings.
      Returns:
    A list of ingredients with all strings converted to lowercase.
    """
    # Check if ingredients_list is indeed a list; if not, return it as is
    if not isinstance(ingredients_list, list):
        return ingredients_list

    # Convert each ingredient in the list to lowercase
    lowercase_list = [ingredient.lower() for ingredient in ingredients_list]
    return lowercase_list

# Apply the function to the 'ingredients' column
df['ingredients'] = df['ingredients'].apply(lowercase_ingredients)

## Adjective Removal

To eliminate ingredient redundancies, we are removing adjectives in the following code block:

In [7]:
# Lowercase all items in the ingredient list and remove "fresh"
for index, row in df.iterrows():
    # Update the ingredients list by removing 'fresh' if it exists
    updated_ingredients = [ingredient.replace('fresh ', '') if 'fresh ' in ingredient else ingredient for ingredient in row['ingredients']]
    # Update the DataFrame with the modified ingredients list
    df.at[index, 'ingredients'] = updated_ingredients

## Deleting Empty Rows

### Filtering rows where `Ingredients` is empty

There were two scenarios in the data, ingredients being len of 0, and ingredients being an empty string list (len of 1, being the list, but its still empty).

In [8]:
# Filter rows where `ingredients` is not empty and not just an empty string list
modified_df = df[df['ingredients'].apply(lambda x: len(x) > 0 and x != [''])].reset_index(drop=True)

### Removing Irreparable Entries

This section will involve deleting entries which we have deemed problematic and unfixable, associated reason for deletion will be in the comment.

In [9]:
indices_to_drop = [7] # Industrial size recipie, acts as outlier for portions 

In [10]:
# We reset index AFTER dropping all the `indices_to_drop` , as we must retain the indexing while dropping all of them, otherwise dropping them will affect the index of the subsequently dropped recipie
modified_df = modified_df.drop(index=indices_to_drop).reset_index(drop=True) 

## Normalization

### Ingredients Lists: Separating Combined Strings

In [11]:
def split_combined_ingredients(ingredients_list):
    """
    Splits a single-item list containing a string of ingredients separated by commas
    into a list of individual ingredients. If the list contains more than one item or
    if the single item does not contain a comma, the original list is returned.

    This is useful for normalizing ingredients lists where all ingredients are combined
    into a single string.

    Parameters:
    - ingredients_list: A list of ingredient(s).

    Returns:
    - A list of separated and trimmed ingredient strings.
    """
    if len(ingredients_list) == 1 and ',' in ingredients_list[0]:
        return [ingredient.strip() for ingredient in ingredients_list[0].split(', ')]
    else:
        return ingredients_list

# Applying the function
modified_df['ingredients'] = modified_df['ingredients'].apply(split_combined_ingredients)

### Deleting ingredients listed as optional 

In [12]:
def filter_optional_ingredients(ingredients_list):
    """
    Filters out ingredients marked as "optional" from a list of ingredients.
    
    This function iterates through a given list of ingredient strings, removing
    any ingredient that contains the word "optional". The goal is to create a more
    concise and essential list of ingredients by excluding those that are not
    crucial to the recipe. This step is part of data cleaning to standardize
    the ingredients data for further analysis.
    
    Parameters:
    - ingredients_list: A list of strings, where each string is an ingredient.

    Returns:
    - A list of ingredient strings, excluding any marked as "optional".
    """
    optional_keyword = 'optional'
    filtered_list = [ingredient for ingredient in ingredients_list if optional_keyword not in ingredient.lower()]
    return filtered_list

# Apply the filtering function to the 'ingredients' column
modified_df['ingredients'] = modified_df['ingredients'].apply(filter_optional_ingredients)


### Deleting steps in `ingredients` that are not usable

In [14]:
def delete_ingredients_at_steps(df, row_index, ingredient_steps):
    """
    Removes ingredients at specified steps (indices) from the list of ingredients
    for a given row in the DataFrame.

    Parameters:
    - df (DataFrame): The DataFrame containing the recipe information.
    - row_index (int): The index of the row from which to remove the ingredients.
    - ingredient_steps (list of int): The indices of the ingredients to remove within the ingredients list.
    
    Returns:
    - None; modifies the DataFrame in place.
    """
    # Ensure the row_index is within the DataFrame's range
    if row_index < 0 or row_index >= len(df):
        print("Row index is out of DataFrame's range.")
        return
    
    # Get the current list of ingredients for the specified row
    ingredients_list = df.at[row_index, 'ingredients']
    
    # Check if ingredient_steps is a single integer, wrap it in a list
    if isinstance(ingredient_steps, int):
        ingredient_steps = [ingredient_steps]

    # Ensure ingredients_list is a list
    if not isinstance(ingredients_list, list):
        print("Ingredients are not in list format for the specified row.")
        return
    
    # Remove the ingredients at the specified steps
    # Sort the indices in reverse order to avoid index shift during deletion
    for step in sorted(ingredient_steps, reverse=True):
        # Check each step's validity before attempting to delete
        if step < 0 or step >= len(ingredients_list):
            print(f"Ingredient step {step} is out of range and will not be deleted.")
            continue
        del ingredients_list[step]
    
    # Update the DataFrame in place
    df.at[row_index, 'ingredients'] = ingredients_list

In [26]:
# Use this to check row [n] of modified_df['ingredients'] 
modified_df['ingredients'][0]

['1.5 oz. 151-proof demerara rum [lemon hart or el dorado]',
 '0.5 oz. lime juice',
 '0.5 oz. sugar syrup',
 '1 dash angostura bitters',
 '6 drops [1\\/8 tsp.] pernod',
 '8 oz. crushed ice']

In [15]:
# Usage of delete_ingredient_at_steps function to remove entries 
delete_ingredients_at_steps(modified_df, 0, [6, 7])

In [16]:
# Use to check above deleted correct indexes
modified_df['ingredients'][0]

['1.5 oz. 151-proof demerara rum [lemon hart or el dorado]',
 '0.5 oz. lime juice',
 '0.5 oz. sugar syrup',
 '1 dash angostura bitters',
 '6 drops [1\\/8 tsp.] pernod',
 '8 oz. crushed ice']

# Parsing

In [17]:
def parse_ingredient(ingredient_str, step_index):
    """
    Parses a single ingredient string into its parts: quantity, unit, and ingredient name,
    while also correcting known formatting issues and standardizing units and ingredient names.
    The function lowers the case of all items in the ingredient list, removes the term "fresh",
    and eliminates backslashes. It then applies a regular expression to extract the quantity, unit,
    and ingredient name from the cleaned string.

    Parameters:
    - ingredient_str (str): The ingredient string to be parsed.
    - step_index (int): The step index or sequence number of the ingredient in the recipe.

    Returns:
    - A dictionary with the parsed components of the ingredient: 'quantity', 'unit', 'ingredient',
      and 'ingredient_step'. If the string does not match the expected pattern, 'quantity' and 'unit'
      are returned as None, and 'ingredient' contains the original (corrected) string with 'ingredient_step'
      reflecting the passed step_index.

    The function is designed to handle a variety of ingredient formats by using a comprehensive regular
    expression. It accounts for different measurement units and formats, aiming to standardize the data
    for further processing or analysis.
    """
    
    # Pre-process to correct known formatting issues
    corrected_str = re.sub(r'\\', '', ingredient_str)  # Remove backslashes that might interfere with parsing
    corrected_str = corrected_str.strip()
    
    pattern = re.compile(
        r'(?P<quantity>\d+\s*\d*\/\d+|\d*\.\d+|\d+)?\s*'  # Capture quantities, fractions, decimals
        r'(?P<unit>oz|ounces?|tsp|teaspoons?|tablespoons?|tbl|tbs|cups?|pints?|quarts?|gallons?|lbs?|pounds?|ml|mL|liters?|dash|dashes|drops?|pinch|pinches|qt|qts|cl)?\.?\s*' 
        r'(?P<ingredient>.+)', re.IGNORECASE)
    
    match = pattern.match(corrected_str)
    
    if match:
        # Normalize unit names
        unit = match.group('unit')
        if unit:
            unit = unit.lower()
            if unit in ['tsp', 'teaspoons']:
                unit = 'tsp'
            elif unit in ['tbl', 'tbs', 'tablespoons']:
                unit = 'tbsp'
            elif unit in ['pounds', 'pound', 'lbs']:
                unit = 'lb'
            elif unit in ['gallons', 'gallon']:
                unit = 'gal'
            elif unit in ['ounces', 'ounce', 'oz']:
                unit = 'oz'
            elif unit in ['milliliters', 'millilitre', 'ml', 'mL']:
                unit = 'ml'
            elif unit in ['liters', 'litre']:
                unit = 'l'
            elif unit in ['qt', 'qts']:  
                unit = 'qt'
        
        return {
            'quantity': match.group('quantity'), 
            'unit': unit, 
            'ingredient': match.group('ingredient').strip(), 
            'ingredient_step': step_index
        }
    else:
        # Handling cases that don't fit the expected pattern
        return {
            'quantity': None, 
            'unit': None, 
            'ingredient': corrected_str, 
            'ingredient_step': step_index
        }

In [18]:
parsed_ingredients_list = []

In [19]:
for index, row in modified_df.iterrows():
    for idx, ingredient in enumerate(row['ingredients']):
        parsed_ingredient = parse_ingredient(ingredient, idx)
        parsed_ingredient['recipe_id'] = index  # Adding the recipe ID to each ingredient
        parsed_ingredients_list.append(parsed_ingredient)

# Convert the list of dictionaries into a DataFrame
parsed_df = pd.DataFrame(parsed_ingredients_list)

In [20]:
parsed_df

Unnamed: 0,quantity,unit,ingredient,ingredient_step,recipe_id
0,1.5,oz,151-proof demerara rum [lemon hart or el dorado],0,0
1,0.5,oz,lime juice,1,0
2,0.5,oz,sugar syrup,2,0
3,1,dash,angostura bitters,3,0
4,6,drops,[1/8 tsp.] pernod,4,0
...,...,...,...,...,...
3871,.5,oz,dolin blanc,3,868
3872,1,dash,orange bitters,4,868
3873,,,"shake on ice and strain""",5,868
3874,,,champagne,0,869


# Post-Parsing Analysis

## Tools to Inspect

### `inspect_row` function

In [21]:
def inspect_row(modified_df, parsed_df, row_index):
    """
    Prints the title, ingredients, and directions of a given row in the modified DataFrame,
    and then prints all corresponding parsed ingredient rows from the parsed DataFrame.
    
    Parameters:
    - modified_df: pandas DataFrame containing the original cocktail data.
    - parsed_df: pandas DataFrame containing the parsed ingredients data.
    - row_index: Integer index of the row to inspect in modified_df and to match in parsed_df.
    """
    # Print details from the modified DataFrame
    print(f"Title: {modified_df.loc[row_index, 'title']}\n")
    print("Ingredients:")
    for ingredient in modified_df.loc[row_index, 'ingredients']:
        print(f"- {ingredient}")
    print("\nDirections:")
    for direction in modified_df.loc[row_index, 'directions']:
        print(f"- {direction}")
    
    # Print corresponding rows from the parsed DataFrame
    print("\nParsed Ingredients:")
    parsed_rows = parsed_df[parsed_df['recipe_id'] == row_index]
    if not parsed_rows.empty:
        print(parsed_rows.to_string(index=False))
    else:
        print("No parsed ingredients found for this recipe.")

In [22]:
inspect_row(modified_df, parsed_df, 0)

Title: 151 Swizzle

Ingredients:
- 1.5 oz. 151-proof demerara rum [lemon hart or el dorado]
- 0.5 oz. lime juice
- 0.5 oz. sugar syrup
- 1 dash angostura bitters
- 6 drops [1\/8 tsp.] pernod
- 8 oz. crushed ice

Directions:
- 

Parsed Ingredients:
quantity  unit                                       ingredient  ingredient_step  recipe_id
     1.5    oz 151-proof demerara rum [lemon hart or el dorado]                0          0
     0.5    oz                                       lime juice                1          0
     0.5    oz                                      sugar syrup                2          0
       1  dash                                angostura bitters                3          0
       6 drops                                [1/8 tsp.] pernod                4          0
       8    oz                                      crushed ice                5          0


### none_rows call

In [23]:
# Filter rows and reset index 
none_rows = parsed_df.loc[parsed_df['quantity'].isnull() | parsed_df['unit'].isnull()].reset_index(drop=True)

# Display the filtered dataframe
none_rows

Unnamed: 0,quantity,unit,ingredient,ingredient_step,recipe_id
0,,,the 21st century,0,1
1,,,rinse coupe glass with pernod,4,1
2,1,,sugar cube,1,3
3,1,,perforated spoon,3,3
4,1,,barspoon lime juice,3,5
...,...,...,...,...,...
1489,,,variation: substitute 1/2 ounce galliano and 1...,8,865
1490,,,,9,865
1491,,,"shake on ice and strain""",5,868
1492,,,champagne,0,869


In [24]:
none_rows.loc[none_rows['ingredient_step'] == 0]

Unnamed: 0,quantity,unit,ingredient,ingredient_step,recipe_id
0,,,the 21st century,0,1
9,,,"(a classic variation on the perfect rob roy), ...",0,7
11,,,a sidecar with lime juice instead of lemon juice.,0,9
16,1,,part yellow chartreuse,0,12
27,3/4,,dry gin,0,15
...,...,...,...,...,...
1477,,,after dinner cocktail,0,858
1481,2,,orange wedges,0,861
1485,,,woo woo shooter on page 429.,0,862
1486,,,a caribbean favorite.,0,865


In [25]:
none_rows.loc[none_rows['ingredient_step'] == 0]

Unnamed: 0,quantity,unit,ingredient,ingredient_step,recipe_id
0,,,the 21st century,0,1
9,,,"(a classic variation on the perfect rob roy), ...",0,7
11,,,a sidecar with lime juice instead of lemon juice.,0,9
16,1,,part yellow chartreuse,0,12
27,3/4,,dry gin,0,15
...,...,...,...,...,...
1477,,,after dinner cocktail,0,858
1481,2,,orange wedges,0,861
1485,,,woo woo shooter on page 429.,0,862
1486,,,a caribbean favorite.,0,865


### Future tools/calls to look into problems

Can leave this empty for now, just putting in a space here for futureproofing the cleanliness

## Visual Analysis

### Graphs/Visuals