## Step 1: Libraries Installation

In [7]:
!pip install datasets pandas openai pymongo

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Collecting openai
  Downloading openai-1.14.2-py3-none-any.whl (262 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m262.4/262.4 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pymongo
  Downloading pymongo-4.6.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (677 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m677.2/677.2 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m

## Step 2: Data Loading

In [8]:
# 1. Load Dataset
from datasets import load_dataset
import pandas as pd

# https://huggingface.co/datasets/brianarbuckle/cocktail_recipes
dataset = load_dataset("brianarbuckle/cocktail_recipes")

# Convert the dataset to a pandas dataframe
dataset_df = pd.DataFrame(dataset['train'])

dataset_df.head(5)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/3.33k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/96.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/875 [00:00<?, ? examples/s]

Unnamed: 0,title,ingredients,directions,misc,source,ner
0,151 Swizzle,[1.5 oz. 151-Proof Demerara Rum [Lemon Hart or...,[],[],Beachbum Berry Remixed,"[pernod, rum]"
1,20th Century,"[The 21st Century, 2 oz. Siete Leguas Blanco T...","[shake on ice and strain into coupe , The Best...",[],Jim Meehan,"[cocchi americano, pernod, tequila]"
2,20th Century,"[1.5 oz. Plymouth Gin, 3\/4 oz. Mari Brizard W...",[shake on ice and strain],[],PDT,"[lillet, gin]"
3,Abbey Cocktail,[],"[Shake liquid ingredients with ice., Strain in...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,[]
4,Absinthe Drip,[1 1/2 ounces Pernod (or other absinthe substi...,[Pour Pernod into a pousse-caf or sour glass....,[The Absinthe Drip was made famous by Toulouse...,The Ultimate Bar Book,"[pernod, absinthe]"


## Step 3: Data Cleaning and Preparation


In [9]:
import numpy as np

def process_ingredients(ner):
    if isinstance(ner, list):
        if not ner:
            return np.nan
        else:
            return ", ".join(ner)
    else:
        return ner

dataset_df['ner'] = dataset_df['ner'].apply(process_ingredients)
dataset_df


Unnamed: 0,title,ingredients,directions,misc,source,ner
0,151 Swizzle,[1.5 oz. 151-Proof Demerara Rum [Lemon Hart or...,[],[],Beachbum Berry Remixed,"pernod, rum"
1,20th Century,"[The 21st Century, 2 oz. Siete Leguas Blanco T...","[shake on ice and strain into coupe , The Best...",[],Jim Meehan,"cocchi americano, pernod, tequila"
2,20th Century,"[1.5 oz. Plymouth Gin, 3\/4 oz. Mari Brizard W...",[shake on ice and strain],[],PDT,"lillet, gin"
3,Abbey Cocktail,[],"[Shake liquid ingredients with ice., Strain in...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,
4,Absinthe Drip,[1 1/2 ounces Pernod (or other absinthe substi...,[Pour Pernod into a pousse-caf or sour glass....,[The Absinthe Drip was made famous by Toulouse...,The Ultimate Bar Book,"pernod, absinthe"
...,...,...,...,...,...,...
870,Yellow Bird,"[ A Caribbean favorite., 1 ounce dark rum, 1 o...","[Shake liquid ingredients with ice., Strain in...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,"galliano, triple sec, rum, cointreau"
871,Yellow Fever,"[1 1/2 ounces vodka, 1/2 ounce Galliano, 1/2 o...","[Shake ingredients with ice., Strain into a ch...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,"vodka, galliano"
872,Yellow Parrot Cocktail,"[3/4 ounce yellow Chartreuse, 3/4 ounce Pernod...","[Shake ingredients with ice., Strain into a ch...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,"brandy, pernod, chartreuse"
873,[The Spirit of the] Algonquin,"[ 2oz. 90 Proof Rye, .75oz. Lemon Juice, .75oz...",[shake on ice and strain],[Suggested glassware is Cocktail Glass],PDT,


In [10]:
print("Columns:", dataset_df.columns)
print("\nNumber of rows and columns:", dataset_df.shape)
print("\nBasic Statistics for numerical data:")
print(dataset_df.describe())
print("\nNumber of missing values in each column:")
print(dataset_df.isnull().sum())

Columns: Index(['title', 'ingredients', 'directions', 'misc', 'source', 'ner'], dtype='object')

Number of rows and columns: (875, 6)

Basic Statistics for numerical data:
             title                                        ingredients  \
count          875                                                875   
unique         646                                                858   
top     Ward Eight  [After Dinner Cocktail, 5 cl Cognac, 2 cl Crme...   
freq             9                                                  6   

       directions misc                 source  ner  
count         875  875                    875  753  
unique        456   63                     83  252  
top            []   []  The Ultimate Bar Book  gin  
freq           83  376                    438   81  

Number of missing values in each column:
title            0
ingredients      0
directions       0
misc             0
source           0
ner            122
dtype: int64


In [11]:
dataset_df = dataset_df.dropna(subset=['ner'])
dataset_df.rename(columns={'ner': 'base'}, inplace=True)
dataset_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataset_df.rename(columns={'ner': 'base'}, inplace=True)


Unnamed: 0,title,ingredients,directions,misc,source,base
0,151 Swizzle,[1.5 oz. 151-Proof Demerara Rum [Lemon Hart or...,[],[],Beachbum Berry Remixed,"pernod, rum"
1,20th Century,"[The 21st Century, 2 oz. Siete Leguas Blanco T...","[shake on ice and strain into coupe , The Best...",[],Jim Meehan,"cocchi americano, pernod, tequila"
2,20th Century,"[1.5 oz. Plymouth Gin, 3\/4 oz. Mari Brizard W...",[shake on ice and strain],[],PDT,"lillet, gin"
4,Absinthe Drip,[1 1/2 ounces Pernod (or other absinthe substi...,[Pour Pernod into a pousse-caf or sour glass....,[The Absinthe Drip was made famous by Toulouse...,The Ultimate Bar Book,"pernod, absinthe"
5,Acapulco,"[1 ounce gold tequila, 1 ounce gold rum, 2 oun...","[Shake ingredients with ice., Strain into an i...",[Suggested glassware is Highball Glass],The Ultimate Bar Book,"tequila, rum"
...,...,...,...,...,...,...
869,Yellow Bird,"[3 cl White Rum, 1.5 cl Galliano, 1.5 cl Tripl...","[Shake liquid ingredients with ice., Strain in...",[Suggested glassware is Cocktail Glass],IBA,"galliano, triple sec, rum"
870,Yellow Bird,"[ A Caribbean favorite., 1 ounce dark rum, 1 o...","[Shake liquid ingredients with ice., Strain in...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,"galliano, triple sec, rum, cointreau"
871,Yellow Fever,"[1 1/2 ounces vodka, 1/2 ounce Galliano, 1/2 o...","[Shake ingredients with ice., Strain into a ch...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,"vodka, galliano"
872,Yellow Parrot Cocktail,"[3/4 ounce yellow Chartreuse, 3/4 ounce Pernod...","[Shake ingredients with ice., Strain into a ch...",[Suggested glassware is Cocktail Glass],The Ultimate Bar Book,"brandy, pernod, chartreuse"
