# Imports

In [1]:
import pandas as pd

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Dataset Inspection

In [2]:
# Load the cleaned data
df = pd.read_csv('clean_df.csv')

# Display the first few rows of the DataFrame to verify it's loaded correctly
df.head(20)

Unnamed: 0,quantity,unit,ingredient,ingredient_step,recipe_id
0,45.0,ml,dark rum,0,0
1,22.5,ml,lime juice,1,0
2,15.0,ml,sugar,2,0
3,1.0,dash,angostura,3,0
4,6.0,drop,pernod,4,0
5,240.0,ml,crushed ice,5,0
6,60.0,ml,silver tequila,0,1
7,22.5,ml,marie brizard creme de cacao,1,1
8,22.5,ml,lemon juice,2,1
9,45.0,ml,gin,0,2


In [3]:
df.tail(5)

Unnamed: 0,quantity,unit,ingredient,ingredient_step,recipe_id
2887,60.0,ml,rye whiskey,0,654
2888,22.5,ml,lemon juice,1,654
2889,22.5,ml,pineapple syrup,2,654
2890,15.0,ml,dolin blanc,3,654
2891,1.0,dash,orange bitters,4,654


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2892 entries, 0 to 2891
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   quantity         2586 non-null   float64
 1   unit             2524 non-null   object 
 2   ingredient       2890 non-null   object 
 3   ingredient_step  2892 non-null   int64  
 4   recipe_id        2892 non-null   int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 113.1+ KB


In [5]:
# Group by recipe_id and aggregate ingredients into lists
grouped_df = df.groupby('recipe_id')['ingredient'].apply(list).reset_index()

# Display grouped DataFrame
print(grouped_df)

     recipe_id                                         ingredient
0            0  [dark rum, lime juice, sugar, angostura, perno...
1            1  [silver tequila, marie brizard creme de cacao,...
2            2  [gin, mari brizard white creme de cacao, lille...
3            3                             [pernod, sugar, water]
4            4  [gold tequila, gold rum, grapefruit juice, pin...
..         ...                                                ...
650        650      [light rum, galliano, triple sec, lime juice]
651        651  [dark rum, light rum, tia maria, orange juice,...
652        652     [vodka, galliano, lime juice, pineapple juice]
653        653        [yellow chartreuse, pernod, apricot brandy]
654        654  [rye whiskey, lemon juice, pineapple syrup, do...

[655 rows x 2 columns]


In [6]:
# Extract the list of lists of ingredients
recipes = grouped_df['ingredient'].tolist()

# Get the list of all unique ingredients
all_ingredients = df['ingredient'].unique().tolist()

# Use MultiLabelBinarizer to encode the ingredients
mlb = MultiLabelBinarizer(classes=all_ingredients)
encoded_recipes = mlb.fit_transform(recipes)

# Create a DataFrame for easier manipulation
encoded_df = pd.DataFrame(encoded_recipes, columns=mlb.classes_)

In [7]:
encoded_df.shape

(655, 389)

# Outline:

Binary Encoding of Ingredients

In [8]:
# Assuming `recipes` is a list of lists, where each inner list contains the ingredients of a recipe
mlb = MultiLabelBinarizer(classes=all_ingredients)
encoded_recipes = mlb.fit_transform(df)

# Create a DataFrame for easier manipulation
encoded_df = pd.DataFrame(encoded_recipes, columns=mlb.classes_)



Model Building:

In [8]:
input_dim = len(all_ingredients)  # Number of unique ingredients

model = Sequential()
model.add(Dense(units=128, input_dim=input_dim, activation='relu'))
model.add(Dense(units=128, activation='relu'))
model.add(Dense(units=input_dim, activation='sigmoid'))  # Output layer with sigmoid activation

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


Model Training:

In [10]:
X = encoded_df.values  # Input features (binary vectors of ingredients)
y = encoded_df.values  # Target labels (same as input for multi-label classification)

model.fit(X, y, epochs=50, batch_size=32, validation_split=0.2)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x1a15ca12910>

Generating New Recipes:

Start with a seed ingredient or set of ingredients.
Use the trained model to predict additional ingredients.

In [11]:
seed_ingredients = ['vodka', 'orange juice']  # Example seed ingredients
seed_vector = mlb.transform([seed_ingredients])[0]

predicted_probabilities = model.predict(seed_vector.reshape(1, -1))
predicted_ingredients = mlb.inverse_transform((predicted_probabilities > 0.5).astype(int))[0]

generated_recipe = list(set(seed_ingredients + list(predicted_ingredients)))
print("Generated Recipe:", generated_recipe)


Generated Recipe: ['mari brizard white creme de cacao', 'port', 'walnut liqueur', 'apricot brandy', 'cucumber spear', 'jägermeister', 'irish whiskey', 'chopped green onion', 'orange bitters', 'parts scotch', 'limoncello', 'disaronno', 'orange juice', 'frangelico', 'vodka', 'kina lillet', 'jigger sazerac brandy']


Summary of Steps:
Clean Data Preparation:

Create a DataFrame with binary encoded ingredient vectors for each recipe.
Model Building:

Define a neural network to learn from these vectors.
Model Training:

Train the model on your dataset.
Recipe Generation:

Generate new recipes starting from seed ingredients.
By following these steps, you should be able to create and train a model that can generate new cocktail recipes based on the patterns it learns from your dataset. Let me know if you need more detail on any of these steps!

The architecture described here is a simple feedforward neural network, also known as a Multi-Layer Perceptron (MLP). This architecture is used for multi-label classification, where each ingredient is treated as a binary label that can be present or absent in a recipe.

Architecture Explanation
Input Layer:

The input layer consists of nodes equal to the number of unique ingredients. Each node represents whether a particular ingredient is present (1) or absent (0) in the recipe.
Input Dimension: input_dim = len(all_ingredients).
Hidden Layers:

The network has two hidden layers, each with 128 neurons. These layers use the ReLU (Rectified Linear Unit) activation function, which introduces non-linearity to the model and allows it to learn complex patterns.
First Hidden Layer: Dense(units=128, input_dim=input_dim, activation='relu')
Second Hidden Layer: Dense(units=128, activation='relu')
Output Layer:

The output layer also consists of nodes equal to the number of unique ingredients. Each node represents the probability of the corresponding ingredient being part of the recipe.
The sigmoid activation function is used in the output layer to produce probabilities between 0 and 1 for each ingredient.
Output Layer: Dense(units=input_dim, activation='sigmoid')
Loss Function:

The model uses binary cross-entropy loss, suitable for multi-label classification where each label (ingredient) is a binary decision (present or absent).
Loss Function: binary_crossentropy
Optimizer:

The Adam optimizer is used to minimize the loss function during training. Adam is a popular choice due to its adaptive learning rate and efficiency.
Optimizer: adam
Metrics:

The accuracy metric is used to evaluate the model's performance during training and validation.
Metrics: accuracy
Why This Architecture?
Simplicity: This architecture is straightforward and easy to implement, making it suitable for a dataset of your size (654 recipes and 447 unique ingredients).
Flexibility: The use of dense layers allows the network to learn from the presence and absence of each ingredient, capturing the relationships between different ingredients.
Scalability: Adding more hidden layers or increasing the number of neurons in each layer can improve the model's capacity to learn more complex patterns, if needed.
Summary
This architecture is a feedforward neural network designed for multi-label classification. It takes a binary vector representing the presence or absence of each ingredient as input and outputs a binary vector representing the predicted probabilities of each ingredient being part of the cocktail recipe. This approach is suitable for generating new recipes by learning the common ingredient combinations from the training data.






