<center><h1>PantryPalette: Your Ingredient-Inspired Recipe Guide</h1></center>

# Problem Statement

Every day, households struggle with unused ingredients in their kitchens, often leading to food waste due
to a lack of inspiration or knowledge on how to use them effectively. Many individuals resort to takeout
or repetitive meals, simply because they don’t know what to cook with the ingredients they already have.


Meanwhile, traditional recipe search methods require manual browsing through extensive databases,
leading to time-consuming and frustrating experiences. Users with dietary restrictions, allergies, or
cultural food preferences face an additional challenge in finding recipes that align with their needs.

PantryPalette is designed to solve this problem by offering a seamless, ingredient-based recipe discovery
experience that helps users maximize their groceries, reduce food waste, and discover new meal ideas
with ease.

# Purpose

The purpose of the MLOps project is to develop an end-to-end data science solution that is usable by someone with no technical knowledge. 

It must include the following components
- A process map built which documents the process
- Data ingested from an online source
- A data repository and model repository
- A predictive model built off the data
- The model predictions

  a. Put into a Streamlit application

  b. Via a Docker deployment

  c. Which is accessible to users
  
- A model monitoring dashboard built
- Documentation for the model process and risks with the production

The purpose is not to have a complex model. It is meant to have a simpler model that works! All of this is to be done in teams of 2, along with individual users. Both the modeling teams and users will take part in the presentations

# Workflow

```mermaid
graph TD;
    A[User Inputs Ingredients] --> B[Preprocessing]
    B --> C[Tokenization & Standardization]
    C --> D[Feature Engineering]
    D --> E[TF-IDF Vectorization]
    E --> F[Cosine Similarity Calculation]
    F --> G[Retrieve Relevant Recipes]
    G --> H[Rank Recipes Based on Similarity]
    H --> I[Provide Alternative Ingredient Suggestions]
    I --> J[Display Recipes with Ingredients & Instructions]

# Implementation

## Data Collection and Ingestion

- Get RecipeNLG data (static data) + Spoonacular API (dynamic data)

In [None]:
import pandas as pd

### Static Data Collection (RecipeNLG Dataset)

- Get RecipeNLG data (static data)

In [None]:
receipenlg = pd.read_csv("../dataset/RecipeNLG_dataset.csv")

In [None]:
receipenlg.shape

In [None]:
receipenlg.head()

In [None]:
# receipenlg = receipenlg.sample(n=10000, random_state=42).reset_index(drop=True)
# receipenlg.shape

In [None]:
receipenlg.info()

In [None]:
# Drop unnecessary columns
receipenlg.drop(columns=["Unnamed: 0", "link", "source", "NER"], inplace=True)
receipenlg.head()

In [None]:
df_static = receipenlg.copy()

### Data Ingestion

-  (dynamic data)

In [None]:
df_dynamic = pd.read_csv("../dataset/recipes.csv")
df_dynamic.head()

In [None]:
# Drop unnecessary columns
df_dynamic.drop(columns=["image", "total time", "description"], inplace=True)
df_dynamic.head()

### Data Preprocessing

- Clean & standardize ingredients

Used regex and string methods to clean and standardize:
- Ingredients: lowercase, alphanumeric filtering, list normalization.
- Instructions: sentence splitting, stripping, and cleaning.


In [None]:
import pandas as pd
import ast
import re

# -----------------------
# Step 1: Standardize Columns
# -----------------------
df_static.rename(columns={'directions': 'instructions'}, inplace=True)
df_dynamic.rename(columns={'Recipe_name': 'title',
                           'Recipe_ingredients': 'ingredients',
                           'Recipe_instructions': 'instructions'}, inplace=True)

# -----------------------
# Step 2: Clean Ingredients
# -----------------------
def process_ingredients(ing):
    if isinstance(ing, list):
        return [re.sub(r'[^a-z0-9\s.]', '', i.lower().strip()) for i in ing]
    elif isinstance(ing, str):
        tokens = ing.split()
        return [re.sub(r'[^a-z0-9\s.]', '', i.lower().strip()) for i in tokens if len(i) > 1]
    else:
        return []

df_static['ingredients'] = df_static['ingredients'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
df_dynamic['ingredients'] = df_dynamic['ingredients'].apply(process_ingredients)

# -----------------------
# Step 3: Clean Instructions
# -----------------------
def process_instructions(instr):
    if isinstance(instr, list):
        return [i.strip() for i in instr if len(i.strip()) > 0]
    elif isinstance(instr, str):
        return [i.strip() for i in instr.split('.') if len(i.strip()) > 0]
    else:
        return []

df_static['instructions'] = df_static['instructions'].apply(process_instructions)
df_dynamic['instructions'] = df_dynamic['instructions'].apply(process_instructions)

# -----------------------
# Step 4: Merge Datasets
# -----------------------
df_combined = pd.concat([df_static, df_dynamic], ignore_index=True)
df_combined.dropna(subset=['ingredients', 'instructions'], inplace=True)

df_combined.head()

In [None]:
import os

os.makedirs("../processed_dataset", exist_ok=True)

# Save clean dataset
df_combined.to_csv("../processed_dataset/combined_recipes_cleaned.csv", index=False)
print("Data integration complete. Ready for feature engineering!")

In [None]:
df_combined.dtypes

In [None]:
import pandas as pd
import ast
import re
from collections import Counter
import matplotlib.pyplot as plt

# Load your combined dataset
df = pd.read_csv("../processed_dataset/combined_recipes_cleaned.csv")  # Replace with actual path

# STEP 1: Parse ingredients column into token list
def flatten_and_tokenize(ingredient_col):
    all_tokens = []
    for entry in ingredient_col:
        try:
            ingredients = ast.literal_eval(entry) if isinstance(entry, str) and entry.startswith("[") else [entry]
            for line in ingredients:
                # Lowercase, remove punctuation/numbers
                line = line.lower()
                line = re.sub(r'[^a-z\s]', '', line)
                tokens = line.split()
                all_tokens.extend(tokens)
        except:
            continue
    return all_tokens

# STEP 2: Generate token frequency
tokens = flatten_and_tokenize(df['ingredients'])
token_freq = Counter(tokens)

# STEP 3: Convert to DataFrame for inspection
stopword_df = pd.DataFrame(token_freq.items(), columns=["token", "count"]).sort_values(by="count", ascending=False)

# STEP 4: Show top 100 most common tokens
print(stopword_df.head(100))

# Optional: Save all tokens with counts
stopword_df.to_csv("../processed_dataset/ingredient_token_counts.csv", index=False)

In [None]:
import pandas as pd
import re
import ast
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Define custom stopwords 
custom_stopwords = stop_words.union({
    # Measurement units
    'c', 'cup', 'cups', 'tsp', 'teaspoon', 'teaspoons', 'tbsp', 'tablespoon', 'tablespoons',
    'oz', 'ounce', 'ounces', 'lb', 'lbs', 'pound', 'pounds', 'g', 'gram', 'grams', 'kg',
    'ml', 'milliliter', 'milliliters', 'l', 'liter', 'liters', 'qt', 'quart', 'quarts',
    'pt', 'pint', 'pints', 'gal', 'gallon', 'gallons', 'pkg', 'pkgs', 'package', 'packages',
    'stick', 'sticks', 'dash', 'pinch', 'can', 'cans', 'fluid', 'fl', 'jar', 'jars',
    'box', 'boxes', 'bottle', 'bottles', 't', 'tbs', 'tbls', 'qt.', 'pt.', 'oz.', 'lb.', 'g.', 'ml.', 'kg.', 'l.', 'pkg.', 'pkt',

    # Preparation and cooking descriptors
    'chopped', 'minced', 'diced', 'sliced', 'grated', 'crushed', 'shredded', 'cut',
    'peeled', 'optional', 'seeded', 'halved', 'coarsely', 'finely', 'thinly', 'roughly',
    'cubed', 'crumbled', 'ground', 'trimmed', 'boneless', 'skinless', 'melted', 'softened',
    'cooled', 'boiled', 'cooked', 'uncooked', 'raw', 'drained', 'rinsed', 'beaten', 'size'

    # Quantity and portion descriptors
    'small', 'medium', 'large', 'extra', 'light', 'dark', 'best', 'fresh', 'freshly',
    'ripe', 'mini', 'whole', 'big', 'room', 'temperature', 'zero', 'one', 'two', 'three',
    'four', 'five', 'six', 'eight', 'ten', 'twelve', 'half', 'third', 'quarter', 'dozen',
    'thousand', 'bite'

    # Filler or generic stopwords
    'plus', 'with', 'without', 'into', 'about', 'of', 'the', 'to', 'for', 'in', 'from',
    'as', 'and', 'or', 'on', 'your', 'if', 'such', 'you', 'use', 'may'
})

def preprocess_ingredients(ingredients):
    try:
        if isinstance(ingredients, str):
            ingredients_list = ast.literal_eval(ingredients) if ingredients.startswith("[") else [ingredients]
        elif isinstance(ingredients, list):
            ingredients_list = ingredients
        else:
            return ""

        cleaned_ingredients = set()

        for ing in ingredients_list:
            ing = re.sub(r'\(.*?\)', '', str(ing)).lower() 
            ing = re.sub(r'[^a-z\s]', '', ing)  
            tokens = word_tokenize(ing)
            tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in custom_stopwords and len(token) > 1]
            if tokens:
                phrase = " ".join(tokens)
                if "oil" not in phrase and "salt" not in phrase and "water" not in phrase:
                    cleaned_ingredients.add(phrase)

        return ", ".join(sorted(cleaned_ingredients))

    except Exception as e:
        print(f"Preprocessing error: {e}")
        return ""

def preprocess_user_ingredients(user_input):
    ingredients = user_input.split(',')
    ingredients_str = str([ing.strip() for ing in ingredients])
    return preprocess_ingredients(ingredients_str)

df_combined = pd.read_csv("../processed_dataset/combined_recipes_cleaned.csv") 

df_combined['ingredients_clean'] = df_combined['ingredients'].apply(preprocess_ingredients)

print(df_combined[['ingredients', 'ingredients_clean']].head())

df_combined.to_csv('../processed_dataset/combined_recipes_preprocessed.csv', index=False)

In [None]:
df_combined

## Feature Engineering

- TF-IDF + Cosine Similarity

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
import pandas as pd

# 1. Train-Test Split
train_df, test_df = train_test_split(df_combined, test_size=0.2, random_state=42)

# 2. TF-IDF Vectorization with Bigrams
vectorizer = TfidfVectorizer(max_features=500, ngram_range=(1, 2))
train_vectors = vectorizer.fit_transform(train_df['ingredients_clean'])

# 3. Nearest Neighbors Model (Cosine Similarity)
nn = NearestNeighbors(n_neighbors=10, metric='cosine')
nn.fit(train_vectors)

## Model Development

- Train NLP-based model

In [None]:
def find_similar_recipes(user_input, vectorizer, nn_model, train_df):
    """
    Finds the top 10 similar recipes based on user's ingredients input and shows similarity score.
    """
    user_cleaned = preprocess_user_ingredients(user_input)
    print(f"\n🔍 Processed Input: {user_cleaned}")
    
    # Vectorize the cleaned user input using the improved TF-IDF vectorizer
    user_vector = vectorizer.transform([user_cleaned])
    distances, indices = nn_model.kneighbors(user_vector)

    print("\n🍽️ Top 10 similar recipes:\n")
    for i, (idx, dist) in enumerate(zip(indices[0], distances[0])):
        similarity_score = round((1 - dist) * 100, 2)  # Convert cosine distance to percentage similarity
        title = train_df.iloc[idx]['title'] if 'title' in train_df.columns else 'N/A'
        ingredients = train_df.iloc[idx]['ingredients']
        ingredients_clean = train_df.iloc[idx]['ingredients_clean']

        print(f"{i+1}. Title: {title}")
        print(f"   Similarity: {similarity_score}%")
        print(f"   Ingredients: {ingredients}")
        print(f"   Cleaned Ingredients: {ingredients_clean}\n")

In [None]:
user_input = "milk, eggs, sugar, pasta, maida"
find_similar_recipes(user_input, vectorizer, nn, train_df)

In [None]:
user_input = "onion, tomato, garlic"
find_similar_recipes(user_input, vectorizer, nn, train_df)

In [None]:
import joblib

os.makedirs("../models", exist_ok=True)

# Save vectorizer and nearest neighbors model
joblib.dump(vectorizer, "../models/tfidf_vectorizer.pkl")
joblib.dump(nn, "../models/nearest_neighbors_model.pkl")

# Optionally save the training DataFrame (with titles + ingredients)
train_df.to_csv("../processed_dataset/train_data.csv", index=False)

## Store in Database

In [None]:
import pandas as pd
import sqlite3
import os
import ast

# Load the preprocessed CSV containing recipes
df = pd.read_csv("../processed_dataset/combined_recipes_preprocessed.csv")

df = df[['title', 'ingredients', 'instructions']] 

# Ensure that the database folder exists
os.makedirs("../database", exist_ok=True)

# Connect to (or create) the SQLite database
conn = sqlite3.connect("../database/pantrypalette.db")

# Create a table named 'recipes' and populate it with the DataFrame data
df.to_sql("recipes", conn, if_exists="replace", index=False)

conn.close()
print("SQLite database 'pantrypalette.db' has been created and populated.")

## UI Development

- Streamlit

## Model Monitoring

- PowerBI

<center><h4>Enjoy Your Perfect Recipe! 🍽️</h4></center>
<center>Designed By: Madhurya Shankar & Sandhya Kilari</center>