# Building Personalized Recipe Finder based on Retrieval-Augmented Generation (RAG) with Dense Passage Retrieval (DPR) pipeline






## Description
This project implements a Retrieval-Augmented Generation (RAG) + Dense Passage Retrieval (DPR) pipeline for a personalized recipe search and recipe refinement. It leverages DPR encoders to generate embeddings for a dataset of recipes, which are stored in a FAISS index for efficient retrieval. When a user submits a query, the system retrieves the most relevant recipes using FAISS and a DPR-based similarity search. The top-ranked recipes serve as context for a generative model (facebook/bart-large), which refines or adapts the recipe based on the user’s preferences.


## References

The dataset of recipes, used in this project, is available at this [link](https://raw.githubusercontent.com/tabatkins/recipe-db/master/db-recipes.json).



##Setup

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
cd "YOUR-PATH-HERE"

In [None]:
# Install necessary libraries
%%capture
!pip install torch transformers faiss-cpu numpy pandas requests


In [None]:
import pandas as pd
import numpy as np
import torch
import faiss
import requests
import json
from transformers import (
    DPRContextEncoder, DPRContextEncoderTokenizer,
    DPRQuestionEncoder, DPRQuestionEncoderTokenizer,
    pipeline
)
import re
import unicodedata


In [None]:
# Load recipe dataset from URL or local JSON
def load_recipes(source):
    if source.startswith("http"):
        response = requests.get(source)
        return response.json() if response.status_code == 200 else {}
    else:
        with open(source, "r", encoding="utf-8") as f:
            return json.load(f)

In [None]:
# Define the dataset source
recipe_source = "https://raw.githubusercontent.com/tabatkins/recipe-db/master/db-recipes.json"
recipes = load_recipes(recipe_source)


In [None]:
print(len(recipes))

540


In [None]:
# Convert recipes dictionary into a DataFrame
recipe_texts = [
    f"{r['name']} | Ingredients: {', '.join(r.get('ingredients', []))} | Instructions: {r.get('instructions', '')}"
    for r in recipes.values()
]
recipe_df = pd.DataFrame(recipe_texts, columns=["text"])


In [None]:
# Load DPR context encoder & tokenizer
ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")
ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")


config.json:   0%|          | 0.00/492 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/dpr-ctx_encoder-multiset-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizer'.


In [None]:
# Generate embeddings for recipes
def generate_embeddings(documents):
    inputs = ctx_tokenizer(documents, return_tensors="pt", padding=True, truncation=True, max_length=512)  # Fix: Add max_length=512
    with torch.no_grad():
        embeddings = ctx_encoder(**inputs).pooler_output
    return embeddings

# Generate embeddings
recipe_embeddings = generate_embeddings(recipe_df["text"].tolist())


In [None]:
# Convert embeddings to numpy and create FAISS index
recipe_embeddings_np = recipe_embeddings.numpy()
dimension = recipe_embeddings_np.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(recipe_embeddings_np)
print(f"Number of recipes indexed: {index.ntotal}")

Number of recipes indexed: 540


In [None]:
# Load DPR question encoder & tokenizer
query_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-multiset-base")
query_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-multiset-base")

config.json:   0%|          | 0.00/493 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/dpr-question_encoder-multiset-base were not used when initializing DPRQuestionEncoder: ['question_encoder.bert_model.pooler.dense.bias', 'question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRQuestionEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRQuestionEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# Function to retrieve the most relevant recipes
def retrieve_recipes(query, top_k=3):
    inputs = query_tokenizer(query, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        query_embedding = query_encoder(**inputs).pooler_output.numpy()

    distances, indices = index.search(query_embedding, top_k)
    results = [(recipe_df["text"].iloc[i], distances[0][j]) for j, i in enumerate(indices[0])]
    return results

In [None]:
# Load generative model
generator = pipeline("text2text-generation", model="facebook/bart-large")

config.json:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
def clean_text(text):
    """Normalize and clean text to remove unusual characters and restore missing spaces."""
    text = unicodedata.normalize("NFKC", text)  # Normalize Unicode text
    text = text.encode("ascii", "ignore").decode("utf-8")  # Remove non-ASCII characters
    text = re.sub(r"([a-z])([A-Z])", r"\1 \2", text)  # Fix missing spaces in camelCase words
    text = re.sub(r"(?<=[a-zA-Z])(?=[A-Z])", " ", text)  # Insert spaces between incorrectly concatenated words
    text = re.sub(r"(?<=[a-zA-Z])(?=\d)", " ", text)  # Insert space before numbers
    text = re.sub(r"(?<=\d)(?=[a-zA-Z])", " ", text)  # Insert space after numbers before letters
    text = re.sub(r"([.,])(?=[a-zA-Z])", r"\1 ", text)  # Ensure space after punctuation
    text = text.replace("  ", " ")  # Remove double spaces
    return text.strip()

In [None]:
def format_generated_recipe(generated_text):
    """Formats the generated recipe for better readability."""

    # Define a consistent separator length
    SEPARATOR = "=" * 60

    # Normalize and clean text
    generated_text = clean_text(generated_text)

    # Split recipe into sections
    parts = re.split(r'Here are some similar recipes:|Ingredients:', generated_text, maxsplit=1)

    # Extract and clean recipe name
    recipe_name = clean_text(parts[0].replace("Modify this recipe for:", "").strip())

    # Extract and clean ingredients & instructions
    ingredients, instructions = "", ""
    if len(parts) > 1:
        content = parts[1].strip()
        if "Instructions:" in content:
            ingredients, instructions = content.split("Instructions:", maxsplit=1)

    # Ensure ingredients and instructions are formatted correctly
    formatted_ingredients = "\n".join([
        f"- {clean_text(line.strip())}" for line in ingredients.split(",")
        if line.strip() and "=" not in line  # Prevent inclusion of "===" in ingredients
    ])

    formatted_instructions = "\n".join([
        f"{i+1}. {clean_text(line.strip())}" for i, line in enumerate(instructions.split("."))
        if line.strip()
    ])

    # Format final output with a fixed-length separator
    formatted_output = f"""
{SEPARATOR}
🍽️ **Recipe Name:** {recipe_name}
{SEPARATOR}

📝 **Ingredients:**
{formatted_ingredients}

👨‍🍳 **Instructions:**
{formatted_instructions}

{SEPARATOR}
"""
    return formatted_output

In [None]:
# Generate a refined recipe
def generate_recipe(query, retrieved_recipes):
    context = "\n\n".join([f"{i+1}. {doc[0]}\n(Score: {doc[1]:.2f})" for i, doc in enumerate(retrieved_recipes)])
    prompt = f"Modify this recipe for: {query}\n\nHere are some similar recipes:\n{context}"

    answer = generator(prompt, max_length=300, num_beams=5, early_stopping=True)[0]["generated_text"]
    formatted_answer = format_generated_recipe(answer)
    display_result(query, retrieved_recipes, formatted_answer)
    return formatted_answer


In [None]:
# Function to format and display results
def display_result(query, retrieved_recipes, generated_recipe):
    print("\n" + "=" * 60)
    print(f"USER QUERY: {query}")
    print("=" * 60)
    print("RETRIEVED RECIPES:")
    for i, (recipe, score) in enumerate(retrieved_recipes, start=1):
        print(f"\n{i}. {recipe}\n   🔹 **Relevance Score:** {score:.2f}")
    print("=" * 60)
    print("GENERATED RECIPE:")
    print(generated_recipe)
    print("=" * 60)

In [None]:
user_query = "low-calorie pasta with shrimps"
retrieved_recipes = retrieve_recipes(user_query)
final_recipe = generate_recipe(user_query, retrieved_recipes)



USER QUERY: low-calorie pasta with shrimps
RETRIEVED RECIPES:

1. Shrimp Scampi with Pasta | Ingredients: 1 pound large raw shrimp, peeled and deveined, 1/4 cup unsalted butter, 1/4 cup olive oil, 3 clove garlic, minced, Pinch of red pepper flakes, 1/4 cup dry white wine, 1 teaspoon salt, 1/2 teaspoon pepper, 3 tablespoon lemon juice, 1 tablespoon lemon zest, 1/4 cup fresh parsley, chopped, 8 ounce dried angel hair pasta (or linguine) | Instructions: If serving scampi over pasta (optional), boil pasta according to package instructions.

In a heavy frying pan, melt the butter and olive oil over medium heat. Lower the heat and add garlic and red pepper flakes. Cook over low heat, stirring occasionally, for 5 minutes
Raise the heat to high and when oil is hot, add the shrimp. Cook, stirring frequently, until the shrimp are pink and opaque, about 3 minutes.

Add the wine and stir, scraping up any browned bits from the bottom of the pan. Cook for 1 minute to let the alcohol