# Vegan Recipe Retrieval with LangChain + ChromaDB

## Purpose

The purpose of this Notebook is to look into building a RAG utilizing Langchain to Track and Chroma DB to build the Vector Database. In this venture the vegan recipe dataset is provided by kaggle.

## Table Of Contents

1. Import Packages required and initial perparation
2. Helper Function preparation
3. Data Preparation
4. Create Vector DB
5. Prepare Simple RAG
6. Sample Query

#### 1. Import Packages required and initial perparation

In [1]:
# Use pandas to import the Vegan Recipe Kaggle Dataset
import pandas as pd
# Use to substitute string via regex
import re
# LangChain ChromaDB vector store initialization
from langchain.vectorstores import Chroma
# LangChain HuggingFaceEmbeddings for generating vector embeddings
from langchain.embeddings import HuggingFaceEmbeddings
# LangChain Document schema for storing individual recipe entries
from langchain.schema import Document
# Load environment variables from a .env file (e.g. for API keys or config)
from dotenv import load_dotenv
# Enables tracing of function executions for monitoring and debugging
from langsmith import traceable
# Imports OpenAI API client for interacting with OpenAI models
from openai import OpenAI
# Provides type hinting for lists to improve code clarity and checks
from typing import List
# Allows nested use of asyncio event loops, useful in interactive environments
import nest_asyncio


In [2]:
# Load Environment Variables
load_dotenv()
# bring in Vegan Recipe Kaggle Dataset
df = pd.read_csv('Data/vegan_recipes.csv')
# quick look into the data
df.head()

Unnamed: 0.1,Unnamed: 0,href,title,ingredients,preparation
0,0,https://veganuary.com/recipes/rainbow-rice/,Rainbow Rice,\nIngredients\n\nCarrot ribbons (just use a pe...,\nMethod\n\nCook the rice as instructed on the...
1,1,https://veganuary.com/recipes/mfc-nachos/,Nachos,\nIngredients\n\n400g Meatless Farm Co mince (...,\nPreparation\n\nPreheat the oven to 350ºF\nHe...
2,2,https://veganuary.com/recipes/hazelnut-truffles/,Hazelnut Truffles,\nIngredients\n\n100g hazelnuts\n2 tablespoons...,\nMethod\n\nPreheat the oven to 200c\nPut the ...
3,3,https://veganuary.com/recipes/simple-roasted-r...,Simple Roasted Radish by ChicP,\nIngredients\n\n1 170g tub beetroot and horse...,\nPreparation\nPre heat the oven to 160°C\nCut...
4,4,https://veganuary.com/recipes/baked-apple-char...,Baked Apple Charlotte,\nIngredients\n\n2 tbsp rapeseed oil\n75g pitt...,\nPreparation\n\nServes 9\nYou will need an 8i...


#### 2. Helper Function preparation

- For Ingredients:
    - Clean the 'Ingredient' substring to get rid of the newline ('\n') and period tags ('\n\n')
    - remove the newline ('\n') and paragraph ('\n\n') specifiers from the rest of the text
- For Preparation:
    - Clean the 'Preparation' or 'Method' substring to get rid of the newline ('\n') and period tags ('\n\n')
    - remove the newline ('\n') and paragraph ('\n\n') specifiers from the rest of the text

In [3]:
# Helper Functions

def cleanNewLineAndParagraph(stringToBeCleaned):
    """
    Replaces all newline characters (\n) in the given string with periods.
    Useful for turning multiline text into single-line sentences for cleaner formatting.
    """
    cleanNewLine = re.sub(r"\n", ".", stringToBeCleaned)
    return cleanNewLine

def cleanIngredients(IngredientString):
    """
    Cleans the ingredients section by:
    - Replacing the specific section header '\nIngredients\n\n' with 'Ingredients:'
    - Replacing newlines with periods for consistency.
    """
    cleanedString = re.sub(r'\nIngredients\n\n', 'Ingredients:', IngredientString)
    cleanedStringSecond = cleanNewLineAndParagraph(cleanedString)
    return cleanedStringSecond

def cleanPreparation(PreparationString):
    """
    Cleans the preparation section by:
    - Removing specific section headers like '\nMethod\n\n' and '\nPreparation\n\n'
    - Replacing newlines with periods for smoother readability.
    """
    cleanedString = re.sub(r'\nMethod\n\n', '', PreparationString)
    cleanedStringSecond = re.sub(r'\nPreparation\n\n', '', cleanedString)
    cleanedStringThird = cleanNewLineAndParagraph(cleanedStringSecond)
    return cleanedStringThird


In [4]:
# Sample of how the Values for ingreidents and preparation are cleaned.

raw_ingredients = "\nIngredients\n\n2 cups flour\n1 tsp sugar"
raw_preparation = "\nPreparation\n\nServes 9\nYou will need an 8i.."

print(cleanIngredients(raw_ingredients))
print(cleanPreparation(raw_preparation))

Ingredients:2 cups flour.1 tsp sugar
Serves 9.You will need an 8i..


#### 3. Data Preparation

- Prepare the main TokenStrings for Ingredients and Preparations

In [5]:
dfCleaned = df
dfCleaned['ingredientsV2'] = dfCleaned['ingredients'].apply(lambda x : cleanIngredients(x))
dfCleaned['preparationV2'] = dfCleaned['preparation'].apply(lambda x : cleanPreparation(x))
dfCleaned['ingredientTokenStrings'] = 'These are the Ingredients for ' + dfCleaned['title'] + ': ' + dfCleaned['ingredientsV2']
dfCleaned['preparationTokenStrings'] = 'These are the steps for ' + dfCleaned['title'] + ': ' + dfCleaned['preparationV2']

In [6]:
# Look into the first 5 rows of the new dataframe containing data for vectorDB Storage
dfCleaned[['title','href','ingredientTokenStrings','preparationTokenStrings']].head()

Unnamed: 0,title,href,ingredientTokenStrings,preparationTokenStrings
0,Rainbow Rice,https://veganuary.com/recipes/rainbow-rice/,These are the Ingredients for Rainbow Rice: In...,These are the steps for Rainbow Rice: Cook the...
1,Nachos,https://veganuary.com/recipes/mfc-nachos/,These are the Ingredients for Nachos: Ingredie...,These are the steps for Nachos: Preheat the ov...
2,Hazelnut Truffles,https://veganuary.com/recipes/hazelnut-truffles/,These are the Ingredients for Hazelnut Truffle...,These are the steps for Hazelnut Truffles: Pre...
3,Simple Roasted Radish by ChicP,https://veganuary.com/recipes/simple-roasted-r...,These are the Ingredients for Simple Roasted R...,These are the steps for Simple Roasted Radish ...
4,Baked Apple Charlotte,https://veganuary.com/recipes/baked-apple-char...,These are the Ingredients for Baked Apple Char...,These are the steps for Baked Apple Charlotte:...


#### 4. Create Vector DB

- Embedding Model all-MiniLM-L6-v2 was chosen because it was light weight, readily available and free and a good starting embedding model
- Each Document is in the form of :

    ```python
    Document(
        metadata={
            'type': '',
            'title': '',
            'link': ''
        },
        page_content=''
    )
    ```


In [7]:
# Embedding model 
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Folder where ChromaDB will store vectors
persist_directory = "veganRecipeChromaDB"

# Prepare documents array to feed into vectoDB Store veganRecipeChromaDB
documents = []

for idx, row in dfCleaned.iterrows():
    title = row['title']
    link = row['href']

    # Add Documents for Ingredients
    documents.append(Document(
        page_content=row['ingredientTokenStrings'],
        metadata={"type": "ingredients", "title": title, "link": link}
    ))

    # Add Documents for Preparation
    documents.append(Document(
        page_content=row['preparationTokenStrings'],
        metadata={"type": "preparation", "title": title, "link": link}
    ))


  embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
  from .autonotebook import tqdm as notebook_tqdm


In [8]:
# Creates the Vectorstore and embed the data
vectorstore = Chroma.from_documents(
    documents,
    embedding=embedding_model,
    persist_directory=persist_directory
)
# Saves vectorstore a folder
vectorstore.persist()

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
  vectorstore.persist()


- Test Retrieval, get the most relevant 5 Documents to the query. The L2 Norm distance formula is used here.
- Lets ask the query regarding tomatoes and find the top 5 documents that are relevant

In [9]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

query = "What are some recipes where tomatoes are pertinent?"
docs = retriever.get_relevant_documents(query)

for doc in docs:
    print("-----")
    print("Title:", doc.metadata.get("title"))
    print("Type:", doc.metadata.get("type"))
    print("Link:", doc.metadata.get("link"))
    print("Content:\n", doc.page_content)


  docs = retriever.get_relevant_documents(query)
Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


-----
Title: HEIRLOOM TOMATO & ENDIVE SALAD + OLIVE BAGNA CAUDA
Type: ingredients
Link: https://simple-veganista.com/heirloom-tomato-and-endive-salad-bagna/
Content:
 These are the Ingredients for HEIRLOOM TOMATO & ENDIVE SALAD + OLIVE BAGNA CAUDA: ..Ingredients:..Scale.1x2x3x ....Salad..2 medium heirloom tomatoes (any color), sliced into wedges.1/2 cup grape heirloom tomatoes, sliced in halve.1 cup cooked chickpeas.2 –3 endive, sliced (arugula would be great too).lemon wedges, to serve.chopped parsley, to serve (optional)..Olive Bagna Cauda..1/4 cup good olive oil, more as needed.1/3 – 1/2 cup black olives and/or capers (I used a mix), pitted and minced, or diced.3 large cloves garlic, minced.pinch of red pepper flakes, optional.juice of 1/2 lemon.salt and freshly cracked pepper to taste...
-----
Title: Moroccan-Style Lentil, Chickpea and Kale Soup
Type: ingredients
Link: https://veganuary.com/recipes/moroccan-lentil-chickpea-and-kale-soup/
Content:
 These are the Ingredients for Moro

#### 5. Prepare Simple RAG

The Code was partially copied from a Tutorial on Langchain Academy. The code below ensures that logging can be seen on langchain

In [10]:
MODEL_PROVIDER = "openai"
MODEL_NAME = "gpt-4o-mini"
APP_VERSION = 1.0
RAG_SYSTEM_PROMPT = """You are an assistant for answer questions about vegan recipes. 
Use the following pieces of retrieved context to answer the latest question in the conversation. 
If you don't know the answer, just say that you don't know. 
Use three sentences maximum and keep the answer concise.
"""

openai_client = OpenAI()
nest_asyncio.apply()


"""
retrieve_documents
- Returns documents fetched from a vectorstore based on the user's question
"""
@traceable(run_type="chain")
def retrieve_documents(question: str):
    return retriever.get_relevant_documents(question)

"""
generate_response
- Calls `call_openai` to generate a model response after formatting inputs
"""
@traceable(run_type="chain")
def generate_response(question: str, documents):
    formatted_docs = "\n\n".join(doc.page_content for doc in documents)
    messages = [
        {
            "role": "system",
            "content": RAG_SYSTEM_PROMPT
        },
        {
            "role": "user",
            "content": f"Context: {formatted_docs} \n\n Question: {question}"
        }
    ]
    return call_openai(messages)

"""
call_openai
- Returns the chat completion output from OpenAI
"""
@traceable(run_type="llm")
def call_openai(
    messages: List[dict], model: str = MODEL_NAME, temperature: float = 0.0
) -> str:
    return openai_client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature,
    )

"""
langsmith_rag
- Calls `retrieve_documents` to fetch documents
- Calls `generate_response` to generate a response based on the fetched documents
- Returns the model response
"""
@traceable(run_type="chain")
def langsmith_rag(question: str):
    documents = retrieve_documents(question)
    response = generate_response(question, documents)
    return response.choices[0].message.content


#### 6. Sample Query

Time to test out the RAG with our Tomato Query !

In [11]:
question = "What are some recipes where tomatoes are pertinent?"
ai_answer = langsmith_rag(question, langsmith_extra={"metadata": {"website": "www.google.com"}})
print(ai_answer)

Some recipes where tomatoes are pertinent include the Heirloom Tomato & Endive Salad, which features heirloom tomatoes and grape heirloom tomatoes, and the Mediterranean Tomato Tart, which uses large tomatoes as a key ingredient. Additionally, the Moroccan-Style Lentil, Chickpea and Kale Soup includes cherry tomatoes and sundried tomatoes. These dishes highlight the versatility of tomatoes in vegan cooking.
