# Indexing Recipes Using LazyGraphRag
This notebook demonstrates how to index using the LazyGraphRag library.

Learn more about LazyGraphRag here: [GraphRAG](https://datastax.github.io/graph-rag/examples/lazy-graph-rag/?h=lazy)

## Datasets
The datasets used in this notebook are:
- **CookingRecipes Dataset**:
    source: https://huggingface.co/datasets/CodeKapital/CookingRecipes
    description: A dataset of cooking recipes with ingredients, directions, and other relevant information.
- **Q&A For Recipes Dataset**:
    source: https://huggingface.co/datasets/Hieu-Pham/cooking_squad
    description: A dataset of cooking-related questions and answers to help users troubleshoot issues with recipe directions. The context of the questions are recipes from the CookingRecipes dataset.
- **General preference Q&A Dataset**:
    source: https://huggingface.co/datasets/andrewsiah/se_cooking_preference_sft
    description: A dataset of questions and answers to help better inform users about cooking techniques and ingredients.

## Instantiation

In [1]:
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv

load_dotenv(dotenv_path='../.env')

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

## Prepare Recipe Data

In [2]:
import pandas as pd 

recipes_data = pd.read_csv('hf://datasets/CodeKapital/CookingRecipes/Data.csv', nrows=3000)

# rename the columns 'Unnamed: 0' to 'i64'
recipes_data.rename(columns={'Unnamed: 0':'i64'}, inplace=True)

In [3]:
import json
import re
regex = re.compile('[^a-zA-Z]')

recipes_df = recipes_data.copy()

# drop duplcates on the 'titlee' column
recipes_df.drop_duplicates(subset='title', inplace=True)
# rename the column "NER" to "ner"
recipes_df.rename(columns={'NER':'ner'}, inplace=True)

# create an id column that is a combination of the title and i64. title should be lowercased and spaces replaced with underscores. # Remove any non-alphanumeric characters from the id column
recipes_df['source_id'] = recipes_df.apply(lambda x: f"{x['i64']}_{regex.sub('', x['title']).lower().replace(' ', '_')}", axis=1)
# apply json.load to directions, ingredients and NER columns
recipes_df['directions'] = recipes_df['directions'].apply(json.loads)
recipes_df['ingredients'] = recipes_df['ingredients'].apply(json.loads)
recipes_df['ner'] = recipes_df['ner'].apply(json.loads)

In [4]:
# NOTE: The AI will recieve recipe context formatted as Markdown.

# Create a function that takes a row fro the recipe dataset and returns a string in markdown format
recipe_format_md = lambda r: """# {title}

## Ingredients
- {ingredients}

## Directions
- {directions}
""".format(title=r['title'], ingredients='\n- '.join(r['ingredients']), directions='\n- '.join(r['directions']))

recipes_df['md'] = recipes_df.apply(recipe_format_md, axis=1)


#########
# uncomment the following line to test the function on a single row
# print(recipes_df['md'][0])
#########


### Recipe Documents
Before loading the recipe data, we need to prepare the recipe documents.
The `page_content` will be the Markdown representation of the recipe.
LazyGraphRag will generate the graph edges using metadata from the recipe documents:
- `keywords`: The CookingRecipes dataset came with a `ner` field that contains entities extracted from the recipe. These entities would be ingredients found in the recipe.
- `source_id`: The unique identifier for the recipe.
- `type`: All recipes will have the type `recipe`. This will help distinguish the recipe nodes from other nodes in the graph, such as the question-answer nodes.

In [5]:
from langchain_core.documents import Document

# convert the recipes to langchain documents
recipe_docs = [Document(page_content=r['md'], id=r['source_id'], metadata={'keywords':r['ner'], 'source_id': r['source_id'], 'type':'recipe'}) for r in recipes_df.to_dict(orient='records')]

## Prepare Cooking Q&A W/ Recipe Context
As discussed earlier, [Hieu-Pham's dataset](https://huggingface.co/datasets/Hieu-Pham/cooking_squad) contains questions and answers related to the CookingRecipes dataset. We will use this dataset to generate the question-answer nodes in the graph. We will leverage the connection between the recipe and the question-answer nodes to generate the graph edges.

In [6]:
recipe_qa_df = pd.read_json("hf://datasets/Hieu-Pham/cooking_squad/squad_cooking_transformed.json")

# explode the 'answers' column
recipe_qa_df['answer_start'] = recipe_qa_df['answers'].apply(lambda x: x['answer_start'])
recipe_qa_df['answer'] = recipe_qa_df['answers'].apply(lambda x: x['text'])

# drop the initial 'answers' column
recipe_qa_df.drop(columns=['answers'], inplace=True)

# Grab the title of the column from splitting the 'context' column on the first '\n'
recipe_qa_df['title'] = recipe_qa_df['context'].apply(lambda x: x.split('\n')[0])
# drop the original 'context' column
recipe_qa_df.drop(columns=['context'], inplace=True)

# join recipe_qa_df with recipes_df on the 'title' column. keep all rows in recipe_qa_df
recipe_qa_df = recipe_qa_df.merge(recipes_df[['title', 'source_id', 'md']], on='title', how='left')

# rename the 'md' column to 'context'
recipe_qa_df.rename(columns={'md':'context'}, inplace=True)

# format the qa pairs in markdown for the AI
qa_format_md = lambda qa: """
The context for this question is a recipe titled *{title}*

Question: {question}
Answer: {answer}
""".format(question=qa['question'], answer=qa['answer'], title=qa['title'])

recipe_qa_df['md'] = recipe_qa_df.apply(qa_format_md, axis=1)

# rename the 'id' column to 'qa_id'
recipe_qa_df.rename(columns={'id':'qa_id'}, inplace=True)


### Question & Answer Documents
Before loading the recipe data, we need to prepare the documents again.
The `page_content` will be the Markdown representation of the Q&A.
LazyGraphRag will generate the graph edges using metadata from the documents:
- `source_id`: The unique identifier for the recipe context linked to the document.
- `type`: All Q&A documents will have the type `question-answer`. This will help distinguish the nodes from other nodes in the graph.

In [7]:

# Prepare Question-Answer Document
recipe_qa_docs = [Document(page_content=qa['md'], id=qa['qa_id'], metadata={'source_id': qa['source_id'], 'type':'question-answer'}) for qa in recipe_qa_df.to_dict(orient='records')]

## Populating the Vector store

In [8]:
from langchain_chroma.vectorstores import Chroma
from langchain_graph_retriever.transformers import ShreddingTransformer

#########
# If you want to only store the recipe documents, uncomment the following variable assignment and comment the one below it
# vector_store = Chroma.from_documents(
#     documents=list(ShreddingTransformer().transform_documents(recipe_docs)),
#     embedding=embeddings,
#     collection_name="recipes",
#     persist_directory="./data/recipes_chroma_db"
# )
#########
shredder = ShreddingTransformer() 
vector_store = Chroma.from_documents(
    documents=list(shredder.transform_documents(recipe_docs + recipe_qa_docs)),
    embedding=embeddings,
    collection_name="recipe_qa_combined",
    persist_directory="./data/recipe_qa_combined_chroma_db"
)

## Graph Traversal

In [9]:
from graph_retriever.strategies import Eager
from langchain_graph_retriever import GraphRetriever
from langchain_graph_retriever.adapters.chroma import ChromaAdapter

traversal_retriever = GraphRetriever(
    store = ChromaAdapter(vector_store, shredder, {"keywords"}),
    edges = [("keywords", "keywords"), ("source_id", "source_id")],
    strategy = Eager(k=5, start_k=2, max_depth=3),
)

In [18]:
# Test the retrieval on a single question. This should return relevant recipes and their context
results = traversal_retriever.invoke("I'm in Ohio and I just had a small round chocolate that had peanut butter. I can't remeber the name of it. All I remember is that it had an 'eye' in the name. If you find it, get me the recipe")
#########
# If you want to test the retrieval on a single question that test the retrieval of a Q&A on a specific recipe, uncomment the following line
# results = traversal_retriever.invoke("No Bake Cookies: How long should the clusters stand until the firm up?")
#########
for doc in results:
    print(f"{doc.id}:\n{doc.page_content}")
    # print(doc.page_content)

2437_buckeyes:
# Buckeyes

## Ingredients
- 1 stick butter, softened
- 1 lb. powdered sugar
- 2 c. crunchy peanut butter
- 3 c. Rice Krispies
- 12 oz. chocolate chips
- 1/3 stick paraffin wax

## Directions
- Mix butter, sugar, peanut butter and Rice Krispies well.
- Form mixture in balls about the size of a walnut.
- Melt chocolate chips and paraffin in double boiler.
- Dip balls in this mixture and place on wax paper until chocolate is set.

11_buckeyecandy:
# Buckeye Candy

## Ingredients
- 1 box powdered sugar
- 8 oz. soft butter
- 1 (8 oz.) peanut butter
- paraffin
- 12 oz. chocolate chips

## Directions
- Mix sugar, butter and peanut butter.
- Roll into balls and place on cookie sheet.
- Set in freezer for at least 30 minutes. Melt chocolate chips and paraffin in double boiler.
- Using a toothpick, dip balls 3/4 of way into chocolate chip and paraffin mixture to make them look like buckeyes.

409_buckeyescookies:
# Buckeyes(Cookies)  

## Ingredients
- 1 (18 oz.) jar crunchy pean

## Use within a chain

In [37]:
from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

prompt = ChatPromptTemplate.from_template(
    """Answer the question based only on the context provided.

Context: {context}

Question: {question}"""
)


def format_docs(docs):
    return "\n\n".join(
        f"text: {doc.page_content} metadata: {doc.metadata}" for doc in docs
    )


# chain = (
#     {"sources": traversal_retriever}
#     | {"context": RunnableLambda(lambda x: format_docs(x['sources'])), "question": RunnablePassthrough()}
#     | prompt
#     | llm
#     | StrOutputParser()
# )

chain = (
    {"context": traversal_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    # | StrOutputParser()
)

In [48]:
# response = chain.invoke("I'm in Ohio and I just had a small round chocolate that had peanut butter. I can't remeber the name of it. All I remember is that it had an 'eye' in the name. If you find it, get me the recipe")
# response = chain.invoke("What are some recipe that use chocolate and creamcheese? Give me the recipes")
response = chain.invoke("I'm looking for some seafood recipes. Can you help me?")
# response = chain.invoke("I'm looking for some chili recipes that use pork tenderloin?")
# response = chain.invoke("What is the id for the recipe Fruit Medley?")
response

AIMessage(content='Sure! Here are some seafood recipes you can try:\n\n1. **Seafood Casserole**\n   - Ingredients: shrimp, scallops, flour, butter, mushrooms, onion, light cream, buttered Ritz crackers, water, and white wine.\n   - Directions: Boil water and wine, cook shrimp and scallops, then combine with other ingredients for a delicious casserole.\n\n2. **Seafood And Pasta Salad**\n   - Ingredients: scallops, cooked shrimp, imitation crab, green onions, olive oil, boiled eggs, mayonnaise, and tri-colored rotini noodles.\n   - Directions: Boil noodles, sauté scallops, and mix with other ingredients for a refreshing salad.\n\n3. **Bouillabaisse**\n   - Ingredients: minced onions, leeks, olive oil, garlic, canned tomatoes, water, parsley, thyme, saffron, assorted fish and shellfish.\n   - Directions: Cook vegetables, add stock and seafood, then bring to boil for a flavorful soup.\n\n4. **Creole Flounder**\n   - Ingredients: flounder or pollack fillets, chopped tomatoes, green pepper, 

In [49]:
response.model_dump()

{'content': 'Sure! Here are some seafood recipes you can try:\n\n1. **Seafood Casserole**\n   - Ingredients: shrimp, scallops, flour, butter, mushrooms, onion, light cream, buttered Ritz crackers, water, and white wine.\n   - Directions: Boil water and wine, cook shrimp and scallops, then combine with other ingredients for a delicious casserole.\n\n2. **Seafood And Pasta Salad**\n   - Ingredients: scallops, cooked shrimp, imitation crab, green onions, olive oil, boiled eggs, mayonnaise, and tri-colored rotini noodles.\n   - Directions: Boil noodles, sauté scallops, and mix with other ingredients for a refreshing salad.\n\n3. **Bouillabaisse**\n   - Ingredients: minced onions, leeks, olive oil, garlic, canned tomatoes, water, parsley, thyme, saffron, assorted fish and shellfish.\n   - Directions: Cook vegetables, add stock and seafood, then bring to boil for a flavorful soup.\n\n4. **Creole Flounder**\n   - Ingredients: flounder or pollack fillets, chopped tomatoes, green pepper, lemon 

In [28]:
response = chain.invoke("Get me the recipe for Seafood And Pasta Salad")
print(response)

# Seafood And Pasta Salad

## Ingredients
- 1 lb. scallops
- 1 lb. cooked shrimp
- 1/2 lb. imitation crab
- 4 green onions
- 1 tsp. olive oil
- 4 boiled eggs
- 2 c. Best Foods mayonnaise
- 1 bag tri-colored rotini noodles, cooked and drained

## Directions
1. Boil the noodles, then rinse and put in a bowl which you will be using for serving the salad.
2. Saute scallops in the olive oil. Add the scallops along with the rest of the seafood to the noodles.
3. Dice the green onions (exclude the chive part) and boiled eggs.
4. Add to the salad.
5. Add mayonnaise; use more or less for taste.
6. Mix ingredients together.
7. Ready to serve!
