<a href="https://colab.research.google.com/github/TairCohen/personal-nutritionist-agent/blob/tair/simple_csv_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple RAG (Retrieval-Augmented Generation) System for CSV Files

## Overview

This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying CSV documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.

# CSV File Structure and Use Case
The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.

## Key Components

1. Loading and spliting csv files.
2. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings
3. Retriever setup for querying the processed documents
4. Creating a question and answer over the csv data.

## Method Details

### Document Preprocessing

1. The csv is loaded using langchain Csvloader
2. The data is split into chunks.


### Vector Store Creation

1. OpenAI embeddings are used to create vector representations of the text chunks.
2. A FAISS vector store is created from these embeddings for efficient similarity search.

### Retriever Setup

1. A retriever is configured to fetch the most relevant chunks for a given query.

## Benefits of this Approach

1. Scalability: Can handle large documents by processing them in chunks.
2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.
3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.
4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.

## Conclusion

This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within a csv file.

install libries

In [None]:
!pip install -q --upgrade langchain-text-splitters langchain-community langgraph
!pip install -q langchain-openai
!pip install faiss-cpu>=1.7.4
!pip install -q langchain

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.5/43.5 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.0/138.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.9/41.9 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.0/47.0 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m223.7/223.7 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

import libries

In [None]:
from langchain_community.document_loaders.csv_loader import CSVLoader
# from pathlib import Path
from langchain_openai import ChatOpenAI,OpenAIEmbeddings
import os
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
llm = ChatOpenAI(model="gpt-4-turbo")

# CSV File Structure and Use Case
The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.

In [None]:
from google.colab import drive
import pandas as pd

drive.mount('/content/drive')

file_path = "/content/drive/MyDrive/סיכומים מתואר שני/NLP/calories_ds.csv"
tabular_data = pd.read_csv(file_path)
print(f"data shape {tabular_data.shape}")
tabular_data.head()

Mounted at /content/drive
data shape (4624, 9)


Unnamed: 0,Code,shmmitzrach,english_name,food_energy,protein,total_fat,carbohydrates,total_dietary_fiber,sodium
0,5,"מי גבינה, חומצי, נוזלי","Whey, acid, fluid",24,0.8,0.1,5.1,0.0,48.0
1,10,"בורגול, מבושל עם שעועית לבנה ועגבניות","Bulgur, cooked with white beans and tomatoes",112,5.3,1.7,15.3,5.1,141.2
2,14,חלב אם,"Milk, human",70,1.0,4.4,6.9,0.0,17.0
3,15,"חלב 3% שומן, תנובה, טרה, הרדוף, יטבתה","Milk, cow, 3% fat",60,3.3,3.0,4.6,0.0,50.0
4,17,"חלב 1% שומן בקרטון מועשר ויטמין A,D, וסידן","Milk, cow, 1% fat, fortified with calcium",42,3.0,1.0,4.6,1.7,40.0


In [None]:
data = tabular_data[['shmmitzrach', 'english_name', 'food_energy']]
filtered_file_path = "/content/drive/MyDrive/סיכומים מתואר שני/NLP/calories_ds_filtered_data.csv"
# data.to_csv(filtered_file_path, index=False)
data = pd.read_csv(filtered_file_path)
data.head(15)

Unnamed: 0,shmmitzrach,english_name,food_energy
0,"מי גבינה, חומצי, נוזלי","Whey, acid, fluid",24
1,"בורגול, מבושל עם שעועית לבנה ועגבניות","Bulgur, cooked with white beans and tomatoes",112
2,חלב אם,"Milk, human",70
3,"חלב 3% שומן, תנובה, טרה, הרדוף, יטבתה","Milk, cow, 3% fat",60
4,"חלב 1% שומן בקרטון מועשר ויטמין A,D, וסידן","Milk, cow, 1% fat, fortified with calcium",42
5,"חלב 3% שומן, מועשר בסידן, תנובה,טרה,יטבתה","Milk, cow, 3% fat, fortified with calcium, Tnu...",58
6,"חלב 3% שומן, מועשר בויטמינים B12, D,E, יטבתה","Milk, cow, 3% fat, fortified with vitamins, Yo...",57
7,"חלב 1% שומן, תנובה, טרה, הרדוף, יטבתה","Milk, cow, 1% fat, Tnuva/Tara/Harduf/Yotvata",43
8,"משקה חלב בטעם וניל,3% שומן, טרה","Milk drink, 3% fat, vanilla/banana/mocha, Tara",86
9,"חלב 2% שומן, כולל דל לקטוז, תנובה","Milk, cow, 3% fat, reduced lactose, Tnuva",51


load and process csv data

In [None]:
loader = CSVLoader(file_path=filtered_file_path)
docs = loader.load_and_split()
docs[0]

# results are less good.
# from langchain_community.document_loaders.dataframe import DataFrameLoader
# loader = DataFrameLoader(data, page_content_column='shmmitzrach')
# docs = loader.load_and_split()
# docs[0]

Document(metadata={'source': '/content/drive/MyDrive/סיכומים מתואר שני/NLP/calories_ds_filtered_data.csv', 'row': 0}, page_content='shmmitzrach: מי גבינה, חומצי, נוזלי\nenglish_name: Whey, acid, fluid\nfood_energy: 24')

Initiate faiss vector store and openai embedding

In [None]:
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

embeddings = OpenAIEmbeddings()
index = faiss.IndexFlatL2(len(OpenAIEmbeddings().embed_query(" ")))
vector_store = FAISS(
    embedding_function=OpenAIEmbeddings(),
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={}
)

Add the splitted csv data to the vector store

In [None]:
vector_store.add_documents(documents=docs)
len(docs) # doc for each row in table

4624

Create the retrieval chain

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

retriever = vector_store.as_retriever()

# Set up system prompt
system_prompt = (
    "You are an AI nutrition assistant that estimates the total calories in a dish based on a text description or an image.\n\n"
    "### *Estimation Methodology:*\n"
    "1. *Check for an exact match in the database.*\n"
    "   - If an exact match exists, return its calorie count per 100g.\n"
    "2. *If no exact match exists, break the dish into ingredients and estimate calories.*\n"
    "   - Identify the *most relevant base food* (e.g., a plain omelet for 'cheese omelet').\n"
    "   - Check for *similar variations* (e.g., 'Egg or omelet, fried without oil' as the base).\n"
    "   - *Only include ingredients explicitly mentioned in the description.*\n"
    "   - Add ingredients like cheese based on the closest match in the database. *Do not assume any extra ingredients (e.g., mushrooms) unless explicitly mentioned.*\n"
    "   - Adjust calorie estimates proportionally to the expected ingredient ratio.\n"
    "3. *Do NOT assume extra ingredients unless explicitly mentioned.*\n"
    "4. *Do NOT use the calorie value of a mixed dish (e.g., 'omelet with mushrooms and cheese') as a direct replacement for a different variant (e.g., 'cheese omelet').*\n"
    "5. *Clearly explain the steps taken, including any assumptions about portions.*\n"
    "6. *For each ingredient:*\n"
    "   - Provide the closest match from the database (e.g., 'Egg or omelet, fried without oil') and its calorie count per 100g.\n"
    "   - If the exact calorie count for an ingredient is missing, explain that and provide an estimated serving size (e.g., 150g for eggs, 30g for cheese).\n"
    "   - Use the standard serving size to calculate the calories from each ingredient based on the proportion of the total dish.\n"
    "7. *Provide the final total calories for the dish.*\n\n"
    "Use the retrieved database context below to find accurate calorie values:\n"
    "{context}\n\n"
    "If the exact ingredient is not found, use the closest alternative and explain why.\n"
    "If specific calorie counts are missing, make assumptions based on standard serving sizes and ingredient ratios. Always provide the final total calorie estimate.")


prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}"),

])

# Create the question-answer chain
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

Query the rag bot with a question based on the CSV data

In [None]:
food = "Milk"
data['english_name_lower'] = data['english_name'].str.lower()
data[data['english_name_lower'].str.contains(food.lower(), na=False)]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['english_name_lower'] = data['english_name'].str.lower()


Unnamed: 0,shmmitzrach,english_name,food_energy,english_name_lower
2,חלב אם,"Milk, human",70,"milk, human"
3,"חלב 3% שומן, תנובה, טרה, הרדוף, יטבתה","Milk, cow, 3% fat",60,"milk, cow, 3% fat"
4,"חלב 1% שומן בקרטון מועשר ויטמין A,D, וסידן","Milk, cow, 1% fat, fortified with calcium",42,"milk, cow, 1% fat, fortified with calcium"
5,"חלב 3% שומן, מועשר בסידן, תנובה,טרה,יטבתה","Milk, cow, 3% fat, fortified with calcium, Tnu...",58,"milk, cow, 3% fat, fortified with calcium, tnu..."
6,"חלב 3% שומן, מועשר בויטמינים B12, D,E, יטבתה","Milk, cow, 3% fat, fortified with vitamins, Yo...",57,"milk, cow, 3% fat, fortified with vitamins, yo..."
...,...,...,...,...
4481,FFQ-שוקו או משקה חלב אחר מכל סוג,"FFQ- Chocolate-flavored milk drink, or other m...",59,"ffq- chocolate-flavored milk drink, or other m..."
4488,"FFQ-יוגורט, מעדן חלב או גבינה דיאט 0% שומן, לל...","FFQ- Yogurt, milk dessert or cheese, diet, 0% ...",38,"ffq- yogurt, milk dessert or cheese, diet, 0% ..."
4524,"FFQ-דגני בוקר מבושלים כגון דייסת קוואקר, סולת...","FFQ- Cereal, breakfast, cooked, inc. rolled oa...",62,"ffq- cereal, breakfast, cooked, inc. rolled oa..."
4564,FFQ-גלידה - כל סוג על בסיס חלב 1. כל השנה או ...,"FFQ- Icecream, milk based, all types- 1. all y...",243,"ffq- icecream, milk based, all types- 1. all y..."


In [None]:
answer= rag_chain.invoke({"input": "How much food energy in Milk 3%?"})
answer['answer']

'Milk with 3% fat contains about 61 calories per 100 grams.'

In [None]:
# food = "yogurt"
food = "yoghurt"
data[data['english_name_lower'].str.contains(food.lower(), na=False)]


Unnamed: 0,shmmitzrach,english_name,food_energy,english_name_lower
18,"יוגורט 4.5% שומן, תנובה","Yoghurt, cow milk, 4.5% fat, unflavored, Tnuva",68,"yoghurt, cow milk, 4.5% fat, unflavored, tnuva"
19,"יוגורט ביו 3% שומן, תנובה","Yoghurt, cow milk, bio, 3% fat, unflavored, Tnuva",65,"yoghurt, cow milk, bio, 3% fat, unflavored, tnuva"
21,"יוגורט של פעם 3% שומן, השומרון","Yoghurt, cow milk, 4% fat, unflavored",71,"yoghurt, cow milk, 4% fat, unflavored"
22,"יוגורט 1.9% שומן עם גרנולה ופירות ,פרילי טבע,...","Yoghurt, cow milk, 3% fat, with granola and fr...",98,"yoghurt, cow milk, 3% fat, with granola and fr..."
23,"יוגורט ביו 3% שומן,דנונה, שטראוס","Yoghurt, cow milk, 3% fat, unflavored, with pr...",70,"yoghurt, cow milk, 3% fat, unflavored, with pr..."
...,...,...,...,...
4232,"משקה יוגורט 1.1% שומן, דנכול, טעמים שונים, שטראוס","Yoghurt drink, cow milk, 1.1% fat, Dancol, fla...",43,"yoghurt drink, cow milk, 1.1% fat, dancol, fla..."
4317,"יוגורט, דנונה פרו, עשיר בחלבון, 2.9% שומן, שטראוס","Yoghurt, cow milk, protein-enriched, 2.9% fat,...",74,"yoghurt, cow milk, protein-enriched, 2.9% fat,..."
4318,"יוגורט, דנונה פרו, עם פרי, עשיר בחלבון, 2.4% ש...","Yoghurt, cow milk, protein-enriched ,added fru...",90,"yoghurt, cow milk, protein-enriched ,added fru..."
4383,"יוגורט קפוא, בסיס להכנת גלידת יוגורט, (פרוזן),...","Yoghurt, frozen, 1% fat, Strauss",118,"yoghurt, frozen, 1% fat, strauss"


In [None]:
answer= rag_chain.invoke({"input": "How much food energy in yoghurt bio 3%?"})
answer['answer']

'Yoghurt bio 3% fat has 65 kcal of food energy per serving.'

In [None]:
import base64
from langchain.schema import HumanMessage

# Function to encode an image as Base64
def encode_image(image_path):
    with open(image_path, "rb") as img_file:
        return base64.b64encode(img_file.read()).decode("utf-8")

# Function to identify food from an image
def identify_food(image_path):
    image_base64 = encode_image(image_path)
    response = llm.invoke([
        HumanMessage(
            content=[
                {"type": "text", "text": "What food is in this image?"},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}}
            ]
        )
    ])
    return response.content  # Returns identified food items


def get_calories(food_items):
    response = rag_chain.invoke({"input": f"How much food energy is in {food_items}?"})
    # return response  # Returns the answer from RAG
    return response['answer']

def estimate_calories(image_path):
    # Step 1: Identify food items in the image
    food_items = identify_food(image_path)
    print("Identified foods:", food_items)

    # Step 2: Retrieve calorie data from RAG
    calories = get_calories(food_items)

    return calories

# Example usage
image_path = "/content/drive/MyDrive/סיכומים מתואר שני/NLP/download.jpg"
calories = estimate_calories(image_path)
print(f"Total estimated calories: {calories}")

Identified foods: The image shows a hamburger paired with a side of French fries. The hamburger features a sesame seed bun, lettuce, tomato slices, one or more beef patties, and it appears to have bacon on top. Beside the hamburger, there's a glass of beer in the background.
Total estimated calories: To estimate the total food energy in the dish as described and shown in the image, we need to discern the calorie counts of the individual components—namely the hamburger, side of French fries, and the beer. Here's how we would break it down, providing estimates based on similar items:

### Hamburger
This hamburger description is closest to the "Hamburger, Big Mac, McDonald's" for comparison purposes:

- It features a sesame seed bun, lettuce, tomato slices, beef patties, and bacon. 
- The calorie content for the Big Mac is 181 kcal/100g, though the Big Mac itself often is around 550 kcal for the entire sandwich given that it typically weighs around 240g (based on McDonald's nutritional in