[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NirDiamant/RAG_Techniques/blob/main/all_rag_techniques/simple_csv_rag.ipynb)

# Simple RAG (Retrieval-Augmented Generation) System for CSV Files

## Overview

This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying CSV documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.

# CSV File Structure and Use Case
The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.

## Key Components

1. Loading and spliting csv files.
2. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings
3. Retriever setup for querying the processed documents
4. Creating a question and answer over the csv data.

## Method Details

### Document Preprocessing

1. The csv is loaded using langchain Csvloader
2. The data is split into chunks.


### Vector Store Creation

1. OpenAI embeddings are used to create vector representations of the text chunks.
2. A FAISS vector store is created from these embeddings for efficient similarity search.

### Retriever Setup

1. A retriever is configured to fetch the most relevant chunks for a given query.

## Benefits of this Approach

1. Scalability: Can handle large documents by processing them in chunks.
2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.
3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.
4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.

## Conclusion

This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within a csv file.

import libries

# Package Installation and Imports

The cell below installs all necessary packages required to run this notebook.


In [4]:
# Install required packages
!pip install faiss-cpu langchain langchain-community langchain-openai pandas python-dotenv




[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from pathlib import Path
from langchain_openai import ChatOpenAI,OpenAIEmbeddings
import os
from dotenv import load_dotenv

# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

# CSV File Structure and Use Case
The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.

In [6]:
# Download required data files
# import os
# os.makedirs('data', exist_ok=True)

# # Download the PDF document used in this notebook
# !wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf
# !wget -O data/customers-100.csv https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/customers-100.csv


In [7]:
import pandas as pd

file_path = ('../data/customers-100.csv') # insert the path of the csv file
data = pd.read_csv(file_path)

#preview the csv file
data.head()

Unnamed: 0,Index,Customer Id,First Name,Last Name,Company,City,Country,Phone 1,Phone 2,Email,Subscription Date,Website
0,1,DD37Cf93aecA6Dc,Sheryl,Baxter,Rasmussen Group,East Leonard,Chile,229.077.5154,397.884.0519x718,zunigavanessa@smith.info,2020-08-24,http://www.stephenson.com/
1,2,1Ef7b82A4CAAD10,Preston,Lozano,Vega-Gentry,East Jimmychester,Djibouti,5153435776,686-620-1820x944,vmata@colon.com,2021-04-23,http://www.hobbs.com/
2,3,6F94879bDAfE5a6,Roy,Berry,Murillo-Perry,Isabelborough,Antigua and Barbuda,+1-539-402-0259,(496)978-3969x58947,beckycarr@hogan.com,2020-03-25,http://www.lawrence.com/
3,4,5Cef8BFA16c5e3c,Linda,Olsen,"Dominguez, Mcmillan and Donovan",Bensonview,Dominican Republic,001-808-617-6467x12895,+1-813-324-8756,stanleyblackwell@benson.org,2020-06-02,http://www.good-lyons.com/
4,5,053d585Ab6b3159,Joanna,Bender,"Martin, Lang and Andrade",West Priscilla,Slovakia (Slovak Republic),001-234-203-0635x76146,001-199-446-3860x3486,colinalvarado@miles.net,2021-04-17,https://goodwin-ingram.com/


load and process csv data

In [8]:
loader = CSVLoader(file_path=file_path)
docs = loader.load_and_split()

In [9]:
docs

[Document(metadata={'source': '../data/customers-100.csv', 'row': 0}, page_content='Index: 1\nCustomer Id: DD37Cf93aecA6Dc\nFirst Name: Sheryl\nLast Name: Baxter\nCompany: Rasmussen Group\nCity: East Leonard\nCountry: Chile\nPhone 1: 229.077.5154\nPhone 2: 397.884.0519x718\nEmail: zunigavanessa@smith.info\nSubscription Date: 2020-08-24\nWebsite: http://www.stephenson.com/'),
 Document(metadata={'source': '../data/customers-100.csv', 'row': 1}, page_content='Index: 2\nCustomer Id: 1Ef7b82A4CAAD10\nFirst Name: Preston\nLast Name: Lozano\nCompany: Vega-Gentry\nCity: East Jimmychester\nCountry: Djibouti\nPhone 1: 5153435776\nPhone 2: 686-620-1820x944\nEmail: vmata@colon.com\nSubscription Date: 2021-04-23\nWebsite: http://www.hobbs.com/'),
 Document(metadata={'source': '../data/customers-100.csv', 'row': 2}, page_content='Index: 3\nCustomer Id: 6F94879bDAfE5a6\nFirst Name: Roy\nLast Name: Berry\nCompany: Murillo-Perry\nCity: Isabelborough\nCountry: Antigua and Barbuda\nPhone 1: +1-539-402-0

Initiate faiss vector store and openai embedding

In [10]:
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

embeddings = OpenAIEmbeddings()
index = faiss.IndexFlatL2(len(OpenAIEmbeddings().embed_query(" ")))
vector_store = FAISS(
    embedding_function=OpenAIEmbeddings(),
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={}
)

Add the splitted csv data to the vector store

In [11]:
vector_store.add_documents(documents=docs)

['19ba6df0-e3c8-4e3b-b5e8-7d1190a2c66f',
 'e6f1eeb4-71ba-44f6-9f95-8553bbe66bb5',
 '912688f2-9f8b-45c2-b7c5-a48243a2ea37',
 '87a746e3-dfbf-4841-b1a2-4970d3351cfc',
 'f6884a52-836f-47fb-949e-a502aaa757ba',
 'ccc0255d-2993-40ec-ae27-bb3411810880',
 'adc55fdf-6bd8-467e-8cc5-1a7857804545',
 'b3e76e72-0741-4517-ba05-e4443756f5e5',
 'dbe965ed-6c0f-45e3-bcb6-d1f559330e3d',
 '18800f35-85d3-4b6a-833c-d763457bec24',
 'fbe52ac3-0241-4900-89c5-430544761d6c',
 '15eec540-bb17-4cd2-b513-2845b5700f31',
 'abbb4595-b269-4c64-8da8-279c6e96ef11',
 '654bc370-3322-4eae-80e3-8ec74d551f1b',
 'fa57b1ec-0f2a-418a-a836-798b252bd407',
 '99d1955c-399b-4816-9f19-df0dcbac2dae',
 'f5482cff-0a46-4d16-9ca8-c6dbf0373ae4',
 '10861ab3-6500-44d8-a861-b7f0b4c7a72e',
 '7fc814e9-b528-4d8f-8f6c-fe6b9e42b619',
 'e3a2bf66-eb42-4d3b-aee3-dd6db7cc8a5f',
 'fd0844f0-6fdd-44a2-9b85-ce2f100b1be0',
 '0da82d12-288f-4de4-a39e-a572fb25ccea',
 '77b7d31a-ae7f-4e72-bfa5-31ca6c01c549',
 '5e4ff0a9-afaa-4a4a-8955-23bdf4142c52',
 '76efb07e-f374-

In [12]:
# The mapping from vectors to documents is in vectorstore.docstore._dict (a dict of {id: Document}).

for i, (doc_id, doc) in enumerate(vector_store.docstore._dict.items()):
    print(f"Doc {i}: {doc.page_content[:100]}...")  # Print first 100 chars

Doc 0: Index: 1
Customer Id: DD37Cf93aecA6Dc
First Name: Sheryl
Last Name: Baxter
Company: Rasmussen Group
...
Doc 1: Index: 2
Customer Id: 1Ef7b82A4CAAD10
First Name: Preston
Last Name: Lozano
Company: Vega-Gentry
Cit...
Doc 2: Index: 3
Customer Id: 6F94879bDAfE5a6
First Name: Roy
Last Name: Berry
Company: Murillo-Perry
City: ...
Doc 3: Index: 4
Customer Id: 5Cef8BFA16c5e3c
First Name: Linda
Last Name: Olsen
Company: Dominguez, Mcmilla...
Doc 4: Index: 5
Customer Id: 053d585Ab6b3159
First Name: Joanna
Last Name: Bender
Company: Martin, Lang and...
Doc 5: Index: 6
Customer Id: 2d08FB17EE273F4
First Name: Aimee
Last Name: Downs
Company: Steele Group
City:...
Doc 6: Index: 7
Customer Id: EA4d384DfDbBf77
First Name: Darren
Last Name: Peck
Company: Lester, Woodard an...
Doc 7: Index: 8
Customer Id: 0e04AFde9f225dE
First Name: Brett
Last Name: Mullen
Company: Sanford, Davenpor...
Doc 8: Index: 9
Customer Id: C2dE4dEEc489ae0
First Name: Sheryl
Last Name: Meyers
Company: Browning-Simon
C...
D

Create the retrieval chain

In [13]:
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

retriever = vector_store.as_retriever()

# Set up system prompt
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}"),
    
])

# Create the question-answer chain
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

Query the rag bot with a question based on the CSV data

In [14]:
answer= rag_chain.invoke({"input": "which company does sheryl Baxter work for?"})
answer['answer']

'Sheryl Baxter works for the Rasmussen Group.'

In [15]:
answer= rag_chain.invoke({"input": "which company does Preston work for?"})
answer['answer']

'Preston works for Vega-Gentry.'