# Simple RAG (Retrieval-Augmented Generation) System for CSV Files

## Overview

This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying CSV documents. The system encodes the document content into a vector store, which can then be queried to retrieve relevant information.

# CSV File Structure and Use Case
The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.

## Key Components

1. Loading and spliting csv files.
2. Vector store creation using [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) and OpenAI embeddings
3. Retriever setup for querying the processed documents
4. Creating a question and answer over the csv data.

## Method Details

### Document Preprocessing

1. The csv is loaded using langchain Csvloader
2. The data is split into chunks.


### Vector Store Creation

1. OpenAI embeddings are used to create vector representations of the text chunks.
2. A FAISS vector store is created from these embeddings for efficient similarity search.

### Retriever Setup

1. A retriever is configured to fetch the most relevant chunks for a given query.

## Benefits of this Approach

1. Scalability: Can handle large documents by processing them in chunks.
2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.
3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.
4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.

## Conclusion

This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within a csv file.

## Introduction to LangChain

This notebook, while demonstrating a RAG system, also serves as a practical introduction to [LangChain](https://www.langchain.com/). LangChain is a powerful open-source framework designed to simplify the creation of applications using large language models (LLMs). It provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.

LangChain can be seen as a higher-level library that orchestrates the different components of a RAG system (and other LLM applications) in a more streamlined and modular way. Instead of writing boilerplate code to connect your data source, embedding model, vector store, and LLM, LangChain provides convenient abstractions to do so.

In this notebook, we will be using several LangChain components, which will be explained in more detail in the following sections.

# Package Installation and Imports

The cell below installs all necessary packages required to run this notebook.


In [None]:
# Install required packages
!pip install faiss-cpu langchain==0.1.14 langchain-community langchain-openai pandas python-dotenv

Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting langchain-community
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-openai
  Downloading langchain_openai-1.0.3-py3-none-any.whl.metadata (2.6 kB)
INFO: pip is looking at multiple versions of langchain-community to determine which version is compatible with other requirements. This could take a while.
Collecting langchain-community
  Downloading langchain_community-0.4-py3-none-any.whl.metadata (3.0 kB)
  Downloading langchain_community-0.3.31-py3-none-any.whl.metadata (3.0 kB)
Collecting requests<3,>=2 (from langchain)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7.0,>=0.6.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
INFO: pip is looking at multiple versions of langchain-openai to determine which

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from pathlib import Path
from langchain_openai import ChatOpenAI,OpenAIEmbeddings
import os
from dotenv import load_dotenv

# Load environment variables from a .env file
env_path = "/content/drive/MyDrive/.env"
load_dotenv(env_path)

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

# CSV File Structure and Use Case
The CSV file contains dummy customer data, comprising various attributes like first name, last name, company, etc. This dataset will be utilized for a RAG use case, facilitating the creation of a customer information Q&A system.

In [None]:
# Download required data files
import os
os.makedirs('data', exist_ok=True)

# Download the PDF document used in this notebook
!wget -O data/Understanding_Climate_Change.pdf https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf
!wget -O data/customers-100.csv https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/customers-100.csv


--2025-11-15 11:25:55--  https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/Understanding_Climate_Change.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 206372 (202K) [application/octet-stream]
Saving to: ‘data/Understanding_Climate_Change.pdf’


2025-11-15 11:25:56 (11.4 MB/s) - ‘data/Understanding_Climate_Change.pdf’ saved [206372/206372]

--2025-11-15 11:25:56--  https://raw.githubusercontent.com/NirDiamant/RAG_TECHNIQUES/main/data/customers-100.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17160 (1

In [None]:
import pandas as pd

file_path = ('data/customers-100.csv') # insert the path of the csv file
data = pd.read_csv(file_path)

#preview the csv file
data.head()

Unnamed: 0,Index,Customer Id,First Name,Last Name,Company,City,Country,Phone 1,Phone 2,Email,Subscription Date,Website
0,1,DD37Cf93aecA6Dc,Sheryl,Baxter,Rasmussen Group,East Leonard,Chile,229.077.5154,397.884.0519x718,zunigavanessa@smith.info,2020-08-24,http://www.stephenson.com/
1,2,1Ef7b82A4CAAD10,Preston,Lozano,Vega-Gentry,East Jimmychester,Djibouti,5153435776,686-620-1820x944,vmata@colon.com,2021-04-23,http://www.hobbs.com/
2,3,6F94879bDAfE5a6,Roy,Berry,Murillo-Perry,Isabelborough,Antigua and Barbuda,+1-539-402-0259,(496)978-3969x58947,beckycarr@hogan.com,2020-03-25,http://www.lawrence.com/
3,4,5Cef8BFA16c5e3c,Linda,Olsen,"Dominguez, Mcmillan and Donovan",Bensonview,Dominican Republic,001-808-617-6467x12895,+1-813-324-8756,stanleyblackwell@benson.org,2020-06-02,http://www.good-lyons.com/
4,5,053d585Ab6b3159,Joanna,Bender,"Martin, Lang and Andrade",West Priscilla,Slovakia (Slovak Republic),001-234-203-0635x76146,001-199-446-3860x3486,colinalvarado@miles.net,2021-04-17,https://goodwin-ingram.com/


### Loading Data with LangChain's `CSVLoader`

The first step in our RAG pipeline is to load the data from our CSV file. LangChain provides a variety of document loaders for different file formats, and in this case, we use the `CSVLoader`. This loader reads the CSV file and creates a `Document` object for each row. Each document contains the page content (the row data) and metadata.

In [None]:
loader = CSVLoader(file_path=file_path)
docs = loader.load_and_split()

### Embeddings with LangChain and OpenAI

After loading the data, we need to create vector embeddings for our documents. LangChain provides a seamless integration with various embedding models, including OpenAI's. The `OpenAIEmbeddings` class is a wrapper around the OpenAI API that allows us to easily generate embeddings for our text data. These embeddings are high-dimensional vectors that capture the semantic meaning of the text, which is crucial for the retrieval step in our RAG system.

In [None]:
embeddings = OpenAIEmbeddings()
index = faiss.IndexFlatL2(len(OpenAIEmbeddings().embed_query(" ")))

### Vector Stores in LangChain: FAISS

Once we have the embeddings, we need a place to store them and efficiently search for similar vectors. This is where vector stores come in. LangChain supports a wide range of vector stores, and in this notebook, we use `FAISS` (Facebook AI Similarity Search), a library for efficient similarity search and clustering of dense vectors.

LangChain's `FAISS` class provides a convenient wrapper around the FAISS library, allowing us to create a vector store from our documents and embeddings with just a few lines of code. This vector store will be used by our retriever to find the most relevant documents for a given query.

In [None]:
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

vector_store = FAISS(
    embedding_function=OpenAIEmbeddings(),
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={}
)

Add the splitted csv data to the vector store

In [None]:
vector_store.add_documents(documents=docs)

['bab6b72b-0844-4e67-8ff8-102992154ed5',
 'df13f472-ba1b-4dc4-866a-d821041f7263',
 '8fb16ed1-b74c-4550-b4ce-3db64a36d640',
 '2dc09e0e-f91c-4539-b1ac-72b5d353c692',
 '1d7e1994-e287-408a-b8ab-3eaa04dd9c3a',
 '3edb89bb-1b53-4a53-bcb1-268ba8fdb8cd',
 '9a40a444-7fb9-4216-adb4-18de5c3b9f9a',
 '0784e231-2840-4575-8289-1f651645b140',
 '1e226d08-338d-42d5-a42f-48e1bcc4fd23',
 '5ce7851d-94d7-47db-bdf4-b7f15f87a5af',
 '03e52efc-c692-49e7-92a1-f385a9b18bb9',
 '941cbce4-c588-46a6-91a8-4d7af3f887b8',
 'db5a3b9b-f8e5-4a22-aaab-be68d0131d4a',
 '17f79201-6c9b-4eaa-b84a-3883211cd287',
 '759b69c1-f15b-43be-866d-0e0fa009ee6f',
 '270127ac-089b-4cc4-bc62-f01f460b8a44',
 '55a32c8b-7951-4145-929c-683c3908d2e0',
 'abcef757-b579-4d7a-8522-34e2d750b7fc',
 '46830fa7-defa-45dd-a94b-ab59e35bc36a',
 '2a130110-ff72-4df4-8107-d24e55a14efb',
 '8a123445-7d39-43be-9e52-a49df224e68a',
 'a06cd3f4-0767-4559-9c11-69f7d2fefe78',
 '965e4649-2b48-4eb0-86d6-896a0dc90416',
 '4e3f1552-ff59-4ae1-9371-16b45f3c95cc',
 '7c72fa28-b510-

### LangChain Chains: `create_retrieval_chain`

The core of LangChain lies in its concept of 'chains'. Chains allow us to combine multiple components together to create a single, coherent application. In our case, we want to create a retrieval-based question-answering chain.

The `create_retrieval_chain` function is a helper function that simplifies the process of creating such a chain. It takes a retriever and a language model as input and creates a chain that first retrieves relevant documents from the vector store and then passes them to the language model to generate an answer. This is the essence of the Retrieval-Augmented Generation (RAG) pattern.

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

retriever = vector_store.as_retriever()

# Set up system prompt
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}"),

])

# Create the question-answer chain
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

Query the rag bot with a question based on the CSV data

In [None]:
answer= rag_chain.invoke({"input": "which company does sheryl Baxter work for?"})
answer['answer']

'Sheryl Baxter works for Rasmussen Group.'

![](https://europe-west1-rag-techniques-views-tracker.cloudfunctions.net/rag-techniques-tracker?notebook=all-rag-techniques--simple-csv-rag)