<a href="https://colab.research.google.com/github/Haiz14/rag_practice/blob/main/csv_rag_llama_index_clean_implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A clean implementation of llama-8b from groq with a csv file containing 100 customers data.

Uses local embedding from hugging face and faiss in-memory vector storage. To implement custom query engine with your own prompts and process check out https://docs.llamaindex.ai/en/stable/examples/query_engine/custom_query_engine/

In [None]:
# setup api_key and install packages
from google.colab import userdata
grok_api_key = userdata.get("GROK_API_KEY")

# apt install for faiss vector
!apt install libopenblas-base libomp-dev

!pip install llama-index \
  llama-index-embeddings-huggingface \
  llama-index-llms-groq \
  llama-index-vector-stores-faiss \
  faiss-cpu \
  pandas


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libomp-dev is already the newest version (1:14.0-55~exp2).
libopenblas-base is already the newest version (0.3.20+ds-1).
0 upgraded, 0 newly installed, 0 to remove and 29 not upgraded.
Collecting llama-index-vector-stores-faiss
  Downloading llama_index_vector_stores_faiss-0.3.0-py3-none-any.whl.metadata (658 bytes)
Downloading llama_index_vector_stores_faiss-0.3.0-py3-none-any.whl (3.9 kB)
Installing collected packages: llama-index-vector-stores-faiss
Successfully installed llama-index-vector-stores-faiss-0.3.0


In [None]:
# save data for usage
!mkdir -p data
!wget https://raw.githubusercontent.com/NirDiamant/RAG_Techniques/refs/heads/main/data/customers-100.csv -O "data/100_customers.csv"

--2025-03-18 15:09:10--  https://raw.githubusercontent.com/NirDiamant/RAG_Techniques/refs/heads/main/data/customers-100.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17160 (17K) [text/plain]
Saving to: ‘data/100_customers.csv’


2025-03-18 15:09:11 (2.74 MB/s) - ‘data/100_customers.csv’ saved [17160/17160]



In [None]:
# setup vector database and query engine
from llama_index.core import Settings, VectorStoreIndex
from llama_index.core.readers import SimpleDirectoryReader
from llama_index.core.ingestion import IngestionPipeline
from llama_index.llms.groq import Groq
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.faiss import FaissVectorStore
from llama_index.readers.file import PagedCSVReader

from faiss import IndexFlatL2

EMBED_DIMENSION = 384 # BAAI/bge-small embed dimension is 384

llama_llm = Groq(
    model="llama-3.1-8b-instant",
    api_key=grok_api_key
)
embedding_model = HuggingFaceEmbedding() # BAAI/bge-small-en, tokenize bert
Settings.llm = llama_llm
Settings.embed_model = embedding_model
print(embedding_model.embed_batch_size)

faiss_index = IndexFlatL2(EMBED_DIMENSION)
faiss_store = FaissVectorStore(faiss_index=faiss_index)

10


In [None]:
# load data in pipeline
data_file_path = "./data/100_customers.csv"
data = read_csv(data_file_path)

csv_reader = PagedCSVReader()

reader = SimpleDirectoryReader(
    input_files=[data_file_path],
    file_extractor= {".csv": csv_reader}
    )

docs = reader.load_data()

pipeline = IngestionPipeline(
    vector_store=faiss_store,
    documents=docs
)

nodes = pipeline.run()

In [None]:
# setup a query engine with vector store
vector_store_index = VectorStoreIndex(nodes=nodes)
query_engine = vector_store_index.as_query_engine()

In [None]:
# verify if vector db works properly
# get a random row and ask llm to tell aboiut the person
from pandas import read_csv
data = read_csv(data_file_path)
print(data.loc[10])
resp = query_engine.query("Tell me about Carl Schroeder")
resp.response

Index                                         11
Customer Id                      216E205d6eBb815
First Name                                  Carl
Last Name                              Schroeder
Company               Oconnell, Meza and Everett
City                                Shannonville
Country                                 Guernsey
Phone 1                         637-854-0256x825
Phone 2                         114.336.0784x788
Email                         kirksalas@webb.com
Subscription Date                     2021-10-20
Website              https://simmons-hurley.com/
Name: 10, dtype: object


"Carl Schroeder is a customer from Guernsey, and his company is Oconnell, Meza and Everett. He has a phone number of 637-854-0256x825 and an email address of kirksalas@webb.com. Carl's subscription date is in 2021, and he is from the city of Shannonville."