### Goal
Create an intelligent agent that understands structured and unstructured business data (PDFs and CSVs) and answers natural language questions.

##### Workflow

* Load documents.
* Split them into chunks.
* Embed them into vector form using OpenAI embeddings.
* Store the embeddings in a FAISS vector database.
* At query time, retrieve the top-k relevant chunks and feed them into GPT-4 for question answering.

##### Tech Stack

* LangChain:	Framework for chaining LLM workflows and RAG pipelines
* OpenAI:	GPT-4 as the LLM + OpenAI Embeddings for vectorization
* FAISS:	Vector database for similarity search
* PyPDF2 / pypdf:	Used indirectly by PyPDFLoader to read PDF files
* pandas:	For CSV file processing and flattening to text
* dotenv:	Load API keys securely from .env

In [22]:
pip install -U langchain langchain-openai

Collecting langchain-openai
  Downloading langchain_openai-0.3.18-py3-none-any.whl.metadata (2.3 kB)
Downloading langchain_openai-0.3.18-py3-none-any.whl (63 kB)
Installing collected packages: langchain-openai
Successfully installed langchain-openai-0.3.18

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [7]:
pip install tiktoken


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [8]:
pip install pypdf


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [9]:
pip install --upgrade langchain


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [10]:
pip install langchain openai faiss-cpu pandas PyPDF2 python-dotenv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [11]:
pip install -U langchain-community


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [12]:
from langchain.document_loaders import PyPDFLoader
import pandas as pd
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.schema import Document

In [23]:
#updating imports
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings

In [24]:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

In [13]:
# loading API key from .env file
from dotenv import load_dotenv
load_dotenv()

True

In [14]:
def load_pdf(path):
    loader = PyPDFLoader(path)
    return loader.load()

In [15]:
def load_csv(path):
    df = pd.read_csv(path)
    return df.to_string(index=False) #flattening for LLM to read data

In [16]:
def embed_docs(docs):
    splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
    chunks = splitter.split_documents(docs)
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(chunks, embeddings)
    return vectorstore

In [17]:
def build_rag_chain(vectorstore):
    llm = ChatOpenAI(model_name="gpt-4", temperature=0)
    chain = RetrievalQA.from_chain_type(
        llm=llm,
        retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
        return_source_documents=True
    )
    return chain

In [18]:
def create_documents_from_csv(csv_path):
    content = load_csv(csv_path)
    return [Document(page_content=content, metadata={"source": csv_path})]

In [19]:
def prepare_all_docs(pdf_paths, csv_paths):
    docs = []
    for pdf in pdf_paths:
        docs.extend(load_pdf(pdf))
    for csv in csv_paths:
        docs.extend(create_documents_from_csv(csv))
    return docs

In [20]:
# pdf_paths = ["Annual_Report_2024.pdf"]
# csv_paths = ["sales_data.csv"]

# docs = prepare_all_docs(pdf_paths, csv_paths)
# vectorstore = embed_docs(docs)
# qa_chain = build_rag_chain(vectorstore)

# query = "Why did net income drop in Q3 2024?"
# response = qa_chain.run(query)

# print("Answer:", response)

In [25]:
# Load and split the 10-Q PDF
loader = PyPDFLoader("aapl-20250329.pdf")
pages = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=200)
chunks = splitter.split_documents(pages)

# Embed and build vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)

# Build QA chain
llm = ChatOpenAI(model_name="gpt-4", temperature=0)
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever())

# Ask question
query = "Summarize Appleâ€™s performance in the quarter ending March 29, 2025."
response = qa_chain.run(query)


In [30]:
from pprint import pprint
pprint(response, sort_dicts=False)

("The text does not provide a comprehensive summary of Apple's performance for "
 'the quarter ending March 29, 2025. It only mentions that in Greater China, '
 'iPhone revenue represented a moderately higher proportion of net sales. The '
 'company had total deferred revenue of $13.6 billion as of March 29, 2025. '
 'The company expects 66% of total deferred revenue to be realized in less '
 'than a year, 24% within one-to-two years, 9% within two-to-three years and '
 '1% in greater than three years.')
