This notebook builds a Retrieval-Augmented Generation (RAG) chatbot that retrieves relevant information from PDF documents and generates responses using a Language Model (LLM). The chatbot processes PDFs, extracts text, embeds it in a vector database, and performs semantic search for accurate answers.

![Screenshot%202025-02-26%20214204.png](attachment:Screenshot%202025-02-26%20214204.png)

## 1- Import Libraries 

In [None]:
import os
from langchain_fireworks import ChatFireworks
from langchain_fireworks import Fireworks
from langchain_fireworks import FireworksEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate
import warnings
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from langchain.retrievers import EnsembleRetriever

## 2- Set API key 

In [None]:
# Set the API key
os.environ["FIREWORKS_API_KEY"] = "fw_3ZnE89uyrvBT8Xvdk1Yr2Qdr"

llm = Fireworks(api_key="fw_3ZnE89uyrvBT8Xvdk1Yr2Qdr", model="accounts/fireworks/models/deepseek-v3")
response = llm.invoke("Hello, how are you?")
print(response)

This code snippet sets up authentication, initializes a language model (DeepSeek-v3 from Fireworks AI), sends a text input, and prints the model's response. The invoke method is used to generate a reply based on the input prompt.

## 3- Initialize embeddings

In [None]:
embeddings = FireworksEmbeddings(api_key="fw_3ZnE89uyrvBT8Xvdk1Yr2Qdr")

## 4- Reading pdfs

In [None]:
pdf_files = [
  r"C:\Users\Esraa\OneDrive - Alexandria University\Chatbot-Using-LangChain\How-to-Manage-your-Finances.pdf",
            r"C:\Users\Esraa\OneDrive - Alexandria University\Chatbot-Using-LangChain\pdf_50_20_30.pdf",
            r"C:\Users\Esraa\OneDrive - Alexandria University\Chatbot-Using-LangChain\Personal-Finance-Management-Handbook.pdf",
            r"C:\Users\Esraa\OneDrive - Alexandria University\Chatbot-Using-LangChain\reach-my-financial-goals.pdf",
            r"C:\Users\Esraa\OneDrive - Alexandria University\Chatbot-Using-LangChain\tips-to-manage-your-money.pdf",
            r"C:\Users\Esraa\OneDrive - Alexandria University\Chatbot-Using-LangChain\beginners-guide-to-saving-2024.pdf",
            r"C:\Users\Esraa\OneDrive - Alexandria University\Chatbot-Using-LangChain\40MoneyManagementTips.pdf"
]

## 5-Spliting documents into smaller meanigful chunks

In [None]:
# Load and split PDF
documents = []
for pdf in pdf_files:
    loader = PyPDFLoader(pdf)
    documents.extend(loader.load())

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)
#  Generate embeddings
def batch_texts(texts, batch_size=256):
    for i in range(0, len(texts), batch_size):
        yield texts[i:i + batch_size]

batch_size = 256
chunk_batches = list(batch_texts(chunks, batch_size))

all_embeddings = []
for batch in chunk_batches:
    batch_texts = [chunk.page_content for chunk in batch]
    batch_embeddings = embeddings.embed_documents(batch_texts)
    all_embeddings.extend(batch_embeddings)

## 6- Store chunks in vectorestore FIASS

In [None]:
#  Store in FAISS
vector_store = FAISS.from_embeddings(
    text_embeddings=list(zip([chunk.page_content for chunk in chunks], all_embeddings)),
    embedding=embeddings
)

retriever = vector_store.as_retriever(search_kwargs={"k": 3})  # Retrieve top 3 relevant chunks

## 7- Create memory

In [None]:
#  Initialize memory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)


## 8- Define a prompt templete

In [None]:
#  Step 6: Define prompt template for financial advice
finance_template = PromptTemplate(
    input_variables=["context", "question", "chat_history"],
    template="""
You are an expert financial advisor. Use the chat history and retrieved context to answer the question in a conversational manner.

Chat History:
{chat_history}

Context:
{context}

Question:
{question}

Answer:
"""
)


## 9- Intialize LLM model (deeoseek)

In [None]:
#  Initialize Fireworks LLM
llm = Fireworks(
    api_key="fw_3ZnE89uyrvBT8Xvdk1Yr2Qdr",
    model="accounts/fireworks/models/deepseek-v3",
    max_tokens=1024
)


## 10- Create converational RAG pipline 

In [None]:
#  Step 8: Create Conversational RAG Pipeline
conversational_rag = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    chain_type="stuff",
    memory=memory,
    combine_docs_chain_kwargs={"prompt": finance_template}
)

## Example follow-up questions

In [None]:
query_1 = "What are the best strategies for saving money?"
response_1 = conversational_rag.invoke({"question": query_1})
print(response_1["answer"])



In [None]:
query_2="Just choose the best two strategies from the previous question"
response_2=conversational_rag.invoke({"question":query_2})
print(response_2["answer"])

## 11- load csv dataset

In [None]:
df=pd.read_csv(r"cleand_finance_data.csv")

In [None]:
df.head()

## 12- Convert dataset into documents

In [None]:
documents = df.apply(lambda row: f"Date: {row['Date']}, Description: {row['Description']}, "
                                 f"Debit: {row['Debit']}, Credit: {row['Credit']}, Amount: {row['Amount']}, "
                                 f"Sub-category: {row['sub-category']}, Category: {row['Category']}, "
                                 f"Category Type: {row['Category Type']}", axis=1).tolist()


## 13- Generate embeddings 

In [None]:
csv_embeddings = embeddings.embed_documents(documents)

## 14- Create vectorstore 

In [None]:
# Create FAISS vector database
csv_vector_store = FAISS.from_embeddings(
    text_embeddings=list(zip(documents, csv_embeddings)),
    embedding=embeddings
)

# Create a retriever for searching
csv_retriever = csv_vector_store.as_retriever(search_kwargs={"k": 3})

## 15- Merge csv and pdf retrieval 

In [None]:
# Combine both retrievers (PDF and CSV)
combined_retriever = EnsembleRetriever(retrievers=[retriever, csv_retriever], weights=[0.5, 0.5])


## 16- Edit pipline

In [None]:
#  RAG Pipeline
conversational_rag = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=combined_retriever,
    chain_type="stuff",
    memory=memory,
    combine_docs_chain_kwargs={"prompt": finance_template}
)

In [None]:
query = "How much i spend last month?"
response = conversational_rag.invoke({"question": query})
print(response["answer"])
