<a href="https://colab.research.google.com/github/Shashwat-KT/RAG/blob/main/RAG_for_Finance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG for Finance - Introduction



### Mounting Google Drive

In [11]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Installing Necessary Libraries

In [1]:
!pip install langchain langchain-core langchain-groq langchain-community chromadb pypdf langchain-huggingface xformers
!pip uninstall transformers -y
!pip install transformers
!pip uninstall torch torchvision -y
!pip install torch torchvision

Found existing installation: transformers 4.48.2
Uninstalling transformers-4.48.2:
  Successfully uninstalled transformers-4.48.2
Collecting transformers
  Using cached transformers-4.48.2-py3-none-any.whl.metadata (44 kB)
Using cached transformers-4.48.2-py3-none-any.whl (9.7 MB)
Installing collected packages: transformers
Successfully installed transformers-4.48.2
Found existing installation: torch 2.6.0
Uninstalling torch-2.6.0:
  Successfully uninstalled torch-2.6.0
Found existing installation: torchvision 0.20.1+cu124
Uninstalling torchvision-0.20.1+cu124:
  Successfully uninstalled torchvision-0.20.1+cu124
Collecting torch
  Using cached torch-2.6.0-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting torchvision
  Downloading torchvision-0.21.0-cp311-cp311-manylinux1_x86_64.whl.metadata (6.1 kB)
Using cached torch-2.6.0-cp311-cp311-manylinux1_x86_64.whl (766.7 MB)
Downloading torchvision-0.21.0-cp311-cp311-manylinux1_x86_64.whl (7.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━

### Importing Necessary Libraries

In [2]:
# LLM Service and Embeddings
from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings

# Document Processing
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

# Chain Management
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain import hub

### Defining Constants

In [18]:
from google.colab import userdata
# import chromadb

GROQ_API_KEY = userdata.get('GROQ_API_KEY')
MODEL_ID = "deepseek-r1-distill-llama-70b"
PDF_PATH = "/content/drive/MyDrive/interviewclub_rag_for_finance/workshop_exercise/10-Q4-2024-As-Filed.pdf"

# # Initialize ChromaDB client
# db = chromadb.PersistentClient(path="./chroma_db")

### Setting Up Chat Model Inference

In [19]:
llm = ChatGroq(model=MODEL_ID, api_key=GROQ_API_KEY)

# Example usage - Invoking
# result = llm.invoke("Solve me fibonacci sequence?")
# print(result.content)

# Example usage - Streaming
# for chunk in llm.stream("Solve me fibonacci sequence?"):
#   print(chunk.content, end="")

### Loading PDF documents

In [29]:
pdf_loader = PyPDFLoader("/content/drive/MyDrive/interviewclub_rag_for_finance/workshop_exercise/10-Q4-2024-As-Filed.pdf")
documents = pdf_loader.load()

In [30]:
len(documents)

121

### Splitting the Document into Manageable Chunks

In [25]:
# Split text into chunks
def split_documents(documents):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=100
    )
    return text_splitter.split_documents(documents)

In [31]:
docs = split_documents(documents)

To know more about splitters: https://medium.com/@harsh.vardhan7695/mastering-text-splitting-in-langchain-735313216e01

### Loading Embeddings from HuggingFace model hub

In [32]:
model_name = "NovaSearch/stella_en_400M_v5"
model_kwargs = {'device': 'cuda:0', 'trust_remote_code': True}
encode_kwargs = {'normalize_embeddings': False}
hf = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/397 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/170k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/51.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/892 [00:00<?, ?B/s]

configuration.py:   0%|          | 0.00/7.13k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/NovaSearch/stella_en_400M_v5:
- configuration.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling.py:   0%|          | 0.00/57.5k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/NovaSearch/stella_en_400M_v5:
- modeling.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

Some weights of the model checkpoint at NovaSearch/stella_en_400M_v5 were not used when initializing NewModel: {'new.pooler.dense.weight', 'new.pooler.dense.bias'}
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/186 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/4.20M [00:00<?, ?B/s]

2_Dense_1024/config.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.20M [00:00<?, ?B/s]

Model selected based on: https://huggingface.co/spaces/mteb/leaderboard

In [39]:
len(docs)

5

### Creating and Storing Embeddings in ChromaDB

In [33]:
# Create embeddings and store them in ChromaDB
def create_vectorstore(documents):
    # embeddings = OpenAIEmbeddings()
    persist_directory = "chroma_db"
    vectorstore = Chroma.from_documents(documents, hf, persist_directory=persist_directory, collection_name="financial_docs")
    return vectorstore

In [34]:
vectorstore = create_vectorstore(docs)

For confirming if our embedding model work correctly.


In [40]:
query = "What is Apple’s current ratio over the last 3 years, and how has it changed?"
docs = vectorstore.similarity_search(query, k=10)
for doc in docs:
    print(doc.page_content, "----sep-----")

2023 spanned 53 weeks.
Macroeconomic Conditions
Macroeconomic conditions, including inflation, interest rates and currency fluctuations, have directly and indirectly impacted, and 
could in the future materially impact, the Company’s results of operations and financial condition.
Apple Inc. | 2024 Form 10-K | 21 ----sep-----
Company Stock Performance
The following graph shows a comparison of five-year cumulative total shareholder return, calculated on a dividend-reinvested 
basis, for the Company, the S&P 500 Index and the Dow Jones U.S. Technology Supersector Index. The graph assumes $100 
was invested in each of the Company’s common stock, the S&P 500 Index and the Dow Jones U.S. Technology Supersector 
Index as of the market close on September 27, 2019. Past stock price performance is not necessarily indicative of future stock 
price performance.
COMPARISON OF 5-YEAR CUMULATIVE TOTAL RETURN
Among Apple Inc., the S&P 500 Index and the Dow Jones U.S. Technology Supersector Index
Apple

In [42]:
docs[0]

Document(metadata={'page': 23, 'page_label': '24', 'source': '/content/drive/MyDrive/interviewclub_rag_for_finance/workshop_exercise/10-Q4-2024-As-Filed.pdf'}, page_content='2023 spanned 53 weeks.\nMacroeconomic Conditions\nMacroeconomic conditions, including inflation, interest rates and currency fluctuations, have directly and indirectly impacted, and \ncould in the future materially impact, the Company’s results of operations and financial condition.\nApple Inc. | 2024 Form 10-K | 21')

In [41]:
len(docs)

10

### Setting Up Retrieval Chain

In [36]:
retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")
combine_docs_chain = create_stuff_documents_chain(
    llm, retrieval_qa_chat_prompt
)

retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 10})
retrieval_chain = create_retrieval_chain(retriever, combine_docs_chain)



Source of the prompt: https://smith.langchain.com/hub/langchain-ai/retrieval-qa-chat

Regular Output

In [37]:
results = retrieval_chain.invoke({"input": "What is Apple’s current ratio over the last 3 years, and how has it changed?"})
print(results['answer'])

<think>
Okay, so I need to figure out Apple's current ratio over the last three years and see how it has changed. The current ratio is a financial metric that helps assess a company's ability to pay its short-term debts using its short-term assets. The formula for the current ratio is current assets divided by current liabilities.

First, I'll look through the provided context to find the necessary data. I see that the context includes Apple's consolidated balance sheets for September 28, 2024, and September 30, 2023. However, I don't see the data for 2022. Wait, maybe it's there somewhere else. Let me check again.

Looking through the context, I notice that in the "CONSOLIDATED BALANCE SHEETS" section, there are figures for September 28, 2024, and September 30, 2023. The assets and liabilities for those two years are provided, but 2022 is only mentioned in the notes, not in the balance sheet. So, I might only have data for 2023 and 2024, but not for 2022.

Wait, perhaps the 2022 data 

Streaming output

In [46]:
for item in retrieval_chain.stream({"input": "What is Apple’s current ratio over the last 3 years, and how has it changed?"}):
  if item.get('answer'):
    print(item.get('answer'), end="")

### for printing reference use below
    #   if item.get('context'):
    # print(item.get('context'), end="")

<think>
Alright, so I need to figure out Apple's current ratio over the last three years and see how it has changed. The current ratio is a financial metric that helps assess a company's ability to pay short-term liabilities with its short-term assets. The formula for the current ratio is current assets divided by current liabilities.

Looking at the context provided, I see that the user has included Apple's 2024 Form 10-K. The relevant sections are the Consolidated Balance Sheets, which list the current assets and current liabilities for September 28, 2024, and September 30, 2023. However, the context only provides two years of data: 2024 and 2023. To get three years, I might need to infer or check if there's data for 2022, but from the context, I don't see 2022's current assets and liabilities explicitly stated except in the segment information and perhaps in the notes.

Wait, in the Notes to Consolidated Financial Statements under Note 1, it mentions that fiscal years 2024 and 2022 