# RAG for Finance - Introduction



### Installing Necessary Libraries

In [2]:
!pip install langchain langchain-core langchain-groq langchain-community chromadb pypdf langchain-huggingface xformers
!pip uninstall transformers -y
!pip install transformers
!pip uninstall torch torchvision -y
!pip install torch torchvision

Collecting langchain-groq
  Downloading langchain_groq-0.2.4-py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.16-py3-none-any.whl.metadata (2.9 kB)
Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting pypdf
  Downloading pypdf-5.2.0-py3-none-any.whl.metadata (7.2 kB)
Collecting langchain-huggingface
  Downloading langchain_huggingface-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Collecting xformers
  Downloading xformers-0.0.29.post2-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting groq<1,>=0.4.1 (from langchain-groq)
  Downloading groq-0.16.0-py3-none-any.whl.metadata (14 kB)
Collecting langchain-core
  Downloading langchain_core-0.3.33-py3-none-any.whl.metadata (6.3 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downlo

Found existing installation: transformers 4.47.1
Uninstalling transformers-4.47.1:
  Successfully uninstalled transformers-4.47.1
Collecting transformers
  Downloading transformers-4.48.2-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.48.2-py3-none-any.whl (9.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m90.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
Successfully installed transformers-4.48.2
Found existing installation: torch 2.6.0
Uninstalling torch-2.6.0:
  Successfully uninstalled torch-2.6.0
Found existing installation: torchvision 0.20.1+cu124
Uninstalling torchvision-0.20.1+cu124:
  Successfully uninstalled torchvision-0.20.1+cu124
Collecting torch
  Using cached torch-2.6.0-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting torchvision
  Downloading torchvi

### Importing Necessary Libraries

In [1]:
# LLM Service and Embeddings
from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings

# Document Processing
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

# Chain Management
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain import hub

### Defining Constants

In [8]:
import os
from dotenv import load_dotenv

# Load .env file
load_dotenv('GROQ_API_KEY.env')  # Update with correct path if needed

# Access the key
GROQ_API_KEY = os.getenv('GROQ_API_KEY')

if GROQ_API_KEY is None:
    raise Exception("GROQ_API_KEY not found. Check your .env file.")

print("API Key loaded successfully!")


API Key loaded successfully!


In [9]:
MODEL_ID = "deepseek-r1-distill-llama-70b"
PDF_PATH = "10-Q4-2024-As-Filed.pdf"


### Setting Up Chat Model Inference

In [12]:
llm = ChatGroq(model=MODEL_ID, api_key=GROQ_API_KEY)

#Example usage - Invoking
result = llm.invoke("Solve me fibonacci sequence?")
print(result.content)

#Example usage - Streaming
for chunk in llm.stream("Solve me fibonacci sequence?"):
  print(chunk.content, end="")

<think>
First, I recognize that the Fibonacci sequence is a series of numbers where each number is the sum of the two preceding ones, starting from 0 and 1.

To generate the sequence, I'll begin with the initial terms, \( F_0 = 0 \) and \( F_1 = 1 \).

Next, I'll use the recurrence relation \( F_n = F_{n-1} + F_{n-2} \) to compute each subsequent term.

By applying this relation iteratively, I can generate the Fibonacci sequence up to the desired term.

Finally, I'll present the sequence in a clear and organized manner for the user to understand.
</think>

Sure! Let's explore the Fibonacci sequence step by step.

### What is the Fibonacci Sequence?

The **Fibonacci sequence** is a series of numbers where each number is the sum of the two preceding ones, usually starting with 0 and 1. It begins like this:

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, ...

### Key Features of the Fibonacci Sequence

1. **Initial Terms:**
   - \( F_0 = 0 \)
   - \( F_1 = 1 \)

2. **Recurrence Relation:**
   - \( F_n 

### Loading PDF documents

In [13]:
pdf_loader = PyPDFLoader(PDF_PATH)
documents = pdf_loader.load()

In [14]:
len(documents)

121

### Splitting the Document into Manageable Chunks

In [15]:
# Split text into chunks
def split_documents(documents):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=100
    )
    return text_splitter.split_documents(documents)

In [16]:
docs = split_documents(documents)

To know more about splitters: https://medium.com/@harsh.vardhan7695/mastering-text-splitting-in-langchain-735313216e01

### Loading Embeddings from HuggingFace model hub

In [17]:
model_name = "NovaSearch/stella_en_400M_v5"
model_kwargs = {'device': 'cuda:0', 'trust_remote_code': True}
encode_kwargs = {'normalize_embeddings': False}
hf = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/397 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/170k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/51.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/892 [00:00<?, ?B/s]

configuration.py:   0%|          | 0.00/7.13k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/NovaSearch/stella_en_400M_v5:
- configuration.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling.py:   0%|          | 0.00/57.5k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/NovaSearch/stella_en_400M_v5:
- modeling.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/1.74G [00:00<?, ?B/s]

Some weights of the model checkpoint at NovaSearch/stella_en_400M_v5 were not used when initializing NewModel: {'new.pooler.dense.weight', 'new.pooler.dense.bias'}
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/186 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/4.20M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.20M [00:00<?, ?B/s]

2_Dense_1024/config.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Model selected based on: https://huggingface.co/spaces/mteb/leaderboard

### Creating and Storing Embeddings in ChromaDB

In [18]:
# Create embeddings and store them in ChromaDB
def create_vectorstore(documents):
    # embeddings = OpenAIEmbeddings()
    persist_directory = "chroma_db"
    vectorstore = Chroma.from_documents(documents, hf, persist_directory=persist_directory, collection_name="financial_docs")
    return vectorstore

In [19]:
vectorstore = create_vectorstore(docs)

For confirming if our embedding model work correctly.


In [20]:
query = "What is Apple’s current ratio over the last 3 years, and how has it changed?"
docs = vectorstore.similarity_search(query, k=5)
for doc in docs:
    print(doc.page_content, "----sep-----")

2023 spanned 53 weeks.
Macroeconomic Conditions
Macroeconomic conditions, including inflation, interest rates and currency fluctuations, have directly and indirectly impacted, and 
could in the future materially impact, the Company’s results of operations and financial condition.
Apple Inc. | 2024 Form 10-K | 21 ----sep-----
Company Stock Performance
The following graph shows a comparison of five-year cumulative total shareholder return, calculated on a dividend-reinvested 
basis, for the Company, the S&P 500 Index and the Dow Jones U.S. Technology Supersector Index. The graph assumes $100 
was invested in each of the Company’s common stock, the S&P 500 Index and the Dow Jones U.S. Technology Supersector 
Index as of the market close on September 27, 2019. Past stock price performance is not necessarily indicative of future stock 
price performance.
COMPARISON OF 5-YEAR CUMULATIVE TOTAL RETURN
Among Apple Inc., the S&P 500 Index and the Dow Jones U.S. Technology Supersector Index
Apple

### Setting Up Retrieval Chain

In [21]:
retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")
combine_docs_chain = create_stuff_documents_chain(
    llm, retrieval_qa_chat_prompt
)

retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 10})
retrieval_chain = create_retrieval_chain(retriever, combine_docs_chain)



Source of the prompt: https://smith.langchain.com/hub/langchain-ai/retrieval-qa-chat

Regular Output

In [22]:
results = retrieval_chain.invoke({"input": "What is Apple’s current ratio over the last 3 years, and how has it changed?"})
print(results['answer'])

<think>
Alright, I need to figure out Apple's current ratio over the last three years and how it has changed. The current ratio is calculated by dividing current assets by current liabilities. 

First, I'll look for the current assets and current liabilities in the provided context. From the consolidated balance sheets, I can see the figures for September 28, 2024, and September 30, 2023. However, I also need the data for 2022, which isn't directly provided in the context. Since the context doesn't include the 2022 balance sheet, I might not be able to calculate the current ratio for that year. 

For 2023 and 2024, the current assets and liabilities are as follows:

- 2023: Current assets = $143,566 million, Current liabilities = $145,308 million
- 2024: Current assets = $152,987 million, Current liabilities = $176,392 million

I'll calculate the current ratio for each year by dividing current assets by current liabilities.

For 2023: 143,566 / 145,308 ≈ 0.988
For 2024: 152,987 / 176,3

Streaming output

In [23]:
for item in retrieval_chain.stream({"input": "What is Apple’s current ratio over the last 3 years, and how has it changed?"}):
  if item.get('answer'):
    print(item.get('answer'), end="")

<think>
Alright, so I need to figure out Apple's current ratio over the last three years and see how it's changed. I'm a bit new to this, so I'll take it step by step. 

First, I remember that the current ratio is a financial metric that helps assess a company's ability to pay its short-term liabilities with its short-term assets. The formula is Current Assets divided by Current Liabilities. A higher ratio is generally better because it indicates the company can more easily cover its short-term debts.

Looking at the context provided, I need to find the figures for Current Assets and Current Liabilities for each of the last three years. The context includes the Consolidated Balance Sheets for September 28, 2024, and September 30, 2023. I also notice that the 2022 figures are mentioned in some sections, so I should check those as well.

Starting with 2024, the balance sheet shows Total current assets as $152,987 million and Total current liabilities as $176,392 million. So the current r