<a href="https://colab.research.google.com/github/Thomas-Xiang/agentic-AI/blob/main/finance_rag_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FinanceGPT - RAG System Tutorial
## Learn by Building: A Complete RAG Pipeline from Scratch

- process PDF documents
- embeddings work (transformers library)
- Vector search with FAISS  
- Language model generation


## Part 1: Setup & Installation

In [1]:
# Install required packages
!pip install -q torch transformers sentence-transformers accelerate
!pip install -q faiss-cpu pypdf pandas numpy tqdm requests bitsandbytes

print("‚úÖ All packages installed!")

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m23.8/23.8 MB[0m [31m62.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m329.1/329.1 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.1/59.1 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25h‚úÖ All packages installed!


### Check GPU Availability

In [2]:
import torch

print("CUDA Available:", torch.cuda.is_available())

if torch.cuda.is_available():
    print("GPU Name:", torch.cuda.get_device_name(0))
    device = "cuda"
else:
    print("‚ö†Ô∏è  Using CPU")
    device = "cpu"

print(f"Using device: {device}")

CUDA Available: True
GPU Name: Tesla T4
Using device: cuda


### Import All Libraries

- **torch**: PyTorch - deep learning foundation
- **transformers**: Hugging Face models
- **sentence_transformers**: Embedding models
- **faiss**: Fast similarity search
- **pypdf**: PDF processing

In [3]:
# Core libraries
import torch
import numpy as np
import pandas as pd
from typing import List, Dict, Tuple


# Transformers
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer


# FAISS for vector search
import faiss

# PDF and utilities
from pypdf import PdfReader
import os
import re
import pickle
import requests
from tqdm.auto import tqdm

print("‚úÖ All libraries imported!")

‚úÖ All libraries imported!


## Part 2: Download Financial Documents

In [4]:
# Create directories
os.makedirs("data/raw", exist_ok=True)
os.makedirs("data/processed", exist_ok=True)

# Download Warren Buffett's 2022 letter
url = "https://www.berkshirehathaway.com/letters/2022ltr.pdf"
output_path = "data/raw/buffett_letter_2022.pdf"

print(f"üì• Downloading...")
try:
    response = requests.get(url, timeout=30)
    with open(output_path, "wb") as f:
        f.write(response.content)
    print(f"‚úÖ Downloaded {os.path.getsize(output_path)/1024:.0f} KB")
except Exception as e:
    print(f"‚ùå Error: {e}")

üì• Downloading...
‚úÖ Downloaded 54 KB


## Part 3: Document Processing

1. Extract text from PDF
2. Clean the text  
3. Split into chunks

In [5]:
# Load PDF
pdf_path = "data/raw/buffett_letter_2022.pdf"
reader = PdfReader(pdf_path)

print(f"üìä Pages: {len(reader.pages)}")

# Extract all text
all_text = ""
for page in tqdm(reader.pages, desc="Reading pages"):
    all_text += page.extract_text() + "\n\n"

print(f"‚úÖ Extracted {len(all_text):,} characters")

üìä Pages: 10


Reading pages:   0%|          | 0/10 [00:00<?, ?it/s]

‚úÖ Extracted 26,658 characters


In [6]:
# Clean text
print("üßπ Cleaning text...")

# Remove multiple spaces
cleaned_text = re.sub(r'\s+', ' ', all_text)

# Remove special characters
cleaned_text = re.sub(r'[^\w\s\.\,\!\?\-\(\)\[\]\:\$\%]', '', cleaned_text)

cleaned_text = cleaned_text.strip()

print(f"‚úÖ Cleaned: {len(cleaned_text):,} characters")

üßπ Cleaning text...
‚úÖ Cleaned: 26,417 characters


In [7]:
# Split into chunks
chunk_size = 1000
chunk_overlap = 200

print(f"Splitting (size={chunk_size}, overlap={chunk_overlap})...")

chunks = []
start = 0

while start < len(cleaned_text):
    end = start + chunk_size
    chunk_text = cleaned_text[start:end]

    # Try to end at sentence boundary
    if end < len(cleaned_text):
        last_period = max(
            chunk_text.rfind('.'),
            chunk_text.rfind('!'),
            chunk_text.rfind('?')
        )
        if last_period != -1:
            chunk_text = chunk_text[:last_period + 1]
            end = start + last_period + 1

    chunks.append({
        'text': chunk_text.strip(),
        'chunk_id': len(chunks),
        'source': pdf_path
    })

    start = end - chunk_overlap

print(f"‚úÖ Created {len(chunks)} chunks")
print(f"\nSample chunk:\n{chunks[0]['text'][:300]}...")

Splitting (size=1000, overlap=200)...
‚úÖ Created 36 chunks

Sample chunk:
Berkshires Performance vs. the SP 500 Annual Percentage Change Year in Per-Share Market Value of Berkshire in SP 500 with Dividends Included 1965 ......................................................................... 49.5 10.0 1966 ....................................................................


## üî¢ Part 4: Creating Embeddings

Convert text to numerical vectors using sentence-transformers.

In [8]:
# Load embedding model
print("Loading embedding model...")

model_name = "all-MiniLM-L6-v2"
embedding_model = SentenceTransformer(model_name)
embedding_model = embedding_model.to(device)

print(f"‚úÖ Model loaded on {device}")
print(f"   Dimension: {embedding_model.get_sentence_embedding_dimension()}")

Loading embedding model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

‚úÖ Model loaded on cuda
   Dimension: 384


In [9]:
# Test with simple examples
test_sentences = [
    "Apple's revenue increased significantly.",
    "Apple's sales grew substantially.",
    "The weather is sunny today."
]

print("Testing embeddings...\n")

test_embeddings = embedding_model.encode(test_sentences, convert_to_numpy=True)

print(f"Shape: {test_embeddings.shape}")
print(f"First embedding (first 10 values): {test_embeddings[0][:10]}")

Testing embeddings...

Shape: (3, 384)
First embedding (first 10 values): [ 0.03692767 -0.02243091  0.04270913 -0.04272933  0.0168224  -0.00700162
  0.01143348  0.05511599  0.03566209  0.04839956]


In [10]:
# Calculate similarity
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(test_embeddings)

print("Similarity Matrix:\n")
for i, row in enumerate(similarity_matrix):
    print(f"Sent{i+1}: {row}")

print(f"\nSent1 ‚Üî Sent2: {similarity_matrix[0][1]:.3f} (HIGH)")
print(f"Sent1 ‚Üî Sent3: {similarity_matrix[0][2]:.3f} (LOW)")

Similarity Matrix:

Sent1: [ 0.9999999   0.8519086  -0.04464385]
Sent2: [8.5190862e-01 9.9999994e-01 6.8118947e-04]
Sent3: [-4.4643845e-02  6.8118947e-04  1.0000001e+00]

Sent1 ‚Üî Sent2: 0.852 (HIGH)
Sent1 ‚Üî Sent3: -0.045 (LOW)


In [11]:
# Create embeddings for all chunks
chunk_texts = [chunk['text'] for chunk in chunks]

print(f"Creating embeddings for {len(chunk_texts)} chunks...")

chunk_embeddings = embedding_model.encode(
    chunk_texts,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True
)

print(f"\n‚úÖ Shape: {chunk_embeddings.shape}")
print(f"   Memory: {chunk_embeddings.nbytes/1024/1024:.2f} MB")

Creating embeddings for 36 chunks...


Batches:   0%|          | 0/2 [00:00<?, ?it/s]


‚úÖ Shape: (36, 384)
   Memory: 0.05 MB


## üîç Part 5: Vector Search with FAISS

Build a fast search index to find relevant chunks.

In [12]:
# Build FAISS index
print("üèóÔ∏è  Building FAISS index...")

embedding_dim = chunk_embeddings.shape[1]

# Create IndexFlatL2 (exact search)
faiss_index = faiss.IndexFlatL2(embedding_dim)

# Add embeddings (must be float32)
if chunk_embeddings.dtype != np.float32:
    chunk_embeddings = chunk_embeddings.astype('float32')

faiss_index.add(chunk_embeddings)

print(f"‚úÖ Index built with {faiss_index.ntotal} vectors")

üèóÔ∏è  Building FAISS index...
‚úÖ Index built with 36 vectors


In [13]:
# Test search
test_query = "What were the main investment strategies?"

print(f"Query: '{test_query}'\n")

# Convert query to embedding
query_embedding = embedding_model.encode(
    [test_query],
    convert_to_numpy=True,
    normalize_embeddings=True
).astype('float32')

# Search
k = 3
distances, indices = faiss_index.search(query_embedding, k)

print(f"Top {k} results:\n")
for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
    similarity = np.exp(-dist)
    print(f"Rank {i+1}: Chunk {idx}")
    print(f"  Similarity: {similarity:.2%}")
    print(f"  Preview: {chunks[idx]['text'][:150]}...\n")

Query: 'What were the main investment strategies?'

Top 3 results:

Rank 1: Chunk 32
  Similarity: 30.37%
  Preview: long term its a weighing machine. If you keep making something more valuable, then some wise person is going to notice it and start buying.  There is ...

Rank 2: Chunk 11
  Similarity: 29.70%
  Preview: ics, many that enjoy very good economic characteristics, and a large group that are marginal. Along the way, other businesses in which I have invested...

Rank 3: Chunk 10
  Similarity: 26.91%
  Preview: we have no say in management. 3 Our goal in both forms of ownership is to make meaningful investments in businesses with both long-lasting favorable e...



## ü§ñ Part 6: Language Model for Generation

Load Gemma-2B with 4-bit quantization to generate answers.

In [14]:

login(token="your_hugging_face_token")

os.environ["HF_TOKEN"] = "your_hugging_face_token"

In [15]:
# Configure quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

print("Quantization configured (4-bit)")

Quantization configured (4-bit)


In [16]:
# Load model and tokenizer
model_name = "google/gemma-2-2b-it"

print(f"Loading {model_name}...")
print("   This may take a few minutes...\n")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
print("‚úÖ Tokenizer loaded")

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",
    torch_dtype=torch.float16,
)

print(f"‚úÖ Model loaded on {next(model.parameters()).device}")

Loading google/gemma-2-2b-it...
   This may take a few minutes...



tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

‚úÖ Tokenizer loaded


config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

‚úÖ Model loaded on cuda:0


In [17]:
# Test generation
test_prompt = "What is machine learning? Answer briefly."

print(f"Test: '{test_prompt}'\n")

inputs = tokenizer(test_prompt, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )

generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated:", generated)

Test: 'What is machine learning? Answer briefly.'

Generated: What is machine learning? Answer briefly.

Machine learning is a branch of artificial intelligence (AI) where computers learn from data without being explicitly programmed.

In simpler terms:

Imagine teaching a child to recognize a cat. You show them pictures of cats, and they learn to identify the features that make a cat a cat (e.g., whiskers, tail, pointy ears). 

Machine learning works in a similar way.  The computer is given data, and it learns to find patterns and make predictions based on that data.


## üîó Part 7: Complete RAG Pipeline

Combine all components!

In [18]:
def create_rag_prompt(query: str, context_chunks: List[Dict]) -> str:
    """Create prompt with context and query"""
    context = "\n\n".join([
        f"[Source {i+1}]\n{chunk['text']}"
        for i, chunk in enumerate(context_chunks)
    ])

    prompt = f"""You are a helpful financial analyst. Answer based ONLY on the context.

CONTEXT:
{context}

QUESTION: {query}

INSTRUCTIONS:
1. Answer using ONLY the context above
2. If insufficient info, say so
3. Cite sources (e.g., "According to Source 1...")
4. Be concise and professional

ANSWER:"""

    return prompt

print("‚úÖ Prompt function defined")

‚úÖ Prompt function defined


In [19]:
def query_rag(query: str, top_k: int = 3):
    """Complete RAG query"""

    print(f"\n{'='*60}")
    print(f"{query}")
    print(f"{'='*60}\n")

    # 1. Embed query
    query_emb = embedding_model.encode(
        [query],
        convert_to_numpy=True,
        normalize_embeddings=True
    ).astype('float32')

    # 2. Search
    distances, indices = faiss_index.search(query_emb, top_k)
    retrieved = [chunks[idx] for idx in indices[0]]

    # 3. Create prompt
    prompt = create_rag_prompt(query, retrieved)

    # 4. Generate
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            temperature=0.7,
            do_sample=True,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
        )

    full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer = full_output[len(prompt):].strip()

    # Display
    print("ANSWER:")
    print(f"{'='*60}")
    print(answer)
    print(f"{'='*60}\n")

    print("SOURCES:")
    for i, chunk in enumerate(retrieved, 1):
        sim = np.exp(-distances[0][i-1])
        print(f"[{i}] ({sim:.1%}) {chunk['text'][:100]}...\n")

    return answer

print("‚úÖ RAG function defined")

‚úÖ RAG function defined


## üéØ Part 8: Test the System!

In [20]:
# Test Query 1
query_rag("What were the main investment strategies discussed?")


What were the main investment strategies discussed?

ANSWER:
The main investment strategies discussed include: 

* **Long-term investing:**  The importance of long-term perspective is emphasized.
* **Understanding market cycles and change:**  Recognizing that markets change and adapting to these changes is crucial. 
* **Identifying undervalued businesses:**  Identifying businesses with high potential but trading at a low price. 
* **Capital allocation decisions:**  Evaluating the quality of investment decisions and recognizing that even "good" decisions are not always guaranteed success. 
* **Long-term perspective and luck:**  The long-term perspective and occasional luck plays a significant role in investment success.


**SOURCES:**

* [Source 1]
* [Source 2]
* [Source 3]

SOURCES:
[1] (30.8%) long term its a weighing machine. If you keep making something more valuable, then some wise person ...

[2] (29.6%) ics, many that enjoy very good economic characteristics, and a large group t

'The main investment strategies discussed include: \n\n* **Long-term investing:**  The importance of long-term perspective is emphasized.\n* **Understanding market cycles and change:**  Recognizing that markets change and adapting to these changes is crucial. \n* **Identifying undervalued businesses:**  Identifying businesses with high potential but trading at a low price. \n* **Capital allocation decisions:**  Evaluating the quality of investment decisions and recognizing that even "good" decisions are not always guaranteed success. \n* **Long-term perspective and luck:**  The long-term perspective and occasional luck plays a significant role in investment success.\n\n\n**SOURCES:**\n\n* [Source 1]\n* [Source 2]\n* [Source 3]'

In [21]:
# Test Query 2
query_rag("What risks were mentioned?")


What risks were mentioned?

ANSWER:
The text mentions several risks, including:

- **Long-term bets against America:** Source 1 mentions a belief that long-term bets against America are unwise. 
- **Misinterpretation of the world:** Source 1 states that "if you dont see the world the way it is, its like judging something through a distorted lens" implying the possibility of misinterpreting the world.
- **Deception and manipulation:** Source 3 emphasizes that "Beating expectations is heralded as a managerial triumph. That activity is disgusting. It requires no talent to manipulate numbers: Only a deep desire to deceive is required."  This suggests that manipulating financial figures for personal gain is a risk.

It's important to note that the text does not explicitly list all potential risks.

SOURCES:
[1] (22.3%) nd self-doubt, I have yet to see a time when it made sense to make a long-term bet against America. ...

[2] (22.3%) ce its existence as well. Beating expectations is herald

'The text mentions several risks, including:\n\n- **Long-term bets against America:** Source 1 mentions a belief that long-term bets against America are unwise. \n- **Misinterpretation of the world:** Source 1 states that "if you dont see the world the way it is, its like judging something through a distorted lens" implying the possibility of misinterpreting the world.\n- **Deception and manipulation:** Source 3 emphasizes that "Beating expectations is heralded as a managerial triumph. That activity is disgusting. It requires no talent to manipulate numbers: Only a deep desire to deceive is required."  This suggests that manipulating financial figures for personal gain is a risk.\n\nIt\'s important to note that the text does not explicitly list all potential risks.'

In [22]:
# Test Query 3
query_rag("Summarize the financial performance.")


Summarize the financial performance.

ANSWER:
Berkshire's financial performance is characterized by:
-  A focus on operational earnings, specifically the record-setting $30.8 billion achieved in 2022. 
-  A long-term strategy, exemplified by the company's ability to weather economic downturns. 
-  A commitment to long-term investing, evident by the successful holding of high-quality assets and the appreciation of a few winning investments over time. 
-  A recognition of the importance of capital allocation decisions, as evidenced by the author's acknowledgement of both good and bad decisions made over the past 58 years.
-  A focus on preserving the company's unmatched staying power. 

**Sources:**
- Source 1
- Source 2
- Source 3

**Explanation:** 
This response summarizes the financial performance of Berkshire based on the provided text. The response highlights key aspects of their financial management, including their focus on operational earnings, long-term investment strategy, and

"Berkshire's financial performance is characterized by:\n-  A focus on operational earnings, specifically the record-setting $30.8 billion achieved in 2022. \n-  A long-term strategy, exemplified by the company's ability to weather economic downturns. \n-  A commitment to long-term investing, evident by the successful holding of high-quality assets and the appreciation of a few winning investments over time. \n-  A recognition of the importance of capital allocation decisions, as evidenced by the author's acknowledgement of both good and bad decisions made over the past 58 years.\n-  A focus on preserving the company's unmatched staying power. \n\n**Sources:**\n- Source 1\n- Source 2\n- Source 3\n\n**Explanation:** \nThis response summarizes the financial performance of Berkshire based on the provided text. The response highlights key aspects of their financial management, including their focus on operational earnings, long-term investment strategy, and the importance of capital alloca

## üíæ Part 9: Save Everything

In [23]:
# Save for later use
print("Saving...\n")

# Save embeddings and chunks
with open('data/processed/embeddings_chunks.pkl', 'wb') as f:
    pickle.dump({'embeddings': chunk_embeddings, 'chunks': chunks}, f)

# Save FAISS index
faiss.write_index(faiss_index, 'data/processed/faiss_index.index')

print("‚úÖ Saved!")
print("   - data/processed/embeddings_chunks.pkl")
print("   - data/processed/faiss_index.index")

Saving...

‚úÖ Saved!
   - data/processed/embeddings_chunks.pkl
   - data/processed/faiss_index.index
