<a href="https://github.com/Deffro/Data-Science-Portfolio/tree/master"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/13PSwmgbQwpcXDIVGCZrvbpZNK7VngNk5?usp=sharing)

# **Building a Retrieval-Augmented Generation (RAG) System with NVIDIA's Annual Financial Statement**

In this tutorial, we’ll explore the creation of a **Retrieval-Augmented Generation (RAG) system** using NVIDIA's latest 10-K filing as our data source.

RAG combines **retrieval** and **generation** to deliver factual, context-driven answers from a given dataset. This approach ensures that the responses are grounded in the provided information, minimizing hallucinations often associated with language models.

By the end of this tutorial, we will have built a pipeline that:
1. Downloads and processes real-world financial data from the SEC EDGAR database.
2. Splits and chunks the data for efficient retrieval.
3. Embeds the chunks into a high-dimensional space for similarity search using FAISS.
4. Retrieves relevant information for a given query.
5. Generates a coherent and accurate answer using a Large Language Model (LLM).

### **Why RAG?**
It's deal for tasks like:

- Summarizing company reports.
- Providing contextual answers to domain-specific questions.
- Enabling users to "chat with their data."

### **Tutorial Overview**
We’ll use NVIDIA’s latest **10-K filing** from the SEC as our dataset. Here's what we’ll cover:
1. **Data Acquisition**: How to fetch the latest 10-K filing directly from the SEC EDGAR database.
2. **Data Preprocessing**: Splitting the content into manageable chunks with overlapping context.
3. **Embedding and Indexing**: Leveraging **sentence-transformers** for embeddings and **FAISS** for efficient similarity search.
4. **Querying the Data**: Retrieving relevant chunks for user queries.
5. **Grounded Generation**: Using a pre-trained LLM to generate answers based on the retrieved information.


In [None]:
%%capture
!pip install transformers faiss-gpu==1.7.2 sentence-transformers==3.0.1

# Loading the latest NVIDIA Annual Financial Statement

In [None]:
import requests
from bs4 import BeautifulSoup

def get_latest_10k_text(cik):
    """
    Fetches the latest 10-K filing for a given company CIK and returns the content as a single text.

    Parameters:
    - cik (str): CIK number of the company.

    Returns:
    - str: Text content of the latest 10-K filing.
    """
    # Step 1: Get the filing index from SEC EDGAR
    base_url = "https://data.sec.gov/submissions/"
    headers = {"User-Agent": "YourName Contact@Email.com"}

    # Fetch the company's filing index
    response = requests.get(f"{base_url}CIK{cik}.json", headers=headers)
    response.raise_for_status()
    data = response.json()

    # Step 2: Find the latest 10-K filing
    filings = data["filings"]["recent"]
    for i, form_type in enumerate(filings["form"]):
        if form_type == "10-K":
            accession_number = filings["accessionNumber"][i].replace("-", "")
            file_url = f"https://www.sec.gov/Archives/edgar/data/{cik}/{accession_number}/index.json"
            break
    else:
        raise ValueError("No 10-K filings found for the given CIK.")

    # Step 3: Get the document URL from the filing index
    filing_response = requests.get(file_url, headers=headers)
    filing_response.raise_for_status()
    filing_data = filing_response.json()

    # Locate the primary document (usually ending with ".htm" or ".txt")
    primary_doc = None
    for doc in filing_data["directory"]["item"]:
        if doc["name"].endswith(".htm") or doc["name"].endswith(".txt"):
            primary_doc = doc["name"]
            break

    if not primary_doc:
        raise ValueError("Could not locate primary document in filing.")

    # Step 4: Fetch the primary document
    filing_url = f"https://www.sec.gov/Archives/edgar/data/{cik}/{accession_number}/{primary_doc}"
    filing_content = requests.get(filing_url, headers=headers).text

    # Step 5: Parse and clean up the content
    soup = BeautifulSoup(filing_content, "html.parser")
    text_content = soup.get_text(separator=" ", strip=True)

    return text_content

# Example: Fetch NVIDIA's latest 10-K
cik = "0001045810"  # CIK for NVIDIA
try:
    latest_10k_text = get_latest_10k_text(cik)
    print(latest_10k_text[:1000])  # Print the first 1000 characters for brevity
except Exception as e:
    print(f"Error: {e}")


0001045810-24-000029.txt : 20240221 0001045810-24-000029.hdr.sgml : 20240221 20240221163657
ACCESSION NUMBER:		0001045810-24-000029
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		114
CONFORMED PERIOD OF REPORT:	20240128
FILED AS OF DATE:		20240221
DATE AS OF CHANGE:		20240221

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			NVIDIA CORP
		CENTRAL INDEX KEY:			0001045810
		STANDARD INDUSTRIAL CLASSIFICATION:	SEMICONDUCTORS & RELATED DEVICES [3674]
		ORGANIZATION NAME:           	04 Manufacturing
		IRS NUMBER:				943177549
		STATE OF INCORPORATION:			DE
		FISCAL YEAR END:			0128

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	000-23985
		FILM NUMBER:		24660316

	BUSINESS ADDRESS:	
		STREET 1:		2788 SAN TOMAS EXPRESSWAY
		CITY:			SANTA CLARA
		STATE:			CA
		ZIP:			95051
		BUSINESS PHONE:		408-486-2000

	MAIL ADDRESS:	
		STREET 1:		2788 SAN TOMAS EXPRESSWAY
		CITY:			SANTA CLARA
		STATE:			CA
		ZIP:			95051

	FORMER COMPANY:	
		FORMER CONFORMED NAME:

We will only keep Item 1 and Item 1A to use for this project.

In [None]:
latest_10k_text = latest_10k_text[30332:185600]

# Step 1. Splitting the text into chunks

First of all let's split the text and create a list of all sentences.

In [None]:
# Split the text into sentences
sentences = [sentence.strip() for sentence in latest_10k_text.split('.') if sentence.strip()]

In [None]:
len(sentences)

878

We got 878 sentences if splitting by the character "."

We can use these sentences as is to create chunks, but we will use **overlapping chunks**

When processing text for retrieval systems, ensuring that no critical information is lost is essential.
This is especially important for documents where ideas and facts often span multiple sentences or sections.

A technique called **overlapping chunks** can help maintain contextual continuity and relevance.

Overlapping chunks include a portion of sentences from the surrounding text, ensuring that the boundaries of each chunk retain context from adjacent parts. This approach minimizes the loss of meaning that can occur when splitting text into discrete sections.

For instance, if we divide text into chunks of three sentences with an overlap of one sentence:
- **Chunk 1**: Sentences 1, 2, 3
- **Chunk 2**: Sentences 3, 4, 5
- **Chunk 3**: Sentences 5, 6, 7

Notice that each chunk shares some sentences with the previous one, which improves context and ensures that no vital information is left out.

---

### **Advantages of Overlapping Chunks**

1. **Improved Context**: By overlapping sentences, each chunk carries forward important information from its predecessor, creating a seamless flow of ideas.
   
2. **Reduced Fragmentation**: Critical information located at the edges of chunks is retained, reducing the risk of incomplete or inaccurate retrieval.
   
3. **Higher Relevance**: Overlapping ensures that relevant information is less likely to be excluded during query processing, resulting in more accurate results.



In [None]:
def chunk_sentences_with_context(sentences, sentences_per_chunk=3, overlap=1):
    """
    Chunk sentences into groups with overlap for context.

    Parameters:
    - sentences (list): A list of sentences.
    - sentences_per_chunk (int): The number of sentences per chunk.
    - overlap (int): The number of sentences that should overlap between chunks.

    Returns:
    - list: A list of overlapping chunks.
    """
    chunks = []
    for i in range(0, len(sentences), sentences_per_chunk - overlap):
        chunk = sentences[i:i + sentences_per_chunk]
        chunks.append(' '.join(chunk))
        # Break if we've reached the end of the list
        if i + sentences_per_chunk >= len(sentences):
            break
    return chunks

# Specify the number of sentences per chunk and overlap
sentences_per_chunk = 10
overlap = 2
texts = chunk_sentences_with_context(sentences, sentences_per_chunk, overlap)

Let's see the 3rd chunk:

In [None]:
texts[2]

'We have invested over $45 3 billion in research and development since our inception, yielding inventions that are essential to modern computing Our invention of the GPU in 1999 sparked the growth of the PC gaming market and redefined computer graphics With our introduction of the CUDA programming model in 2006, we opened the parallel processing capabilities of our GPU to a broad range of compute-intensive applications, paving the way for the emergence of modern AI In 2012, the AlexNet neural network, trained on NVIDIA GPUs, won the ImageNet computer image recognition competition, marking the “Big Bang” moment of AI We introduced our first Tensor Core GPU in 2017, built from the ground-up for the new era of AI, and our first autonomous driving system-on-chips, or SoC, in 2018 Our acquisition of Mellanox in 2020 expanded our innovation canvas to include networking and led to the introduction of a new processor class – the data processing unit, or DPU Over the past 5 years, we have built

# Step 2: Convert chunks to embeddings


After creating chunks of text, the next step is to convert these chunks into numerical representations called **embeddings**. These embeddings are essential for enabling similarity-based search and retrieval tasks.

---

### **What Are Embeddings?**

Embeddings are dense vector representations of text in a high-dimensional space. They capture the semantic meaning of text, making it possible to compare and retrieve chunks based on their contextual similarity rather than just matching keywords.

For instance:
- Sentences like "The dog barked loudly." and "A loud noise was made by the dog." may have similar embeddings despite using different words.
- This semantic understanding is crucial for retrieval systems to find relevant results.

---

### **Embedding Model**

To generate embeddings, we use the **`sentence-transformers`** library and the `all-mpnet-base-v2` model, which is well-suited for capturing semantic relationships in text. This model outputs a fixed-dimensional embedding (768 dimensions) for each input text.


In [None]:
from sentence_transformers import SentenceTransformer

# Load SentenceTransformer model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert chunks to embeddings
embeds = model.encode(texts, show_progress_bar=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

# Step 3: Store embeddings in an index database

Now that we have generated embeddings for our text chunks, the next step is to store these embeddings in an **index database**. The index enables efficient similarity search, allowing us to quickly find the most relevant chunks for any query.

---

### **Why Use an Index Database?**

When dealing with a large number of embeddings, searching through them one by one for similarity becomes computationally expensive. By creating an **index**, we can:
- **Optimize Search Speed**: Perform nearest neighbor searches in milliseconds, even with millions of embeddings.
- **Enable Scalability**: Handle large datasets without performance degradation.
- **Support Complex Queries**: Retrieve multiple relevant chunks for a given query.

---

### **Using FAISS**

We use **FAISS (Facebook AI Similarity Search)**, a highly efficient library designed for similarity search tasks. It supports high-dimensional vector data and provides various indexing methods optimized for speed and scalability.


In [None]:
import numpy as np
import pandas as pd
import faiss

dim = embeds.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(np.float32(embeds))

# Step 4: Retrieve the relevant information for a query

With our FAISS index ready, we can now search for the most relevant chunks of text based on a given query. This is the retrieval step in the **Retrieval-Augmented Generation (RAG)** pipeline. It enables us to extract the most contextually relevant information to ground subsequent generation tasks.

---

### **How It Works**

1. **Encode the Query**:
   - The query is embedded into the same vector space as the indexed text chunks using the same `SentenceTransformer` model. This ensures that the query and the text chunks can be compared semantically.

2. **Search the FAISS Index**:
   - The FAISS index is queried to find the nearest neighbors (most similar text chunks) to the query embedding.
   - The search returns the indices of the closest embeddings along with their distances (similarity scores).

3. **Retrieve and Format Results**:
   - The indices are used to fetch the corresponding text chunks.
   - The results, including the text and their similarity distances, are stored in a structured DataFrame for easy analysis and use.


In [None]:
def retrieve(query, number_of_results=3):
    """
    Search for the nearest neighbors of a query in the FAISS index.

    Parameters:
    - query (str): The query text to search for.
    - number_of_results (int): Number of nearest neighbors to retrieve.

    Returns:
    - pd.DataFrame: A DataFrame containing the nearest texts and their distances.
    """
    # 1. Get the query's embedding
    query_embed = model.encode([query])  # Ensure query is encoded as a list

    # 2. Retrieve the nearest neighbors
    distances, similar_item_ids = index.search(np.float32(query_embed), number_of_results)

    # 3. Format the results
    texts_np = np.array(texts)  # Convert texts list to numpy array for indexing
    results = pd.DataFrame(data={
        'texts': texts_np[similar_item_ids[0]],
        'distance': distances[0]
    })

    return results

In [None]:
query = "which are the executive officers?"
results = retrieve(query)
results

Unnamed: 0,texts,distance
0,"This flexibility supports diverse hiring, rete...",1.05917
1,B A degree from Harvard Business School Debora...,1.245988
2,", a networking equipment company, since 2010 A...",1.306535


In [None]:
results["texts"][0]

"This flexibility supports diverse hiring, retention, and employee engagement, which we believe makes NVIDIA a great place to work During fiscal year 2025, we will continue to have a flexible work environment and maintain our company wide 2-days off a quarter for employees to rest and recharge Information About Our Executive Officers The following sets forth certain information regarding our executive officers, their ages, and positions as of February\xa016, 2024: Name Age Position Jen-Hsun Huang 60 President and Chief Executive Officer Colette M Kress 56 Executive Vice President and Chief Financial Officer Ajay K Puri 69 Executive Vice President, Worldwide Field Operations Debora Shoquist 69 Executive Vice President, Operations Timothy S Teter 57 Executive Vice President and General Counsel Jen-Hsun Huang co-founded NVIDIA in 1993 and has served as our President, Chief Executive Officer, and a member of the Board of Directors since our inception From 1985 to 1993, Mr Huang was employe

# Step 5. Use an LLM to answer

The final step in our **Retrieval-Augmented Generation (RAG)** pipeline involves utilizing a **Large Language Model (LLM)** to provide a well-grounded answer based on the retrieved text. This step, often referred to as **grounded generation**, ensures that the LLM generates responses that are both contextually accurate and relevant to the query.

---

### **How It Works**

1. **LLM Integration**:
   - Use a pre-trained LLM to process the query and the retrieved context.
   - Hugging Face’s `pipeline` makes it easy to integrate text-generation models for this purpose.

2. **Prompt Design**:
   - The LLM is guided by a carefully crafted prompt, which includes:
     - **Persona**: Defines the role or expertise of the model.
     - **Instruction**: Specifies how the LLM should process the retrieved information.
     - **Query**: Combines the user’s question with the retrieved context.

3. **Grounded Answering**:
   - The LLM generates a response based on the retrieved context.
   - If the information is insufficient or unrelated, the LLM is instructed to acknowledge this (e.g., “I am not sure”).


In [None]:
from transformers import pipeline

# Create a pipeline
pipe = pipeline(
    task="text-generation",
    model="Qwen/Qwen2.5-1.5B-Instruct",
    return_full_text=False,
    max_new_tokens=500,
    do_sample=True,
    temperature=0.2,
)

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
# Prompt components
persona = "You are a helpful assistant specialized in company annual financial statements.\n"
instruction = "Answer using the relevant information provided above. If you didn't find the information say 'I am not sure'.\n"

# The full prompt - remove and add pieces to view its impact on the generated output
llm_query = persona + results["texts"][0]+ instruction + query

messages = [
    {"role": "user", "content": llm_query}
]

In [None]:
output = pipe(messages)
print(output[0]["generated_text"])

The executive officers mentioned in the given text are:

1. **Jen-Hsun Huang** - He is described as the President and Chief Executive Officer (CEO) and a member of the Board of Directors since the company's inception.

2. **Colette M Kress** - She is listed as the Executive Vice President and Chief Financial Officer.

3. **Ajay K Puri** - He is identified as the Executive Vice President, Worldwide Field Operations.

4. **Debora Shoquist** - She is named as the Executive Vice President, Operations.

5. **Timothy S Teter** - He is referred to as the Executive Vice President and General Counsel.

These individuals hold significant roles within NVIDIA and play crucial parts in its operations and leadership.
