# Assignment 2 – Comparative Financial QA System: RAG vs Fine-Tuning

Group No 16
Group Member Names:
## Group Member Names:
1. | Anup Jindal        | 2023ac05472 |100%
2. | Yogesh Chaturvedi  | 2023ac05167 |100%
3. | HRISHIKESH MALAKAR | 2023Ac05058 |100%
4. | Anit Nair          | 2023ac05503 |100%
5. | DEBASISH ACHARYA   | 2023ac05417 |100%


Objective
Develop and compare two systems for answering questions based on company financial statements (last two years):

Retrieval-Augmented Generation (RAG) Chatbot: Combines document retrieval and generative response.
Fine-Tuned Language Model (FT) Chatbot: Directly fine-tunes a small open-source language model on financial Q&A.
Use the same financial data for both methods and perform a detailed comparison on accuracy, speed, and robustness.

In [1]:
%pip install -r requirements.txt

Collecting streamlit (from -r requirements.txt (line 1))
  Downloading streamlit-1.48.1-py3-none-any.whl.metadata (9.5 kB)
Collecting altair!=5.4.0,!=5.4.1,<6,>=4.0 (from streamlit->-r requirements.txt (line 1))
  Downloading altair-5.5.0-py3-none-any.whl.metadata (11 kB)
Collecting blinker<2,>=1.5.0 (from streamlit->-r requirements.txt (line 1))
  Downloading blinker-1.9.0-py3-none-any.whl.metadata (1.6 kB)
Collecting pyarrow>=7.0 (from streamlit->-r requirements.txt (line 1))
  Downloading pyarrow-21.0.0-cp311-cp311-win_amd64.whl.metadata (3.4 kB)
Collecting toml<2,>=0.10.1 (from streamlit->-r requirements.txt (line 1))
  Downloading toml-0.10.2-py2.py3-none-any.whl.metadata (7.1 kB)
Collecting watchdog<7,>=2.1.5 (from streamlit->-r requirements.txt (line 1))
  Downloading watchdog-6.0.0-py3-none-win_amd64.whl.metadata (44 kB)
Collecting gitpython!=3.1.19,<4,>=3.0.7 (from streamlit->-r requirements.txt (line 1))
  Downloading gitpython-3.1.45-py3-none-any.whl.metadata (13 kB)
Collect

In [2]:
# imports
import zipfile
import os
from bs4 import BeautifulSoup

## 1. Data Collection & Preprocessing

Downloaded Financial Statement of GE Healthcare From United States Securities and Exchange Commission:

Click [here](https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001932393&type=10-Q&dateb=&owner=include&count=40&search_text=) for link to source of data.

### 1.1 Extract the data and convert them to plain text. (Source data is html files)

In [3]:
zip_file_path = './content/gehc-annual-report-2023-2024.zip'
extracted_dir_path = './content/gehc_fin_extracted'

# Create the extraction directory if it doesn't exist
os.makedirs(extracted_dir_path, exist_ok=True)

# Extract the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extracted_dir_path)

print(f"Extracted {zip_file_path} to {extracted_dir_path}")

Extracted ./content/gehc-annual-report-2023-2024.zip to ./content/gehc_fin_extracted


In [4]:
extracted_dir_path = './content/gehc_fin_extracted'
plain_text_dir_path = './content/gehc_fin_plain_text'

# Create the directory for plain text files if it doesn't exist
os.makedirs(plain_text_dir_path, exist_ok=True)

html_files = []
for root, _, files in os.walk(extracted_dir_path):
    for file in files:
        if file.endswith(".html") or file.endswith(".htm"):
            html_files.append(os.path.join(root, file))

print(f"Found {len(html_files)} HTML files.")

for html_file_path in html_files:
    try:
        # Try reading with utf-8 first, then latin-1
        try:
            with open(html_file_path, 'r', encoding='utf-8') as f:
                html_content = f.read()
        except UnicodeDecodeError:
            with open(html_file_path, 'r', encoding='latin-1') as f:
                html_content = f.read()


        # Use BeautifulSoup to parse HTML and extract text
        soup = BeautifulSoup(html_content, 'html.parser')
        plain_text = soup.get_text(separator='\n')

        # Create a corresponding plain text file path
        relative_path = os.path.relpath(html_file_path, extracted_dir_path)
        plain_text_file_path = os.path.join(plain_text_dir_path, relative_path + ".txt")

        # Create directories for the plain text file if they don't exist
        os.makedirs(os.path.dirname(plain_text_file_path), exist_ok=True)

        with open(plain_text_file_path, 'w', encoding='utf-8') as f:
            f.write(plain_text)

        print(f"Converted {html_file_path} to plain text and saved to {plain_text_file_path}")

    except Exception as e:
        print(f"Error processing {html_file_path}: {e}")

print("Finished converting HTML files to plain text.")

Found 2 HTML files.
Converted ./content/gehc_fin_extracted\gehc-20231231.html to plain text and saved to ./content/gehc_fin_plain_text\gehc-20231231.html.txt
Converted ./content/gehc_fin_extracted\gehc-20241231.html to plain text and saved to ./content/gehc_fin_plain_text\gehc-20241231.html.txt
Finished converting HTML files to plain text.



### 1.2 Walk through each text file and save them to a list as string.

In [5]:
plain_text_data = []
# Walk through the directory and read all .txt files
for root, _, files in os.walk(plain_text_dir_path):
    for file in files:
        if file.endswith(".txt"):
            file_path = os.path.join(root, file)
            try:
                with open(file_path, 'r', encoding='utf-8') as f:
                    plain_text_data.append(f.read())
                print(f"Loaded {file_path}")
            except Exception as e:
                print(f"Error reading {file_path}: {e}")

print(f"Loaded {len(plain_text_data)} plain text files.")


Loaded ./content/gehc_fin_plain_text\gehc-20231231.html.txt
Loaded ./content/gehc_fin_plain_text\gehc-20241231.html.txt
Loaded 2 plain text files.


### 1.3 Clean text by removing noise like headers, footers, and page numbers.

In [6]:
import re

cleaned_text_data = []

# Function to clean text
def clean_text(text):
    # Remove common headers/footers (example patterns, adjust as needed)
    text = re.sub(r'\[\s*\d+\s*\]', '', text) # Remove numbers in brackets like [ 1 ]
    text = re.sub(r'Page\s+\d+\s+of\s+\d+', '', text, flags=re.IGNORECASE) # Remove "Page X of Y"
    text = re.sub(r'Exhibit\s+\d+\.\d+', '', text, flags=re.IGNORECASE) # Remove "Exhibit X.Y"
    text = re.sub(r'\n\s*\n', '\n', text) # Remove excessive newlines

    # You might need more specific patterns based on the actual data structure

    return text

# Apply cleaning to each document
for text in plain_text_data:
    cleaned_text_data.append(clean_text(text))



### 1.4 Segment reports into logical sections (e.g., income statement, balance sheet).

In [7]:
import re

segmented_financial_statements = []

# Define the financial statement segments and their potential headings
# Using a dictionary to map a user-friendly name to a list of potential regex patterns.
# This allows for variations in how headings might appear.
financial_segment_patterns = {
    "Statements of Operations / Income": [
        r"CONSOLIDATED STATEMENTS OF OPERATIONS\s*\n(.*?)(?=\n(?:Statements of Financial Position|Statements of Comprehensive Income|Statements of Changes in Equity|balance sheet|cash flows)|\Z)",
        r"Statements of Income\s*\n(.*?)(?=\n(?:Statements of Financial Position|Statements of Comprehensive Income|Statements of Changes in Equity|balance sheet|cash flows)|\Z)",
    ],
    "Statements of Financial Position / Balance Sheet": [
        r"Statements of Financial Position\s*\n(.*?)(?=\n(?:Statements of Comprehensive Income|Statements of Changes in Equity|balance sheet|cash flows)|\Z)",
        r"balance sheet\s*\n(.*?)(?=\n(?:Statements of Comprehensive Income|Statements of Changes in Equity|cash flows)|\Z)",
    ],
    "Statements of Cash Flows": [
        r"cash flows\s*\n(.*?)(?=\Z)",
    ]
}

# Iterate through each cleaned document
for doc_text in cleaned_text_data:
    doc_segments = {}
    remaining_text = doc_text

    # Iterate through each financial segment and try to find its content using the defined patterns
    for segment_name, patterns in financial_segment_patterns.items():
        found_segment = False
        for pattern in patterns:
            match = re.search(pattern, remaining_text, re.DOTALL | re.IGNORECASE) # Use IGNORECASE for flexibility
            if match:
                doc_segments[segment_name] = match.group(1).strip()
                # Update remaining_text to be the part after the found segment if a match is found
                remaining_text = remaining_text[match.end():]
                found_segment = True
                break # Move to the next segment after finding a match

        if not found_segment:
             doc_segments[segment_name] = "Segment not found." # Indicate if a segment is not found after trying all patterns


    segmented_financial_statements.append(doc_segments)

print(f"Segmented {len(segmented_financial_statements)} documents into financial statements.")

# You can inspect the first segmented financial statements to see the results
import json
print(json.dumps(segmented_financial_statements[0], indent=2))

Segmented 2 documents into financial statements.
{
  "Statements of Operations / Income": "For the years ended December 31\n2023\n2022\nSales of products\n$\n13,127\n$\n12,044\nSales of services\n6,425\n6,297\nTotal revenues\n19,552\n18,341\nCost of products\n8,465\n7,975\nCost of services\n3,165\n3,187\nGross profit\n7,922\n7,179\nSelling, general, and administrative\n4,282\n3,631\nResearch and development\n1,205\n1,026\nTotal operating expenses\n5,487\n4,657\nOperating income\n2,435\n2,522\nInterest and other financial charges \u0096 net\n542\n77\nNon-operating benefit (income) costs\n(382)\n(5)\nOther (income) expense \u0096 net\n(86)\n(62)\nIncome from continuing operations before income taxes\n2,361\n2,512\nBenefit (provision) for income taxes\n(743)\n(563)\nNet income from continuing operations\n1,618\n1,949\nIncome (loss) from discontinued operations, net of taxes\n(4)\n18\nNet income\n1,614\n1,967\nNet (income) loss attributable to noncontrolling interests\n(46)\n(51)\nNet inco

### 1.5 From the segmented_financial_statements, create a data structure as below to store the information:

```json
{
    document: number, // document id
    segment: string,  // finacial segment like Operations, inancial Position / Balance Sheet,  Comprehensive Income
    line_item: string, Cash Flow
    2024: number, // value in each year
    2023: number,
}
```

In [8]:
financial_data = []
# Regex to find line items and their values for 2024, 2023 and 2022
line_item_pattern = re.compile(
    r"^(.*?)\s+"  # Capture the line item description
    r"\$\s*([\d,]+)\s+"  # Capture the 2024 value
    r"\$\s*([\d,]+)\s+"  # Capture the 2023 value
    r"\$\s*([\d,]+)\s+", # Capture the 2022 value
    re.MULTILINE # Pass the flag here
)

for i, doc_segments in enumerate(segmented_financial_statements):
    for segment_name, content in doc_segments.items():
        if content != "Segment not found.":
            # Find all matches in the content
            matches = line_item_pattern.finditer(content)
            for match in matches:
                line_item = match.group(1).strip()
                value_2024 = match.group(2).replace(',', '')
                value_2023 = match.group(3).replace(',', '')
                value_2022 = match.group(4).replace(',', '') # Corrected index for 2022 value
                # Add to our structured data list
                financial_data.append({
                    "document": i + 1,
                    "segment": segment_name,
                    "line_item": line_item,
                    "2024": int(value_2024),
                    "2023": int(value_2023),
                    "2022": int(value_2022)
                })

# Print the first 5 extracted key-value pairs
for item in financial_data[:5]:
    print(item)

{'document': 1, 'segment': 'Statements of Operations / Income', 'line_item': 'Impact on PBO/APBO at December 31, 2023', '2024': 940, '2023': 318, '2022': 40}
{'document': 1, 'segment': 'Statements of Operations / Income', 'line_item': 'Sales of products', '2024': 13127, '2023': 12044, '2022': 11165}
{'document': 1, 'segment': 'Statements of Operations / Income', 'line_item': 'Net income attributable to GE HealthCare common stockholders', '2024': 1385, '2023': 1916, '2022': 2247}
{'document': 1, 'segment': 'Statements of Operations / Income', 'line_item': 'Net income attributable to GE HealthCare', '2024': 1568, '2023': 1916, '2022': 2247}
{'document': 1, 'segment': 'Statements of Operations / Income', 'line_item': 'Comprehensive income attributable to GE HealthCare', '2024': 755, '2023': 1073, '2022': 2049}


### 1.6 Formulate Questions at least 50 (Q/A) pairs

In [9]:
generated_questions = []
count  = 0
# Iterate through the extracted financial data
for item in financial_data:
    line_item = item["line_item"]
    value_2024 = item["2024"]
    value_2023 = item["2023"]
    value_2022 = item["2022"]
    segment = item["segment"]

    # Question type 1: Value in a specific year
    generated_questions.append({
        "based_on_data_item": item,
        "question": f"What was the value of '{line_item}' in {2024} according to the {segment}?",
    })
    if (len(generated_questions) == 50):
      break;
    generated_questions.append({
        "based_on_data_item": item,
        "question": f"Find the value for '{line_item}' in {2023} from the {segment}.",
    })
    if (len(generated_questions) == 50):
      break;
    generated_questions.append({
        "based_on_data_item": item,
        "question": f"Could you provide the figure for '{line_item}' in {2022} as reported in the {segment}?",
    })
    if (len(generated_questions) == 50):
          break;
    # Question type 2: Change between two years
    generated_questions.append({
        "based_on_data_item": item,
        "question": f"How much did the '{line_item}' change from {2023} to {2024} based on the {segment}?",
    })
    if (len(generated_questions) == 50):
      break;
    generated_questions.append({
        "based_on_data_item": item,
        "question": f"What was the difference in '{line_item}' between {2022} and {2023} according to the {segment}?",
    })
    if (len(generated_questions) == 50):
      break;
    # Question type 3: Value across multiple years (if applicable and makes sense)
    # Only generate this if all three years have values
    if value_2024 is not None and value_2023 is not None and value_2022 is not None:
         generated_questions.append({
            "based_on_data_item": item,
            "question": f"What were the values for '{line_item}' for the years {2024}, {2023}, and {2022} in the {segment}?",
        })
    if (len(generated_questions) == 50):
      break;
# Print the first 10 generated questions to inspect
for q in generated_questions[:10]:
    print(q)

print(f"\nGenerated {len(generated_questions)} questions.")

{'based_on_data_item': {'document': 1, 'segment': 'Statements of Operations / Income', 'line_item': 'Impact on PBO/APBO at December 31, 2023', '2024': 940, '2023': 318, '2022': 40}, 'question': "What was the value of 'Impact on PBO/APBO at December 31, 2023' in 2024 according to the Statements of Operations / Income?"}
{'based_on_data_item': {'document': 1, 'segment': 'Statements of Operations / Income', 'line_item': 'Impact on PBO/APBO at December 31, 2023', '2024': 940, '2023': 318, '2022': 40}, 'question': "Find the value for 'Impact on PBO/APBO at December 31, 2023' in 2023 from the Statements of Operations / Income."}
{'based_on_data_item': {'document': 1, 'segment': 'Statements of Operations / Income', 'line_item': 'Impact on PBO/APBO at December 31, 2023', '2024': 940, '2023': 318, '2022': 40}, 'question': "Could you provide the figure for 'Impact on PBO/APBO at December 31, 2023' in 2022 as reported in the Statements of Operations / Income?"}
{'based_on_data_item': {'document':

#### For each questions formulated above generate the answers.

In [10]:
# Iterate through the generated questions
for q in generated_questions:
    item = q["based_on_data_item"]
    line_item = item["line_item"]
    value_2024 = item["2024"]
    value_2023 = item["2023"]
    value_2022 = item["2022"]
    segment = item["segment"]
    question_text = q["question"]
    answer = ""

    # Determine the type of question and extract/calculate the answer
    if f"in {2024}" in question_text and f"{2023}, and {2022}" not in question_text:
        answer = f"The value of '{line_item}' in 2024 was {value_2024} millions of dollars."
    elif f"in {2023}" in question_text and f"{2024}, and {2022}" not in question_text:
        answer = f"The value of '{line_item}' in 2023 was {value_2023} millions of dollars."
    elif f"in {2022}" in question_text and f"{2024}, and {2023}" not in question_text:
        answer = f"The value of '{line_item}' in 2022 was {value_2022} millions of dollars."
    elif f"change from {2023} to {2024}" in question_text:
        change = value_2024 - value_2023
        answer = f"The change in '{line_item}' from 2023 to 2024 was {change} millions of dollars."
    elif f"difference in '{line_item}' between {2022} and {2023}" in question_text:
        difference = value_2023 - value_2022
        answer = f"The difference in '{line_item}' between 2022 and 2023 was {difference} millions of dollars."
    elif f"for the years {2024}, {2023}, and {2022}" in question_text:
         answer = f"The values for '{line_item}' for the years 2024, 2023, and 2022 were {value_2024}, {value_2023}, and {value_2022} millions of dollars, respectively."
    else:
        # Handle any unexpected question formats or if the pattern doesn't match
        answer = "Could not determine the specific answer based on the question format."


    q['answer'] = answer

# Print the first 10 question-answer pairs
print("First 10 Generated Q/A Pairs:")
for q in generated_questions[:10]:
    print(f"Question: {q['question']}")
    print(f"Answer: {q['answer']}")
    print("-" * 20)

generated_questions_answer = generated_questions.copy();
# Print the total number of Q/A pairs generated
print(f"\nTotal Q/A pairs generated: {len(generated_questions)}")

First 10 Generated Q/A Pairs:
Question: What was the value of 'Impact on PBO/APBO at December 31, 2023' in 2024 according to the Statements of Operations / Income?
Answer: The value of 'Impact on PBO/APBO at December 31, 2023' in 2024 was 940 millions of dollars.
--------------------
Question: Find the value for 'Impact on PBO/APBO at December 31, 2023' in 2023 from the Statements of Operations / Income.
Answer: The value of 'Impact on PBO/APBO at December 31, 2023' in 2023 was 318 millions of dollars.
--------------------
Question: Could you provide the figure for 'Impact on PBO/APBO at December 31, 2023' in 2022 as reported in the Statements of Operations / Income?
Answer: The value of 'Impact on PBO/APBO at December 31, 2023' in 2022 was 40 millions of dollars.
--------------------
Question: How much did the 'Impact on PBO/APBO at December 31, 2023' change from 2023 to 2024 based on the Statements of Operations / Income?
Answer: The change in 'Impact on PBO/APBO at December 31, 2023

## 2. Retrieval-Augmented Generation (RAG) System Implementation

### 2.1 Data Processing

 - Split the cleaned text into chunks suitable for retrieval with at least two chunk sizes (e.g., 100 and 400 tokens).
 - Assign unique IDs and metadata to chunks.


In [11]:
# Assuming 'cleaned_text_data' is available from the previous data cleaning step

def chunk_text(text, chunk_size=100, overlap=20):
    """
    Splits text into overlapping chunks.

    Args:
        text (str): The input text.
        chunk_size (int): The desired size of each chunk (in words or tokens, depending on how you split).
        overlap (int): The number of words/tokens to overlap between chunks.

    Returns:
        list: A list of text chunks.
    """
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

# Define chunk sizes
chunk_sizes = [100, 400]
chunked_data = {}

# Process each cleaned document and create chunks of different sizes
for doc_id, cleaned_text in enumerate(cleaned_text_data):
    for size in chunk_sizes:
        chunks = chunk_text(cleaned_text, chunk_size=size)
        if f'chunks_{size}' not in chunked_data:
            chunked_data[f'chunks_{size}'] = []

        for i, chunk in enumerate(chunks):
            chunked_data[f'chunks_{size}'].append({
                'id': f'doc_{doc_id}_chunk_{i}_size_{size}',
                'content': chunk,
                'metadata': {
                    'document_id': doc_id,
                    'chunk_id': i,
                    'chunk_size': size
                }
            })

# Print some information about the generated chunks
for size, chunks in chunked_data.items():
    print(f"Generated {len(chunks)} chunks of size {size}.")
    if len(chunks) > 0:
        print(f"First chunk ({size}): {chunks[0]['content'][:200]}...") # Print first 200 characters of the first chunk

Generated 1997 chunks of size chunks_100.
First chunk (chunks_100): gehc-20231231 false 0001932393 FY 2023 http://www.gehealthcare.com/20231231#PropertyPlantAndEquipmentAndOperatingLeaseRightOfUseAssetAfterAccumulatedDepreciationAndAmortization http://www.gehealthcare...
Generated 421 chunks of size chunks_400.
First chunk (chunks_400): gehc-20231231 false 0001932393 FY 2023 http://www.gehealthcare.com/20231231#PropertyPlantAndEquipmentAndOperatingLeaseRightOfUseAssetAfterAccumulatedDepreciationAndAmortization http://www.gehealthcare...


### 2.2 Embedding & Indexing

#### 2.2.1 Embed the chunks using the all-MiniLM-L6-v2

In [12]:
# Install the sentence-transformers library if you haven't already

from sentence_transformers import SentenceTransformer
import numpy as np

# Load the sentence embedding model
try:
    embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    print("Sentence embedding model 'all-MiniLM-L6-v2' loaded successfully.")
except Exception as e:
    print(f"Error loading sentence embedding model: {e}")
    print("Please ensure you have an active internet connection to download the model.")
    embedding_model = None # Set to None if loading fails

# Assuming 'chunked_data' is available from the previous chunking step
# We will embed chunks of size 100 and 400

if embedding_model is not None:
    embedded_chunks = {}
    for size, chunks in chunked_data.items():
        print(f"Embedding {len(chunks)} chunks of size {size.split('_')[-1]}...")
        # Extract the content of the chunks to embed
        chunks_content = [chunk['content'] for chunk in chunks]

        # Generate embeddings
        try:
            embeddings = embedding_model.encode(chunks_content, show_progress_bar=True)
            embedded_chunks[size] = {
                'chunks': chunks, # Keep the original chunk data
                'embeddings': embeddings
            }
            print(f"Finished embedding chunks of size {size.split('_')[-1]}. Shape of embeddings: {embeddings.shape}")
        except Exception as e:
            print(f"Error during embedding for chunk size {size.split('_')[-1]}: {e}")
            embedded_chunks[size] = None # Indicate if embedding failed


else:
    print("Embedding model not loaded, skipping embedding step.")

# You can inspect the shape of the embeddings for one chunk size, e.g., size 100
# if embedded_chunks.get('chunks_100') and embedded_chunks['chunks_100']['embeddings'] is not None:
#     print(f"\nShape of embeddings for chunk size 100: {embedded_chunks['chunks_100']['embeddings'].shape}")
#     print(f"Shape of embeddings for chunk size 400: {embedded_chunks['chunks_400']['embeddings'].shape}")

Sentence embedding model 'all-MiniLM-L6-v2' loaded successfully.
Embedding 1997 chunks of size 100...


Batches:   0%|          | 0/63 [00:00<?, ?it/s]

Finished embedding chunks of size 100. Shape of embeddings: (1997, 384)
Embedding 421 chunks of size 400...


Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Finished embedding chunks of size 400. Shape of embeddings: (421, 384)


#### 2.2.2 Build dense vector store to capture semantic relation using ChromaDB

In [13]:
import chromadb

# Initialize ChromaDB client
# By default, it will use an in-memory database. You can configure it for persistent storage if needed.
try:
    client = chromadb.Client()
    print("ChromaDB client initialized.")
except Exception as e:
    print(f"Error initializing ChromaDB client: {e}")
    client = None # Set to None if initialization fails


# Create or get a collection for our chunks
# A collection is like a table in a traditional database.
collection_name = "financial_report_chunks"
try:
    collection = client.get_or_create_collection(name=collection_name)
    print(f"ChromaDB collection '{collection_name}' created or retrieved.")
except Exception as e:
    print(f"Error getting or creating ChromaDB collection: {e}")
    collection = None # Set to None if collection creation fails

# Add the embedded chunks to the collection
# We'll add the chunks from one of the sizes, for example, size 100, to the dense vector store.
# You could potentially add both sizes to separate collections or experiment with different strategies.
if collection is not None and embedded_chunks.get('chunks_100') and embedded_chunks['chunks_100']['embeddings'] is not None:
    chunks_to_add = embedded_chunks['chunks_100']['chunks']
    embeddings_to_add = embedded_chunks['chunks_100']['embeddings']

    # Prepare data for ChromaDB
    ids = [chunk['id'] for chunk in chunks_to_add]
    documents = [chunk['content'] for chunk in chunks_to_add]
    metadatas = [chunk['metadata'] for chunk in chunks_to_add]


    # Add to ChromaDB in batches to avoid potential issues with large numbers of documents
    batch_size = 100  # Adjust batch size as needed
    for i in range(0, len(ids), batch_size):
        batch_ids = ids[i:i + batch_size]
        batch_documents = documents[i:i + batch_size]
        batch_embeddings = embeddings_to_add[i:i + batch_size]
        batch_metadatas = metadatas[i:i+ batch_size]

        try:
            collection.add(
                embeddings=batch_embeddings.tolist(), # ChromaDB expects a list of lists
                documents=batch_documents,
                metadatas=batch_metadatas,
                ids=batch_ids
            )
            print(f"Added batch {i//batch_size + 1} to ChromaDB.")
        except Exception as e:
            print(f"Error adding batch {i//batch_size + 1} to ChromaDB: {e}")

    print(f"Finished adding {len(ids)} chunks to ChromaDB collection '{collection_name}'.")

# You can verify the count of items in the collection
if collection is not None:
    try:
        count = collection.count()
        print(f"Total items in ChromaDB collection '{collection_name}': {count}")
    except Exception as e:
        print(f"Error getting count from ChromaDB collection: {e}")

ChromaDB client initialized.
ChromaDB collection 'financial_report_chunks' created or retrieved.
Added batch 1 to ChromaDB.
Added batch 2 to ChromaDB.
Added batch 3 to ChromaDB.
Added batch 4 to ChromaDB.
Added batch 5 to ChromaDB.
Added batch 6 to ChromaDB.
Added batch 7 to ChromaDB.
Added batch 8 to ChromaDB.
Added batch 9 to ChromaDB.
Added batch 10 to ChromaDB.
Added batch 11 to ChromaDB.
Added batch 12 to ChromaDB.
Added batch 13 to ChromaDB.
Added batch 14 to ChromaDB.
Added batch 15 to ChromaDB.
Added batch 16 to ChromaDB.
Added batch 17 to ChromaDB.
Added batch 18 to ChromaDB.
Added batch 19 to ChromaDB.
Added batch 20 to ChromaDB.
Finished adding 1997 chunks to ChromaDB collection 'financial_report_chunks'.
Total items in ChromaDB collection 'financial_report_chunks': 1997


### 2.2.3 Create Sparse index (BM25 or TF-IDF) for keyword retrieval

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Assuming 'chunked_data' is available from the previous chunking step
# We will use the chunks of size 100 for building the TF-IDF index

# Extract the content of the chunks
if 'chunks_100' in embedded_chunks:
    chunks_to_embed = embedded_chunks['chunks_100']['chunks']
    chunks_content = [chunk['content'] for chunk in chunks_to_embed]

    # Initialize TF-IDF Vectorizer
    # You can adjust parameters like max_features, min_df, max_df, ngram_range
    tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

    # Fit the vectorizer to the chunk content and transform the chunks
    try:
        tfidf_matrix = tfidf_vectorizer.fit_transform(chunks_content)
        print("TF-IDF vectorizer fitted and matrix created successfully.")
        print(f"Shape of TF-IDF matrix: {tfidf_matrix.shape}")
    except Exception as e:
        print(f"Error creating TF-IDF matrix: {e}")
        tfidf_vectorizer = None # Set to None if fitting fails
        tfidf_matrix = None # Set to None if fitting fails

else:
    print("Chunks of size 100 not found in chunked_data. Cannot build TF-IDF index.")
    tfidf_vectorizer = None
    tfidf_matrix = None

# The tfidf_matrix now represents the sparse index of our chunks.

TF-IDF vectorizer fitted and matrix created successfully.
Shape of TF-IDF matrix: (1997, 5000)


### 2.3 Hybrid Retrieval Pipeline

#### 2.3.1 Preprocess data clean

In [15]:
import re
from nltk.corpus import stopwords
import nltk

# Download stopwords if not already downloaded
try:
    nltk.data.find('corpora/stopwords')
except LookupError: # Corrected exception type
    nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def preprocess_query(query):
    """
    Cleans, lowercases, and removes stopwords from a query.

    Args:
        query (str): The input query string.

    Returns:
        str: The preprocessed query string.
    """
    # Convert to lowercase
    query = query.lower()
    # Remove special characters and punctuation
    query = re.sub(r'[^a-z0-9\s]', '', query)
    # Remove stopwords
    query = ' '.join([word for word in query.split() if word not in stop_words])
    return query

# Example Usage:
# user_query = "What was the total revenues in 2024 for GE Healthcare?"
# preprocessed_query = preprocess_query(user_query)
# print(f"Original Query: {user_query}")
# print(f"Preprocessed Query: {preprocessed_query}")

#### 2.3.2 Generate query embedding.

In [16]:
# Assuming 'preprocess_query' and 'embedding_model' are available from previous steps.

def generate_query_embedding(query, embedding_model):
    """
    Generates the embedding for a preprocessed query.

    Args:
        query (str): The preprocessed query string.
        embedding_model: The sentence embedding model.

    Returns:
        numpy.ndarray: The query embedding.
    """
    if embedding_model is None:
        print("Embedding model is not loaded. Cannot generate query embedding.")
        return None
    try:
        # Encode the query to get its embedding
        query_embedding = embedding_model.encode(query)
        print("Query embedding generated successfully.")
        return query_embedding
    except Exception as e:
        print(f"Error generating query embedding: {e}")
        return None

# Example Usage (assuming 'user_query' is defined):
# preprocessed_user_query = preprocess_query(user_query)
# query_embedding = generate_query_embedding(preprocessed_user_query, embedding_model)

# if query_embedding is not None:
#     print(f"Shape of query embedding: {query_embedding.shape}")

#### 2.3.3 Retrieve top-N chunks from:
- Dense retrieval (vector similarity).
- Sparse retrieval (BM25).

In [17]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Assuming 'collection', 'embedding_model', 'tfidf_vectorizer', 'tfidf_matrix',
# and 'chunked_data' are available from previous steps.

def dense_retrieve(query, collection, embedding_model, n_results=5):
    """
    Retrieves top-N relevant chunks using dense vector similarity with ChromaDB.

    Args:
        query (str): The user query.
        collection (chromadb.Collection): The ChromaDB collection.
        embedding_model: The sentence embedding model.
        n_results (int): The number of results to retrieve.

    Returns:
        list: A list of dictionaries, where each dictionary contains the 'id',
              'content', and 'metadata' of a retrieved chunk. Returns an empty
              list if retrieval fails or no results are found.
    """
    if collection is None or embedding_model is None:
        print("ChromaDB collection or embedding model not loaded. Cannot perform dense retrieval.")
        return []

    try:
        # Generate embedding for the query
        query_embedding = embedding_model.encode([query]).tolist() # ChromaDB expects a list of lists

        # Query ChromaDB
        results = collection.query(
            query_embeddings=query_embedding,
            n_results=n_results,
            include=['documents', 'metadatas'] # Request documents (content) and metadatas
        )

        # Process the results
        retrieved_chunks = []
        if results and results['ids'] and results['documents'] and results['metadatas']:
            # Assuming the structure of results is as expected from ChromaDB query with include=['documents', 'metadatas']
            # results['ids'][0] is a list of ids for the first query (since we queried with a list of one embedding)
            # results['documents'][0] is a list of document contents for the first query
            # results['metadatas'][0] is a list of metadatas for the first query

            for i in range(len(results['ids'][0])):
                 retrieved_chunks.append({
                    'id': results['ids'][0][i],
                    'content': results['documents'][0][i],
                    'metadata': results['metadatas'][0][i]
                })

        print(f"Dense retrieval found {len(retrieved_chunks)} results.")
        return retrieved_chunks

    except Exception as e:
        print(f"Error during dense retrieval: {e}")
        return []


def sparse_retrieve_tfidf(query, tfidf_vectorizer, tfidf_matrix, chunks, n_results=5):
    """
    Retrieves top-N relevant chunks using sparse keyword similarity (TF-IDF).

    Args:
        query (str): The user query.
        tfidf_vectorizer (TfidfVectorizer): The fitted TF-IDF vectorizer.
        tfidf_matrix (sparse matrix): The TF-IDF matrix of the chunks.
        chunks (list): A list of chunk dictionaries (e.g., from chunked_data['chunks_100']['chunks']).
        n_results (int): The number of results to retrieve.

    Returns:
        list: A list of dictionaries, where each dictionary contains the 'id',
              'content', and 'metadata' of a retrieved chunk. Returns an empty
              list if retrieval fails or no results are found.
    """
    if tfidf_vectorizer is None or tfidf_matrix is None or not chunks:
        print("TF-IDF vectorizer, matrix, or chunks not available. Cannot perform sparse retrieval.")
        return []

    try:
        # Transform the query using the same TF-IDF vectorizer
        query_tfidf = tfidf_vectorizer.transform([query])

        # Calculate cosine similarity between the query TF-IDF and chunk TF-IDF matrix
        cosine_similarities = cosine_similarity(query_tfidf, tfidf_matrix).flatten()

        # Get the indices of the top-N most similar chunks
        # Use argpartition for efficiency if n_results is much smaller than the total number of chunks
        # Or use argsort if you need the results sorted by similarity
        # top_n_indices = np.argsort(cosine_similarities)[::-1][:n_results] # Gets indices in descending order of similarity
        top_n_indices = np.argpartition(cosine_similarities, -n_results)[-n_results:] # More efficient for large matrices

        # Filter out indices that might be out of bounds if n_results is larger than available chunks
        top_n_indices = top_n_indices[top_n_indices < len(chunks)]

        # Retrieve the actual chunks based on the indices
        retrieved_chunks = []
        # Sort by similarity score (optional, but good for presentation)
        # Sorting indices by similarity score in descending order before picking top-N
        sorted_indices = top_n_indices[np.argsort(cosine_similarities[top_n_indices])][::-1]


        for idx in sorted_indices:
             # Explicitly cast idx to int just in case
             int_idx = int(idx)
             retrieved_chunks.append({
                'id': chunks[int_idx]['id'],
                'content': chunks[int_idx]['content'],
                'metadata': chunks[int_idx]['metadata']
            })


        print(f"Sparse retrieval found {len(retrieved_chunks)} results.")
        return retrieved_chunks

    except Exception as e:
        print(f"Error during sparse retrieval: {e}")
        return []

# Example Usage (assuming 'preprocessed_user_query' is defined):
# user_query = "What was the total revenues in 2024?"
# preprocessed_user_query = preprocess_query(user_query) # Make sure preprocess_query is run first

# dense_results = dense_retrieve(preprocessed_user_query, collection, embedding_model, n_results=5)
# print("\nDense Retrieval Results:")
# for chunk in dense_results:
#     print(f"- ID: {chunk['id']}, Content: {chunk['content'][:100]}...") # Print first 100 chars of content

# sparse_results = sparse_retrieve_tfidf(preprocessed_user_query, tfidf_vectorizer, tfidf_matrix, chunked_data['chunks_100']['chunks'], n_results=5)
# print("\nSparse Retrieval Results:")
# for chunk in sparse_results:
#      print(f"- ID: {chunk['id']}, Content: {chunk['content'][:100]}...") # Print first 100 chars of content

In [18]:
def combine_retrieval_results(dense_results, sparse_results):
    """
    Combines the results from dense and sparse retrieval.

    Args:
        dense_results (list): List of chunks from dense retrieval.
        sparse_results (list): List of chunks from sparse retrieval.

    Returns:
        list: A list of unique retrieved chunks.
    """
    combined_chunks = {}

    # Add dense retrieval results
    for chunk in dense_results:
        combined_chunks[chunk['id']] = chunk # Use chunk ID to handle potential duplicates

    # Add sparse retrieval results
    for chunk in sparse_results:
        combined_chunks[chunk['id']] = chunk # Overwrite if already exists (optional, depending on desired behavior)

    # Convert the dictionary values back to a list
    return list(combined_chunks.values())

# Example Usage (assuming 'dense_results' and 'sparse_results' are defined from previous steps):
# combined_results = combine_retrieval_results(dense_results, sparse_results)
# print(f"\nCombined Retrieval Results: Found {len(combined_results)} unique chunks.")
# for chunk in combined_results:
#      print(f"- ID: {chunk['id']}, Content: {chunk['content'][:100]}...") # Print first 100 chars of content

#### 2.3.4 Advanced RAG Technique (Select One)



In [19]:
# Define the number of initial candidates for broad retrieval
n_broad_dense = 10 # Retrieve more candidates from dense retrieval
n_broad_sparse = 10 # Retrieve more candidates from sparse retrieval

# Example User Query
user_query = "What was the total revenues in 2024 for GE Healthcare?"
# Preprocess the query
preprocessed_user_query = preprocess_query(user_query) # Make sure preprocess_query is run first

# Perform broad dense retrieval
broad_dense_results = dense_retrieve(preprocessed_user_query, collection, embedding_model, n_results=n_broad_dense)
print(f"\nBroad Dense Retrieval found {len(broad_dense_results)} candidates.")

# Perform broad sparse retrieval
# Assuming 'chunks_to_embed' is the list of chunks used for TF-IDF
# Corrected the variable name to access the chunks from embedded_chunks
chunks_to_embed = embedded_chunks['chunks_100']['chunks'] # Make sure this is correctly referenced
broad_sparse_results = sparse_retrieve_tfidf(preprocessed_user_query, tfidf_vectorizer, tfidf_matrix, chunks_to_embed, n_results=n_broad_sparse)
print(f"Broad Sparse Retrieval found {len(broad_sparse_results)} candidates.")

# The next step will be to combine these broad results.

Dense retrieval found 10 results.

Broad Dense Retrieval found 10 candidates.
Sparse retrieval found 10 results.
Broad Sparse Retrieval found 10 candidates.


In [20]:
# Step 1: Combine broad retrieval results (already implemented in a previous cell)
# Assuming 'broad_dense_results' and 'broad_sparse_results' are available from the previous execution

combined_results = combine_retrieval_results(broad_dense_results, broad_sparse_results)
print(f"Combined retrieval results: Found {len(combined_results)} unique chunks.")

# Step 2: Load a cross-encoder model for reranking
# Install the sentence-transformers library if you haven't already (already done)

try:
    from sentence_transformers import CrossEncoder
    # Load a pre-trained cross-encoder model suitable for reranking
    # ms-marco-MiniLM-L-6-v2 is a good choice for general domain, you might explore others
    cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') # Corrected model name
    print("Cross-encoder model 'cross-encoder/ms-marco-MiniLM-L-6-v2' loaded successfully.")
except Exception as e:
    print(f"Error loading cross-encoder model: {e}")
    cross_encoder_model = None # Set to None if loading fails
    print("Please ensure you have an active internet connection to download the model.")

Combined retrieval results: Found 19 unique chunks.
Cross-encoder model 'cross-encoder/ms-marco-MiniLM-L-6-v2' loaded successfully.


### 2.5 Response generation

#### Guard Rail function defination.

In [1]:
FINANCIAL_KEYWORDS = [
    'value', 'sales', 'income', 'cost', 'pbo', 'apbo', 'operations',
    'financial', 'stockholders', 'change', 'difference', 'revenue', 'products'
]

def is_relevant(question):
    """Checks if the question contains any financial keywords."""
    return any(keyword in question.lower() for keyword in FINANCIAL_KEYWORDS)

# Example Usage
print(f"'What is the value of sales in 2024?' is relevant: {is_relevant('What is the value of sales in 2024?')}")
print(f"'What is the capital of France?' is relevant: {is_relevant('What is the capital of France?')}")

'What is the value of sales in 2024?' is relevant: True
'What is the capital of France?' is relevant: False


In [None]:
# Step 3: Rerank the combined results using a cross-encoder model
# Assuming 'combined_results' and 'cross_encoder_model' are available from previous steps.

if cross_encoder_model is not None and combined_results:
    print("\nReranking combined results...")
    # Prepare sentence pairs for the cross-encoder: [query, document]
    sentence_pairs = [[preprocessed_user_query, chunk['content']] for chunk in combined_results]

    # Get scores from the cross-encoder
    try:
        reranking_scores = cross_encoder_model.predict(sentence_pairs)

        # Combine the original chunks with their reranking scores
        scored_results = []
        for i, chunk in enumerate(combined_results):
            scored_results.append({
                'chunk': chunk,
                'score': reranking_scores[i]
            })

        # Sort the results by reranking score in descending order
        reranked_results = sorted(scored_results, key=lambda x: x['score'], reverse=True)

        print(f"Finished reranking. Top score: {reranked_results[0]['score'] if reranked_results else 'N/A'}")

    except Exception as e:
        print(f"Error during reranking: {e}")
        reranked_results = [] # Set to empty list if reranking fails

else:
    print("\nSkipping reranking due to missing cross-encoder model or combined results.")
    reranked_results = [] # Set to empty list if prerequisites are not met


# Step 4: Select top-k chunks for response generation
# Assuming 'reranked_results' are available from the previous reranking step.

k = 3  # Define the number of top chunks to use as context
top_k_chunks = [item['chunk'] for item in reranked_results[:k]]

print(f"\nSelected top {k} chunks for response generation.")
for chunk in top_k_chunks:
    print(f"- ID: {chunk['id']}, Content: {chunk['content'][:150]}...")

# Step 5: Generate Answer using a small generative model (GPT-2 Small)
# Install transformers library if not already installed
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load GPT-2 Small model and tokenizer
try:
    model_name = "gpt2" # Using the base gpt2 model which is equivalent to gpt2-small
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    model = GPT2LMHeadModel.from_pretrained(model_name)
    print(f"\nLoaded generative model: {model_name}")
except Exception as e:
    print(f"Error loading generative model {model_name}: {e}")
    tokenizer = None
    model = None


if tokenizer is not None and model is not None and top_k_chunks and is_relevant(user_query):
    # Concatenate retrieved passages and user query
    # Add a clear separator between context and query
    context = "\n".join([chunk['content'] for chunk in top_k_chunks])
    prompt = f"Context:\n{context}\n\nQuestion: {user_query}\n\nAnswer:"

    # Limit input tokens to the model context window
    # GPT-2 has a context window of 1024 tokens
    max_model_input_length = tokenizer.model_max_length # This should be 1024 for gpt2
    max_prompt_length = max_model_input_length - 50 # Reserve some tokens for the answer

    # Encode the prompt and truncate if necessary
    encoded_prompt = tokenizer.encode(prompt, max_length=max_prompt_length, truncation=True, return_tensors="pt")

    # Generate the answer
    try:
        print("\nGenerating answer...")
        output_sequences = model.generate(
            encoded_prompt,
            max_length=max_model_input_length,
            num_return_sequences=1,
            no_repeat_ngram_size=2,
            early_stopping=True,
            temperature=0.7, # Adjust temperature for creativity vs focus
            top_k=50, # Limit the vocabulary to the top 50 tokens
            top_p=0.95 # Nucleus sampling
        )

        # Decode the generated answer
        generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)

        # Extract only the answer part (assuming the model follows the "Answer:" format)
        answer_start = generated_text.find("Answer:")
        if answer_start != -1:
            final_answer = generated_text[answer_start + len("Answer:"):].strip()
        else:
            final_answer = generated_text.strip() # If "Answer:" not found, use the whole generated text

        print("\nGenerated Answer:")
        print(final_answer)

    except Exception as e:
        print(f"Error during answer generation: {e}")

else:
    print("\nSkipping answer generation due to missing model, tokenizer, or chunks.")

# Step 6: Finish task (This will be handled by the Speak field and final summary)


Reranking combined results...
Finished reranking. Top score: 8.195188522338867

Selected top 3 chunks for response generation.
- ID: doc_0_chunk_517_size_100, Content: operations 1,618 1,949 Income (loss) from discontinued operations, net of taxes (4) 18 Net income 1,614 1,967 Net (income) loss attributable to noncon...
- ID: doc_1_chunk_445_size_100, Content: World revenues were $3,158 million, growing 5% or $162 million due to growth in PDx, Imaging, and AVS revenues, partially offset by unfavorable foreig...
- ID: doc_1_chunk_434_size_100, Content: 2,361 2,512 Benefit (provision) for income taxes (531) (743) (563) Net income from continuing operations 2,050 1,618 1,949 Income (loss) from disconti...


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.



Loaded generative model: gpt2

Generating answer...

Generated Answer:
The total revenue for the year ended Dec 31 was $1,811 million. The net revenue was primarily due primarily to the acquisition of GE Medical Services, which was a $2.4 billion acquisition.
. . .
,
 (6)
The net income was partially due due partially to a decrease in the net profit of the Company's medical device business. In addition, the decrease was due largely to an increase in operating expenses. As a result of these changes, GE was able to reduce its operating costs by $4.2 billion. This decrease is partially attributable primarily because of a reduction in its net operating income of $0.9 billion, primarily related to its medical devices business, as well as a decline in net revenues of approximately $5.7 billion due partly to lower operating margins. These changes were partially driven by the increase of operating margin in our medical equipment business and the reduction of our operating expense. We believe 

### 2.7 Build Interface Development