#### **Synthetic Data Generation for RAG Evaluation**

##### **Overview**
This notebook demonstrates how to use Large Language Models (LLMs) to generate synthetic question-answer pairs from structured financial documents, specifically **10-K filings**. By leveraging LLMs, we can create high-quality Q&A datasets that help evaluate **Retrieval-Augmented Generation (RAG)** systems in financial applications.

##### **Why Synthetic Data?**
Manually curated Q&A datasets are **expensive and time-consuming** to create. Synthetic data generation using LLMs offers a scalable alternative, enabling:
- **Efficient dataset creation** for RAG model evaluation.
- **Diverse and comprehensive** question sets from real-world documents.
- **Customization** based on specific retrieval and generative tasks.

##### **What You’ll Learn**
In this notebook, we will:
1. **Preprocess** 10-K filings by extracting relevant text.
2. **Generate** synthetic question-answer pairs using an LLM.

Let's get started! 🚀


##### 1️⃣ Environment Setup
🔧 Import Required Dependencies

In [19]:
import pandas as pd
import os
import json
import csv
import re
import argparse
import os
import getpass
import asyncio
import nest_asyncio
nest_asyncio.apply()
from concurrent.futures import ThreadPoolExecutor
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from openai import OpenAI
import networkx as nx
# Query the graph using LangChain
from langchain.chains.graph_qa.base import GraphQAChain
from langchain.indexes.graph import NetworkxEntityGraph
from langchain.prompts import PromptTemplate
import getpass

🔑 Setting Up NVIDIA API Key

To access NVIDIA AI Endpoints, you need to provide a valid **NVIDIA API key**.

- If the key is not already set as an environment variable, you will be prompted to enter it.
- The key should start with `nvapi-`, ensuring it is correctly formatted.
- This step is essential for interacting with NVIDIA's LLM services.

Run the following cell to set up your API key:


In [3]:
# Ensure NVIDIA API key is set
if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvapi_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

client = OpenAI(
  base_url = "https://integrate.api.nvidia.com/v1",
  api_key = os.environ["NVIDIA_API_KEY"] 
)

##### 2️⃣ Load and Preprocess 10-K Filings

📄 What is a 10-K?
- A 10-K is an annual report filed with the SEC that provides a detailed overview of a company's financial performance and risks.

🔍 Key Sections We Will Use:
- Item 1A (Risk Factors) – Outlines significant risks the company faces.
- Item 7 (MD&A) – Management’s review of financial results, trends, and business performance.

In [4]:
def read_10k_sections(file_path):
    """Reads a 10-K JSON file and extracts Item 1A and Item 7 sections."""
    with open(file_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    return data.get("item_1A", "").strip(), data.get("item_7", "").strip()

# Example 10-k we'll use in this notebook
file_path = "data/Altair Engineering Inc_10K_2020.json"
item_1a_text, item_7_text = read_10k_sections(file_path)

# Preview text
print("📌 Item 1A (Risk Factors) - Preview:\n", item_1a_text[:500], "...\n")
print("📌 Item 7 (MD&A) - Preview:\n", item_7_text[:500], "...")


📌 Item 1A (Risk Factors) - Preview:
 Item 1A. Risk Factors
An investment in our Class A common stock involves a high degree of risk. You should carefully consider the risks and uncertainties described below, together with all the other information in this Annual Report on Form 10-K, including “Management’s Discussion and Analysis of Financial Condition and Results of Operations” and the consolidated financial statements and the related notes. If any of the following risks actually occurs, our business, reputation, financial conditi ...

📌 Item 7 (MD&A) - Preview:
 Item 7. Management’s Discussion and Analysis of Financial Condition and Results of Operations
The following discussion and analysis of our financial condition and results of operations should be read in conjunction with our audited consolidated financial statements (and notes thereto) for the year ended December 31, 2020 included elsewhere in this Annual Report on Form 10-K. This discussion contains forward-looking statements

##### 3️⃣ Data Preprocessing

📌 Replace Company References

In [5]:
replacements_pattern = re.compile(r"\b(we|us|our|the Company|We|Us|Our|The Company)\b")

def preprocess_text(text, company_name):
    """Replaces company references (we, our) with actual company name."""
    return replacements_pattern.sub(lambda match: f"{company_name}" if match.group(0).lower() in ["we", "us"] else f"{company_name}'s", text)

📌 Handle Escaped Characters in JSON

In [6]:
def preprocess_json_content(json_content):
    """Fix improperly escaped sequences in JSON data."""
    json_content = re.sub(r'\\n', '\n', json_content)  
    json_content = re.sub(r'\\u([0-9a-fA-F]{4})', r'\\u\1', json_content)  
    return json_content

##### 4️⃣ Generate Question-Answer Pairs using LLM
📜 Instruction Prompts for QA Generation

In [7]:
item1a_instruction_prompt = """
The provided context is retrieved from 10K Item 1A “Risk Factors”, which includes information about the most significant risks that apply to the company or to its securities. 
Companies generally list the risk factors in order of their importance. In practice, this section focuses on the risks themselves, not how the company addresses those risks. 
Some risks may be true for the entire economy, some may apply only to the company’s industry sector or geographic region, and some may be unique to the company. 
Given the context, create three very good question answer pairs. The questions should focus on information about the most significant risks. 
Your output should be in a json format of individual question answer pairs. Restrict the question to the context information provided.
"""

item7_instruction_prompt = """
The provided context is retrieved from 10K Item 7 “Management’s Discussion and Analysis of Financial Condition and Results of Operations”, which gives the company’s perspective on the business results of the past financial year. This section, known as the MD&A for short, allows company management to tell its story in its own words.
The MD&A presents:
1. The company’s operations and financial results, including information about the company’s liquidity and capital resources and any known trends or uncertainties that could materially affect the company’s results. This section may also discuss management’s views of key business risks and what it is doing to address them.
2. Material changes in the company’s results compared to a prior period.
3. Critical accounting judgments, such as estimates and assumptions. These accounting judgments – and any changes from previous years – can have a significant impact on the numbers in the financial statements, such as assets, costs, and net income. 
Given the context, create three very good question answer pairs. The questions should focus on company's financial condition and results of operations. 
Your output should be in a json format of individual question answer pairs. Restrict the question to the context information provided.
"""

📌 Generate QA Pairs

🔹 About nemotron-4-340b-instruct-128k

The NVIDIA NeMo Nemotron-4 340B Instruct is a 340-billion parameter instruction-tuned Large Language Model (LLM), designed to enhance contextual understanding and structured text generation. This model is fine-tuned for multi-turn conversations, making it highly effective for retrieval-augmented generation (RAG), synthetic data generation, and long-context comprehension.

💡 The Nemotron-4-340B-Instruct model underwent additional alignment training using synthetic data, ensuring robust instruction-following and high response accuracy for structured text generation tasks.

🔗 Official Model Card & Documentation: [NVIDIA Build Catalog](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct/modelcard)

In [8]:
def generate_qa_pairs(chunk, prompt, company_name, section_name):
    """
    Generates synthetic Q&A pairs using an LLM based on a given text chunk. 
    We will use `nemotron-4-340b-instruct-128k`model, which is a high-capacity instruction-tuned model.
    The generated structured question-answer pairs will be returned in JSON format.

    Parameters:
    -----------
    chunk : str
        The text segment from the 10-K filing.
    prompt : str
        The structured prompt to guide Q&A generation.
    company_name : str
        The name of the company whose 10-K is being processed.
    section_name : str
        The section of the 10-K report (e.g., "Item 1A", "Item 7").
    
    Returns:
    --------
    list of dict
        A list of dictionaries, each containing:
        - `company_name` : str
        - `section_name` : str
        - `question` : str
        - `gt_answer` : str (ground truth answer)
        - `gt_context` : str (source context from 10-K)
    """

    # Combine extracted text with prompt
    context = f"{chunk}\n{prompt}"
    
    try:
        # Query NVIDIA Nemotron-4 340B Instruct model
        completion = client.chat.completions.create(
            model="nvdev/nvidia/nemotron-4-340b-instruct-128k",
            messages=[{"role": "user", "content": context}],
            temperature=0.05,  # Low temperature for deterministic responses
            max_tokens=1024,   # Limit response length
        )

        response_content = completion.choices[0].message.content.strip()
        
        # Extract JSON response using regex
        match = re.search(r'\[\s*{.*?}\s*\]', response_content, re.DOTALL)
        if not match:
            print("Error: No JSON array found in the response.")
            return []

        extracted_json = match.group(0)

        # Parse JSON safely
        qa_data = json.loads(extracted_json)
        
        # Structure the synthetic Q&A dataset
        synthetic_qa = [
            {
                "company_name": company_name,
                "section_name": section_name,
                "question": qa.get("question", "").strip(),
                "gt_answer": qa.get("answer", "").strip(),
                "gt_context": chunk
            }
            for qa in qa_data if isinstance(qa, dict) and "question" in qa and "answer" in qa
        ]
        
        print(synthetic_qa)  # Debugging output
        return synthetic_qa

    except Exception as e:
        print(f"Error during QA generation: {e}")
        return []


##### 5️⃣ Process 10-K Sections with Async Processing

📌 Process a 10-K Section in Chunks
- LangChain library provides a variety of document transformers, such as text splitters. In this example, we will use the generic RecursiveCharacterTextSplitter, we will set the chunk size to 2k and overlap to 100.

In [9]:
async def process_section(text, prompt, section_name, company_name):
    """Splits text into chunks and generates QA pairs asynchronously."""
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)
    chunks = text_splitter.create_documents([text])

    tasks = [
        asyncio.to_thread(generate_qa_pairs, chunk.page_content, prompt, company_name, section_name)
        for chunk in chunks
    ]

    results = await asyncio.gather(*tasks)
    return [qa for result in results for qa in result]  # Flatten results

📌 Process an Entire 10-K File

In [10]:
async def process_file(file_data, filename, period_of_report):
    """Processes an entire 10-K file, extracting QA pairs from Items 1A and 7."""
    company_name = file_data.get("company", "")

    # Process Item 1A
    item_1A_text = preprocess_text(file_data.get("item_1A", ""), company_name)
    item_1A_qa_pairs = await process_section(item_1A_text, item1a_instruction_prompt, "Item 1A", company_name)

    # Process Item 7
    item_7_text = preprocess_text(file_data.get("item_7", ""), company_name)
    item_7_qa_pairs = await process_section(item_7_text, item7_instruction_prompt, "Item 7", company_name)

    # Add metadata
    return [
        {**qa_pair, "filename": filename, "year": period_of_report}
        for qa_pair in item_1A_qa_pairs + item_7_qa_pairs
    ]

##### 6️⃣ Run the Pipeline on All 10-K Files
📌 Process Multiple 10-K Files in a Directory

In [11]:
async def extract_qa_pairs_from_directory(directory, output_filename):
    async def process_wrapper(file_path):
        with open(file_path, 'r', encoding='utf-8') as file:
            file_data = json.load(file)
            filename = os.path.basename(file_path)
            period_of_report = file_data.get("period_of_report", "")
            return await process_file(file_data, filename, period_of_report)
    
    file_paths = [os.path.join(directory, f) for f in os.listdir(directory) if f.endswith(".json")]
    tasks = [process_wrapper(file_path) for file_path in file_paths]
    results = await asyncio.gather(*tasks)
    
    all_qa_pairs = [qa_pair for result in results for qa_pair in result]
    
    with open(output_filename, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ["company_name", "filename", "year", "section_name", "question", "gt_answer", "gt_context"]
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

        writer.writeheader()
        writer.writerows(all_qa_pairs)


# Run pipeline
asyncio.run(extract_qa_pairs_from_directory(os.getcwd()+'/data/', "data/synthetic_qa_pairs.csv"))

[{'company_name': 'Altair Engineering Inc.', 'section_name': 'Item 1A', 'question': 'What are the potential negative impacts of Altair Engineering Inc. using cash to fund an acquisition?', 'gt_answer': 'The payment of cash will decrease available cash.', 'gt_context': "Altair Engineering Inc. may pay cash, incur debt, or issue equity securities to fund an acquisition. The payment of cash will decrease available cash. The incurrence of debt would likely increase Altair Engineering Inc.'s fixed obligations and could subject Altair Engineering Inc. to restrictive covenants or obligations. The issuance of equity securities would likely be dilutive to Altair Engineering Inc.'s stockholders. Altair Engineering Inc. may also incur unanticipated liabilities as a result of acquiring companies. Future acquisition activity may disrupt Altair Engineering Inc.'s core business, divert Altair Engineering Inc.'s resources, or require significant management attention.\nInternational operations expose A

##### 7️⃣ Filling RAG Outputs for Evaluation

🔹 Why Fill RAG Outputs for Evaluation?
- To effectively evaluate a Retrieval-Augmented Generation (RAG) system, we need to populate the generated answers for each question-answer pair in our synthetic dataset.

We will follow the same method to query the constructed knowledge graph from notebook 1.

In [15]:
# Import the synthetic QA pairs
qa_pairs_df = pd.read_csv('data/synthetic_qa_pairs.csv')
qa_pairs_df.head()

Unnamed: 0,company_name,filename,year,section_name,question,gt_answer,gt_context
0,Altair Engineering Inc.,Altair Engineering Inc_10K_2020.json,2020-12-31,Item 1A,What is the primary risk associated with inves...,The primary risk associated with investing in ...,Item 1A. Risk Factors\nAn investment in Altair...
1,Altair Engineering Inc.,Altair Engineering Inc_10K_2020.json,2020-12-31,Item 1A,What are the key risks related to Altair Engin...,The key risks related to Altair Engineering In...,Item 1A. Risk Factors\nAn investment in Altair...
2,Altair Engineering Inc.,Altair Engineering Inc_10K_2020.json,2020-12-31,Item 1A,How could Altair Engineering Inc.'s business b...,If any of the risks described in the risk fact...,Item 1A. Risk Factors\nAn investment in Altair...
3,Altair Engineering Inc.,Altair Engineering Inc_10K_2020.json,2020-12-31,Item 1A,What is the most significant risk factor menti...,The most significant risk factor mentioned is ...,•\nAltair Engineering Inc.'s customers’ softwa...
4,Altair Engineering Inc.,Altair Engineering Inc_10K_2020.json,2020-12-31,Item 1A,How does the impact of acquisitions of busines...,The impact of acquisitions of businesses and p...,•\nAltair Engineering Inc.'s customers’ softwa...


In [17]:
# Import the knowledge graph
graphml_file = "data\Altair Engineering Inc._knowledge_graph.gml"
G = nx.read_graphml(graphml_file)

  graphml_file = "data\Altair Engineering Inc._knowledge_graph.gml"


📌 Initialize GraphQAChain with the custom prompt


In [20]:
# Define a custom prompt template
custom_prompt_template = """
You are an expert in financial analysis and knowledge graphs. Use the following knowledge graph triples to answer the user's question.

Knowledge Graph Triples:
{graph_triples}

Note:
- Entities in the graph include both subject names (e.g., "Microsoft") and their categories (e.g., "COMP" for companies).
- Always include subject categories in your reasoning.
- If you don't know the answer, say "I don't know."

Question: {question}
Answer:
"""

# Create a PromptTemplate object
custom_prompt = PromptTemplate(
    template=custom_prompt_template,
    input_variables=["graph_triples", "question"]
)

graph = NetworkxEntityGraph(G)
llm = ChatNVIDIA(model="nvdev/meta/llama-3.3-70b-instruct")

# Initialize GraphQAChain with the custom prompt
chain = GraphQAChain.from_llm(
    llm=llm,
    graph=graph,
    verbose=True,
    prompt=custom_prompt  # Pass your custom prompt here
)

📌 Query GraphRAG for each synthetic question to create evaluation data

In [22]:
async def graphrag_retrieval(row):
    """Retrieves an answer from the GraphRAG system for a given question."""
    question = row['question']
    print(question)
    res = chain.run(question)
    return res

async def populate_answer(df):
    """Populates the DataFrame with answers retrieved from the GraphRAG system."""
    tasks = [graphrag_retrieval(row) for _, row in df.iterrows()]
    answers = await asyncio.gather(*tasks)  # Gather all tasks concurrently
    df['answer'] = answers
    return df

async def query_knowledge_graph_async(df):
    """Wrapper to run the populate_answer function asynchronously."""
    return await populate_answer(df)

eval_df = await query_knowledge_graph_async(qa_pairs_df)
eval_df.to_csv('data/evaluation_data.csv', index=False)

What is the primary risk associated with investing in Altair Engineering Inc.'s Class A common stock?


[1m> Entering new GraphQAChain chain...[0m
Entities Extracted:
[32;1m[1;3mAltair Engineering Inc.[0m
Full Context:
[32;1m[1;3mAltair Engineering Inc. Expect Amendment of Interest Rates
Altair Engineering Inc. Face Bad Debt Expenses
Altair Engineering Inc. Face Extension of Payment Terms
Altair Engineering Inc. Operate_In Seasonal variations
Altair Engineering Inc. Face Fluctuations in Foreign Currency Exchange Rates
Altair Engineering Inc. Have Revenue and Expenses Denominated in Foreign Currencies
Altair Engineering Inc. Face Currency Risk
Altair Engineering Inc. Dependent_on Performance of Distributors and Resellers
Altair Engineering Inc. Focus_on Long-term Growth
Altair Engineering Inc. Invest_in Research and Development
Altair Engineering Inc. Face Loss of Senior Execuives
Altair Engineering Inc. Face Regulatory, Economic and Political Risks
Altair Engineering Inc. Face I