<a href="https://colab.research.google.com/github/Saba-Gul/Leveraging-LlamaIndex-for-Synthetic-Financial-Data-Analysis/blob/main/Synthetic_Data_Generation_For_Financial_Report_llamaindex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Synthetic Data Generation for Financial Report using LlamaIndex**

The code is designed to generate synthetic questions based on the content of a financial report (specifically a 10-K report) using the LlamaIndex library and OpenAI's language model. Here's a breakdown of its functionality:

1. **PDF Data Extraction**: It downloads a 10-K financial report and extracts the text content from the PDF document.
  
2. **Data Chunking**: The extracted content is chunked into smaller sections (nodes) for more manageable processing.

3. **Synthetic Question Generation**: Using the LLM (OpenAI's GPT model), the code generates a specified number of diverse questions related to the financial report. These questions are designed to cover key financial concepts, metrics, and specific numerical data from the report.

4. **Output Storage**: The generated questions are stored in a dictionary, with each question linked to its relevant context (the part of the report it pertains to).

### **Setup and Installation**

In [14]:
# Install the LlamaIndex library for data handling and LLM interactions
!pip install llama-index -q

###**Explanation:**
This line installs the llama-index library, which provides tools for data handling and interaction with large language models (LLMs).
###**Importance:**
Financial analysts often need tools that streamline data processing and enhance analytical capabilities. Using libraries like llama-index can significantly speed up the generation of insights from large datasets.

### **Download the Financial Document**

In [15]:
# Download the Amazon 2023 10-K report in PDF format
!wget "https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/c7c14359-36fa-40c3-b3ca-5bf7f3fa0b96.pdf" -O amzn_2023_10k.pdf

--2024-10-08 04:20:26--  https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/c7c14359-36fa-40c3-b3ca-5bf7f3fa0b96.pdf
Resolving d18rn0p25nwr6d.cloudfront.net (d18rn0p25nwr6d.cloudfront.net)... 18.238.139.87, 18.238.139.198, 18.238.139.223, ...
Connecting to d18rn0p25nwr6d.cloudfront.net (d18rn0p25nwr6d.cloudfront.net)|18.238.139.87|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 800598 (782K) [application/pdf]
Saving to: ‘amzn_2023_10k.pdf’


2024-10-08 04:20:26 (8.79 MB/s) - ‘amzn_2023_10k.pdf’ saved [800598/800598]



### **Explanation:**
This command fetches the Amazon 2023 10-K report in PDF format from a specified URL and saves it as amzn_2023_10k.pdf.
### **Importance:**
Access to accurate and up-to-date financial documents is crucial for analysis. Analysts need to work with current reports to make informed decisions.

### **PDF Chunking**

In [16]:
import json

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SimpleNodeParser

#### **Load the Document**

In [17]:
# Define the file containing the financial report
files = ['amzn_2023_10k.pdf']
# Load the PDF document using SimpleDirectoryReader
loader = SimpleDirectoryReader(input_files=files)
documents = loader.load_data()  # Extract content from the PDF
print(f"len of documents {len(documents)}")  # Print the number of documents loaded

len of documents 94


### **Explanation:**
This segment utilizes `SimpleDirectoryReader` to load the PDF document and extract its content.
### **Importance:**
Breaking down large documents into manageable pieces (chunks) is essential for processing and analyzing data effectively, especially for financial reports that can be extensive.

#### **Parse the Document**

In [18]:
# Parse the documents into nodes for easier processing
parser = SimpleNodeParser.from_defaults()
nodes = parser.get_nodes_from_documents(documents, show_progress=True)
print(f"len of nodes {len(nodes)}")  # Print the number of nodes created

Parsing nodes:   0%|          | 0/94 [00:00<?, ?it/s]

len of nodes 118


In [19]:
# Create a corpus dictionary to store the content of each node
corpus = {node.node_id: node.get_content() for node in nodes}

### **Explanation:**
The SimpleNodeParser is used to parse the documents into nodes, allowing for better organization of content.
### **Importance:**
Organizing content into nodes helps analysts retrieve specific information quickly, which is vital for effective financial analysis.

In [20]:
import re
import uuid
from llama_index.llms.openai import OpenAI
from tqdm.notebook import tqdm

### **Define the Prompt Template**

In [21]:
# Define the prompt template for generating questions
prompt_template = """\
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge.
generate only questions based on the below query.
You are a financial analyst.
Your task is to setup {num_questions_per_chunk} questions for a 10K financial report.
The questions should be diverse in nature across the document.
The questions should contain key concepts and metrics in a financial report.
Some of the questions should contain numbers, quarters and years to reflect the nature of a financial report.
Restrict the questions to the context information provided."
"""

### **Explanation:**
This block creates a template for generating questions based on the context provided from the document.
### **Importance:**
This template ensures that the generated questions are relevant and focused, enabling analysts to gather critical insights from the report efficiently.

###**Generate Hypothetical Questions**

#### **Setup for Question Generation**

In [22]:
import nest_asyncio  # Allow nested asyncio events
nest_asyncio.apply()  # Apply nest_asyncio
from google.colab import userdata  # For user data management

# Retrieve OpenAI API key from user data
OPENAI_API_KEY = userdata.get('openai-key')

### **Explanation:**
Various libraries are imported to facilitate regex processing, unique ID generation, interaction with OpenAI's API, and progress tracking.
### **Importance:**
Utilizing these tools allows analysts to automate the question generation process, saving time and reducing manual effort.

#### **Question Generation Loop**

In [23]:
# Number of questions to generate per chunk of text
num_questions_per_chunk = 3
# Initialize the OpenAI model
llm = OpenAI(model='gpt-4o-mini', api_key=OPENAI_API_KEY)
# Create dictionaries to hold generated queries and their relevant context
queries = {}
relevant_context = {}

# Iterate over each node in the corpus to generate questions
for node_id, text in tqdm(corpus.items()):
    # Format the prompt with the current context and desired number of questions
    query = prompt_template.format(context_str=text, num_questions_per_chunk=num_questions_per_chunk)
    # Get the response from the LLM based on the query
    response = llm.complete(query)
    # Process the response to extract questions
    result = str(response).strip().split("\n")
    # Clean the questions by removing leading numbers and whitespace
    questions = [
        re.sub(r"^\d+[\).\s]", "", question).strip() for question in result
    ]
    # Filter out empty questions
    questions = [question for question in questions if len(question) > 0]
    print(questions)  # Print the generated questions for review
    # Store the questions and associate them with the corresponding node ID
    for question in questions:
        question_id = str(uuid.uuid4())  # Generate a unique ID for each question
        queries[question_id] = question  # Store the question
        relevant_context[question_id] = [node_id]  # Associate the question with its context



  0%|          | 0/118 [00:00<?, ?it/s]

['What was the total revenue reported by Amazon.com, Inc. for the fiscal year ended December 31, 2023, and how does it compare to the revenue reported for the previous fiscal year?', "Can you provide details on the company's net income or loss for the fourth quarter of 2023, including any significant factors that contributed to this financial outcome?", "What are the key metrics related to Amazon.com, Inc.'s cash flow from operating activities for the fiscal year ended December 31, 2023, and how do these metrics reflect the company's overall financial health?"]
['What was the aggregate market value of voting stock held by non-affiliates of the registrant as of June 30, 2023, and how does this figure compare to the previous fiscal year?', "As of January 24, 2024, how many shares of common stock were outstanding, and what implications might this have for the company's earnings per share calculation?", "Does the registrant's financial statements reflect any corrections of errors to previo

In [24]:
# Collect all generated questions in a single variable
questions


['What specific criteria must be met for the Company to initiate a clawback of Incentive-Based Compensation received by an Executive Officer, and how does the three-year look-back period impact the calculation of the amount to be recovered?',
 'In the event of a financial restatement, how does the clawback policy differentiate between the recovery of cash bonuses and equity awards for Executive Officers, particularly in relation to the timing of the misconduct or error?',
 'Can you provide an example of how the Leadership Development and Compensation Committee would determine the amount of Incentive-Based Compensation to be recovered from an Executive Officer if the Company had to restate its financial statements for the fiscal year ending December 31, 2023?']

### **Explanation:**
This block generates hypothetical questions for each chunk of the document by passing the context to the LLM and capturing the generated questions.
### **Importance:**
By generating diverse and context-specific questions, analysts can gain deeper insights into financial reports. This helps identify key metrics, trends, and anomalies that may require further investigation.

###**Function for Synthetic Data Generation**

In [25]:
def synthetic_data_generation(corpus, num_questions_per_chunk=3, prompt_template=None):
    """
    Automatically generate hypothetical questions that could be answered with
    documents in the corpus.

    Args:
        corpus (dict): The corpus containing document content.
        num_questions_per_chunk (int): Number of questions to generate per document chunk.
        prompt_template (str): The template for generating questions.

    Returns:
        tuple: A dictionary of generated queries and their relevant context.
    """
    # Initialize the OpenAI model again
    llm = OpenAI(model='gpt-4o-mini', api_key=OPENAI_API_KEY)
    # Use the provided prompt template or fallback to the default
    prompt_template = prompt_template or """\
    Context information is below.
    ---------------------
    {context_str}
    ---------------------
    Given the context information and not prior knowledge.
    generate only questions based on the below query.
    You are a financial analyst.
    Your task is to setup {num_questions_per_chunk} questions for a 10K financial report.
    The questions should be diverse in nature across the document.
    The questions should contain key concepts and metrics in a financial report.
    Some of the questions should contain numbers, quarters and years to reflect the nature of a financial report.
    Restrict the questions to the context information provided."
    """

    # Create dictionaries to hold generated queries and their relevant context
    queries = {}
    relevant_context = {}

    # Iterate over each node in the corpus to generate questions
    for node_id, text in tqdm(corpus.items()):
        # Format the prompt with the current context and desired number of questions
        query = prompt_template.format(context_str=text, num_questions_per_chunk=num_questions_per_chunk)
        # Get the response from the LLM based on the query
        response = llm.complete(query)
        # Process the response to extract questions
        result = str(response).strip().split("\n")
        # Clean the questions by removing leading numbers and whitespace
        questions = [
            re.sub(r"^\d+[\).\s]", "", question).strip() for question in result
        ]
        # Filter out empty questions
        questions = [question for question in questions if len(question) > 0]
        print(questions)  # Print the generated questions for review
        # Store the questions and associate them with the corresponding node ID
        for question in questions:
            question_id = str(uuid.uuid4())  # Generate a unique ID for each question
            queries[question_id] = question  # Store the question
            relevant_context[question_id] = [node_id]  # Associate the question with its context
    return queries, relevant_context  # Return the generated queries and their relevant context

### **Explanation:**
A reusable function is defined for generating questions based on different input parameters, streamlining the process.
### **Importance:**
This modular approach allows analysts to adapt the function for various reports or documents, promoting efficiency and flexibility in data analysis.

###**Summary**
The code effectively automates the generation of relevant questions from a financial report, facilitating a more in-depth analysis. By leveraging LlamaIndex and a large language model, financial analysts can quickly derive insights, enabling them to make better-informed decisions based on data-driven questions and findings. The ability to generate synthetic data allows for the exploration of different scenarios and deeper insights without the need for manual data preparation. This approach can enhance the analytical capabilities of financial professionals significantly.