# Make Sample Questions

This notebook starts with a reference to a directory full of documents and an abstract description of the content of those documents.  It then does the following:

1. It uses the abstract description of the documents to generate a bunch of questions by calling a question generator model, which is currently set to gpt-4o.  You want a very powerful and smart model for that purpose because generating a large volume of questions from an abstract description is a pretty challenging task.
2. It builds a vector database from the content of those documents using Docling to analyze them.
3. It uses RAG and a reference answer generator model (also gpt-4o currently) to generate reference answers.  You really need a very powerful model to be the reference answer generator because you're going to be treating these reference answers as ground truth for the smaller and presumably less powerful models that you were trying to actually evaluate in the next notebook.
4. It through each of the reference answers and asks the reference answer generator model to assess whether the answer is really answering the question or just saying that it doesn't know.  This is important because often you want a separate analysis for how well each model works on those questions that have reference answers versus how well each model works on those questions where the reference behavior is do not answer because the content doesn't say.
5. It stores all of this information in a file for use in the next notebook, [evaluate-using-sample-questions.ipynb](./evaluate-using-sample-questions.ipynb).

If you have time, you should also get a human to vet the reference answers and improve them, but that's expensive to do at scale so I think in practice often that's not going to happen.

## Import dependencies

In [1]:
import evaluation_utilities

import logging
import os
import re
import requests
import importlib
from typing import NamedTuple

from pathlib import Path

from IPython.display import clear_output

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.llms import ChatMessage

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Rerun this cell whenever you change evaluation_utilities
importlib.reload(evaluation_utilities)

<module 'evaluation_utilities' from '/Users/bmurdock/lls-comparisons/evaluation_utilities.py'>

## Configure and initialize models

The main configuration options for this notebook are in the following cell, so you may want to edit some values there before running.

In [3]:
QUESTION_GENERATOR_MODEL_INFO={"model": "gpt-4o", "timeout": 7200}
REFERENCE_ANSWER_GENERATOR_MODEL={"model": "gpt-4o"}
EMBED_MODEL_ID="ibm-granite/granite-embedding-125m-english"

NUM_TOPICS_PER_PROFILE=20

CONTENT_URLS=["https://www.ibm.com/downloads/documents/us-en/1227c12d3a38b173"]
CONTENT_LOCATION="./docs/"
CONTENT_DESCRIPTION="IBM 2024 Annual Report"

In [4]:
class UserProfile(NamedTuple):
    description:str
    number_of_topics:int
    number_of_iterations_per_topic:int

# For each profile, this will generate N questions of each type per topic, where
# N = number_of_topics * number_of_iterations_per_topic * <number of question types, typically 3>.
# Put higher numbers for profiles that are more important for your application,
# or the same numbers for each if all of your profiles are equally important.
USER_PROFILES=[
    UserProfile(description="Professional stock market analyst", number_of_topics=50, number_of_iterations_per_topic=12),
    UserProfile(description="Manager at a company that is considering buying an IBM product", number_of_topics=10, number_of_iterations_per_topic=10),
    UserProfile(description="High school student taking a business course", number_of_topics=5, number_of_iterations_per_topic=5),
    UserProfile(description="Fifth grader who wants to learn about IBM", number_of_topics=5, number_of_iterations_per_topic=5)
]

# Smaller version for testing.  Comment out this one to get a larger set of questions.
USER_PROFILES=[
    UserProfile(description="Professional stock market analyst", number_of_topics=8, number_of_iterations_per_topic=2),
    UserProfile(description="Manager at a company that is considering buying an IBM product", number_of_topics=5, number_of_iterations_per_topic=1),
    UserProfile(description="High school student taking a business course", number_of_topics=5, number_of_iterations_per_topic=1)
]

In [5]:

total_number_of_iterations = sum([p.number_of_topics * p.number_of_iterations_per_topic * 3 for p in USER_PROFILES])

EXPERIMENT_SHORT_LABEL = f"ibm-2024-{total_number_of_iterations}"

EXPERIMENT_SHORT_LABEL

'ibm-2024-78'

In [6]:
EMBED_MODEL = HuggingFaceEmbedding(model_name=EMBED_MODEL_ID)
question_generator_model = OpenAI(**QUESTION_GENERATOR_MODEL_INFO)
reference_answer_generator_model = OpenAI(**REFERENCE_ANSWER_GENERATOR_MODEL)

In [7]:
messages = [
    ChatMessage(role="user", content="Say hello to the world"),
]
question_generator_model.chat(messages)

ChatResponse(message=ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='Hello, world!')]), raw=ChatCompletion(id='chatcmpl-BlyKjEzbr4eAwrdamDAbdiBUGhEwG', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Hello, world!', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1750773037, model='gpt-4o-2024-08-06', object='chat.completion', service_tier='default', system_fingerprint='fp_07871e2ad8', usage=CompletionUsage(completion_tokens=4, prompt_tokens=12, total_tokens=16, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))), delta=None, logprobs=None, additional_kwargs={'prompt_tokens': 12, 'completion_tokens': 4, 'total_tokens': 16})

In [8]:
question_generator_model.complete("Say hello to the world")

CompletionResponse(text='Hello, World!', additional_kwargs={'prompt_tokens': 12, 'completion_tokens': 4, 'total_tokens': 16}, raw=ChatCompletion(id='chatcmpl-BlyKjD49g0WPQvmoJMgd2FB6ORr1D', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Hello, World!', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1750773037, model='gpt-4o-2024-08-06', object='chat.completion', service_tier='default', system_fingerprint='fp_a288987b44', usage=CompletionUsage(completion_tokens=4, prompt_tokens=12, total_tokens=16, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))), logprobs=None, delta=None)

## Download content

This downloads the content at the URLs specified in CONTENT_URLS and stores them in CONTENT_LOCATION.

In [9]:

def download_file(url: str, output_dir: str|Path):
    response = requests.get(url)
    if "content-disposition" in response.headers:
        content_disposition = response.headers["content-disposition"]
        filename = content_disposition.split("filename=")[1]
        if filename.startswith('"') and filename.endswith('"'):
            filename = filename[1:-1]
    else:
        filename = url.split("/")[-1]
    target = Path(output_dir, filename)
    with open(target, mode="wb") as file:
        file.write(response.content)
    return target

os.makedirs(CONTENT_LOCATION, exist_ok=True)
for url in CONTENT_URLS:
    target = download_file(url=url, output_dir=CONTENT_LOCATION)
    print(target)

docs/ibm-annual-report-2024.pdf


## Question type prompts

These prompts are adapted from Docling SDG's [generation_prompts.py](https://github.com/docling-project/docling-sdg/blob/main/docling_sdg/qa/prompts/generation_prompts.py). However, they've been rewritten to use abstract descriptions of content rather than the content itself. This approach serves two main objectives:

*Generating More Realistic and Challenging Questions*

First, generating questions from abstract descriptions yields queries that are less directly tied to the specific terminology, level of specificity, and phrasing of the source document's text. Solutions that generate questions directly from document passages, tend to produce questions that closely mirror the original text. This can make it easier to locate relevant documents and extract answers.  That's unhelpful for test data, because it can lead to a misleadingly high assessment of system accuracy. Questions generated this way are inherently easier for systems designed to find direct textual matches, often resulting in artificially inflated performance metrics. By using abstract descriptions, we aim to create a more challenging and realistic test set that better reflects real-world user queries, which often don't precisely echo the source material.

*Evaluating Unanswerable Questions*

Second, when questions are generated from an abstract understanding of the content, it naturally leads to instances where the generated questions may not be directly answerable by the available content. Many RAG systems are designed with the crucial capability to identify and decline to answer questions for which insufficient information exists within their knowledge base. To effectively evaluate a RAG system's proficiency in this "unanswerable question" scenario, it's essential to have a test set that includes examples of queries a user might reasonably pose based on a general understanding of the available content, but which aren't explicitly addressed by the specific documents. This allows for a robust assessment of the system's ability to differentiate between answerable and unanswerable queries.

It's important to note that this work builds upon the significant and valuable innovation presented by the Docling SDG project. Their [research](https://aclanthology.org/2025.coling-industry.4/) demonstrates the clear benefit of explicitly prompting for diverse question types (e.g., single-fact, summary, reasoning), leading to a much richer and varied set of generated questions. Our goal is to retain these advantages, leveraging the insights into effective question type prompting, while simultaneously addressing the limitations of existing approaches that generate questions directly from document content instead of from a more abstract representation.

In [10]:
class MetaPromptFormatter(dict):
    def __missing__(self, key):
        return f"{{{key}}}"

In [11]:
DEFAULT_META_QUESTION_PROMPT = (
    'A "{type_str}" question is a question with the following properties:\n{type_def_str}\n'
    "I will provide you with an abstract description of some content and a topic and a user profile and a list of existing questions.\n"
    'Think of a "{type_str}" question that a user with the specified profile might ask that could plausibly be answered using only '
    "information contained in the content and that is distinct from the existing questions.\n"
    "\n"
    "## Abstract Description of Content\n\n{content_description_str}\n\n"
    "## Topic\n\n{topic_str}\n\n"
    "## Existing Questions\n\n{existing_questions_str}\n\n"
    "## User Profile\n\n{user_profile_str}\n\n"
    "\n"
    "What question did you think about? Do not say anything other than the question."
)

DEFAULT_FACT_SINGLE_QUESTION_PROMPT = DEFAULT_META_QUESTION_PROMPT.format_map(
    MetaPromptFormatter(
        type_str = "single-fact",
        type_def_str = (
            "- It is a natural language question.\n"
            "- It is answered with a single piece of factual information.\n"
        )
    )
)

DEFAULT_SUMMARY_QUESTION_PROMPT = DEFAULT_META_QUESTION_PROMPT.format_map(
    MetaPromptFormatter(
        type_str = "summary",
        type_def_str = (
            "- It is a natural language question.\n"
            "- It is answered with a summary of multiple pieces of information.\n"
            "- It cannot be answered with a single piece of factual information.\n"
        )
    )
)


DEFAULT_REASONING_QUESTION_PROMPT = DEFAULT_META_QUESTION_PROMPT.format_map(
    MetaPromptFormatter(
        type_str = "reasoning",
        type_def_str = (
            "- It is a natural language question.\n"
            "- It requires the reader to think critically and make an inference or draw a "
            "conclusion based on the information provided.\n"
        )
    )
)

DEFAULT_TOPIC_GENERATION_PROMPT = (
    "I will provide you with an abstract description of a document and a user profile and ask you to generate a list of topics "
    "that might covered in that document and that a user with that profile might be interested in.\n\n"
    "## Abstract Description of Document\n\n{content_description_str}\n\n"
    "## User Profile\n\n{user_profile_str}\n\n"
    "Please generate a list of {num_topics} topics.  Generate one topic per line. "
    'For each line, put a number and then a "." and then a short description of the topic.'
)

In [12]:
print(DEFAULT_FACT_SINGLE_QUESTION_PROMPT)

A "single-fact" question is a question with the following properties:
- It is a natural language question.
- It is answered with a single piece of factual information.

I will provide you with an abstract description of some content and a topic and a user profile and a list of existing questions.
Think of a "single-fact" question that a user with the specified profile might ask that could plausibly be answered using only information contained in the content and that is distinct from the existing questions.

## Abstract Description of Content

{content_description_str}

## Topic

{topic_str}

## Existing Questions

{existing_questions_str}

## User Profile

{user_profile_str}


What question did you think about? Do not say anything other than the question.


In [13]:
QUESTION_PROMPTS = {
    "fact": DEFAULT_FACT_SINGLE_QUESTION_PROMPT,
    "summary": DEFAULT_SUMMARY_QUESTION_PROMPT,
    "reasoning": DEFAULT_REASONING_QUESTION_PROMPT
}

In [14]:
evaluation_utilities.write_json(QUESTION_PROMPTS, "question_prompts.json")

In [15]:
print(DEFAULT_SUMMARY_QUESTION_PROMPT.format(content_description_str="Stuff about content", topic_str="A topic", existing_questions_str="Who?\nWhen?\nWhere?", user_profile_str=USER_PROFILES[0].description))

A "summary" question is a question with the following properties:
- It is a natural language question.
- It is answered with a summary of multiple pieces of information.
- It cannot be answered with a single piece of factual information.

I will provide you with an abstract description of some content and a topic and a user profile and a list of existing questions.
Think of a "summary" question that a user with the specified profile might ask that could plausibly be answered using only information contained in the content and that is distinct from the existing questions.

## Abstract Description of Content

Stuff about content

## Topic

A topic

## Existing Questions

Who?
When?
Where?

## User Profile

Professional stock market analyst


What question did you think about? Do not say anything other than the question.


## Generate topics

In [16]:
# Run with just 3 topics to show what the outputs can look like before continuing on with the full generation using the NUM_TOPICS constant.
message = DEFAULT_TOPIC_GENERATION_PROMPT.format(content_description_str=CONTENT_DESCRIPTION, num_topics=3, user_profile_str=USER_PROFILES[0].description)
messages = [ChatMessage(role="user", content=message)]
resp = question_generator_model.chat(messages)

In [17]:
print(resp.message.blocks[0].text)

1. Financial Performance and Revenue Growth: Analysis of IBM's financial results, including revenue, profit margins, and year-over-year growth.

2. Strategic Initiatives and Innovations: Overview of IBM's strategic priorities, including advancements in technology, AI, and cloud computing.

3. Market Position and Competitive Landscape: Evaluation of IBM's position in the market relative to competitors and insights into industry trends.


In [18]:
# Assisted by Google Gemini 
def extract_list_items(text_block: str) -> list[str]:
    """
    Extracts items from a multi-line string.

    It handles:
    - Optional blank lines between items.
    - Optional numbering (e.g., "1.", "1 ", "2.") at the start of lines.
    It returns a list of strings, with numbers and blank lines removed,
    and each item stripped of leading/trailing whitespace.

    Args:
        text_block: The multi-line string to process.

    Returns:
        A list of extracted string values.
    """
    items = []
    # Regex to identify leading numbers followed by an optional period and optional whitespace.
    # Example: "1.", "1 ", "  1. ", "2 "
    # This pattern is applied to lines that have already had their outer whitespace stripped.
    # ^      Matches the beginning of the string (the stripped line).
    # \d+    Matches one or more digits (the number).
    # \.?    Matches an optional literal period.
    # \s* Matches zero or more whitespace characters following the number/period.
    number_prefix_pattern = re.compile(r"^\d+\.?\s*")

    for line in text_block.splitlines():
        # 1. Remove leading/trailing whitespace from the current line.
        stripped_line = line.strip()

        # 2. If the line is blank after stripping, skip it.
        if not stripped_line:
            continue

        # 3. Remove the number prefix, if present.
        #    The sub() method replaces the matched pattern with an empty string.
        item_text = number_prefix_pattern.sub("", stripped_line)

        # 4. Strip any leading/trailing whitespace that might remain on the item_text.
        #    This is important if the original item had spaces after the number,
        #    or if the item itself had leading/trailing spaces (which strip() in step 1
        #    would have handled if no number was present, but this ensures cleanliness
        #    after potential prefix removal).
        final_item_text = item_text.strip()

        # 5. Add the cleaned item to the list, only if it's not empty.
        #    (e.g., a line like "1." would become "" after processing).
        if final_item_text:
            items.append(final_item_text)

    return items

In [19]:
def generate_topics(prompt, content_description, question_generator_model, user_profiles):
    retval = []
    for user_profile in user_profiles:
        message = prompt.format(content_description_str=content_description, num_topics=user_profile.number_of_topics, user_profile_str=user_profile.description)
        messages = [ChatMessage(role="user", content=message)]
        resp = question_generator_model.chat(messages)
        response_text = resp.message.blocks[0].text
        topics = extract_list_items(response_text)
        retval.append((user_profile, topics))
    return retval

In [20]:
profiles_with_topics = generate_topics(DEFAULT_TOPIC_GENERATION_PROMPT, CONTENT_DESCRIPTION, question_generator_model, USER_PROFILES)
profiles_with_topics

[(UserProfile(description='Professional stock market analyst', number_of_topics=8, number_of_iterations_per_topic=2),
  ["Financial Performance Overview: Analysis of IBM's revenue, profit margins, and overall financial health in 2024.",
   "Stock Performance and Shareholder Returns: Examination of IBM's stock price trends and dividend payouts over the past year.",
   "Strategic Initiatives and Innovations: Insights into IBM's new projects, technological advancements, and strategic goals for future growth.",
   "Market Position and Competitive Analysis: Evaluation of IBM's standing in the tech industry compared to its competitors.",
   'Risk Factors and Management: Discussion of potential risks facing IBM and the strategies in place to mitigate them.',
   "Sustainability and Corporate Responsibility: Overview of IBM's efforts in sustainability, environmental impact, and corporate social responsibility.",
   "Leadership and Governance: Information on IBM's executive team, board of direct

In [21]:
def compute_num_questions_expected(profiles_with_topics, question_prompts):
    num_question_prompts = len(question_prompts)
    n = 0
    for user_profile, topics in profiles_with_topics:
        n += len(topics) * user_profile.number_of_iterations_per_topic * num_question_prompts
    return n

compute_num_questions_expected(profiles_with_topics, QUESTION_PROMPTS)

78

## Generate questions for each topic

In [22]:
existing_questions = []

question_prompt = DEFAULT_SUMMARY_QUESTION_PROMPT
user_profile = profiles_with_topics[0][0]
topic = profiles_with_topics[0][1][0]

existing_questions_str="\n".join(existing_questions) if existing_questions else "NONE"
message = question_prompt.format(content_description_str=CONTENT_DESCRIPTION, topic_str=topic, existing_questions_str=existing_questions_str, user_profile_str=user_profile.description)
print(message)
messages = [ChatMessage(role="user", content=message)]
resp = question_generator_model.chat(messages)

A "summary" question is a question with the following properties:
- It is a natural language question.
- It is answered with a summary of multiple pieces of information.
- It cannot be answered with a single piece of factual information.

I will provide you with an abstract description of some content and a topic and a user profile and a list of existing questions.
Think of a "summary" question that a user with the specified profile might ask that could plausibly be answered using only information contained in the content and that is distinct from the existing questions.

## Abstract Description of Content

IBM 2024 Annual Report

## Topic

Financial Performance Overview: Analysis of IBM's revenue, profit margins, and overall financial health in 2024.

## Existing Questions

NONE

## User Profile

Professional stock market analyst


What question did you think about? Do not say anything other than the question.


In [23]:
existing_questions.append(resp.message.blocks[0].text)
print(resp.message.blocks[0].text)

What are the key factors that influenced IBM's revenue and profit margins in 2024, and how do these reflect the company's overall financial health?


In [24]:
existing_questions_str="\n".join(existing_questions) if existing_questions else "NONE"                                 
message = question_prompt.format(content_description_str=CONTENT_DESCRIPTION, topic_str=topic, existing_questions_str=existing_questions_str, user_profile_str=user_profile.description)

print(message)
messages = [ChatMessage(role="user", content=message)]
resp = question_generator_model.chat(messages)

A "summary" question is a question with the following properties:
- It is a natural language question.
- It is answered with a summary of multiple pieces of information.
- It cannot be answered with a single piece of factual information.

I will provide you with an abstract description of some content and a topic and a user profile and a list of existing questions.
Think of a "summary" question that a user with the specified profile might ask that could plausibly be answered using only information contained in the content and that is distinct from the existing questions.

## Abstract Description of Content

IBM 2024 Annual Report

## Topic

Financial Performance Overview: Analysis of IBM's revenue, profit margins, and overall financial health in 2024.

## Existing Questions

What are the key factors that influenced IBM's revenue and profit margins in 2024, and how do these reflect the company's overall financial health?

## User Profile

Professional stock market analyst


What questio

In [25]:
existing_questions.append(resp.message.blocks[0].text)
print(resp.message.blocks[0].text)

How did IBM's financial performance in 2024 compare to previous years, and what trends can be identified in their revenue and profit margins over this period?


In [26]:
QUESTION_PROMPTS

{'fact': 'A "single-fact" question is a question with the following properties:\n- It is a natural language question.\n- It is answered with a single piece of factual information.\n\nI will provide you with an abstract description of some content and a topic and a user profile and a list of existing questions.\nThink of a "single-fact" question that a user with the specified profile might ask that could plausibly be answered using only information contained in the content and that is distinct from the existing questions.\n\n## Abstract Description of Content\n\n{content_description_str}\n\n## Topic\n\n{topic_str}\n\n## Existing Questions\n\n{existing_questions_str}\n\n## User Profile\n\n{user_profile_str}\n\n\nWhat question did you think about? Do not say anything other than the question.',
 'summary': 'A "summary" question is a question with the following properties:\n- It is a natural language question.\n- It is answered with a summary of multiple pieces of information.\n- It cannot 

'A "single-fact" question is a question with the following properties:\n- It is a natural language question.\n- It is answered with a single piece of factual information.\n\nI will provide you with an abstract description of some content and a topic and a user profile and a list of existing questions.\nThink of a "single-fact" question that a user with the specified profile might ask that could plausibly be answered using only information contained in the content and that is distinct from the existing questions.\n\n## Abstract Description of Content\n\nIBM 2024 Annual Report\n\n## Topic\n\nIBM\'s 2024 financial performance overview\n\n## Existing Questions\n\nNONE\n\n## User Profile\n\nProfessional stock market analyst\n\n\nWhat question did you think about? Do not say anything other than the question.'

In [27]:
# Now we put all the pieces above into a function that iterates through all the topics, question prompts, and repeats a given number of times.
def generate_questions(profiles_with_topics, content_description, question_prompts, question_generator_model):
    all_questions_with_prompts_and_topics_and_profiles = []
    num_questions_expected = compute_num_questions_expected(profiles_with_topics, question_prompts)
    i = 1
    for user_profile, topics in profiles_with_topics:
        for topic in topics:
            for question_prompt_label, question_prompt in question_prompts.items():
                existing_questions_for_topic_and_prompt = []
                for _ in range(user_profile.number_of_iterations_per_topic):
                    existing_questions_str="\n".join(existing_questions_for_topic_and_prompt) if existing_questions_for_topic_and_prompt else "NONE"
                    message = question_prompt.format(content_description_str=content_description, topic_str=topic, existing_questions_str=existing_questions_str, user_profile_str=user_profile.description)
                    messages = [ChatMessage(role="user", content=message)]
                    resp = question_generator_model.chat(messages)
                    question_text = resp.message.blocks[0].text
                    existing_questions_for_topic_and_prompt.append(question_text)
                    # Note that what we're storing here is the question/prompt/topic tuple.  What we really want as an output is just the question, but the prompt and topic might be useful for understanding where the question came from.
                    all_questions_with_prompts_and_topics_and_profiles.append((question_text, question_prompt_label, topic, user_profile.description))
                    clear_output(wait=True)
                    print(f"{i} / {num_questions_expected}")
                    i += 1
    return all_questions_with_prompts_and_topics_and_profiles

In [28]:
all_questions_with_prompts_and_topics_and_profiles = generate_questions(profiles_with_topics, CONTENT_DESCRIPTION, QUESTION_PROMPTS, question_generator_model)

78 / 78


In [29]:
all_questions_with_prompts_and_topics_and_profiles

[("What was IBM's total revenue in 2024?",
  'fact',
  "Financial Performance Overview: Analysis of IBM's revenue, profit margins, and overall financial health in 2024.",
  'Professional stock market analyst'),
 ("What was IBM's net profit margin in 2024?",
  'fact',
  "Financial Performance Overview: Analysis of IBM's revenue, profit margins, and overall financial health in 2024.",
  'Professional stock market analyst'),
 ("What are the key factors that influenced IBM's revenue growth and profit margins in 2024, and how do these elements reflect the company's overall financial health?",
  'summary',
  "Financial Performance Overview: Analysis of IBM's revenue, profit margins, and overall financial health in 2024.",
  'Professional stock market analyst'),
 ("How did IBM's financial performance in 2024 compare to previous years, and what trends can be identified in terms of revenue growth, profit margins, and financial stability?",
  'summary',
  "Financial Performance Overview: Analysi

In [30]:
evaluation_utilities.write_json(all_questions_with_prompts_and_topics_and_profiles, "all_questions_with_prompts_and_topics_and_profiles.json")

In [31]:
questions = [t[0] for t in all_questions_with_prompts_and_topics_and_profiles]
print(len(questions))
questions[0:5]

78


["What was IBM's total revenue in 2024?",
 "What was IBM's net profit margin in 2024?",
 "What are the key factors that influenced IBM's revenue growth and profit margins in 2024, and how do these elements reflect the company's overall financial health?",
 "How did IBM's financial performance in 2024 compare to previous years, and what trends can be identified in terms of revenue growth, profit margins, and financial stability?",
 "How did IBM's revenue growth in 2024 compare to industry trends, and what factors contributed to any discrepancies?"]

In [32]:
# Assisted by Google Gemini
def sort_and_remove_duplicates(question_data):
    # First, sort the list to ensure consistent results
    question_data.sort(key=lambda x: x[0])

    # Use a dictionary to remove duplicates while keeping the full tuple
    unique_data_dict = {item[0]: item for item in question_data}
    return list(unique_data_dict.values())

In [33]:
sorted_unique_questions_with_prompts_and_topics_and_profiles = sort_and_remove_duplicates(all_questions_with_prompts_and_topics_and_profiles)

In [34]:
len(sorted_unique_questions_with_prompts_and_topics_and_profiles)

77

In [35]:
sorted_unique_questions_with_prompts_and_topics_and_profiles[0:3]

[("How did IBM's financial performance in 2024 compare to previous years, and what might this indicate about the company's ability to adapt to market changes?",
  'reasoning',
  "Overview of IBM's Financial Performance: A summary of IBM's revenue, profits, and financial health in 2024, which can provide insights into how large corporations manage their finances.",
  'High school student taking a business course'),
 ("How did IBM's financial performance in 2024 compare to previous years, and what trends can be identified in terms of revenue growth, profit margins, and financial stability?",
  'summary',
  "Financial Performance Overview: Analysis of IBM's revenue, profit margins, and overall financial health in 2024.",
  'Professional stock market analyst'),
 ("How did IBM's revenue growth in 2024 compare to industry trends, and what factors contributed to any discrepancies?",
  'reasoning',
  "Financial Performance Overview: Analysis of IBM's revenue, profit margins, and overall financ

In [36]:
question_data = [{"user_input" : q[0], "user_input_type": q[1], "user_input_topic": q[2], "user_profile": q[3]} for q in sorted_unique_questions_with_prompts_and_topics_and_profiles]

In [37]:
evaluation_utilities.write_json(question_data, f"./questions-{EXPERIMENT_SHORT_LABEL}.json")
f"./questions-{EXPERIMENT_SHORT_LABEL}.json"

'./questions-ibm-2024-78.json'

## Make the vector database

This next block lists all the files in the specified directory and then ingests them all into a vector database.

It is using the Llama Index DoclingReader, which is a simple and naive way to use Docling.  It converts everything to mark down and then use built-in primitives in Llama Index to do the chunking.

You can see a much more sophisticated use of Docling at IBM's [Granite_Multimodal_RAG.ipynb](https://github.com/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/RAG/Granite_Multimodal_RAG.ipynb)
where they use the Docling hierarchical chunker and then skip over chunks with tables in them, and then iterate through the tables separately and convert the tables to
markdown one at a time.  That example also uses a model to generate descriptions of pictures and then include those descriptions in the index too.  At some point, we'd like
to do some all of that here and get a better foundation for building the vector index that we would use for both generating the reference answers in this notebook and
for doing the actual rag evaluations in [evaluate-using-sample-questions.ipynb](./evaluate-using-sample-questions.ipynb).

In [38]:
# Reload the data from disk.  This is here to let you restart the notebook from this point.
question_data = evaluation_utilities.read_json(f"./questions-{EXPERIMENT_SHORT_LABEL}.json")

In [39]:
question_data[0:3]

[{'user_input': "How did IBM's financial performance in 2024 compare to previous years, and what might this indicate about the company's ability to adapt to market changes?",
  'user_input_type': 'reasoning',
  'user_input_topic': "Overview of IBM's Financial Performance: A summary of IBM's revenue, profits, and financial health in 2024, which can provide insights into how large corporations manage their finances.",
  'user_profile': 'High school student taking a business course'},
 {'user_input': "How did IBM's financial performance in 2024 compare to previous years, and what trends can be identified in terms of revenue growth, profit margins, and financial stability?",
  'user_input_type': 'summary',
  'user_input_topic': "Financial Performance Overview: Analysis of IBM's revenue, profit margins, and overall financial health in 2024.",
  'user_profile': 'Professional stock market analyst'},
 {'user_input': "How did IBM's revenue growth in 2024 compare to industry trends, and what fac

In [40]:
file_paths = evaluation_utilities.list_files(CONTENT_LOCATION)
file_paths

[PosixPath('docs/ibm-annual-report-2024.pdf')]

In [41]:
index = evaluation_utilities.make_simple_index(file_paths, EMBED_MODEL)
index

<llama_index.core.indices.vector_store.base.VectorStoreIndex at 0x4ca934410>

In [42]:
q = "Tell me about Z mainframe sales"
result = index.as_query_engine(llm=question_generator_model).query(q)
#print(f"Q: {q}\nA: {result.response.strip()}\n\nSources:")
#display([(n.text, n.metadata) for n in result.source_nodes])

[n.text for n in result.source_nodes]

["## 14 Management Discussion\n\nInternational Business Machines Corporation and Subsidiary Companies\n\nIBM Z: the premier transaction processing platform with leading security, resilience and scale, highly optimized for mission-critical, high-volume transaction workloads and enabled for enterprise AI and hybrid cloud. It includes IBM Z and LinuxONE, with a range of high-performance systems designed to address enterprise computing capacity, security and performance needs, z/OS, a securityrich, high-performance enterprise operating system, as well as Linux and other operating systems.\n\nDistributed  Infrastructure: includes  Power,  Storage  and  IBM  Cloud  Infrastructure-as-a-Service  (IaaS).  Power  consists  of  highperformance servers, designed and engineered for data intensive and AI-enabled workloads and optimized for hybrid cloud and Linux. The Storage portfolio consists of a broad range of storage hardware and software-defined offerings, including Z-attach and distributed fla

## Use RAG to generate reference answers


Here we use RAG on all of the questions and get answers from the RAG.  We will label these "reference answers" because they're being generated by the model that we have designated to be our reference answer generator, i.e., the model that we trust to be close enough to perfect that it can act as our "ground truth" for evaluating the other models that we intend to evaluate.  We just discard the retrieve context instead of creating them as reference contexts because there's no particular reason to believe that they're particularly good.  If we had an ultra-high power search capability (e.g., something that retrieved a long list of results and then had a powerful model rate each result), we would want to use it here to get reference contexts and then use those reference context to generate the answer instead of the actual retrieve contexts.

Note that we are calling "run_reference_rag", our reference answer generator RAG.  That RAG is very slow because it uses LlamaIndex's LLMReranker to assess search result quality before then taking the search results rated highest by the LLM and using them to generate the answer (and recording them as the reference contexts).  It is often impractical to do this in a deployed application because it is slow, but it can be useful for evaluation purposes because we expect it to generate something that is closer to a "ground truth" for evaluation then what we would get with a simpler/faster RAG.

In [43]:
qna = [{"user_input": "Tell me about Z mainframe sales"}, {"user_input": "Why is the sky blue?"}]
sample_data = evaluation_utilities.run_reference_rag(qna, question_generator_model, index, "temp.json", number_of_search_results=5)
sample_data

[{'user_input': 'Tell me about Z mainframe sales',
  'reference': "In 2023, IBM Z mainframe sales experienced a decrease in revenue by 4.5 percent as reported (4.2 percent adjusted for currency) compared to the previous year. This decline was consistent with the z16 product cycle, which was introduced in the second quarter of 2022. Despite the decrease, the z16 program significantly outperformed prior cycles, including the successful z15 program. The z16 mainframe incorporates several key innovations, such as cloud-native development for hybrid cloud, embedded AI at scale, quantum-safe cyber-resilient security, energy efficiency, and strong reliability and scalability. These features drove increased demand for IBM Z, with clients leveraging it for more workloads, resulting in a doubling of installed MIPS over the last two product cycles.\n\nBy 2024, the IBM z16 had become the most successful mainframe program in IBM's history, underscoring its enduring value to clients. The z16's succe

Notice above that a question for which that has highly relevant content in the index can get multiple reference_contexts (i.e., search results rated as relevant by the LLMReranker) while a question that does not have relevant content in the index can get no reference context and thus no answer.  We could ask the model to generate a reference answer without any context, but that answer wouldn't be very useful for evaluating RAG applications.  Instead, we want to mark such questions as not having a RAG answer so we can test the ability of candidate solutions to not answer such questions.

In [44]:
output_file = f"./questions_and_reference_answers-{EXPERIMENT_SHORT_LABEL}-{len(question_data)}.json"
output_file

'./questions_and_reference_answers-ibm-2024-78-77.json'

In [45]:
data = evaluation_utilities.run_reference_rag(question_data, reference_answer_generator_model, index, output_file, number_of_search_results=5)

data[0:5]

[{'user_input': "How did IBM's financial performance in 2024 compare to previous years, and what might this indicate about the company's ability to adapt to market changes?",
  'user_input_type': 'reasoning',
  'user_input_topic': "Overview of IBM's Financial Performance: A summary of IBM's revenue, profits, and financial health in 2024, which can provide insights into how large corporations manage their finances.",
  'user_profile': 'High school student taking a business course',
  'reference': "In 2024, IBM's financial performance showed both strengths and challenges compared to previous years. The company reported $62.8 billion in revenue, which represented a 1.4 percent growth year-over-year as reported and a 3 percent increase when adjusted for currency. This growth was primarily driven by strong performance in the Software segment, which saw an 8.3 percent increase in revenue as reported (9.0 percent adjusted for currency), with notable contributions from Red Hat and Automation. 

In [46]:
#evaluation_utilities.write_json(data, output_file)

## Analyzing Questions

In [53]:
# Reload the data from disk.  This is here to let you restart the notebook from this point.
data = evaluation_utilities.read_json(output_file)
output_file

'./questions_and_reference_answers-ibm-2024-78-77.json'

In [48]:
data[0:2]

[{'user_input': "How did IBM's financial performance in 2024 compare to previous years, and what might this indicate about the company's ability to adapt to market changes?",
  'user_input_type': 'reasoning',
  'user_input_topic': "Overview of IBM's Financial Performance: A summary of IBM's revenue, profits, and financial health in 2024, which can provide insights into how large corporations manage their finances.",
  'user_profile': 'High school student taking a business course',
  'reference': "In 2024, IBM's financial performance showed both strengths and challenges compared to previous years. The company reported $62.8 billion in revenue, which represented a 1.4 percent growth year-over-year as reported and a 3 percent increase when adjusted for currency. This growth was primarily driven by strong performance in the Software segment, which saw an 8.3 percent increase in revenue as reported (9.0 percent adjusted for currency), with notable contributions from Red Hat and Automation. 

In [49]:
evaluation_utilities.LOGGER.setLevel(logging.DEBUG)
response = evaluation_utilities.check_if_answer_is_attempting_to_answer_question(data[1]["user_input"], data[1]["reference"], question_generator_model)
evaluation_utilities.LOGGER.setLevel(logging.INFO)
response

2025-06-24 10:04:34,537 - evaluation_utilities - DEBUG - Sending prompt to LLM:

You are an expert at evaluating the intent behind responses.
Given the following Question and Candidate Answer, your task is to determine if the Candidate Answer is making a direct and genuine attempt to address or provide a response to the *specific* Question asked.

**CRITICAL RULE:** You MUST COMPLETELY IGNORE the factual accuracy or correctness of the Candidate Answer. Whether the answer is right or wrong is irrelevant for this task. Your sole focus is on its relevance and whether it appears to be an *intended, on-topic reply* to the question.

* Respond with "YES" if the Candidate Answer clearly tries to directly answer the Question, even if it's factually incorrect, incomplete, or vague.
* Respond with "NO" if the Candidate Answer is off-topic, evasive, tangential, asks another question, or discusses unrelated subjects (even if broadly similar).

---
Question: How did IBM's financial performance in 2

True

In [50]:
for qna in data:
    if qna["reference_contexts"]:
        qna["has_reference_answer"] = evaluation_utilities.check_if_answer_is_attempting_to_answer_question(qna["user_input"], qna["reference"], question_generator_model)
        if not qna["has_reference_answer"]:
            print("Rejecting question because it does not attempt to answer the question.")
            print(qna["user_input"])
            print(qna["reference"])
            print("--------------------------------")
    else:
        qna["has_reference_answer"] = False

Rejecting question because it does not attempt to answer the question.
How does IBM's 2024 Annual Report illustrate the company's approach to addressing environmental and social issues through its corporate social responsibility and sustainability efforts?
The context provided does not include specific details about IBM's 2024 Annual Report or its approach to addressing environmental and social issues through corporate social responsibility and sustainability efforts. To find this information, you may need to access IBM's 2024 Annual Report directly from their website or contact their investor relations department for the most recent updates.
--------------------------------
Rejecting question because it does not attempt to answer the question.
What are the key risks identified by IBM in their 2024 Annual Report, and what strategies have they outlined to manage these risks?
The provided context does not include specific details about the key risks identified by IBM in their 2024 Annual

In [51]:
final_output_file = f"./qna-{EXPERIMENT_SHORT_LABEL}-{len(data)}.json"
final_output_file

'./qna-ibm-2024-78-77.json'

In [52]:
evaluation_utilities.write_json(data, final_output_file)