# Example of generating QAs for a 10K
In this example, we will show you how to generate question-answers (QAs) from a pdf using OpenAI's models via `uniflow`'s [OpenAIJsonModelFlow](https://github.com/CambioML/uniflow/blob/main/uniflow/flow/model_flow.py#L125).

For this example, we're using a [10K from Nike](https://investors.nike.com/investors/news-events-and-reports/).

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)

Finally, we are storing the Nike 10K in the `data\raw_input` directory as "nike-10k-2023.pdf". You can download the file from [here](https://s1.q4cdn.com/806093406/files/doc_downloads/2023/414759-1-_5_Nike-NPS-Combo_Form-10-K_WR.pdf).

### Update system path

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install helper packages

In [2]:
!{sys.executable} -m pip install -q python-dotenv openai
!{sys.executable} -m pip uninstall -y uniflow

[0m

### Import Dependency

In [3]:
from dotenv import load_dotenv
from uniflow.flow.client import TransformClient
from uniflow.flow.config import TransformOpenAIConfig, TransformForGenerationOpenAIGPT3p5Config
from uniflow.op.model.model_config import OpenAIModelConfig
from uniflow.op.prompt import Context, PromptTemplate

load_dotenv()


  from .autonotebook import tqdm as notebook_tqdm


True

### Prepare sample prompts

First, we need to demonstrate sample prompts for LLM. We do this by giving a sample list of `Context` examples to the `PromptTemplate` class.

In [4]:
guided_prompt = PromptTemplate(
    instruction="""Generate one question and its corresponding answer based on the last context in the last
    example. Follow the format of the examples below to include context, question, and answer in the response""",
    few_shot_prompt=[
        Context(
            context="In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.",
            question="Who published A Mathematical Theory of Communication in 1948?",
            answer="Claude E. Shannon.",
        ),
        Context(
            context="""The Compute & Networking segment is comprised of our Data Center accelerated computing platforms and end-to-end networking platforms including Quantum
for InfiniBand and Spectrum for Ethernet; our NVIDIA DRIVE automated-driving platform and automotive development agreements; """,
            question="What does the Compute & Networking segment include?",
            answer="""The Compute & Networking segment includes Data Center accelerated computing platforms, end-to-end networking platforms (Quantum for InfiniBand and Spectrum for Ethernet), the NVIDIA DRIVE automated-driving platform, and automotive development agreements.""",
        ),
])

Next, for the given `page_contents` above, we convert them to the `Context` class to be processed by `uniflow`.

### Use LLM to generate data

In this example, we will use the [OpenAIModelConfig](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L17)'s default LLM to generate questions and answers.

Here, we pass in our `guided_prompt` to the `OpenAIConfig` to use our customized instructions and examples, instead of the `uniflow` default ones.

We also want to get the response in the `json` format instead of the `text` default, so we set the `response_format` to `json_object`.

In [5]:
# config = TransformOpenAIConfig(
#     prompt_template=guided_prompt,
#     model_config=OpenAIModelConfig(response_format={"type": "text"}),
#     # auto_split_long_text=False
# )

config2 = TransformForGenerationOpenAIGPT3p5Config(
    prompt_template=guided_prompt,
    model_config=OpenAIModelConfig(),
    auto_split_long_text=True
)

client = TransformClient(config2)

In [6]:
# def read_and_chunk(file_path, words_per_chunk=2500):
#     # Initialize variables
#     contexts = []
#     current_chunk_words = []

#     # Open and read the file
#     with open(file_path, 'r', encoding='utf-8') as file:
#         for line in file:
#             # Split the line into words
#             words = line.split()
#             for word in words:
#                 current_chunk_words.append(word)
#                 # Check if the current chunk reached the specified number of words
#                 if len(current_chunk_words) >= words_per_chunk:
#                     # Join the words to form a context and add to the list
#                     contexts.append(Context(context=' '.join(current_chunk_words)))
#                     current_chunk_words = []  # Reset for the next chunk

#     # Add the last chunk if there are any remaining words
#     if current_chunk_words:
#         contexts.append(Context(context=' '.join(current_chunk_words)))

#     return contexts

# # Example usage
# file_path = './data/raw_input/book-war-and-peace.txt'
# contexts = read_and_chunk(file_path)
# for context in contexts[:1]:  # Just printing the first Context for brevity
#     print(f"---\nContext(context='{context.context[:50]}...')\n---")

In [7]:
def extract_contexts(file_path, ranges):
    # Read the entire file content
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()

    # Initialize the list for holding contexts
    contexts = []

    # Extract specified ranges
    for start, end in ranges:
        # Adjusting end index to fit within the file length if necessary
        end = min(end, len(content))
        context_text = content[start:end]
        contexts.append(Context(context=context_text))

    return contexts

# Define the ranges for the contexts
ranges = [(100, 992000), (992000+12600, 992000+13200)]

# Specify the file path
file_path = './data/raw_input/book-war-and-peace.txt'

# Extract contexts
contexts = extract_contexts(file_path, ranges)

# Print the contexts for demonstration
for i, context in enumerate(contexts):
    preview_text = context.context[:50] + "..." if len(context.context) > 50 else context.context
    print(f"---\nContext {i+1} (Preview): '{preview_text}'\n---")

---
Context 1 (Preview): 'arn you, if you don't tell me that this means war,...'
---
---
Context 2 (Preview): 'e said
to Pierre as he kissed her hand. She had kn...'
---


Now we call the `run` method on the `client` object to execute the question-answer generation operation on the data shown above.

In [8]:
# print(contexts[0])

output = client.run(contexts[:15])

The current Context object needs splitting because it exceeds the token limitation.
9782
9873
9817
9715
9864
9957
9927
9317
9607
9732
9562
9968
9796
9744
8175
9809
9998
9883
9803
9589
8727
8528
9631
9908
9692
9826
9791
9933
9496
9951
9760
9941
9995
9934
9931
9418
9922
9579
8898
9880
9994
8807
9898
9809
9600
9541
9844
9292
9326
9888
9488
9395
8969
9569
9950
9962
9947
9848
9919
9604
9833
9295
9884
9704
9803
9938
9717
9891
9938
9785
9503
9434
9738
9827
9940
9968
9743
10000
9786
9212
7587
9176
9456
9931
9920
9759
9985
9991
9989
9885
9988
9713
9474
9788
9482
9578
9884
9684
9760
9595
9818
7133


100%|██████████| 103/103 [01:40<00:00,  1.02it/s]


### Output

In [9]:
from pprint import pprint
pprint(len(output))

pprint(output[0])

for o in output:
    pprint(o['output'][0]['response'])

103
{'output': [{'error': 'No errors.',
             'response': ['question: Who was the well-known woman greeting '
                          'Prince Vasili Kuragin in the text?\n'
                          'answer: The well-known Anna Pavlovna Scherer, maid '
                          'of honor and favorite of the Empress Marya '
                          'Fedorovna, greeted Prince Vasili Kuragin.']}],
 'root': <uniflow.node.Node object at 0x10d48a070>}
['question: Who was the well-known woman greeting Prince Vasili Kuragin in the '
 'text?\n'
 'answer: The well-known Anna Pavlovna Scherer, maid of honor and favorite of '
 'the Empress Marya Fedorovna, greeted Prince Vasili Kuragin.']
['question: Who was the beautiful Princess Helene and what group was she '
 'gathered with?\n'
 "answer: The beautiful Princess Helene was Prince Vasili's daughter, and she "
 'was grouped with the young people, along with the little Princess '
 'Bolkonskaya.']
['question: Who entered the drawing room i

## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>