# Example of generating QAs for a 10K
In this example, we will show you how to generate question-answers (QAs) from a pdf using OpenAI's models via `uniflow`'s [OpenAIJsonModelFlow](https://github.com/CambioML/uniflow/blob/main/uniflow/flow/model_flow.py#L125).

For this example, we're using a [10K from Nike](https://investors.nike.com/investors/news-events-and-reports/).

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)

Finally, we are storing the Nike 10K in the `data\raw_input` directory as "nike-10k-2023.pdf". You can download the file from [here](https://s1.q4cdn.com/806093406/files/doc_downloads/2023/414759-1-_5_Nike-NPS-Combo_Form-10-K_WR.pdf).

### Update system path

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install helper packages

In [2]:
!{sys.executable} -m pip install -q pandas pypdf nougat-ocr

### Import Dependency

In [3]:
from dotenv import load_dotenv
import os
import pandas as pd
from uniflow.flow.client import TransformClient, ExtractClient
from uniflow.flow.config import TransformOpenAIConfig
from uniflow.flow.config import ExtractPDFConfig, NougatModelConfig
from uniflow.op.model.model_config import OpenAIModelConfig
from uniflow.op.prompt import Context, PromptTemplate

load_dotenv()


  from .autonotebook import tqdm as notebook_tqdm


True

### Prepare the input data
First, we need to pre-process the PDF to get text chunks that we can feed into the model. We will use `Nougat` to process the PDF data.

In [4]:
pdf_file = "amazon-10k-2023.pdf"
# pdf_file = "nike-10k-2023.pdf"
# pdf_file = "alphabet-10k-2023.pdf"

##### Set current directory and input data directory.

In [5]:
dir_cur = os.getcwd()
input_file = os.path.join(f"{dir_cur}/data/raw_input/", pdf_file)

##### Load and split the pdf using Nougat

In [6]:
pdf_directory = [
    {"pdf": input_file},
]

extract_config = ExtractPDFConfig(
    model_config=NougatModelConfig(
        model_name = "0.1.0-small",
        batch_size = 128 # When batch_size>1, nougat will run on CUDA, otherwise it will run on CPU
    )
)

nougat_client = ExtractClient(extract_config)

pdf_output = nougat_client.run(pdf_directory)

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
  0%|                                                                                                                 | 0/1 [00:00<?, ?it/s]

INFO: likely hallucinated title at the end of the page: ## Appendix B


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [07:08<00:00, 428.76s/it]


##### Pre-process the output context from containing elements as individual lines in a PDF to having each element contain text within a 1,000-token length.

In [7]:
def count_tokens(text):
    # Assuming each word is a token, this function counts the number of tokens in the text.
    return len(text.split())

def recreate_string(pdf_output):
    recreated_output = []
    current_element = ""
    for line in pdf_output:
        line = line.rstrip('\n')  # Remove the trailing newline character
        if current_element:
            temp_element = current_element + " " + line
        else:
            temp_element = line

        if count_tokens(temp_element) <= 1000:
            current_element = temp_element
        else:
            recreated_output.append(current_element)
            current_element = line

    if current_element:
        recreated_output.append(current_element)

    return recreated_output

page_contents = recreate_string(pdf_output[0]['output'][0]['text'])

### Prepare sample prompts

First, we need to demonstrate sample prompts for LLM. We do this by giving a sample list of `Context` examples to the `PromptTemplate` class.

In [8]:
guided_prompt = PromptTemplate(
    instruction="""Generate one question and its corresponding answer based on the last context in the last
    example. Follow the format of the examples below to include context, question, and answer in the response""",
    few_shot_prompt=[
        Context(
            context="In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.",
            question="Who published A Mathematical Theory of Communication in 1948?",
            answer="Claude E. Shannon.",
        ),
])

Next, for the given `page_contents` above, we convert them to the `Context` class to be processed by `uniflow`.

In [9]:
input_data = [ Context(context=p[:500]) for p in page_contents[6:16] if len(p) > 200]
input_data

[Context(context='In addition, failure to optimize inventory or staffing in our fulfillment network increases our net shipping cost by requiring long-zone or partial shipments. We and our co-sourcers may be unable to adequately staff our fulfillment network and customer service centers. For example, productivity across our fulfillment network currently is being affected by regional labor market and global supply chain constraints, which increase payroll costs and make it difficult to hire, train, and deploy a suf'),
 Context(context='We also rely on a significant number of personnel to operate our stores, fulfillment network, and data centers and carry out our other operations. Failure to successfully hire, train, manage, and retain sufficient personnel to meet our needs can strain our operations, increase payroll and other costs, and harm our business and reputation. In addition, changes in laws and regulations applicable to employees, independent contractors, and temporary personnel 

### Use LLM to generate data

In this example, we will use the [OpenAIModelConfig](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L17)'s default LLM to generate questions and answers.

Here, we pass in our `guided_prompt` to the `OpenAIConfig` to use our customized instructions and examples, instead of the `uniflow` default ones.

We also want to get the response in the `json` format instead of the `text` default, so we set the `response_format` to `json_object`.

In [10]:
config = TransformOpenAIConfig(
    prompt_template=guided_prompt,
    model_config=OpenAIModelConfig(response_format={"type": "json_object"}),
)
client = TransformClient(config)

Now we call the `run` method on the `client` object to execute the question-answer generation operation on the data shown above.

In [11]:
output = client.run(input_data)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:23<00:00,  2.34s/it]


### Process the output

Let's take a look of the generated output. We need to do a little postprocessing on the raw output.

In [12]:
# Extracting context, question, and answer into a DataFrame
contexts = []
questions = []
answers = []

for item in output:
    for i in item.get('output', []):
        for response in i.get('response', []):
            if any(key not in response for key in ['context', 'question', 'answer']):
                print("Missing context, question or answer in response:", response)
                continue
            contexts.append(response['context'])
            questions.append(response['question'])
            answers.append(response['answer'])

# Set display options
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)

df = pd.DataFrame({
    'Context': contexts,
    'Question': questions,
    'Answer': answers
})

styled_df = df.style.set_properties(**{'text-align': 'left'}).set_table_styles([{
    'selector': 'th',
    'props': [('text-align', 'left')]
}])
styled_df

Unnamed: 0,Context,Question,Answer
0,"In addition, failure to optimize inventory or staffing in our fulfillment network increases our net shipping cost by requiring long-zone or partial shipments. We and our co-sourcers may be unable to adequately staff our fulfillment network and customer service centers. For example, productivity across our fulfillment network currently is being affected by regional labor market and global supply chain constraints, which increase payroll costs and make it difficult to hire, train, and deploy a suf",What are some factors affecting productivity across the fulfillment network?,"Regional labor market and global supply chain constraints, which increase payroll costs and make it difficult to hire, train, and deploy a sufficient staff."
1,"We also rely on a significant number of personnel to operate our stores, fulfillment network, and data centers and carry out our other operations. Failure to successfully hire, train, manage, and retain sufficient personnel to meet our needs can strain our operations, increase payroll and other costs, and harm our business and reputation. In addition, changes in laws and regulations applicable to employees, independent contractors, and temporary personnel could increase our payroll costs, decrea","What are the potential impacts of failure to successfully hire, train, manage, and retain sufficient personnel?","Strain our operations, increase payroll and other costs, and harm our business and reputation."
2,"The potential risks associated with disruptions in business operations include loss of management focus, problems retaining key personnel, additional operating losses and expenses, potential impairment of assets and goodwill, and impairment of customer relationships.",What are some potential risks associated with disruptions in business operations?,"Loss of management focus, problems retaining key personnel, additional operating losses and expenses, potential impairment of assets and goodwill, and impairment of customer relationships."
3,"In addition, we provide regulated services in certain jurisdictions because we enable customers to keep account balances with us and transfer money to third parties, and because we provide services to third parties to facilitate payments on their behalf. Jurisdictions subject us to requirements for licensing, regulatory inspection, bonding and capital maintenance, the use, handling, and segregation of transferred funds, consumer disclosures, maintaining or processing data, and authentication.",What are the requirements jurisdictions subject the company to?,"Licensing, regulatory inspection, bonding and capital maintenance, the use, handling, and segregation of transferred funds, consumer disclosures, maintaining or processing data, and authentication."
4,"As an innovative company offering a wide range of consumer and business products and services around the world, we are regularly subject to actual and threatened claims, litigation, reviews, investigations, and other proceedings, including proceedings by governments and regulatory authorities, involving a wide range of issues, including patent and other intellectual property matters, taxes, labor and employment, competition and antitrust, privacy, data use, data protection, data security, data l",What kind of issues is the innovative company regularly involved in?,"The company is regularly involved in issues such as patent and other intellectual property matters, taxes, labor and employment, competition and antitrust, privacy, data use, data protection, and data security."
5,"We are also subject to tax controversies in various jurisdictions that can result in tax assessments against us. Developments in an audit, investigation, or other tax controversy can have a material effect on our operating results or cash flows in the period or periods for which that development occurs, as well as for prior and subsequent periods. Due to the inherent complexity and uncertainty of these matters, interpretations of certain tax laws by authorities, and judicial, administrative, and",What can result in tax assessments against the company in various jurisdictions?,Tax controversies in various jurisdictions can result in tax assessments against the company.
6,"Our financial focus is on long-term, sustainable growth in free cash flows. Free cash flows are driven primarily by increasing operating income and efficiently managing accounts receivable, inventory, accounts payable, and cash capital expenditures, including our decision to purchase or lease property and equipment. Increases in operating income primarily result from increases in sales of products and services and efficiently managing our operating costs, partially offset by investments we mak",What drives free cash flows according to our financial focus?,"Free cash flows are primarily driven by increasing operating income and efficiently managing accounts receivable, inventory, accounts payable, and cash capital expenditures, including the decision to purchase or lease property and equipment."
7,"For additional information about each line item addressed above, refer to Item 8 of Part II, ""Financial Statements and Supplementary Data -- Note 1 -- Description of Business, Accounting Policies, and Supplemental Disclosures."" Our Annual Report on Form 10-K for the year ended December 31, 2021 includes a discussion and analysis of our financial condition and results of operations for the year ended December 31, 2020 in Item 7 of Part II, ""Management's Discussion and Analysis of Financial Condit",Where can additional information about the financial statements and supplementary data be found?,"Additional information can be found in Item 8 of Part II, ""Financial Statements and Supplementary Data -- Note 1 -- Description of Business, Accounting Policies, and Supplemental Disclosures"" in the Annual Report on Form 10-K for the year ended December 31, 2021."
8,"Cash provided by (used in) operating activities was $46.3 billion and $46.8 billion in 2021 and 2022. Our operating cash flows result primarily from cash received from our consumer, seller, developer, enterprise, and content reactor customers, and advertisers, offset by cash payments we make for products and services, employee compensation, payment processing and related transaction costs, operating leases, and interest payments. Cash received from our customers and other activities generally co",What are the main sources of operating cash flows mentioned in the context?,"Cash received from consumer, seller, developer, enterprise, and content reactor customers, and advertisers."
9,"We have organized our operations into three segments: North America, International, and AWS. These segments reflect the way the Company evaluates its business performance and manages its operations. See Item 8 of Part II, ""Financial Statements and Supplementary Data -- Note 10 -- Segment Information."" #### Overview Macroeconomic factors, including inflation, increased interest rates, significant capital market volatility, the prolonged COVID-19 pandemic, global supply chain constraints, and glob",What are the three segments into which the operations are organized?,"North America, International, and AWS."


Finally, we can save the output to a csv file.

In [13]:
output_df = df[['Question', 'Answer']]

output_dir = 'data/output'

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

output_df.to_csv(f"{output_dir}/Nike_10k_QApairs.csv", index=False)

## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>