# Example of generating QAs for a 10K
In this example, we will show you how to generate question-answers (QAs) from a pdf using OpenAI's models via `uniflow`'s [OpenAIJsonModelFlow](https://github.com/CambioML/uniflow/blob/main/uniflow/flow/model_flow.py#L125).

For this example, we're using a [10K from Nike](https://investors.nike.com/investors/news-events-and-reports/).

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)

Finally, we are storing the Nike 10K in the `data\raw_input` directory as "nike-10k-2023.pdf". You can download the file from [here](https://s1.q4cdn.com/806093406/files/doc_downloads/2023/414759-1-_5_Nike-NPS-Combo_Form-10-K_WR.pdf).

### Update system path

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install helper packages

In [2]:
!{sys.executable} -m pip install -q langchain pandas pypdf

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple


### Import Dependency

In [3]:
from dotenv import load_dotenv
import os
import pandas as pd
from uniflow.flow.client import TransformClient
from uniflow.flow.config import TransformOpenAIConfig
from uniflow.op.model.model_config import OpenAIModelConfig
from langchain.document_loaders import PyPDFLoader
from uniflow.op.prompt import Context, PromptTemplate

load_dotenv()


True

### Prepare the input data
First, we need to pre-process the PDF to get text chunks that we can feed into the model. We will use `PyPDFLoader` from langchain.

In [4]:
pdf_file = "nike-10k-2023.pdf"

##### Set current directory and input data directory.

In [5]:
dir_cur = os.getcwd()
input_file = os.path.join(f"{dir_cur}/data/raw_input/", pdf_file)

##### Load and split the pdf

In [6]:
loader = PyPDFLoader(input_file)
pages = loader.load_and_split()
page_contents = [page.page_content for page in pages]

### Prepare sample prompts

First, we need to demonstrate sample prompts for LLM. We do this by giving a sample list of `Context` examples to the `PromptTemplate` class.

In [7]:
guided_prompt = PromptTemplate(
    instruction="""Generate one question and its corresponding answer based on the last context in the last
    example. Follow the format of the examples below to include context, question, and answer in the response""",
    few_shot_prompt=[
        Context(
            context="In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.",
            question="Who published A Mathematical Theory of Communication in 1948?",
            answer="Claude E. Shannon.",
        ),
])

Next, for the given `page_contents` above, we convert them to the `Context` class to be processed by `uniflow`.

In [8]:
input_data = [ Context(context=p[:500]) for p in page_contents[6:16] if len(p) > 200]
input_data

[Context(context='We also offer interactive consumer services and experiences as well as digital products through our digital platforms, including \nfitness and activity apps; sport, fitness and wellness content; and digital services and features in retail stores that enhance the \nconsumer experience.\nSALES AND MARKETING\nWe experience moderate fluctuations in aggregate sales volume during the year. Historically, revenues in the first and fourth \nfiscal quarters have slightly exceeded those in the second and third '),
 Context(context='INTERNATIONAL MARKETS\nFor fiscal 2023, non-U.S. NIKE Brand and Converse sales accounted for approximately 57% of total revenues, compared to 60% \nand 61% for fiscal 2022 and fiscal 2021, respectively. We sell our products to retail accounts through our own NIKE Direct \noperations and through a mix of independent distributors, licensees and sales representatives around the world. W e sell to \nthousands of retail accounts and ship products from 67 d

### Use LLM to generate data

In this example, we will use the [OpenAIModelConfig](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L17)'s default LLM to generate questions and answers.

Here, we pass in our `guided_prompt` to the `OpenAIConfig` to use our customized instructions and examples, instead of the `uniflow` default ones.

We also want to get the response in the `json` format instead of the `text` default, so we set the `response_format` to `json_object`.

In [9]:
config = TransformOpenAIConfig(
    prompt_template=guided_prompt,
    model_config=OpenAIModelConfig(response_format={"type": "json_object"}),
)
client = TransformClient(config)

Now we call the `run` method on the `client` object to execute the question-answer generation operation on the data shown above.

In [13]:
output = client.run(input_data)

  0%|          | 0/10 [00:00<?, ?it/s]

### Process the output

Let's take a look of the generated output. We need to do a little postprocessing on the raw output.

In [14]:
# Extracting context, question, and answer into a DataFrame
contexts = []
questions = []
answers = []

for item in output:
    for i in item.get('output', []):
        for response in i.get('response', []):
            if any(key not in response for key in ['context', 'question', 'answer']):
                print("Missing context, question or answer in response:", response)
                continue
            contexts.append(response['context'])
            questions.append(response['question'])
            answers.append(response['answer'])

# Set display options
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)

df = pd.DataFrame({
    'Context': contexts,
    'Question': questions,
    'Answer': answers
})

styled_df = df.style.set_properties(**{'text-align': 'left'}).set_table_styles([{
    'selector': 'th',
    'props': [('text-align', 'left')]
}])
styled_df

Unnamed: 0,Context,Question,Answer
0,"We also offer interactive consumer services and experiences as well as digital products through our digital platforms, including fitness and activity apps; sport, fitness and wellness content; and digital services and features in retail stores that enhance the consumer experience. SALES AND MARKETING We experience moderate fluctuations in aggregate sales volume during the year. Historically, revenues in the first and fourth fiscal quarters have slightly exceeded those in the second and third",What are some of the digital products offered through the digital platforms?,"fitness and activity apps; sport, fitness and wellness content; and digital services and features in retail stores."
1,"INTERNATIONAL MARKETS For fiscal 2023, non-U.S. NIKE Brand and Converse sales accounted for approximately 57% of total revenues, compared to 60% and 61% for fiscal 2022 and fiscal 2021, respectively. We sell our products to retail accounts through our own NIKE Direct operations and through a mix of independent distributors, licensees and sales representatives around the world. W e sell to thousands of retail accounts and ship products from 67 distribution centers outside of the United States.",How much did non-U.S. NIKE Brand and Converse sales account for in total revenues for fiscal 2023?,Approximately 57%.
2,"footwear production. For fiscal 2023, factories in Vietnam, Indonesia and China manufactured approximately 50%, 27% and 18% of total NIKE Brand footwear, respectively. For fiscal 2023, four footwear contract manufacturers each accounted for greater than 10% of footwear production and in the aggregate accounted for approximately 58% of NIKE Brand footwear production. As of May 31, 2023, our contract manufacturers operated 291 finished goods apparel factories located in 31 countries. For fiscal","Which countries manufactured approximately 50%, 27%, and 18% of total NIKE Brand footwear in fiscal 2023?","Vietnam, Indonesia, and China."
3,"NIKE's contract manufacturers buy raw materials for the manufacturing of our footwear, apparel and equipment products. Most raw materials are available and purchased by those contract manufacturers in the countries where manufact",Where do NIKE's contract manufacturers buy raw materials for the manufacturing of their products?,Most raw materials are available and purchased by those contract manufacturers in the countries where manufacturing takes place.
4,"We monitor protectionist trends and developments throughout the world that may materially impact our industry, and we engage in administrative and judicial processes to mitigate trade restrictions. W e are actively monitoring actions that may result in additional anti-dumping measures and could affect our industry. We are also monitoring for and advocating against other impediments that may limit or delay customs clearance for imports of footwear , apparel and equipment. NIKE also advocates f",What does NIKE actively monitor and advocate against?,"NIKE actively monitors and advocates against actions that may result in additional anti-dumping measures and other impediments that may limit or delay customs clearance for imports of footwear, apparel, and equipment."
5,"Our international operations are also subject to compliance with the U.S . Foreign Corrupt Practices Act (the ""FCPA""), and other anti-bribery laws applicable to our operations. We source a significant portion of our products from, and have important consumer markets, outside of the United States. We have an ethics and compliance program to address compliance with the FCPA and similar laws by us, our employees, agents, suppliers and other partners. Refer to Item 1A. Risk Factors for additiona",What laws are applicable to our international operations?,"Compliance with the U.S. Foreign Corrupt Practices Act (the ""FCPA""), and other anti-bribery laws."
6,"devices, and related software applications. These patents expire at various times. We believe our success depends upon our capabilities in areas such as design, research and development, production and marketing and is supported and protected by our intellectual property rights, such as trademarks, utility and design patents, copyrights, and trade secrets, among others. We have followed a policy of applying for and registering intellectual property rights in the United States and select forei","What supports and protects the company's capabilities in areas such as design, research and development, production, and marketing?","Intellectual property rights such as trademarks, utility and design patents, copyrights, and trade secrets."
7,"HUMAN CAPITAL RESOURCES At NIKE, we consider the strength and effective management of our workforce to be essential to the ongoing success of our business. We believe that it is important to attract, develop and retain a diverse and engaged workforce at all levels of our business and that such a workforce fosters creativity and accelerates innovation. W e are focused on building an increasingly diverse talent pipeline that reflects our consumers, athletes and the communities we serve. CULTURE",What does NIKE consider essential for the ongoing success of its business?,The strength and effective management of its workforce.
8,"Diversity, equity and inclusion (""DE&I"") is a strategic priority for NIKE and we are committed to having an increa",What is a strategic priority for NIKE?,"Diversity, equity and inclusion (""DE&I"") is a strategic priority for NIKE."
9,"Our DE&I focus extends beyond our workforce and includes our communities, which we support in a number of ways. We have committed to investments that aim to address racial inequality and improve diversity and representation in our communities. W e also are leveraging our global scale to accelerate business diversity , including investing in business training programs for women and increasing the proportion of services supplied by minority-owned businesses. COMPENSATION AND BENEFITS NIKE's to","What is NIKE's focus on diversity, equity, and inclusion?","NIKE's focus extends beyond their workforce and includes their communities, aiming to address racial inequality, improve diversity and representation, invest in business training programs for women, and increase the proportion of services supplied by minority-owned businesses."


Finally, we can save the output to a csv file.

In [15]:
output_df = df[['Question', 'Answer']]

output_dir = 'data/output'

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

output_df.to_csv(f"{output_dir}/Nike_10k_QApairs.csv", index=False)

## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>