# Example of generating summaries for a 10K
In this example, we will show you how to generate page summaries from a pdf using OpenAI's models via `uniflow`'s [OpenAIJsonModelFlow](https://github.com/CambioML/uniflow/blob/main/uniflow/flow/model_flow.py#L125).

For this example, we're using a [10K from Nike](https://investors.nike.com/investors/news-events-and-reports/).

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)

Finally, we are storing the Nike 10K in the `data\raw_input` directory as "nike-10k-2023.pdf". You can download the file from [here](https://s1.q4cdn.com/806093406/files/doc_downloads/2023/414759-1-_5_Nike-NPS-Combo_Form-10-K_WR.pdf).

### Update system path

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install helper packages

In [2]:
!{sys.executable} -m pip install -q pandas pypdf poetry nougat-ocr
!poetry -C ../../ install --no-root # install uniflow dependencies

[34mInstalling dependencies from lock file[39m

[39;1mPackage operations[39;22m: [34m0[39m installs, [34m1[39m update, [34m0[39m removals

  [34;1m•[39;22m [39mUpdating [39m[36mplatformdirs[39m[39m ([39m[39;1m3.11.0[39;22m[39m -> [39m[39;1m4.2.0[39;22m[39m)[39m: [34mPending...[39m
[1A[0J  [34;1m•[39;22m [39mUpdating [39m[36mplatformdirs[39m[39m ([39m[39;1m3.11.0[39;22m[39m -> [39m[39;1m4.2.0[39;22m[39m)[39m: [34mInstalling...[39m
[1A[0J  [32;1m•[39;22m [39mUpdating [39m[36mplatformdirs[39m[39m ([39m[39;1m3.11.0[39;22m[39m -> [39m[32m4.2.0[39m[39m)[39m


### Import dependencies

In [3]:
from dotenv import load_dotenv
import os
import pandas as pd
from uniflow.flow.client import TransformClient, ExtractClient
from uniflow.flow.config import TransformOpenAIConfig
from uniflow.flow.config import ExtractPDFConfig, NougatModelConfig
from uniflow.op.model.model_config import OpenAIModelConfig
from uniflow.op.prompt import Context, PromptTemplate

load_dotenv()


  from .autonotebook import tqdm as notebook_tqdm


True

### Prepare the input data
First, we need to pre-process the PDF to get text chunks that we can feed into the model. We will use `Nougat` to process the PDF data.

In [4]:
pdf_file = "amazon-10k-2023.pdf"
# pdf_file = "nike-10k-2023.pdf"
# pdf_file = "alphabet-10k-2023.pdf"

##### Set current directory and input data directory.

In [5]:
dir_cur = os.getcwd()
input_file = os.path.join(f"{dir_cur}/data/raw_input/", pdf_file)

##### Load and split the pdf

In [6]:
pdf_directory = [
    {"pdf": input_file},
]

extract_config = ExtractPDFConfig(
    model_config=NougatModelConfig(
        model_name = "0.1.0-small",
        batch_size = 128 # When batch_size>1, nougat will run on CUDA, otherwise it will run on CPU
    )
)

nougat_client = ExtractClient(extract_config)

pdf_output = nougat_client.run(pdf_directory)

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
  0%|                                                                                                                 | 0/1 [00:00<?, ?it/s]

INFO: likely hallucinated title at the end of the page: ## Appendix B


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [07:10<00:00, 430.25s/it]


##### Pre-process the output context from containing elements as individual lines in a PDF to having each element contain text within a 1,000-token length.

In [7]:
def count_tokens(text):
    # Assuming each word is a token, this function counts the number of tokens in the text.
    return len(text.split())

def recreate_string(pdf_output):
    recreated_output = []
    current_element = ""
    for line in pdf_output:
        line = line.rstrip('\n')  # Remove the trailing newline character
        if current_element:
            temp_element = current_element + " " + line
        else:
            temp_element = line

        if count_tokens(temp_element) <= 1000:
            current_element = temp_element
        else:
            recreated_output.append(current_element)
            current_element = line

    if current_element:
        recreated_output.append(current_element)

    return recreated_output

page_contents = recreate_string(pdf_output[0]['output'][0]['text'])

### Prepare sample prompts

First, we need to demonstrate sample prompts for LLM. Because we are not generating the default questions and answers, we need to have a custom `instruction` and custom `examples`, which we configure in the `PromptTemplate` class.

First, we give a custom `instruction` to the `PromptTemplate`. This ensures we are instructing the LLM to generate summaries instead of the default questions and answers.

Next, we give a sample list of `Context` examples to the `PromptTemplate` class. We pass in a custom `summary` property into our `Context` objects. This is an example summary based on the `context`.

In [8]:
guided_prompt = PromptTemplate(
    instruction="Generate a one sentence summary based on the last context below. Follow the format of the examples below to include context and summary in the response",
    few_shot_prompt=[
        Context(
            context="When you're operating on the maker's schedule, meetings are a disaster. A single meeting can blow a whole afternoon, by breaking it into two pieces each too small to do anything hard in. Plus you have to remember to go to the meeting. That's no problem for someone on the manager's schedule. There's always something coming on the next hour; the only question is what. But when someone on the maker's schedule has a meeting, they have to think about it.",
            summary="Meetings disrupt the productivity of those following a maker's schedule, dividing their time into impractical segments, while those on a manager's schedule are accustomed to a continuous flow of tasks.",
        ),
    ],
)

Next, for the given `page_contents` above, we convert them to the `Context` class to be processed by `uniflow`.

In [9]:
data = [ Context(context=p[:800], summary="") for p in page_contents[6:16] if len(p) > 200 ]
data

[Context(context='In addition, failure to optimize inventory or staffing in our fulfillment network increases our net shipping cost by requiring long-zone or partial shipments. We and our co-sourcers may be unable to adequately staff our fulfillment network and customer service centers. For example, productivity across our fulfillment network currently is being affected by regional labor market and global supply chain constraints, which increase payroll costs and make it difficult to hire, train, and deploy a sufficient number of people to operate our fulfillment network as efficiently as we would like. Under some of our commercial agreements, we maintain the inventory of other companies, thereby increasing the complexity of tracking inventory and operating our fulfillment network. Our failure to properly h', summary=''),
 Context(context='We also rely on a significant number of personnel to operate our stores, fulfillment network, and data centers and carry out our other operations. F

### Use LLM to generate data

In this example, we will use the [OpenAIModelConfig](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L17)'s default LLM to generate questions and answers.

Here, we pass in our `guided_prompt` to the `TransformOpenAIConfig` to use our customized instructions and examples, instead of the `uniflow` default ones.

We also want to get the response in the `json` format instead of the `text` default, so we set the `response_format` to `json_object`.

In [10]:
config = TransformOpenAIConfig(
    prompt_template=guided_prompt,
    model_config=OpenAIModelConfig(response_format={"type": "json_object"}),
)
client = TransformClient(config)

Now we call the `run` method on the `client` object to execute the question-answer generation operation on the data shown above.

In [11]:
output = client.run(data)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:19<00:00,  1.91s/it]


### Process the output

Let's take a look of the generated output. We need to do a little postprocessing on the raw output.

In [12]:
# Extracting context, question, and answer into a DataFrame
contexts = []
summaries = []

for item in output:
    for i in item.get('output', []):
        for response in i.get('response', []):
            if any(key not in response for key in ['context', 'summary']):
                print("Missing context or summary in response:", response)
                continue
            contexts.append(response['context'])
            summaries.append(response['summary'])

# Set display options
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)

df = pd.DataFrame({
    'Context': contexts,
    'Summaries': summaries,
})

styled_df = df.style.set_properties(**{'text-align': 'left'}).set_table_styles([{
    'selector': 'th',
    'props': [('text-align', 'left')]
}])
styled_df

Missing context or summary in response: {'summary': 'Failure to optimize inventory and staffing in the fulfillment network increases shipping costs and affects productivity due to labor market and supply chain constraints, making it difficult to efficiently operate the network and deploy sufficient staff.'}
Missing context or summary in response: {'summary': "Operating on the maker's schedule can be disrupted by meetings, while those on the manager's schedule are accustomed to a continuous flow of tasks."}
Missing context or summary in response: {'summary': "Those on a maker's schedule find meetings disruptive and impractical, while those on a manager's schedule are accustomed to a continuous flow of tasks."}
Missing context or summary in response: {'few_shot_response': [{'context': "When you're operating on the maker's schedule, meetings are a disaster. A single meeting can blow a whole afternoon, by breaking it into two pieces each too small to do anything hard in. Plus you have to r

Unnamed: 0,Context,Summaries
0,"* disruption of our ongoing business, including loss of management focus on existing businesses; * problems retaining key personnel; * additional operating losses and expenses of the businesses we acquired or in which we invested; * the potential impairment of tangible and intangible assets and goodwill, including as a result of acquisitions; * the potential impairment of customer and other relationships of the company we acquired or in which we invested or our own customers as a result of any integration of operations; * the difficulty of completing such transactions, including obtaining regulatory approvals or satisfying other closing conditions, and achieving anticipated benefits within expected timeframes, or at all; * the difficulty of incorporating acquired operations, technology, and employees into our existing business; * the potential for diversion of management's attention from other business concerns; * risks related to the concentration of investment we have in a few companies; * the potential failure to generate, or delays in the generation of, expected revenue or synergies from new investments and acquisitions; * higher-than-expected costs or unanticipated liabilities associated with new investments and acquisitions; * potential loss of our ability to use net operating losses to offset future taxable income; * the potential effect on our brand and customer demand for our products and services; * the potential failure to maintain the value of our brands; * our vulnerability to general adverse economic and industry conditions; * increased competition in our markets and the potential effect on our market share; * potential changes in the market for and demand for our products and services; * the potential effect of tax law changes and changes in tax rates; * potential changes in accounting standards and other legal requirements or environmental and other regulations; * potential changes in interest rates or foreign currency exchange rates; * the potential effect of competition, regulatory changes and other factors in our industry; * potential changes in the relationships between the United States and other countries; and * potential changes in the political environment in the United States and other countries, and their potential effects on our business.","Risks associated with disruptions to ongoing businesses, difficulties in integrating operations and technology, potential loss of key personnel, and challenges in realizing expected benefits and revenue from new investments and acquisitions pose significant threats to the company's stability and success."


Finally, we can save the output to a csv file.

In [13]:
output_dir = 'data/output'

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

df.to_csv(f"{output_dir}/Nike_10k_Summaries.csv", index=False)