# Notebook for Report Questions Flow via OpenAI
In this example, we will show you how to generate question-answers (QAs) from a pdf using OpenAI's models via `uniflow`'s [OpenAIJsonModelFlow](https://github.com/CambioML/uniflow/blob/main/uniflow/flow/model_flow.py#L125).


### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/uniflow/tree/main#api-keys)

In this example, we'll be using two papers in markdown format from under 'example/transform/data/raw_input/'

### Update system path

In [28]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install helper packages

### Import Dependency

In [99]:
from dotenv import load_dotenv

from uniflow.flow.flow_factory import FlowFactory
from uniflow.flow.client import TransformClient
from uniflow.flow.config import TransformOpenAIConfig
from uniflow.op.model.model_config import OpenAIModelConfig
from uniflow.op.prompt import Context, PromptTemplate

load_dotenv()


True

In [100]:
FlowFactory.list()

{'extract': ['ExtractHTMLFlow',
  'ExtractImageFlow',
  'ExtractIpynbFlow',
  'ExtractMarkdownFlow',
  'ExtractPDFFlow',
  'ExtractTxtFlow',
  'ExtractGmailFlow'],
 'transform': ['TransformAzureOpenAIFlow',
  'TransformComparisonGoogleFlow',
  'TransformComparisonOpenAIFlow',
  'TransformCopyFlow',
  'TransformGoogleFlow',
  'TransformGoogleMultiModalModelFlow',
  'TransformHuggingFaceFlow',
  'TransformLMQGFlow',
  'TransformOpenAIFlow',
  'TransformQuestionExtractionOpenAIFlow',
  'TransformNewsFeedOpenAIFlow',
  'TransformReportGenerationOpenAIFlow'],
 'rater': ['RaterFlow']}

### Prepare the input data
They are in preprocessed in markdown formats

In [31]:
import os
import sys

from any_parser import AnyParser 

example_apikey = os.getenv("CAMBIO_API_KEY")
example_local_file = "data/raw_input/PDD/Hayden.pdf"

op = AnyParser(example_apikey)
content_result = op.extract(example_local_file)
content_result

['# Pinduoduo (Nasdaq: PDD)\n\nPinduoduo is the third largest ecommerce platform in China. We believe the stock is under-\nvalued, as PDD is now trading at ~10.4x EV/FCF (adjusted for stock-based comp) or ~6.4x EV\n/ FCF on 2025 estimates. This valuation seems far too cheap, for a company who is growing\ncurrent revenues at +65% y/y (as of Q3 2022), expected to grow top-line at ~24% y/y over the\nnext three years, and where we expect operating income to 2 - 3x over the same timeframe¹.\n\nIt seems the basis of this opportunity, lies in the investment community\'s broad aversion to\nChinese equities (especially internet companies), in addition to several company specific\nconcerns:\n\n1. The Recent Chinese Equity Sell-off is Due to Politics, Not Fundamentals: At the\nlowest point on Oct 24th, PDD\'s stock price was down by -34% in a single day, due to\nforeign capital fleeing the Chinese equity markets after the President Xi\'s "landslide"\nreelection victory and the appointment of his 

In [52]:
# TODO: Fix later, change Open-parser librray to Any-parser
from uniflow.flow.client import ExtractClient
from uniflow.flow.config import ExtractPDFConfig
from uniflow.op.model.model_config import OpenParserModelConfig

input_file = "data/raw_input/PDD/Hayden.pdf"

data = [
    {"filename": input_file},
]

config = ExtractPDFConfig(
    model_config=OpenParserModelConfig(
        model_name = "CambioML/open-parser",
        api_key = os.getenv("CAMBIO_API_KEY"),
    ),
)
openparser_client = ExtractClient(config)

output = openparser_client.run(data)
# content_result = output[0]['output'][0]['text']
# content_result

  0%|          | 0/1 [00:00<?, ?it/s]

Upload response: 204


100%|██████████| 1/1 [00:23<00:00, 23.16s/it]

Extraction success.





['# HAYDEN CAPITAL',
 '1345 AVENUE OF THE AMERICAS',
 '33RD FLOOR',
 'NEW YORK, NY 10105',
 'HAYDENCAP ITAL.CO',
 '## Pinduoduo (Nasdaq: PDD)',
 'Pinduoduo is the third largest ecommerce platform in China. We believe the stock is under-',
 'valued, as PDD is now trading at ~10.4x EV/FCF (adjusted for stock-based comp) or ~6.4x EV',
 '/ FCF on 2025 estimates. This valuation seems far too cheap, for a company who is growing',
 'current revenues at +65% y/y (as of Q3 2022), expected to grow top-line at ~24% y/y over the',
 'next three years, and where we expect operating income to 2 - 3x over the same timeframe¹.',
 "It seems the basis of this opportunity, lies in the investment community's broad aversion to",
 'Chinese equities (especially internet companies), in addition to several company specific',
 'concerns:',
 '1. The Recent Chinese Equity Sell-off is Due to Politics, Not Fundamentals: At the',
 "lowest point on Oct 24th, PDD's stock price was down by -34% in a single day, due to",

### Use LLM to generate question/answer pairs for given reports

In this example, we will use the [OpenAIModelConfig](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L17)'s default LLM to generate questions and answers.

In [32]:
# initialize questions bank
questions_bank = []

In [33]:
input_data = [Context(Context=content_result[0])]
config = TransformOpenAIConfig(
    flow_name="TransformQuestionExtractionOpenAIFlow",
    model_config=OpenAIModelConfig(),
)
client = TransformClient(config)

Now we call the `run` method on the `client` object to execute the question-answer generation operation on the data shown above.

In [34]:
output = client.run(input_data)

100%|██████████| 1/1 [00:23<00:00, 23.40s/it]


### Pipeline

In [57]:
# Extract PDF, generate questions, then store them into a question bank
questions_bank = []

def extract_and_get_questions(file_path):
    content_result = op.extract(file_path)
    input_data = [Context(Context=content_result[0])]
    config = TransformOpenAIConfig(
        flow_name="TransformQuestionExtractionOpenAIFlow",
        model_config=OpenAIModelConfig(),
    )
    client = TransformClient(config)
    output = client.run(input_data)

    for item in output:
        for output in item['output']:
            for response in output['response']:
                if response.startswith('question:'):
                    question = response.split('\n')[0].replace('question:', '').strip()
                    questions_bank.append(question)

In [None]:
dir_cur = os.getcwd()  # or you can specify any directory
target_dir = os.path.join(dir_cur, "data/raw_input/PDD/reports")
for root, dirs, files in os.walk(target_dir):
    for file in files:
        file_path = os.path.join(root, file)
        extract_and_get_questions(file_path)

In [63]:
# total num of questions generated
len(questions_bank)

91

### Now process incoming news feed

In [79]:
example_news = "data/raw_input/PDD/news/sa_news_2.pdf"
news_content = op.extract(example_news)
news_content

["## PDD jumps after Q1 profit surges 200%, transaction services drive revenue growth\n\nMay 22, 2024 7:16 AM ET PDD Holdings Inc. (PDD) Stock NTES, BIDU, BABA.. By: Ravikash Bakolia, SA\nNews Editor\n\n![Etsy, Temu, Etsy, Temu, Gumtree, ebay, eBay, Amazon, AliExpress](https://seekingalpha.com/news/4109204-pdd-rises-after-q1-profit-surges-200-transaction-services-drive-revenue-growth)\n\nPDD's (NASDAQ:PDD) stock rose about 8% premarket on Wednesday after first\nquarter results beat estimates.\n\nAdjusted earnings per American depositary shares surged 199.4% year-over-year to\nRMB20.72 ($2.83), the company said.",
 '5/22/24, 3:01 PM\n\nPDD jumps after Q1 profit surges 200%, transaction services drive revenue growth_Seeking Alpha\n\nThe e-commerce giant\'s total revenues soared nearly 131% year-over-year to\nRMB86.81B (about $12.02B). Both top and bottom lines surpassed analysts\' estimates.\n\n"We will focus our efforts on improving the overall consumer experience, strengthening\nour su

In [93]:
# Optional: use questions_bank[:5] for faster compute time and lower op cost

In [83]:
input_data = [Context(Context=news_content[0])]
config = TransformOpenAIConfig(
    flow_name="TransformNewsFeedOpenAIFlow",
    prompt_template= PromptTemplate(instruction='\n'.join(questions_bank)),
    model_config=OpenAIModelConfig(),
)
client = TransformClient(config)

In [91]:
output = client.run(input_data)
output

100%|██████████| 1/1 [01:00<00:00, 60.50s/it]


[{'output': [{'response': ['question: Where is the United States Securities and Exchange Commission located?\nanswer: The United States Securities and Exchange Commission is headquartered in Washington, D.C.'],
    'error': 'No errors.'},
   {'response': ["question: What is the status of the company's Form 20-F filing?\nanswer: N/A"],
    'error': 'No errors.'},
   {'response': ["question: What were PDD's first quarter revenue and net income results?\nanswer: N/A"],
    'error': 'No errors.'},
   {'response': ['question: What is the full name of the company?\nanswer: PDD Holdings Inc.'],
    'error': 'No errors.'},
   {'response': ["question: What were PDD's first quarter 2024 adjusted earnings per American depositary shares?\nanswer: RMB20.72 ($2.83)"],
    'error': 'No errors.'},
   {'response': ['question: What is the translation of the registrant\'s name "Cayman Islands" into English?\nanswer: N/A'],
    'error': 'No errors.'},
   {'response': ['question: What is the jurisdiction o

In [92]:
news_questions = []
news_answers = []
result = output

# Loop through the data structure
for item in result:
    for cur_output in item.get('output', []):
        for response in cur_output.get('response', []):
            # Split the response to extract the answer
            parts = response.split('\n')
            for part in parts:
                if part.startswith('answer:'):
                    answer = part.replace('answer: ', '')
                    if answer != "N/A":
                        news_answers.append(answer)
news_answers

['The United States Securities and Exchange Commission is headquartered in Washington, D.C.',
 'PDD Holdings Inc.',
 'RMB20.72 ($2.83)',
 'PDD Holdings Inc. is incorporated in the Cayman Islands.',
 'The trading symbol of Pinduoduo is PDD and it is registered on the NASDAQ exchange.',
 'The American depositary shares of PDD will be listed on the NASDAQ.',
 'The impact of the consumer wallet share gain strategy on revenue and earnings growth is not provided in the given context.',
 "PDD's performance in Q1 exceeded estimates, with adjusted earnings per American depositary shares surging 199.4% year-over-year to RMB20.72 ($2.83). The drivers of this performance were strong transaction services driving revenue growth.",
 'Transaction services drove revenue growth for PDD in Q1, leading to a 200% surge in profit for the company.',
 "As a financial analyst for institutional investors, the investment recommendation for PDD's stock would be to consider the positive first quarter results and t

In [98]:
# extract questions and answers used for report generation flow
questions = []
answers = []

for item in result:
    for cur_output in item.get('output', []):
        for response in cur_output.get('response', []):
            parts = response.split('\n')
            question = None
            answer = None
            for part in parts:
                if part.startswith('question:'):
                    question = part.replace('question: ', '')
                elif part.startswith('answer:'):
                    answer = part.replace('answer: ', '')

            if answer and answer != 'N/A' and question:
                questions.append(question)
                answers.append(answer)


Question 1: Where is the United States Securities and Exchange Commission located?
Answer 1: The United States Securities and Exchange Commission is headquartered in Washington, D.C.
Question 2: What is the full name of the company?
Answer 2: PDD Holdings Inc.
Question 3: What were PDD's first quarter 2024 adjusted earnings per American depositary shares?
Answer 3: RMB20.72 ($2.83)
Question 4: What is the jurisdiction of incorporation or organization for PDD Holdings Inc.?
Answer 4: PDD Holdings Inc. is incorporated in the Cayman Islands.
Question 5: What is the trading symbol of Pinduoduo (PDD) and on which exchange is it registered?
Answer 5: The trading symbol of Pinduoduo is PDD and it is registered on the NASDAQ exchange.
Question 6: Where will the American depositary shares of PDD be listed?
Answer 6: The American depositary shares of PDD will be listed on the NASDAQ.
Question 7: What is the impact of the consumer wallet share gain strategy on revenue and earnings growth?
Answer 

### Report Generation

In [123]:
concatenated_questions = "\n".join(questions)
concatenated_answers = "\n".join(answers)

input_data = [[Context(context=concatenated_questions), Context(context=concatenated_answers)]]
config = TransformOpenAIConfig(
    flow_name="TransformReportGenerationOpenAIFlow",
    model_config=OpenAIModelConfig(),
)
client = TransformClient(config)

In [124]:
client.run(input_data)

100%|██████████| 1/1 [00:12<00:00, 12.31s/it]


[{'error': 'list indices must be integers or slices, not str',
  'traceback': 'Traceback (most recent call last):\n  File "c:\\Users\\Pumpkinfries\\Desktop\\Cambio\\uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering\\example\\transform\\../..\\uniflow\\flow\\server.py", line 165, in _run_flow\n    output = f(input_list)\n  File "c:\\Users\\Pumpkinfries\\Desktop\\Cambio\\uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering\\example\\transform\\../..\\uniflow\\flow\\flow.py", line 36, in __call__\n    nodes = self.run(nodes)\n  File "c:\\Users\\Pumpkinfries\\Desktop\\Cambio\\uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering\\example\\transform\\../..\\uniflow\\flow\\transform\\transform_report_generation_flow.py", line 210, in run\n    answer_node_grouped = self._group(\n  File "c:\\Users\\Pumpkinfries\\Desktop\\Cambio\\uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering\\example\\transform\\../..\\uniflow\\op\\basic\\group_op.py", line 52, i

## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>