# Example of pipeline: extracting and transforming pdf file

In this example, we will show you how use uniflow to extract and transform knowledge from a unstructured pdf file.

Specifically, we will show you how to end-to-end generate question-answers (QAs) from a given pdf using uniflow's `MultiFlowsPipeline`.

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

### Update system path

In [1]:
%reload_ext autoreload
%autoreload 2

import sys
import pprint
import re

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install libraries

In [None]:
!{sys.executable} -m pip install -q transformers accelerate bitsandbytes scipy nougat-ocr

### Import dependency

In [3]:
import os
import pandas as pd
from uniflow.pipeline import MultiFlowsPipeline
from uniflow.flow.config import PipelineConfig
from uniflow.flow.config import TransformHuggingFaceConfig, ExtractPDFConfig
from uniflow.op.model.model_config import HuggingfaceModelConfig, NougatModelConfig
from uniflow.op.prompt import PromptTemplate, Context
from uniflow.op.extract.split.constants import MARKDOWN_HEADER_SPLITTER

  from .autonotebook import tqdm as notebook_tqdm


### Prepare the input data

First, let's set current directory and input data directory, and load the raw data.

In [4]:
dir_cur = os.getcwd()
pdf_file = "nike-paper.pdf"
input_file = os.path.join(f"{dir_cur}/data/raw_input/", pdf_file)

data = [
    {"pdf": input_file},
]

### Define extract config using Nougat

In [5]:
extract_config = ExtractPDFConfig(
    model_config=NougatModelConfig(
        model_name = "0.1.0-small",
        batch_size = 1 # When batch_size>1, nougat will run on CUDA, otherwise it will run on CPU
    ),
    splitter=MARKDOWN_HEADER_SPLITTER
)

### Prepare sample prompts

Now we need to write a little bit prompts to generate question and answer for a given paragraph, each promopt data includes a instruction and a list of examples with "context", "question" and "answer". We do this by giving a sample list of `Context` examples to the `GuidedPrompt` class.

In [6]:
guided_prompt = PromptTemplate(
    instruction="""Generate one question and its corresponding answer based on the last context in the last
    example. Follow the format of the examples below to include context, question, and answer in the response""",
    few_shot_prompt=[
        Context(
            context="In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.",
            question="Who published A Mathematical Theory of Communication in 1948?",
            answer="Claude E. Shannon.",
        ),
        
])

### Define transform config

In this example, we will use the [HuggingfaceModelConfig](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L39)'s default LLM to generate questions and answers. Let's import the config of this model.

Here, we pass in our `guided_prompt` to the `HuggingfaceConfig` to use our customized instructions and examples, instead of the `uniflow` default ones.

Note, base on your GPU memory, you can set your optimal `batch_size` below. 

In [7]:
current_batch_size = 1
print("batch size:", current_batch_size)

transform_config = TransformHuggingFaceConfig(
    prompt_template=guided_prompt,
    model_config=HuggingfaceModelConfig(batch_size=current_batch_size)
)

batch size: 1


### Use MultiFlowsPipeline

Let's import the `PipelineConfig` of `MultiFlowsPipeline` to connect `extract_config` and `transform_config`.

In [8]:
p = MultiFlowsPipeline(PipelineConfig(
    extract_config=extract_config,
    transform_config=transform_config,
))

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.10s/it]


Now we call the `run` method on the `MultiFlowsPipeline` object to execute the question-answer generation operation on the data shown above.

In [9]:
output = p.run(data)

  0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 1/1 [00:42<00:00, 42.38s/it]
100%|██████████| 13/13 [07:19<00:00, 33.78s/it]


### Output

Let's take a look of the generated output of Abstract segmentation.

### Process the output

Let's take a look of the generated output. We need to do a little postprocessing on the raw output.

In [13]:
# Extracting context, question, and answer into a DataFrame
contexts = []
questions = []
answers = []

keywords = ["context:", "question:", "answer:"]
pattern = '|'.join(map(re.escape, keywords))

for item in output[0][3:5]:
    o = item['output'][0]['response'][0]
    segments = [segment for segment in re.split(pattern, o) if segment.strip()]

    contexts.append(segments[-3])
    questions.append(segments[-2])
    answers.append(segments[-1])

# Set display options
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)

df = pd.DataFrame({
    'Context': contexts,
    'Question': questions,
    'Answer': answers
})

styled_df = df.style.set_properties(**{'text-align': 'left'}).set_table_styles([{
    'selector': 'th',
    'props': [('text-align', 'left')]
}])
styled_df

Unnamed: 0,Context,Question,Answer
0,"## 2 Study Design We selected athletes who recorded a sufficiently fast marathon time--men under 2:24 and women under 2:45--at a collection of 22 distinct marathon venues in 2015 or 2016, including the 2016 U.S. Olympic Marathon Trials, which were contested in Los Angeles in February of 2016. The list of marathons is included in the Appendix. This resulted in a sample of 270 distinct women and 308 distinct men after matching names and our best effort to correct alternate spellings of names. We recorded these athletes' performances in the same 22 marathon venues over the period 2015 to 2019, and searched publicly available online photographs, manually identifying whether or not each athlete was wearing a Nike Vaporfly shoe by visual inspection. All marathon times were downloaded from the website www.marathonguide.com. Our criteria for inclusion in the study were meant to satisfy certain objectives. First, we wanted to study elite and sub-elite athletes, since shoe regulations are motivated by performance advantages for athletes in this group. Second, we wanted to study athletes who had achieved success in the marathon before the Nike Vaporfly shoes had been released to the public. This ensures that inclusion in the study is unrelated to whether an athlete was wearing the shoes in the race where they qualified for inclusion in the study. This is important because, if any shoe effect exists, the magnitude of the effect may differ among different athletes. If we were to use performances potentially aided by the shoes to select the athletes, that might have biased our sample towards athletes who benefit most from the shoes. To identify shoes worn by the runners, we used photos posted on public websites such as marathonfoto.com, marathon-photos.com, sportphoto.com, and flashframe.io. We also collected photographs from social media sites such as facebook.com and instagram.com. We assumed that Vaporfly shoes were not worn in 2015 or 2016 by any runners except for a few that were reported to have worn prototypes in the 2016 US Olympic Trials Marathon. Identification of shoes via photos is a manual process that is subject to error. We have made all of our shoe identifications publicly available at [https://github.com/joeguinness/vaporfly](https://github.com/joeguinness/vaporfly) and will update this paper with new data if we are made aware of any errors in shoe identification. We identified the shoes worn in 840 of 880 (95.5%) men's performances in our dataset and in 778 of 810 (96.0%) women's performances.",What were the criteria for inclusion in the study design?,"The criteria for inclusion in the study design were to study elite and sub-elite athletes, and athletes who had achieved success in the marathon before the Nike Vaporfly shoes had been released to the public."
1,"## 3 Data Exploration In Figure 1, we plot some summaries of the data. The left plot contains the proportion of runners wearing Vaporflys in each race in our dataset, separated by sex. Aside from a few prototypes being used in 2016, adoption of the shoes began in early 2017 and rose to over 50% on average in races at the end of 2019. The right plot contains the average marathon time for each athlete in the dataset in Vaporfly vs. non-Vaporfly shoes. Most runners' average time in Vaporfly shoes is faster than their average time in non-Vaporfly shoes. Specifically, 53 of 71 men (74.5%) who switched to Vaporflys ran faster in them, and 40 of 56 women (71.4%) who switched to Vaporflys ran faster in them. The right plot does not tell the whole story because it might be the case that runners who switched to Vaporflys did so when they ran on faster marathon courses. Some courses, such as the Boston Marathon course, have hills or often have poor weather, while others are flat and fast. So it is important to use the data to attempt to account for the difficulty of each Figure 1: (Left) Each circle represents an individual race, with the area of the circle proportional to the number of runners from the race in our dataset, and the vertical position equal to the proportion of runners wearing Vaporfly shoes in the race. (Right) Each circle represents an athlete, with the horizontal position being the athlete’s average marathon time in non-Vaporfly shoes, and the vertical position being the athlete’s average time in Vaporfly shoes. course. To get a satisfactory estimate of the effect of Vaporfly shoes, we need to analyze all of the data holistically, controlling for the strength of each runner and the difficulty of each marathon course. In the next section, we describe a statistical model intended for that purpose.",What is shown in the left plot of Figure 1?,"The left plot of Figure 1 shows the proportion of runners wearing Vaporflys in each race in our dataset, separated by sex. It also indicates that adoption of the shoes began in early 2017 and rose to over 50% on average in races at the end of 2019."
