# Build Your Own Research Paper Agent: Transform a Unstructured Research Paper to an Finetuned LLM
Do you want to build an agent so that you can ask it anything about a research paper? In this example, we will show you how use `uniflow` and `pykoi` to extract knowledge from a Nike research paper and then finetune an LLM on these knowledge.

First, we'll use `uniflow` to generate question-answers (QAs) from a pdf using OpenAI's models via `uniflow`'s `MultiFlowPipeline`.

Next, we'll use `pykoi` to run supervised fine-tuning (SFT) on the QAs generated by `uniflow`.

Finally, we'll use `pykoi`'s Chatbot to run the SFT model, so you can ask questions about the 10K and get answers.

For this example, we're using a the paper "An Observational Study of the Effect of Nike Vaporfly Shoes on Marathon Performance" from Cornell.

>*Note: In order to run this notebook, you need a GPU (for the `RLHF`).*

### Before running the code

You will need to set up a conda environment to run this notebook. You can set up the environment following the [instruction](https://github.com/CambioML/cambio-recipes/tree/main#installation).

We are using `uniflow` and several of the `pykoi` modules, so you will need to install these in your environment as well:

```bash
pip3 install uniflow
pip3 install "pykoi[huggingface, rag, rlhf]"
```
Finally, you will need to install torch:

```bash
pip3 uninstall torch
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121  # cu121 means cuda 12.1
```

Next, you will need a valid [OpenAI API key](https://platform.openai.com/api-keys) to run the code. Once you have the key, set it as the environment variable `OPENAI_API_KEY` within a `.env` file in the root directory of this repository. For more details, see this [instruction](https://github.com/CambioML/cambio-recipes/tree/main#api-keys)

## 1. Generate QAs from a Research Paper using `uniflow`

### Update System Path

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install helper packages
If you already have these installed, feel free to skip this step.

In [2]:
!{sys.executable} -m pip install -q transformers accelerate bitsandbytes scipy #nougat-ocr

[0m

### Import Dependencies

In [3]:
import os
import pandas as pd
from uniflow.pipeline import MultiFlowsPipeline
from uniflow.flow.config import PipelineConfig
from uniflow.flow.config import TransformOpenAIConfig, ExtractPDFConfig
from uniflow.op.model.model_config import  NougatModelConfig
from uniflow.op.prompt import Context
from uniflow.op.extract.split.constants import PARAGRAPH_SPLITTER

from dotenv import load_dotenv
load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

### Prepare Input data

First, let's set the current directory and input data directory, and load the raw input data.

In [4]:
dir_cur = os.getcwd()
pdf_file = "nike-paper.pdf"
input_file = os.path.join(f"{dir_cur}/data/raw_input/", pdf_file)

data = [
    {"pdf": input_file},
]

### Define extract config using Nougat
For this example, we'll run the `ExtractPDF` flow to extract the text from the 10K pdf. This uses the [Nougat](https://pypi.org/project/nougat-ocr/0.1.17/) PDF parser.

In [5]:
extract_config = ExtractPDFConfig(
    model_config=NougatModelConfig(
        batch_size = 4 # When batch_size>1, nougat will run on CUDA, otherwise it will run on CPU
    ),
    splitter=PARAGRAPH_SPLITTER
)

Here's some reference runtimes for the `batch_size` with the `PARAGRAPH_SPLITTER`:

| `batch_size` | Runtime [m:s] |
|--------------|---------------|
| 1 | 01:05 |
| 2 | 00:41 |
| 4 | 00:27 |
| 8 | 00:15 | 

### Prepare sample prompts
Now we need to write a bit of a prompt to generate question and answer for a given paragraph. We'll generate multiple QAs for each paragraph.

In [6]:
# Modify the number of Q&A sets as desired
number_QAs = 3 #1, 3, or 5

Each prompt data includes a instruction and a list of examples with "context", "question" and "answer". We do this by giving a sample list of `Context` examples to the `PromptTemplate` class.

Modify the `few_shot_examples` to have the same number of questions as `number_QAs`

In [7]:
from pprint import pprint

instruction = """
# Instruction
Generate {} question(s) and the corresponding answer(s) based on the context. Following \
the format of the examples below to include context and qas in the response.

## Note: Use the below two examples just as a reference, do not include in your response.
""".format(number_QAs)


few_shot_examples = [
    Context(
        context="The quick brown fox jumps over the lazy black dog.",
        qas=[
        {"Question": "What is the color of the fox?", "Answer": "Brown"},
        {"Question": "What is the color of the dog?", "Answer": "Black"},
        {"Question": "What does the fox jump over?", "Answer": "Dog"},
        # {"Question": "Was the dog lazy?", "Answer": "Yes"},
        # {"Question": "What did the fox do?", "Answer": "Jump"},

    ]
    ),
    Context(
        context="Snoopy can be selfish, but loves his owner, Charlie Brown.",
        qas=[
        {"Question": "How does Snoopy sometimes behave?", "Answer": "Selfish"},
        {"Question": "How does Snoopy feel about his owner?", "Answer": "Loves"},
        {
            "Question": "What is the name of Snoopy's owner?",
            "Answer": "Charlie Brown",
        },
        # { "Question": "Who can be selfish?", "Answer": "Snoopy"},
        # { "Question": "Who is loved?", "Answer": "Charlie Brown"},
    ])
]


### Define transform config
Next, we set up the `TransformConfig`.

In [8]:
transform_config = TransformOpenAIConfig()

First, we customize the config for picking the answer chunk.

In [9]:
transform_config.prompt_template.instruction = instruction
transform_config.prompt_template.few_shot_prompt = few_shot_examples

If we want the response format to be JSON, we need to update two aspects of the default config:

1. Change the model_name to "gpt-4-1106-preview", which is the only GPT-4 model that supports the JSON format.
1. Change the response_format to a json_object.

In [10]:
transform_config.model_config.model_name = "gpt-4-1106-preview"
transform_config.model_config.response_format = {"type": "json_object"}
transform_config.model_config.num_call = 1
transform_config.model_config.temperature = 0.0

Finally, we update the `num_threads` and `batch_size`. You'll want to optimize this number to maximize efficiency. Note that these must be the same number.

In [11]:
num_thread_batch_size = 32
transform_config.model_config.num_thread = num_thread_batch_size
transform_config.model_config.batch_size = num_thread_batch_size
pprint(transform_config)

TransformOpenAIConfig(flow_name='TransformOpenAIFlow',
                      model_config=OpenAIModelConfig(model_name='gpt-4-1106-preview',
                                                     model_server='OpenAIModelServer',
                                                     num_call=1,
                                                     temperature=0.0,
                                                     response_format={'type': 'json_object'},
                                                     num_thread=32,
                                                     batch_size=32),
                      num_thread=1,
                      prompt_template=PromptTemplate(instruction='\n# Instruction\nGenerate 3 question(s) and the corresponding answer(s) based on the context. Following the format of the examples below to include context and qas in the response.\n\n## Note: Use the below two examples just as a reference, do not include in your response.\n', few_shot_prompt=[Context(c

Here's some reference for `num_thread` / `batch_size`, generating 3 QAs for each paragraph:

| `num_thread` / `batch_size` | runtime [m:s] |
| ----------------------------|---------|
| 2 | 13:34 |
| 4 | 09:15 |
| 16 | 03:09 |
| 32 | 02:21 |
| 64 | 01:42 |

Here are some references for 4 QAs/paragraph:
| `num_thread` / `batch_size` | runtime [m:s] |
| ----------------------------|---------|
| 8 | 05:12 |



### Run MultiFlowsPipeline
Let's use the `PipelineConfig` to connect `extract_config` and `transform_config`, and pass that into our `MultiFlowsPipeline` to run the data.

In [12]:
p = MultiFlowsPipeline(PipelineConfig(
    extract_config=extract_config,
    transform_config=transform_config,
))

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


Now we call the `run` method on the `MultiFlowsPipeline` object to execute the question-answer generation operation on the data shown above.

In [13]:
output = p.run(data)

100%|██████████| 1/1 [00:27<00:00, 27.01s/it]
100%|██████████| 4/4 [02:21<00:00, 35.49s/it]


### Output
Let's take a look of the generated output of Abstract segmentation.

#### Process the output
Let's take a look of the generated output. First, let's make sure there weren't any errors.

In [14]:
for i, o in enumerate(output):
    for j, item in enumerate(o):
        if 'error' in item:
            print("Error at output[{}][{}]: {}".format(i, j, item['error']))


Let's print out the raw output:

In [15]:

pprint(output)

[{'output': [{'error': 'No errors.',
              'response': [{'context': '# An Observational Study of the Effect '
                                       'of Nike Vaporly Shoes on Marathon '
                                       'Performance',
                            'qas': [{'Answer': 'The effect of Nike Vaporly '
                                               'Shoes on Marathon Performance',
                                     'Question': 'What is the focus of the '
                                                 'study?'},
                                    {'Answer': 'An Observational Study',
                                     'Question': 'What type of study is being '
                                                 'conducted?'},
                                    {'Answer': 'The performance of marathon '
                                               'runners wearing Nike Vaporly '
                                               'Shoes',
                            

Next, we need to do a little postprocessing on the raw output.

In [16]:
import re

# Extracting context, question, and answer into a DataFrame
contexts = []
questions = []
answers = []

error_list = []

for item in output:
    o = item['output'][0]['response']

    for resp in o:
        context = resp['context']

        for qa in resp['qas']:
            questions.append(qa['Question'])
            answers.append(qa['Answer'])
            contexts.append(context)

# Set display options
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)

df = pd.DataFrame({
    'Context': contexts,
    'Question': questions,
    'Answer': answers
})

# Remove any headers or short contexts
df_filtered = df[df['Context'].apply(lambda x: len(x) > 100 and not x.startswith('#'))]
df_filtered = df_filtered.reset_index(drop=True)

styled_df = df_filtered.style.set_properties(**{'text-align': 'left'}).set_table_styles([{
    'selector': 'th',
    'props': [('text-align', 'left')]
}])
styled_df

Unnamed: 0,Context,Question,Answer
0,"Mount Everest is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas. The international border between China (Tibet Autonomous Region) and Nepal runs across its summit point.",What is Mount Everest known for?,Mount Everest is known for being Earth's highest mountain above sea level.
1,"Mount Everest is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas. The international border between China (Tibet Autonomous Region) and Nepal runs across its summit point.",Where is Mount Everest located?,Mount Everest is located in the Mahalangur Himal sub-range of the Himalayas.
2,"Mount Everest is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas. The international border between China (Tibet Autonomous Region) and Nepal runs across its summit point.",Which countries does the international border on Mount Everest's summit point separate?,The international border on Mount Everest's summit point separates China (Tibet Autonomous Region) and Nepal.
3,"We collected marathon performance data from a systematic sample of elite and sub-elite athletes over the period 2015 to 2019, then searched the internet for publicly-available photographs of these performances, identifying whether the Nike Vaporly shoes were worn or not in each performance. Controlling for athlete ability and race difficulty, we estimated the effect on marathon times of wearing the Vaporfly shoes. Assuming that the effect of Vaporfly shoes is additive, we estimate that the Vaporfly shoes improve men's times between 2.0 and 3.9 minutes, while they improve women's times between 0.8 and 3.5 minutes. Assuming that the effect of Vaporfly shoes is multiplicative, we estimate that they improve men's times between 1.4 and 2.8 percent and women's performances between 0.6 and 2.2 percent. The improvements are in comparison to the shoe the athlete was wearing before switching to Vaporfly shoes, and represents an expected improvement rather than a guaranteed improvement.",What was the method used to collect data on marathon performances?,"A systematic sample of elite and sub-elite athletes' performances from 2015 to 2019 was collected, and publicly-available photographs of these performances were searched to identify whether Nike Vaporfly shoes were worn."
4,"We collected marathon performance data from a systematic sample of elite and sub-elite athletes over the period 2015 to 2019, then searched the internet for publicly-available photographs of these performances, identifying whether the Nike Vaporly shoes were worn or not in each performance. Controlling for athlete ability and race difficulty, we estimated the effect on marathon times of wearing the Vaporfly shoes. Assuming that the effect of Vaporfly shoes is additive, we estimate that the Vaporfly shoes improve men's times between 2.0 and 3.9 minutes, while they improve women's times between 0.8 and 3.5 minutes. Assuming that the effect of Vaporfly shoes is multiplicative, we estimate that they improve men's times between 1.4 and 2.8 percent and women's performances between 0.6 and 2.2 percent. The improvements are in comparison to the shoe the athlete was wearing before switching to Vaporfly shoes, and represents an expected improvement rather than a guaranteed improvement.",What is the estimated improvement in marathon times for men wearing Vaporfly shoes?,"The estimated improvement in marathon times for men wearing Vaporfly shoes is between 2.0 and 3.9 minutes if the effect is additive, and between 1.4 and 2.8 percent if the effect is multiplicative."
5,"We collected marathon performance data from a systematic sample of elite and sub-elite athletes over the period 2015 to 2019, then searched the internet for publicly-available photographs of these performances, identifying whether the Nike Vaporly shoes were worn or not in each performance. Controlling for athlete ability and race difficulty, we estimated the effect on marathon times of wearing the Vaporfly shoes. Assuming that the effect of Vaporfly shoes is additive, we estimate that the Vaporfly shoes improve men's times between 2.0 and 3.9 minutes, while they improve women's times between 0.8 and 3.5 minutes. Assuming that the effect of Vaporfly shoes is multiplicative, we estimate that they improve men's times between 1.4 and 2.8 percent and women's performances between 0.6 and 2.2 percent. The improvements are in comparison to the shoe the athlete was wearing before switching to Vaporfly shoes, and represents an expected improvement rather than a guaranteed improvement.",How do Vaporfly shoes affect women's marathon performances?,"Vaporfly shoes are estimated to improve women's marathon times between 0.8 and 3.5 minutes if the effect is additive, and between 0.6 and 2.2 percent if the effect is multiplicative."
6,"Mount Everest is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas. The international border between China (Tibet Autonomous Region) and Nepal runs across its summit point.",What is Mount Everest?,Earth's highest mountain above sea level
7,"Mount Everest is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas. The international border between China (Tibet Autonomous Region) and Nepal runs across its summit point.",Where is Mount Everest located?,In the Mahalangur Himal sub-range of the Himalayas
8,"Mount Everest is Earth's highest mountain above sea level, located in the Mahalangur Himal sub-range of the Himalayas. The international border between China (Tibet Autonomous Region) and Nepal runs across its summit point.",What runs across the summit point of Mount Everest?,The international border between China (Tibet Autonomous Region) and Nepal
9,"There is a growing consensus that Nike Corporation's new line of marathon racing shoes, which are commonly referred to as Vaporflys, provide a significant performance advantage to athletes who wear them. While several different versions of the shoes have appeared in races, including the Vaporfly 4%, the Vaporfly Next%, the Alphafly, and several prototype shoes, each iteration of the shoes has in common a carbon fiber plate stacked inside of a highly responsive foam sole.",What is the common feature of all the Nike marathon racing shoes mentioned?,A carbon fiber plate stacked inside of a highly responsive foam sole


Finally, we can save the `uniflow` output to a `.csv` file.

In [17]:
output_df = df[['Question', 'Answer']]

output_dir = 'data/output'

uniflow_output_path = f"{output_dir}/Nike_Research_QApairs.csv"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

output_df.to_csv(uniflow_output_path, index=False)

#### Release GPU Memory
We'll need to use our GPU for future steps, so let's release the memory.

In [18]:
import torch
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("GPU memory has been released.")
else:
    print("No GPU devices found.")

GPU memory has been released.


## 2. Running `pykoi` `SupervisedFineTuning` on the QA pairs

### Install helper packages
If you already have these installed, feel free to skip this step.

In [None]:
!{sys.executable} -m pip install -q peft

### Import Dependency

In [None]:
from pykoi.rlhf import RLHFConfig
from pykoi.rlhf import SupervisedFinetuning
from peft import LoraConfig, TaskType

### Set the parameters

In [None]:
base_model_path = "meta-llama/Llama-2-7b-chat-hf"
dataset_name = uniflow_output_path
peft_model_path = "./models/rlhf_step1_sft"
dataset_type = "local_csv"
learning_rate = 1e-3
weight_decay = 0.0
max_steps = 1600
per_device_train_batch_size = 1
per_device_eval_batch_size = 4
log_freq = 20
eval_freq = 2000
save_freq = 200
train_test_split_ratio = 0.0001
dataset_subset_sft_train = 999999999
size_valid_set = 0
device_map = "auto"

r = 8
lora_alpha = 16
lora_dropout = 0.05
bias = "none"
task_type = TaskType.CAUSAL_LM

Set the parameters in the `LoraConfig` and `RLHFConfig`.

In [None]:
lora_config = LoraConfig(
    r=r,
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    bias=bias,
    task_type=task_type,
    )


# run supervised finetuning
config = RLHFConfig(
    base_model_path=base_model_path,
    dataset_type=dataset_type,
    dataset_name=dataset_name,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    max_steps=max_steps,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    log_freq=log_freq,
    eval_freq=eval_freq,
    save_freq=save_freq,
    train_test_split_ratio=train_test_split_ratio,
    dataset_subset_sft_train=dataset_subset_sft_train,
    size_valid_set=size_valid_set,
    lora_config_rl=lora_config,
    device_map=device_map,
    )

### Run the SupervisedFineTuning

In [None]:
rlhf_step1_sft = SupervisedFinetuning(config)
rlhf_step1_sft.train_and_save(peft_model_path)

#### Release GPU Memory
We'll need to use our GPU for future steps, so let's release the memory.

In [None]:
import torch
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("GPU memory has been released.")
else:
    print("No GPU devices found.")

## 3. Running a `pykoi` `Chatbot` on the fine-tuned model

### Import pykoi components

In [None]:
from pykoi.application import Application
from pykoi.chat import ModelFactory
from pykoi.chat import QuestionAnswerDatabase
from pykoi.component import Chatbot, Dashboard

### Create the Model

Here, we create a model from `meta-llama/Llama-2-7b-chat-hf` and the fine-tuned model we created above.

In [None]:
model = ModelFactory.create_model(
    model_source="peft_huggingface",
    base_model_path="meta-llama/Llama-2-7b-chat-hf",
    lora_model_path="/home/ubuntu/pykoi/models/rlhf_step1_sft",
)

### Create the Chatbot with the model
Next, we create database, chatbot, and dashboard components via `pykoi`.

In [None]:
database = QuestionAnswerDatabase(debug=True)
chatbot = Chatbot(model=model, feedback="vote")
dashboard = Dashboard(database=database)

### Run the Chatbot app!

#### Add `nest_asyncio` 
Add `nest_asyncio` to avoid error such as `asyncio.run() cannot be called from a running event loop`. Since we're running another interface inside a Jupyter notebook where an asyncio event loop is already running, we'll encounter the error. (since The uvicorn.run() function uses asyncio.run(), which isn't compatible with a running event loop.)

In [None]:
# !pip install -q nest_asyncio
import nest_asyncio
nest_asyncio.apply()

Now we can run our app!

In [None]:
app = Application(debug=False, share=False)
app.add_component(chatbot)
app.add_component(dashboard)
app.run()

Congrats! You've just built your own research paper chatbot!

## End of the notebook

Check more use cases in the [example folder](../../examples/)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>