# Example of loading PDF using Nougat
Source: https://arxiv.org/abs/1408.5882

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation. Furthermore, make sure you have the following packages installed:

In [None]:
# pip3 install nougat-ocr

### Load packages

In [None]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

In [None]:
import os
import pandas as pd
from uniflow.client import Client
from uniflow.config import Config
from uniflow.model.config import OpenAIModelConfig
from uniflow.config import NougatConfig

### Prepare the input data

First, let's set current directory and input data directory, and load the raw data.

In [None]:
dir_cur = os.getcwd()
pdf_file = "1408.5882_page-1.pdf"
input_file = os.path.join(f"{dir_cur}/data/raw_input/", pdf_file)

##### Load the pdf using Nougat

In [None]:
data = [
    {"pdf": input_file},
]

config = NougatConfig()
nougat_client = Client(config)

output = nougat_client.run(data)


In [None]:
p = output[0]['output'][0]['response'][0]

Now we need to write a little bit prompts to generate question and answer for a given paragraph, each promopt data includes a instruction and a list of examples with "context", "question" and "answer".

In [None]:
data = [{
    "instruction": """Generate one question and its corresponding answer based on the context. Following the format of the examples below to include the same context, question, and answer in the response.""",
    "examples": [
        {
            "context": """In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.""",
            "question": """Who published A Mathematical Theory of Communication in 1948?""",
            "answer": """Claude E. Shannon."""
        },
        {
            "context": p[:1000],
            "question": """""",
            "answer": """""",
        }
    ],
}]


### Run the model

In this example, we will use the [OpenAIModelServer](https://github.com/CambioML/uniflow/blob/main/uniflow/model/server.py#L108) as the LLM to generate questions and answers. Let's import the config and client of this model.

In [None]:
config = Config(model_config=OpenAIModelConfig())
client = Client(config)

Now we call the `run` method on the `client` object to execute the question-answer generation operation on the data shown above.

Note sometimes the LLM doesn't return a JSON output, then uniflow will handle the failure and auto retry generating a new output.

In [None]:
output = client.run(data)

### Process the output

Let's take a look of the generation output. We need to do a little postprocessing on the raw output.

In [None]:
# Extracting context, question, and answer into a DataFrame
contexts = []
questions = []
answers = []

for item in output:
    for i in item.get('output', []):
        for response in i.get('response', []):
            if any(key not in response for key in ['context', 'question', 'answer']):
                continue
            contexts.append(response['context'])
            questions.append(response['question'])
            answers.append(response['answer'])

df = pd.DataFrame({
    'Context': contexts,
    'Question': questions,
    'Answer': answers
})

In [None]:
# Set display options
pd.set_option('display.max_colwidth', None)  # or use a specific width like 50
pd.set_option('display.width', 1000)

df

Finally, we can save the generated question answers into a `.csv` file.

In [None]:
import os

# Directory path you want to ensure exists
directory = 'data/output'

# Check if the directory exists
if not os.path.exists(directory):
    # Create the directory, including any necessary intermediate directories
    os.makedirs(directory)

In [None]:
output_df = df[['Question', 'Answer']]
output_df.to_csv("data/output/Nike_10k_QApairs.csv", index=False)