# Example of loading PDF using Nougat
Source: https://arxiv.org/abs/1408.5882

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

### Load packages

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

In [2]:
import os
import pandas as pd
from uniflow.flow.client import ExtractClient, TransformClient
from uniflow.flow.config import TransformOpenAIConfig, ExtractPDFConfig
from uniflow.op.model.model_config import OpenAIModelConfig, NougatModelConfig
from uniflow.op.prompt import PromptTemplate, Context
from uniflow.op.extract.split.splitter_factory import SplitterOpsFactory
from uniflow.op.extract.split.constants import PARAGRAPH_SPLITTER


### Prepare the input data

First, let's set current directory and input data directory, and load the raw data.

In [3]:
dir_cur = os.getcwd()
pdf_file = "1408.5882_page-1.pdf"
input_file = os.path.join(f"{dir_cur}/data/raw_input/", pdf_file)

### List all the available splitters
These are the different splitters we can use to post-process the loaded PDF.

In [4]:
SplitterOpsFactory.list()

['ParagraphSplitter', 'MarkdownHeaderSplitter', 'RecursiveCharacterSplitter']

##### Load the pdf using Nougat

In [5]:
data = [
    {"filename": input_file},
]

config = ExtractPDFConfig(
    model_config=NougatModelConfig(
        model_name = "facebook/nougat-small",
        batch_size = 2
    ),
    splitter=PARAGRAPH_SPLITTER,
)
nougat_client = ExtractClient(config)

output = nougat_client.run(data)


  0%|          | 0/1 [00:00<?, ?it/s]

In [6]:
contexts = output[0]['output'][0]['text']
output

[{'output': [{'text': ['# Convolutional Neural Networks for Sentence Classification',
     ' Yoon Kim',
     'New York University',
     'yhk255@nyu.edu',
     '###### Abstract',
     'We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tuning offers further gains in performance. We additionally propose a simple modification to the architecture to allow for the use of both task-specific and static vectors. The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification.',
     '## 1 Introduction',
     'Deep learning models have achieved remarkable results in computer vision [11] and speech recognition [1] in recent years. W

Now we need to write a little bit prompts to generate question and answer for a given paragraph, each promopt data includes a instruction and a list of examples with "context", "question" and "answer".

In [8]:
guided_prompt = PromptTemplate(
    instruction="""Generate one question and its corresponding answer based on the last context in the last
    example. Follow the format of the examples below to include context, question, and answer in the response""",
    few_shot_prompt=[Context(
        context="In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.""",
        question="Who published A Mathematical Theory of Communication in 1948?""",
        answer="Claude E. Shannon."""
    )]
)
input_data = [
        Context(
            context=p[:1000],
            question="",
            answer="",
        )
        for p in contexts
]


### Run the model

In this example, we will use the [OpenAIModelServer](https://github.com/CambioML/uniflow/blob/main/uniflow/model/server.py#L108) as the LLM to generate questions and answers. Let's import the config and client of this model.

In [9]:
config = TransformOpenAIConfig(
    prompt_template=guided_prompt,
    model_config=OpenAIModelConfig(
        response_format={"type": "json_object"}
    ),
)
transform_client = TransformClient(config)

Now we call the `run` method on the `client` object to execute the question-answer generation operation on the data shown above.

Note sometimes the LLM doesn't return a JSON output, then uniflow will handle the failure and auto retry generating a new output.

In [10]:
output = transform_client.run(input_data)

100%|██████████| 14/14 [00:27<00:00,  1.95s/it]


### Process the output

Let's take a look of the generation output. We need to do a little postprocessing on the raw output.

In [11]:
# Extracting context, question, and answer into a DataFrame
contexts = []
questions = []
answers = []

for item in output:
    for i in item.get('output', []):
        for response in i.get('response', []):
            if any(key not in response for key in ['context', 'question', 'answer']):
                continue
            contexts.append(response['context'])
            questions.append(response['question'])
            answers.append(response['answer'])

df = pd.DataFrame({
    'Context': contexts,
    'Question': questions,
    'Answer': answers
})

In [12]:
# Set display options
pd.set_option('display.max_colwidth', None)  # or use a specific width like 50
pd.set_option('display.width', 1000)

df

Unnamed: 0,Context,Question,Answer
0,Convolutional Neural Networks for Sentence Classification,What is the focus of the article?,The focus of the article is Convolutional Neural Networks for Sentence Classification.
1,Yoon Kim,Who is Yoon Kim?,We need additional context to provide an accurate answer.
2,New York University,What is the name of the university?,New York University
3,"In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948) establishing the theory of information. In his article, Shannon introduced the concept of information entropy for the first time. We will begin our journey here.",What concept did Claude E. Shannon introduce for the first time in his article A Mathematical Theory of Communication?,Claude E. Shannon introduced the concept of information entropy for the first time.
4,"In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.",What concept did Shannon introduce for the first time in his article A Mathematical Theory of Communication?,Shannon introduced the concept of information entropy for the first time.
5,"We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tuning offers further gains in performance. We additionally propose a simple modification to the architecture to allow for the use of both task-specific and static vectors. The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification.",What type of neural network was used in the experiments?,Convolutional neural networks (CNN).
6,"In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.",What concept did Shannon introduce for the first time in his article A Mathematical Theory of Communication?,Shannon introduced the concept of information entropy for the first time.
7,"Deep learning models have achieved remarkable results in computer vision [11] and speech recognition [1] in recent years. Within natural language processing, much of the work with deep learning methods has involved learning word vector representations through neural language models [1, 1, 2] and performing composition over the learned word vectors for classification [1]. Word vectors, wherein words are projected from a sparse, 1-of-\(V\) encoding (here \(V\) is the vocabulary size) onto a lower dimensional vector space via a hidden layer, are essentially feature extractors that encode semantic features of words in their dimensions. In such dense representations, semantically close words are likewise close--in euclidean or cosine distance--in the lower dimensional vector space.",What are word vectors used for in natural language processing?,Word vectors are used for learning word vector representations through neural language models and performing composition over the learned word vectors for classification.
8,"Convolutional neural networks (CNN) utilize layers with convolving filters that are applied to local features [1]. Originally invented for computer vision, CNN models have subsequently been shown to be effective for NLP and have achieved excellent results in semantic parsing [13], search query retrieval [2], sentence modeling [1], and other traditional NLP tasks [1].",What are some tasks for which CNN models have been shown to be effective?,"CNN models have been shown to be effective for semantic parsing, search query retrieval, sentence modeling, and other traditional NLP tasks."
9,"In the present work, we train a simple CNN with one layer of convolution on top of word vectors obtained from an unsupervised neural language model. These vectors were trained by Mikolov et al. (2013) on 100 billion words of Google News, and are publicly available.1 We initially keep the word vectors static and learn only the other parameters of the model. Despite little tuning of hyperparameters, this simple model achieves excellent results on multiple benchmarks, suggesting that the pre-trained vectors are 'universal' feature extractors that can be utilized for various classification tasks. Learning task-specific vectors through fine-tuning results in further improvements. We finally describe a simple modification to the architecture to allow for the use of both pre-trained and task-specific vectors by having multiple channels.",Who trained the word vectors on 100 billion words of Google News?,Mikolov et al. (2013)


Finally, we can save the generated question answers into a `.csv` file.

In [13]:
import os

# Directory path you want to ensure exists
directory = 'data/output'

# Check if the directory exists
if not os.path.exists(directory):
    # Create the directory, including any necessary intermediate directories
    os.makedirs(directory)

In [14]:
output_df = df[['Question', 'Answer']]
output_df.to_csv("data/output/CNN_pdf_QApairs.csv", index=False)

## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>