# Example of loading PDF using Nougat without splitting
Source: https://arxiv.org/abs/1408.5882

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation. Furthermore, make sure you have the following packages installed:

In [1]:
# pip3 install nougat-ocr

### Load packages

In [2]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

In [3]:
import os
from uniflow.flow.client import ExtractClient
from uniflow.flow.config import ExtractPDFConfig, NougatModelConfig


  from .autonotebook import tqdm as notebook_tqdm


### Prepare the input data

First, let's set current directory and input data directory, and load the raw data.

In [4]:
dir_cur = os.getcwd()
pdf_file = "1408.5882_page-1.pdf"
input_file = os.path.join(f"{dir_cur}/data/raw_input/", pdf_file)

##### Load the pdf using Nougat
Note that we do not pass in a `splitter` to the `ExtractPDFConfig`. This will load the entire pdf as a single string, and will not split the pdf into sections.

In [5]:
data = [
    {"filename": input_file},
]

config = ExtractPDFConfig(
    model_config=NougatModelConfig(
        model_name = "0.1.0-small",
        batch_size = 1 # When batch_size>1, nougat will run on CUDA, otherwise it will run on CPU
    ),
)
nougat_client = ExtractClient(config)

output = nougat_client.run(data)


  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
100%|██████████| 1/1 [03:51<00:00, 231.45s/it]


### Process the output
As you can see, a plain PDF extract without a splitter will just output a single `str` under the `text` key.

In [6]:
context = output[0]['output'][0]['text']
context

"# Convolutional Neural Networks for Sentence Classification\n\n Yoon Kim\n\nNew York University\n\nyhk255@nyu.edu\n\n###### Abstract\n\nWe report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tuning offers further gains in performance. We additionally propose a simple modification to the architecture to allow for the use of both task-specific and static vectors. The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification.\n\n## 1 Introduction\n\nDeep learning models have achieved remarkable results in computer vision [11] and speech recognition [1] in recent years. Within natural language processing, much of the work with d

## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>