# Example of loading PDF using recursive splitter

Recursive Splitter: Splitting text by recursively look at characters.
Recursively tries to split by different characters to find one that works.

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation. Furthermore, make sure you have the following packages installed:

In [None]:
# pip3 install nougat-ocr

### Load packages

In [1]:
%reload_ext autoreload
%autoreload 2

import sys

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

In [2]:
import os
import pandas as pd
import pprint
from uniflow.flow.client import ExtractClient, TransformClient
from uniflow.flow.config import TransformOpenAIConfig, ExtractPDFConfig
from uniflow.op.model.model_config import OpenAIModelConfig, NougatModelConfig
from uniflow.op.prompt import PromptTemplate, Context
from uniflow.op.extract.split.splitter_factory import SplitterOpsFactory
from uniflow.op.extract.split.constants import RECURSIVE_CHARACTER_SPLITTER

  from .autonotebook import tqdm as notebook_tqdm


### Prepare the input data

First, let's set current directory and input data directory, and load the raw data.

In [3]:
dir_cur = os.getcwd()
pdf_file = "1408.5882_page-1.pdf"
input_file = os.path.join(f"{dir_cur}/data/raw_input/", pdf_file)

### List all the available splitters
These are the different splitters we can use to post-process the loaded PDF.

In [4]:
SplitterOpsFactory.list()

['ParagraphSplitter', 'MarkdownHeaderSplitter', 'RecursiveCharacterSplitter']

##### Load the pdf using recursive splitter

In [5]:
data = [
    {"filename": input_file},
]

config = ExtractPDFConfig(
    model_config=NougatModelConfig(
        model_name = "0.1.0-small",
        batch_size = 1 # When batch_size>1, nougat will run on CUDA, otherwise it will run on CPU
    ),
    splitter=RECURSIVE_CHARACTER_SPLITTER,
)
nougat_client = ExtractClient(config)

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]


In [6]:
output = nougat_client.run(data)
contexts = output[0]['output'][0]['text']

  0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 1/1 [00:05<00:00,  5.07s/it]


### Process the output

Let's take a look of the generation output. 

In [7]:
for i, _s in enumerate(contexts):
    pprint.pprint(f"chunk_{i}: {_s[:200]}...")

('chunk_0: # Convolutional Neural Networks for Sentence Classification Yoon '
 'KimNew York Universityyhk255@nyu.edu###### AbstractWe report on a series of '
 'experiments with convolutional neural networks (CNN) traine...')
('chunk_1: Deep learning models have achieved remarkable results in computer '
 'vision [11] and speech recognition [1] in recent years. Within natural '
 'language processing, much of the work with deep learning method...')
('chunk_2: Convolutional neural networks (CNN) utilize layers with convolving '
 'filters that are applied to local features [1]. Originally invented for '
 'computer vision, CNN models have subsequently been shown to b...')
('chunk_3: In the present work, we train a simple CNN with one layer of '
 'convolution on top of word vectors obtained from an unsupervised neural '
 'language model. These vectors were trained by Mikolov et al. (2013)...')
('chunk_4: Our work is philosophically similar to Razavian et al. (2014) which '
 'showed that for i

## End of the notebook

Check more Uniflow use cases in the [example folder](https://github.com/CambioML/uniflow/tree/main/example/model#examples)!

<a href="https://www.cambioml.com/" title="Title">
    <img src="../image/cambioml_logo_large.png" style="height: 100px; display: block; margin-left: auto; margin-right: auto;"/>
</a>