# Extract text from PDFs

Extracting text from PDFs is challenging because these files may be scanned, have complex layouts, or contain unstructured data such as images and tables. When building a dataset to benchmark embedding models, it is important to avoid noisy, poorly formatted, or merged text between sections.

Libraries like [PyMuPDF](https://github.com/pymupdf/PyMuPDF), [PyPDF2](https://github.com/py-pdf/pypdf), and [pdfplumber](https://github.com/jsvine/pdfplumber) can extract text from simple PDFs. However, they are ineffective when dealing with unstructured data or scanned documents. Embedding models, unlike LLMs, cannot interpret the context or structure of a document. They simply embed whatever text they receive, regardless of its quality.

Therefore, it is essential to ensure that the extracted text is clean and well-structured. Modern LLMs excel at reading images and understanding text, allowing us to leverage them to extract text from PDFs in a format that closely matches how a human would perceive the document.

## Nanonets-OCR2-3B

Start by initializing the model and processor. You will need at least `3.5GB of VRAM` to run this model.

In [1]:
import torch

from transformers import (
    AutoTokenizer,
    AutoProcessor,
    AutoModelForImageTextToText,
    BitsAndBytesConfig,
)


model_path = "nanonets/Nanonets-OCR2-3B"
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True,
)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    dtype="auto",
    device_map=device,
    attn_implementation="flash_attention_2",
    quantization_config=quantization_config,
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Prepare the input to be passed to the model.

In [3]:
from PIL import Image


image_path = "../images/test_ocr.png"
image = Image.open(image_path)
prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes."""
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {"type": "image", "image": f"file://{image_path}"},
            {"type": "text", "text": prompt},
        ],
    },
]
text = processor.apply_chat_template(
    conversation=messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
inputs = inputs.to(model.device)

Generate the output.

In [4]:
output_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [7]:
extracted_text = output_text[0]
print(extracted_text)

<header>ENERGY BUDGET OF WASP-121 b FROM JWST/NIRISS PHASE CURVE</header>
<page_number>9</page_number>

while the kernel weights are structured as $(N_{\text{slice}}, N_{\text{time}})$. This precomputation significantly accelerates our calculations, which is essential since the longitudinal slices are at least partially degenerate with one another. Consequently, the fits require more steps and walkers to ensure proper convergence.

To address this, we follow a similar approach to our sinusoidal fits using emcee, but we increase the total number of steps to 100,000 and use 100 walkers. Naively, the fit would include $2N_{\text{slice}} + 1$ parameters: $N_{\text{slice}}$ for the albedo values, $N_{\text{slice}}$ for the emission parameters, and one additional scatter parameter, $\sigma$. However, since night-side slices do not contribute to the reflected light component, we exclude these albedo values from the fit. In any case, our choice of 100 walkers ensures a sufficient number of wal

Let's use the same model on the scanned PDF.

In [13]:
import base64
import pymupdf

from tqdm import tqdm

extracted_text_from_all_pages = ""
pdf_path = "../data/documents/rog_strix_gaming_notebook_pc_scanned_file.pdf"
pdf_document = pymupdf.open(pdf_path)

for page in tqdm(pdf_document, total=pdf_document.page_count):  # type: ignore
    pix = page.get_pixmap(dpi=300)
    image_bytes = pix.tobytes("png")
    encoded_string = base64.b64encode(image_bytes).decode("utf-8")
    data_uri = f"data:image/png;base64,{encoded_string}"

    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {"type": "image", "image": f"{data_uri}"},
                {"type": "text", "text": prompt},
            ],
        },
    ]
    text = processor.apply_chat_template(
        conversation=messages, tokenize=False, add_generation_prompt=True
    )
    inputs = processor(
        text=[text], images=[data_uri], padding=True, return_tensors="pt"
    )
    inputs = inputs.to(model.device)

    output_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
    generated_ids = [
        output_ids[len(input_ids) :]
        for input_ids, output_ids in zip(inputs.input_ids, output_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    )
    extracted_text_from_all_pages += output_text[0] + "\n"

print(extracted_text_from_all_pages.strip())

100%|██████████| 19/19 [03:24<00:00, 10.78s/it]

<img>BC HDMI™ HIGH-DEFINITION MULTIMEDIA INTERFACE</img>

E25294

REVISED EDITION V5 / DECEMBER 2024

<img>ROG STRIX logo</img>

ROG STRIX
GAMING NOTEBOOK PC

MORE INFO:
<img>QR Code</img>
# COPYRIGHT INFORMATION

No part of this manual, including the products and software described in it, may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form or by any means, except documentation kept by the purchaser for backup purposes, without the express written permission of ASUSTeK COMPUTER INC. ("ASUS").

ASUS PROVIDES THIS MANUAL "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OR CONDITIONS OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT SHALL ASUS, ITS DIRECTORS, OFFICERS, EMPLOYEES OR AGENTS BE LIABLE FOR ANY INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES (INCLUDING DAMAGES FOR LOSS OF PROFITS, LOSS OF BUSINESS, LOSS OF USE OR DATA,




In [14]:
file_path = "../data/extracted_text/nanonets_docling_output.txt"
with open(file_path, "w") as f:
    f.write(extracted_text_from_all_pages)