# Extract text from PDFs

Extracting text from PDFs is challenging because these files may be scanned, have complex layouts, or contain unstructured data such as images and tables. When building a dataset to benchmark embedding models, it is important to avoid noisy, poorly formatted, or merged text between sections.

Libraries like [PyMuPDF](https://github.com/pymupdf/PyMuPDF), [PyPDF2](https://github.com/py-pdf/pypdf), and [pdfplumber](https://github.com/jsvine/pdfplumber) can extract text from simple PDFs. However, they are ineffective when dealing with unstructured data or scanned documents. Embedding models, unlike LLMs, cannot interpret the context or structure of a document. They simply embed whatever text they receive, regardless of its quality.

Therefore, it is essential to ensure that the extracted text is clean and well-structured. Modern LLMs excel at reading images and understanding text, allowing us to leverage them to extract text from PDFs in a format that closely matches how a human would perceive the document.

## Gemini

Let's see how to do this with Gemini. Gemini is a multimodal model that can process both text and images, or even an entire PDF file. You can use any other LLM you prefer.

When passing a PDF to Gemini, ensure that the total input size (PDF plus prompt) is less than 20MB. For larger files, use the [File API](https://ai.google.dev/gemini-api/docs/document-processing#large-pdfs) to upload the document.

Read more about [working with PDF files in Gemini](https://ai.google.dev/gemini-api/docs/document-processing).

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

Let's write a system prompt to teach the model how to extract text from the PDF.

In [2]:
system_prompt = """You are an expert AI assistant, you are tasked with extracting the entire text from any PDF document. The document can be simple, complex, or even scanned, this shouldn't matter to you.

You will be given the entire PDF as input. Start examining the document page by page, when you come across text, extract it as is don't convert it into another format like HTML or Markdown. If you come across images, replace them with a very detailed description of the image while taking into consideration the context around it.

When you come across tables, describe them too like the image. The description should be very detailed and in a way that someone will understand the table without seeing it.

Make sure to keep the structure of the document, if there are sections, subsections, bullet points, or numbered lists, make sure to keep them as is. If there are any headers, footers, page numbers, remove them.

The final output should be a clean, well-structured text that represents the content of the entire PDF document as closely as possible to how a human would see it with their eyes when reading the document. Don't say anything else, just output the text you extracted from the PDF.

Here is the PDF:
"""

Submit the request and obtain the extracted text.

In [None]:
import pathlib

from google import genai
from google.genai import types

client = genai.Client()
relative_path = "../data/documents/rog_strix_gaming_notebook_pc_scanned_file.pdf"
file_path = pathlib.Path(relative_path)
prompt = "Extract the text from the PDF please."
response = client.models.generate_content(
    model="gemini-2.5-pro",
    # model="gemini-2.5-flash",
    contents=[
        system_prompt,
        types.Part.from_bytes(
            data=file_path.read_bytes(),
            mime_type="application/pdf",
        ),
        prompt,
    ],
    config=types.GenerateContentConfig(max_output_tokens=32_768),
)

As you can see, the LLM was able to extract the text from the PDF in a well-structured format.

However, this is not enough; we still need to manually review the output to make sure that only the relevant information is included in the benchmarking dataset.

In [6]:
print(response.text)

An image of the front cover of a user manual for a gaming laptop. The background is dark gray and black with a diagonal split. The upper right section has a faint, stylized text pattern that includes words like "GAMER" and "JOIN". A large, silver, stylized eye logo, which is the symbol for ASUS Republic of Gamers (ROG), is prominently displayed in the center. Below the logo, the text "ROG STRIX" is in large, silver, stylized capital letters, with "GAMING NOTEBOOK PC" written underneath in a smaller, sans-serif font.
In the top left corner, there are two logos: a "BC" logo and the "HDMI HIGH-DEFINITION MULTIMEDIA INTERFACE" logo.
In the top right corner, "E25294" is written, and below it, "REVISED EDITION V5 / DECEMBER 2024".
In the bottom right corner, there is a QR code with "MORE INFO" written above it.

COPYRIGHT INFORMATION
No part of this manual, including the products and software described in it, may be reproduced, transmitted, transcribed, stored in a retrieval system, or trans

Save the extracted text to a file so that we can review it later.

In [None]:
output_path = "../data/extracted_text/rog_strix_gaming_notebook_pc_scanned_file.txt"
with open(output_path, "w") as f:
    f.write(response.text)  # type: ignore