# Extract text from PDFs

Extracting text from PDFs is challenging because these files may be scanned, have complex layouts, or contain unstructured data such as images and tables. When building a dataset to benchmark embedding models, it is important to avoid noisy, poorly formatted, or merged text between sections.

Libraries like [PyMuPDF](https://github.com/pymupdf/PyMuPDF), [PyPDF2](https://github.com/py-pdf/pypdf), and [pdfplumber](https://github.com/jsvine/pdfplumber) can extract text from simple PDFs. However, they are ineffective when dealing with unstructured data or scanned documents. Embedding models, unlike LLMs, cannot interpret the context or structure of a document. They simply embed whatever text they receive, regardless of its quality.

Therefore, it is essential to ensure that the extracted text is clean and well-structured. Modern LLMs excel at reading images and understanding text, allowing us to leverage them to extract text from PDFs in a format that closely matches how a human would perceive the document.

## PyMuPDF

### Normal PDF

By a `normal PDF`, I mean a document that is not scanned and contains selectable text that can be extracted directly. Let's use the `pymupdf` library to extract text from each page of the PDF.

In [2]:
import pymupdf

pdf_path = "../data/documents/the_state_of_ai_how_organizations_are_rewiring_to_capture_value_final.pdf"
pdf_document = pymupdf.open(pdf_path)
print(f"Number of pages: {pdf_document.page_count}")

Number of pages: 26


In [3]:
extracted_text = ""
for page in pdf_document:
    extracted_text += page.get_text()  # type: ignore
print(f"Extracted {len(extracted_text)} characters of text.")

Extracted 50979 characters of text.


We successfully extracted text from the PDF, but is it well-structured? Let's examine the first 2,000 characters to find out.

In [4]:
print(extracted_text[:2000])

The state of AI  
March 2025
Alex Singla  
Alexander Sukharevsky  
Lareina Yee  
Michael Chui 
Bryce Hall
How organizations are rewiring to capture value
Organizations are beginning to create the 
structures and processes that lead to 
meaningful value from gen AI. While  
still in early days, companies are  
redesigning workflows, elevating  
governance, and mitigating  
more risks.   
O
rganizations are starting to make 
organizational changes designed to 
generate future value from gen AI, and 
large companies are leading the way. The 
latest McKinsey Global Survey on AI finds 
that organizations are beginning to take steps that drive 
bottom-line impact—for example, redesigning workflows as 
they deploy gen AI and putting senior leaders in critical roles, 
such as overseeing AI governance. The findings also show 
that organizations are working to mitigate a growing set of 
gen-AI-related risks and are hiring for new AI-related roles 
while they retrain employees to participate in A

The answer is no, the text is not well-structured. This becomes clear on page 4, which contains a stacked bar chart.

![example_of_unstructured_data_in_pdf](../images/example_of_unstructured_data_in_pdf.jpg)

In [5]:
page_number = 4
print(pdf_document[page_number].get_text())  # type: ignore

Organizations are selectively centralizing elements of their  
AI deployment 
The survey findings also shed light on how organizations are structuring their AI deployment 
efforts. Some essential elements for deploying AI tend to be fully or partially centralized  
(Exhibit 1). For risk and compliance, as well as data governance, organizations often use a fully 
centralized model such as a center of excellence. For tech talent and adoption of AI solutions, 
on the other hand, respondents most often report using a hybrid or partially centralized model, 
with some resources handled centrally and others distributed across functions or business 
units—though respondents at organizations with less than $500 million in annual revenues  
are more likely than others to report fully centralizing these elements. 
Exhibit 1
Degree of centralization of AI deployment,¹ % of respondents
McKinsey & Company
¹Question was asked only of respondents whose organizations use AI in at least 1 function, n = 

Other libraries such as `PyPDF2` and `pdfplumber` produce similar results. However, with LLMs that can interpret both text and images, we can extract and reconstruct the entire content of a PDF in a well-structured format.

In [6]:
pdf_document.close()

### Scanned PDF

Even if a PDF is scanned, we can still determine the number of pages it contains.

In [7]:
pdf_path = "../data/documents/rog_strix_gaming_notebook_pc_scanned_file.pdf"
pdf_document = pymupdf.open(pdf_path)
print(f"Number of pages: {pdf_document.page_count}")

Number of pages: 19


As you can see, the output shows 0 characters. This is because the PDF is scanned and contains images of text, rather than actual text that can be extracted directly. Let's explore how to handle this using LLMs. Check the next notebooks:

**Proprietary models:**
- [Gemini](./1_2_ExtractTextFomPDFsGemini.ipynb)

**Open source models:**
- [Granite docling 258M](./1_3_ExtractTextFomPDFsGraniteDocling258M.ipynb)
- [Gemma3 12B](./1_4_ExtractTextFomPDFsGemma3.ipynb)
- [Nanonets OCR2 3B](./1_5_ExtractTextFomPDFsNanonetsOCR2.ipynb)

In [8]:
extracted_text = ""
for page in pdf_document:
    extracted_text += page.get_text()  # type: ignore
print(f"Extracted {len(extracted_text)} characters of text.")

Extracted 0 characters of text.
