<p align="center">
  <br/>
    <img alt="Spark Pdf" src="https://stabrise.com/media/filer_public_thumbnails/filer_public/16/d6/16d6a0d6-f162-42ad-a5a3-7dc20361ad24/sparkpdf.png__1000x300_subsampling-2.webp" width="450" style="max-width: 100%;">
  <br/>
</p>

This notebook demonstrates how to use the PDF Datasource to load multiple page PDF files with Apache Spark.

<p align="center">
    <a target="_blank" href="https://colab.research.google.com/github/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb">
      <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
    </a>
    <a href="https://search.maven.org/artifact/com.stabrise/spark-pdf_2.12">
        <img alt="Maven Central Version" src="https://img.shields.io/maven-central/v/com.stabrise/spark-pdf_2.12">
    </a>
    <a href="https://github.com/StabRise/spark-pdf/blob/master/LICENSE" >
        <img src="https://img.shields.io/badge/License-AGPL%203-blue.svg" alt="License"/>
    </a>
</p>

---

**Source Code**: [https://github.com/StabRise/spark-pdf](https://github.com/StabRise/spark-pdf)

⭐ Star us on GitHub — it motivates us a lot!

---

## Key features:

- Read PDF documents to the Spark DataFrame
- Support read PDF files lazy per page
- Support big files, up to 10k pages
- Support scanned PDF files (call OCR)
- No need to install Tesseract OCR, it's included in the package

In [0]:
# Install PySpark and Pillow for display images
%pip install pyspark==3.4.1
%pip install Pillow

## Creating Spark Session with Spark Pdf DataSource

In [0]:
import io
from PIL import Image
from IPython.display import display
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

builder = SparkSession.builder \
    .master("local[*]") \
    .config("spark.jars.packages", "com.stabrise:spark-pdf_2.12:0.1.7") \
    .config("spark.driver.memory", "8g") \
    .appName("SparkPdf")

spark = builder.getOrCreate()
spark

## Loading pdf documents to the Spark

In [0]:
# Downloading example PDF files
import urllib.request

filenames = ["./example1.pdf", "./example2.pdf", "./example3.pdf"]
url = f"https://raw.githubusercontent.com/StabRise/spark-pdf/refs/heads/main/examples/"
for f in filenames:
    urllib.request.urlretrieve(url + f.split("/")[-1], f)

In [0]:
df = spark.read.format("pdf") \
    .option("imageType", "BINARY") \
    .option("resolution", "300") \
    .option("pagePerPartition", "8") \
    .option("reader", "pdfBox") \
    .load(filenames)

Available options for the data source:

- `imageType`: Oputput image type. Can be: "BINARY", "GREY", "RGB". Default: "RGB".
- `resolution`: Resolution for rendering PDF page to the image. Default: "300" dpi.
- `pagePerPartition`: Number pages per partition in Spark DataFrame. Default: "5".
- `reader`: Supports: `pdfBox` - based on PdfBox java lib, `gs` - based on GhostScript (need installation GhostScipt to the system)

### Counting total number of pages in all documents

Spark PDF operates with a lazy evaluation approach, extracting metadata from PDF files without loading the entire file into memory.

In this example, we loaded two PDF documents:  
- The first document contains 1 page.  
- The second document contains 1 page with not recognized text.
- The last one document contains 30 pages.


In [0]:
df.count()

### Checking Number of Partitions

We specified the option `pagePerPartition = 8` in the configuration.<br/>
This results in 6 partitions:  
- 1 partition for the first file.  
- 1 partition for the second file.  
- 4 partitions for the last file, which contains 30 pages.  

In [0]:
df.rdd.getNumPartitions()

### Showing the DataFrame

The DataFrame contains the following columns:

- `path`: path to the file
- `page_number`: page number of the document
- `text`: extracted text from the text layer of the PDF page
- `image`: image representation of the page
- `document`: the OCR-extracted text from the rendered image (calls Tesseract OCR)
- `partition_number`: partition number

In [0]:
df.select("filename", "page_number", "partition_number", "text") \
    .orderBy("filename", "page_number") \
    .show()

In [0]:
df.printSchema()

##  PDF document page with text layer (digital/searchable PDF).

In [0]:
# Loading first page of some document
row = spark.read.format("pdf") \
    .option("imageType", "BINARY") \
    .option("resolution", "300") \
    .option("pagePerPartition", "8") \
    .option("reader", "pdfBox") \
    .load(["example1.pdf"]) \
    .select("page_number", "text", "image.data", "path") \
    .limit(1) \
    .collect()[0]

In [0]:
# Image representation of the page 
display(Image.open(io.BytesIO(row.data)).resize((600, 800)))

In [0]:
print(row.text) # Text representation of the page. 

## PDF document page containing image data (scanned or image based PDF)

In [0]:
# Loading first page of the document with not recognized text data
row = spark.read.format("pdf") \
    .option("imageType", "BINARY") \
    .option("resolution", "300") \
    .option("pagePerPartition", "8") \
    .option("reader", "pdfBox") \
    .load(["example2.pdf"]) \
    .select("page_number", "text", "document", "image.data", "path") \
    .limit(1) \
    .collect()[0]

In [0]:
display(Image.open(io.BytesIO(row.data)).resize((600, 800)))

In [0]:
print(row.text) # it's empty, because this page doesn't contains text layer

In [0]:
# Showing text recognized by the OCR
print(row.document.text)