<p align="center">
  <br/>
    <a href="https://stabrise.com/spark-pdf/"><img alt="Spark Pdf" src="https://stabrise.com/static/images/projects/sparkpdf.webp" style="max-width: 100%;"></a>
  <br/>
</p>

This notebook demonstrates how to use the PDF Datasource to read PDF files from Unity Catalog volume on Databricks.

<p align="center">
    <a target="_blank" href="https://colab.research.google.com/github/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb">
      <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
    </a>
    <a href="https://search.maven.org/artifact/com.stabrise/spark-pdf-spark35_2.12">
        <img alt="Maven Central Version" src="https://img.shields.io/maven-central/v/com.stabrise/spark-pdf-spark35_2.12">
    </a>
    <a href="https://github.com/StabRise/spark-pdf/blob/master/LICENSE" >
        <img src="https://img.shields.io/badge/License-AGPL%203-blue.svg" alt="License"/>
    </a>
</p>

---

**Source Code**: [https://github.com/StabRise/spark-pdf](https://github.com/StabRise/spark-pdf)

**Related Blog Posts**: 
 - [https://stabrise.com/blog/spark-pdf-databricks-unity-catalog/](https://stabrise.com/blog/spark-pdf-databricks-unity-catalog/)
 - [https://stabrise.com/blog/spark-pdf-on-databricks/](https://stabrise.com/blog/spark-pdf-on-databricks/)

⭐ Star us on GitHub — it motivates us a lot!

---

## Key features:

- Read PDF documents to the Spark DataFrame
- Support read PDF files lazy per page
- Support big files, up to 10k pages
- Support scanned PDF files (call OCR)
- No need to install Tesseract OCR, it's included in the package

## Requirements

- Databricks runtime v15.04 or above (Spark v3.5.x)
- Spark PDF v0.1.16 or above (maven: com.stabrise:spark-pdf-spark35_2.12:0.1.16)


In [0]:
import os
import urllib.request

## Download example files and copy it to the Unty Catalog volume

In [0]:
CATALOG_NAME = "your catalog"
SCHEMA_NAME = "default"
VOLUME_NAME = "your volume"


# Downloading example PDF files

filenames = ["example1.pdf", "example2.pdf", "example3.pdf"]
url = f"https://raw.githubusercontent.com/StabRise/spark-pdf/refs/heads/main/examples/"
for filename in filenames:
    urllib.request.urlretrieve(url + filename, filename)
    volume_path = f"/Volumes/{CATALOG_NAME}/{SCHEMA_NAME}/{VOLUME_NAME}/{filename}"
    dbutils.fs.cp (f"file:{os.getcwd()}/{filename}", volume_path)

## Loading pdf documents to the Spark

In [0]:
df = spark.read.format("pdf") \
    .option("imageType", "BINARY") \
    .option("resolution", "300") \
    .option("pagePerPartition", "8") \
    .option("reader", "pdfBox") \
    .load(f"/Volumes/{CATALOG_NAME}/{SCHEMA_NAME}/{VOLUME_NAME}/*.pdf")

Available options for the data source:

- `imageType`: Oputput image type. Can be: "BINARY", "GREY", "RGB". Default: "RGB".
- `resolution`: Resolution for rendering PDF page to the image. Default: "300" dpi.
- `pagePerPartition`: Number pages per partition in Spark DataFrame. Default: "5".
- `reader`: Supports: `pdfBox` - based on PdfBox java lib, `gs` - based on GhostScript (need installation GhostScipt to the system)

### Counting total number of pages in all documents

Spark PDF operates with a lazy evaluation approach, extracting metadata from PDF files without loading the entire file into memory.

In this example, we loaded two PDF documents:  
- The first document contains 1 page.  
- The second document contains 1 page with not recognized text.
- The last one document contains 30 pages.


In [0]:
df.count()

32

### Checking Number of Partitions

We specified the option `pagePerPartition = 8` in the configuration.<br/>
This results in 6 partitions:  
- 1 partition for the first file.  
- 1 partition for the second file.  
- 4 partitions for the last file, which contains 30 pages.  

In [0]:
df.rdd.getNumPartitions()

6

### Showing the DataFrame

The DataFrame contains the following columns:

- `path`: path to the file
- `page_number`: page number of the document
- `text`: extracted text from the text layer of the PDF page
- `image`: image representation of the page
- `document`: the OCR-extracted text from the rendered image (calls Tesseract OCR)
- `partition_number`: partition number

In [0]:
df.select("filename", "page_number", "partition_number", "text") \
    .orderBy("filename", "page_number") \
    .show()

+------------+-----------+----------------+--------------------+
|    filename|page_number|partition_number|                text|
+------------+-----------+----------------+--------------------+
|example1.pdf|          0|               4|RECIPE\nStrawberr...|
|example2.pdf|          0|               5|                  \n|
|example3.pdf|          0|               0|Lorem ipsum \nLor...|
|example3.pdf|          1|               0|In non mauris jus...|
|example3.pdf|          2|               0|Lorem ipsum dolor...|
|example3.pdf|          3|               0|Maecenas mauris l...|
|example3.pdf|          4|               0|Etiam vehicula lu...|
|example3.pdf|          5|               0|Lorem ipsum \nLor...|
|example3.pdf|          6|               0|In non mauris jus...|
|example3.pdf|          7|               0|Lorem ipsum dolor...|
|example3.pdf|          8|               1|Maecenas mauris l...|
|example3.pdf|          9|               1|Etiam vehicula lu...|
|example3.pdf|         10

In [0]:
df.printSchema()

root
 |-- path: string (nullable = true)
 |-- filename: string (nullable = true)
 |-- page_number: integer (nullable = true)
 |-- partition_number: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- image: struct (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- resolution: integer (nullable = true)
 |    |-- data: binary (nullable = true)
 |    |-- imageType: string (nullable = true)
 |    |-- exception: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |-- document: struct (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- text: string (nullable = true)
 |    |-- outputType: string (nullable = true)
 |    |-- bBoxes: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- text: string (nullable = true)
 |    |    |    |-- score: float (nullable = true)
 |    |    |    |-- x: integer (nullable = true)
 |    |    |    |-- y: intege