![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/reader/SparkNLP_PDF_Reader_Demo.ipynb)

# Introducing PDF reader in SparkNLP
This notebook showcases the newly added  `sparknlp.read().pdf()` method in Spark NLP that parses PDF content from both local files and distributed file systems into a Spark DataFrame.

## Setup and Initialization
Let's keep in mind a few things before we start 😊

Support for reading pdf files was introduced in Spark NLP 6.0.0 Please make sure you have upgraded to the latest Spark NLP release.

Let's install and setup Spark NLP in Google Colab. This part is pretty easy via our simple script

In [1]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Mounted at /content/drive
Processing ./spark_nlp-6.0.0-py2.py3-none-any.whl
Installing collected packages: spark-nlp
Successfully installed spark-nlp-6.0.0
Apache Spark version: 3.5.5


For local files example we will download a couple of PDF files from Spark NLP Github repo:

In [2]:
!mkdir pdf-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/pdf/pdf-title.pdf -P pdf-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/pdf/text_3_pages.pdf -P pdf-files

--2025-04-29 08:48:49--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/pdf/pdf-title.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25803 (25K) [application/octet-stream]
Saving to: ‘pdf-files/pdf-title.pdf’


2025-04-29 08:48:49 (11.9 MB/s) - ‘pdf-files/pdf-title.pdf’ saved [25803/25803]

--2025-04-29 08:48:49--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/pdf/text_3_pages.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9487 (9.3K) 

## Parsing PDFs from Local Files
Use the `pdf()` method to parse Excel content from local directories.

In [4]:
import sparknlp

pdf_df = sparknlp.read().pdf("./pdf-files")
pdf_df.show()

+--------------------+--------------------+------+--------------------+----------------+---------------+-------+---------+-------+
|                path|    modificationTime|length|                text|height_dimension|width_dimension|content|exception|pagenum|
+--------------------+--------------------+------+--------------------+----------------+---------------+-------+---------+-------+
|file:/content/pdf...|2025-04-29 08:48:...| 25803|This is a Title \...|             842|            596|   NULL|     NULL|      0|
|file:/content/pdf...|2025-04-29 08:48:...|  9487|This is a page.\n...|             841|            595|   NULL|     NULL|      0|
+--------------------+--------------------+------+--------------------+----------------+---------------+-------+---------+-------+



In [5]:
pdf_df.printSchema()

root
 |-- path: string (nullable = true)
 |-- modificationTime: timestamp (nullable = true)
 |-- length: long (nullable = true)
 |-- text: string (nullable = true)
 |-- height_dimension: integer (nullable = true)
 |-- width_dimension: integer (nullable = true)
 |-- content: binary (nullable = true)
 |-- exception: string (nullable = true)
 |-- pagenum: integer (nullable = true)



You can also use DFS file systems like:
- Databricks: `dbfs://`
- HDFS: `hdfs://`
- Microsoft Fabric OneLake: `abfss://`

### Configuration Parameters

You can customize the behavior of PDF reader with some parameters.

- `storeSplittedPdf`: By default, it's `false`. When it's `true` it stores bytes content of splitted pdf in `content` column

In [7]:
params = {"storeSplittedPdf": "true"}
pdf_df = sparknlp.read(params).pdf("./pdf-files")
pdf_df.show()

+--------------------+--------------------+------+--------------------+----------------+---------------+--------------------+---------+-------+
|                path|    modificationTime|length|                text|height_dimension|width_dimension|             content|exception|pagenum|
+--------------------+--------------------+------+--------------------+----------------+---------------+--------------------+---------+-------+
|file:/content/pdf...|2025-04-29 08:48:...| 25803|This is a Title \...|             842|            596|[25 50 44 46 2D 3...|     NULL|      0|
|file:/content/pdf...|2025-04-29 08:48:...|  9487|This is a page.\n...|             841|            595|[25 50 44 46 2D 3...|     NULL|      0|
+--------------------+--------------------+------+--------------------+----------------+---------------+--------------------+---------+-------+

