# Introducing Partition in SparkNLP
This notebook showcases the newly added `Partition` component in Spark NLP
providing a streamlined and user-friendly interface for interacting with Spark NLP readers

## Setup and Initialization
Let's keep in mind a few things before we start 😊

Support for **Partitioning** files was introduced in Spark NLP 6.0.1 Please make sure you have upgraded to the latest Spark NLP release.

For local files example we will download different files from Spark NLP Github repo:

**Downloading HTML files**

In [7]:
!mkdir html-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/html/example-10k.html -P html-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/html/fake-html.html -P html-files

--2025-04-30 22:09:44--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1116-Adding-Partitioning-Documents-to-SparkNLP/src/test/resources/reader/html/example-10k.html
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2456707 (2.3M) [text/plain]
Saving to: ‘html-files/example-10k.html’


2025-04-30 22:09:44 (165 MB/s) - ‘html-files/example-10k.html’ saved [2456707/2456707]

--2025-04-30 22:09:45--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1116-Adding-Partitioning-Documents-to-SparkNLP/src/test/resources/reader/html/fake-html.html
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubus

**Downloading PDF files**

In [8]:
!mkdir pdf-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/pdf/image_3_pages.pdf -P pdf-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/pdf/pdf-title.pdf -P pdf-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/pdf/text_3_pages.pdf -P pdf-files

--2025-04-30 22:09:45--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1116-Adding-Partitioning-Documents-to-SparkNLP/src/test/resources/reader/pdf/image_3_pages.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15629 (15K) [application/octet-stream]
Saving to: ‘pdf-files/image_3_pages.pdf’


2025-04-30 22:09:45 (73.4 MB/s) - ‘pdf-files/image_3_pages.pdf’ saved [15629/15629]

--2025-04-30 22:09:45--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1116-Adding-Partitioning-Documents-to-SparkNLP/src/test/resources/reader/pdf/pdf-title.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.gi

**Downloading Word files**

In [9]:
!mkdir word-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/doc/contains-pictures.docx -P word-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/doc/fake_table.docx -P word-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/doc/page-breaks.docx -P word-files

--2025-04-30 22:09:46--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1116-Adding-Partitioning-Documents-to-SparkNLP/src/test/resources/reader/doc/contains-pictures.docx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 95087 (93K) [application/octet-stream]
Saving to: ‘word-files/contains-pictures.docx’


2025-04-30 22:09:46 (22.7 MB/s) - ‘word-files/contains-pictures.docx’ saved [95087/95087]

--2025-04-30 22:09:46--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1116-Adding-Partitioning-Documents-to-SparkNLP/src/test/resources/reader/doc/fake_table.docx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubuser

**Downloading Excel files**

In [10]:
!mkdir excel-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xls/vodafone.xlsx -P excel-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xls/2023-half-year-analyses-by-segment.xlsx -P excel-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xls/page-break-example.xlsx -P excel-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/xls/xlsx-subtable-cases.xlsx -P excel-files

--2025-04-30 22:09:47--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1116-Adding-Partitioning-Documents-to-SparkNLP/src/test/resources/reader/xls/vodafone.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12541 (12K) [application/octet-stream]
Saving to: ‘excel-files/vodafone.xlsx’


2025-04-30 22:09:47 (74.1 MB/s) - ‘excel-files/vodafone.xlsx’ saved [12541/12541]

--2025-04-30 22:09:47--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1116-Adding-Partitioning-Documents-to-SparkNLP/src/test/resources/reader/xls/2023-half-year-analyses-by-segment.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubuserc

In [11]:
!cp drive/MyDrive/JSL/PageBreakExample.xlsx ./excel-files

**Downloading PowerPoint files**

In [12]:
!mkdir ppt-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/ppt/fake-power-point.pptx -P ppt-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/ppt/fake-power-point-table.pptx -P ppt-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/ppt/speaker-notes.pptx -P ppt-files

--2025-04-30 22:09:48--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1116-Adding-Partitioning-Documents-to-SparkNLP/src/test/resources/reader/ppt/fake-power-point.pptx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38412 (38K) [application/octet-stream]
Saving to: ‘ppt-files/fake-power-point.pptx’


2025-04-30 22:09:48 (16.0 MB/s) - ‘ppt-files/fake-power-point.pptx’ saved [38412/38412]

--2025-04-30 22:09:48--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1116-Adding-Partitioning-Documents-to-SparkNLP/src/test/resources/reader/ppt/fake-power-point-table.pptx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.git

**Downloading Email files**

In [13]:
!mkdir email-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/email/email-text-attachments.eml -P email-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/email/test-several-attachments.eml -P email-files

--2025-04-30 22:09:49--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1116-Adding-Partitioning-Documents-to-SparkNLP/src/test/resources/reader/email/email-text-attachments.eml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3175 (3.1K) [text/plain]
Saving to: ‘email-files/email-text-attachments.eml’


2025-04-30 22:09:49 (39.7 MB/s) - ‘email-files/email-text-attachments.eml’ saved [3175/3175]

--2025-04-30 22:09:49--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1116-Adding-Partitioning-Documents-to-SparkNLP/src/test/resources/reader/email/test-several-attachments.eml
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to

**Downloading Text files**

In [14]:
!mkdir txt-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/txt/simple-text.txt -P txt-files

--2025-04-30 22:09:49--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1116-Adding-Partitioning-Documents-to-SparkNLP/src/test/resources/reader/txt/simple-text.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 300 [text/plain]
Saving to: ‘txt-files/simple-text.txt’


2025-04-30 22:09:50 (5.08 MB/s) - ‘txt-files/simple-text.txt’ saved [300/300]



## Partitioning Documents
Use the `Partition` component to parse text content from local directories.

In [16]:
from sparknlp.partition.partition import Partition

partition_df = Partition().partition("./txt-files/simple-text.txt")
partition_df.show()

+--------------------+--------------------+
|                path|                 txt|
+--------------------+--------------------+
|file:/content/txt...|[{Title, BIG DATA...|
+--------------------+--------------------+



Partitioning a Word Document

In [17]:
partition_df = Partition().partition("./word-files/fake_table.docx")
partition_df.show()

+--------------------+--------------------+
|                path|                 doc|
+--------------------+--------------------+
|file:/content/wor...|[{Table, Header C...|
+--------------------+--------------------+



Partitioning an Excel Document

In [18]:
partition_df = Partition().partition("./word-files/fake_table.docx")
partition_df.show()

+--------------------+--------------------+
|                path|                 doc|
+--------------------+--------------------+
|file:/content/wor...|[{Table, Header C...|
+--------------------+--------------------+



Partitioning a Power Point Document

In [19]:
partition_df = Partition().partition("./ppt-files/fake-power-point.pptx")
partition_df.show()

+--------------------+--------------------+
|                path|                 ppt|
+--------------------+--------------------+
|file:/content/ppt...|[{Title, Adding a...|
+--------------------+--------------------+



Partitioning a Email Document

In [20]:
partition_df = Partition().partition("./email-files/test-several-attachments.eml")
partition_df.show()

+--------------------+--------------------+
|                path|               email|
+--------------------+--------------------+
|file:/content/ema...|[{Title, Test Sev...|
+--------------------+--------------------+



Partitioning an HTML Document

In [21]:
partition_df = Partition().partition("./html-files/fake-html.html")
partition_df.show()

+--------------------+--------------------+
|                path|                html|
+--------------------+--------------------+
|file:/content/htm...|[{Title, My First...|
+--------------------+--------------------+



Partitioning a PDF Document

In [22]:
partition_df = Partition().partition("./pdf-files/text_3_pages.pdf")
partition_df.show()

+--------------------+--------------------+------+--------------------+----------------+---------------+-------+---------+-------+
|                path|    modificationTime|length|                text|height_dimension|width_dimension|content|exception|pagenum|
+--------------------+--------------------+------+--------------------+----------------+---------------+-------+---------+-------+
|file:/content/pdf...|2025-04-30 22:09:...|  9487|This is a page.\n...|             841|            595|   NULL|     NULL|      0|
+--------------------+--------------------+------+--------------------+----------------+---------------+-------+---------+-------+



You can also use DFS file systems like:
- Databricks: `dbfs://`
- HDFS: `hdfs://`
- Microsoft Fabric OneLake: `abfss://`

### Configuration Parameters

The `Partition` feature allows you to extract content from various file formats while providing flexible customization using keyword arguments e.g. `kwargs`.

| Kwargs Option   | Document Type | Usage                              |
|:---------------:|:-------------:|:----------------------------------:|
| `content_type`  | All           | Override automatic file detection  |
| `timeout` | HTML        | Set max wait time for URL requests     |
| `include_page_breaks` | Word | Include HTML of tables |
| `group_broken_paragraphs` | Text | Groups paragraphs by processing text that uses blank lines to separate paragraphs |
| `include_slide_notes` | PowerPoint | Include speaker notes |
| `infer_table_structure` | Excel | Include HTML version of the table |
| `append_cells` | Excel | Instead of partitioning by cell returns all the data in one row |

One important customization is specifying the content type explicitly using the `content_type` parameter. This helps bypass file identification and directly process files following the [MIME](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types) specification.

In [23]:
partition_df = Partition(content_type = "application/msword").partition("./word-files")
partition_df.show()

+--------------------+--------------------+
|                path|                 doc|
+--------------------+--------------------+
|file:/content/wor...|[{Header, An inli...|
|file:/content/wor...|[{Table, Header C...|
|file:/content/wor...|[{NarrativeText, ...|
+--------------------+--------------------+



In [24]:
partition_df = Partition(content_type = "application/pdf").partition("./pdf-files")
partition_df.show()

+--------------------+--------------------+------+--------------------+----------------+---------------+-------+---------+-------+
|                path|    modificationTime|length|                text|height_dimension|width_dimension|content|exception|pagenum|
+--------------------+--------------------+------+--------------------+----------------+---------------+-------+---------+-------+
|file:/content/pdf...|2025-04-30 22:09:...| 25803|This is a Title \...|             842|            596|   NULL|     NULL|      0|
|file:/content/pdf...|2025-04-30 22:09:...| 15629|              \n\n\n|             841|            595|   NULL|     NULL|      0|
|file:/content/pdf...|2025-04-30 22:09:...|  9487|This is a page.\n...|             841|            595|   NULL|     NULL|      0|
+--------------------+--------------------+------+--------------------+----------------+---------------+-------+---------+-------+



In [25]:
partition_df = Partition(content_type = "application/vnd.ms-excel").partition("./excel-files/PageBreakExample.xlsx")
partition_df.show(truncate=False)

+-----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|path                                           |xls                                                                                                                                                                                                                                                                                                 

The `timeout` parameter lets you define how long to wait (in seconds) for a response when fetching HTML content, preventing long stalls on slow or unresponsive sites.

In [26]:
partition_df = Partition(timeout = 1).partition("https://www.blizzard.com")
partition_df.show(truncate=False)

+------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

For Word documents, use `includePageBreaks` to preserve structural information like page boundaries, which are inserted as HTML tables in the output.

In [27]:
partition_df = Partition(content_type = "application/vnd.ms-excel", include_page_breaks = True).partition("./excel-files/PageBreakExample.xlsx")
partition_df.show(truncate=False)

+-----------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [28]:
partition_df = Partition(content_type = "application/vnd.ms-excel", includePageBreaks = True).partition("./excel-files/PageBreakExample.xlsx")
partition_df.show(truncate=False)

+-----------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

When parsing plain text files, `group_broken_paragraphs` can be enabled to intelligently merge broken paragraphs by interpreting blank lines as true paragraph breaks.

In [29]:
text = (
            "The big brown fox\n"
            "was walking down the lane.\n"
            "\n"
            "At the end of the lane,\n"
            "the fox met a bear."
        )

In [30]:
text_df = Partition(group_broken_paragraphs=True).partition_text(text = text)
text_df.show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|txt                                                                                                                                                              |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{NarrativeText, The big brown fox was walking down the lane., {paragraph -> 0}}, {NarrativeText, At the end of the lane, the fox met a bear., {paragraph -> 0}}]|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+



For PowerPoint files, the `include_slide_notes` flag ensures that speaker notes from each slide are extracted and included in the output.

In [31]:
partition_df = Partition(include_slide_notes = True).partition("./ppt-files/speaker-notes.pptx")
partition_df.show(truncate=False)

+------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|path                                      |ppt                                                                                                                                                                                                                                                                                                                      |
+------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In Excel files, enabling `infer_table_structure` allows Partition to generate an HTML representation of table structures, useful for downstream parsing or display.

In [32]:
partition_df = Partition(infer_table_structure = True).partition("./excel-files/page-break-example.xlsx")
partition_df.select("xls").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

With Excel inputs, set `append_cells` to concatenate all cell values in a row into a single string instead of separating each cell individually.

In [33]:
partition_df = Partition(append_cells = True).partition("./excel-files/xlsx-subtable-cases.xlsx")
partition_df.select("xls").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|xls                                                                                                                                                                                                                                                      |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{NarrativeText, a\tb\nc\td\te\n- f\na\nb\tc\nd\te\na\nb\nc\td\ne\tf\na\tb\nc\td\n2. e\na\tb\nc\td\ne\nf\na\nb\tc\nd\te\nf\na\nb\nc\td\ne\tf\ng\na\nb\tc\nd\te\nf\ng\na\nb\nc\td\ne\tf\ng\nh\na\tb\tc\na\nb\tc\td\na\tb\tc\nd\ne, {SheetName -> She

### Headers for URLs

Another available parameter is `headers`. This is currently used when a URL is provided, allowing you to set the necessary headers for the request. It can be useful in scenarios such as requesting web pages in a specific language or when authentication is required, for example by passing a Bearer token.



In [34]:
partition_df = Partition().partition("https://www.blizzard.com", headers = {"Accept-Language": "es-ES"})
partition_df.show(truncate=False)

+------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------