# Introducing Partition with Semantic Chunking SparkNLP
This notebook showcases the newly added `Partition` component in Spark NLP
providing a streamlined and user-friendly interface for interacting with Spark NLP readers

## Setup and Initialization
Let's keep in mind a few things before we start 😊

Support for **Partitioning** files was introduced in Spark NLP 6.0.1 

Chunking support was added in Spark NLP 6.0.3
Please make sure you have upgraded to the latest Spark NLP release.

- Let's install and setup Spark NLP in Google Colab. This part is pretty easy via our simple script

In [None]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

For local files example we will download different files from Spark NLP Github repo:

**Downloading Files**

In [7]:
!mkdir txt-files
!mkdir html-files

In [8]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1125-Implement-Chunking-Strategies/src/test/resources/reader/txt/long-text.txt -P txt-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1125-Implement-Chunking-Strategies/src/test/resources/reader/html/fake-html.html -P html-files

--2025-06-06 15:19:01--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1125-Implement-Chunking-Strategies/src/test/resources/reader/txt/long-text.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1032 (1.0K) [text/plain]
Saving to: ‘txt-files/long-text.txt’


2025-06-06 15:19:01 (58.1 MB/s) - ‘txt-files/long-text.txt’ saved [1032/1032]

--2025-06-06 15:19:01--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1125-Implement-Chunking-Strategies/src/test/resources/reader/html/fake-html.html
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... conne

## Partitioning Documents with Chunking
Use the `basic` chunking to segment data into coherent chunks based on character limits

In [9]:
from sparknlp.partition.partition import Partition

partition_df = Partition(content_type = "text/plain", chunking_strategy = "basic").partition("./txt-files/long-text.txt")



Output without `basic` chunk:

In [10]:
from pyspark.sql.functions import explode, col

result_df = partition_df.select(explode(col("txt.content")))
result_df.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Output with `basic` chunk:

In [11]:
result_df = partition_df.select(explode(col("chunks.content")))
result_df.show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |


Use `by_title` chunking to group sections in documents with headings, tables, and mixed semantic elements

In [12]:
partition_df = Partition(content_type = "text/html", chunking_strategy = "by_title", combineTextUnderNChars = 50).partition("./html-files/fake-html.html")



Output without `by_title` chunk:

In [13]:
result_df = partition_df.select(explode(col("html.content")))
result_df.show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                               |
+----------------------------------------------------------------------------------------------------------------------------------+
|My First Heading                                                                                                                  |
|My Second Heading                                                                                                                 |
|My first paragraph. lorem ipsum dolor set amet. if the cow comes home under the sun how do you fault the cow for it's worn hooves?|
|A Third Heading                                                                                                                   |
|Column 1 Column 2 Row 1, Cell 1 Row 1, Cell 2 Row 2, Cell 1 Row 2, C

Output with `by_title` chunk:

In [14]:
result_df = partition_df.select(explode(col("chunks.content")))
result_df.show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                                                                  |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|My First Heading My Second Heading My first paragraph. lorem ipsum dolor set amet. if the cow comes home under the sun how do you fault the cow for it's worn hooves? A Third Heading|
|Column 1 Column 2 Row 1, Cell 1 Row 1, Cell 2 Row 2, Cell 1 Row 2, Cell 2                                                                                                            |
+-------------------------------------------------------------------------------

You can also use DFS file systems like:
- Databricks: `dbfs://`
- HDFS: `hdfs://`
- Microsoft Fabric OneLake: `abfss://`