# Introducing PartitionTransformer in SparkNLP
Spark NLP Readers and `Partition` help build structured inputs for your downstream NLP tasks.​

The new `PartitionTransformer` makes your current Spark NLP workflow smoother by allowing to reuse your pipelines seamlessly.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!cp drive/MyDrive/JSL/sparknlp/sparknlp.jar .
!cp drive/MyDrive/JSL/sparknlp/spark_nlp-6.0.1-py2.py3-none-any.whl .

In [3]:
!pip install spark_nlp-6.0.1-py2.py3-none-any.whl

Processing ./spark_nlp-6.0.1-py2.py3-none-any.whl
Installing collected packages: spark-nlp
Successfully installed spark-nlp-6.0.1


In [4]:
# import sparknlp
# # let's start Spark with Spark NLP
# spark = sparknlp.start()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkNLP") \
    .master("local[*]") \
    .config("spark.driver.memory", "12G") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryoserializer.buffer.max", "2000M") \
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.jars", "./sparknlp.jar") \
    .getOrCreate()


print("Apache Spark version: {}".format(spark.version))

Apache Spark version: 3.5.1


Creating File

## Setup and Initialization
Let's keep in mind a few things before we start 😊

Support for **PartitionTransformer** was introduced in Spark NLP 6.0.2 Please make sure you have upgraded to the latest Spark NLP release.

For local files example we will download different files from Spark NLP Github repo:

Downloading HTML files

In [5]:
!mkdir html-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1174-Adding-PartitionTransformer/src/test/resources/reader/html/example-10k.html -P html-files
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1174-Adding-PartitionTransformer/src/test/resources/reader/html/fake-html.html -P html-files

--2025-05-24 14:57:07--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1174-Adding-PartitionTransformer/src/test/resources/reader/html/example-10k.html
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2456707 (2.3M) [text/plain]
Saving to: ‘html-files/example-10k.html’


2025-05-24 14:57:08 (31.6 MB/s) - ‘html-files/example-10k.html’ saved [2456707/2456707]

--2025-05-24 14:57:08--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1174-Adding-PartitionTransformer/src/test/resources/reader/html/fake-html.html
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.

## Partitioning Documents

`PartitionTransformer` outpus a different schema than `Partition`, here we can expect our common Annotation schema

In [6]:
from pyspark.ml import Pipeline
from sparknlp.partition.partition_transformer import *

empty_df = spark.createDataFrame([], "string").toDF("text")

partition_transformer = PartitionTransformer() \
    .setInputCols(["text"]) \
    .setContentType("text/html") \
    .setContentPath("./html-files") \
    .setOutputCol("partition")

pipeline = Pipeline(stages=[
    partition_transformer
])

pipeline_model = pipeline.fit(empty_df)
result_df = pipeline_model.transform(empty_df)

result_df.show()

+--------------------+--------------------+--------------------+--------------------+
|                path|             content|                text|           partition|
+--------------------+--------------------+--------------------+--------------------+
|file:/content/htm...|<?xml  version="1...|[{Title, UNITED S...|[{document, 0, 12...|
|file:/content/htm...|<!DOCTYPE html>\n...|[{Title, My First...|[{document, 0, 15...|
+--------------------+--------------------+--------------------+--------------------+



In [11]:
result_df.printSchema()

root
 |-- path: string (nullable = true)
 |-- content: string (nullable = true)
 |-- text: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- elementType: string (nullable = true)
 |    |    |-- content: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |-- partition: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)



You can integrate `PartitionTransformer` directly into your existing Spark NLP pipelines.​

In [15]:
text = (
    "The big brown fox\n"
    "was walking down the lane.\n"
    "\n"
    "At the end of the lane,\n"
    "the fox met a bear."
)

testDataSet = spark.createDataFrame(
    [(text,)],
    ["text"]
)

In [17]:
from pyspark.ml import Pipeline
from sparknlp import DocumentAssembler

emptyDataSet = spark.createDataFrame([], testDataSet.schema)

documentAssembler = DocumentAssembler() \
            .setInputCol("text") \
            .setOutputCol("document")

partition = PartitionTransformer() \
    .setInputCols(["document"]) \
    .setOutputCol("partition") \
    .setGroupBrokenParagraphs(True)

pipeline = Pipeline(stages=[documentAssembler, partition])
pipelineModel = pipeline.fit(emptyDataSet)

In [18]:
resultDf = pipelineModel.transform(testDataSet)
resultDf.select("partition").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|partition                                                                                                                                                                    |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 43, The big brown fox was walking down the lane., {paragraph -> 0}, []}, {document, 0, 42, At the end of the lane, the fox met a bear., {paragraph -> 0}, []}]|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

