![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/DocumentTokenSplitter.ipynb)

## Colab + Data Setup

In [10]:
!wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Installing PySpark 3.2.3 and Spark NLP 5.2.2
setup Colab for PySpark 3.2.3 and Spark NLP 5.2.2


In [11]:
!wget https://github.com/JohnSnowLabs/spark-nlp/blob/587f79020de7bc09c2b2fceb37ec258bad57e425/src/test/resources/spell/sherlockholmes.txt > /dev/null 2>&1

# Download DocumentTokenSplitter Model and Create Spark NLP Pipeline

In [12]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

spark = sparknlp.start()

print(f"Spark NLP version {sparknlp.version()}\nApache Spark version: {spark.version}")

Spark NLP version 5.2.2
Apache Spark version: 3.2.3


In [13]:
textDF = spark.read.text(
   "sherlockholmes.txt",
    wholetext=True
).toDF("text")

In [14]:
DocumentTokenSplitter

sparknlp.annotator.document_token_splitter.DocumentTokenSplitter

Lets create a Spark NLP pipeline with the following stages:

In [15]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

textSplitter = DocumentTokenSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setNumTokens(512) \
    .setTokenOverlap(10) \
    .setExplodeSplits(True)

pipeline = Pipeline().setStages([documentAssembler, textSplitter])
result = pipeline.fit(textDF).transform(textDF)

result.selectExpr(
      "splits.result as result",
      "splits[0].begin as begin",
      "splits[0].end as end",
      "splits[0].end - splits[0].begin as length",
      "splits[0].metadata.numTokens as tokens") \
    .show(8, truncate = 80)

+--------------------------------------------------------------------------------+-----+-----+------+------+
|                                                                          result|begin|  end|length|tokens|
+--------------------------------------------------------------------------------+-----+-----+------+------+
|[{"payload":{"allShortcutsEnabled":false,"fileTree":{"src/test/resources/spel...|    0|11335| 11335|   512|
|[the case of the Trepoff murder, of his clearing up","of the singular tragedy...|11280|14436|  3156|   512|
|[order to remove crusted mud from it.","Hence, you see, my double deduction t...|14379|17697|  3318|   512|
|[a \"P,\" and a","large \"G\" with a small \"t\" woven into the texture of th...|17644|20993|  3349|   512|
|[which he had apparently adjusted that very moment,","for his hand was still ...|20928|24275|  3347|   512|
|[his high white forehead, \"you","can understand that I am not accustomed to ...|24214|27991|  3777|   512|
|[send it on the da

# Now let's make another pipeline to see if this actually works!

let's get the data ready

In [16]:
df = spark.createDataFrame([
    [("All emotions, and that\none particularly, were abhorrent to his cold, "
      "precise but\nadmirably balanced mind.\n\nHe was, I take it, the most "
      "perfect\nreasoning and observing machine that the world has seen.")]
]).toDF("text")


Lets create a Spark NLP pipeline following the same stages as before:

In [17]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

document_token_splitter = DocumentTokenSplitter() \
    .setInputCols("document") \
    .setOutputCol("splits") \
    .setNumTokens(3) \
    .setTokenOverlap(1) \
    .setExplodeSplits(True) \
    .setTrimWhitespace(True) \

pipeline = Pipeline().setStages([documentAssembler, document_token_splitter])
pipeline_df = pipeline.fit(df).transform(df)

results = pipeline_df.select("splits").collect()

splits = [
    row["splits"][0].result.replace("\n\n", " ").replace("\n", " ")
    for row in results
]

**Evaluation**

In [18]:
expected = [
    "All emotions, and",
    "and that one",
    "one particularly, were",
    "were abhorrent to",
    "to his cold,",
    "cold, precise but",
    "but admirably balanced",
    "balanced mind. He",
    "He was, I",
    "I take it,",
    "it, the most",
    "most perfect reasoning",
    "reasoning and observing",
    "observing machine that",
    "that the world",
    "world has seen.",
]

splits == expected

True

Great it works!