![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/DocumentCharacterTextSplitter.ipynb)

## Colab + Data Setup

In [1]:
!wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Installing PySpark 3.2.3 and Spark NLP 5.2.2
setup Colab for PySpark 3.2.3 and Spark NLP 5.2.2
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.5/281.5 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.3/547.3 kB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
!wget https://github.com/JohnSnowLabs/spark-nlp/blob/587f79020de7bc09c2b2fceb37ec258bad57e425/src/test/resources/spell/sherlockholmes.txt > /dev/null 2>&1

In [3]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

spark = sparknlp.start()

print(f"Spark NLP version {sparknlp.version()}\nApache Spark version: {spark.version}")

Spark NLP version 5.2.2
Apache Spark version: 3.2.3


In [4]:
textDF = spark.read.text(
   "sherlockholmes.txt",
    wholetext=True
).toDF("text")

# Download DocumentTokenSplitter Model and Create Spark NLP Pipeline

In [5]:
DocumentCharacterTextSplitter

sparknlp.annotator.document_character_text_splitter.DocumentCharacterTextSplitter

Lets create a Spark NLP pipeline with the following stages:

In [7]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

textSplitter = DocumentCharacterTextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setChunkSize(20000) \
    .setChunkOverlap(200) \
    .setExplodeSplits(True)

pipeline = Pipeline().setStages([documentAssembler, textSplitter])
result = pipeline.fit(textDF).transform(textDF)

result.selectExpr(
      "splits.result",
      "splits[0].begin",
      "splits[0].end",
      "splits[0].end - splits[0].begin as length").show(8, truncate = 80)

+--------------------------------------------------------------------------------+---------------+-------------+------+
|                                                                          result|splits[0].begin|splits[0].end|length|
+--------------------------------------------------------------------------------+---------------+-------------+------+
|[{"payload":{"allShortcutsEnabled":false,"fileTree":{"src/test/resources/spel...|              0|        19998| 19998|
|[Doctor, and give us your best","attention.\"","","A slow and heavy step, whi...|          19806|        39805| 19999|
|[he said as he turned hungrily on the simple fare that","our landlady had pro...|          39606|        59604| 19998|
|[armchair and","putting his fingertips together, as was his custom when in","...|          59407|        79406| 19999|
|[after a time, he did not come in at","all. Still, of course, I never dared t...|          79208|        99201| 19993|
|[least an hour before us,\" he remarked

# Now let's make another pipeline to see if this actually works!

let's get the data ready

In [8]:
df = spark.createDataFrame([
    [("All emotions, and that\none particularly, were abhorrent to his cold, "
      "precise but\nadmirably balanced mind.\n\nHe was, I take it, the most "
      "perfect\nreasoning and observing machine that the world has seen.")]
]).toDF("text")


Lets create a Spark NLP pipeline following the same stages as before:

In [13]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

document_character_text_splitter = DocumentCharacterTextSplitter() \
    .setInputCols("document") \
    .setOutputCol("splits") \
    .setChunkSize(20) \
    .setChunkOverlap(5) \
    .setExplodeSplits(True) \
    .setPatternsAreRegex(False) \
    .setKeepSeparators(True) \
    .setSplitPatterns(["\n\n", "\n", " ", ""]) \
    .setTrimWhitespace(True)

pipeline = Pipeline().setStages([documentAssembler, document_character_text_splitter])
pipeline_df = pipeline.fit(df).transform(df)

results = pipeline_df.select("splits").collect()

splits = [
    row["splits"][0].result.replace("\n\n", " ").replace("\n", " ")
    for row in results
]

**Evaluation**

In [15]:
expected = [
    "All emotions, and",
    "and that",
    "one particularly,",
    "were abhorrent to",
    "to his cold,",
    "precise but",
    "admirably balanced",
    "mind.",
    "He was, I take it,",
    "it, the most",
    "most perfect",
    "reasoning and",
    "and observing",
    "machine that the",
    "the world has seen.",
]

splits == expected

True

Great it works!