![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/open-source-nlp/21.0.Document_Splitters.ipynb)

# **Document Splitters**

This notebook will cover the different parameters and usages of `Translator Annotators`.

**📖 Learning Objectives:**

1. Background: Understand the `Translator` such as `MarianTransformer` and M2M100Transformer .

2. Colab setup.

3. Become comfortable with using the different parameters of the annotator.

**🔗 Helpful Links:**

- For Translation models: : [Model Hub](https://sparknlp.org/models?tag=translation)



## **🎬 Colab Setup**

In [None]:
!pip install -q pyspark==3.4.1 spark-nlp

In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline, PipelineModel

import pandas as pd

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp.start(params=params)

print("Spark NLP Version :", sparknlp.version())

spark

Spark NLP Version : 5.3.3


## **Dataset**

In [None]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/open-source-nlp/data/holmes.txt

In [4]:
textDF = spark.read.text( "holmes.txt", wholetext=True).toDF("text")
textDF.show(truncate=100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|THE ADVENTURES OF SHERLOCK HOLMESArthur Conan Doyle Table of contents A Scandal in Bohemia The Re...|
+----------------------------------------------------------------------------------------------------+



# DocumentCharacterTextSplitter Model

## **📜 Background**

`DocumentCharacterTextSplitter`: Annotator which splits large documents into chunks of roughly given size.

DocumentCharacterTextSplitter takes a list of separators. It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.

For example, given chunk size 20 and overlap 5:


**🔗 Helpful Links:**

- Python API: [DocumentCharacterTextSplitter](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/document_character_text_splitter/index.html#sparknlp.annotator.document_character_text_splitter.DocumentCharacterTextSplitter)
- Scala API: [DocumentCharacterTextSplitter](https://sparknlp.org/api/com/johnsnowlabs/nlp/annotators/DocumentCharacterTextSplitter)




## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`

- Output: `DOCUMENT`

## **🔎 Parameters**

- `chunkSize`: Size of each chunk of text.
- `chunkOverlap`: Length of the overlap between text chunks , by default `0`.
- `splitPatterns`: Patterns to separate the text by in decreasing priority , by default `["\n\n", "\n", " ", ""]`.
- `patternsAreRegex`: Whether to interpret the split patterns as regular expressions , by default `False`.
 -`keepSeparators`: Whether to keep the separators in the final result , by default `True`.
- `explodeSplits`: Whether to explode split chunks to separate rows , by default `False`.
- `trimWhitespace`: Whether to trim whitespaces of extracted chunks , by default `True`.


## Define Spark NLP pipeline

Lets create a Spark NLP pipeline with the following stages:

In [5]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

character_text_splitter = DocumentCharacterTextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setChunkSize(2000) \
    .setChunkOverlap(20) \
    .setExplodeSplits(True)

pipeline = Pipeline()\
    .setStages([
        document_assembler,
        character_text_splitter
])

result = pipeline.fit(textDF).transform(textDF)

result.selectExpr(
      "splits.result",
      "splits[0].begin as begin",
      "splits[0].end as end",
      "splits[0].end - splits[0].begin as length").show(8, truncate = 80)

+--------------------------------------------------------------------------------+-----+-----+------+
|                                                                          result|begin|  end|length|
+--------------------------------------------------------------------------------+-----+-----+------+
|[THE ADVENTURES OF SHERLOCK HOLMESArthur Conan Doyle Table of contents A Scan...|    0| 1995|  1995|
|[his whole Bohemian soul, remained in our lodgings in Baker Street, buried am...| 1977| 3974|  1997|
|[It seldom was; but he was glad, I think, to see me. With hardly a word spoke...| 3960| 5955|  1995|
|[I must be dull, indeed, if I do not pronounce him to be an active member of ...| 5940| 7939|  1999|
|[It is a capital mistake to theorize before one has data. Insensibly one begi...| 7924| 9915|  1991|
|[to resolve all our doubts." As he spoke there was the sharp sound of horses'...| 9897|11890|  1993|
|[long, straight chin suggestive of resolution pushed to the length of obstina...|

**Now let's make another pipeline to see if this actually works!**



Lets create a Spark NLP pipeline following the same stages as before:

In [6]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

character_text_splitter = DocumentCharacterTextSplitter() \
    .setInputCols("document") \
    .setOutputCol("splits") \
    .setChunkSize(20) \
    .setChunkOverlap(5) \
    .setExplodeSplits(True) \
    .setPatternsAreRegex(False) \
    .setKeepSeparators(True) \
    .setSplitPatterns(["\n\n", "\n", " ", ""]) \
    .setTrimWhitespace(True)

pipeline = Pipeline()\
    .setStages([
        document_assembler,
        character_text_splitter
])

let's get the data ready

In [7]:
df = spark.createDataFrame([
    [("All emotions, and that\none particularly, were abhorrent to his cold, "
      "precise but\nadmirably balanced mind.\n\nHe was, I take it, the most "
      "perfect\nreasoning and observing machine that the world has seen.")]
]).toDF("text")

In [8]:
result_df = pipeline.fit(df).transform(df)
result_df.selectExpr(
      "splits.result[0]",
      "splits[0].begin as begin",
      "splits[0].end as end",
      "splits[0].end - splits[0].begin as length").show(truncate = 80)

+-------------------+-----+---+------+
|   splits.result[0]|begin|end|length|
+-------------------+-----+---+------+
|  All emotions, and|    0| 17|    17|
|           and that|   14| 22|     8|
|  one particularly,|   23| 40|    17|
|  were abhorrent to|   41| 58|    17|
|       to his cold,|   56| 68|    12|
|        precise but|   69| 80|    11|
| admirably balanced|   81| 99|    18|
|              mind.|  100|105|     5|
| He was, I take it,|  107|125|    18|
|       it, the most|  122|134|    12|
|       most perfect|  130|142|    12|
|      reasoning and|  143|156|    13|
|      and observing|  153|166|    13|
|   machine that the|  167|183|    16|
|the world has seen.|  180|199|    19|
+-------------------+-----+---+------+



**Evaluation**

In [9]:
results = result_df.select("splits").collect()

splits = [
    row["splits"][0].result.replace("\n\n", " ").replace("\n", " ")
    for row in results
]

In [10]:
expected = [
    "All emotions, and",
    "and that",
    "one particularly,",
    "were abhorrent to",
    "to his cold,",
    "precise but",
    "admirably balanced",
    "mind.",
    "He was, I take it,",
    "it, the most",
    "most perfect",
    "reasoning and",
    "and observing",
    "machine that the",
    "the world has seen.",
]

splits == expected

True

## Contains classes for the DocumentNormalizer

In [11]:
!rm -r sherlockholmes*

rm: cannot remove 'sherlockholmes*': No such file or directory


In [None]:
!wget  https://github.com/JohnSnowLabs/spark-nlp/blob/587f79020de7bc09c2b2fceb37ec258bad57e425/src/test/resources/spell/sherlockholmes.txt  -P ./

In [13]:
unnormalized_textDF = spark.read.text("sherlockholmes.txt", wholetext=True).toDF("text")

unnormalized_textDF.collect()[0]["text"][:1000]

'\n\n\n\n\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  >\n\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-0eace2597ca3.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-a167e256da9c.css" /><link data-color-theme="dark_dimme

In [14]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

cleanUpPatterns = ["""(<[^>]*>)"""]

document_normalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalize_document") \
    .setAction("clean") \
    .setPatterns(cleanUpPatterns) \
    .setReplacement(" ") \
    .setPolicy("pretty_all") \
    .setLowercase(False)

character_text_splitter = DocumentCharacterTextSplitter() \
    .setInputCols(["normalize_document"]) \
    .setOutputCol("splits") \
    .setChunkSize(2000) \
    .setChunkOverlap(20) \
    .setExplodeSplits(True)

pipeline = Pipeline()\
    .setStages([
        document_assembler,
        document_normalizer,
        character_text_splitter
])

In [15]:

result = pipeline.fit(unnormalized_textDF).transform(unnormalized_textDF)

result.selectExpr(
      "splits.result",
      "splits[0].begin as begin",
      "splits[0].end as end",
      "splits[0].end - splits[0].begin as length").show(truncate = 80)

+--------------------------------------------------------------------------------+-----+-----+------+
|                                                                          result|begin|  end|length|
+--------------------------------------------------------------------------------+-----+-----+------+
|[{"locale":"en","featureFlags":["code_vulnerability_scanning","copilot_conver...|    1| 2000|  1999|
|[on another tab or window. Reload to refresh your session. Dismiss alert {{ m...| 1983| 2317|   334|
|[ {"payload":{"allShortcutsEnabled":false,"fileTree":{"src/test/resources/spe...| 2317|10155|  7838|
|[Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle","","Th...|10156|12153|  1997|
|[actions. But for the trained reasoner","to admit such intrusions into his ow...|12137|14134|  1997|
|[practice), when my way led me through Baker Street. As I","passed the well-r...|14119|16112|  1993|
|[but as I have changed my clothes I can't imagine how you","deduce it. As to ...|

# DocumentTokenSplitter Model

## **📜 Background**

` DocumentTokenSplitter`: Annotator that splits large documents into smaller documents based on the number of tokens in the text.

Currently, DocumentTokenSplitter splits the text by whitespaces to create the tokens. The number of these tokens will then be used as a measure of the text length. In the future, other tokenization techniques will be supported.


**🔗 Helpful Links:**

- Python API: [ DocumentTokenSplitter](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/document_token_splitter/index.html#sparknlp.annotator.document_token_splitter.DocumentTokenSplitter)
- Scala API: [ DocumentTokenSplitter](https://sparknlp.org/api/com/johnsnowlabs/nlp/annotators/DocumentTokenSplitter)




## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`

- Output: `DOCUMENT`

## **🔎 Parameters**

- `numTokens`: Limit of the number of tokens in a text
- `tokenOverlap`: Length of the token overlap between text chunks, by default `0`.
- `explodeSplits`: Whether to explode split chunks to separate rows, by default `False`.
- `trimWhitespace`: Whether to trim whitespaces of extracted chunks, by default `True`.


## Define Spark NLP pipeline

In [16]:
token_splitter = DocumentTokenSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setNumTokens(512) \
    .setTokenOverlap(10) \
    .setExplodeSplits(True)

pipeline = Pipeline()\
    .setStages([
        document_assembler,
        token_splitter
    ])

In [17]:
result = pipeline.fit(textDF).transform(textDF)

result.selectExpr(
      "splits.result[0] as result",
      "splits[0].begin as begin",
      "splits[0].end as end",
      "splits[0].end - splits[0].begin as length",
      "splits[0].metadata.numTokens as tokens") \
    .show(8, truncate = 80)

+--------------------------------------------------------------------------------+-----+-----+------+------+
|                                                                          result|begin|  end|length|tokens|
+--------------------------------------------------------------------------------+-----+-----+------+------+
|THE ADVENTURES OF SHERLOCK HOLMESArthur Conan Doyle Table of contents A Scand...|    0| 2940|  2940|   512|
|daily press, I knew little of my former friend and companion. One night--it w...| 2890| 5582|  2692|   512|
|mud from it. Hence, you see, my double deduction that you had been out in vil...| 5529| 8352|  2823|   512|
|processes. "Such paper could not be bought under half a crown a packet. It is...| 8297|11080|  2783|   512|
|akin to bad taste. Heavy bands of astrakhan were slashed across the sleeves a...|11024|13840|  2816|   512|
|Our visitor glanced with some apparent surprise at the languid, lounging figu...|13777|16824|  3047|   512|
|recovered." "We ha

# Medical Use-Case

In [18]:
! wget -q https://github.com/JohnSnowLabs/spark-nlp-workshop/raw/master/healthcare-nlp/data/diabetes_txt_files.zip

In [19]:
import shutil

filename = "./diabetes_txt_files.zip"
extract_dir = "./"
archive_format = "zip"

shutil.unpack_archive(filename, extract_dir, archive_format)

In [20]:
multi_doc = spark.read.text("./diabetes_txt_files", wholetext=True).toDF("text")
multi_doc = multi_doc.withColumn("filename", F.input_file_name())\
                      .withColumn("filename",F.split('filename', '/'))\
                      .withColumn('filename', F.col('filename')[F.size('filename') -1])
multi_doc.show(truncate=100)


+----------------------------------------------------------------------------------------------------+-----------------------+
|                                                                                                text|               filename|
+----------------------------------------------------------------------------------------------------+-----------------------+
|Diabetes mellitus is a group of diseases associated with various metabolic disorders, the main fe...|PMC4020724_abstract.txt|
|Objective: The peer interaction–based online model has been influential in the recent development...|PMC7432193_abstract.txt|
|Gestational diabetes mellitus (GDM) is associated with developing type 2 diabetes, but very few s...|PMC5770032_abstract.txt|
|A diagnosis of diabetes or hyperglycemia should be confirmed prior to ordering, dispensing, or ad...|PMC6104264_abstract.txt|
|The aim of this study was to describe the characteristics and outcomes of pregnancies in a nation...|PMC705437

In [None]:
!pip install -q langchain

In [22]:
from langchain.document_loaders import PySparkDataFrameLoader
loader = PySparkDataFrameLoader(spark, multi_doc, page_content_column="text")
documents = loader.load()

In [23]:
documents[0]

Document(page_content='Diabetes mellitus is a group of diseases associated with various metabolic disorders, the main feature of which is chronic hyperglycemia due to insufficient insulin action. Its pathogenesis involves both genetic and environmental factors. The long‐term persistence of metabolic disorders can cause susceptibility to specific complications and also foster arteriosclerosis. Diabetes mellitus is associated with a broad range of clinical presentations, from being asymptomatic to ketoacidosis or coma, depending on the degree of metabolic disorder.\nNote: Those that cannot at present be classified as any of the above are called unclassifiable.\nThe occurrence of diabetes‐specific complications has not been confirmed in some of these conditions.\nThe occurrence of diabetes‐specific complications has not been confirmed in some of these conditions.\n\u2002A scheme of the relationship between etiology (mechanism) and patho‐physiological stages (states) of diabetes mellitus. 