![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/open-source-nlp/21.0.Document_Splitters.ipynb)

# **Document Splitters**

This notebook will cover the different parameters and usages of `DocumentCharacterTextSplitter` and `DocumentTokenSplitter`.

**📖 Learning Objectives:**

1. Background: Understand the Document Splitters such as `DocumentCharacterTextSplitter` and `DocumentTokenSplitter` .

2. Colab setup.

3. Become comfortable with using the different parameters of the annotator.

**🔗 Helpful Links:**

- For Translation models: : [Model Hub](https://sparknlp.org/models?tag=translation)



## **🎬 Colab Setup**

In [None]:
!pip install -q pyspark==3.4.1 spark-nlp

In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline, PipelineModel

import pandas as pd

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp.start(params=params)

print("Spark NLP Version :", sparknlp.version())

spark

Spark NLP Version : 5.3.3


# DocumentCharacterTextSplitter Model

## **📜 Background**

`DocumentCharacterTextSplitter`: Annotator which splits large documents into chunks of roughly given size.

DocumentCharacterTextSplitter takes a list of separators. It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.

For example, given chunk size 20 and overlap 5:

```
"He was, I take it, the most perfect reasoning and observing machine that the world has seen."

["He was, I take it,", "it, the most", "most perfect", "reasoning and", "and observing", "machine that the", "the world has seen."]
```

**🔗 Helpful Links:**

- Python API: [DocumentCharacterTextSplitter](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/document_character_text_splitter/index.html#sparknlp.annotator.document_character_text_splitter.DocumentCharacterTextSplitter)
- Scala API: [DocumentCharacterTextSplitter](https://sparknlp.org/api/com/johnsnowlabs/nlp/annotators/DocumentCharacterTextSplitter)




## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`

- Output: `DOCUMENT`

## **🔎 Parameters**

- `chunkSize`: Size of each chunk of text.
- `chunkOverlap`: Length of the overlap between text chunks , by default `0`.
- `splitPatterns`: Patterns to separate the text by in decreasing priority , by default `["\n\n", "\n", " ", ""]`.
- `patternsAreRegex`: Whether to interpret the split patterns as regular expressions , by default `False`.
- `keepSeparators`: Whether to keep the separators in the final result , by default `True`.
- `explodeSplits`: Whether to explode split chunks to separate rows , by default `False`.
- `trimWhitespace`: Whether to trim whitespaces of extracted chunks , by default `True`.


## **💻Pipeline**

Lets create a Spark NLP pipeline with the following stages:

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

character_text_splitter = DocumentCharacterTextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setChunkSize(50) \
    .setChunkOverlap(10) \
    .setExplodeSplits(False)

pipeline = Pipeline()\
    .setStages([
        document_assembler,
        character_text_splitter
])

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/open-source-nlp/data/holmes.txt

In [None]:
holmesDF = spark.read.text("holmes.txt", wholetext=True).toDF("text")
holmesDF.show(truncate=100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|THE ADVENTURES OF SHERLOCK HOLMESArthur Conan Doyle Table of contents A Scandal in Bohemia The Re...|
+----------------------------------------------------------------------------------------------------+



In [None]:
result = pipeline.fit(holmesDF).transform(holmesDF)

result.select("splits").show(truncate = 300)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                                                                                                                                                                      splits|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 45, THE ADVENTURES OF SHERLOCK HOLMESArthur Conan, {sentence -> 0, document

### ▶ explodeSplits

Whether to explode split chunks to separate rows , by default `False`.

In [None]:
character_text_splitter = DocumentCharacterTextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setChunkSize(100) \
    .setChunkOverlap(10) \
    .setExplodeSplits(True)

pipeline = Pipeline()\
    .setStages([
        document_assembler,
        character_text_splitter
])

result = pipeline.fit(holmesDF).transform(holmesDF)

result.select("splits").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                           |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 94, THE ADVENTURES OF SHERLOCK HOLMESArthur Conan Doyle Table of contents A Scandal in Bohemia The, {sentence -> 0, document -> 0}, []}]          |
|[{document, 91, 188, The Red-Headed League A Case of Identity The Boscombe Valley Mystery The Five Orange Pips The Man, {sentence -> 0, document -> 1}, []}]     |
|[{document, 181, 280, The Man with the Twisted Lip The Adventure of the Blue Carbuncle The Adventure of the Speckled Band, {sentence -> 0, document -> 2}, []}]  |
|[{document, 276

In [None]:
# Let's prettify
result.selectExpr(
      "splits.result[0] as split",
      "splits[0].begin as begin",
      "splits[0].end as end",
      "splits[0].end - splits[0].begin as length").show(8, truncate = 120)

+---------------------------------------------------------------------------------------------------+-----+---+------+
|                                                                                              split|begin|end|length|
+---------------------------------------------------------------------------------------------------+-----+---+------+
|     THE ADVENTURES OF SHERLOCK HOLMESArthur Conan Doyle Table of contents A Scandal in Bohemia The|    0| 94|    94|
|  The Red-Headed League A Case of Identity The Boscombe Valley Mystery The Five Orange Pips The Man|   91|188|    97|
|The Man with the Twisted Lip The Adventure of the Blue Carbuncle The Adventure of the Speckled Band|  181|280|    99|
|Band The Adventure of the Engineer's Thumb The Adventure of the Noble Bachelor The Adventure of the|  276|375|    99|
|    of the Beryl Coronet The Adventure of the Copper Beeches A SCANDAL IN BOHEMIA Table of contents|  369|464|    95|
| contents Chapter 1 Chapter 2 Chapter 3CHAPTER 

### ▶ splitPatterns

Patterns to separate the text by in decreasing priority , by default ["\n\n", "\n", " ", ""]

In [None]:
text = """  (Medical Transcription Sample Report)

PRESENT ILLNESS:
Patient with hypertension, syncope, and spinal stenosis - for recheck.

SUBJECTIVE:
The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema.

MEDICAL HISTORY:
Reviewed and unchanged from the dictation on 12/03/2003.

MEDICATIONS:
Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily.
She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash."""

textDF = spark.createDataFrame([[text]]).toDF("text")
textDF.show(truncate = 100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|  (Medical Transcription Sample Report)\n\nPRESENT ILLNESS:\nPatient with hypertension, syncope, ...|
+----------------------------------------------------------------------------------------------------+



In [None]:
character_text_splitter = DocumentCharacterTextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setChunkSize(100) \
    .setChunkOverlap(10) \
    .setExplodeSplits(True)\
    .setSplitPatterns(["\n\n", "\n"])

pipeline = Pipeline()\
    .setStages([
        document_assembler,
        character_text_splitter
])

result = pipeline.fit(textDF).transform(textDF)

result.selectExpr(
      "splits.result[0] as split",
      "splits[0].begin as begin",
      "splits[0].end as end",
      "splits[0].end - splits[0].begin as length").show(truncate = 120)

+------------------------------------------------------------------------------------------------------------------------+-----+---+------+
|                                                                                                                   split|begin|end|length|
+------------------------------------------------------------------------------------------------------------------------+-----+---+------+
|                                                                                   (Medical Transcription Sample Report)|    2| 39|    37|
|                                PRESENT ILLNESS:\nPatient with hypertension, syncope, and spinal stenosis - for recheck.|   41|128|    87|
|                                                                                                             SUBJECTIVE:|  130|141|    11|
|\nThe patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest...|  141|315|   174|
|                   

In [None]:
character_text_splitter = DocumentCharacterTextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setChunkSize(100) \
    .setChunkOverlap(10) \
    .setExplodeSplits(True)\
    .setSplitPatterns(["\n\n", "\n", " "])

pipeline = Pipeline()\
    .setStages([
        document_assembler,
        character_text_splitter
])

result = pipeline.fit(textDF).transform(textDF)

result.selectExpr(
      "splits.result[0] as split",
      "splits[0].begin as begin",
      "splits[0].end as end",
      "splits[0].end - splits[0].begin as length").show(truncate = 120)

+-----------------------------------------------------------------------------------------------+-----+---+------+
|                                                                                          split|begin|end|length|
+-----------------------------------------------------------------------------------------------+-----+---+------+
|                                                          (Medical Transcription Sample Report)|    2| 39|    37|
|       PRESENT ILLNESS:\nPatient with hypertension, syncope, and spinal stenosis - for recheck.|   41|128|    87|
|                                                                                    SUBJECTIVE:|  130|141|    11|
|  The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies|  142|235|    93|
|         denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema.|  229|315|    86|
|                     MEDICAL HISTORY:\nReviewed and unchanged from the dictatio

### ▶ trimWhitespace

Whether to trim whitespaces of extracted chunks , by default True.

In [None]:
character_text_splitter = DocumentCharacterTextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setChunkSize(100) \
    .setChunkOverlap(10) \
    .setExplodeSplits(True)\
    .setTrimWhitespace(False)

pipeline = Pipeline()\
    .setStages([
        document_assembler,
        character_text_splitter
])

result = pipeline.fit(textDF).transform(textDF)

result.selectExpr(
      "splits.result[0] as split",
      "splits[0].begin as begin",
      "splits[0].end as end",
      "splits[0].end - splits[0].begin as length").show(truncate = 120)

+-------------------------------------------------------------------------------------------------+-----+---+------+
|                                                                                            split|begin|end|length|
+-------------------------------------------------------------------------------------------------+-----+---+------+
|                                                            (Medical Transcription Sample Report)|    0| 39|    39|
|     \n\nPRESENT ILLNESS:\nPatient with hypertension, syncope, and spinal stenosis - for recheck.|   39|128|    89|
|                                                                                  \n\nSUBJECTIVE:|  128|141|    13|
|  \nThe patient is a 78-year-old female who returns for recheck. She has hypertension. She denies|  141|235|    94|
|           denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema.|  228|315|    87|
|                   \n\nMEDICAL HISTORY:\nReviewed and unchanged

### ▶  keepSeparators

Whether to keep the separators in the final result , by default `True`.

In [None]:
character_text_splitter = DocumentCharacterTextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setChunkSize(100) \
    .setChunkOverlap(10) \
    .setExplodeSplits(True)\
    .setTrimWhitespace(False)\
    .setKeepSeparators(False)

pipeline = Pipeline()\
    .setStages([
        document_assembler,
        character_text_splitter
])

result = pipeline.fit(textDF).transform(textDF)

result.selectExpr(
      "splits.result[0] as split",
      "splits[0].begin as begin",
      "splits[0].end as end",
      "splits[0].end - splits[0].begin as length").show(truncate = 120)

+-----------------------------------------------------------------------------------------------+-----+---+------+
|                                                                                          split|begin|end|length|
+-----------------------------------------------------------------------------------------------+-----+---+------+
|                                                          (Medical Transcription Sample Report)|    0| 39|    39|
|       PRESENT ILLNESS:\nPatient with hypertension, syncope, and spinal stenosis - for recheck.|   41|128|    87|
|                                                                                    SUBJECTIVE:|  130|141|    11|
|  The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies|  142|235|    93|
|     She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema.|  225|315|    90|
|                     MEDICAL HISTORY:\nReviewed and unchanged from the dictatio

### ▶ patternsAreRegex

Whether to interpret the split patterns as regular expressions , by default `False`.

In [None]:
character_text_splitter = DocumentCharacterTextSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setChunkSize(100) \
    .setChunkOverlap(10) \
    .setExplodeSplits(True)\
    .setPatternsAreRegex(True)\
    .setSplitPatterns(["(?:\n\s*)"])\
    .setTrimWhitespace(True)\
    .setKeepSeparators(True)


pipeline = Pipeline()\
    .setStages([
        document_assembler,
        character_text_splitter
])

result = pipeline.fit(textDF).transform(textDF)

result.selectExpr(
      "splits.result[0] as split",
      "splits[0].begin as begin",
      "splits[0].end as end",
      "splits[0].end - splits[0].begin as length").show(truncate = 120)

+------------------------------------------------------------------------------------------------------------------------+-----+---+------+
|                                                                                                                   split|begin|end|length|
+------------------------------------------------------------------------------------------------------------------------+-----+---+------+
|                                                               (Medical Transcription Sample Report)\n\nPRESENT ILLNESS:|    2| 57|    55|
|                                   Patient with hypertension, syncope, and spinal stenosis - for recheck.\n\nSUBJECTIVE:|   58|141|    83|
|\nThe patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest...|  141|315|   174|
|                              MEDICAL HISTORY:\nReviewed and unchanged from the dictation on 12/03/2003.\n\nMEDICATIONS:|  317|404|    87|
|\nAtenolol 50 mg da

## Testing

**Now let's make another pipeline to see if this actually works!**



Lets create a Spark NLP pipeline following the same stages as before:

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

character_text_splitter = DocumentCharacterTextSplitter() \
    .setInputCols("document") \
    .setOutputCol("splits") \
    .setChunkSize(20) \
    .setChunkOverlap(5) \
    .setExplodeSplits(True) \
    .setPatternsAreRegex(False) \
    .setKeepSeparators(True) \
    .setSplitPatterns(["\n\n", "\n", " ", ""]) \
    .setTrimWhitespace(True)

pipeline = Pipeline()\
    .setStages([
        document_assembler,
        character_text_splitter
])

let's get the data ready

In [None]:
df = spark.createDataFrame([
    [("All emotions, and that\none particularly, were abhorrent to his cold, "
      "precise but\nadmirably balanced mind.\n\nHe was, I take it, the most "
      "perfect\nreasoning and observing machine that the world has seen.")]
]).toDF("text")

In [None]:
result_df = pipeline.fit(df).transform(df)
result_df.selectExpr(
      "splits.result[0] as split",
      "splits[0].begin as begin",
      "splits[0].end as end",
      "splits[0].end - splits[0].begin as length").show(truncate = 80)

+-------------------+-----+---+------+
|              split|begin|end|length|
+-------------------+-----+---+------+
|  All emotions, and|    0| 17|    17|
|           and that|   14| 22|     8|
|  one particularly,|   23| 40|    17|
|  were abhorrent to|   41| 58|    17|
|       to his cold,|   56| 68|    12|
|        precise but|   69| 80|    11|
| admirably balanced|   81| 99|    18|
|              mind.|  100|105|     5|
| He was, I take it,|  107|125|    18|
|       it, the most|  122|134|    12|
|       most perfect|  130|142|    12|
|      reasoning and|  143|156|    13|
|      and observing|  153|166|    13|
|   machine that the|  167|183|    16|
|the world has seen.|  180|199|    19|
+-------------------+-----+---+------+



**Evaluation**

In [None]:
results = result_df.select("splits").collect()

splits = [row["splits"][0].result.replace("\n\n", " ").replace("\n", " ") for row in results]

In [None]:
expected = [
    "All emotions, and",
    "and that",
    "one particularly,",
    "were abhorrent to",
    "to his cold,",
    "precise but",
    "admirably balanced",
    "mind.",
    "He was, I take it,",
    "it, the most",
    "most perfect",
    "reasoning and",
    "and observing",
    "machine that the",
    "the world has seen.",
]

splits == expected

True

## with DocumentNormalizer

In [None]:
!wget -q https://github.com/JohnSnowLabs/spark-nlp/blob/587f79020de7bc09c2b2fceb37ec258bad57e425/src/test/resources/spell/sherlockholmes.txt  -P ./

In [None]:
unnormalized_textDF = spark.read.text("sherlockholmes.txt", wholetext=True).toDF("text")

unnormalized_textDF.collect()[0]["text"][:1000]

'\n\n\n\n\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  >\n\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-f13f84a2af0d.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-1ee85695b584.css" /><link data-color-theme="dark_dimme

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

cleanUpPatterns = ["""(<[^>]*>)"""]

document_normalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalize_document") \
    .setAction("clean") \
    .setPatterns(cleanUpPatterns) \
    .setReplacement(" ") \
    .setPolicy("pretty_all") \
    .setLowercase(False)

character_text_splitter = DocumentCharacterTextSplitter() \
    .setInputCols(["normalize_document"]) \
    .setOutputCol("splits") \
    .setChunkSize(2000) \
    .setChunkOverlap(20) \
    .setExplodeSplits(True)

pipeline = Pipeline()\
    .setStages([
        document_assembler,
        document_normalizer,
        character_text_splitter
])

In [None]:
result = pipeline.fit(unnormalized_textDF).transform(unnormalized_textDF)

result.selectExpr(
      "splits.result[0] as split",
      "splits[0].begin as begin",
      "splits[0].end as end",
      "splits[0].end - splits[0].begin as length").show(truncate = 100)

+----------------------------------------------------------------------------------------------------+-----+-----+------+
|                                                                                               split|begin|  end|length|
+----------------------------------------------------------------------------------------------------+-----+-----+------+
|{"locale":"en","featureFlags":["code_vulnerability_scanning","copilot_conversational_ux_history_r...|    1| 1999|  1998|
|You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ ...| 1987| 2343|   356|
| {"payload":{"allShortcutsEnabled":false,"fileTree":{"src/test/resources/spell":{"items":[{"name"...| 2343|10181|  7838|
|Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle","","This eBook is for the u...|10182|12179|  1997|
|actions. But for the trained reasoner","to admit such intrusions into his own delicate and finely...|12163|14160|  1997|
|practice), when my way 

# DocumentTokenSplitter Model

## **📜 Background**

` DocumentTokenSplitter`: Annotator that splits large documents into smaller documents based on the number of tokens in the text.

Currently, DocumentTokenSplitter splits the text by whitespaces to create the tokens. The number of these tokens will then be used as a measure of the text length. In the future, other tokenization techniques will be supported.


**🔗 Helpful Links:**

- Python API: [ DocumentTokenSplitter](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/document_token_splitter/index.html#sparknlp.annotator.document_token_splitter.DocumentTokenSplitter)
- Scala API: [ DocumentTokenSplitter](https://sparknlp.org/api/com/johnsnowlabs/nlp/annotators/DocumentTokenSplitter)




## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`

- Output: `DOCUMENT`

## **🔎 Parameters**

- `numTokens`: Limit of the number of tokens in a text
- `tokenOverlap`: Length of the token overlap between text chunks, by default `0`.
- `explodeSplits`: Whether to explode split chunks to separate rows, by default `False`.
- `trimWhitespace`: Whether to trim whitespaces of extracted chunks, by default `True`.


## **💻Pipeline**

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

token_splitter = DocumentTokenSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setNumTokens(512) \
    .setTokenOverlap(10) \
    .setExplodeSplits(True)

pipeline = Pipeline()\
    .setStages([
        document_assembler,
        token_splitter
    ])

In [None]:
result = pipeline.fit(holmesDF).transform(holmesDF)

result.selectExpr(
      "splits.result[0] as result",
      "splits[0].begin as begin",
      "splits[0].end as end",
      "splits[0].end - splits[0].begin as length",
      "splits[0].metadata.numTokens as tokens") \
    .show(8, truncate = 80)

+--------------------------------------------------------------------------------+-----+-----+------+------+
|                                                                          result|begin|  end|length|tokens|
+--------------------------------------------------------------------------------+-----+-----+------+------+
|THE ADVENTURES OF SHERLOCK HOLMESArthur Conan Doyle Table of contents A Scand...|    0| 2940|  2940|   512|
|daily press, I knew little of my former friend and companion. One night--it w...| 2890| 5582|  2692|   512|
|mud from it. Hence, you see, my double deduction that you had been out in vil...| 5529| 8352|  2823|   512|
|processes. "Such paper could not be bought under half a crown a packet. It is...| 8297|11080|  2783|   512|
|akin to bad taste. Heavy bands of astrakhan were slashed across the sleeves a...|11024|13840|  2816|   512|
|Our visitor glanced with some apparent surprise at the languid, lounging figu...|13777|16824|  3047|   512|
|recovered." "We ha

# Medical Use-Case

In [None]:
! wget -q https://github.com/JohnSnowLabs/spark-nlp-workshop/raw/master/healthcare-nlp/data/diabetes_txt_files.zip

In [None]:
import shutil

filename = "./diabetes_txt_files.zip"
extract_dir = "./"
archive_format = "zip"

shutil.unpack_archive(filename, extract_dir, archive_format)

In [None]:
multi_doc = spark.read.text("./diabetes_txt_files", wholetext=True).toDF("text")
multi_doc = multi_doc.withColumn("filename", F.input_file_name())\
                      .withColumn("filename",F.split('filename', '/'))\
                      .withColumn('filename', F.col('filename')[F.size('filename') -1])
multi_doc.show(truncate=100)


+----------------------------------------------------------------------------------------------------+-----------------------+
|                                                                                                text|               filename|
+----------------------------------------------------------------------------------------------------+-----------------------+
|Diabetes mellitus is a group of diseases associated with various metabolic disorders, the main fe...|PMC4020724_abstract.txt|
|Objective: The peer interaction–based online model has been influential in the recent development...|PMC7432193_abstract.txt|
|Gestational diabetes mellitus (GDM) is associated with developing type 2 diabetes, but very few s...|PMC5770032_abstract.txt|
|A diagnosis of diabetes or hyperglycemia should be confirmed prior to ordering, dispensing, or ad...|PMC6104264_abstract.txt|
|The aim of this study was to describe the characteristics and outcomes of pregnancies in a nation...|PMC705437

In [None]:
!pip install -q langchain

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.7/817.7 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m59.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m287.5/287.5 kB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m113.0/113.0 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.8/144.8 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from langchain.document_loaders import PySparkDataFrameLoader
loader = PySparkDataFrameLoader(spark, multi_doc, page_content_column="text")
documents = loader.load()

In [None]:
documents[0]

Document(page_content='Diabetes mellitus is a group of diseases associated with various metabolic disorders, the main feature of which is chronic hyperglycemia due to insufficient insulin action. Its pathogenesis involves both genetic and environmental factors. The long‐term persistence of metabolic disorders can cause susceptibility to specific complications and also foster arteriosclerosis. Diabetes mellitus is associated with a broad range of clinical presentations, from being asymptomatic to ketoacidosis or coma, depending on the degree of metabolic disorder.\nNote: Those that cannot at present be classified as any of the above are called unclassifiable.\nThe occurrence of diabetes‐specific complications has not been confirmed in some of these conditions.\nThe occurrence of diabetes‐specific complications has not been confirmed in some of these conditions.\n\u2002A scheme of the relationship between etiology (mechanism) and patho‐physiological stages (states) of diabetes mellitus. 

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

character_text_splitter = DocumentCharacterTextSplitter() \
    .setInputCols("document") \
    .setOutputCol("splits") \
    .setChunkSize(1000) \
    .setChunkOverlap(50) \
    .setExplodeSplits(True) \
    .setPatternsAreRegex(False) \
    .setKeepSeparators(True) \
    .setSplitPatterns(["\n", " ", ""]) \
    .setTrimWhitespace(True)

pipeline = Pipeline()\
    .setStages([
        document_assembler,
        character_text_splitter
])

In [None]:
result = pipeline.fit(multi_doc).transform(multi_doc)

result.selectExpr(
      "splits.result[0] as split",
      "splits[0].begin as begin",
      "splits[0].end as end",
      "splits[0].end - splits[0].begin as length",
      "filename")\
    .show(50,truncate = 120)

+------------------------------------------------------------------------------------------------------------------------+-----+----+------+-----------------------+
|                                                                                                                   split|begin| end|length|               filename|
+------------------------------------------------------------------------------------------------------------------------+-----+----+------+-----------------------+
|Diabetes mellitus is a group of diseases associated with various metabolic disorders, the main feature of which is ch...|    0| 846|   846|PMC4020724_abstract.txt|
| A scheme of the relationship between etiology (mechanism) and patho‐physiological stages (states) of diabetes mellit...|  847|1715|   868|PMC4020724_abstract.txt|
|The classification of glucose metabolism disorders is principally derived from etiology, and includes staging of path...| 1716|2710|   994|PMC4020724_abstract.txt|
|on the de

# Medical Document Splitter

[`Medical Document Splitter`](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/30.0.InternalDocumentSplitter.ipynb) Annotator with More Flexibility and Customization for RAG Pipelines

Discover our cutting-edge Internal Document Splitter—an innovative annotator designed to effortlessly break down extensive documents into manageable segments. Empowering users with the ability to define custom separators, this tool seamlessly divides texts, ensuring each chunk adheres to specified length criteria.
