![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **InternalDocumentSplitter**

This notebook covers the uses of `InternalDocumentSplitter`. This annotator specifically target to split documents into relevant sections.




**📖 Learning Objectives:**

1. Understand how `InternalDocumentSplitter` works.

2. Become comfortable using the parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [InternalDocumentSplitter](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#internaldocumentsplitter)


- For extended examples of usage, see [Spark NLP Workshop repository](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/38.InternalDocumentSplitter.ipynb).


## **📜 Background**

This Annotator splits large documents into small documents. `InternalDocumentSplitter` has setSplitMode method to decide how to split documents.

If splitMode is `recursive`, It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.

Additionally, you can set
- custom patterns with setSplitPatterns
- whether patterns should be interpreted as regex with setPatternsAreRegex
- whether to keep the separators with setKeepSeparators
- whether to trim whitespaces with setTrimWhitespace
- whether to explode the splits to individual rows with setExplodeSplits

## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8744_521_4_530.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.1, 💊Spark-Healthcare==5.3.0, running on ⚡ PySpark==3.4.0


In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`

- Output: `DOCUMENT`

Optionaly `TOKEN` and one more `DOCUMENT` (sentence) can be additional input.

## **🔎 Parameters**

- `chunkSize`: Size of each chunk of text. This param is applicable only for "recursive" splitMode.
- `chunkOverlap`: Length of the overlap between text chunks, by default `0`. This param is applicable only for `recursive` splitMode.
- `splitPatterns`: Patterns to split the document.
patternsAreRegex. Whether to interpret the split patterns as regular expressions, by default `True`.
- `keepSeparators`: Whether to keep the separators in the final result , by default `True`. This param is applicable only for "recursive" splitMode.
- `explodeSplits`: Whether to explode split chunks to separate rows , by default `False`.
- `trimWhitespace`: Whether to trim whitespaces of extracted chunks , by default `True`.
- `splitMode`: The split mode to determine how text should be segmented. Default: 'regex'. It should be one of the following values:
  - "char": Split text based on individual characters.
  - "token": Split text based on tokens. You should supply tokens from inputCols.
  - "sentence": Split text based on sentences. You should supply sentences from inputCols.
  - "recursive": Split text recursively using a specific algorithm.
  - "regex": Split text based on a regular expression pattern.
- `sentenceAwareness`: Whether to split the document by sentence awareness if possible.
  - If true, it can stop the split process before maxLength.
  - If true, you should supply sentences from inputCols. Default: `False`.
  - This param is not applicable only for `regex` and `recursive` splitMode.
- `maxLength`: The maximum length allowed for spitting. The mode in which the maximum length is specified:
  - "char": Maximum length is measured in characters. Default: `512`
  - "token": Maximum length is measured in tokens. Default: `128`
  - "sentence": Maximum length is measured in sentences. Default: `8`
- `customBoundsStrategy`: The custom bounds strategy for text splitting using regular expressions. This param is applicable only for `regex` splitMode.
- `caseSensitive`: Whether to use case sensitive when matching regex, by default `False`. This param is applicable only for `regex` splitMode.
-  `metaDataFields`: Metadata fields to add specified data in columns to the metadata of the split documents.         You should set column names to read columns.

- `enableSentenceIncrement`: Whether the sentence index should be incremented in the metadata of the annotator.When set to true, the annotator will increment the sentence index in the metadata for each split documents. Default: `False`.

## **💻 Pipeline**

### `inputCols` and `outputCol`

In [None]:
text = """(Medical Transcription Sample Report)

PRESENT ILLNESS:
Patient with hypertension, syncope, and spinal stenosis - for recheck.

SUBJECTIVE:
The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema.

MEDICAL HISTORY:
Reviewed and unchanged from the dictation on 12/03/2003.

MEDICATIONS:
Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily.
She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash."""

textDF = spark.createDataFrame([[text]]).toDF("text")

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

document_splitter = medical.InternalDocumentSplitter() \
    .setInputCols(["document"]) \
    .setOutputCol("splits") \
    .setExplodeSplits(True) \

pipeline = nlp.Pipeline().setStages([
    document_assembler,
     document_splitter
])

pipeline = pipeline.fit(textDF).transform(textDF).select("splits").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                                                                                                                                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Recursive Mode

Recursive Mode supports Spark NLP `DocumentCharacterTextSplitter`, which allows users to split large documents into smaller chunks. This splitter accepts a list of separators in sequence and divides subtexts if they exceed the chunk length, while optionally overlapping chunks. Our inspiration came from the `CharacterTextSplitter` and `RecursiveCharacterTextSplitter` implementations within the `LangChain` library. As always, we've ensured that it's optimized, ready for production, and scalable:

In [None]:
df = spark.createDataFrame([[(
    "The patient is a 28-year-old, who is status post gastric bypass surgery"
    " nearly one year ago. \nHe has lost about 200 pounds and was otherwise doing well"
    " until yesterday evening around 7:00-8:00 when he developed nausea and right upper quadrant pain,"
    " which apparently wrapped around toward his right side and back. He feels like he was on it"
    " but has not done so. He has overall malaise and a low-grade temperature of 100.3."
    " \n\nHe denies any prior similar or lesser symptoms. His last normal bowel movement was yesterday."
    " He denies any outright chills or blood per rectum."
)]]).toDF("text")


document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

document_splitter = medical.InternalDocumentSplitter()\
    .setInputCols("document")\
    .setOutputCol("splits")\
    .setSplitMode("recursive")\
    .setChunkSize(100)\
    .setChunkOverlap(3)\
    .setExplodeSplits(True)\
    .setPatternsAreRegex(False)\
    .setSplitPatterns(["\n\n", "\n", " "])\
    .setKeepSeparators(False)\
    .setTrimWhitespace(True)

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    document_splitter
])

pipeline_df = pipeline.fit(df).transform(df).select("splits").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                                                                       |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 92, The patient is a 28-year-old, who is status post gastric bypass surgery nearly one year ago., {sentence -> 0, document -> 0, uuid -> 6d8207f9-ef43-4b81-9055-030cf8c0483d}, []}]          |
|[{document, 94, 192, He has lost about 200 pounds and was otherwise doing well until yesterday evening around 7:00-8:00, {sentence -> 0, document -> 1, uuid -> 811fcac

## Regex Mode

In [None]:
data = """
Beyond OpenAI in Commercial LLM Landscape
Exploring the Innovators and Challengers in the Commercial LLM Landscape beyond OpenAI: Anthropic, Cohere, Mosaic ML, Cerebras, Aleph Alpha, AI21 Labs and John Snow Labs.
Veysel Kocaman
John Snow Labs
Veysel Kocaman

This blog post explores the emerging players in the commercial large language model (LLM) landscape, namely Anthropic, Cohere, Mosaic ML, Cerebras, Aleph Alpha, AI21 Labs and John Snow Labs. While OpenAI is well-known, these companies bring fresh ideas and tools to the LLM world. We discuss their unique offerings, compliance with the EU AI Act, pricing, and performance on various tasks.

In the burgeoning world of artificial intelligence, large language models (LLMs) are the new vanguard, shaping how we interact with machines and expanding the boundaries of what technology can achieve. As the field evolves, a dynamic set of companies have emerged, each contributing unique perspectives and solutions to the landscape. They range from established tech giants flexing their AI muscles, to innovative start-ups pushing the boundaries of what’s possible.

This landscape is a vibrant blend of commercial entities and open-source advocates, with a wealth of diversity in their origin stories, funding, and the models they have developed. From licensed models delivered via APIs to open-source alternatives available for local deployment, the offerings span a broad spectrum, meeting the varied needs of developers, businesses, and researchers worldwide.

In this blog post, we will dive into the fascinating ecosystem of LLM companies. We’ll start with an overview, presenting a snapshot of the current landscape. We’ll then delve into more detailed profiles of each major player, exploring their unique contributions, the models they’ve brought to life, and the strategic decisions that have shaped their paths. So, whether you’re an AI enthusiast, a developer navigating the LLM waters, or just a curious mind, join us as we journey through the bustling landscape of LLM companies.

The landscape of large language models (LLMs) companies
The landscape of large language models (LLMs) companies is diverse, featuring both well-established organizations and dynamic newcomers. Dominating the industry are leading LLM companies, such as OpenAI, which was founded in 2015 and has accumulated $11.3 billion in funding by June 2023. Known for their GPT-3.5 and GPT-4 (ChatGPT) models, OpenAI provides access to these tools through a licensed API.
"""
mediumDF = spark.createDataFrame([[data]]).toDF("text")

**default regex**

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

document_splitter = medical.InternalDocumentSplitter()\
    .setInputCols("document")\
    .setOutputCol("splits")\
    .setSplitMode("regex")\
    .setExplodeSplits(True)

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    document_splitter
])

pipeline_df = pipeline.fit(mediumDF).transform(mediumDF).select("splits").show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                                                                                                                                                                                                                                  

**Custom Regex Split Patterns**

In [None]:
# with SplitPatterns
document_splitter = medical.InternalDocumentSplitter()\
    .setInputCols("document")\
    .setOutputCol("splits")\
    .setSplitMode("regex")\
    .setSplitPatterns(["\n\n","\n"])\
    .setExplodeSplits(True)

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    document_splitter
])

pipeline_df = pipeline.fit(mediumDF).transform(mediumDF).select("splits").show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                                                                                                                                                                                                                                  

**Custom Phrase/Words Split Patterns**


In [None]:
text = """(Medical Transcription Sample Report)

PRESENT ILLNESS:
Patient with hypertension, syncope, and spinal stenosis - for recheck.

SUBJECTIVE:
The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema.

MEDICAL HISTORY:
Reviewed and unchanged from the dictation on 12/03/2003.

MEDICATIONS:
Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily.
She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash."""

textDF = spark.createDataFrame([[text]]).toDF("text")

In [None]:
document_splitter = medical.InternalDocumentSplitter()\
    .setInputCols("document")\
    .setOutputCol("splits")\
    .setSplitMode("regex")\
    .setSplitPatterns(["PRESENT ILLNESS:", "SUBJECTIVE:", "MEDICAL HISTORY:", "MEDICATIONS:"])\
    .setCaseSensitive(True) \
    .setExplodeSplits(True)

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    document_splitter
])

pipeline_df = pipeline.fit(textDF).transform(textDF).select("splits").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                                                                                                                                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

**setCustomBoundsStrategy** can be `none`, `prepend`, `append`. Default is "none"

In [None]:
document_splitter = medical.InternalDocumentSplitter()\
    .setInputCols("document")\
    .setOutputCol("splits")\
    .setSplitMode("regex")\
    .setSplitPatterns(["PRESENT ILLNESS:", "SUBJECTIVE:", "MEDICAL HISTORY:", "MEDICATIONS:"])\
    .setCaseSensitive(True) \
    .setCustomBoundsStrategy("prepend")\
    .setExplodeSplits(True) \

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    document_splitter
])

pipeline_df = pipeline.fit(textDF).transform(textDF).select("splits").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                                                                                                                                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
document_splitter = medical.InternalDocumentSplitter()\
    .setInputCols("document")\
    .setOutputCol("splits")\
    .setSplitMode("regex")\
    .setSplitPatterns(["PRESENT ILLNESS:", "SUBJECTIVE:", "MEDICAL HISTORY:", "MEDICATIONS:"])\
    .setCaseSensitive(True) \
    .setCustomBoundsStrategy("append")\
    .setExplodeSplits(True) \

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    document_splitter
])

pipeline_df = pipeline.fit(textDF).transform(textDF).select("splits").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                                                                                                                                                                                                                                         |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
document_splitter = medical.InternalDocumentSplitter()\
    .setInputCols("document")\
    .setOutputCol("splits")\
    .setSplitMode("regex")\
    .setSplitPatterns(["PRESENT ILLNESS:", "SUBJECTIVE:", "MEDICAL HISTORY:", "MEDICATIONS:"])\
    .setCaseSensitive(True) \
    .setCustomBoundsStrategy("none")\
    .setExplodeSplits(True) \

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    document_splitter
])

pipeline_df = pipeline.fit(textDF).transform(textDF).select("splits").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                                                                                                                                                                                                                                         |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

**setCaseSensitive**  True

In [None]:
# ["firstly,", "secondly,", "thirdly,", "in conclusion,"]
document_splitter = medical.InternalDocumentSplitter()\
    .setInputCols("document")\
    .setOutputCol("splits")\
    .setSplitMode("regex")\
    .setSplitPatterns(["present illness:", "subjective:", "medical history:", "medications:"])\
    .setCaseSensitive(True) \
    .setCustomBoundsStrategy("prepend")\
    .setExplodeSplits(True) \

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    document_splitter
])

pipeline_df = pipeline.fit(textDF).transform(textDF).select("splits").show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                                                                                  

**setCaseSensitive** False

In [None]:
# ["firstly,", "secondly,", "thirdly,", "in conclusion,"]
document_splitter = medical.InternalDocumentSplitter()\
    .setInputCols("document")\
    .setOutputCol("splits")\
    .setSplitMode("regex")\
    .setSplitPatterns(["present illness:", "subjective:", "medical history:", "medications:"])\
    .setCaseSensitive(False) \
    .setCustomBoundsStrategy("prepend")\
    .setExplodeSplits(True) \

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    document_splitter
])

pipeline_df = pipeline.fit(textDF).transform(textDF).select("splits").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                                                                                                                                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Char Mode


In [None]:
ai = """AI advancements impact fields, improving data analysis. Ethical concerns, like privacy and bias, shape academic discussions.
Scholars explore AI's responsible development. Ongoing research navigates evolving challenges."""

df = spark.createDataFrame([[ai]]).toDF("text")

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

document_splitter = medical.InternalDocumentSplitter()\
    .setInputCols("document")\
    .setOutputCol("splits")\
    .setSplitMode("char")\
    .setMaxLength(128)\
    .setExplodeSplits(True)

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    document_splitter
])

pipeline_df = pipeline.fit(df).transform(df).select("splits").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                                                                                                   |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 128, AI advancements impact fields, improving data analysis. Ethical concerns, like privacy and bias, shape academic discussions.\nSch, {sentence -> 0, document -> 0, uuid -> a4574d33-820a-4d78-a821-967f119333d1}, []}]|
|[{document, 128, 219, olars explore AI's responsible de

**sentenceAwareness**

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector()\
    .setInputCols("document")\
    .setOutputCol("sentence")

document_splitter = medical.InternalDocumentSplitter()\
    .setInputCols("document", "sentence")\
    .setOutputCol("splits")\
    .setSplitMode("char")\
    .setMaxLength(128)\
    .setSentenceAwareness(True)\
    .setExplodeSplits(True)

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_detector,
    document_splitter
])

pipeline_df = pipeline.fit(df).transform(df).select("splits").show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 124, AI advancements impact fields, improving data analysis. Ethical concerns, like privacy and bias, shape academic discussions., {sentence -> 0, document -> 0, uuid -> 7e60fd98-512f-464d-955f-21b85466376d}, []}]|
|[{document, 125, 219, Scholars explore AI's responsible development. Ongoin

##  Token Mode

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

document_splitter = medical.InternalDocumentSplitter()\
    .setInputCols("document", "token")\
    .setOutputCol("splits")\
    .setSplitMode("token")\
    .setMaxLength(12)\
    .setExplodeSplits(True)

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    tokenizer,
    document_splitter
])

pipeline_df = pipeline.fit(df).transform(df).select("splits").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 73, AI advancements impact fields, improving data analysis. Ethical concerns,, {sentence -> 0, document -> 0, uuid -> b47856d8-ffdd-40fd-9fac-b67dae8100f2}, []}]  |
|[{document, 74, 146, like privacy and bias, shape academic discussions.\nScholars explore AI's, {sentence -> 0, document -> 1, uuid -> 99b40869-b0a3-4504-9671-577517b33e05}, []}]|
|[{document, 147, 219, responsible development. Ongoing research navigates evolving challenges.

**sentenceAwareness**

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector()\
    .setInputCols("document")\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

document_splitter = medical.InternalDocumentSplitter()\
    .setInputCols("document", "sentence", "token")\
    .setOutputCol("splits")\
    .setSplitMode("token")\
    .setSentenceAwareness(True)\
    .setMaxLength(12)\
    .setExplodeSplits(True)

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_detector,
    tokenizer,
    document_splitter
])


pipeline_df = pipeline.fit(df).transform(df).select("splits").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                                                                  |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 55, AI advancements impact fields, improving data analysis., {sentence -> 0, document -> 0, uuid -> 51e6717b-9fd6-49a8-bf70-110aa4fa44ef}, []}]                                          |
|[{document, 56, 124, Ethical concerns, like privacy and bias, shape academic discussions., {sentence -> 0, document -> 1, uuid -> ff3ab5c4-dc2b-4238-b5c0-7ccf6e5523f7}, []}]              

##  Sentence Mode

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector_dl = nlp.SentenceDetectorDLModel\
    .pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
    .setInputCols("document")\
    .setOutputCol("sentence")

document_splitter = medical.InternalDocumentSplitter() \
    .setInputCols(["document", "sentence"]) \
    .setOutputCol("splits") \
    .setSplitMode("sentence") \
    .setMaxLength(4)\
    .setExplodeSplits(True)

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_detector_dl,
    document_splitter
])

pipeline_df = pipeline.fit(mediumDF).transform(mediumDF).select("splits").show(truncate=False)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                  

## Meta Data Fields

**setMetaDataFields**

Metadata fields to add specified data in columns to the metadata of the split documents. You should set column names to read columns.

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/healthcare-nlp/data/mt_data.csv -O ./mt_data.csv

In [None]:
mt_data_df = spark.createDataFrame(pd.read_csv("mt_data.csv").sample(10))
mt_data_df.show()

+----------+--------------------+--------------------+--------------------+
|PATIENT_ID|  medical_speciality|           file_name|                text|
+----------+--------------------+--------------------+--------------------+
|    #99373|             Surgery|    Surgery_1066.txt|\nMedical Special...|
|    #92885|             Urology|      Urology_46.txt|\nMedical Special...|
|    #89507|Cardiovascular_Pu...|Cardiovascular_Pu...|\nMedical Special...|
|    #59823|Cardiovascular_Pu...|Cardiovascular_Pu...|\nMedical Special...|
|    #99164|Consult_History_a...|Consult_History_a...|\nMedical Special...|
|    #43503|        Neurosurgery| Neurosurgery_57.txt|\nMedical Special...|
|    #72216|    General_Medicine|General_Medicine_...|\nMedical Special...|
|    #84082|        Chiropractic| Chiropractic_05.txt|\nMedical Special...|
|    #62923|Cardiovascular_Pu...|Cardiovascular_Pu...|\nMedical Special...|
|    #16590| Hematology_Oncology|Hematology_Oncolo...|\nMedical Special...|
+----------+

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector_dl = nlp.SentenceDetectorDLModel\
    .pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
    .setInputCols("document")\
    .setOutputCol("sentence")

document_splitter = medical.InternalDocumentSplitter() \
    .setInputCols(["document", "sentence"]) \
    .setOutputCol("splits") \
    .setSplitMode("sentence") \
    .setMaxLength(3)\
    .setExplodeSplits(True)\
    .setMetaDataFields(["PATIENT_ID","medical_speciality","file_name"])

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_detector_dl,
     document_splitter
])

pipeline = pipeline.fit(mt_data_df)\
                    .transform(mt_data_df)\
                    .selectExpr("splits.result as splits", "splits.metadata as metadata")\
                    .show(truncate=False)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                                                                                                                                                                                                                  