![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **ChunkSentenceSplitter**

This notebook will cover the different parameters and usages of `ChunkSentenceSplitter`. This annotator that splits documents or sentences by chunks provided. Splitted parts can be named with the splitting chunks.

**📖 Learning Objectives:**

1. Understand how It is useful when you need to perform different models or analysis in different sections of your document (for example, for different headers, clauses, items, etc.).

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [ChunkSentenceSplitter](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#chunksentencesplitter)

- Python Docs : [ChunkSentenceSplitter](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/chunker/chunk_sentence_splitter/index.html)

- Scala Docs : [ChunkSentenceSplitter](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/chunker/ChunkSentenceSplitter.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Healthcare).

## **🎬 Colab Setup**

In [1]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m5.7

In [2]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving 5.3.3.spark_nlp_for_healthcare.json to 5.3.3.spark_nlp_for_healthcare.json


In [3]:
from johnsnowlabs import nlp, medical

nlp.install()

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
🚨 Outdated Medical Secrets in license file. Version=5.3.3 but should be Version=5.3.2
🚨 Outdated OCR Secrets in license file. Version=5.1.2 but should be Version=5.3.2
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False 

In [4]:
import pandas as pd

spark = nlp.start()
spark

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `CHUNK`

- Output: `DOCUMENT`

## **🔎 Parameters**


- `GroupBySentences`: (boolean) Sets the groupBySentences that allow split the paragraphs grouping the chunks by sentences.

- `InsertChunk`: (boolean) Whether to insert the chunk in the paragraph or not.

- `DefaultEntity`: (str) Sets the key in the metadata dictionary that you want to filter (by default 'entity')



## Data Prepare

In [5]:
#input data

input_list = ["""Sample Name: Mesothelioma - Pleural Biopsy
Description: Right pleural effusion and suspected malignant mesothelioma. (Medical Transcription Sample Report)
PREOPERATIVE DIAGNOSIS:  Right pleural effusion and suspected malignant mesothelioma.
POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothelioma.
ANESTHESIA: General double-lumen endotracheal.
DESCRIPTION OF FINDINGS:  Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.
SPECIMEN:  Pleural biopsies for pathology and microbiology.
INDICATIONS:  Briefly, this is a 66-year-old gentleman who has been transferred from an outside hospital after a pleural effusion had been drained and biopsies taken from the right chest that were thought to be consistent with mesothelioma. Upon transfer, he had a right pleural effusion demonstrated on x-ray as well as some shortness of breath and dyspnea on exertion. The risks, benefits, and alternatives to right VATS pleurodesis and pleural biopsy were discussed with the patient and his family and they wished to proceed.
Dr. X was present for the entire procedure which was right VATS pleurodesis and pleural biopsies.The counts were correct x2 at the end of the case."""]

In [6]:
files = [f"{i}.txt" for i in (range(1, len(input_list)+1))]

df = spark.createDataFrame(pd.DataFrame({'text': input_list, 'file' : files}))

df.show()

+--------------------+-----+
|                text| file|
+--------------------+-----+
|Sample Name: Meso...|1.txt|
+--------------------+-----+



In [7]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")\

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_jsl_slim", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(["Header"])

pipeline_sentence = nlp.Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter
    ])

empty_df = spark.createDataFrame([[""]]).toDF('text')
pipeline_model_sentence = pipeline_sentence.fit(empty_df)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl_slim download started this may take some time.
[OK!]


In [8]:
result = pipeline_model_sentence.transform(df)
result.selectExpr('explode(ner_chunk)').show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                             |
+------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 43, 54, Description:, {chunk -> 0, confidence -> 0.8571, ner_source -> ner_chunk, entity -> Header, sentence -> 1}, []}                 |
|{chunk, 155, 177, PREOPERATIVE DIAGNOSIS:, {chunk -> 1, confidence -> 0.87280005, ner_source -> ner_chunk, entity -> Header, sentence -> 3}, []}|
|{chunk, 241, 264, POSTOPERATIVE DIAGNOSIS:, {chunk -> 2, confidence -> 0.8618, ner_source -> ner_chunk, entity -> Header, sentence -> 4}, []}   |
|{chunk, 324, 334, ANESTHESIA:, {chunk -> 3, confidence -> 0.68285, ner_source -> ner_chunk, entity -> Header, sentenc

### `setGroupBySentences()`

In [9]:
#applying ChunkSentenceSplitter
chunkSentenceSplitter = medical.ChunkSentenceSplitter()\
    .setInputCols("document","ner_chunk")\
    .setOutputCol("paragraphs")\
    .setGroupBySentences(False)

paragraphs = chunkSentenceSplitter.transform(result)

In [10]:
paragraphs.show()

+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text| file|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|          paragraphs|
+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Sample Name: Meso...|1.txt|[{document, 0, 12...|[{document, 0, 41...|[{token, 0, 5, Sa...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 43, 54, ...|[{document, 0, 43...|
+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [11]:
paragraphs.select("paragraphs.result").show(truncate=100)

+----------------------------------------------------------------------------------------------------+
|                                                                                              result|
+----------------------------------------------------------------------------------------------------+
|[Sample Name: Mesothelioma - Pleural Biopsy\n, Description: Right pleural effusion and suspected ...|
+----------------------------------------------------------------------------------------------------+



In [12]:
pd.set_option('display.max_colwidth', None)
result_df = paragraphs.selectExpr("explode(paragraphs) as result").selectExpr("result.result","result.metadata.entity").toPandas()
result_df.head()

Unnamed: 0,result,entity
0,Sample Name: Mesothelioma - Pleural Biopsy\n,introduction
1,Description: Right pleural effusion and suspected malignant mesothelioma. (Medical Transcription Sample Report)\n,Header
2,PREOPERATIVE DIAGNOSIS: Right pleural effusion and suspected malignant mesothelioma.\n,Header
3,"POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothelioma.\n",Header
4,ANESTHESIA: General double-lumen endotracheal.\n,Header


### `setInsertChunk()`



In [13]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

tokenizer= nlp.Tokenizer()\
      .setInputCols(["document"])\
      .setOutputCol("token")

tokenClassifier = medical.BertForTokenClassification.pretrained("bert_token_classifier_ner_jsl_slim", "en", "clinical/models")\
    .setInputCols("token", "document")\
    .setOutputCol("ner")\
    .setCaseSensitive(True)

ner_converter = medical.NerConverterInternal() \
      .setInputCols(["document", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(["Header"])

pipeline = nlp.Pipeline(
    stages = [
        documentAssembler,
        tokenizer,
        tokenClassifier,
        ner_converter
    ])

empty_df = spark.createDataFrame([[""]]).toDF('text')
pipeline_model = pipeline.fit(empty_df)

bert_token_classifier_ner_jsl_slim download started this may take some time.
[OK!]


In [14]:
result = pipeline_model.transform(df)
result.selectExpr('explode(ner_chunk)').show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                             |
+------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 155, 177, PREOPERATIVE DIAGNOSIS:, {chunk -> 0, confidence -> 0.9710055, ner_source -> ner_chunk, entity -> Header, sentence -> 0}, []} |
|{chunk, 241, 264, POSTOPERATIVE DIAGNOSIS:, {chunk -> 1, confidence -> 0.9611764, ner_source -> ner_chunk, entity -> Header, sentence -> 0}, []}|
|{chunk, 324, 333, ANESTHESIA, {chunk -> 2, confidence -> 0.80923826, ner_source -> ner_chunk, entity -> Header, sentence -> 0}, []}             |
|{chunk, 371, 393, DESCRIPTION OF FINDINGS, {chunk -> 3, confidence -> 0.9926481, ner_source -> ner_chunk, entity -> H

In [15]:
result.columns #no sentence column

['text', 'file', 'document', 'token', 'ner', 'ner_chunk']

In [16]:
chunkSentenceSplitter = medical.ChunkSentenceSplitter()\
    .setInputCols("ner_chunk","document")\
    .setOutputCol("paragraphs")\
    .setInsertChunk(False)

paragraphs = chunkSentenceSplitter.transform(result)

In [17]:
paragraphs.select("paragraphs.result").show(truncate=100)

+----------------------------------------------------------------------------------------------------+
|                                                                                              result|
+----------------------------------------------------------------------------------------------------+
|[Sample Name: Mesothelioma - Pleural Biopsy\nDescription: Right pleural effusion and suspected ma...|
+----------------------------------------------------------------------------------------------------+



In [18]:
result_insert = paragraphs.selectExpr("explode(paragraphs) as result").selectExpr("result.result","result.metadata.entity", "result.metadata.splitter_chunk").toPandas()
result_insert.head()

Unnamed: 0,result,entity,splitter_chunk
0,Sample Name: Mesothelioma - Pleural Biopsy\nDescription: Right pleural effusion and suspected malignant mesothelioma. (Medical Transcription Sample Report)\n,introduction,UNK
1,"Right pleural effusion and suspected malignant mesothelioma.\nPOSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothelioma.\nANESTHESIA: General double-lumen endotracheal.\nDESCRIPTION OF FINDINGS: Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.\nSPECIMEN: Pleural biopsies for pathology and microbiology.\nINDICATIONS: Briefly, this is a 66-year-old gentleman who has been transferred from an outside hospital after a pleural effusion had been drained and biopsies taken from the right chest that were thought to be consistent with mesothelioma. Upon transfer, he had a right pleural effusion demonstrated on x-ray as well as some shortness of breath and dyspnea on exertion. The risks, benefits, and alternatives to right VATS pleurodesis and pleural biopsy were discussed with the patient and his family and they wished to proceed.\nDr. X was present for the entire procedure which was right VATS pleurodesis and pleural biopsies.The counts were correct x2 at the end of the case",Header,PREOPERATIVE DIAGNOSIS:


### `setDefaultEntity()`

In [19]:
chunkSentenceSplitter_2 = medical.ChunkSentenceSplitter()\
    .setInputCols("ner_chunk","document")\
    .setOutputCol("paragraphs")\
    .setInsertChunk(True)\
    .setDefaultEntity("Intro") #to set the name of the introduction entity


paragraphs = chunkSentenceSplitter_2.transform(result)

result = paragraphs.selectExpr("explode(paragraphs) as result").selectExpr("result.result","result.metadata.entity", "result.metadata.splitter_chunk").toPandas()
result.head()

Unnamed: 0,result,entity,splitter_chunk
0,Sample Name: Mesothelioma - Pleural Biopsy\nDescription: Right pleural effusion and suspected malignant mesothelioma. (Medical Transcription Sample Report)\n,Intro,UNK
1,"PREOPERATIVE DIAGNOSIS: Right pleural effusion and suspected malignant mesothelioma.\nPOSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothelioma.\nANESTHESIA: General double-lumen endotracheal.\nDESCRIPTION OF FINDINGS: Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.\nSPECIMEN: Pleural biopsies for pathology and microbiology.\nINDICATIONS: Briefly, this is a 66-year-old gentleman who has been transferred from an outside hospital after a pleural effusion had been drained and biopsies taken from the right chest that were thought to be consistent with mesothelioma. Upon transfer, he had a right pleural effusion demonstrated on x-ray as well as some shortness of breath and dyspnea on exertion. The risks, benefits, and alternatives to right VATS pleurodesis and pleural biopsy were discussed with the patient and his family and they wished to proceed.\nDr. X was present for the entire procedure which was right VATS pleurodesis and pleural biopsies.The counts were correct x2 at the end of the case",Header,PREOPERATIVE DIAGNOSIS:
