![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/18.0.Chunk_Sentence_Splitter.ipynb)

# Chunk Sentence Splitter
We are releasing `ChunkSentenceSplitter`  annotator that splits documents or sentences by chunks provided. Splitted parts can be named with the splitting chunks. <br/>
By using this annotator, you can do some tasks like splitting clinical documents according into sections in accordance with CDA (Clinical Document Architecture).

## Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs==5.1.0

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical, visual

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical, visual

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

## How It Works


In [None]:
#giving input chunks to the ChunkSentenceSplitter model by using regex
regex = """Reporting Template,title1
SPECIMEN,title2
RESULTS,title3"""

with open("title_regex.txt", 'w') as f:
  f.write(regex)

In [None]:
import pandas as pd

documentAssembler = nlp.DocumentAssembler()\
     .setInputCol("text")\
     .setOutputCol("document")

regexMatcher = nlp.RegexMatcher()\
     .setExternalRules("/content/title_regex.txt", ",")\
     .setInputCols("document")\
     .setOutputCol("chunks")

pipeline =  nlp.Pipeline().setStages([
                                  documentAssembler,
                                  regexMatcher])

text_list = ["""
This is the header that have not title

Reporting Template

Writers write descriptive paragraphs because their purpose is to describe something. Their point is that something
is beautiful or disgusting or strangely intriguing.
Writers write persuasive and argument paragraphs because their purpose is to persuade or convince someone. T

SPECIMEN
+Adequacy of Sample for Testing
___ Adequate
+Estimated % Tumor Cellularity
___ Suboptimal (explain): _________________

RESULTS
+Mutational Analysis
___ Mutation detected
___ Mutation no identified
___ EGFR
"""]

data_chunk = spark.createDataFrame([["text"]]).toDF("text")

pipeline_model = pipeline.fit(data_chunk)

chunk_df = pipeline_model.transform(spark.createDataFrame(pd.DataFrame({'text': text_list})))

In [None]:
chunk_df.show()

+--------------------+--------------------+--------------------+
|                text|            document|              chunks|
+--------------------+--------------------+--------------------+
|
This is the head...|[{document, 0, 55...|[{chunk, 41, 58, ...|
+--------------------+--------------------+--------------------+



In [None]:
chunk_df.selectExpr('explode(chunks)').show(truncate=False)

+------------------------------------------------------------------------------------------+
|col                                                                                       |
+------------------------------------------------------------------------------------------+
|{chunk, 41, 58, Reporting Template, {identifier -> title1, sentence -> 0, chunk -> 0}, []}|
|{chunk, 338, 345, SPECIMEN, {identifier -> title2, sentence -> 0, chunk -> 1}, []}        |
|{chunk, 468, 474, RESULTS, {identifier -> title3, sentence -> 0, chunk -> 2}, []}         |
+------------------------------------------------------------------------------------------+



Applying `ChunkSentenceSplitter()`

In [None]:
chunkSentenceSplitter = medical.ChunkSentenceSplitter()\
      .setInputCols("chunks","document")\
      .setOutputCol("paragraphs")

paragraphs = chunkSentenceSplitter.transform(chunk_df)

In [None]:
paragraphs.selectExpr("explode(paragraphs) as result").selectExpr("result.result","result.metadata.entity").toPandas()

Unnamed: 0,result,entity
0,\nThis is the header that have not title\n\n,introduction
1,Reporting Template\n\nWriters write descriptiv...,title1
2,SPECIMEN\n+Adequacy of Sample for Testing\n___...,title2
3,RESULTS\n+Mutational Analysis\n___ Mutation de...,title3


### Ner Pipeline with Sentence Splitting

In [None]:
#input data

input_list = ["""Sample Name: Mesothelioma - Pleural Biopsy
Description: Right pleural effusion and suspected malignant mesothelioma. (Medical Transcription Sample Report)
PREOPERATIVE DIAGNOSIS:  Right pleural effusion and suspected malignant mesothelioma.
POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothelioma.
ANESTHESIA: General double-lumen endotracheal.
DESCRIPTION OF FINDINGS:  Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.
SPECIMEN:  Pleural biopsies for pathology and microbiology.
INDICATIONS:  Briefly, this is a 66-year-old gentleman who has been transferred from an outside hospital after a pleural effusion had been drained and biopsies taken from the right chest that were thought to be consistent with mesothelioma. Upon transfer, he had a right pleural effusion demonstrated on x-ray as well as some shortness of breath and dyspnea on exertion. The risks, benefits, and alternatives to right VATS pleurodesis and pleural biopsy were discussed with the patient and his family and they wished to proceed.
Dr. X was present for the entire procedure which was right VATS pleurodesis and pleural biopsies.The counts were correct x2 at the end of the case."""]

In [None]:
files = [f"{i}.txt" for i in (range(1, len(input_list)+1))]

df = spark.createDataFrame(pd.DataFrame({'text': input_list, 'file' : files}))

df.show()

+--------------------+-----+
|                text| file|
+--------------------+-----+
|Sample Name: Meso...|1.txt|
+--------------------+-----+



Now, creating NER pipeline for extracting chunks

In [None]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")\

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_jsl_slim", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(["Header"])

pipeline_sentence = nlp.Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter
    ])

empty_df = spark.createDataFrame([[""]]).toDF('text')
pipeline_model_sentence = pipeline_sentence.fit(empty_df)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl_slim download started this may take some time.
[OK!]


In [None]:
result = pipeline_model_sentence.transform(df)
result.selectExpr('explode(ner_chunk)').show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                             |
+------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 43, 54, Description:, {chunk -> 0, confidence -> 0.8571, ner_source -> ner_chunk, entity -> Header, sentence -> 1}, []}                 |
|{chunk, 155, 177, PREOPERATIVE DIAGNOSIS:, {chunk -> 1, confidence -> 0.87280005, ner_source -> ner_chunk, entity -> Header, sentence -> 3}, []}|
|{chunk, 241, 264, POSTOPERATIVE DIAGNOSIS:, {chunk -> 2, confidence -> 0.8618, ner_source -> ner_chunk, entity -> Header, sentence -> 4}, []}   |
|{chunk, 324, 334, ANESTHESIA:, {chunk -> 3, confidence -> 0.68285, ner_source -> ner_chunk, entity -> Header, sentenc

In [None]:
result.columns

['text',
 'file',
 'document',
 'sentence',
 'token',
 'embeddings',
 'ner',
 'ner_chunk']

In [None]:
#applying ChunkSentenceSplitter
chunkSentenceSplitter = medical.ChunkSentenceSplitter()\
    .setInputCols("document","ner_chunk")\
    .setOutputCol("paragraphs")\
    .setGroupBySentences(False)

paragraphs = chunkSentenceSplitter.transform(result)

In [None]:
paragraphs.show()

+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text| file|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|          paragraphs|
+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Sample Name: Meso...|1.txt|[{document, 0, 12...|[{document, 0, 41...|[{token, 0, 5, Sa...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 43, 54, ...|[{document, 0, 43...|
+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [None]:
paragraphs.select("paragraphs.result").show(truncate=100)

+----------------------------------------------------------------------------------------------------+
|                                                                                              result|
+----------------------------------------------------------------------------------------------------+
|[Sample Name: Mesothelioma - Pleural Biopsy
, Description: Right pleural effusion and suspected m...|
+----------------------------------------------------------------------------------------------------+



In [None]:
pd.set_option('display.max_colwidth', None)
result_df = paragraphs.selectExpr("explode(paragraphs) as result").selectExpr("result.result","result.metadata.entity").toPandas()
result_df.head()

Unnamed: 0,result,entity
0,Sample Name: Mesothelioma - Pleural Biopsy\n,introduction
1,Description: Right pleural effusion and suspected malignant mesothelioma. (Medical Transcription Sample Report)\n,Header
2,PREOPERATIVE DIAGNOSIS: Right pleural effusion and suspected malignant mesothelioma.\n,Header
3,"POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothelioma.\n",Header
4,ANESTHESIA: General double-lumen endotracheal.\n,Header


### Ner Pipeline without Sentence Splitter

In [None]:
documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

#sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
#        .setInputCols(["document"])\
#        .setOutputCol("sentence")

tokenizer= nlp.Tokenizer()\
      .setInputCols(["document"])\
      .setOutputCol("token")

tokenClassifier = medical.BertForTokenClassification.pretrained("bert_token_classifier_ner_jsl_slim", "en", "clinical/models")\
    .setInputCols("token", "document")\
    .setOutputCol("ner")\
    .setCaseSensitive(True)

ner_converter = medical.NerConverterInternal() \
      .setInputCols(["document", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(["Header"])

pipeline = nlp.Pipeline(
    stages = [
        documentAssembler,
        tokenizer,
        tokenClassifier,
        ner_converter
    ])

empty_df = spark.createDataFrame([[""]]).toDF('text')
pipeline_model = pipeline.fit(empty_df)

bert_token_classifier_ner_jsl_slim download started this may take some time.
[OK!]


In [None]:
result = pipeline_model.transform(df)
result.selectExpr('explode(ner_chunk)').show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                              |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 155, 177, PREOPERATIVE DIAGNOSIS:, {chunk -> 0, confidence -> 0.9710056, ner_source -> ner_chunk, entity -> Header, sentence -> 0}, []}  |
|{chunk, 241, 264, POSTOPERATIVE DIAGNOSIS:, {chunk -> 1, confidence -> 0.96117634, ner_source -> ner_chunk, entity -> Header, sentence -> 0}, []}|
|{chunk, 324, 333, ANESTHESIA, {chunk -> 2, confidence -> 0.80923885, ner_source -> ner_chunk, entity -> Header, sentence -> 0}, []}              |
|{chunk, 371, 393, DESCRIPTION OF FINDINGS, {chunk -> 3, confidence -> 0.9926482, ner_source -> ner_chunk, entit

In [None]:
result.columns #no sentence column

['text', 'file', 'document', 'token', 'ner', 'ner_chunk']

In [None]:
#applying ChunkSentenceSplitter
chunkSentenceSplitter = medical.ChunkSentenceSplitter()\
    .setInputCols("ner_chunk","document")\
    .setOutputCol("paragraphs")\

paragraphs = chunkSentenceSplitter.transform(result)

In [None]:
paragraphs.show()

+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text| file|            document|               token|                 ner|           ner_chunk|          paragraphs|
+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+
|Sample Name: Meso...|1.txt|[{document, 0, 12...|[{token, 0, 5, Sa...|[{named_entity, 0...|[{chunk, 155, 177...|[{document, 0, 15...|
+--------------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+



In [None]:
paragraphs.select("paragraphs.result").show(truncate=100)

+----------------------------------------------------------------------------------------------------+
|                                                                                              result|
+----------------------------------------------------------------------------------------------------+
|[Sample Name: Mesothelioma - Pleural Biopsy
Description: Right pleural effusion and suspected mal...|
+----------------------------------------------------------------------------------------------------+



In [None]:
result_df = paragraphs.selectExpr("explode(paragraphs) as result").selectExpr("result.result","result.metadata.entity", "result.metadata.splitter_chunk").toPandas()
result_df.head()

Unnamed: 0,result,entity,splitter_chunk
0,Sample Name: Mesothelioma - Pleural Biopsy\nDescription: Right pleural effusion and suspected malignant mesothelioma. (Medical Transcription Sample Report)\n,introduction,UNK
1,PREOPERATIVE DIAGNOSIS: Right pleural effusion and suspected malignant mesothelioma.\n,Header,PREOPERATIVE DIAGNOSIS:
2,"POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothelioma.\n",Header,POSTOPERATIVE DIAGNOSIS:
3,ANESTHESIA: General double-lumen endotracheal.\n,Header,ANESTHESIA
4,"DESCRIPTION OF FINDINGS: Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.\n",Header,DESCRIPTION OF FINDINGS


`.setInsertChunk()` parameter to set whether remove chunks from splitted parts or not.

In [None]:
chunkSentenceSplitter = medical.ChunkSentenceSplitter()\
    .setInputCols("ner_chunk","document")\
    .setOutputCol("paragraphs")\
    .setInsertChunk(False)

paragraphs = chunkSentenceSplitter.transform(result)

In [None]:
paragraphs.select("paragraphs.result").show(truncate=100)

+----------------------------------------------------------------------------------------------------+
|                                                                                              result|
+----------------------------------------------------------------------------------------------------+
|[Sample Name: Mesothelioma - Pleural Biopsy
Description: Right pleural effusion and suspected mal...|
+----------------------------------------------------------------------------------------------------+



In [None]:
result_insert = paragraphs.selectExpr("explode(paragraphs) as result").selectExpr("result.result","result.metadata.entity", "result.metadata.splitter_chunk").toPandas()
result_insert.head()

Unnamed: 0,result,entity,splitter_chunk
0,Sample Name: Mesothelioma - Pleural Biopsy\nDescription: Right pleural effusion and suspected malignant mesothelioma. (Medical Transcription Sample Report)\n,introduction,UNK
1,Right pleural effusion and suspected malignant mesothelioma.\n,Header,PREOPERATIVE DIAGNOSIS:
2,"Right pleural effusion, suspected malignant mesothelioma.\n",Header,POSTOPERATIVE DIAGNOSIS:
3,: General double-lumen endotracheal.\n,Header,ANESTHESIA
4,": Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.\n",Header,DESCRIPTION OF FINDINGS


Check how `.setInsertChunk(True)` affects the result

In [None]:
chunkSentenceSplitter_2 = medical.ChunkSentenceSplitter()\
    .setInputCols("ner_chunk","document")\
    .setOutputCol("paragraphs")\
    .setInsertChunk(True)\
    .setDefaultEntity("Intro") #to set the name of the introduction entity


paragraphs = chunkSentenceSplitter_2.transform(result)

result = paragraphs.selectExpr("explode(paragraphs) as result").selectExpr("result.result","result.metadata.entity", "result.metadata.splitter_chunk").toPandas()
result.head()

Unnamed: 0,result,entity,splitter_chunk
0,Sample Name: Mesothelioma - Pleural Biopsy\nDescription: Right pleural effusion and suspected malignant mesothelioma. (Medical Transcription Sample Report)\n,Intro,UNK
1,PREOPERATIVE DIAGNOSIS: Right pleural effusion and suspected malignant mesothelioma.\n,Header,PREOPERATIVE DIAGNOSIS:
2,"POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothelioma.\n",Header,POSTOPERATIVE DIAGNOSIS:
3,ANESTHESIA: General double-lumen endotracheal.\n,Header,ANESTHESIA
4,"DESCRIPTION OF FINDINGS: Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.\n",Header,DESCRIPTION OF FINDINGS
