![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/18.1.Section_Header_Splitting_and_Classification.ipynb)

# Clinical Section Header Splitting and Classification


## Colab Setup

Note: This notebook is prepared for JohnSnowLabs 5.2.0 and later versions.

In [None]:
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

In [None]:
from pyspark.sql import functions as F

## Classifying texts into sections

We currently have two pretrainend models for this task, trained with slightly different text data:

- `bert_sequence_classifier_clinical_sections`: Classifies the text assuming that the section header can be part of the text
- `bert_sequence_classifier_clinical_sections_headless`: Classifies the text wihtout the section name in the text

| Model Name           |            Predicted Classes              |
|----------------------|-------------------------------------------|
| [`bert_sequence_classifier_clinical_sections`](https://nlp.johnsnowlabs.com/2023/12/21/bert_sequence_classifier_clinical_sections_en.html) | `Complications and Risk Factors`, `Consultation and Referral`, <br>`Diagnostic and Laboratory Data`, `Discharge Information`, `Habits`, <br>`History`, `Patient Information`, `Procedures`, `Impression`, `Other` |
| [`bert_sequence_classifier_clinical_sections_headless`](https://nlp.johnsnowlabs.com/2023/12/21/bert_sequence_classifier_clinical_sections_headless_en.html)   | `Consultation and Referral`, `Habits`, `Complications and Risk Factors`,<br> `Diagnostic and Laboratory Data`, `Discharge Information`, `History`, <br>`Impression`, `Patient Information`, `Procedures`, `Other` |

First, let's create a pipeline to process the texts.

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')

tokenizer = nlp.Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')

sequenceClassifier = medical.BertForSequenceClassification\
    .pretrained('bert_sequence_classifier_clinical_sections_headless', 'en', 'clinical/models')\
    .setInputCols(["document",'token'])\
    .setOutputCol("prediction")\
    .setCaseSensitive(False)

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

bert_sequence_classifier_clinical_sections_headless download started this may take some time.
[OK!]


In this example, we will classify a text extracted from a clinical document.

In [None]:
text = [["""(Medical Transcription Sample Report)
PRESENT ILLNESS:
Patient with hypertension, syncope, and spinal stenosis - for recheck.
SUBJECTIVE:
The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema.
MEDICAL HISTORY:
Reviewed and unchanged from the dictation on 12/03/2003.
MEDICATIONS:
Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily.
She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash."""]]

data = spark.createDataFrame(text).toDF("text")

result = model.transform(data)
result.selectExpr("text","prediction.result").show(truncate=40)

+----------------------------------------+--------------------------------+
|                                    text|                          result|
+----------------------------------------+--------------------------------+
|(Medical Transcription Sample Report)...|[Complications and Risk Factors]|
+----------------------------------------+--------------------------------+



We can see that the text contained information on what care has been given to the patient and prescriptions for the patient to take after leaving the medical center. This information is usually part of the discharge final comments, and was correctly classified in the `Complications and Risk Factors` category by the model.  

In [None]:
text = [
  ["""PRESENT ILLNESS: Patient with hypertension, syncope, and spinal stenosis - for recheck."""],
  ["""SUBJECTIVE: The patient is a 78-year-old female who returns for recheck. She has hypertension. She denies difficulty with chest pain, palpations, orthopnea, nocturnal dyspnea, or edema."""],
  ["""MEDICAL HISTORY: Reviewed and unchanged from the dictation on 12/03/2003."""],
  ["""MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with vitamin D two to three pills daily, multivitamin daily, aspirin as needed, and TriViFlor 25 mg two pills daily.
      She also has Elocon cream 0.1% and Synalar cream 0.01% that she uses as needed for rash."""]
]

data = spark.createDataFrame(text).toDF("text")

result = model.transform(data)
result.selectExpr("text","prediction.result[0] as Classes").show(truncate=75)

+---------------------------------------------------------------------------+------------------------------+
|                                                                       text|                       Classes|
+---------------------------------------------------------------------------+------------------------------+
|PRESENT ILLNESS: Patient with hypertension, syncope, and spinal stenosis...|     Consultation and Referral|
|SUBJECTIVE: The patient is a 78-year-old female who returns for recheck....|Diagnostic and Laboratory Data|
|  MEDICAL HISTORY: Reviewed and unchanged from the dictation on 12/03/2003.|                       History|
|MEDICATIONS: Atenolol 50 mg daily, Premarin 0.625 mg daily, calcium with...|                       History|
+---------------------------------------------------------------------------+------------------------------+



The text data we used was obtained by manually extracting from a sample document from the `mtsamples` corpus. In practice, we may want to process a whole document, split it into chunks and then classify each chunk to know what section of the document it belongs to. Let's see how to use `ChunkSentenceSplitter` and text classification to do that.

## Splitting Documents with NER Models

We currently have pretrainend models for this task, trained with slightly different text data:



| Model Name           |            Predicted Classes              |
|----------------------|-------------------------------------------|
| [`ner_jsl_slim`](https://nlp.johnsnowlabs.com/2021/08/13/ner_jsl_slim_en.html) | `Header` |
| [`ner_jsl`](https://nlp.johnsnowlabs.com/2022/10/19/ner_jsl_en.html)   | `Family_History_Header`, `Medical_History_Header`, `Section_Header`,<br> `Social_History_Header`, `Vital_Signs_Header` |
| [`ner_section_header_diagnosis`](https://nlp.johnsnowlabs.com/2023/07/26/ner_section_header_diagnosis_en.html) | `Patient info header`, `Medical History Header`, `Clinical History Header`, <br> `History of Present Illness Header`, `Medications Header`, `Allergies Header`, <br> `Laboratory Results Header`, `Imaging Studies Header`,  `Diagnosis Header`,<br> `Treatment Plan Header` |

In this example, we first split a full document into chunks based on an `NER` model that is trained to identify headers in the document. We then use the `ChunkSentenceSplitter` to split the document based on those headers and finally categorize each chunk with out section classifier.

In [None]:
example = """Medical Specialty:
Cardiovascular / Pulmonary

Sample Name: Aortic Valve Replacement

Description: Aortic valve replacement using a mechanical valve and two-vessel coronary artery bypass grafting procedure using saphenous vein graft to the first obtuse marginal artery and left radial artery graft to the left anterior descending artery.
(Medical Transcription Sample Report)

DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with congestive heart failure. The patient has diabetes and is morbidly obese.

PROCEDURES: Aortic valve replacement using a mechanical valve and two-vessel coronary artery bypass grafting procedure using saphenous vein graft to the first obtuse marginal artery and left radial artery graft to the left anterior descending artery.

ANESTHESIA: General endotracheal

INCISION: Median sternotomy

INDICATIONS: The patient presented with severe congestive heart failure associated with the patient's severe diabetes. The patient was found to have moderately stenotic aortic valve. In addition, The patient had significant coronary artery disease consisting of a chronically occluded right coronary artery but a very important large obtuse marginal artery coming off as the main circumflex system. The patient also has a left anterior descending artery which has moderate disease and this supplies quite a bit of collateral to the patient's right system. It was decided to perform a valve replacement as well as coronary artery bypass grafting procedure.

FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve leaflet is calcified and a severe restrictive leaflet motion. It is a tricuspid type of valve. The coronary artery consists of a large left anterior descending artery which is associated with 60% stenosis but a large obtuse marginal artery which has a tight proximal stenosis.

The radial artery was used for the left anterior descending artery. Flow was excellent. Looking at the targets in the posterior descending artery territory, there did not appear to be any large branches. On the angiogram these vessels appeared to be quite small. Because this is a chronically occluded vessel and the patient has limited conduit due to the patient's massive obesity, attempt to bypass to this area was not undertaken. The patient was brought to the operating room

PROCEDURE: The patient was brought to the operating room and placed in supine position. A median sternotomy incision was carried out and conduits were taken from the left arm as well as the right thigh. The patient weighs nearly three hundred pounds. There was concern as to taking down the left internal mammary artery. Because the radial artery appeared to be a good conduit The patient would have arterial graft to the left anterior descending artery territory. The patient was cannulated after the aorta and atrium were exposed and full heparinization.

The patient went on cardiopulmonary bypass and the aortic cross-clamp was applied Cardioplegia was delivered through the coronary sinuses in a retrograde manner. The patient was cooled to 32 degrees. Iced slush was applied to the heart. The aortic valve was then exposed through the aortic root by transverse incision. The valve leaflets were removed and the #23 St. Jude mechanical valve was secured into position by circumferential pledgeted sutures. At this point, aortotomy was closed.

The first obtuse marginal artery was a very large target and the vein graft to this target indeed produced an excellent amount of flow. Proximal anastomosis was then carried out to the foot of the aorta. The left anterior descending artery does not have severe disease but is also a very good target and the radial artery was anastomosed to this target in an end-to-side manner. The two proximal anastomoses were then carried out to the root of the aorta.

The patient came off cardiopulmonary bypass after aortic cross-clamp was released. The patient was adequately warmed. Protamine was given without adverse effect. Sternal closure was then done using wires. The subcutaneous layers were closed using Vicryl suture. The skin was approximated using staples.
"""

df = spark.createDataFrame([[example]]).toDF("text")

df.show()

+--------------------+
|                text|
+--------------------+
|Medical Specialty...|
+--------------------+



We create a NER pipeline to identify headers in the text. Since our models identify other entities, we will whitelist the `Header` entity only.

In [None]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_jsl_slim", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["Header"])

chunkSentenceSplitter = medical.ChunkSentenceSplitter()\
    .setInputCols("document","ner_chunk")\
    .setOutputCol("paragraphs")\
    .setGroupBySentences(False)

sequenceClassifier = medical.BertForSequenceClassification\
    .pretrained('bert_sequence_classifier_clinical_sections', 'en', 'clinical/models')\
    .setInputCols(["paragraphs", "token"])\
    .setOutputCol("prediction")\
    .setCaseSensitive(False)

pipeline_sentence = nlp.Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        chunkSentenceSplitter,
        sequenceClassifier
    ])

empty_df = spark.createDataFrame([[""]]).toDF('text')
pipeline_model_sentence = pipeline_sentence.fit(empty_df)

result = pipeline_model_sentence.transform(df)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl_slim download started this may take some time.
[OK!]
bert_sequence_classifier_clinical_sections download started this may take some time.
[OK!]


Let's check which entities were found by the model:

In [None]:
result.selectExpr('explode(ner_chunk)').show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                              |
+---------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 86, 97, Description:, {chunk -> 0, confidence -> 0.62765, ner_source -> ner_chunk, entity -> Header, sentence -> 2}, []} |
|{chunk, 377, 386, DIAGNOSIS:, {chunk -> 1, confidence -> 0.6609, ner_source -> ner_chunk, entity -> Header, sentence -> 4}, []}  |
|{chunk, 530, 540, PROCEDURES:, {chunk -> 2, confidence -> 0.6484, ner_source -> ner_chunk, entity -> Header, sentence -> 6}, []} |
|{chunk, 782, 792, ANESTHESIA:, {chunk -> 3, confidence -> 0.65955, ner_source -> ner_chunk, entity -> Header, sentence -> 7}, []}|
|{chunk, 845, 856, INDICATIONS:, {chunk -> 4, confidence -> 0.8141, ner_sour

Now, let's split the document based on those entities to obtain sections of the document.

Let's see the resulting chunks:

In [None]:
result_df = result.selectExpr("explode(paragraphs) as result").selectExpr("result.result as section","result.metadata.entity")
result_df.show(truncate=100)

+----------------------------------------------------------------------------------------------------+------------+
|                                                                                             section|      entity|
+----------------------------------------------------------------------------------------------------+------------+
|         Medical Specialty:\nCardiovascular / Pulmonary\n\nSample Name: Aortic Valve Replacement\n\n|introduction|
|Description: Aortic valve replacement using a mechanical valve and two-vessel coronary artery byp...|      Header|
|DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with congestive heart fa...|      Header|
|PROCEDURES: Aortic valve replacement using a mechanical valve and two-vessel coronary artery bypa...|      Header|
|                                 ANESTHESIA: General endotracheal\n\nINCISION: Median sternotomy\n\n|      Header|
|INDICATIONS: The patient presented with severe congestive heart failure

The first section was automatically called `introduction` (name tht can be customized with the `defaultEntity` parameter), and contains the text before the first entity.

You can find more details on this annotator on [this notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/18.0.Chunk_Sentence_Splitter.ipynb).

Now, we can classify each document with our classifiers.

In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.paragraphs.result,
                                                 result.paragraphs.metadata,
                                                 result.prediction.result)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("paragraph"),
                          F.expr("cols['1']['entity']").alias("entity"),
                          F.expr("cols['2']").alias("prediction"))

result_df.show(truncate=80)

+--------------------------------------------------------------------------------+------------+------------------------------+
|                                                                       paragraph|      entity|                    prediction|
+--------------------------------------------------------------------------------+------------+------------------------------+
|Medical Specialty:\nCardiovascular / Pulmonary\n\nSample Name: Aortic Valve R...|introduction|                       History|
|Description: Aortic valve replacement using a mechanical valve and two-vessel...|      Header|Complications and Risk Factors|
|DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with...|      Header|Diagnostic and Laboratory Data|
|PROCEDURES: Aortic valve replacement using a mechanical valve and two-vessel ...|      Header|                    Procedures|
|             ANESTHESIA: General endotracheal\n\nINCISION: Median sternotomy\n\n|      Header|                

Each chunk can be then further processed in their own category or analysis.

## Splitting Documents with InternalDocumentSplitter

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")\


document_splitter = medical.InternalDocumentSplitter() \
    .setInputCols("document")\
    .setOutputCol("splits")\
    .setSplitMode("recursive")\
    .setChunkSize(100)\
    .setChunkOverlap(3)\
    .setExplodeSplits(True)\
    .setPatternsAreRegex(False)\
    .setSplitPatterns(["\n\n", "\n"])\
    .setKeepSeparators(False)\
    .setTrimWhitespace(True)

sequenceClassifier = medical.BertForSequenceClassification\
    .pretrained('bert_sequence_classifier_clinical_sections', 'en', 'clinical/models')\
    .setInputCols(["splits", "token"])\
    .setOutputCol("prediction")\
    .setCaseSensitive(False)

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    tokenizer,
    document_splitter,
    sequenceClassifier
])

result = pipeline.fit(df).transform(df)

bert_sequence_classifier_clinical_sections download started this may take some time.
[OK!]


In [None]:
result.select("splits").show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                                                                                                    

In [None]:
result.select("prediction.result","splits.result").show(truncate=False)

+--------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                          |result                                                                                                                                                                                                                                                                           

## Document Filtering by Classification

The `DocumentFiltererByClassifier` function is designed to filter documents based on the outcomes generated by classifier annotators. It operates using two lists: a white list and a black list. The white list comprises classifier results that meet the criteria to pass through the filter, while the black list includes results that are prohibited from passing through. This filtering process is sensitive to cases by default. However, by setting 'caseSensitive' to false, the filter becomes case-insensitive, allowing for a broader range of matches based on the specified criteria. This function serves as an effective tool for systematically sorting and managing documents based on specific classifier outcomes, facilitating streamlined document handling and organization.


In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")\

document_splitter = medical.InternalDocumentSplitter() \
    .setInputCols("document")\
    .setOutputCol("splits")\
    .setSplitMode("recursive")\
    .setChunkSize(100)\
    .setChunkOverlap(3)\
    .setExplodeSplits(True)\
    .setPatternsAreRegex(False)\
    .setSplitPatterns(["\n\n", "\n"])\
    .setKeepSeparators(False)\
    .setTrimWhitespace(True)
    #.setEnableSentenceIncrement(False)

sequenceClassifier = medical.BertForSequenceClassification\
    .pretrained('bert_sequence_classifier_clinical_sections', 'en', 'clinical/models')\
    .setInputCols(["splits", "token"])\
    .setOutputCol("prediction")\
    .setCaseSensitive(False)

document_filterer = medical.DocumentFiltererByClassifier()\
    .setInputCols(["splits", "prediction"])\
    .setOutputCol("filteredDocuments")\
    .setWhiteList(["Diagnostic and Laboratory Data"])\
    .setCaseSensitive(False)\


pipeline = nlp.Pipeline().setStages([
    document_assembler,
    tokenizer,
    document_splitter,
    sequenceClassifier,
    #document_filterer
])

result = pipeline.fit(df).transform(df)

bert_sequence_classifier_clinical_sections download started this may take some time.
[OK!]


In [None]:
result.selectExpr("splits.result[0] as splits",
                  "prediction.result[0] as classes"
                  ).show(truncate=80)

+--------------------------------------------------------------------------------+------------------------------+
|                                                                          splits|                       classes|
+--------------------------------------------------------------------------------+------------------------------+
|Medical Specialty:\nCardiovascular / Pulmonary\n\nSample Name: Aortic Valve R...|                       History|
|Description: Aortic valve replacement using a mechanical valve and two-vessel...|Complications and Risk Factors|
|                                           (Medical Transcription Sample Report)|Complications and Risk Factors|
|DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with...|Diagnostic and Laboratory Data|
|PROCEDURES: Aortic valve replacement using a mechanical valve and two-vessel ...|                    Procedures|
|                 ANESTHESIA: General endotracheal\n\nINCISION: Median sternotomy|      

In [None]:
pipeline = nlp.Pipeline().setStages([
    document_assembler,
    tokenizer,
    document_splitter,
    sequenceClassifier,
    document_filterer
])

result = pipeline.fit(df).transform(df)
from pyspark.sql.functions import col
result.selectExpr("filteredDocuments.result[0] as splits",
                  "filteredDocuments.metadata[0].class_label as classes")\
                  .filter(col("classes").isNotNull()).show(truncate=80)

+--------------------------------------------------------------------------------+------------------------------+
|                                                                          splits|                       classes|
+--------------------------------------------------------------------------------+------------------------------+
|DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with...|Diagnostic and Laboratory Data|
|FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve lea...|Diagnostic and Laboratory Data|
|The radial artery was used for the left anterior descending artery. Flow was ...|Diagnostic and Laboratory Data|
|The first obtuse marginal artery was a very large target and the vein graft t...|Diagnostic and Laboratory Data|
+--------------------------------------------------------------------------------+------------------------------+

