![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/30.2.Text_Classification_with_Contextual_Window_Splitting.ipynb)


# **Text Classification with Contextual Window Splitting**

**Text Classification** is a natural language processing (NLP) task of assigning a label to a piece of text. For example, a text classification model could be used to classify emails as spam or not spam, or to classify news articles as sports, politics, or entertainment.

**Contextual Window Splitting** is a technique used to improve the performance of text classification models. In text classification with contextual window splitting, the text is split into multiple windows, and each window is treated as a separate instance for classification. This allows for a more fine-grained analysis of the text and can help capture the nuances of the language used.

The size of the window and the amount of overlap between adjacent windows can be adjusted to optimize the classification performance.

Contextual window splitting has been shown to improve the performance of text classification models on a variety of tasks. It is a simple and effective technique that can be used to improve the accuracy of text classification models.

## **Colab Setup**

📌To run this yourself, you will need to upload your license keys to the notebook. Just run the cells below in order to do that. Also you can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.3.0 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession

import sparknlp_jsl
import sparknlp

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
from sparknlp_jsl.annotator.windowed.windowed_sentence import WindowedSentenceModel

params = {"spark.driver.memory":"16G",
"spark.kryoserializer.buffer.max":"2000M",
"spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.0.0
Spark NLP_JSL Version : 5.0.0


##📃 Sentence Splitting with Contextual Window Embeddings


Classifiers, in order to properly manage the text length, require to split big documents into smaller chunks. This can be done at sentence, paragraph, section or page level.

However, after splitting, the Classifier will only see the splits, and the rest of the text becomes out of reach for them to be taken into account.

This creates several issues:

1️⃣- Small splits may not have a meaning of their own if not combined with the surroundings.

2️⃣- Splits may be ambiguous, and disambiguation may only happen taking into account the surroundings.

Fortunately, `WindowedSentenceModel` can help to provide some context from the surroundings.

# Comparing Classification of Isolated Sentences with Contexualized Windows

We can use the transcribed medical reports from the [MTSamples](https://mtsamples.com/) webpage. Let us use the first report and split this document into sentences.

Then, we are going to run a classifier and see the results.

In [4]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/mt_samples_10.csv

In [5]:
mt_samples_df = spark.read.csv("mt_samples_10.csv", header=True, multiLine=True)

In [6]:
sample_text = mt_samples_df.limit(1).collect()[0]['text']
print(sample_text)

Sample Type / Medical Specialty:
Hematology - Oncology
Sample Name:
Discharge Summary - Mesothelioma - 1
Description:
Mesothelioma, pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.
(Medical Transcription Sample Report)
PRINCIPAL DIAGNOSIS:
Mesothelioma.
SECONDARY DIAGNOSES:
Pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.
PROCEDURES
1. On August 24, 2007, decortication of the lung with pleural biopsy and transpleural fluoroscopy.
2. On August 20, 2007, thoracentesis.
3. On August 31, 2007, Port-A-Cath placement.
HISTORY AND PHYSICAL:
The patient is a 41-year-old Vietnamese female with a nonproductive cough that started last week. She has had right-sided chest pain radiating to her back with fever starting yesterday. She has a history of pericarditis and pericardectomy in May 2006 and developed cough with right-sided chest pain, and went to an urgent care cen

In [7]:
df = spark.createDataFrame([[sample_text]]).toDF("text")

## 📃 classifierdl_pico_biobert

We will use the [PICO Classifier](https://nlp.johnsnowlabs.com/2021/01/21/classifierdl_pico_biobert_en.html) model from the John Snow Labs Models Hub, which is trained on a custom dataset derived from PICO classification dataset.

This model classifies the document into one of the following classes:

`CONCLUSIONS` `DESIGN_SETTING` `INTERVENTION` `PARTICIPANTS` `FINDINGS` `MEASUREMENTS` `AIMS`

### Sentence Splitting Pipeline

Sentence Detection/Splitting in Spark NLP is the process of automatically identifying the boundaries of sentences in a given text. It is a critical step in several natural language processing (NLP) tasks because many NLP tasks take sentence as an input unit.

In [8]:
doc_assembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentence_detector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("isolated_sentence")

sentence_pipeline = Pipeline(stages=[doc_assembler, sentence_detector])

sentence_pipeline_model = sentence_pipeline.fit(df)

sentence_pipeline_lp = LightPipeline(sentence_pipeline_model)

Let's see the first 25 sentences to check the efficiency of sentence splitting.

In [9]:
isolated_sentences = sentence_pipeline_lp.annotate(sample_text)['isolated_sentence'][:25]
isolated_sentences

['Sample Type / Medical Specialty:\nHematology - Oncology\nSample Name:\nDischarge Summary - Mesothelioma - 1\nDescription:\nMesothelioma, pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.',
 '(Medical Transcription Sample Report)\nPRINCIPAL DIAGNOSIS:\nMesothelioma.',
 'SECONDARY DIAGNOSES:\nPleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.',
 'PROCEDURES',
 '1. On August 24, 2007, decortication of the lung with pleural biopsy and transpleural fluoroscopy.',
 '2. On August 20, 2007, thoracentesis.',
 '3. On August 31, 2007, Port-A-Cath placement.',
 'HISTORY AND PHYSICAL:\nThe patient is a 41-year-old Vietnamese female with a nonproductive cough that started last week.',
 'She has had right-sided chest pain radiating to her back with fever starting yesterday.',
 'She has a history of pericarditis and pericardectomy in May 2006 and developed cough with right-s

### Classification Pipeline

In [10]:
tokenizer = Tokenizer().setInputCols('document').setOutputCol('token')

embeddings = BertEmbeddings.pretrained('biobert_pubmed_base_cased')\
   .setInputCols(["document", 'token'])\
   .setOutputCol("word_embeddings")

sentence_embeddings = SentenceEmbeddings() \
   .setInputCols(["document", "word_embeddings"]) \
   .setOutputCol("sentence_embeddings") \
   .setPoolingStrategy("AVERAGE")

classifier = ClassifierDLModel.pretrained('classifierdl_pico_biobert', 'en', 'clinical/models')\
   .setInputCols(['sentence_embeddings']).setOutputCol('label')

prediction_pipeline = Pipeline(stages=[doc_assembler, tokenizer, embeddings, sentence_embeddings, classifier])

prediction_model = prediction_pipeline.fit(df)

prediction_lp = LightPipeline(prediction_model)

biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
classifierdl_pico_biobert download started this may take some time.
Approximate size to download 22 MB
[OK!]


Let's create a dataframe, where we will store the sentences and the predicted labels.

In [11]:
import pandas as pd

comparison_table = []

for s in isolated_sentences:
  label = prediction_lp.annotate(s)['label']

  if len(label) == 0:
    label = ['OTHER']
  comparison_table.append(        (s, label)   )

comparison_table_df = pd.DataFrame(comparison_table, columns=['sentence', 'label_no_context'])

comparison_table_df

Unnamed: 0,sentence,label_no_context
0,Sample Type / Medical Specialty:\nHematology -...,[PARTICIPANTS]
1,(Medical Transcription Sample Report)\nPRINCIP...,[MEASUREMENTS]
2,"SECONDARY DIAGNOSES:\nPleural effusion, atrial...",[PARTICIPANTS]
3,PROCEDURES,[DESIGN_SETTING]
4,"1. On August 24, 2007, decortication of the lu...",[INTERVENTION]
5,"2. On August 20, 2007, thoracentesis.",[INTERVENTION]
6,"3. On August 31, 2007, Port-A-Cath placement.",[PARTICIPANTS]
7,HISTORY AND PHYSICAL:\nThe patient is a 41-yea...,[PARTICIPANTS]
8,She has had right-sided chest pain radiating t...,[AIMS]
9,She has a history of pericarditis and pericard...,[PARTICIPANTS]


## 🚩  **Context Windows**

By using the context windows, it is possible to give to each sentence, a context of `[-n, +n]` so that the model can classify them with more information. The model will be able to use the `n` sentences before and `n` sentences after the original sentence.

`setWindowSize` parameter is used to define the context window.

In [12]:
doc_assembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentence_detector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("isolated_sentence")

context_window = WindowedSentenceModel()\
                    .setInputCols(["isolated_sentence"])\
                    .setOutputCol("window")\
                    .setWindowSize(1)

window_splitting_pipeline = Pipeline(stages=[doc_assembler, sentence_detector, context_window])

window_splitting_model = window_splitting_pipeline.fit(df)

window_splitting_lp = LightPipeline(window_splitting_model)

You can see the windows that the **`WindowedSentenceModel`** annotator will use each time for classifying the sentence - this time in a context window, not a single sentence.

In [13]:
windows = window_splitting_lp.annotate(sample_text)['window'][:25]
windows

['Sample Type / Medical Specialty:\nHematology - Oncology\nSample Name:\nDischarge Summary - Mesothelioma - 1\nDescription:\nMesothelioma, pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis. (Medical Transcription Sample Report)\nPRINCIPAL DIAGNOSIS:\nMesothelioma.',
 'Sample Type / Medical Specialty:\nHematology - Oncology\nSample Name:\nDischarge Summary - Mesothelioma - 1\nDescription:\nMesothelioma, pleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis. (Medical Transcription Sample Report)\nPRINCIPAL DIAGNOSIS:\nMesothelioma. SECONDARY DIAGNOSES:\nPleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thrombosis.',
 '(Medical Transcription Sample Report)\nPRINCIPAL DIAGNOSIS:\nMesothelioma. SECONDARY DIAGNOSES:\nPleural effusion, atrial fibrillation, anemia, ascites, esophageal reflux, and history of deep venous thr

And now, let's classify the sentences with their window, to compare the impact of them in an isolated way from the context:

In [14]:
window_labels = []
for w in windows:
  label = prediction_lp.annotate(w)['label']
  if len(label) == 0:
    label = ['OTHER']
  window_labels.append(label)

## 📍 Difference between using a single sentence and a window

In [15]:
comparison_table_df['label_context'] = window_labels
comparison_table_df

Unnamed: 0,sentence,label_no_context,label_context
0,Sample Type / Medical Specialty:\nHematology -...,[PARTICIPANTS],[MEASUREMENTS]
1,(Medical Transcription Sample Report)\nPRINCIP...,[MEASUREMENTS],[MEASUREMENTS]
2,"SECONDARY DIAGNOSES:\nPleural effusion, atrial...",[PARTICIPANTS],[PARTICIPANTS]
3,PROCEDURES,[DESIGN_SETTING],[PARTICIPANTS]
4,"1. On August 24, 2007, decortication of the lu...",[INTERVENTION],[INTERVENTION]
5,"2. On August 20, 2007, thoracentesis.",[INTERVENTION],[INTERVENTION]
6,"3. On August 31, 2007, Port-A-Cath placement.",[PARTICIPANTS],[PARTICIPANTS]
7,HISTORY AND PHYSICAL:\nThe patient is a 41-yea...,[PARTICIPANTS],[PARTICIPANTS]
8,She has had right-sided chest pain radiating t...,[AIMS],[AIMS]
9,She has a history of pericarditis and pericard...,[PARTICIPANTS],[AIMS]


<br/>

Although this is not a recipe you can use at any case, and highly depends on how your classifier was trained, it can really help to:

✅ Standardise blocks of predictions, capturing and grouping sentences together,

✅ Carry out the meaning to the surroundings to resolve short sentences without much meaning.

❌However, this does not come without a caveat. Sometimes, short sentences with no meaningful text may get affected by the surroundings.
