![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/clinical_text_classification/3.Clinical_Longformer_vs_BertSentence_&_USE.ipynb)

# Colab Setup

In [None]:
import json
import os
from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables

os.environ.update(license_keys)

In [30]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [31]:
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

spark

Spark NLP Version : 3.4.0
Spark NLP_JSL Version : 3.4.0


In [32]:
# if you want to start the session with custom params as in start function above
from pyspark.sql import SparkSession

def start(SECRET):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+PUBLIC_VERSION) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+SECRET+"/spark-nlp-jsl-"+JSL_VERSION+".jar")
      
    return builder.getOrCreate()

#spark = start(SECRET)

## Download Mtsamples Dataset

In [37]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/mtsamples_classifier.csv

In [38]:
spark_df = spark.read.csv("mtsamples_classifier.csv", header = True)

spark_df.show()

+----------------+--------------------+
|        category|                text|
+----------------+--------------------+
|Gastroenterology| PROCEDURES PERFO...|
|Gastroenterology| OPERATION 1. Ivo...|
|Gastroenterology| PREOPERATIVE DIA...|
|Gastroenterology| PROCEDURE: Colon...|
|Gastroenterology| PREOPERATIVE DIA...|
|Gastroenterology| PREOPERATIVE DIA...|
|Gastroenterology| The patient was ...|
|Gastroenterology| PREOPERATIVE DIA...|
|Gastroenterology| PREOPERATIVE DIA...|
|Gastroenterology| HISTORY OF PRESE...|
|Gastroenterology| PREOPERATIVE DIA...|
|Gastroenterology| PREOPERATIVE DIA...|
|Gastroenterology| PROCEDURE IN DET...|
|Gastroenterology| PREPROCEDURE DIA...|
|Gastroenterology| The patient is a...|
|Gastroenterology| DIAGNOSIS ON ADM...|
|Gastroenterology| EXAM: CT abdomen...|
|Gastroenterology| Sample Doctor, M...|
|Gastroenterology| CHIEF COMPLAINT:...|
|Gastroenterology| PREOPERATIVE DIA...|
+----------------+--------------------+
only showing top 20 rows



In [None]:
spark_df.count()

638

In [None]:
spark_df = spark_df.filter((spark_df.text != ""))
# None values in the text column of data were removed from the data --> 8 rows

In [None]:
spark_df.count()

630

In [None]:
spark_df.groupBy("category").count().show()

+----------------+-----+
|        category|count|
+----------------+-----+
|         Urology|  110|
|       Neurology|  141|
|      Orthopedic|  222|
|Gastroenterology|  157|
+----------------+-----+



**I tried the following values for all models**
- **setLr :** 0.001, 0.003, 0.005, 0.0001, 0.0003, 0.0005
- **setMaxEpochs :** 16, 32, 64, 128
- **setBatchSize :** 8, 10, 16, 32
- **setDropout :** 0.2, 0.3, 0.4, 0.5

In [None]:
(trainingData, testData) = spark_df.randomSplit([0.8, 0.2], seed = 42)

## Universal Sentence Encoder

In [None]:
document_assembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained()\
        .setInputCols("document")\
        .setOutputCol("sentence_embeddings")

classifier_dl = ClassifierDLApproach()\
        .setInputCols(["sentence_embeddings"])\
        .setOutputCol("class")\
        .setLabelColumn("category")\
        .setBatchSize(8)\
        .setMaxEpochs(64)\
        .setLr(0.003)\
        .setDropout(0.3)\
        .setEnableOutputLogs(True)\

use_clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        use,
        classifier_dl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [None]:
%%time
use_clf_pipelineModel = use_clf_pipeline.fit(trainingData)

CPU times: user 172 ms, sys: 21.1 ms, total: 193 ms
Wall time: 29.7 s


In [None]:
!cd ~/annotator_logs/ && ls -lt

total 32
-rw-r--r-- 1 root root 4507 Feb 15 21:53 ClassifierDLApproach_41c1c26d040d.log
-rw-r--r-- 1 root root 9020 Feb 15 21:03 ClassifierDLApproach_726271d783f5.log
-rw-r--r-- 1 root root 9075 Feb 15 20:52 ClassifierDLApproach_9571e7315f7f.log


In [None]:
!cat ~/annotator_logs/ClassifierDLApproach_8442e46ada47.log

Training started - epochs: 64 - learning_rate: 0.003 - batch_size: 8 - training_examples: 503 - classes: 4
Epoch 0/64 - 0.52s - loss: 74.91817 - acc: 0.51180875 - batches: 63
Epoch 1/64 - 0.36s - loss: 68.47723 - acc: 0.6569701 - batches: 63
Epoch 2/64 - 0.34s - loss: 62.247604 - acc: 0.74769586 - batches: 63
Epoch 3/64 - 0.34s - loss: 58.552536 - acc: 0.82430875 - batches: 63
Epoch 4/64 - 0.35s - loss: 55.85074 - acc: 0.85656685 - batches: 63
Epoch 5/64 - 0.34s - loss: 54.345898 - acc: 0.8666475 - batches: 63
Epoch 6/64 - 0.35s - loss: 52.835968 - acc: 0.890841 - batches: 63
Epoch 7/64 - 0.35s - loss: 51.885544 - acc: 0.9133065 - batches: 63
Epoch 8/64 - 0.35s - loss: 51.30577 - acc: 0.91935486 - batches: 63
Epoch 9/64 - 0.34s - loss: 50.97224 - acc: 0.92741936 - batches: 63
Epoch 10/64 - 0.36s - loss: 50.771362 - acc: 0.93346775 - batches: 63
Epoch 11/64 - 0.35s - loss: 50.639965 - acc: 0.93346775 - batches: 63
Epoch 12/64 - 0.34s - loss: 50.550507 - acc: 0.93346775 - batches: 63
Epo

As you can see,  we achieved **95% accuracy score** in less than 1 min with no text preprocessing, which is usually the most time consuming and laborious step in any NLP modeling.We achieved **91% test set accuracy!**

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

preds = use_clf_pipelineModel.transform(testData)

preds_df = preds.select("category","text","class.result").toPandas()

preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       0.90      0.90      0.90        31
       Neurology       0.89      0.89      0.89        28
      Orthopedic       0.95      0.90      0.92        41
         Urology       0.86      0.93      0.89        27

        accuracy                           0.91       127
       macro avg       0.90      0.91      0.90       127
    weighted avg       0.91      0.91      0.91       127



## Bert Sentence Embeddings(sent_biobert_clinical_base_cased)

In [None]:
document_assembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")
        
bert_sent = BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased")\
      .setInputCols(["document"])\
      .setOutputCol("sentence_embeddings")

classsifier_dl = ClassifierDLApproach()\
      .setInputCols(["sentence_embeddings"])\
      .setOutputCol("class")\
      .setLabelColumn("category")\
      .setMaxEpochs(128)\
      .setBatchSize(8)\
      .setEnableOutputLogs(True)\
      .setLr(0.0005)\
      .setDropout(0.2)

bert_sent_clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        bert_sent,
        classsifier_dl
    ])

sent_biobert_clinical_base_cased download started this may take some time.
Approximate size to download 386.6 MB
[OK!]


In [None]:
%%time
bert_sent_pipelineModel = bert_sent_clf_pipeline.fit(trainingData)

CPU times: user 2.65 s, sys: 296 ms, total: 2.95 s
Wall time: 7min 47s


In [None]:
!cd ~/annotator_logs/ && ls -lt

total 24
-rw-r--r-- 1 root root 9020 Feb 15 21:03 ClassifierDLApproach_726271d783f5.log
-rw-r--r-- 1 root root 9075 Feb 15 20:52 ClassifierDLApproach_9571e7315f7f.log


In [None]:
!cat ~/annotator_logs/ClassifierDLApproach_726271d783f5.log

Training started - epochs: 128 - learning_rate: 5.0E-4 - batch_size: 8 - training_examples: 528 - classes: 4
Epoch 0/128 - 0.61s - loss: 86.71257 - acc: 0.39204547 - batches: 66
Epoch 1/128 - 0.46s - loss: 77.47093 - acc: 0.6136364 - batches: 66
Epoch 2/128 - 0.44s - loss: 70.96891 - acc: 0.6875 - batches: 66
Epoch 3/128 - 0.46s - loss: 68.40075 - acc: 0.7140151 - batches: 66
Epoch 4/128 - 0.48s - loss: 67.243576 - acc: 0.72159094 - batches: 66
Epoch 5/128 - 0.45s - loss: 66.59227 - acc: 0.7310606 - batches: 66
Epoch 6/128 - 0.45s - loss: 66.1715 - acc: 0.7405303 - batches: 66
Epoch 7/128 - 0.44s - loss: 65.86946 - acc: 0.74810606 - batches: 66
Epoch 8/128 - 0.66s - loss: 65.62933 - acc: 0.75 - batches: 66
Epoch 9/128 - 0.46s - loss: 65.43666 - acc: 0.75189394 - batches: 66
Epoch 10/128 - 0.46s - loss: 65.26919 - acc: 0.75189394 - batches: 66
Epoch 11/128 - 0.44s - loss: 65.11844 - acc: 0.7556818 - batches: 66
Epoch 12/128 - 0.46s - loss: 64.98228 - acc: 0.7613636 - batches: 66
Epoch 1

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

preds = bert_sent_pipelineModel.transform(testData)

preds_df = preds.select("category","text","class.result").toPandas()

preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       0.87      0.93      0.90        28
       Neurology       0.88      0.88      0.88        24
      Orthopedic       0.86      0.83      0.85        30
         Urology       0.74      0.70      0.72        20

        accuracy                           0.84       102
       macro avg       0.84      0.83      0.83       102
    weighted avg       0.84      0.84      0.84       102



## Bert Sentence Embeddings(sbiobert_base_cased_mli)

In [None]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sbert_embedder = BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")\
    .setMaxSentenceLength(512)\
    .setCaseSensitive(False)

classsifier_dl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("category")\
    .setMaxEpochs(64)\
    .setBatchSize(8)\
    .setLr(0.0005)\
    .setDropout(0.3)\
    .setEnableOutputLogs(True)

bert_sent_clf_pipeline = Pipeline(
    stages = [
    document_assembler,
    sbert_embedder,
    classsifier_dl
])

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]


In [None]:
%%time
bert_sent_pipelineModel = bert_sent_clf_pipeline.fit(trainingData)

CPU times: user 12.5 s, sys: 1.32 s, total: 13.8 s
Wall time: 36min


In [None]:
!cd ~/annotator_logs/ && ls -lt

total 8
-rw-r--r-- 1 root root 4512 Feb 15 23:18 ClassifierDLApproach_40e43d4ec7d3.log


In [None]:
!cat ~/annotator_logs/ClassifierDLApproach_40e43d4ec7d3.log

Training started - epochs: 64 - learning_rate: 5.0E-4 - batch_size: 8 - training_examples: 528 - classes: 4
Epoch 0/64 - 1.08s - loss: 77.64776 - acc: 0.5814394 - batches: 66
Epoch 1/64 - 0.86s - loss: 68.01735 - acc: 0.7234849 - batches: 66
Epoch 2/64 - 0.86s - loss: 64.66168 - acc: 0.84090906 - batches: 66
Epoch 3/64 - 0.84s - loss: 62.0315 - acc: 0.8806818 - batches: 66
Epoch 4/64 - 0.85s - loss: 60.525566 - acc: 0.90909094 - batches: 66
Epoch 5/64 - 0.86s - loss: 59.666126 - acc: 0.9185606 - batches: 66
Epoch 6/64 - 0.84s - loss: 59.00746 - acc: 0.92424244 - batches: 66
Epoch 7/64 - 0.83s - loss: 58.42534 - acc: 0.9261364 - batches: 66
Epoch 8/64 - 0.82s - loss: 57.921482 - acc: 0.92992425 - batches: 66
Epoch 9/64 - 0.85s - loss: 57.518993 - acc: 0.9337121 - batches: 66
Epoch 10/64 - 0.83s - loss: 57.21453 - acc: 0.9318182 - batches: 66
Epoch 11/64 - 0.82s - loss: 56.977867 - acc: 0.9337121 - batches: 66
Epoch 12/64 - 0.82s - loss: 56.789948 - acc: 0.9337121 - batches: 66
Epoch 13/

When we look at two Bert Sentence Embeddings, we see that **sbiobert_base_cased_mli** embeddings runs slower, but it has better results.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

preds = bert_sent_pipelineModel.transform(testData)

preds_df = preds.select("category","text","class.result").toPandas()

preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       0.97      1.00      0.98        28
       Neurology       0.88      0.96      0.92        24
      Orthopedic       1.00      0.90      0.95        30
         Urology       0.95      0.95      0.95        20

        accuracy                           0.95       102
       macro avg       0.95      0.95      0.95       102
    weighted avg       0.95      0.95      0.95       102



## Clinical Longformer Embeddings

This embeddings model was imported from Hugging Face( [link](https://huggingface.co/yikuan8/Clinical-Longformer)). Clinical-Longformer is a clinical knowledge enriched version of Longformer that was further pretrained using MIMIC-III clinical notes. It allows up to 4,096 tokens as the model input.

In [None]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

lf_embeddings = LongformerEmbeddings.pretrained("clinical_longformer", "en")\
  .setInputCols(["document", "token"])\
  .setOutputCol("embeddings")\
  .setCaseSensitive(False)\
  .setMaxSentenceLength(512)

embeddingsSentence = SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

classifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("category")\
    .setBatchSize(8)\
    .setEnableOutputLogs(True)\
    .setMaxEpochs(64)\
    .setLr(0.003)\
    .setDropout(0.3)

lf_pipeline = Pipeline(
               stages = [
                    document_assembler,
                    tokenizer,
                    lf_embeddings,
                    embeddingsSentence,
                    classifierdl
               ])

clinical_longformer download started this may take some time.
Approximate size to download 510.1 MB
[OK!]


In [None]:
%%time
lf_model = lf_pipeline.fit(trainingData)

CPU times: user 15.5 s, sys: 1.71 s, total: 17.2 s
Wall time: 45min 7s


In [None]:
!cd ~/annotator_logs/ && ls -lt

total 32
-rw-r--r-- 1 root root 4507 Feb 15 21:53 ClassifierDLApproach_41c1c26d040d.log
-rw-r--r-- 1 root root 9020 Feb 15 21:03 ClassifierDLApproach_726271d783f5.log
-rw-r--r-- 1 root root 9075 Feb 15 20:52 ClassifierDLApproach_9571e7315f7f.log


In [None]:
!cat ~/annotator_logs/ClassifierDLApproach_41c1c26d040d.log

Training started - epochs: 64 - learning_rate: 0.003 - batch_size: 8 - training_examples: 528 - classes: 4
Epoch 0/64 - 0.85s - loss: 90.93513 - acc: 0.33901516 - batches: 66
Epoch 1/64 - 0.47s - loss: 79.88781 - acc: 0.51704544 - batches: 66
Epoch 2/64 - 0.48s - loss: 69.131226 - acc: 0.6287879 - batches: 66
Epoch 3/64 - 0.45s - loss: 67.21849 - acc: 0.7064394 - batches: 66
Epoch 4/64 - 0.49s - loss: 67.160805 - acc: 0.72727275 - batches: 66
Epoch 5/64 - 0.47s - loss: 65.39544 - acc: 0.7651515 - batches: 66
Epoch 6/64 - 0.49s - loss: 64.52378 - acc: 0.7878788 - batches: 66
Epoch 7/64 - 0.48s - loss: 63.951595 - acc: 0.79545456 - batches: 66
Epoch 8/64 - 0.47s - loss: 63.488316 - acc: 0.8087121 - batches: 66
Epoch 9/64 - 0.45s - loss: 63.128845 - acc: 0.81439394 - batches: 66
Epoch 10/64 - 0.48s - loss: 62.890724 - acc: 0.8238636 - batches: 66
Epoch 11/64 - 0.48s - loss: 62.742283 - acc: 0.8276515 - batches: 66
Epoch 12/64 - 0.46s - loss: 62.6539 - acc: 0.8314394 - batches: 66
Epoch 13

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

preds = lf_model.transform(testData)

preds_df = preds.select("category","text","class.result").toPandas()

preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       0.77      0.96      0.86        28
       Neurology       0.95      0.88      0.91        24
      Orthopedic       0.91      0.97      0.94        30
         Urology       0.92      0.60      0.73        20

        accuracy                           0.87       102
       macro avg       0.89      0.85      0.86       102
    weighted avg       0.88      0.87      0.87       102



Clinical Longformer Embeddings results are good, but takes a long time than Bert Sentence Embeddings & USE on Google Colab