![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Colab Setup

In [None]:
import json
import os
from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables

os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

spark

Spark NLP Version : 3.4.0
Spark NLP_JSL Version : 3.4.0


In [4]:
# if you want to start the session with custom params as in start function above
from pyspark.sql import SparkSession

def start(SECRET):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+PUBLIC_VERSION) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+SECRET+"/spark-nlp-jsl-"+JSL_VERSION+".jar")
      
    return builder.getOrCreate()

#spark = start(SECRET)

## Downloading Mtsamples Dataset

In [5]:
spark_df = spark.read.parquet("/content/drive/MyDrive/mtsamples.parquet")

spark_df.show(truncate=100)

+----------+----------------------------------------------------------------------------------------------------+
|  category|                                                                                                text|
+----------+----------------------------------------------------------------------------------------------------+
|Orthopedic| PREOPERATIVE DIAGNOSES 1. EMG-proven left carpal tunnel syndrome. 2. Tenosynovitis of the left t...|
|Orthopedic| PREOPERATIVE DIAGNOSIS: Bunion, left foot. POSTOPERATIVE DIAGNOSIS: Bunion, left foot. PROCEDURE...|
|Orthopedic| RICE stands for the most important elements of treatment for many injuries---rest, ice, compress...|
|Orthopedic| The patient is an 84-year-old retired male who is referred to our office by Dr. O. He comes in t...|
|Orthopedic| PREOPERATIVE DIAGNOSES: 1. Left carpal tunnel syndrome (354.0). 2. Left ulnar nerve entrapment a...|
|Orthopedic| PREOPERATIVE DIAGNOSIS: Herniated nucleus pulposus T8-T9. POSTOPERATIVE DIA

In [6]:
spark_df.count()

638

In [7]:
spark_df.groupBy("category").count().show()

+----------------+-----+
|        category|count|
+----------------+-----+
|         Urology|  115|
|       Neurology|  143|
|      Orthopedic|  223|
|Gastroenterology|  157|
+----------------+-----+



In [8]:
spark_df = spark_df.filter(spark_df.text != "")

# None values in the text column of data were removed

In [9]:
spark_df.groupBy("category").count().show()

+----------------+-----+
|        category|count|
+----------------+-----+
|         Urology|  110|
|       Neurology|  141|
|      Orthopedic|  222|
|Gastroenterology|  157|
+----------------+-----+



**I tried the following values for all models**
- **setLr :** 0.001, 0.003, 0.005, 0.0001, 0.0003, 0.0005
- **setMaxEpochs :** 16, 32, 64, 128
- **setBatchSize :** 8, 10, 16, 32
- **setDropout :** 0.2, 0.3, 0.4, 0.5

## Universal Sentence Encoder

In [10]:
(trainingData, testData) = spark_df.randomSplit([0.8, 0.2], seed = 100)

In [11]:
document_assembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained()\
        .setInputCols("document")\
        .setOutputCol("sentence_embeddings")

classifier_dl = ClassifierDLApproach()\
        .setInputCols(["sentence_embeddings"])\
        .setOutputCol("class")\
        .setLabelColumn("category")\
        .setBatchSize(8)\
        .setMaxEpochs(64)\
        .setLr(0.003)\
        .setDropout(0.3)\
        .setEnableOutputLogs(True)\

use_clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        use,
        classifier_dl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [12]:
%%time
use_clf_pipelineModel = use_clf_pipeline.fit(trainingData)

CPU times: user 313 ms, sys: 40.3 ms, total: 354 ms
Wall time: 47.3 s


In [14]:
!cd ~/annotator_logs/ && ls -lt

total 8
-rw-r--r-- 1 root root 4469 Feb 14 11:11 ClassifierDLApproach_bae5c4d5fab2.log


In [15]:
!cat ~/annotator_logs/ClassifierDLApproach_bae5c4d5fab2.log

Training started - epochs: 64 - learning_rate: 0.003 - batch_size: 8 - training_examples: 503 - classes: 4
Epoch 0/64 - 0.86s - loss: 76.92651 - acc: 0.60627884 - batches: 63
Epoch 1/64 - 0.53s - loss: 64.66552 - acc: 0.75547236 - batches: 63
Epoch 2/64 - 0.54s - loss: 62.406364 - acc: 0.8156682 - batches: 63
Epoch 3/64 - 0.54s - loss: 60.451885 - acc: 0.874424 - batches: 63
Epoch 4/64 - 0.52s - loss: 58.452198 - acc: 0.90063363 - batches: 63
Epoch 5/64 - 0.50s - loss: 56.841595 - acc: 0.9228111 - batches: 63
Epoch 6/64 - 0.51s - loss: 56.0128 - acc: 0.9248272 - batches: 63
Epoch 7/64 - 0.48s - loss: 55.354492 - acc: 0.92684335 - batches: 63
Epoch 8/64 - 0.50s - loss: 54.849907 - acc: 0.93490785 - batches: 63
Epoch 9/64 - 0.51s - loss: 54.478996 - acc: 0.94095623 - batches: 63
Epoch 10/64 - 0.51s - loss: 54.206657 - acc: 0.94095623 - batches: 63
Epoch 11/64 - 0.49s - loss: 53.993572 - acc: 0.94297236 - batches: 63
Epoch 12/64 - 0.51s - loss: 53.825882 - acc: 0.94297236 - batches: 63
Ep

As you can see,  we achieved **95% accuracy score** in less than 1 min with no text preprocessing, which is usually the most time consuming and laborious step in any NLP modeling.We achieved **91% test set accuracy!**

In [16]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

preds = use_clf_pipelineModel.transform(testData)

preds_df = preds.select("category","text","class.result").toPandas()

preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       0.93      0.90      0.92        31
       Neurology       0.86      0.89      0.88        28
      Orthopedic       0.95      0.88      0.91        41
         Urology       0.87      0.96      0.91        27

        accuracy                           0.91       127
       macro avg       0.90      0.91      0.90       127
    weighted avg       0.91      0.91      0.91       127



## BioBert Sentence Embeddings

In [12]:
document_assembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

bert_sent = BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased","en")\
      .setInputCols(["document"])\
      .setOutputCol("sentence_embeddings")

classsifier_dl = ClassifierDLApproach()\
      .setInputCols(["sentence_embeddings"])\
      .setOutputCol("class")\
      .setLabelColumn("category")\
      .setMaxEpochs(128)\
      .setBatchSize(8)\
      .setEnableOutputLogs(True)\
      .setLr(0.0005)\
      .setDropout(0.2)

bert_sent_clf_pipeline = Pipeline(
    stages = [
        document_assembler,
        bert_sent,
        classsifier_dl
    ])

sent_biobert_clinical_base_cased download started this may take some time.
Approximate size to download 386.6 MB
[OK!]


In [13]:
%%time
bert_sent_pipelineModel = bert_sent_clf_pipeline.fit(trainingData)

CPU times: user 2.97 s, sys: 307 ms, total: 3.28 s
Wall time: 8min 20s


In [14]:
!cd ~/annotator_logs/ && ls -lt

total 12
-rw-r--r-- 1 root root 9044 Feb 14 16:23 ClassifierDLApproach_89b9788d5d3d.log


In [15]:
!cat ~/annotator_logs/ClassifierDLApproach_89b9788d5d3d.log

Training started - epochs: 128 - learning_rate: 5.0E-4 - batch_size: 8 - training_examples: 503 - classes: 4
Epoch 0/128 - 0.82s - loss: 80.75726 - acc: 0.41071427 - batches: 63
Epoch 1/128 - 0.50s - loss: 73.471695 - acc: 0.66129035 - batches: 63
Epoch 2/128 - 0.50s - loss: 68.14546 - acc: 0.7258065 - batches: 63
Epoch 3/128 - 0.50s - loss: 65.51254 - acc: 0.7419355 - batches: 63
Epoch 4/128 - 0.53s - loss: 64.090836 - acc: 0.75 - batches: 63
Epoch 5/128 - 0.56s - loss: 63.20666 - acc: 0.7520161 - batches: 63
Epoch 6/128 - 0.49s - loss: 62.58945 - acc: 0.76008064 - batches: 63
Epoch 7/128 - 0.51s - loss: 62.10406 - acc: 0.766129 - batches: 63
Epoch 8/128 - 0.49s - loss: 61.68844 - acc: 0.7782258 - batches: 63
Epoch 9/128 - 0.48s - loss: 61.330368 - acc: 0.7782258 - batches: 63
Epoch 10/128 - 0.48s - loss: 61.022247 - acc: 0.7822581 - batches: 63
Epoch 11/128 - 0.48s - loss: 60.74977 - acc: 0.7842742 - batches: 63
Epoch 12/128 - 0.47s - loss: 60.506935 - acc: 0.7883065 - batches: 63
Ep

In [16]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

preds = bert_sent_pipelineModel.transform(testData)

preds_df = preds.select("category","text","class.result").toPandas()

preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       0.88      0.90      0.89        31
       Neurology       0.71      0.86      0.77        28
      Orthopedic       0.87      0.83      0.85        41
         Urology       0.82      0.67      0.73        27

        accuracy                           0.82       127
       macro avg       0.82      0.81      0.81       127
    weighted avg       0.82      0.82      0.82       127



## Clinical Longformer Embeddings

This embeddings model was imported from Hugging Face( [link](https://huggingface.co/yikuan8/Clinical-Longformer)). Clinical-Longformer is a clinical knowledge enriched version of Longformer that was further pretrained using MIMIC-III clinical notes. It allows up to 4,096 tokens as the model input.

In [11]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

lf_embeddings = LongformerEmbeddings.pretrained("clinical_longformer", "en")\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")\
    .setCaseSensitive(True)\
    .setMaxSentenceLength(4096)

embeddingsSentence = SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

classifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("category")\
    .setBatchSize(8)\
    .setMaxEpochs(32)\
    .setLr(0.003)\
    .setDropout(0.3)\
    .setEnableOutputLogs(True)

lf_pipeline = Pipeline(
               stages = [
                    document_assembler,
                    tokenizer,
                    lf_embeddings,
                    embeddingsSentence,
                    classifierdl
               ])

clinical_longformer download started this may take some time.
Approximate size to download 510.1 MB
[OK!]


In [12]:
%%time
lf_model = lf_pipeline.fit(trainingData)

CPU times: user 40 s, sys: 4.87 s, total: 44.8 s
Wall time: 1h 47min 56s


In [13]:
!cd ~/annotator_logs/ && ls -lt

total 4
-rw-r--r-- 1 root root 2274 Feb 14 10:18 ClassifierDLApproach_f28696d93dc9.log


In [17]:
!cat ~/annotator_logs/ClassifierDLApproach_f28696d93dc9.log

Training started - epochs: 32 - learning_rate: 0.003 - batch_size: 8 - training_examples: 491 - classes: 4
Epoch 1/32 - 0.40s - loss: 86.889915 - acc: 0.39890712 - batches: 62
Epoch 2/32 - 0.16s - loss: 75.45212 - acc: 0.5928962 - batches: 62
Epoch 3/32 - 0.16s - loss: 68.27135 - acc: 0.72336066 - batches: 62
Epoch 4/32 - 0.16s - loss: 66.29152 - acc: 0.74248636 - batches: 62
Epoch 5/32 - 0.16s - loss: 64.60902 - acc: 0.7581967 - batches: 62
Epoch 6/32 - 0.16s - loss: 63.89928 - acc: 0.76229507 - batches: 62
Epoch 7/32 - 0.16s - loss: 62.76322 - acc: 0.76229507 - batches: 62
Epoch 8/32 - 0.16s - loss: 61.017834 - acc: 0.7916667 - batches: 62
Epoch 9/32 - 0.16s - loss: 60.243973 - acc: 0.80601096 - batches: 62
Epoch 10/32 - 0.16s - loss: 59.79163 - acc: 0.8162569 - batches: 62
Epoch 11/32 - 0.16s - loss: 59.42674 - acc: 0.8340164 - batches: 62
Epoch 12/32 - 0.16s - loss: 59.10232 - acc: 0.83811474 - batches: 62
Epoch 13/32 - 0.16s - loss: 58.826294 - acc: 0.8422131 - batches: 62
Epoch 1

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

preds = lf_model.transform(testData)

preds_df = preds.select("category","text","class.result").toPandas()

preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

Clinical Longformer embeddings results are good, but takes a long time than BioBert & USE on colab

```
                  precision    recall  f1-score   support

Gastroenterology       0.82      0.92      0.87        36
       Neurology       0.89      0.76      0.82        33
      Orthopedic       0.88      0.96      0.91        45
         Urology       0.77      0.68      0.72        25

        accuracy                           0.85       139
       macro avg       0.84      0.83      0.83       139
    weighted avg       0.85      0.85      0.85       139
```




