# **MedicalBertForSequenceClassification Models**

This notebook will cover the different parameters and usages of `MedicalBertForSequenceClassification` annotator.


**📖 Learning Objectives:**
Become comfortable using the different parameters of the annotator.
**🔗 Helpful Links:**

- Documentation : [MedicalBertForSequenceClassification](https://nlp.johnsnowlabs.com/2023/05/09/generic_svm_classifier_ade_en.html)

- Python Docs : [MedicalBertForSequenceClassification](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/classification/medical_bert_for_sequence_classification/index.html#)

- For extended examples of usage, see [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/19.MedicalBertForSequenceClassification_in_SparkNLP.ipynb).


## **📜 Background**

 `MedicalBertForSequenceClassification`  can load Bert Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. Pretrained models can be loaded with :method :`.pretrained` of the companion object:

For available pretrained models please see the [`Models Hub`](https://nlp.johnsnowlabs.com/models?task=Named+Entity+Recognition)

To see which models are compatible and how to import them see [`Import Transformers into Spark NLP` 🚀](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669)

## **🎬 Colab Setup**

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs==5.1.4

In [None]:
from johnsnowlabs import nlp, medical, visual

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical, visual
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

## **🖨️ Input/Output Annotation Types**



- Input: `DOCUMENT, TOKEN`

- Output: `CATEGORY`

## **🖨️ Running Classifier**



In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = medical.BertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("classes")

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])

model = pipeline.fit(spark.createDataFrame([['']]).toDF("text"))


bert_sequence_classifier_ade download started this may take some time.
[OK!]


In [None]:
data_list =[["Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair 2137. Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. Name (NI) past or present smoking hx, no EtOH."],
     ["Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep Apnea. Denies tobacco and ETOH. Works as cafeteria worker."]]

data = spark.createDataFrame(data_list).toDF("text")


In [None]:
result = model.transform(data)

result.select("text", "classes.result").show(2,truncate=100)

+----------------------------------------------------------------------------------------------------+-------+
|                                                                                                text| result|
+----------------------------------------------------------------------------------------------------+-------+
|Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair...|[False]|
|Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep...|[False]|
+----------------------------------------------------------------------------------------------------+-------+



## **🔎 Parameters**

'`batchSize`',  'Size of every batch': default: 8,

'`coalesceSentences`': "Instead of 1 class per sentence (if inputCols is '''sentence''' output 1 class per document by averaging probabilities in all sentences." default: False,

'`maxSentenceLength`', 'Max sentence length to process', default: 128

`caseSensitive`', 'whether to ignore case in tokens for embeddings matching',default: True,

### ▶ `batchSize`



```
    batchSize
        Batch size. Large values allows faster processing but requires more
        memory, by default 8
```



```
In following two runs, two different batch sizes are used and process time differs
```

In [None]:
sequenceClassifier = medical.BertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("classes")\
    .setBatchSize(4)\

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))


bert_sequence_classifier_ade download started this may take some time.
[OK!]


In [None]:
data_list =[["Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair 2137. Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. Name (NI) past or present smoking hx, no EtOH."],
     ["Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep Apnea. Denies tobacco and ETOH. Works as cafeteria worker."]]

data = spark.createDataFrame(data_list*1000).toDF("text")

In [None]:
%%time
result = model.transform(data)
result.write.mode("overwrite").format("noop").save()
# result.select("text", "classes.result").show(truncate=False)

CPU times: user 758 ms, sys: 109 ms, total: 867 ms
Wall time: 2min 10s


In [None]:
sequenceClassifier = medical.BertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("classes")\
    .setBatchSize(64)\
    .setCoalesceSentences(True)

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

bert_sequence_classifier_ade download started this may take some time.
[OK!]


In [None]:
data_list =[["Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair 2137. Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. Name (NI) past or present smoking hx, no EtOH."],
     ["Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep Apnea. Denies tobacco and ETOH. Works as cafeteria worker."]]

data = spark.createDataFrame(data_list*1000).toDF("text")

In [None]:
%%time
result2 = model.transform(data)
result2.write.mode("overwrite").format("noop").save()
# result.select("text", "classes.result").show(truncate=False)

CPU times: user 755 ms, sys: 106 ms, total: 861 ms
Wall time: 2min 9s


### ▶`setCoalesceSentences`

```
Instead of 1 class per sentence (if inputCols is '''sentence''') output 1 class per document by averaging probabilities in all sentences.
Due to max sequence length limit in almost all transformer models such as BERT (512 tokens), this parameter helps feeding all the sentences
 into the model and averaging all the probabilities for the entire document instead of probabilities per sentence.
 ```
 **(Default: False)**

 in next to runs, the result column shows the difference



In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")


tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

sequenceClassifier = medical.BertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("classes")\
    .setBatchSize(8)\
    .setCoalesceSentences(False)  # set False

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentenceDetector,
    tokenizer,
    sequenceClassifier
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))


bert_sequence_classifier_ade download started this may take some time.
[OK!]


In [None]:
data_list =[["Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair 2137. Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. Name (NI) past or present smoking hx, no EtOH."],
     ["Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep Apnea. Denies tobacco and ETOH. Works as cafeteria worker."]]

data = spark.createDataFrame(data_list*100).toDF("text")

result = model.transform(data)
result.select("text", "classes.result").show(2,truncate=100)

+----------------------------------------------------------------------------------------------------+-----------------------------------+
|                                                                                                text|                             result|
+----------------------------------------------------------------------------------------------------+-----------------------------------+
|Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair...|[False, False, False, False, False]|
|Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep...|              [False, False, False]|
+----------------------------------------------------------------------------------------------------+-----------------------------------+
only showing top 2 rows



In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")


tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

sequenceClassifier = medical.BertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("classes")\
    .setBatchSize(8)\
    .setCoalesceSentences(True)  # set True


pipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentenceDetector,
    tokenizer,
    sequenceClassifier
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))


bert_sequence_classifier_ade download started this may take some time.
[OK!]


In [None]:
data_list =[["Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair 2137. Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. Name (NI) past or present smoking hx, no EtOH."],
     ["Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep Apnea. Denies tobacco and ETOH. Works as cafeteria worker."]]

data = spark.createDataFrame(data_list*100).toDF("text")
result = model.transform(data)
result.select("text", "classes.result").show(2,truncate=100)

+----------------------------------------------------------------------------------------------------+-------+
|                                                                                                text| result|
+----------------------------------------------------------------------------------------------------+-------+
|Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair...|[False]|
|Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep...|[False]|
+----------------------------------------------------------------------------------------------------+-------+
only showing top 2 rows



### ▶`setMaxSentenceLength`

```
Sets max sentence length to process, by default 128
 ```



In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")


tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

sequenceClassifier = medical.BertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("classes")\
    .setCoalesceSentences(False)\
    .setMaxSentenceLength(2)


pipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentenceDetector,
    tokenizer,
    sequenceClassifier
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

bert_sequence_classifier_ade download started this may take some time.
[OK!]


In [None]:
data_list =[["Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair 2137. Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. Name (NI) past or present smoking hx, no EtOH."],
     ["Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep Apnea. Denies tobacco and ETOH. Works as cafeteria worker."]]

data = spark.createDataFrame(data_list*100).toDF("text")
result = model.transform(data)
result.select("text", "classes.result").show(2,truncate=100)

+----------------------------------------------------------------------------------------------------+------------------------------+
|                                                                                                text|                        result|
+----------------------------------------------------------------------------------------------------+------------------------------+
|Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair...|[True, True, True, True, True]|
|Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep...|            [True, True, True]|
+----------------------------------------------------------------------------------------------------+------------------------------+
only showing top 2 rows



In [None]:
# for testing purpose, maxlength set to 2 and results are as shown above: True

### ▶`    caseSensitive`

`Whether to ignore case in tokens for embeddings matching` **default True**

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = medical.BertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("classes")\
    .setCaseSensitive(False)

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

bert_sequence_classifier_ade download started this may take some time.
[OK!]


In [None]:
data_list =[["Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair 2137. Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. Name (NI) past or present smoking hx, no EtOH."],
     ["Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep Apnea. Denies tobacco and ETOH. Works as cafeteria worker."]]

data = spark.createDataFrame(data_list*100).toDF("text")
result = model.transform(data)
result.select("text", "classes.result").show(2,truncate=100)

+----------------------------------------------------------------------------------------------------+------+
|                                                                                                text|result|
+----------------------------------------------------------------------------------------------------+------+
|Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair...|[True]|
|Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep...|[True]|
+----------------------------------------------------------------------------------------------------+------+
only showing top 2 rows

