# **MedicalDistilBertForSequenceClassification models**

This notebook will cover the different parameters and usages of `MedicalDistilBertForSequenceClassification` annotator.


**📖 Learning Objectives:**
Become comfortable using the different parameters of the annotator.
**🔗 Helpful Links:**

- Documentation : [MedicalDistilBertForSequenceClassification](https://nlp.johnsnowlabs.com/2022/02/08/distilbert_sequence_classifier_ade_en.html)

- Python Docs : [MedicalDistilBertForSequenceClassification](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/classification/medical_distilbert_for_sequence_classification/index.html#)

- For extended examples of usage, see [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare).


## **📜 Background**

 `MedicalDistilBertForSequenceClassification` can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. Pretrained models can be loaded with method :`.pretrained` of the companion object:

For available pretrained models please see the [`Models Hub`](https://nlp.johnsnowlabs.com/models?)

Models from the HuggingFace 🤗 Transformers library are also compatible with
Spark NLP 🚀.
To see which models are compatible and how to import them see [`Import Transformers into Spark NLP` 🚀](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669)

## **🎬 Colab Setup**

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs==5.1.4

In [None]:
from johnsnowlabs import nlp, medical, visual

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical, visual
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

## **🖨️ Input/Output Annotation Types**



- Input: `DOCUMENT, TOKEN`

- Output: `CATEGORY`

## **🖨️ Running Classifier**



In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = medical.DistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("class")

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])

model = pipeline.fit(spark.createDataFrame([['']]).toDF("text"))


distilbert_sequence_classifier_ade download started this may take some time.
[OK!]


In [None]:
data = spark.createDataFrame([["I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numbness of tongue and gums.I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."],
                              ["Religare Capital Ranbaxy has been accepting approval for Diovan since 2012"]]).toDF("text")

result = model.transform(data)

result.select("text", "class.result").show(truncate=100)

+----------------------------------------------------------------------------------------------------+-------+
|                                                                                                text| result|
+----------------------------------------------------------------------------------------------------+-------+
|I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numb...| [True]|
|                          Religare Capital Ranbaxy has been accepting approval for Diovan since 2012|[False]|
+----------------------------------------------------------------------------------------------------+-------+



## **🔎 Parameters**

'`batchSize`',  'Size of every batch': default: 8,

'`coalesceSentences`': "Instead of 1 class per sentence (if inputCols is '''sentence''' output 1 class per document by averaging probabilities in all sentences." default: False,

'`maxSentenceLength`', 'Max sentence length to process', default: 128

`caseSensitive`', 'whether to ignore case in tokens for embeddings matching',default: True,

In [None]:
sequenceClassifier.extractParamMap()


{Param(parent='MedicalDistilBertForSequenceClassification_60da933a49f9', name='batchSize', doc='Size of every batch'): 8,
 Param(parent='MedicalDistilBertForSequenceClassification_60da933a49f9', name='coalesceSentences', doc="Instead of 1 class per sentence (if inputCols is '''sentence''') output 1 class per document by averaging probabilities in all sentences."): False,
 Param(parent='MedicalDistilBertForSequenceClassification_60da933a49f9', name='engine', doc='Deep Learning engine used for this model'): 'tensorflow',
 Param(parent='MedicalDistilBertForSequenceClassification_60da933a49f9', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='MedicalDistilBertForSequenceClassification_60da933a49f9', name='maxSentenceLength', doc='Max sentence length to process'): 128,
 Param(parent='MedicalDistilBertForSequenceClassification_60da933a49f9', name='caseSensitive', doc='whether to ignore case in tokens for embeddings matching'):

### ▶ `batchSize`



```
    batchSize
        Batch size. Large values allows faster processing but requires more
        memory, by default 8
```



In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = medical.DistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("class")\
    .setBatchSize(4)

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))


distilbert_sequence_classifier_ade download started this may take some time.
[OK!]


In [None]:
data = spark.createDataFrame([["I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numbness of tongue and gums.I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."],
                              ["Religare Capital Ranbaxy has been accepting approval for Diovan since 2012"]]*100).toDF("text")

In [None]:
%%time
result = model.transform(data)
result.write.mode("overwrite").format("noop").save()
# result.select("text", "classes.result").show(truncate=False)

CPU times: user 76.4 ms, sys: 3.77 ms, total: 80.2 ms
Wall time: 9.26 s


In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = medical.DistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("class")\
    .setBatchSize(64)

pipeline2 = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])

model2 = pipeline2.fit(spark.createDataFrame([[""]]).toDF("text"))

distilbert_sequence_classifier_ade download started this may take some time.
[OK!]


In [None]:
data = spark.createDataFrame([["I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numbness of tongue and gums.I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."],
                              ["Religare Capital Ranbaxy has been accepting approval for Diovan since 2012"]]*100).toDF("text")

In [None]:
%%time
result2 = model2.transform(data)
result2.write.mode("overwrite").format("noop").save()
# result.select("text", "classes.result").show(truncate=False)

CPU times: user 32.4 ms, sys: 1.83 ms, total: 34.3 ms
Wall time: 3.23 s


### ▶`setCoalesceSentences`

```
Instead of 1 class per sentence (if inputCols is '''sentence''') output 1 class per document by averaging probabilities in all sentences.
Due to max sequence length limit in almost all transformer models such as BERT (512 tokens), this parameter helps feeding all the sentences
 into the model and averaging all the probabilities for the entire document instead of probabilities per sentence.
 ```
 **(Default: False)**

 in next to runs, the result column shows the difference



In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")


tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

sequenceClassifier = medical.DistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("class_")\
    .setCoalesceSentences(False)  # set False

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentenceDetector,
    tokenizer,
    sequenceClassifier
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))


distilbert_sequence_classifier_ade download started this may take some time.
[OK!]


In [None]:
data = spark.createDataFrame([
    [
        "I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numbness of tongue and gums.I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication. " \
        "Religare Capital Ranbaxy has been accepting approval for Diovan since 2012.",
        ],
    ["Religare Capital Ranbaxy has been accepting approval for Diovan since 2012"]]).toDF("text")

result = model.transform(data)
result.select(result.text, result.class_.result).show()

+--------------------+-------------+
|                text|class_.result|
+--------------------+-------------+
|I have an allergi...|[True, False]|
|Religare Capital ...|      [False]|
+--------------------+-------------+



In [None]:
df=result.select(result.class_.metadata).toPandas()
df['class_.metadata'].iloc[0]  # first data has two sentence, and their confidence levels are below:
#when .setCoalesceSentences(True), the average of two sentence will be False


[{'sentence': '0', 'Some(False)': '0.02774548', 'Some(True)': '0.9722545'},
 {'sentence': '1', 'Some(False)': '0.98720175', 'Some(True)': '0.012798243'}]

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")


tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

sequenceClassifier = medical.DistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("class")\
    .setCoalesceSentences(True)  # set True


pipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentenceDetector,
    tokenizer,
    sequenceClassifier
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))


distilbert_sequence_classifier_ade download started this may take some time.
[OK!]


In [None]:
data = spark.createDataFrame([
    [
        "I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numbness of tongue and gums.I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication. " \
        "Religare Capital Ranbaxy has been accepting approval for Diovan since 2012.",
        ],
    ["Religare Capital Ranbaxy has been accepting approval for Diovan since 2012"]]).toDF("text")

result = model.transform(data)

result.select("text", "class.result").show(truncate=60)

+------------------------------------------------------------+-------+
|                                                        text| result|
+------------------------------------------------------------+-------+
|I have an allergic reaction to vancomycin so I have itchy...|[False]|
|Religare Capital Ranbaxy has been accepting approval for ...|[False]|
+------------------------------------------------------------+-------+



In [None]:
# As shown above, first document having two sentence predicted as False

### ▶`setMaxSentenceLength`

```
Sets max sentence length to process, by default 128
 ```



In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")



tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = medical.DistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("class_")\
    .setMaxSentenceLength(2)  # set 2

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
data = spark.createDataFrame([
    [
        "I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numbness of tongue and gums.I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication. " \
        "Religare Capital Ranbaxy has been accepting approval for Diovan since 2012.",
        ],
    ["Religare Capital Ranbaxy has been accepting approval for Diovan since 2012"]]).toDF("text")

result = model.transform(data)

result.select("text", "class_.result").show(truncate=60)


distilbert_sequence_classifier_ade download started this may take some time.
[OK!]
+------------------------------------------------------------+------+
|                                                        text|result|
+------------------------------------------------------------+------+
|I have an allergic reaction to vancomycin so I have itchy...|[True]|
|Religare Capital Ranbaxy has been accepting approval for ...|[True]|
+------------------------------------------------------------+------+



In [None]:
# for testing purpose, maxlength set to 2 and results are as shown above: True

### ▶`    caseSensitive`

`Whether to ignore case in tokens for embeddings matching` **default True**

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = medical.DistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("class_")\
    .setCaseSensitive(False)  # set False

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])

model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
data = spark.createDataFrame([
    [
        "I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numbness of tongue and gums.I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication. " \
        "Religare Capital Ranbaxy has been accepting approval for Diovan since 2012.",
        ],
    ["Religare Capital Ranbaxy has been accepting approval for Diovan since 2012"]]).toDF("text")

result = model.transform(data)

result.select("text", "class_.result").show(truncate=60)


distilbert_sequence_classifier_ade download started this may take some time.
[OK!]
+------------------------------------------------------------+------+
|                                                        text|result|
+------------------------------------------------------------+------+
|I have an allergic reaction to vancomycin so I have itchy...|[True]|
|Religare Capital Ranbaxy has been accepting approval for ...|[True]|
+------------------------------------------------------------+------+

