![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **DocumentMLClassifierModel**

This notebook will cover the different parameters and usages of `DocumentMLClassifierModel`.

**🔗 Helpful Links:**

- Python Docs : [DocumentMLClassifierModel](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/classification/document_ml_classifier/index.html#sparknlp_jsl.annotator.classification.document_ml_classifier.DocumentMLClassifierModel)

- Scala Docs : [DocumentMLClassifierModel](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/classification/DocumentMLClassifierModel.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp).

## **🎬 Colab Setup**

In [1]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.4/116.4 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.8/135.8 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m656.0/656.0 kB[0m [31m40.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m66.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.4/95.4 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.7/540.7 kB[0m [31m52.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m

In [2]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_7139.json to spark_nlp_for_healthcare_spark_ocr_7139.json


In [3]:
from johnsnowlabs import nlp

nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7139.json
🚨 Outdated OCR Secrets in license file. Version=5.1.0 but should be Version=5.0.2
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.1.4-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.1.3-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.1.4.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.1.3.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7139.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-5.1.3-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==5.1.3 installed! ✅ Heal the planet with NLP! 


In [1]:
from johnsnowlabs import nlp, medical
import pyspark.sql.functions as F
import pandas as pd

spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_7139.json
🚨 Outdated OCR Secrets in license file. Version=5.1.0 but should be Version=5.0.2
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.1.4, 💊Spark-Healthcare==5.1.3, running on ⚡ PySpark==3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN`

- Output: `CATEGORY`

## **🔎 Parameters**


- `labels`: (list) Sets the name of labels to be used.

- `minTokenNgram`: (int) Sets minimum number of tokens for Ngrams.*

- `maxTokenNgram`: (int) Sets maximum number of tokens for Ngrams.*


> **\* Use with caution, as pretrained models were trained with specific values for minimum and maximum values of n-grams.**

## Prepare data

In [2]:
data = spark.createDataFrame([
    ["I feel great after taking tylenol."],
    ["Toxic epidermal necrolysis resulted after 19 days of treatment with 5-fluorocytosine and amphotericin B."]
]).toDF("text")

In [3]:
data.show(truncate=False)

+--------------------------------------------------------------------------------------------------------+
|text                                                                                                    |
+--------------------------------------------------------------------------------------------------------+
|I feel great after taking tylenol.                                                                      |
|Toxic epidermal necrolysis resulted after 19 days of treatment with 5-fluorocytosine and amphotericin B.|
+--------------------------------------------------------------------------------------------------------+



### `setLabels()`


The labels of the pretrained model `classifierml_ade` are `"False"` and `"True"` to determine if there were any Adverse Drug Event (ADE) on the text. Let's change the labels to `"Not ADE"` and `"ADE"`:

In [4]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier_ml = medical.DocumentMLClassifierModel.pretrained("classifierml_ade", "en", "clinical/models")\
    .setInputCols("token")\
    .setOutputCol("prediction")\
    .setLabels(["Not ADE", "ADE"])

clf_Pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    classifier_ml])


classifierml_ade download started this may take some time.
[OK!]


In [5]:
clf_model = clf_Pipeline.fit(data)
result = clf_model.transform(data)

In [6]:
result.select('text','prediction.result').show(truncate=False)

+--------------------------------------------------------------------------------------------------------+---------+
|text                                                                                                    |result   |
+--------------------------------------------------------------------------------------------------------+---------+
|Toxic epidermal necrolysis resulted after 19 days of treatment with 5-fluorocytosine and amphotericin B.|[ADE]    |
|I feel great after taking tylenol.                                                                      |[Not ADE]|
+--------------------------------------------------------------------------------------------------------+---------+



### `setMinTokenNgram()` and `setMaxTokenNgram()`

Defines the range of tokens to be used by the vectorizer model.


> **\* Use with caution, as pretrained models were trained with specific values for minimum and maximum values of n-grams.**


These parameters are used internally to create the features to be used as input to the model.

In [7]:
# Change min to 2 and max to 4

classifier_ml.setMinTokenNgram(2).setMaxTokenNgram(4)

clf_Pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    classifier_ml])

clf_model = clf_Pipeline.fit(data)
result = clf_model.transform(data)

In [8]:
result.select('text', 'prediction.result').show(truncate=False)

+--------------------------------------------------------------------------------------------------------+---------+
|text                                                                                                    |result   |
+--------------------------------------------------------------------------------------------------------+---------+
|Toxic epidermal necrolysis resulted after 19 days of treatment with 5-fluorocytosine and amphotericin B.|[ADE]    |
|I feel great after taking tylenol.                                                                      |[Not ADE]|
+--------------------------------------------------------------------------------------------------------+---------+



Please note that this specific pretrained model was fitted with (1,2) for the minimum and maxmum number of grams. Using different values will make the vectorizer model consider them as Out-of-Vocabulary (OOV). Usualy, it is recommended to use a subinterval of the pretrained model. This way all the tokens will be known by the model.

For example, let's use (1,1):

In [9]:
classifier_ml.setMinTokenNgram(1).setMaxTokenNgram(1)

clf_Pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    classifier_ml])

clf_model = clf_Pipeline.fit(data)
result = clf_model.transform(data)
result.select('text', 'prediction.result').show(truncate=False)

+--------------------------------------------------------------------------------------------------------+---------+
|text                                                                                                    |result   |
+--------------------------------------------------------------------------------------------------------+---------+
|Toxic epidermal necrolysis resulted after 19 days of treatment with 5-fluorocytosine and amphotericin B.|[Not ADE]|
|I feel great after taking tylenol.                                                                      |[Not ADE]|
+--------------------------------------------------------------------------------------------------------+---------+



We can see that information present in bigrams was lost and the model now incorrectly classified the first document as `Not ADE`.