![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **DocumentMLClassifierApproach**

This notebook will cover the different parameters and usages of `DocumentMLClassifierApproach`.

**📖 Learning Objectives:**

1. Understand how to train a model to classify documents with a Logarithmic Regression algorithm.

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [DocumentMLClassifierApproach](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#documentmlclassifier)

- Python Docs : [DocumentMLClassifierApproach](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/classification/document_ml_classifier/index.html)

- Scala Docs : [DocumentMLClassifierApproach](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/classification/DocumentMLClassifierApproach.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp).

## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m46.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m7.4

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_8734_532.json to spark_nlp_for_healthcare_spark_ocr_8734_532.json


In [None]:
from johnsnowlabs import nlp, medical

nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-5.3.2-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==5.3.2 installed! ✅ Heal the planet with NLP! 


In [None]:
import pyspark.sql.functions as F
import pandas as pd

spark = nlp.start()
spark

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN`

- Output: `CATEGORY`

## **🔎 Parameters**


- `labelCol`: (str) Sets column with the value result we are trying to predict.

- `maxIter`: (Int) Sets maximum number of iterations.

- `tol`: (float) Sets convergence tolerance after each iteration.

- `fitIntercept`: (str) Sets whether to fit an intercept term, default is true.

- `classificationModelClass`: (str) Sets a the classification model class from SparkML to use; possible values are: `logreg` (Logistic Regression) or `svm` (Support Vector Machines). Defaults to `svm`.

- `minTokenNgram`: (int) Sets minimum number of tokens for Ngrams.

- `maxTokenNgram`: (int) Sets maximum number of tokens for Ngrams.

- `mergeChunks`: (boolean) whether to merge all chunks in a document or not (Default: false)

## Prepare Data

We will use a dataset with Adverse Drug Events (ADE) examples to train a binary classification model (contains ADE or not).

In [None]:
#downloading sample datasets
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/ADE-NEG.txt
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/DRUG-AE.rel

In [None]:
df_neg= pd.read_csv("ADE-NEG.txt", header=None, delimiter="\t", names=["col1"])
df_neg['text'] =  df_neg.col1.str.split('NEG').str[1]
df_neg["category"] = "neg"
df_neg= df_neg[["text", "category"]]
df_neg.head()

Unnamed: 0,text,category
0,Clioquinol intoxication occurring in the trea...,neg
1,"""Retinoic acid syndrome"" was prevented with s...",neg
2,BACKGROUND: External beam radiation therapy o...,neg
3,"Although the enuresis ceased, she developed t...",neg
4,A 42-year-old woman had uneventful bilateral ...,neg


In [None]:
df_pos= pd.read_csv("DRUG-AE.rel", header=None, delimiter="|")
df_pos["category"]= "pos"
df_pos.rename(columns={1: "text"}, inplace=True)
df_pos= df_pos[["text", "category"]]
df_pos.head()

Unnamed: 0,text,category
0,Intravenous azithromycin-induced ototoxicity.,pos
1,"Immobilization, while Paget's bone disease was...",pos
2,Unaccountable severe hypercalcemia in a patien...,pos
3,METHODS: We report two cases of pseudoporphyri...,pos
4,METHODS: We report two cases of pseudoporphyri...,pos


In [None]:
ade_df= pd.concat([df_neg, df_pos]).sample(frac=1) #merge and shuffle the data
ade_df.head()

Unnamed: 0,text,category
16244,Because the patient remained bradycardic on p...,neg
12402,"This agent is, however, associated with a rar...",neg
4430,A 51-yr-old nonsmoking male patient without an...,pos
5751,"Vicks VapoRub induces mucin secretion, decrea...",neg
4623,He was treated with the immunosuppressive age...,neg


In [None]:
ade_df.category.value_counts()

category
neg    16695
pos     6821
Name: count, dtype: int64

Send the data to a spark data frame:

In [None]:
spark_df = spark.createDataFrame(ade_df)
spark_df.show(3)

+--------------------+--------+
|                text|category|
+--------------------+--------+
| Because the pati...|     neg|
| This agent is, h...|     neg|
|A 51-yr-old nonsm...|     pos|
+--------------------+--------+
only showing top 3 rows



In [None]:
train_data, test_data = spark_df.randomSplit([0.7, 0.3], seed=100)

print("Train Dataset Count: " + str(train_data.count()))
print("Test Dataset Count: " + str(test_data.count()))

Train Dataset Count: 16464
Test Dataset Count: 7052


Now, let's check how to train models using different hyperparameters.

### `setlabelCol()`



We could set text preprocessing stages in the pipeline (stopword removal, stemming, lemmatization, etc.), but for simplicity we keep the pipeline with tokenization only.

The first parameter to understand is how to set the correct column of the data frame that contains the ground truth label of the text.

In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier = medical.DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction")

pipeline = nlp.Pipeline(stages=[document, token, classifier])

Train the model:

In [None]:
%%time
model = pipeline.fit(train_data)


CPU times: user 394 ms, sys: 67 ms, total: 461 ms
Wall time: 1min 6s


In [None]:
result = model.transform(test_data).cache()
result.show()

+--------------------+--------+--------------------+--------------------+--------------------+
|                text|category|            document|               token|          prediction|
+--------------------+--------+--------------------+--------------------+--------------------+
| 'Maqianzi' (the ...|     neg|[{document, 0, 10...|[{token, 1, 1, ',...|[{category, 1, 10...|
| 10 months later ...|     neg|[{document, 0, 72...|[{token, 1, 2, 10...|[{category, 1, 72...|
| 18 perfusions we...|     neg|[{document, 0, 16...|[{token, 1, 2, 18...|[{category, 1, 16...|
| 2. Prior to the ...|     neg|[{document, 0, 15...|[{token, 1, 1, 2,...|[{category, 1, 15...|
| 3 of the 5 previ...|     neg|[{document, 0, 98...|[{token, 1, 1, 3,...|[{category, 1, 98...|
| 53-year-old woma...|     neg|[{document, 0, 21...|[{token, 1, 11, 5...|[{category, 1, 21...|
| 5: Movement diso...|     neg|[{document, 0, 71...|[{token, 1, 1, 5,...|[{category, 1, 71...|
| 9. More accurate...|     neg|[{document, 0, 10..

Evaluating the model

In [None]:
from sklearn.metrics import classification_report

In [None]:
result_df = result.select('category','prediction.result').toPandas()
result_df['result'] = result_df.result.apply(lambda x : x[0])

print (classification_report(result_df['category'], result_df['result']))

              precision    recall  f1-score   support

         neg       0.89      0.90      0.90      5027
         pos       0.75      0.74      0.74      2025

    accuracy                           0.85      7052
   macro avg       0.82      0.82      0.82      7052
weighted avg       0.85      0.85      0.85      7052



### `setClassificationModelClass()`



By default, the annotator wil train a `SVM` model, but we can also set it to `logreg`.

In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier_logreg = medical.DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("logreg")

pipeline = nlp.Pipeline(stages=[document, token, classifier_logreg])

In [None]:
%%time
logreg_model = pipeline.fit(train_data)

CPU times: user 180 ms, sys: 25.8 ms, total: 206 ms
Wall time: 26 s


In [None]:
result_logreg = logreg_model.transform(test_data).cache()
result_logreg.show()

+--------------------+--------+--------------------+--------------------+--------------------+
|                text|category|            document|               token|          prediction|
+--------------------+--------+--------------------+--------------------+--------------------+
| 'Maqianzi' (the ...|     neg|[{document, 0, 10...|[{token, 1, 1, ',...|[{category, 1, 10...|
| 10 months later ...|     neg|[{document, 0, 72...|[{token, 1, 2, 10...|[{category, 1, 72...|
| 18 perfusions we...|     neg|[{document, 0, 16...|[{token, 1, 2, 18...|[{category, 1, 16...|
| 2. Prior to the ...|     neg|[{document, 0, 15...|[{token, 1, 1, 2,...|[{category, 1, 15...|
| 3 of the 5 previ...|     neg|[{document, 0, 98...|[{token, 1, 1, 3,...|[{category, 1, 98...|
| 53-year-old woma...|     neg|[{document, 0, 21...|[{token, 1, 11, 5...|[{category, 1, 21...|
| 5: Movement diso...|     neg|[{document, 0, 71...|[{token, 1, 1, 5,...|[{category, 1, 71...|
| 9. More accurate...|     neg|[{document, 0, 10..

In [None]:
result_df = result_logreg.select('category','prediction.result').toPandas()
result_df['result'] = result_df.result.apply(lambda x : x[0])

print (classification_report(result_df['category'], result_df['result']))

              precision    recall  f1-score   support

         neg       0.89      0.91      0.90      5027
         pos       0.76      0.73      0.75      2025

    accuracy                           0.86      7052
   macro avg       0.83      0.82      0.82      7052
weighted avg       0.86      0.86      0.86      7052



### `setMaxIter()`

maximum number of iterations (Default: 10), affects the training time and convergence.

In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier_svm = medical.DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("svm")\
    .setMaxIter(5)

pipeline = nlp.Pipeline(stages=[document, token, classifier_svm])

result_svm = pipeline.fit(train_data).transform(test_data)

result_svm_df = result_svm.select('category','prediction.result').toPandas()
result_svm_df['result'] = result_svm_df.result.apply(lambda x : x[0])

print (classification_report(result_svm_df['category'], result_svm_df['result']))

              precision    recall  f1-score   support

         neg       0.90      0.90      0.90      5027
         pos       0.74      0.75      0.74      2025

    accuracy                           0.85      7052
   macro avg       0.82      0.82      0.82      7052
weighted avg       0.85      0.85      0.85      7052



### `setTol()`

convergence tolerance after each iteration (Default: 1e-6)

In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier_logreg = medical.DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("logreg")\
    .setTol(1e-3)

pipeline = nlp.Pipeline(stages=[document, token, classifier_logreg])

result_logreg = pipeline.fit(train_data).transform(test_data).cache()

result_logreg_df = result_logreg.select('category','prediction.result').toPandas()
result_logreg_df['result'] = result_logreg_df.result.apply(lambda x : x[0])

print (classification_report(result_logreg_df['category'], result_logreg_df['result']))

              precision    recall  f1-score   support

         neg       0.89      0.91      0.90      5027
         pos       0.76      0.73      0.75      2025

    accuracy                           0.86      7052
   macro avg       0.83      0.82      0.82      7052
weighted avg       0.86      0.86      0.86      7052



### `setMinTokenNgram()` and `setMaxTokenNgram()`

In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier_logreg = medical.DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("logreg")\
    .setMinTokenNgram(1)\
    .setMaxTokenNgram(3)

pipeline = nlp.Pipeline(stages=[document, token, classifier_logreg])

result_logreg = pipeline.fit(train_data).transform(test_data).cache()

result_logreg_pf = result_logreg.select('category','prediction.result').toPandas()
result_logreg_pf['result'] = result_logreg_pf.result.apply(lambda x : x[0])

print (classification_report(result_logreg_pf['category'], result_logreg_pf['result']))

              precision    recall  f1-score   support

         neg       0.89      0.91      0.90      5027
         pos       0.76      0.73      0.75      2025

    accuracy                           0.86      7052
   macro avg       0.83      0.82      0.82      7052
weighted avg       0.86      0.86      0.86      7052



### `setMergeChunks()`

whether to merge all chunks in a document or not (Default: false)

In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier_logreg = medical.DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("logreg")\
    .setMergeChunks(True)

pipeline = nlp.Pipeline(stages=[document, token, classifier_logreg])

result_logreg = pipeline.fit(train_data).transform(test_data).cache()

result_logreg_pf = result_logreg.select('category','prediction.result').toPandas()
result_logreg_pf['result'] = result_logreg_pf.result.apply(lambda x : x[0])

print (classification_report(result_logreg_pf['category'], result_logreg_pf['result']))

              precision    recall  f1-score   support

         neg       0.89      0.91      0.90      5027
         pos       0.76      0.73      0.75      2025

    accuracy                           0.86      7052
   macro avg       0.83      0.82      0.82      7052
weighted avg       0.86      0.86      0.86      7052



### `setFitIntercept()`



whether to fit an intercept term (Default: true)

In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier_logreg = medical.DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("logreg")\
    .setFitIntercept(False)

pipeline = nlp.Pipeline(stages=[document, token, classifier_logreg])

result_logreg = pipeline.fit(train_data).transform(test_data).cache()

result_logreg_pf = result_logreg.select('category','prediction.result').toPandas()
result_logreg_pf['result'] = result_logreg_pf.result.apply(lambda x : x[0])

print (classification_report(result_logreg_pf['category'], result_logreg_pf['result']))

              precision    recall  f1-score   support

         neg       0.87      0.97      0.92      5027
         pos       0.89      0.66      0.76      2025

    accuracy                           0.88      7052
   macro avg       0.88      0.81      0.84      7052
weighted avg       0.88      0.88      0.87      7052

