![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **DocumentMLClassifierApproach**

This notebook will cover the different parameters and usages of `DocumentMLClassifierApproach`.

**📖 Learning Objectives:**

1. Understand how to train a model to classify documents with a Logarithmic Regression algorithm. Training data requires columns for text and their label. The result is a trained GenericClassifierModel.

2. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Python Docs : [DocumentMLClassifierApproach](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/classification/document_ml_classifier/index.html)

- Scala Docs : [DocumentMLClassifierApproach](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/classification/DocumentMLClassifierApproach.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Healthcare).

## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp

nlp.install()

In [4]:
from johnsnowlabs import nlp, medical
import pyspark.sql.functions as F
import pandas as pd

spark = nlp.start()

👌 Detected license file /content/5.1.1.spark_nlp_for_healthcare.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.1.1, 💊Spark-Healthcare==5.1.1, running on ⚡ PySpark==3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN`

- Output: `CATEGORY`

## **🔎 Parameters**


- `labelCol`: (str) Sets column with the value result we are trying to predict.

- `maxIter`: (Int) Sets maximum number of iterations.

- `tol`: (float) Sets convergence tolerance after each iteration.

- `fitIntercept`: (str) Sets whether to fit an intercept term, default is true.

- `vectorizationModelPath`: (str) Sets a path to the classification model if it has been already trained.

- `classificationModelPath`: (str) Sets a path to the classification model if it has been already trained.

- `classificationModelClass`: (str) Sets a the classification model class from SparkML to use; possible values are: logreg, svm.

- `minTokenNgram`: (int) Sets minimum number of tokens for Ngrams.

- `maxTokenNgram`: (int) Sets maximum number of tokens for Ngrams.

- `mergeChunks`: (boolean) whether to merge all chunks in a document or not (Default: false)

## Prepare Data

In [None]:
#downloading sample datasets
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/ADE-NEG.txt
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/DRUG-AE.rel

In [None]:
df_neg= pd.read_csv("ADE-NEG.txt", header=None, delimiter="\t", names=["col1"])
df_neg['text'] =  df_neg.col1.str.split('NEG').str[1]
df_neg["category"] = "neg"
df_neg= df_neg[["text", "category"]]
df_neg.head()

Unnamed: 0,text,category
0,Clioquinol intoxication occurring in the trea...,neg
1,"""Retinoic acid syndrome"" was prevented with s...",neg
2,BACKGROUND: External beam radiation therapy o...,neg
3,"Although the enuresis ceased, she developed t...",neg
4,A 42-year-old woman had uneventful bilateral ...,neg


In [None]:
df_pos= pd.read_csv("DRUG-AE.rel", header=None, delimiter="|")
df_pos["category"]= "pos"
df_pos.rename(columns={1: "text"}, inplace=True)
df_pos= df_pos[["text", "category"]]
df_pos.head()

Unnamed: 0,text,category
0,Intravenous azithromycin-induced ototoxicity.,pos
1,"Immobilization, while Paget's bone disease was...",pos
2,Unaccountable severe hypercalcemia in a patien...,pos
3,METHODS: We report two cases of pseudoporphyri...,pos
4,METHODS: We report two cases of pseudoporphyri...,pos


In [None]:
ade_df= pd.concat([df_neg, df_pos]).sample(frac=1) #merge and shuffle the data
ade_df.head()

Unnamed: 0,text,category
1537,Erythema multiforme associated with phenytoin ...,pos
2472,The pupils were pinpoint.,neg
8053,Cross-reactivity between clindamycin and ampi...,neg
15362,The fourth was a 49-year-old female patient w...,neg
14401,DISCUSSION: Fifty percent of VPA is metaboliz...,neg


In [None]:
spark_df = spark.createDataFrame(ade_df)
spark_df.show(3)

+--------------------+--------+
|                text|category|
+--------------------+--------+
|Erythema multifor...|     pos|
| The pupils were ...|     neg|
| Cross-reactivity...|     neg|
+--------------------+--------+
only showing top 3 rows



In [None]:
train_data, test_data = spark_df.randomSplit([0.8, 0.2], seed = 100)

print("Train Dataset Count: " + str(train_data.count()))
print("Test Dataset Count: " + str(test_data.count()))

Train Dataset Count: 18855
Test Dataset Count: 4661


### `setlabelCol()`



Column with the value result we are trying to predict.

In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

stemmer = nlp.Stemmer() \
    .setInputCols(["token"]) \
    .setOutputCol("stem")

classifier = medical.DocumentMLClassifierApproach() \
    .setInputCols("stem") \
    .setLabelCol("category") \
    .setOutputCol("prediction")

pipeline = nlp.Pipeline(stages=[document, token, stemmer, classifier])

In [None]:
model = pipeline.fit(train_data)
result = model.transform(test_data).cache()
result.show()

+--------------------+--------+--------------------+--------------------+--------------------+--------------------+
|                text|category|            document|               token|                stem|          prediction|
+--------------------+--------+--------------------+--------------------+--------------------+--------------------+
| A 69-year-old ty...|     neg|[{document, 0, 72...|[{token, 1, 1, A,...|[{token, 1, 1, a,...|[{category, 1, 72...|
| A 78-year-old ma...|     neg|[{document, 0, 15...|[{token, 1, 1, A,...|[{token, 1, 1, a,...|[{category, 1, 15...|
| Four months afte...|     neg|[{document, 0, 12...|[{token, 1, 4, Fo...|[{token, 1, 4, fo...|[{category, 1, 12...|
| Radiation therap...|     neg|[{document, 0, 65...|[{token, 1, 9, Ra...|[{token, 1, 9, ra...|[{category, 1, 65...|
|Administration of...|     pos|[{document, 0, 11...|[{token, 0, 13, A...|[{token, 0, 13, a...|[{category, 0, 11...|
|Assessment of cor...|     pos|[{document, 0, 20...|[{token, 0, 9, As...

### `setClassificationModelClass()`



Specify the classification model if it has been already trained. The below pipeline uses the SVM classifier of `DocumentMLClassifier` with setting `setClassificationModelClass("svm")`.

In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

stemmer = nlp.Stemmer() \
    .setInputCols(["token"]) \
    .setOutputCol("stem")

classifier_svm = medical.DocumentMLClassifierApproach() \
    .setInputCols("stem") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("svm")

pipeline = nlp.Pipeline(stages=[document, token, stemmer, classifier_svm])

In [None]:
svm_model = pipeline.fit(train_data)
result_svm = svm_model.transform(test_data).cache()
result_svm.show()

+--------------------+--------+--------------------+--------------------+--------------------+--------------------+
|                text|category|            document|               token|                stem|          prediction|
+--------------------+--------+--------------------+--------------------+--------------------+--------------------+
| A 69-year-old ty...|     neg|[{document, 0, 72...|[{token, 1, 1, A,...|[{token, 1, 1, a,...|[{category, 1, 72...|
| A 78-year-old ma...|     neg|[{document, 0, 15...|[{token, 1, 1, A,...|[{token, 1, 1, a,...|[{category, 1, 15...|
| Four months afte...|     neg|[{document, 0, 12...|[{token, 1, 4, Fo...|[{token, 1, 4, fo...|[{category, 1, 12...|
| Radiation therap...|     neg|[{document, 0, 65...|[{token, 1, 9, Ra...|[{token, 1, 9, ra...|[{category, 1, 65...|
|Administration of...|     pos|[{document, 0, 11...|[{token, 0, 13, A...|[{token, 0, 13, a...|[{category, 0, 11...|
|Assessment of cor...|     pos|[{document, 0, 20...|[{token, 0, 9, As...

In [None]:
result_svm.select("text","prediction").show(4, truncate=100)

+----------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|                                                                                                text|                                                                      prediction|
+----------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|                            A 69-year-old type 2 diabetic man was admitted due to diabetic gangrane.| [{category, 1, 72, neg, {sentence -> 0, chunk -> 0, confidence -> 0.8634}, []}]|
| A 78-year-old man with a long history of major depression responded well to a course of ECT but ...|[{category, 1, 152, neg, {sentence -> 0, chunk -> 0, confidence -> 0.9379}, []}]|
| Four months after cessation of treatment, a severe acne with comedones, papule

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [None]:
result_svm_df = result_svm.select('category','prediction.result').toPandas()
result_svm_df['result'] = result_svm_df.result.apply(lambda x : x[0])

print (classification_report(result_svm_df['category'], result_svm_df['result']))

              precision    recall  f1-score   support

         neg       0.88      0.97      0.93      3312
         pos       0.91      0.69      0.79      1349

    accuracy                           0.89      4661
   macro avg       0.90      0.83      0.86      4661
weighted avg       0.89      0.89      0.89      4661



### `setMaxIter()`

maximum number of iterations (Default: 10)

In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier_logreg = medical.DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("logreg")\
    .setMaxIter(5)

pipeline = nlp.Pipeline(stages=[document, token, classifier_logreg])

result_logreg = pipeline.fit(train_data).transform(test_data).cache()

result_logreg_df = result_logreg.select('category','prediction.result').toPandas()
result_logreg_df['result'] = result_logreg_df.result.apply(lambda x : x[0])

print (classification_report(result_logreg_df['category'], result_logreg_df['result']))

              precision    recall  f1-score   support

         neg       0.87      0.98      0.92      3312
         pos       0.94      0.65      0.77      1349

    accuracy                           0.89      4661
   macro avg       0.91      0.82      0.85      4661
weighted avg       0.89      0.89      0.88      4661



### `setTol()`

convergence tolerance after each iteration (Default: 1e-6)

In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier_logreg = medical.DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("logreg")\
    .setTol(1e-3)

pipeline = nlp.Pipeline(stages=[document, token, classifier_logreg])

result_logreg = pipeline.fit(train_data).transform(test_data).cache()

result_logreg_df = result_logreg.select('category','prediction.result').toPandas()
result_logreg_df['result'] = result_logreg_df.result.apply(lambda x : x[0])

print (classification_report(result_logreg_df['category'], result_logreg_df['result']))

              precision    recall  f1-score   support

         neg       0.88      0.98      0.93      3312
         pos       0.92      0.68      0.78      1349

    accuracy                           0.89      4661
   macro avg       0.90      0.83      0.86      4661
weighted avg       0.89      0.89      0.89      4661



### `setMaxTokenNgram()`

In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier_logreg = medical.DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("logreg")\
    .setMaxTokenNgram(2)

pipeline = nlp.Pipeline(stages=[document, token, classifier_logreg])

result_logreg = pipeline.fit(train_data).transform(test_data).cache()

result_logreg_pf = result_logreg.select('category','prediction.result').toPandas()
result_logreg_pf['result'] = result_logreg_pf.result.apply(lambda x : x[0])

print (classification_report(result_logreg_pf['category'], result_logreg_pf['result']))

              precision    recall  f1-score   support

         neg       0.89      0.96      0.93      3312
         pos       0.89      0.72      0.79      1349

    accuracy                           0.89      4661
   macro avg       0.89      0.84      0.86      4661
weighted avg       0.89      0.89      0.89      4661



### `setMinTokenNgram()`

In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier_logreg = medical.DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("logreg")\
    .setMinTokenNgram(3)

pipeline = nlp.Pipeline(stages=[document, token, classifier_logreg])

result_logreg = pipeline.fit(train_data).transform(test_data).cache()

result_logreg_pf = result_logreg.select('category','prediction.result').toPandas()
result_logreg_pf['result'] = result_logreg_pf.result.apply(lambda x : x[0])

print (classification_report(result_logreg_pf['category'], result_logreg_pf['result']))

              precision    recall  f1-score   support

         neg       0.86      0.96      0.90      3312
         pos       0.85      0.61      0.71      1349

    accuracy                           0.85      4661
   macro avg       0.85      0.78      0.81      4661
weighted avg       0.85      0.85      0.85      4661



### `setMergeChunks()`

whether to merge all chunks in a document or not (Default: false)

In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier_logreg = medical.DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("logreg")\
    .setMergeChunks(True)

pipeline = nlp.Pipeline(stages=[document, token, classifier_logreg])

result_logreg = pipeline.fit(train_data).transform(test_data).cache()

result_logreg_pf = result_logreg.select('category','prediction.result').toPandas()
result_logreg_pf['result'] = result_logreg_pf.result.apply(lambda x : x[0])

print (classification_report(result_logreg_pf['category'], result_logreg_pf['result']))

              precision    recall  f1-score   support

         neg       0.88      0.98      0.93      3312
         pos       0.92      0.68      0.78      1349

    accuracy                           0.89      4661
   macro avg       0.90      0.83      0.86      4661
weighted avg       0.89      0.89      0.89      4661



### `setFitIntercept()`



whether to fit an intercept term (Default: true)

In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier_logreg = medical.DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("logreg")\
    .setFitIntercept(True)

pipeline = nlp.Pipeline(stages=[document, token, classifier_logreg])

result_logreg = pipeline.fit(train_data).transform(test_data).cache()

result_logreg_pf = result_logreg.select('category','prediction.result').toPandas()
result_logreg_pf['result'] = result_logreg_pf.result.apply(lambda x : x[0])

print (classification_report(result_logreg_pf['category'], result_logreg_pf['result']))

              precision    recall  f1-score   support

         neg       0.88      0.98      0.93      3312
         pos       0.92      0.68      0.78      1349

    accuracy                           0.89      4661
   macro avg       0.90      0.83      0.86      4661
weighted avg       0.89      0.89      0.89      4661



`setClassificationModelPath()` : specify the classification model if it has been already trained.

`setVectorizationModelPath()`: specify the vectorization model if it has been already trained.

# **DocumentMLClassifierModel**

This notebook will cover the different parameters and usages of `DocumentMLClassifierModel`.

**🔗 Helpful Links:**

- Python Docs : [DocumentMLClassifierModel](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/classification/document_ml_classifier/index.html#sparknlp_jsl.annotator.classification.document_ml_classifier.DocumentMLClassifierModel)

- Scala Docs : [DocumentMLClassifierModel](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/classification/DocumentMLClassifierModel.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Healthcare).

## **🖨️ Input/Output Annotation Types**

- Input: `TOKEN`

- Output: `CATEGORY`

## Build a Pipeline

In [5]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier_ml = medical.DocumentMLClassifierModel.pretrained("classifierml_ade", "en", "clinical/models")\
    .setInputCols("token")\
    .setOutputCol("prediction")

clf_Pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    classifier_ml])

data = spark.createDataFrame([["""I feel great after taking tylenol."""], ["""Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient."""]]).toDF("text")

result = clf_Pipeline.fit(data).transform(data)

classifierml_ade download started this may take some time.
[OK!]


In [7]:
result.show(truncate=False)

+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------+
|text                                                                                    |document  

In [10]:
result.select('text','prediction.result').show(truncate=False)

+----------------------------------------------------------------------------------------+-------+
|text                                                                                    |result |
+----------------------------------------------------------------------------------------+-------+
|Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.|[False]|
|I feel great after taking tylenol.                                                      |[False]|
+----------------------------------------------------------------------------------------+-------+

