![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/30.1.Text_Classification_with_DocumentMLClassifier.ipynb)

# 30.1 Text Classification with DocumentMLClassifier

## Colab Setup

In [None]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
%pip install --upgrade -q pyspark==3.4.1 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
%pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

In [3]:
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp_jsl.base import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"12G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

spark

Spark NLP Version : 5.1.0
Spark NLP_JSL Version : 5.1.0


## 🔎 Models :


|index|model|
|-----:|:-----|
|1|[classifierml_ade](https://nlp.johnsnowlabs.com/2023/05/16/classifierml_ade_en.html)


## Load ADE Classification Dataset

In [4]:
#downloading sample datasets
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/ADE-NEG.txt
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/DRUG-AE.rel

**ADE-Negative Dataset**

In [5]:
df_neg= pd.read_csv("ADE-NEG.txt", header=None, delimiter="\t", names=["col1"])
df_neg['text'] =  df_neg.col1.str.split('NEG').str[1]
df_neg["category"] = "neg"
df_neg= df_neg[["text", "category"]]
df_neg.head()

Unnamed: 0,text,category
0,Clioquinol intoxication occurring in the trea...,neg
1,"""Retinoic acid syndrome"" was prevented with s...",neg
2,BACKGROUND: External beam radiation therapy o...,neg
3,"Although the enuresis ceased, she developed t...",neg
4,A 42-year-old woman had uneventful bilateral ...,neg


**ADE-Positive Dataset**

In [6]:
df_pos= pd.read_csv("DRUG-AE.rel", header=None, delimiter="|")
df_pos["category"]= "pos"
df_pos.rename(columns={1: "text"}, inplace=True)
df_pos= df_pos[["text", "category"]]
df_pos.head()

Unnamed: 0,text,category
0,Intravenous azithromycin-induced ototoxicity.,pos
1,"Immobilization, while Paget's bone disease was...",pos
2,Unaccountable severe hypercalcemia in a patien...,pos
3,METHODS: We report two cases of pseudoporphyri...,pos
4,METHODS: We report two cases of pseudoporphyri...,pos


**Merging positive and negative datasets**

In [7]:
ade_df= pd.concat([df_neg, df_pos]).sample(frac=1) #merge and shuffle the data
ade_df.head()

Unnamed: 0,text,category
4164,Although this type of hyperpigmentation has be...,pos
5360,We report a 16-year-old male who developed nep...,pos
155,The literature on these associations is revie...,neg
6462,We report here the first case using posaconaz...,neg
7330,6-Thioguanine is being increasingly used in t...,neg


In [8]:
ade_df["category"].value_counts()

neg    16695
pos     6821
Name: category, dtype: int64

In [9]:
ade_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23516 entries, 4164 to 3305
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   text      23516 non-null  object
 1   category  23516 non-null  object
dtypes: object(2)
memory usage: 551.2+ KB


In [10]:
spark_df = spark.createDataFrame(ade_df)
spark_df.head(3)

[Row(text='Although this type of hyperpigmentation has been previously seen in patients with cancer who are receiving bleomycin, this is, to our knowledge, the first reported case of bleomycin-induced hyperpigmentation in an AIDS patient and should be added to the growing list of cutaneous eruptions seen in these patients.', category='pos'),
 Row(text='We report a 16-year-old male who developed nephrotic syndrome related to membranous glomerulopathy with clinical and serological evidence of systemic lupus erythematosus after treatment with griseofulvin.', category='pos'),
 Row(text=' The literature on these associations is reviewed.', category='neg')]

In [11]:
spark_df.groupBy("category").count().show()
spark_df.printSchema()

+--------+-----+
|category|count|
+--------+-----+
|     pos| 6821|
|     neg|16695|
+--------+-----+

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)



**Get train & test set**

In [12]:
train_data, test_data = spark_df.randomSplit([0.8, 0.2], seed = 100)

print("Train Dataset Count: " + str(train_data.count()))
print("Test Dataset Count: " + str(test_data.count()))

Train Dataset Count: 18793
Test Dataset Count: 4723


## DocumentMLClassifier with Linear SVM

The below pipeline uses the SVM classifier of `DocumentMLClassifier` with setting `setClassificationModelClass("svm")`.

In [13]:
document = DocumentAssembler().setInputCol("text").setOutputCol("document")

token = Tokenizer().setInputCols("document").setOutputCol("token")

stemmer = Stemmer() \
    .setInputCols(["token"]) \
    .setOutputCol("stem")

classifier_svm = DocumentMLClassifierApproach() \
    .setInputCols("stem") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("svm")

pipeline = Pipeline(stages=[document, token, stemmer, classifier_svm])

In [14]:
svm_model = pipeline.fit(train_data)
result_svm = svm_model.transform(test_data).cache()
result_svm.show()

+--------------------+--------+--------------------+--------------------+--------------------+--------------------+
|                text|category|            document|               token|                stem|          prediction|
+--------------------+--------+--------------------+--------------------+--------------------+--------------------+
| All-trans-retino...|     neg|[{document, 0, 93...|[{token, 1, 18, A...|[{token, 1, 18, a...|[{category, 1, 93...|
| Although radiati...|     neg|[{document, 0, 15...|[{token, 1, 8, Al...|[{token, 1, 8, al...|[{category, 1, 15...|
| Atypical antipsy...|     neg|[{document, 0, 11...|[{token, 1, 8, At...|[{token, 1, 8, at...|[{category, 1, 11...|
| These agents hav...|     neg|[{document, 0, 16...|[{token, 1, 5, Th...|[{token, 1, 5, th...|[{category, 1, 16...|
|The reported case...|     pos|[{document, 0, 29...|[{token, 0, 2, Th...|[{token, 0, 2, th...|[{category, 0, 29...|
|Transient left ho...|     pos|[{document, 0, 14...|[{token, 0, 8, Tr...

In [15]:
result_svm.printSchema()

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- va

In [16]:
result_svm.select("text","prediction").show(4, truncate=100)

+----------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|                                                                                                text|                                                                      prediction|
+----------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|       All-trans-retinoic acid is an effective induction treatment for acute promyelocytic leukemia.| [{category, 1, 93, neg, {sentence -> 0, chunk -> 0, confidence -> 0.8559}, []}]|
| Although radiation-induced laryngeal necrosis has become a rare complication, the combination of...|[{category, 1, 159, neg, {sentence -> 0, chunk -> 0, confidence -> 0.9202}, []}]|
| Atypical antipsychotic induced hypothermia is a rare adverse effect that may p

In [17]:
result_svm_df = result_svm.select('category','prediction.result').toPandas()
result_svm_df['result'] = result_svm_df.result.apply(lambda x : x[0])

print (classification_report(result_svm_df['category'], result_svm_df['result']))

              precision    recall  f1-score   support

         neg       0.90      0.91      0.90      3385
         pos       0.76      0.75      0.76      1338

    accuracy                           0.86      4723
   macro avg       0.83      0.83      0.83      4723
weighted avg       0.86      0.86      0.86      4723



## DocumentMLClassifier with Logistic Regression

In [18]:
classifier_logreg = DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("logreg")\

pipeline = Pipeline(stages=[document, token, classifier_logreg])

logreg_model = pipeline.fit(train_data)
result_logreg = logreg_model.transform(test_data).cache()

result_logreg_df = result_logreg.select('category','prediction.result').toPandas()
result_logreg_df['result'] = result_logreg_df.result.apply(lambda x : x[0])

print (classification_report(result_logreg_df['category'], result_logreg_df['result']))

              precision    recall  f1-score   support

         neg       0.90      0.92      0.91      3385
         pos       0.79      0.75      0.77      1338

    accuracy                           0.87      4723
   macro avg       0.84      0.84      0.84      4723
weighted avg       0.87      0.87      0.87      4723



## Playing with Some Parameters

In [19]:
classifier_logreg.extractParamMap()

{Param(parent='DocumentMLClassifierApproach_767a0c1f3fd5', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='DocumentMLClassifierApproach_767a0c1f3fd5', name='labelCol', doc='column with the value result we are trying to predict.'): 'category',
 Param(parent='DocumentMLClassifierApproach_767a0c1f3fd5', name='maxIter', doc='maximum number of iterations.'): 10,
 Param(parent='DocumentMLClassifierApproach_767a0c1f3fd5', name='tol', doc='convergence tolerance after each iteration.'): 1e-06,
 Param(parent='DocumentMLClassifierApproach_767a0c1f3fd5', name='fitIntercept', doc='whether to fit an intercept term, default is true.'): True,
 Param(parent='DocumentMLClassifierApproach_767a0c1f3fd5', name='vectorizationModelPath', doc='specify the vectorization model if it has been already trained.'): '',
 Param(parent='DocumentMLClassifierApproach_767a0c1f3fd5', name='classificationModelPath', doc='specify the classification model if 

**Change `maxIter` parameter with `setMaxIter()`**

In [20]:
classifier_logreg = DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("logreg")\
    .setMaxIter(5)

pipeline = Pipeline(stages=[document, token, classifier_logreg])

result_logreg = pipeline.fit(train_data).transform(test_data).cache()

result_logreg_df = result_logreg.select('category','prediction.result').toPandas()
result_logreg_df['result'] = result_logreg_df.result.apply(lambda x : x[0])

print (classification_report(result_logreg_df['category'], result_logreg_df['result']))

              precision    recall  f1-score   support

         neg       0.89      0.95      0.92      3385
         pos       0.85      0.71      0.77      1338

    accuracy                           0.88      4723
   macro avg       0.87      0.83      0.85      4723
weighted avg       0.88      0.88      0.88      4723



**Change `tol` (tolarance) parameter with `setTol()`**

In [21]:
classifier_logreg = DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("logreg")\
    .setTol(1e-3)

pipeline = Pipeline(stages=[document, token, classifier_logreg])

result_logreg = pipeline.fit(train_data).transform(test_data).cache()

result_logreg_df = result_logreg.select('category','prediction.result').toPandas()
result_logreg_df['result'] = result_logreg_df.result.apply(lambda x : x[0])

print (classification_report(result_logreg_df['category'], result_logreg_df['result']))

              precision    recall  f1-score   support

         neg       0.90      0.92      0.91      3385
         pos       0.79      0.75      0.77      1338

    accuracy                           0.87      4723
   macro avg       0.84      0.84      0.84      4723
weighted avg       0.87      0.87      0.87      4723



**Change `maxTokenNgram` parameter with `setMaxTokenNgram()`**

In [22]:
classifier_logreg = DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("logreg")\
    .setMaxTokenNgram(2)

pipeline = Pipeline(stages=[document, token, classifier_logreg])

result_logreg = pipeline.fit(train_data).transform(test_data).cache()

result_logreg_pf = result_logreg.select('category','prediction.result').toPandas()
result_logreg_pf['result'] = result_logreg_pf.result.apply(lambda x : x[0])

print (classification_report(result_logreg_pf['category'], result_logreg_pf['result']))

              precision    recall  f1-score   support

         neg       0.91      0.92      0.92      3385
         pos       0.79      0.78      0.79      1338

    accuracy                           0.88      4723
   macro avg       0.85      0.85      0.85      4723
weighted avg       0.88      0.88      0.88      4723



## Using More Annotators

In [23]:
document = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

token = Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

stemmer = Stemmer() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCol("stem")

classifier_svm = DocumentMLClassifierApproach() \
    .setInputCols("stem") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("svm")\
    .setMaxIter(5)

finisher = Finisher() \
    .setInputCols(["prediction"]) \
    .setOutputCols(["predictions"]) \
    .setCleanAnnotations(True)\



pipeline = Pipeline(stages=[document, token, normalizer, stopwords_cleaner, stemmer, classifier_svm, finisher])

model= pipeline.fit(train_data)

In [24]:
sample_df = spark.createDataFrame([["We report two cases of pseudoporphyria caused by naproxen and oxaprozin."],
                                   ["Special attention should be paid when attempting to sample the endometrium in patients with mullerian abnormalities."]]).toDF("text")

result = model.transform(sample_df)
result.show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------+-----------+
|text                                                                                                                |predictions|
+--------------------------------------------------------------------------------------------------------------------+-----------+
|We report two cases of pseudoporphyria caused by naproxen and oxaprozin.                                            |[pos]      |
|Special attention should be paid when attempting to sample the endometrium in patients with mullerian abnormalities.|[neg]      |
+--------------------------------------------------------------------------------------------------------------------+-----------+

