![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/30.Clinical_Text_Classification_with_Spark_NLP.ipynb)

# Colab Setup

In [None]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.4.1 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing libcudnn library
! apt install -qq --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2 -y &> /dev/null

In [None]:
!pip install -q tensorflow==2.12.0
!pip install -q tensorflow-addons

❗ **PLEASE RE-START RUNTIME AND CONTINUE**

---



In [None]:
import json
import os

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.base import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline,PipelineModel

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"24G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params, gpu=True)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.4.0
Spark NLP_JSL Version : 5.4.0


## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

# Classifiers

The below classifiers will be used in this notebook.ClassifierDL, MultiClassifierDL, and GenericClassifier will be trained using healthcare_100d, embeddingd_clinical, and bert sentence embeddings(sbiobert_base_cased_mli). DocumentLogRegClassifier accepts tokens, so sentence embeddings are not utilized during DocumentLogRegClassifier training.

## ClassifierDL

ClassifierDL is a generic Multi-class Text Classification annotator. ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) built inside TensorFlow and supports up to 100 classes. For more information please [follow the link](https://nlp.johnsnowlabs.com/docs/en/annotators#classifierdl).


##  MultiClassifierDL

MultiClassifierDL is a Multi-label Text Classification annotator.MultiClassifierDL uses a Bidirectional GRU with a convolutional model built inside TensorFlow and supports up to 100 classes.  For more information please [follow the link](https://nlp.johnsnowlabs.com/docs/en/annotators#multiclassifierdl).

Here are some Multi-label Text Classification models that trained with MultiClassifierDL:

|index|model|
|-----:|:-----|
|1|[multiclassifierdl_heart_disease_en](https://nlp.johnsnowlabs.com/2023/10/16/multiclassifierdl_heart_disease_en.html)|
|2|[multiclassifierdl_hoc_en](https://nlp.johnsnowlabs.com/2023/07/04/multiclassifierdl_hoc_en.html)|
|3|[multiclassifierdl_litcovid_en](https://nlp.johnsnowlabs.com/2023/07/04/multiclassifierdl_litcovid_en.html)|
|4|[multiclassifierdl_respiratory_disease_en](https://nlp.johnsnowlabs.com/2023/10/03/multiclassifierdl_respiratory_disease_en.html)|


## GenericClassifier

GenericClassifier is a TensorFlow model for the generic classification of feature vectors in Healthcare  Lİbrary. It takes FEATURE_VECTOR annotations from FeaturesAssembler as input, classifies them, and outputs CATEGORY annotations. Please see [the link](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#genericclassifier) for more information.

Here are some Social Determinants of Health (SDOH) models that trained with GenericClassifier:

|index|model|
|-----:|:-----|
|1|[genericclassifier_sdoh_economics_binary_sbiobert_cased_mli](https://nlp.johnsnowlabs.com/2023/01/14/genericclassifier_sdoh_economics_binary_sbiobert_cased_mli_en.html)|
|2|[genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli](https://nlp.johnsnowlabs.com/2023/01/14/genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli_en.html)|
|3|[genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli](https://nlp.johnsnowlabs.com/2023/01/14/genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli_en.html)|
|4|[genericclassifier_sdoh_alcohol_usage_sbiobert_cased_mli](https://nlp.johnsnowlabs.com/2023/01/14/genericclassifier_sdoh_alcohol_usage_sbiobert_cased_mli_en.html)|
|5|[genericclassifier_sdoh_alcohol_usage_binary_sbiobert_cased_mli](https://nlp.johnsnowlabs.com/2023/01/14/genericclassifier_sdoh_alcohol_usage_binary_sbiobert_cased_mli_en.html)|
|6|[genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli](https://nlp.johnsnowlabs.com/2023/04/10/genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli_en.html)
|7|[genericclassifier_sdoh_mental_health_clinical](https://nlp.johnsnowlabs.com/2023/04/10/genericclassifier_sdoh_mental_health_clinical_en.html)
|8|[genericclassifier_sdoh_under_treatment_sbiobert_cased_mli](https://nlp.johnsnowlabs.com/2023/04/10/genericclassifier_sdoh_under_treatment_sbiobert_cased_mli_en.html)
|9|[genericclassifier_patient_complaint_sbiobert_cased_mli](https://nlp.johnsnowlabs.com/2023/08/31/patient_complaint_classifier_generic_bert_M1_en.html)
|10|[Genericclassifier_age_group_sbiobert_cased_mli](https://nlp.johnsnowlabs.com/2023/08/16/genericclassifier_age_group_sbiobert_cased_mli_en.html)

## GenericLogRegClassifier

`GenericLogRegClassifier` is a derivative of GenericClassifier which implements a multinomial *Logistic Regression*. This is a single layer neural network with the logistic function at the output. The input to the model is `FeatureVector` and the output is `Category` annotations with labels and corresponding confidence scores varying between 0 and 1. Training data requires "text" and their "label" columns only and the trained model will be a `GenericLogRegClassifierModel()`.

|index|model|
|-----:|:-----|
|1|[generic_logreg_classifier_ade](https://nlp.johnsnowlabs.com/2023/05/09/generic_logreg_classifier_ade_en.html)



## GenericSVMClassifier

`GenericSVMClassifier` is a derivative of GenericClassifier which implements *SVM (Support Vector Machine)* classification. The input to the model is `FeatureVector` and the output is `Category` annotations with labels and corresponding confidence scores. The scores are standardized using the logistic function so that they vary between 0 and 1. Taining data requires "text" and their "label" columns only and the trained model will be a `GenericSVMClassifierModel()`

|index|model|
|-----:|:-----|
|1|[generic_svm_classifier_ade](https://nlp.johnsnowlabs.com/2023/05/09/generic_svm_classifier_ade_en.html)

## DocumentLogRegClassifier

DocumentLogRegClassifier is a model to classify documents with a Logarithmic Regression algorithm in Healthcare  Library. Training data requires columns for text and labels. The result is a trained DocumentLogRegClassifierModel. you can get more info [here](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#documentlogregclassifier).

|index|model|
|-----:|:-----|
|1|[classifier_logreg_ade](https://nlp.johnsnowlabs.com/2023/05/16/classifier_logreg_ade_en.html)|


# ADE Dataset

### Data Preprocessing

In [None]:
#downloading sample datasets
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/ADE-NEG.txt
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/DRUG-AE.rel

**ADE Negative Dataset**

In [None]:
df_neg= pd.read_csv("ADE-NEG.txt", header=None, delimiter="\t", names=["col1"])
df_neg.head()

Unnamed: 0,col1
0,6460590 NEG Clioquinol intoxication occurring ...
1,"8600337 NEG ""Retinoic acid syndrome"" was preve..."
2,8402502 NEG BACKGROUND: External beam radiatio...
3,"8700794 NEG Although the enuresis ceased, she ..."
4,17662448 NEG A 42-year-old woman had uneventfu...


In [None]:
df_neg['text'] =  df_neg.col1.str.split('NEG').str[1]
df_neg["category"] = "neg"
df_neg= df_neg[["text", "category"]]
df_neg.head()

Unnamed: 0,text,category
0,Clioquinol intoxication occurring in the trea...,neg
1,"""Retinoic acid syndrome"" was prevented with s...",neg
2,BACKGROUND: External beam radiation therapy o...,neg
3,"Although the enuresis ceased, she developed t...",neg
4,A 42-year-old woman had uneventful bilateral ...,neg


**ADE Positive Dataset**

In [None]:
df_pos= pd.read_csv("DRUG-AE.rel", header=None, delimiter="|")
df_pos.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,10030778,Intravenous azithromycin-induced ototoxicity.,ototoxicity,43,54,azithromycin,22,34
1,10048291,"Immobilization, while Paget's bone disease was...",increased calcium-release,960,985,dihydrotachysterol,908,926
2,10048291,Unaccountable severe hypercalcemia in a patien...,hypercalcemia,31,44,dihydrotachysterol,94,112
3,10082597,METHODS: We report two cases of pseudoporphyri...,pseudoporphyria,620,635,naproxen,646,654
4,10082597,METHODS: We report two cases of pseudoporphyri...,pseudoporphyria,620,635,oxaprozin,659,668


In [None]:
df_pos["category"]= "pos"
df_pos.rename(columns={1: "text"}, inplace=True)
df_pos= df_pos[["text", "category"]]
df_pos.head()

Unnamed: 0,text,category
0,Intravenous azithromycin-induced ototoxicity.,pos
1,"Immobilization, while Paget's bone disease was...",pos
2,Unaccountable severe hypercalcemia in a patien...,pos
3,METHODS: We report two cases of pseudoporphyri...,pos
4,METHODS: We report two cases of pseudoporphyri...,pos


**Merging Positive and Negative dataset**

In [None]:
ade_df= pd.concat([df_neg, df_pos])
ade_df.head()

Unnamed: 0,text,category
0,Clioquinol intoxication occurring in the trea...,neg
1,"""Retinoic acid syndrome"" was prevented with s...",neg
2,BACKGROUND: External beam radiation therapy o...,neg
3,"Although the enuresis ceased, she developed t...",neg
4,A 42-year-old woman had uneventful bilateral ...,neg


In [None]:
ade_df["category"].value_counts()

category
neg    16695
pos     6821
Name: count, dtype: int64

In [None]:
ade_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 23516 entries, 0 to 6820
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   text      23516 non-null  object
 1   category  23516 non-null  object
dtypes: object(2)
memory usage: 551.2+ KB


We take 30% of the data to make a faster run. You can use all data for better scores.

In [None]:
spark_df = spark.createDataFrame(ade_df).sample(0.3, 3) # limit the data

trainingData, testData = spark_df.randomSplit([0.8, 0.2], seed = 100)

print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

Training Dataset Count: 5652
Test Dataset Count: 1436


In [None]:
spark_df.groupBy("category").count().show()

+--------+-----+
|category|count|
+--------+-----+
|     neg| 5023|
|     pos| 2065|
+--------+-----+



In [None]:
spark_df.printSchema()

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)



In [None]:
spark_df.head(3)

[Row(text=' Clioquinol intoxication occurring in the treatment of acrodermatitis enteropathica with reference to SMON outside of Japan.', category='neg'),
 Row(text=' A 42-year-old woman had uneventful bilateral laser-assisted subepithelial keratectomy (LASEK) to correct myopia.', category='neg'),
 Row(text=' A 16-year-old girl with erosive, polyarticular JRA showed no detectable change in her articular disease following nine exchanges.', category='neg')]

## 100 Dimension Healthcare Embeddings (embeddings_healthcare_100d)



Now we will extract [healthcare_100d embeddings](https://nlp.johnsnowlabs.com/2020/05/29/embeddings_healthcare_100d_en.html) and use it in the classificaiton model training.

In [None]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

embeddings_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,

])

embeddings_healthcare_100d download started this may take some time.
Approximate size to download 475.8 MB
[OK!]


In [None]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)
trainingData_with_embeddings = trainingData_with_embeddings.select("text","category","sentence_embeddings")
trainingData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+--------+------------------------------------------------------------+
|                                                        text|category|                                         sentence_embeddings|
+------------------------------------------------------------+--------+------------------------------------------------------------+
| A 16-year-old girl with erosive, polyarticular JRA showe...|     neg|[{sentence_embeddings, 0, 129,  A 16-year-old girl with e...|
| A 42-year-old woman had uneventful bilateral laser-assis...|     neg|[{sentence_embeddings, 0, 112,  A 42-year-old woman had u...|
+------------------------------------------------------------+--------+------------------------------------------------------------+
only showing top 2 rows



In [None]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)
testData_with_embeddings = testData_with_embeddings.select("text","category","sentence_embeddings")
testData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+--------+------------------------------------------------------------+
|                                                        text|category|                                         sentence_embeddings|
+------------------------------------------------------------+--------+------------------------------------------------------------+
| Air conditioning and proper ventilation of the operating...|     neg|[{sentence_embeddings, 0, 155,  Air conditioning and prop...|
| Biological data mainly revealed hepatic failure and lact...|     neg|[{sentence_embeddings, 0, 68,  Biological data mainly rev...|
+------------------------------------------------------------+--------+------------------------------------------------------------+
only showing top 2 rows



In [None]:
testData_with_embeddings.printSchema()

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)



In [None]:
testData_with_embeddings.select(testData_with_embeddings.sentence_embeddings.embeddings).show(3,truncate=120)

+------------------------------------------------------------------------------------------------------------------------+
|                                                                                          sentence_embeddings.embeddings|
+------------------------------------------------------------------------------------------------------------------------+
|[[0.023146994, -0.048840538, 0.2768485, 0.09796163, -0.13117273, -0.0791464, 0.068399966, -0.04242571, -0.025790978, ...|
|[[0.11048286, 0.07001462, 0.14409271, 0.16125715, -0.05150216, 0.051806085, 0.08491022, -0.06517192, 0.07884913, -0.0...|
|[[-0.049716324, 0.019244563, 0.07450515, 0.11221783, -0.12938853, 0.026980795, 0.025576496, -0.060712427, 0.044161126...|
+------------------------------------------------------------------------------------------------------------------------+
only showing top 3 rows



In [None]:
log_folder="ADE_logs_healthcare_100d"
!mkdir -p $log_folder

### ClassifierDL

In [None]:
classifier_dl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("prediction")\
    .setLabelColumn("category")\
    .setBatchSize(16)\
    .setMaxEpochs(30)\
    .setLr(0.002)\
    .setDropout(0.3)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)\
    # .setValidationSplit(0.1)

classifier_dl_pipeline = Pipeline(
    stages = [
        classifier_dl
])

In [None]:
clfDL_model_hc100 = classifier_dl_pipeline.fit(trainingData_with_embeddings)

In [None]:
!cat $log_folder/ClassifierDLApproach_*

Training started - epochs: 30 - learning_rate: 0.002 - batch_size: 16 - training_examples: 5652 - classes: 2
Epoch 0/30 - 2.12s - loss: 195.05273 - acc: 0.75637394 - batches: 354
Epoch 1/30 - 1.48s - loss: 173.61128 - acc: 0.8082507 - batches: 354
Epoch 2/30 - 1.46s - loss: 169.4416 - acc: 0.8284348 - batches: 354
Epoch 3/30 - 1.46s - loss: 166.92075 - acc: 0.8413598 - batches: 354
Epoch 4/30 - 1.46s - loss: 165.17186 - acc: 0.85216004 - batches: 354
Epoch 5/30 - 1.45s - loss: 163.48428 - acc: 0.8569405 - batches: 354
Epoch 6/30 - 1.44s - loss: 162.08125 - acc: 0.86331445 - batches: 354
Epoch 7/30 - 1.48s - loss: 160.86395 - acc: 0.8700425 - batches: 354
Epoch 8/30 - 1.45s - loss: 159.82391 - acc: 0.8742918 - batches: 354
Epoch 9/30 - 1.47s - loss: 158.91988 - acc: 0.87712467 - batches: 354
Epoch 10/30 - 1.45s - loss: 158.10605 - acc: 0.8796034 - batches: 354
Epoch 11/30 - 1.46s - loss: 157.3355 - acc: 0.88172805 - batches: 354
Epoch 12/30 - 1.45s - loss: 156.63329 - acc: 0.8850921 - b

In [None]:
preds = clfDL_model_hc100.transform(testData_with_embeddings)

In [None]:
preds.printSchema()
preds.select(preds.prediction).show(5, truncate=False)
preds.select(preds.category, preds.prediction.result).show(5, truncate=False)

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- prediction: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    

In [None]:
preds_df = preds.select("category","text","prediction.result").toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

         neg       0.88      0.91      0.90      1028
         pos       0.76      0.69      0.72       408

    accuracy                           0.85      1436
   macro avg       0.82      0.80      0.81      1436
weighted avg       0.85      0.85      0.85      1436



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df = pd.DataFrame(columns=["pos-f1-score","accuracy"])
results_df.loc["ClassifierDL_100d"] = [res["pos"]["f1-score"], res["accuracy"]]
results_df

Unnamed: 0,pos-f1-score,accuracy
ClassifierDL_100d,0.723295,0.850279


### MultiClassifierDL

We will use MultiClassifierDL built by using Bidirectional GRU and CNNs inside TensorFlow that supports up to 100 classes. It is designed for multi-label classification purposes. Here we will use MultiClassifierDL as a binary classifier.

In [None]:
# MultiClassifierDL accepts list of strings as input label. So we convert label column to array type.
trainingData_with_embeddings = trainingData_with_embeddings.withColumn("category_array", F.array(F.col("category")))
testData_with_embeddings = testData_with_embeddings.withColumn("category_array", F.array(F.col("category")))

In [None]:
multiClassifier = MultiClassifierDLApproach()\
    .setInputCols("sentence_embeddings")\
    .setOutputCol("prediction")\
    .setLabelColumn("category_array")\
    .setBatchSize(16)\
    .setMaxEpochs(15)\
    .setLr(9e-3)\
    .setThreshold(0.5)\
    .setShufflePerEpoch(False)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)\
  #   .setValidationSplit(0.1)

multiClassifier_pipeline = Pipeline(
    stages = [
        multiClassifier
])

In [None]:
multiClassifier_model_hc100 = multiClassifier_pipeline.fit(trainingData_with_embeddings)

In [None]:
!cat $log_folder/MultiClassifierDLApproach_*

Training started - epochs: 15 - learning_rate: 0.009 - batch_size: 16 - training_examples: 5652 - classes: 2
Epoch 0/15 - 7.98s - loss: 0.5103771 - acc: 0.7445113 - batches: 354
Epoch 1/15 - 3.59s - loss: 0.4528057 - acc: 0.7809844 - batches: 354
Epoch 2/15 - 3.71s - loss: 0.42339063 - acc: 0.8026735 - batches: 354
Epoch 3/15 - 3.73s - loss: 0.4023571 - acc: 0.81303114 - batches: 354
Epoch 4/15 - 3.67s - loss: 0.38445872 - acc: 0.8226806 - batches: 354
Epoch 5/15 - 3.42s - loss: 0.3669401 - acc: 0.8327727 - batches: 354
Epoch 6/15 - 3.48s - loss: 0.35005027 - acc: 0.84233356 - batches: 354
Epoch 7/15 - 3.60s - loss: 0.33449456 - acc: 0.84817636 - batches: 354
Epoch 8/15 - 3.69s - loss: 0.31806666 - acc: 0.85578966 - batches: 354
Epoch 9/15 - 3.73s - loss: 0.3028526 - acc: 0.8666785 - batches: 354
Epoch 10/15 - 3.70s - loss: 0.28788775 - acc: 0.8720786 - batches: 354
Epoch 11/15 - 3.82s - loss: 0.27171096 - acc: 0.8838527 - batches: 354
Epoch 12/15 - 3.83s - loss: 0.2568614 - acc: 0.890

In [None]:
preds = multiClassifier_model_hc100.transform(testData_with_embeddings)

In [None]:
preds.printSchema()
preds.select(preds.prediction).show(5, truncate=False)
preds.select(preds.category_array, preds.prediction.result).show(5, truncate=False)

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- category_array: array (nullable = false)
 |    |-- element: string (containsNull = true)
 |-- prediction: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable

In [None]:
preds_df = preds.select("category","text","prediction.result","prediction.metadata").toPandas()
preds_df.result.apply(lambda x: str(x) ).value_counts()

result
['neg']    1063
['pos']     373
Name: count, dtype: int64

In [None]:
preds_df[preds_df.result.apply(len)==2]

Unnamed: 0,category,text,result,metadata


In [None]:
preds_df[preds_df.result.apply(len)==0]

Unnamed: 0,category,text,result,metadata


MultiClassifierDL is a multi-label classifier, so some predictions may include both labels or none of the labels. That can be controlled a bit with `.setThreshold()` parameter during training. For now we will keep not keep zero label predictions and get the highest score as prediction.

In [None]:
# We will get the highest score label as result. You can control the number of zero label results with setThreshold() in the training.
preds_df["scores"] = preds_df.metadata.apply(lambda x: {k:float(v) for k,v in x[0].items()} if len(x)>=1 else "")
preds_df["result"] = preds_df.scores.apply(lambda x: max(x, key=x.get) if len(x)>=1 else "")
preds_df["result"].value_counts()

result
neg    1063
pos     373
Name: count, dtype: int64

In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["MultiClassifierDL_100d"] = [res["pos"]["f1-score"], res["accuracy"]]

print (classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

         neg       0.88      0.91      0.89      1028
         pos       0.74      0.68      0.71       408

    accuracy                           0.84      1436
   macro avg       0.81      0.79      0.80      1436
weighted avg       0.84      0.84      0.84      1436



### Generic Classifier

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("category")\
    .setHiddenLayers([300,200, 50])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph_100d.pb")

GenericClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericClassifierApproach takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [None]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

gen_clf = GenericClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph_100d.pb")\
    .setEpochsNumber(25)\
    .setBatchSize(128)\
    .setLearningRate(0.004)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.2)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_graph_builder,
        gen_clf
])

In [None]:
generic_model_hc100 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph_100d.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [300, 200, 50], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}


Instructions for updating:
Colocations handled automatically by placer.


generic_classifier graph exported to gc_graph/gcf_graph_100d.pb


In [None]:
!cat $log_folder/GenericClassifierApproach_*

Training 25 epochs
Epoch 1/25	0.36s	Loss: 12.278396	ACC: 0.64663196
Epoch 2/25	0.16s	Loss: 10.034459	ACC: 0.7231945
Epoch 3/25	0.16s	Loss: 9.83074	ACC: 0.73107636
Epoch 4/25	0.16s	Loss: 10.007244	ACC: 0.7165625
Epoch 5/25	0.16s	Loss: 9.52974	ACC: 0.7328125
Epoch 6/25	0.16s	Loss: 9.4485445	ACC: 0.7447917
Epoch 7/25	0.16s	Loss: 9.463392	ACC: 0.73868054
Epoch 8/25	0.16s	Loss: 9.271053	ACC: 0.74010414
Epoch 9/25	0.16s	Loss: 9.327106	ACC: 0.74041665
Epoch 10/25	0.16s	Loss: 8.715538	ACC: 0.7618403
Epoch 11/25	0.16s	Loss: 8.908168	ACC: 0.7553125
Epoch 12/25	0.16s	Loss: 8.754789	ACC: 0.7663889
Epoch 13/25	0.16s	Loss: 8.500905	ACC: 0.7721875
Epoch 14/25	0.16s	Loss: 8.649995	ACC: 0.7684722
Epoch 15/25	0.16s	Loss: 8.317527	ACC: 0.77774304
Epoch 16/25	0.16s	Loss: 8.258308	ACC: 0.7796181
Epoch 17/25	0.16s	Loss: 8.145815	ACC: 0.78875005
Epoch 18/25	0.16s	Loss: 8.38633	ACC: 0.7724653
Epoch 19/25	0.16s	Loss: 8.27932	ACC: 0.7781944
Epoch 20/25	0.16s	Loss: 8.107455	ACC: 0.78416663
Epoch 21/25	0.16s	Loss

In [None]:
pred_df = generic_model_hc100.transform(testData_with_embeddings)

In [None]:
preds.printSchema()
preds.select(preds.prediction).show(5, truncate=False)
preds.select(preds.category, preds.prediction.result).show(5, truncate=False)

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- category_array: array (nullable = false)
 |    |-- element: string (containsNull = true)
 |-- prediction: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable

In [None]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.92      0.74      0.82      1028
         pos       0.57      0.84      0.68       408

    accuracy                           0.77      1436
   macro avg       0.74      0.79      0.75      1436
weighted avg       0.82      0.77      0.78      1436



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericClassifier_100d"] = [res["pos"]["f1-score"], res["accuracy"]]

### GenericLogRegClassifier

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_logreg_graph_builder = TFGraphBuilder()\
    .setModelName("logreg_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("log_reg_graph.pb")

GenericLogRegClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericLogRegClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [None]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericLogRegClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/log_reg_graph.pb")\
    .setEpochsNumber(20)\
    .setBatchSize(128)\
    .setLearningRate(0.01)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_logreg_graph_builder,
        gen_clf
])


In [None]:
generic_model_hc100 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: logreg_classifier
Graph folder: gc_graph
Graph file name: log_reg_graph.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid'}
logreg_classifier graph exported to gc_graph/log_reg_graph.pb


In [None]:
!cat $log_folder/GenericLogRegClassifierApproach*

Training 20 epochs
Epoch 1/20	0.13s	Loss: 27.13766	ACC: 0.7051042
Epoch 2/20	0.05s	Loss: 24.967659	ACC: 0.71524304
Epoch 3/20	0.05s	Loss: 24.239285	ACC: 0.7254861
Epoch 4/20	0.05s	Loss: 23.404781	ACC: 0.7356597
Epoch 5/20	0.05s	Loss: 23.012838	ACC: 0.7459722
Epoch 6/20	0.05s	Loss: 22.866966	ACC: 0.7447917
Epoch 7/20	0.05s	Loss: 22.636919	ACC: 0.7487153
Epoch 8/20	0.05s	Loss: 22.422686	ACC: 0.75104165
Epoch 9/20	0.05s	Loss: 22.32955	ACC: 0.75166667
Epoch 10/20	0.05s	Loss: 22.380241	ACC: 0.7515973
Epoch 11/20	0.05s	Loss: 22.189754	ACC: 0.7584375
Epoch 12/20	0.05s	Loss: 22.186195	ACC: 0.7554514
Epoch 13/20	0.05s	Loss: 22.087923	ACC: 0.7646528
Epoch 14/20	0.05s	Loss: 22.137875	ACC: 0.75722224
Epoch 15/20	0.05s	Loss: 22.073769	ACC: 0.75576395
Epoch 16/20	0.05s	Loss: 22.048752	ACC: 0.75489587
Epoch 17/20	0.05s	Loss: 21.747688	ACC: 0.7603125
Epoch 18/20	0.05s	Loss: 21.954409	ACC: 0.7601042
Epoch 19/20	0.05s	Loss: 21.748411	ACC: 0.76565975
Epoch 20/20	0.05s	Loss: 21.80159	ACC: 0.75677085
Train

In [None]:
pred_df = generic_model_hc100.transform(testData_with_embeddings)

In [None]:
preds.printSchema()
preds.select(preds.prediction).show(5, truncate=False)
preds.select(preds.category, preds.prediction.result).show(5, truncate=False)

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- category_array: array (nullable = false)
 |    |-- element: string (containsNull = true)
 |-- prediction: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable

In [None]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.80      0.93      0.86      1028
         pos       0.70      0.41      0.51       408

    accuracy                           0.78      1436
   macro avg       0.75      0.67      0.69      1436
weighted avg       0.77      0.78      0.76      1436



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericLogReg_100d"] = [res["pos"]["f1-score"], res["accuracy"]]

### GenericSVMClassifier

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_svm_graph_builder = TFGraphBuilder()\
    .setModelName("svm_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("svm_graph.pb")

GenericSVMClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericSVMClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [None]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(21)\
    .setBatchSize(128)\
    .setLearningRate(0.02)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_svm_graph_builder,
        gen_clf
])

In [None]:
generic_model_hc100 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: gc_graph
Graph file name: svm_graph.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to gc_graph/svm_graph.pb


In [None]:
!cat $log_folder/GenericSVMClassifierApproach*

Training 21 epochs
Epoch 1/21	0.14s	Loss: 28.87314	ACC: 0.69805557
Epoch 2/21	0.05s	Loss: 25.810778	ACC: 0.721875
Epoch 3/21	0.04s	Loss: 25.194712	ACC: 0.7365625
Epoch 4/21	0.04s	Loss: 24.96621	ACC: 0.74645835
Epoch 5/21	0.05s	Loss: 24.297335	ACC: 0.7527084
Epoch 6/21	0.04s	Loss: 24.534353	ACC: 0.7510764
Epoch 7/21	0.04s	Loss: 24.283468	ACC: 0.75343746
Epoch 8/21	0.04s	Loss: 24.278233	ACC: 0.7574653
Epoch 9/21	0.04s	Loss: 24.64222	ACC: 0.7555208
Epoch 10/21	0.05s	Loss: 24.371727	ACC: 0.7576736
Epoch 11/21	0.05s	Loss: 24.236877	ACC: 0.7629166
Epoch 12/21	0.04s	Loss: 24.634861	ACC: 0.75687504
Epoch 13/21	0.04s	Loss: 24.194399	ACC: 0.7654861
Epoch 14/21	0.04s	Loss: 24.202686	ACC: 0.7618403
Epoch 15/21	0.05s	Loss: 24.175367	ACC: 0.7584722
Epoch 16/21	0.04s	Loss: 24.432009	ACC: 0.7605209
Epoch 17/21	0.04s	Loss: 24.445057	ACC: 0.75972223
Epoch 18/21	0.04s	Loss: 24.109804	ACC: 0.75885415
Epoch 19/21	0.05s	Loss: 24.303558	ACC: 0.7605209
Epoch 20/21	0.05s	Loss: 24.265621	ACC: 0.7626042
Epoch 21

In [None]:
pred_df = generic_model_hc100.transform(testData_with_embeddings)

In [None]:
preds.printSchema()
preds.select(preds.prediction).show(5, truncate=False)
preds.select(preds.category, preds.prediction.result).show(5, truncate=False)

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- category_array: array (nullable = false)
 |    |-- element: string (containsNull = true)
 |-- prediction: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable

In [None]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.80      0.92      0.86      1028
         pos       0.68      0.43      0.53       408

    accuracy                           0.78      1436
   macro avg       0.74      0.68      0.69      1436
weighted avg       0.77      0.78      0.77      1436



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericSVM_100d"] = [res["pos"]["f1-score"], res["accuracy"]]

## 200 Dimension Healthcare Embeddings (embeddings_clinical)



Now we will extract [embeddings_clinical](https://nlp.johnsnowlabs.com/2020/01/28/embeddings_clinical_en.html) embeddings which has 200 dimension output and use this embeddings in the model training.

In [None]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

embeddings_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


In [None]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)\
                                                  .select("text","category","sentence_embeddings")
trainingData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+--------+------------------------------------------------------------+
|                                                        text|category|                                         sentence_embeddings|
+------------------------------------------------------------+--------+------------------------------------------------------------+
| A 16-year-old girl with erosive, polyarticular JRA showe...|     neg|[{sentence_embeddings, 0, 129,  A 16-year-old girl with e...|
| A 42-year-old woman had uneventful bilateral laser-assis...|     neg|[{sentence_embeddings, 0, 112,  A 42-year-old woman had u...|
+------------------------------------------------------------+--------+------------------------------------------------------------+
only showing top 2 rows



In [None]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)\
                                                  .select("text","category","sentence_embeddings")
testData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+--------+------------------------------------------------------------+
|                                                        text|category|                                         sentence_embeddings|
+------------------------------------------------------------+--------+------------------------------------------------------------+
| Air conditioning and proper ventilation of the operating...|     neg|[{sentence_embeddings, 0, 155,  Air conditioning and prop...|
| Biological data mainly revealed hepatic failure and lact...|     neg|[{sentence_embeddings, 0, 68,  Biological data mainly rev...|
+------------------------------------------------------------+--------+------------------------------------------------------------+
only showing top 2 rows



In [None]:
log_folder="ADE_logs_healthcare_200d"

### ClassifierDL

In [None]:
classifier_dl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("category")\
    .setBatchSize(16)\
    .setMaxEpochs(30)\
    .setLr(0.001)\
    .setDropout(0.3)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)

classifier_dl_pipeline = Pipeline(
    stages = [
        classifier_dl
])

In [None]:
clfDL_model_hc200 = classifier_dl_pipeline.fit(trainingData_with_embeddings)

In [None]:
preds = clfDL_model_hc200.transform(testData_with_embeddings)

preds_df = preds.select("category","text","class.result").toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

         neg       0.89      0.93      0.91      1028
         pos       0.81      0.70      0.75       408

    accuracy                           0.87      1436
   macro avg       0.85      0.82      0.83      1436
weighted avg       0.86      0.87      0.86      1436



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["ClassifierDL_200d"] = [res["pos"]["f1-score"], res["accuracy"]]

### MultiClassifierDL

In [None]:
# MultiClassifierDL accepts list of strings as input label. So we convert label column to array type.
trainingData_with_embeddings = trainingData_with_embeddings.withColumn("category_array", F.array(F.col("category")))
testData_with_embeddings = testData_with_embeddings.withColumn("category_array", F.array(F.col("category")))

In [None]:
multiClassifier = MultiClassifierDLApproach()\
    .setInputCols("sentence_embeddings")\
    .setOutputCol("prediction")\
    .setLabelColumn("category_array")\
    .setBatchSize(32)\
    .setMaxEpochs(20)\
    .setLr(9e-3)\
    .setThreshold(0.5)\
    .setShufflePerEpoch(False)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)\
  #   .setValidationSplit(0.1)

multiClassifier_pipeline = Pipeline(
    stages = [
        multiClassifier
])

In [None]:
multiClassifier_model_hc200 = multiClassifier_pipeline.fit(trainingData_with_embeddings)

In [None]:
preds = multiClassifier_model_hc200.transform(testData_with_embeddings)

In [None]:
preds_df = preds.select("category","text","prediction.result","prediction.metadata").toPandas()
preds_df.result.apply(lambda x: str(x) ).value_counts()

result
['neg']           926
['pos']           508
['neg', 'pos']      1
[]                  1
Name: count, dtype: int64

In [None]:
# We will get the highest score label as result. You can control the number of zero label results with setThreshold() in the training.
preds_df["scores"] = preds_df.metadata.apply(lambda x: {k:float(v) for k,v in x[0].items()} if len(x)>=1 else "")
preds_df["result"] = preds_df.scores.apply(lambda x: max(x, key=x.get) if len(x)>=1 else "")
preds_df["result"].value_counts()

result
neg    926
pos    509
         1
Name: count, dtype: int64

In [None]:
print (classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

                   0.00      0.00      0.00         0
         neg       0.93      0.84      0.88      1028
         pos       0.68      0.85      0.75       408

    accuracy                           0.84      1436
   macro avg       0.54      0.56      0.55      1436
weighted avg       0.86      0.84      0.85      1436



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["MultiClassifierDL_200d"] = [res["pos"]["f1-score"], res["accuracy"]]

### Generic Classifier

In [None]:
# !pip install -q tensorflow==2.12.0 tensorflow_addons

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("category")\
    .setHiddenLayers([300,200, 50])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph_200d.pb")

In [None]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

gen_clf = GenericClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph_200d.pb")\
    .setEpochsNumber(40)\
    .setBatchSize(16)\
    .setLearningRate(0.001)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.5)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_graph_builder,
        gen_clf
])

In [None]:
generic_model_hc200 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph_200d.pb
Build params: {'input_dim': 200, 'output_dim': 2, 'hidden_layers': [300, 200, 50], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}
generic_classifier graph exported to gc_graph/gcf_graph_200d.pb


In [None]:
pred_df = generic_model_hc200.transform(testData_with_embeddings)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.91      0.89      0.90      1028
         pos       0.74      0.78      0.76       408

    accuracy                           0.86      1436
   macro avg       0.83      0.84      0.83      1436
weighted avg       0.86      0.86      0.86      1436



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericClassifier_200d"] = [res["pos"]["f1-score"], res["accuracy"]]

### GenericLogRegClassifier

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_logreg_graph_builder = TFGraphBuilder()\
    .setModelName("logreg_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("log_reg_graph.pb")

GenericLogRegClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericLogRegClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [None]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericLogRegClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/log_reg_graph.pb")\
    .setEpochsNumber(20)\
    .setBatchSize(128)\
    .setLearningRate(0.01)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_logreg_graph_builder,
        gen_clf
])


In [None]:
generic_model_hc200 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: logreg_classifier
Graph folder: gc_graph
Graph file name: log_reg_graph.pb
Build params: {'input_dim': 200, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid'}
logreg_classifier graph exported to gc_graph/log_reg_graph.pb


In [None]:
pred_df = generic_model_hc200.transform(testData_with_embeddings)

In [None]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.85      0.91      0.88      1028
         pos       0.72      0.58      0.64       408

    accuracy                           0.82      1436
   macro avg       0.78      0.74      0.76      1436
weighted avg       0.81      0.82      0.81      1436



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericLogReg_200d"] = [res["pos"]["f1-score"], res["accuracy"]]

### GenericSVMClassifier

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_svm_graph_builder = TFGraphBuilder()\
    .setModelName("svm_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("svm_graph.pb")

GenericSVMClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericSVMClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [None]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(20)\
    .setBatchSize(128)\
    .setLearningRate(0.005)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_svm_graph_builder,
        gen_clf
])


In [None]:
generic_model_hc200 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: gc_graph
Graph file name: svm_graph.pb
Build params: {'input_dim': 200, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to gc_graph/svm_graph.pb


In [None]:
pred_df = generic_model_hc200.transform(testData_with_embeddings)

In [None]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.84      0.92      0.88      1028
         pos       0.73      0.54      0.62       408

    accuracy                           0.81      1436
   macro avg       0.78      0.73      0.75      1436
weighted avg       0.81      0.81      0.80      1436



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericSVM_200d"] = [res["pos"]["f1-score"], res["accuracy"]]

## Bert Sentence Embeddings (sbiobert_base_cased_mli)

Now we will extract [sbiobert_base_cased_mli](https://nlp.johnsnowlabs.com/2020/11/27/sbiobert_base_cased_mli_en.html) embeddings which has 768 dimension output and use this embeddings in the model training.

In [None]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

bert_sent = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

embeddings_pipeline = Pipeline(
    stages = [
        document_assembler,
        bert_sent
])

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]


In [None]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)\
                                                  .select("text","category","sentence_embeddings")

trainingData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+--------+------------------------------------------------------------+
|                                                        text|category|                                         sentence_embeddings|
+------------------------------------------------------------+--------+------------------------------------------------------------+
| A 16-year-old girl with erosive, polyarticular JRA showe...|     neg|[{sentence_embeddings, 0, 129,  A 16-year-old girl with e...|
| A 42-year-old woman had uneventful bilateral laser-assis...|     neg|[{sentence_embeddings, 0, 112,  A 42-year-old woman had u...|
+------------------------------------------------------------+--------+------------------------------------------------------------+
only showing top 2 rows



In [None]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)\
                                                  .select("text","category","sentence_embeddings")

testData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+--------+------------------------------------------------------------+
|                                                        text|category|                                         sentence_embeddings|
+------------------------------------------------------------+--------+------------------------------------------------------------+
| Air conditioning and proper ventilation of the operating...|     neg|[{sentence_embeddings, 0, 155,  Air conditioning and prop...|
| Biological data mainly revealed hepatic failure and lact...|     neg|[{sentence_embeddings, 0, 68,  Biological data mainly rev...|
+------------------------------------------------------------+--------+------------------------------------------------------------+
only showing top 2 rows



In [None]:
log_folder="ADE_logs_bert"

### ClassifierDL

In [None]:
classifier_dl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("category")\
    .setBatchSize(32)\
    .setMaxEpochs(20)\
    .setLr(0.001)\
    .setDropout(0.3)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)

classifier_dl_pipeline = Pipeline(
    stages=[
        classifier_dl
])

In [None]:
clfDL_model_bert = classifier_dl_pipeline.fit(trainingData_with_embeddings)

In [None]:
preds = clfDL_model_bert.transform(testData_with_embeddings)

preds_df = preds.select("category","text","class.result").toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

         neg       0.94      0.92      0.93      1028
         pos       0.82      0.84      0.83       408

    accuracy                           0.90      1436
   macro avg       0.88      0.88      0.88      1436
weighted avg       0.90      0.90      0.90      1436



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["ClassifierDL_bert"] = [res["pos"]["f1-score"], res["accuracy"]]

### MultiClassifierDL

In [None]:
# MultiClassifierDL accepts list of strings as input label. So we convert label column to array type.
trainingData_with_embeddings = trainingData_with_embeddings.withColumn("category_array", F.array(F.col("category")))
testData_with_embeddings = testData_with_embeddings.withColumn("category_array", F.array(F.col("category")))

In [None]:
multiClassifier = MultiClassifierDLApproach()\
    .setInputCols("sentence_embeddings")\
    .setOutputCol("prediction")\
    .setLabelColumn("category_array")\
    .setBatchSize(32)\
    .setMaxEpochs(20)\
    .setLr(9e-3)\
    .setThreshold(0.5)\
    .setShufflePerEpoch(False)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)\
  #   .setValidationSplit(0.1)

multiClassifier_pipeline = Pipeline(
    stages = [
        multiClassifier
])

In [None]:
multiClassifier_model_bert = multiClassifier_pipeline.fit(trainingData_with_embeddings)

In [None]:
preds = multiClassifier_model_bert.transform(testData_with_embeddings)

In [None]:
preds_df = preds.select("category","text","prediction.result","prediction.metadata").toPandas()
preds_df.result.apply(lambda x: str(x) ).value_counts()

result
['neg']    1055
['pos']     381
Name: count, dtype: int64

In [None]:
# We will get the highest score label as result. You can control the number of zero label results with setThreshold() in the training.
preds_df["scores"] = preds_df.metadata.apply(lambda x: {k:float(v) for k,v in x[0].items()} if len(x)>=1 else "")
preds_df["result"] = preds_df.scores.apply(lambda x: max(x, key=x.get) if len(x)>=1 else "")
preds_df = preds_df[(preds_df["result"]!="")]
preds_df["result"].value_counts()

result
neg    1055
pos     381
Name: count, dtype: int64

In [None]:
print (classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

         neg       0.91      0.93      0.92      1028
         pos       0.82      0.77      0.79       408

    accuracy                           0.89      1436
   macro avg       0.87      0.85      0.86      1436
weighted avg       0.88      0.89      0.89      1436



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["MultiClassifierDL_bert"] = [res["pos"]["f1-score"], res["accuracy"]]

### Generic Classifier

In [None]:
# !pip install -q tensorflow==2.11.0 tensorflow_addons

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("category")\
    .setHiddenLayers([300,200, 50])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph_bert.pb")

In [None]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

gen_clf = GenericClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph_bert.pb")\
    .setEpochsNumber(20)\
    .setBatchSize(32)\
    .setLearningRate(0.001)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.5)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_graph_builder,
        gen_clf
])


In [None]:
generic_model_bert = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph_bert.pb
Build params: {'input_dim': 768, 'output_dim': 2, 'hidden_layers': [300, 200, 50], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}
generic_classifier graph exported to gc_graph/gcf_graph_bert.pb


In [None]:
pred_df = generic_model_bert.transform(testData_with_embeddings)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.96      0.87      0.91      1028
         pos       0.73      0.91      0.81       408

    accuracy                           0.88      1436
   macro avg       0.85      0.89      0.86      1436
weighted avg       0.89      0.88      0.88      1436



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericClassifier_bert"] = [res["pos"]["f1-score"], res["accuracy"]]

### GenericLogRegClassifier

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_logreg_graph_builder = TFGraphBuilder()\
    .setModelName("logreg_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("log_reg_graph.pb")

GenericLogRegClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericLogRegClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [None]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericLogRegClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/log_reg_graph.pb")\
    .setEpochsNumber(20)\
    .setBatchSize(128)\
    .setLearningRate(0.01)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_logreg_graph_builder,
        gen_clf
])


In [None]:
generic_model_bert = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: logreg_classifier
Graph folder: gc_graph
Graph file name: log_reg_graph.pb
Build params: {'input_dim': 768, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid'}
logreg_classifier graph exported to gc_graph/log_reg_graph.pb


In [None]:
pred_df = generic_model_bert.transform(testData_with_embeddings)

In [None]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.89      0.95      0.92      1028
         pos       0.84      0.70      0.76       408

    accuracy                           0.88      1436
   macro avg       0.86      0.82      0.84      1436
weighted avg       0.87      0.88      0.87      1436



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericLogReg_bert"] = [res["pos"]["f1-score"], res["accuracy"]]

### GenericSVMClassifier

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_svm_graph_builder = TFGraphBuilder()\
    .setModelName("svm_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("svm_graph.pb")

GenericSVMClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericSVMClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [None]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(20)\
    .setBatchSize(128)\
    .setLearningRate(0.005)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_svm_graph_builder,
        gen_clf
])


In [None]:
generic_model_bert = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: gc_graph
Graph file name: svm_graph.pb
Build params: {'input_dim': 768, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to gc_graph/svm_graph.pb


In [None]:
pred_df = generic_model_bert.transform(testData_with_embeddings)

In [None]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.94      0.89      0.92      1028
         pos       0.76      0.86      0.80       408

    accuracy                           0.88      1436
   macro avg       0.85      0.87      0.86      1436
weighted avg       0.89      0.88      0.88      1436



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericSVM_bert"] = [res["pos"]["f1-score"], res["accuracy"]]

## DocumentLogRegClassifier

In [None]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
    .setInputCols("normalized")\
    .setOutputCol("cleanTokens")\
    .setCaseSensitive(False)

stemmer = Stemmer() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCol("stem")

logreg = DocumentLogRegClassifierApproach()\
    .setInputCols("stem")\
    .setLabelCol("category")\
    .setOutputCol("prediction")\
    .setMaxIter(10)\
    .setTol(1e-6)

clf_Pipeline = Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        normalizer,
        stopwords_cleaner,
        stemmer,
        logreg
])

doclogreg_model = clf_Pipeline.fit(trainingData)

In [None]:
pred_df = doclogreg_model.transform(testData)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.88      0.87      0.87      1028
         pos       0.68      0.70      0.69       408

    accuracy                           0.82      1436
   macro avg       0.78      0.78      0.78      1436
weighted avg       0.82      0.82      0.82      1436



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["DocumentLogRegClassifier"] = [res["pos"]["f1-score"], res["accuracy"]]

## Comparision of All Models' Result in ADE Dataset

Here is the all results of classifier models. Only positive ADE f1 scores and accuracy scores are presented.

In [None]:
results_df

Unnamed: 0,pos-f1-score,accuracy
ClassifierDL_100d,0.723295,0.850279
MultiClassifierDL_100d,0.709347,0.841922
GenericClassifier_100d,0.675862,0.770891
GenericLogReg_100d,0.513932,0.781337
GenericSVM_100d,0.530735,0.782033
ClassifierDL_200d,0.75,0.867688
MultiClassifierDL_200d,0.752454,0.841226
GenericClassifier_200d,0.75895,0.859331
GenericLogReg_200d,0.641407,0.81546
GenericSVM_200d,0.623596,0.81337


# Mtsamples Dataset

❗ **PLEASE RE-START RUNTIME AND CONTINUE**

In [None]:
import json
import os

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.base import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline,PipelineModel

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"24G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params, gpu=True)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.4.0
Spark NLP_JSL Version : 5.4.0


## Load Dataset

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/mtsamples_classifier.csv

In [None]:
spark_df = spark.read.csv("mtsamples_classifier.csv", header = True)

spark_df.show(10,truncate=100)

+----------------+----------------------------------------------------------------------------------------------------+
|        category|                                                                                                text|
+----------------+----------------------------------------------------------------------------------------------------+
|Gastroenterology| PROCEDURES PERFORMED: Colonoscopy. INDICATIONS: Renewed symptoms likely consistent with active f...|
|Gastroenterology| OPERATION 1. Ivor-Lewis esophagogastrectomy. 2. Feeding jejunostomy. 3. Placement of two right-s...|
|Gastroenterology| PREOPERATIVE DIAGNOSES: 1. Gastroesophageal reflux disease. 2. Chronic dyspepsia. POSTOPERATIVE ...|
|Gastroenterology| PROCEDURE: Colonoscopy. PREOPERATIVE DIAGNOSES: Rectal bleeding and perirectal abscess. POSTOPER...|
|Gastroenterology| PREOPERATIVE DIAGNOSIS: Right colon tumor. POSTOPERATIVE DIAGNOSES: 1. Right colon cancer. 2. As...|
|Gastroenterology| PREOPERATIVE DIAGNOSI

In [None]:
spark_df.printSchema()

root
 |-- category: string (nullable = true)
 |-- text: string (nullable = true)



In [None]:
spark_df.count()

638

In [None]:
spark_df.groupBy("category").count().show()

+----------------+-----+
|        category|count|
+----------------+-----+
|         Urology|  115|
|       Neurology|  143|
|      Orthopedic|  223|
|Gastroenterology|  157|
+----------------+-----+



In [None]:
(trainingData, testData) = spark_df.randomSplit([0.8, 0.2], seed = 42)

In [None]:
trainingData.groupBy("category").count().show()

+----------------+-----+
|        category|count|
+----------------+-----+
|         Urology|   95|
|       Neurology|  121|
|      Orthopedic|  188|
|Gastroenterology|  132|
+----------------+-----+



In [None]:
testData.groupBy("category").count().show()

+----------------+-----+
|        category|count|
+----------------+-----+
|         Urology|   20|
|       Neurology|   22|
|      Orthopedic|   35|
|Gastroenterology|   25|
+----------------+-----+



## 100 Dimension Healthcare Embeddings (embeddings_healthcare_100d)



We will extract [healthcare_100d embeddings](https://nlp.johnsnowlabs.com/2020/05/29/embeddings_healthcare_100d_en.html) and use it in the classificaiton model training.

In [None]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

embeddings_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
])

embeddings_healthcare_100d download started this may take some time.
Approximate size to download 475.8 MB
[OK!]


In [None]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)
trainingData_with_embeddings = trainingData_with_embeddings.select("text","category","sentence_embeddings")
trainingData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+----------------+------------------------------------------------------------+
|                                                        text|        category|                                         sentence_embeddings|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
| ADMISSION DIAGNOSIS: Symptomatic cholelithiasis. DISCHAR...|Gastroenterology|[{sentence_embeddings, 0, 2228,  ADMISSION DIAGNOSIS: Sym...|
| ADMITTING DIAGNOSES: Hiatal hernia, gastroesophageal ref...|Gastroenterology|[{sentence_embeddings, 0, 3237,  ADMITTING DIAGNOSES: Hia...|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
only showing top 2 rows



In [None]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)
testData_with_embeddings = testData_with_embeddings.select("text","category","sentence_embeddings")
testData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+----------------+------------------------------------------------------------+
|                                                        text|        category|                                         sentence_embeddings|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
| ADMITTING DIAGNOSIS: Gastrointestinal bleed. HISTORY OF ...|Gastroenterology|[{sentence_embeddings, 0, 3978,  ADMITTING DIAGNOSIS: Gas...|
| CHIEF COMPLAINT: Dysphagia and hematemesis while vomitin...|Gastroenterology|[{sentence_embeddings, 0, 6515,  CHIEF COMPLAINT: Dysphag...|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
only showing top 2 rows



In [None]:
testData_with_embeddings.printSchema()

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)



In [None]:
testData_with_embeddings.select(testData_with_embeddings.sentence_embeddings.embeddings).show(3,truncate=120)

+------------------------------------------------------------------------------------------------------------------------+
|                                                                                          sentence_embeddings.embeddings|
+------------------------------------------------------------------------------------------------------------------------+
|[[-0.010433162, 0.0127568655, 0.110687375, 0.1609855, -0.14818177, -0.050856464, 0.064748034, -0.019294674, -0.002178...|
|[[-0.0055608996, 0.018396137, 0.12305377, 0.13861515, -0.18651922, -0.07559964, 0.12978752, -0.0046452526, -0.0031408...|
|[[-0.019461514, 0.060639214, 0.11142021, 0.12826534, -0.16216077, -0.10416892, 0.17136306, -0.020363769, 0.01169337, ...|
+------------------------------------------------------------------------------------------------------------------------+
only showing top 3 rows



In [None]:
log_folder="Mt_logs_healthcare_100d"
!mkdir -p $log_folder

### ClassifierDL

In [None]:
classifier_dl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("prediction")\
    .setLabelColumn("category")\
    .setBatchSize(8)\
    .setMaxEpochs(50)\
    .setLr(0.005)\
    .setDropout(0.3)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)\
    # .setValidationSplit(0.1)

classifier_dl_pipeline = Pipeline(stages=[classifier_dl])

In [None]:
clfDL_model_hc100 = classifier_dl_pipeline.fit(trainingData_with_embeddings)

In [None]:
preds = clfDL_model_hc100.transform(testData_with_embeddings)

In [None]:
preds_df = preds.select("category","text","prediction.result").toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       0.83      0.80      0.82        25
       Neurology       0.78      0.82      0.80        22
      Orthopedic       0.91      0.91      0.91        35
         Urology       0.75      0.75      0.75        20

        accuracy                           0.83       102
       macro avg       0.82      0.82      0.82       102
    weighted avg       0.83      0.83      0.83       102



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df = pd.DataFrame(columns=["macro-f1-score","weighted-f1-score","accuracy"])
results_df.loc["ClassifierDL_100d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]
results_df

Unnamed: 0,macro-f1-score,weighted-f1-score,accuracy
ClassifierDL_100d,0.820153,0.833413,0.833333


### MultiClassifierDL

In [None]:
# MultiClassifierDL accepts list of strings as input label. So we convert label column to array type.
trainingData_with_embeddings = trainingData_with_embeddings.withColumn("category_array", F.array(F.col("category")))
testData_with_embeddings = testData_with_embeddings.withColumn("category_array", F.array(F.col("category")))

In [None]:
multiClassifier = MultiClassifierDLApproach()\
    .setInputCols("sentence_embeddings")\
    .setOutputCol("prediction")\
    .setLabelColumn("category_array")\
    .setBatchSize(8)\
    .setMaxEpochs(40)\
    .setLr(5e-3)\
    .setThreshold(0.5)\
    .setShufflePerEpoch(False)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)\
  #   .setValidationSplit(0.1)

multiClassifier_pipeline = Pipeline(
    stages = [
        multiClassifier
    ])

In [None]:
multiClassifier_model_hc100 = multiClassifier_pipeline.fit(trainingData_with_embeddings)

In [None]:
preds = multiClassifier_model_hc100.transform(testData_with_embeddings)

In [None]:
preds_df = preds.select("category","text","prediction.result","prediction.metadata").toPandas()
preds_df.result.apply(lambda x: str(x) ).value_counts()

result
['Orthopedic']                        35
['Urology']                           21
['Gastroenterology']                  19
['Neurology']                         17
[]                                     5
['Neurology', 'Orthopedic']            4
['Orthopedic', 'Gastroenterology']     1
Name: count, dtype: int64

In [None]:
# We will get the highest score label as result. You can control the number of zero label results with setThreshold() in the training.
preds_df["scores"] = preds_df.metadata.apply(lambda x: {k:float(v) for k,v in x[0].items()} if len(x)>=1 else "")
preds_df["result"] = preds_df.scores.apply(lambda x: max(x, key=x.get) if len(x)>=1 else "")
preds_df = preds_df[(preds_df["result"]!="")]
preds_df["result"].value_counts()

result
Orthopedic          36
Urology             21
Gastroenterology    20
Neurology           20
Name: count, dtype: int64

In [None]:
print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       0.95      0.86      0.90        22
       Neurology       0.80      0.76      0.78        21
      Orthopedic       0.89      0.91      0.90        35
         Urology       0.90      1.00      0.95        19

        accuracy                           0.89        97
       macro avg       0.89      0.88      0.88        97
    weighted avg       0.89      0.89      0.89        97



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["MultiClassifierDL_100d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### Generic Classifier

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("category")\
    .setHiddenLayers([300,200, 50])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph_100d.pb")

In [None]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

gen_clf = GenericClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph_100d.pb")\
    .setEpochsNumber(50)\
    .setBatchSize(32)\
    .setLearningRate(0.001)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.3)\
    .setFixImbalance(False)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_graph_builder,
        gen_clf
])

In [None]:
generic_model_hc100 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph_100d.pb
Build params: {'input_dim': 100, 'output_dim': 4, 'hidden_layers': [300, 200, 50], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}


Instructions for updating:
Colocations handled automatically by placer.


generic_classifier graph exported to gc_graph/gcf_graph_100d.pb


In [None]:
pred_df = generic_model_hc100.transform(testData_with_embeddings)

In [None]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.89      0.96      0.92        25
       Neurology       0.89      0.77      0.83        22
      Orthopedic       0.92      0.94      0.93        35
         Urology       0.90      0.90      0.90        20

        accuracy                           0.90       102
       macro avg       0.90      0.89      0.90       102
    weighted avg       0.90      0.90      0.90       102



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericClassifier_100d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### GenericLogRegClassifier

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_logreg_graph_builder = TFGraphBuilder()\
    .setModelName("logreg_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("log_reg_graph.pb")

GenericLogRegClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericLogRegClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [None]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericLogRegClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/log_reg_graph.pb")\
    .setEpochsNumber(45)\
    .setBatchSize(32)\
    .setLearningRate(0.02)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_logreg_graph_builder,
        gen_clf
])


In [None]:
generic_model_hc100 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: logreg_classifier
Graph folder: gc_graph
Graph file name: log_reg_graph.pb
Build params: {'input_dim': 100, 'output_dim': 4, 'hidden_layers': [], 'output_act': 'sigmoid'}
logreg_classifier graph exported to gc_graph/log_reg_graph.pb


In [None]:
pred_df = generic_model_hc100.transform(testData_with_embeddings)

In [None]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.91      0.84      0.87        25
       Neurology       0.81      0.77      0.79        22
      Orthopedic       0.84      0.91      0.88        35
         Urology       0.85      0.85      0.85        20

        accuracy                           0.85       102
       macro avg       0.85      0.84      0.85       102
    weighted avg       0.85      0.85      0.85       102



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericLogReg_100d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### GenericSVMClassifier

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_svm_graph_builder = TFGraphBuilder()\
    .setModelName("svm_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("svm_graph.pb")

GenericSVMClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericSVMClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [None]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(45)\
    .setBatchSize(32)\
    .setLearningRate(0.02)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_svm_graph_builder,
        gen_clf
])

In [None]:
generic_model_hc100 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: gc_graph
Graph file name: svm_graph.pb
Build params: {'input_dim': 100, 'output_dim': 4, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to gc_graph/svm_graph.pb


In [None]:
pred_df = generic_model_hc100.transform(testData_with_embeddings)

In [None]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.88      0.88      0.88        25
       Neurology       0.86      0.82      0.84        22
      Orthopedic       0.94      0.91      0.93        35
         Urology       0.82      0.90      0.86        20

        accuracy                           0.88       102
       macro avg       0.87      0.88      0.88       102
    weighted avg       0.88      0.88      0.88       102



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericSVM_100d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

## 200 Dimension Healthcare Embeddings (embeddings_clinical)



Now we will extract [embeddings_clinical](https://nlp.johnsnowlabs.com/2020/01/28/embeddings_clinical_en.html) embeddings which has 200 dimension output and use this embeddings in the model training.

In [None]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

embeddings_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


In [None]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)\
                                                  .select("text","category","sentence_embeddings")
trainingData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+----------------+------------------------------------------------------------+
|                                                        text|        category|                                         sentence_embeddings|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
| ADMISSION DIAGNOSIS: Symptomatic cholelithiasis. DISCHAR...|Gastroenterology|[{sentence_embeddings, 0, 2228,  ADMISSION DIAGNOSIS: Sym...|
| ADMITTING DIAGNOSES: Hiatal hernia, gastroesophageal ref...|Gastroenterology|[{sentence_embeddings, 0, 3237,  ADMITTING DIAGNOSES: Hia...|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
only showing top 2 rows



In [None]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)\
                                                  .select("text","category","sentence_embeddings")
testData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+----------------+------------------------------------------------------------+
|                                                        text|        category|                                         sentence_embeddings|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
| ADMITTING DIAGNOSIS: Gastrointestinal bleed. HISTORY OF ...|Gastroenterology|[{sentence_embeddings, 0, 3978,  ADMITTING DIAGNOSIS: Gas...|
| CHIEF COMPLAINT: Dysphagia and hematemesis while vomitin...|Gastroenterology|[{sentence_embeddings, 0, 6515,  CHIEF COMPLAINT: Dysphag...|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
only showing top 2 rows



In [None]:
log_folder="Mt_logs_healthcare_200d"

### ClassifierDL

In [None]:
classifier_dl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("category")\
    .setBatchSize(32)\
    .setMaxEpochs(30)\
    .setLr(0.01)\
    .setDropout(0.1)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)

classifier_dl_pipeline = Pipeline(stages=[classifier_dl])

In [None]:
clfDL_model_hc200 = classifier_dl_pipeline.fit(trainingData_with_embeddings)

In [None]:
preds = clfDL_model_hc200.transform(testData_with_embeddings)

preds_df = preds.select("category","text","class.result").toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       0.00      0.00      0.00        25
       Neurology       0.00      0.00      0.00        22
      Orthopedic       0.00      0.00      0.00        35
         Urology       0.20      1.00      0.33        20

        accuracy                           0.20       102
       macro avg       0.05      0.25      0.08       102
    weighted avg       0.04      0.20      0.06       102



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["ClassifierDL_200d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### MultiClassifierDL



In [None]:
# MultiClassifierDL accepts list of strings as input label. So we convert label column to array type.
trainingData_with_embeddings = trainingData_with_embeddings.withColumn("category_array", F.array(F.col("category")))
testData_with_embeddings = testData_with_embeddings.withColumn("category_array", F.array(F.col("category")))

In [None]:
multiClassifier = MultiClassifierDLApproach()\
    .setInputCols("sentence_embeddings")\
    .setOutputCol("prediction")\
    .setLabelColumn("category_array")\
    .setBatchSize(16)\
    .setMaxEpochs(20)\
    .setLr(9e-3)\
    .setThreshold(0.2)\
    .setShufflePerEpoch(False)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)\
  #   .setValidationSplit(0.1)

multiClassifier_pipeline = Pipeline(
    stages = [
        multiClassifier
])

In [None]:
multiClassifier_model_hc200 = multiClassifier_pipeline.fit(trainingData_with_embeddings)

In [None]:
preds = multiClassifier_model_hc200.transform(testData_with_embeddings)

In [None]:
preds_df = preds.select("category","text","prediction.result","prediction.metadata").toPandas()
preds_df.result.apply(lambda x: str(x) ).value_counts()

result
['Orthopedic']                        32
['Gastroenterology']                  19
['Urology']                           18
['Neurology']                         13
['Neurology', 'Orthopedic']           11
[]                                     4
['Orthopedic', 'Gastroenterology']     3
['Urology', 'Orthopedic']              1
['Urology', 'Gastroenterology']        1
Name: count, dtype: int64

In [None]:
# We will get the highest score label as result. you can control the number of zero label results with setThreshold() in the training.
preds_df["scores"] = preds_df.metadata.apply(lambda x: {k:float(v) for k,v in x[0].items()} if len(x)>=1 else "")
preds_df["result"] = preds_df.scores.apply(lambda x: max(x, key=x.get) if len(x)>=1 else "")
preds_df = preds_df[(preds_df["result"]!="")]
preds_df["result"].value_counts()

result
Orthopedic          38
Gastroenterology    22
Urology             20
Neurology           18
Name: count, dtype: int64

In [None]:
print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       1.00      0.92      0.96        24
       Neurology       0.89      0.80      0.84        20
      Orthopedic       0.87      0.94      0.90        35
         Urology       0.90      0.95      0.92        19

        accuracy                           0.91        98
       macro avg       0.91      0.90      0.91        98
    weighted avg       0.91      0.91      0.91        98



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["MultiClassifierDL_200d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### Generic Classifier

In [None]:
# !pip install -q tensorflow==2.7.0 tensorflow_addons

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("category")\
    .setHiddenLayers([300,200, 50])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph_200d.pb")

In [None]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

gen_clf = GenericClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph_200d.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(32)\
    .setLearningRate(0.001)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.3)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_graph_builder,
        gen_clf
])

In [None]:
generic_model_hc200 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph_200d.pb
Build params: {'input_dim': 200, 'output_dim': 4, 'hidden_layers': [300, 200, 50], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}
generic_classifier graph exported to gc_graph/gcf_graph_200d.pb


In [None]:
pred_df = generic_model_hc200.transform(testData_with_embeddings)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.85      0.88      0.86        25
       Neurology       0.90      0.82      0.86        22
      Orthopedic       0.89      0.94      0.92        35
         Urology       0.89      0.85      0.87        20

        accuracy                           0.88       102
       macro avg       0.88      0.87      0.88       102
    weighted avg       0.88      0.88      0.88       102



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericClassifier_200d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### GenericLogRegClassifier

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_logreg_graph_builder = TFGraphBuilder()\
    .setModelName("logreg_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("log_reg_graph.pb")

GenericLogRegClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericLogRegClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [None]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericLogRegClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/log_reg_graph.pb")\
    .setEpochsNumber(45)\
    .setBatchSize(32)\
    .setLearningRate(0.02)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_logreg_graph_builder,
        gen_clf
])


In [None]:
generic_model_hc200 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: logreg_classifier
Graph folder: gc_graph
Graph file name: log_reg_graph.pb
Build params: {'input_dim': 200, 'output_dim': 4, 'hidden_layers': [], 'output_act': 'sigmoid'}
logreg_classifier graph exported to gc_graph/log_reg_graph.pb


In [None]:
pred_df = generic_model_hc200.transform(testData_with_embeddings)

In [None]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.81      0.84      0.82        25
       Neurology       0.83      0.68      0.75        22
      Orthopedic       0.84      0.91      0.88        35
         Urology       0.80      0.80      0.80        20

        accuracy                           0.82       102
       macro avg       0.82      0.81      0.81       102
    weighted avg       0.82      0.82      0.82       102



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericLogReg_200d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### GenericSVMClassifier

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_svm_graph_builder = TFGraphBuilder()\
    .setModelName("svm_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("svm_graph.pb")

GenericSVMClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericSVMClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [None]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(45)\
    .setBatchSize(32)\
    .setLearningRate(0.02)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_svm_graph_builder,
        gen_clf
])


In [None]:
generic_model_hc200 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: gc_graph
Graph file name: svm_graph.pb
Build params: {'input_dim': 200, 'output_dim': 4, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to gc_graph/svm_graph.pb


In [None]:
pred_df = generic_model_hc200.transform(testData_with_embeddings)

In [None]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.81      0.84      0.82        25
       Neurology       0.86      0.82      0.84        22
      Orthopedic       0.89      0.91      0.90        35
         Urology       0.84      0.80      0.82        20

        accuracy                           0.85       102
       macro avg       0.85      0.84      0.85       102
    weighted avg       0.85      0.85      0.85       102



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericSVM_200d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

## Bert Sentence Embeddings (sbiobert_base_cased_mli)

We will extract [sbiobert_base_cased_mli](https://nlp.johnsnowlabs.com/2020/11/27/sbiobert_base_cased_mli_en.html) embeddings which has 768 dimension output and use this embeddings in the model training.

In [None]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

bert_sent = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

embeddings_pipeline = Pipeline(
    stages = [
        document_assembler,
        bert_sent
])

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]


In [None]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)\
                                                  .select("text","category","sentence_embeddings")

trainingData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+----------------+------------------------------------------------------------+
|                                                        text|        category|                                         sentence_embeddings|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
| ADMISSION DIAGNOSIS: Symptomatic cholelithiasis. DISCHAR...|Gastroenterology|[{sentence_embeddings, 0, 2228,  ADMISSION DIAGNOSIS: Sym...|
| ADMITTING DIAGNOSES: Hiatal hernia, gastroesophageal ref...|Gastroenterology|[{sentence_embeddings, 0, 3237,  ADMITTING DIAGNOSES: Hia...|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
only showing top 2 rows



In [None]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)\
                                                  .select("text","category","sentence_embeddings")

testData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+----------------+------------------------------------------------------------+
|                                                        text|        category|                                         sentence_embeddings|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
| ADMITTING DIAGNOSIS: Gastrointestinal bleed. HISTORY OF ...|Gastroenterology|[{sentence_embeddings, 0, 3978,  ADMITTING DIAGNOSIS: Gas...|
| CHIEF COMPLAINT: Dysphagia and hematemesis while vomitin...|Gastroenterology|[{sentence_embeddings, 0, 6515,  CHIEF COMPLAINT: Dysphag...|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
only showing top 2 rows



In [None]:
log_folder="Mt_logs_bert"

### ClassifierDL

In [None]:
classifier_dl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("category")\
    .setBatchSize(8)\
    .setMaxEpochs(30)\
    .setLr(0.002)\
    .setDropout(0.1)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)

classifier_dl_pipeline = Pipeline(stages=[classifier_dl])

In [None]:
clfDL_model_bert = classifier_dl_pipeline.fit(trainingData_with_embeddings)

In [None]:
preds = clfDL_model_bert.transform(testData_with_embeddings)

preds_df = preds.select("category","text","class.result").toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       0.92      0.88      0.90        25
       Neurology       0.83      0.86      0.84        22
      Orthopedic       0.88      0.86      0.87        35
         Urology       0.90      0.95      0.93        20

        accuracy                           0.88       102
       macro avg       0.88      0.89      0.88       102
    weighted avg       0.88      0.88      0.88       102



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["ClassifierDL_bert"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### MultiClassifierDL

In [None]:
# MultiClassifierDL accepts list of strings as input label. So we convert label column to array type.
trainingData_with_embeddings = trainingData_with_embeddings.withColumn("category_array", F.array(F.col("category")))
testData_with_embeddings = testData_with_embeddings.withColumn("category_array", F.array(F.col("category")))

In [None]:
multiClassifier = MultiClassifierDLApproach()\
    .setInputCols("sentence_embeddings")\
    .setOutputCol("prediction")\
    .setLabelColumn("category_array")\
    .setBatchSize(8)\
    .setMaxEpochs(20)\
    .setLr(9e-3)\
    .setThreshold(0.5)\
    .setShufflePerEpoch(False)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)\
  #   .setValidationSplit(0.1)

multiClassifier_pipeline = Pipeline(
    stages = [
        multiClassifier
])

In [None]:
multiClassifier_model_bert = multiClassifier_pipeline.fit(trainingData_with_embeddings)

In [None]:
preds = multiClassifier_model_bert.transform(testData_with_embeddings)

In [None]:
preds_df = preds.select("category","text","prediction.result","prediction.metadata").toPandas()
preds_df.result.apply(lambda x: str(x) ).value_counts()

result
['Orthopedic']          31
['Neurology']           24
['Gastroenterology']    22
['Urology']             20
[]                       5
Name: count, dtype: int64

In [None]:
# We will get the highest score label as result. You can control the number of zero label results with setThreshold() in the training.
preds_df["scores"] = preds_df.metadata.apply(lambda x: {k:float(v) for k,v in x[0].items()} if len(x)>=1 else "")
preds_df["result"] = preds_df.scores.apply(lambda x: max(x, key=x.get) if len(x)>=1 else "")
preds_df = preds_df[(preds_df["result"]!="")]
preds_df["result"].value_counts()

result
Orthopedic          31
Neurology           24
Gastroenterology    22
Urology             20
Name: count, dtype: int64

In [None]:
print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       1.00      0.92      0.96        24
       Neurology       0.75      0.86      0.80        21
      Orthopedic       0.94      0.85      0.89        34
         Urology       0.85      0.94      0.89        18

        accuracy                           0.89        97
       macro avg       0.88      0.89      0.89        97
    weighted avg       0.90      0.89      0.89        97



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["MultiClassifierDL_bert"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### Generic Classifier

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("category")\
    .setHiddenLayers([300,200, 50])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph_bert.pb")

In [None]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

gen_clf = GenericClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph_bert.pb")\
    .setEpochsNumber(25)\
    .setBatchSize(16)\
    .setLearningRate(0.002)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.2)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_graph_builder,
        gen_clf
])


In [None]:
generic_model_bert = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph_bert.pb
Build params: {'input_dim': 768, 'output_dim': 4, 'hidden_layers': [300, 200, 50], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}
generic_classifier graph exported to gc_graph/gcf_graph_bert.pb


In [None]:
pred_df = generic_model_bert.transform(testData_with_embeddings)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.96      0.92      0.94        25
       Neurology       0.79      0.86      0.83        22
      Orthopedic       0.88      0.83      0.85        35
         Urology       0.86      0.90      0.88        20

        accuracy                           0.87       102
       macro avg       0.87      0.88      0.87       102
    weighted avg       0.88      0.87      0.87       102



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericClassifier_bert"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

This age classification model is a sophisticated text classification tool tailored to identify and categorize text according to different age groups. Model distinguishes among Old Adult, Adult, Child and Teen, and Other/Unknown contexts, providing valuable insights into textual data that references age-specific scenarios or concerns.

In [None]:
generic_classifier = GenericClassifierModel.pretrained("genericclassifier_age_e5", "en", "clinical/models")\
    .setInputCols(["features"])\
    .setOutputCol("prediction")

sample_texts = [
"""The patient presents with conditions often associated with the stresses and lifestyle of early career and possibly higher education stages, including sleep irregularities and repetitive stress injuries. There's a notable emphasis on preventative care, with discussions around lifestyle choices that can impact long-term health, such as smoking cessation, regular exercise, and balanced nutrition. The patient is also counseled on mental health, particularly in managing stress and anxiety that may arise from personal and professional responsibilities and ambitions at this stage of life.""",
"""The senior patient presents with age-related issues such as reduced hearing and vision, arthritis, and memory lapses. Emphasis is on managing chronic conditions, maintaining social engagement, and adapting lifestyle to changing physical abilities. Discussions include medication management, dietary adjustments to suit older digestion, and the importance of regular, low-impact exercise.""",
"""The late teenage patient is dealing with final growth spurts, the stress of impending adulthood, and decisions about higher education or career paths. Health discussions include maintaining a balanced diet, the importance of regular sleep patterns, and managing academic and social pressures. Mental health support is considered crucial at this stage, with a focus on building resilience and coping mechanisms.""",
"""The patient, faces adjustments to a new lifestyle with changes in daily routines and social interactions. Health concerns include managing the transition from an active work life to more leisure time, which may impact physical and mental health. Preventative health measures are emphasized, along with the importance of staying mentally and physically active and engaged in the community."""
]

genericclassifier_age_e5 download started this may take some time.
[OK!]


### GenericLogRegClassifier

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_logreg_graph_builder = TFGraphBuilder()\
    .setModelName("logreg_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("log_reg_graph.pb")

GenericLogRegClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericLogRegClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [None]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericLogRegClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/log_reg_graph.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(64)\
    .setLearningRate(0.004)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_logreg_graph_builder,
        gen_clf
])


In [None]:
generic_model_bert = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: logreg_classifier
Graph folder: gc_graph
Graph file name: log_reg_graph.pb
Build params: {'input_dim': 768, 'output_dim': 4, 'hidden_layers': [], 'output_act': 'sigmoid'}
logreg_classifier graph exported to gc_graph/log_reg_graph.pb


In [None]:
pred_df = generic_model_bert.transform(testData_with_embeddings)

In [None]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.96      0.92      0.94        25
       Neurology       0.81      0.77      0.79        22
      Orthopedic       0.88      0.86      0.87        35
         Urology       0.83      0.95      0.88        20

        accuracy                           0.87       102
       macro avg       0.87      0.87      0.87       102
    weighted avg       0.87      0.87      0.87       102



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericCLogReg_bert"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### GenericSVMClassifier

In [None]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_svm_graph_builder = TFGraphBuilder()\
    .setModelName("svm_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("svm_graph.pb")

GenericSVMClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericSVMClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [None]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(64)\
    .setLearningRate(0.004)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_svm_graph_builder,
        gen_clf
])


In [None]:
generic_model_bert = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: gc_graph
Graph file name: svm_graph.pb
Build params: {'input_dim': 768, 'output_dim': 4, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to gc_graph/svm_graph.pb


In [None]:
pred_df = generic_model_bert.transform(testData_with_embeddings)

In [None]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.96      0.92      0.94        25
       Neurology       0.72      0.95      0.82        22
      Orthopedic       0.93      0.80      0.86        35
         Urology       0.84      0.80      0.82        20

        accuracy                           0.86       102
       macro avg       0.86      0.87      0.86       102
    weighted avg       0.88      0.86      0.86       102



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericCSVM_bert"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

## DocumentLogRegClassifier

In [None]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

logreg = DocumentLogRegClassifierApproach()\
    .setInputCols("token")\
    .setLabelCol("category")\
    .setOutputCol("prediction")\
    .setMaxIter(10)\
    .setTol(1e-6)

clf_Pipeline = Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        logreg
])

doclogreg_model = clf_Pipeline.fit(trainingData)

In [None]:
pred_df = doclogreg_model.transform(testData)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.92      0.88      0.90        25
       Neurology       0.80      0.73      0.76        22
      Orthopedic       0.91      0.89      0.90        35
         Urology       0.79      0.95      0.86        20

        accuracy                           0.86       102
       macro avg       0.86      0.86      0.86       102
    weighted avg       0.87      0.86      0.86       102



In [None]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["DocumentLogRegClassifier"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

## Comparision of All Models' Result in Mtsamples Dataset

Here is the all results of classifier models. Macro f1, weighted f1 and accuracy scores are presented.

In [None]:
results_df

Unnamed: 0,macro-f1-score,weighted-f1-score,accuracy
ClassifierDL_100d,0.820153,0.833413,0.833333
MultiClassifierDL_100d,0.884165,0.885508,0.886598
GenericClassifier_100d,0.895481,0.900549,0.901961
GenericLogReg_100d,0.848103,0.852503,0.852941
GenericSVM_100d,0.875472,0.8826,0.882353
ClassifierDL_200d,0.081967,0.064288,0.196078
MultiClassifierDL_200d,0.906453,0.907969,0.908163
GenericClassifier_200d,0.877087,0.881814,0.882353
GenericLogReg_200d,0.81256,0.821306,0.823529
GenericSVM_200d,0.845665,0.852612,0.852941
