![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/30.Clinical_Text_Classification_with_Spark_NLP.ipynb)

# Colab Setup

In [1]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [2]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.5.1 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing libcudnn library
! apt install -qq --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2 -y &> /dev/null

In [None]:
!pip install git+https://github.com/tensorflow/addons.git

❗ **PLEASE RE-START RUNTIME AND CONTINUE**

---



In [4]:
import json
import os

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [5]:
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.base import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline,PipelineModel

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"24G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params, gpu=True)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 6.1.3
Spark NLP_JSL Version : 6.1.1


## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

# Classifiers

The below classifiers will be used in this notebook.ClassifierDL, MultiClassifierDL, and GenericClassifier will be trained using healthcare_100d, embeddingd_clinical, and bert sentence embeddings(sbiobert_base_cased_mli). DocumentLogRegClassifier accepts tokens, so sentence embeddings are not utilized during DocumentLogRegClassifier training.

## ClassifierDL

ClassifierDL is a generic Multi-class Text Classification annotator. ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) built inside TensorFlow and supports up to 100 classes. For more information please [follow the link](https://nlp.johnsnowlabs.com/docs/en/annotators#classifierdl).


##  MultiClassifierDL

MultiClassifierDL is a Multi-label Text Classification annotator.MultiClassifierDL uses a Bidirectional GRU with a convolutional model built inside TensorFlow and supports up to 100 classes.  For more information please [follow the link](https://nlp.johnsnowlabs.com/docs/en/annotators#multiclassifierdl).

Here are some Multi-label Text Classification models that trained with MultiClassifierDL:

|index|model|
|-----:|:-----|
|1|[multiclassifierdl_heart_disease_en](https://nlp.johnsnowlabs.com/2023/10/16/multiclassifierdl_heart_disease_en.html)|
|2|[multiclassifierdl_hoc_en](https://nlp.johnsnowlabs.com/2023/07/04/multiclassifierdl_hoc_en.html)|
|3|[multiclassifierdl_litcovid_en](https://nlp.johnsnowlabs.com/2023/07/04/multiclassifierdl_litcovid_en.html)|
|4|[multiclassifierdl_respiratory_disease_en](https://nlp.johnsnowlabs.com/2023/10/03/multiclassifierdl_respiratory_disease_en.html)|


## GenericClassifier

GenericClassifier is a TensorFlow model for the generic classification of feature vectors in Healthcare  Lİbrary. It takes FEATURE_VECTOR annotations from FeaturesAssembler as input, classifies them, and outputs CATEGORY annotations. Please see [the link](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#genericclassifier) for more information.

Here are some Social Determinants of Health (SDOH) models that trained with GenericClassifier:

|index|model|
|-----:|:-----|
|1|[genericclassifier_sdoh_economics_binary_sbiobert_cased_mli](https://nlp.johnsnowlabs.com/2023/01/14/genericclassifier_sdoh_economics_binary_sbiobert_cased_mli_en.html)|
|2|[genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli](https://nlp.johnsnowlabs.com/2023/01/14/genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli_en.html)|
|3|[genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli](https://nlp.johnsnowlabs.com/2023/01/14/genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli_en.html)|
|4|[genericclassifier_sdoh_alcohol_usage_sbiobert_cased_mli](https://nlp.johnsnowlabs.com/2023/01/14/genericclassifier_sdoh_alcohol_usage_sbiobert_cased_mli_en.html)|
|5|[genericclassifier_sdoh_alcohol_usage_binary_sbiobert_cased_mli](https://nlp.johnsnowlabs.com/2023/01/14/genericclassifier_sdoh_alcohol_usage_binary_sbiobert_cased_mli_en.html)|
|6|[genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli](https://nlp.johnsnowlabs.com/2023/04/10/genericclassifier_sdoh_housing_insecurity_sbiobert_cased_mli_en.html)
|7|[genericclassifier_sdoh_mental_health_clinical](https://nlp.johnsnowlabs.com/2023/04/10/genericclassifier_sdoh_mental_health_clinical_en.html)
|8|[genericclassifier_sdoh_under_treatment_sbiobert_cased_mli](https://nlp.johnsnowlabs.com/2023/04/10/genericclassifier_sdoh_under_treatment_sbiobert_cased_mli_en.html)
|9|[genericclassifier_patient_complaint_sbiobert_cased_mli](https://nlp.johnsnowlabs.com/2023/08/31/patient_complaint_classifier_generic_bert_M1_en.html)
|10|[Genericclassifier_age_group_sbiobert_cased_mli](https://nlp.johnsnowlabs.com/2023/08/16/genericclassifier_age_group_sbiobert_cased_mli_en.html)

## GenericLogRegClassifier

`GenericLogRegClassifier` is a derivative of GenericClassifier which implements a multinomial *Logistic Regression*. This is a single layer neural network with the logistic function at the output. The input to the model is `FeatureVector` and the output is `Category` annotations with labels and corresponding confidence scores varying between 0 and 1. Training data requires "text" and their "label" columns only and the trained model will be a `GenericLogRegClassifierModel()`.

|index|model|
|-----:|:-----|
|1|[generic_logreg_classifier_ade](https://nlp.johnsnowlabs.com/2023/05/09/generic_logreg_classifier_ade_en.html)
|2|[generic_logreg_classifier_metastasis](https://nlp.johnsnowlabs.com/2024/08/09/generic_logreg_classifier_metastasis_en.html)



## GenericSVMClassifier

`GenericSVMClassifier` is a derivative of GenericClassifier which implements *SVM (Support Vector Machine)* classification. The input to the model is `FeatureVector` and the output is `Category` annotations with labels and corresponding confidence scores. The scores are standardized using the logistic function so that they vary between 0 and 1. Taining data requires "text" and their "label" columns only and the trained model will be a `GenericSVMClassifierModel()`

|index|model|
|-----:|:-----|
|1|[generic_svm_classifier_ade](https://nlp.johnsnowlabs.com/2023/05/09/generic_svm_classifier_ade_en.html)
|2|[generic_svm_classifier_metastasis](https://nlp.johnsnowlabs.com/2024/08/09/generic_svm_classifier_metastasis_en.html)

## DocumentLogRegClassifier

DocumentLogRegClassifier is a model to classify documents with a Logarithmic Regression algorithm in Healthcare  Library. Training data requires columns for text and labels. The result is a trained DocumentLogRegClassifierModel. you can get more info [here](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#documentlogregclassifier).

|index|model|
|-----:|:-----|
|1|[classifier_logreg_ade](https://nlp.johnsnowlabs.com/2023/05/16/classifier_logreg_ade_en.html)|


# ADE Dataset

### Data Preprocessing

In [6]:
#downloading sample datasets
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/ADE-NEG.txt
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/DRUG-AE.rel

**ADE Negative Dataset**

In [7]:
df_neg= pd.read_csv("ADE-NEG.txt", header=None, delimiter="\t", names=["col1"])
df_neg.head()

Unnamed: 0,col1
0,6460590 NEG Clioquinol intoxication occurring ...
1,"8600337 NEG ""Retinoic acid syndrome"" was preve..."
2,8402502 NEG BACKGROUND: External beam radiatio...
3,"8700794 NEG Although the enuresis ceased, she ..."
4,17662448 NEG A 42-year-old woman had uneventfu...


In [8]:
df_neg['text'] =  df_neg.col1.str.split('NEG').str[1]
df_neg["category"] = "neg"
df_neg= df_neg[["text", "category"]]
df_neg.head()

Unnamed: 0,text,category
0,Clioquinol intoxication occurring in the trea...,neg
1,"""Retinoic acid syndrome"" was prevented with s...",neg
2,BACKGROUND: External beam radiation therapy o...,neg
3,"Although the enuresis ceased, she developed t...",neg
4,A 42-year-old woman had uneventful bilateral ...,neg


**ADE Positive Dataset**

In [9]:
df_pos= pd.read_csv("DRUG-AE.rel", header=None, delimiter="|")
df_pos.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,10030778,Intravenous azithromycin-induced ototoxicity.,ototoxicity,43,54,azithromycin,22,34
1,10048291,"Immobilization, while Paget's bone disease was...",increased calcium-release,960,985,dihydrotachysterol,908,926
2,10048291,Unaccountable severe hypercalcemia in a patien...,hypercalcemia,31,44,dihydrotachysterol,94,112
3,10082597,METHODS: We report two cases of pseudoporphyri...,pseudoporphyria,620,635,naproxen,646,654
4,10082597,METHODS: We report two cases of pseudoporphyri...,pseudoporphyria,620,635,oxaprozin,659,668


In [10]:
df_pos["category"]= "pos"
df_pos.rename(columns={1: "text"}, inplace=True)
df_pos= df_pos[["text", "category"]]
df_pos.head()

Unnamed: 0,text,category
0,Intravenous azithromycin-induced ototoxicity.,pos
1,"Immobilization, while Paget's bone disease was...",pos
2,Unaccountable severe hypercalcemia in a patien...,pos
3,METHODS: We report two cases of pseudoporphyri...,pos
4,METHODS: We report two cases of pseudoporphyri...,pos


**Merging Positive and Negative dataset**

In [11]:
ade_df= pd.concat([df_neg, df_pos])
ade_df.head()

Unnamed: 0,text,category
0,Clioquinol intoxication occurring in the trea...,neg
1,"""Retinoic acid syndrome"" was prevented with s...",neg
2,BACKGROUND: External beam radiation therapy o...,neg
3,"Although the enuresis ceased, she developed t...",neg
4,A 42-year-old woman had uneventful bilateral ...,neg


In [12]:
ade_df["category"].value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
neg,16695
pos,6821


In [13]:
ade_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 23516 entries, 0 to 6820
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   text      23516 non-null  object
 1   category  23516 non-null  object
dtypes: object(2)
memory usage: 551.2+ KB


We take 30% of the data to make a faster run. You can use all data for better scores.

In [14]:
spark_df = spark.createDataFrame(ade_df).sample(0.3, 3) # limit the data

trainingData, testData = spark_df.randomSplit([0.8, 0.2], seed = 100)

print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

Training Dataset Count: 5606
Test Dataset Count: 1473


In [15]:
spark_df.groupBy("category").count().show()

+--------+-----+
|category|count|
+--------+-----+
|     neg| 5028|
|     pos| 2051|
+--------+-----+



In [16]:
spark_df.printSchema()

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)



In [17]:
spark_df.head(3)

[Row(text=' Clioquinol intoxication occurring in the treatment of acrodermatitis enteropathica with reference to SMON outside of Japan.', category='neg'),
 Row(text=' A 42-year-old woman had uneventful bilateral laser-assisted subepithelial keratectomy (LASEK) to correct myopia.', category='neg'),
 Row(text=' A 16-year-old girl with erosive, polyarticular JRA showed no detectable change in her articular disease following nine exchanges.', category='neg')]

## 100 Dimension Healthcare Embeddings (embeddings_healthcare_100d)



Now we will extract [healthcare_100d embeddings](https://nlp.johnsnowlabs.com/2020/05/29/embeddings_healthcare_100d_en.html) and use it in the classificaiton model training.

In [18]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

embeddings_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,

])

embeddings_healthcare_100d download started this may take some time.
Approximate size to download 475.8 MB
[OK!]


In [19]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)
trainingData_with_embeddings = trainingData_with_embeddings.select("text","category","sentence_embeddings")
trainingData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+--------+------------------------------------------------------------+
|                                                        text|category|                                         sentence_embeddings|
+------------------------------------------------------------+--------+------------------------------------------------------------+
| 'Bail-out' bivalirudin use in patients with thrombotic c...|     neg|[{sentence_embeddings, 0, 150,  'Bail-out' bivalirudin us...|
| (4) The acute rehabilitation of a 21-year-old woman with...|     neg|[{sentence_embeddings, 0, 182,  (4) The acute rehabilitat...|
+------------------------------------------------------------+--------+------------------------------------------------------------+
only showing top 2 rows



In [20]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)
testData_with_embeddings = testData_with_embeddings.select("text","category","sentence_embeddings")
testData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+--------+------------------------------------------------------------+
|                                                        text|category|                                         sentence_embeddings|
+------------------------------------------------------------+--------+------------------------------------------------------------+
| A 15-year-old boy had temporary hypertropia, supraductio...|     neg|[{sentence_embeddings, 0, 262,  A 15-year-old boy had tem...|
| A 51-year-old-woman presented with chronic eosinophilia,...|     neg|[{sentence_embeddings, 0, 139,  A 51-year-old-woman prese...|
+------------------------------------------------------------+--------+------------------------------------------------------------+
only showing top 2 rows



In [21]:
testData_with_embeddings.printSchema()

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)



In [22]:
testData_with_embeddings.select(testData_with_embeddings.sentence_embeddings.embeddings).show(3,truncate=120)

+------------------------------------------------------------------------------------------------------------------------+
|                                                                                          sentence_embeddings.embeddings|
+------------------------------------------------------------------------------------------------------------------------+
|[[0.036828704, 0.052321058, 0.11382609, -0.05798419, -0.1450678, 0.13031518, 0.13105926, 0.077215254, -0.11359191, -0...|
|[[0.04241796, 0.10478715, 0.042885326, 0.06311753, -0.3185922, 0.046458624, 0.17697075, -0.037831675, -0.01754189, -0...|
|[[0.057804022, 0.047698125, 0.06304918, 0.15662706, -0.04416214, -0.06267688, 0.08489624, -0.0543084, -0.0027059284, ...|
+------------------------------------------------------------------------------------------------------------------------+
only showing top 3 rows



In [23]:
log_folder="ADE_logs_healthcare_100d"
!mkdir -p $log_folder

### ClassifierDL

In [24]:
classifier_dl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("prediction")\
    .setLabelColumn("category")\
    .setBatchSize(16)\
    .setMaxEpochs(30)\
    .setLr(0.002)\
    .setDropout(0.3)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)\
    # .setValidationSplit(0.1)

classifier_dl_pipeline = Pipeline(
    stages = [
        classifier_dl
])

In [25]:
clfDL_model_hc100 = classifier_dl_pipeline.fit(trainingData_with_embeddings)

In [26]:
!cat $log_folder/ClassifierDLApproach_*

Training started - epochs: 30 - learning_rate: 0.002 - batch_size: 16 - training_examples: 5606 - classes: 2
Epoch 0/30 - 2.30s - loss: 211.6643 - acc: 0.70928574 - batches: 351
Epoch 1/30 - 1.69s - loss: 211.95518 - acc: 0.71 - batches: 351
Epoch 2/30 - 1.70s - loss: 211.95518 - acc: 0.71 - batches: 351
Epoch 3/30 - 1.95s - loss: 211.95518 - acc: 0.71 - batches: 351
Epoch 4/30 - 2.04s - loss: 211.95518 - acc: 0.71 - batches: 351
Epoch 5/30 - 2.09s - loss: 211.95518 - acc: 0.71 - batches: 351
Epoch 6/30 - 1.81s - loss: 211.95518 - acc: 0.71 - batches: 351
Epoch 7/30 - 1.74s - loss: 211.95518 - acc: 0.71 - batches: 351
Epoch 8/30 - 1.82s - loss: 211.95518 - acc: 0.71 - batches: 351
Epoch 9/30 - 2.01s - loss: 211.95518 - acc: 0.71 - batches: 351
Epoch 10/30 - 2.41s - loss: 211.95518 - acc: 0.71 - batches: 351
Epoch 11/30 - 2.10s - loss: 211.95518 - acc: 0.71 - batches: 351
Epoch 12/30 - 1.90s - loss: 211.95518 - acc: 0.71 - batches: 351
Epoch 13/30 - 2.03s - loss: 211.95518 - acc: 0.71 -

In [27]:
preds = clfDL_model_hc100.transform(testData_with_embeddings)

In [28]:
preds.printSchema()
preds.select(preds.prediction).show(5, truncate=False)
preds.select(preds.category, preds.prediction.result).show(5, truncate=False)

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- prediction: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    

In [29]:
preds_df = preds.select("category","text","prediction.result").toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

         neg       0.72      1.00      0.84      1057
         pos       0.00      0.00      0.00       416

    accuracy                           0.72      1473
   macro avg       0.36      0.50      0.42      1473
weighted avg       0.51      0.72      0.60      1473



In [30]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df = pd.DataFrame(columns=["pos-f1-score","accuracy"])
results_df.loc["ClassifierDL_100d"] = [res["pos"]["f1-score"], res["accuracy"]]
results_df

Unnamed: 0,pos-f1-score,accuracy
ClassifierDL_100d,0.0,0.717583


### MultiClassifierDL

We will use MultiClassifierDL built by using Bidirectional GRU and CNNs inside TensorFlow that supports up to 100 classes. It is designed for multi-label classification purposes. Here we will use MultiClassifierDL as a binary classifier.

In [31]:
# MultiClassifierDL accepts list of strings as input label. So we convert label column to array type.
trainingData_with_embeddings = trainingData_with_embeddings.withColumn("category_array", F.array(F.col("category")))
testData_with_embeddings = testData_with_embeddings.withColumn("category_array", F.array(F.col("category")))

In [32]:
multiClassifier = MultiClassifierDLApproach()\
    .setInputCols("sentence_embeddings")\
    .setOutputCol("prediction")\
    .setLabelColumn("category_array")\
    .setBatchSize(16)\
    .setMaxEpochs(15)\
    .setLr(9e-3)\
    .setThreshold(0.5)\
    .setShufflePerEpoch(False)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)\
  #   .setValidationSplit(0.1)

multiClassifier_pipeline = Pipeline(
    stages = [
        multiClassifier
])

In [33]:
multiClassifier_model_hc100 = multiClassifier_pipeline.fit(trainingData_with_embeddings)

In [34]:
!cat $log_folder/MultiClassifierDLApproach_*

Training started - epochs: 15 - learning_rate: 0.009 - batch_size: 16 - training_examples: 5606 - classes: 2
Epoch 0/15 - 9.67s - loss: 0.5211027 - acc: 0.7369643 - batches: 351
Epoch 1/15 - 3.46s - loss: 0.4577158 - acc: 0.7746429 - batches: 351
Epoch 2/15 - 3.32s - loss: 0.42108843 - acc: 0.79839283 - batches: 351
Epoch 3/15 - 3.29s - loss: 0.39720863 - acc: 0.8158929 - batches: 351
Epoch 4/15 - 3.35s - loss: 0.37761903 - acc: 0.82625 - batches: 351
Epoch 5/15 - 3.17s - loss: 0.3612009 - acc: 0.836875 - batches: 351
Epoch 6/15 - 3.26s - loss: 0.34774014 - acc: 0.8429464 - batches: 351
Epoch 7/15 - 3.15s - loss: 0.33186734 - acc: 0.85169643 - batches: 351
Epoch 8/15 - 3.25s - loss: 0.31702447 - acc: 0.85625 - batches: 351
Epoch 9/15 - 3.05s - loss: 0.30283532 - acc: 0.8632143 - batches: 351
Epoch 10/15 - 3.13s - loss: 0.2903383 - acc: 0.86892855 - batches: 351
Epoch 11/15 - 3.18s - loss: 0.27682382 - acc: 0.8769643 - batches: 351
Epoch 12/15 - 3.24s - loss: 0.26144764 - acc: 0.8842857

In [35]:
preds = multiClassifier_model_hc100.transform(testData_with_embeddings)

In [36]:
preds.printSchema()
preds.select(preds.prediction).show(5, truncate=False)
preds.select(preds.category_array, preds.prediction.result).show(5, truncate=False)

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- category_array: array (nullable = false)
 |    |-- element: string (containsNull = true)
 |-- prediction: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable

In [37]:
preds_df = preds.select("category","text","prediction.result","prediction.metadata").toPandas()
preds_df.result.apply(lambda x: str(x) ).value_counts()

Unnamed: 0_level_0,count
result,Unnamed: 1_level_1
['neg'],1011
['pos'],461
"['neg', 'pos']",1


In [38]:
preds_df[preds_df.result.apply(len)==2]

Unnamed: 0,category,text,result,metadata
1312,pos,A patient with seropositive rheumatoid arthrit...,"[neg, pos]","[{'sentence': '0', 'neg': '0.5015679', 'pos': ..."


In [39]:
preds_df[preds_df.result.apply(len)==0]

Unnamed: 0,category,text,result,metadata


MultiClassifierDL is a multi-label classifier, so some predictions may include both labels or none of the labels. That can be controlled a bit with `.setThreshold()` parameter during training. For now we will keep not keep zero label predictions and get the highest score as prediction.

In [40]:
# We will get the highest score label as result. You can control the number of zero label results with setThreshold() in the training.
preds_df["scores"] = preds_df.metadata.apply(lambda x: {k:float(v) for k,v in x[0].items()} if len(x)>=1 else "")
preds_df["result"] = preds_df.scores.apply(lambda x: max(x, key=x.get) if len(x)>=1 else "")
preds_df["result"].value_counts()

Unnamed: 0_level_0,count
result,Unnamed: 1_level_1
neg,1012
pos,461


In [41]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["MultiClassifierDL_100d"] = [res["pos"]["f1-score"], res["accuracy"]]

print (classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

         neg       0.88      0.85      0.87      1057
         pos       0.65      0.72      0.68       416

    accuracy                           0.81      1473
   macro avg       0.77      0.78      0.77      1473
weighted avg       0.82      0.81      0.81      1473



### Generic Classifier

In [42]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("category")\
    .setHiddenLayers([300,200, 50])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph_100d.pb")

GenericClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericClassifierApproach takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [43]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

gen_clf = GenericClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph_100d.pb")\
    .setEpochsNumber(25)\
    .setBatchSize(128)\
    .setLearningRate(0.004)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.2)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_graph_builder,
        gen_clf
])

In [44]:
generic_model_hc100 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph_100d.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [300, 200, 50], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}
generic_classifier graph exported to gc_graph/gcf_graph_100d.pb


In [45]:
!cat $log_folder/GenericClassifierApproach_*

Training 25 epochs
Epoch 1/25	0.48s	Loss: 12.836317	ACC: 0.62284493
Epoch 2/25	0.16s	Loss: 10.319197	ACC: 0.7103352
Epoch 3/25	0.15s	Loss: 9.808507	ACC: 0.7316942
Epoch 4/25	0.14s	Loss: 9.430674	ACC: 0.7394476
Epoch 5/25	0.15s	Loss: 9.631403	ACC: 0.7268549
Epoch 6/25	0.14s	Loss: 9.181558	ACC: 0.7434513
Epoch 7/25	0.14s	Loss: 8.920649	ACC: 0.7546757
Epoch 8/25	0.14s	Loss: 8.803186	ACC: 0.76352215
Epoch 9/25	0.13s	Loss: 8.693391	ACC: 0.76298946
Epoch 10/25	0.13s	Loss: 8.648312	ACC: 0.7679611
Epoch 11/25	0.13s	Loss: 8.404008	ACC: 0.77371955
Epoch 12/25	0.13s	Loss: 8.311676	ACC: 0.78447384
Epoch 13/25	0.14s	Loss: 8.23434	ACC: 0.78349555
Epoch 14/25	0.14s	Loss: 8.347946	ACC: 0.7799479
Epoch 15/25	0.13s	Loss: 8.146642	ACC: 0.7863852
Epoch 16/25	0.13s	Loss: 8.045092	ACC: 0.78655237
Epoch 17/25	0.13s	Loss: 8.182261	ACC: 0.7809645
Epoch 18/25	0.13s	Loss: 7.8782854	ACC: 0.7938809
Epoch 19/25	0.14s	Loss: 7.8657374	ACC: 0.7899781
Epoch 20/25	0.13s	Loss: 7.989813	ACC: 0.78953594
Epoch 21/25	0.13s	L

In [46]:
pred_df = generic_model_hc100.transform(testData_with_embeddings)

In [47]:
preds.printSchema()
preds.select(preds.prediction).show(5, truncate=False)
preds.select(preds.category, preds.prediction.result).show(5, truncate=False)

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- category_array: array (nullable = false)
 |    |-- element: string (containsNull = true)
 |-- prediction: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable

In [48]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.89      0.82      0.86      1057
         pos       0.63      0.75      0.68       416

    accuracy                           0.80      1473
   macro avg       0.76      0.79      0.77      1473
weighted avg       0.82      0.80      0.81      1473



In [49]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericClassifier_100d"] = [res["pos"]["f1-score"], res["accuracy"]]

### GenericLogRegClassifier

In [50]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_logreg_graph_builder = TFGraphBuilder()\
    .setModelName("logreg_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("log_reg_graph.pb")

GenericLogRegClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericLogRegClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [51]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericLogRegClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/log_reg_graph.pb")\
    .setEpochsNumber(20)\
    .setBatchSize(128)\
    .setLearningRate(0.01)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_logreg_graph_builder,
        gen_clf
])


In [52]:
generic_model_hc100 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: logreg_classifier
Graph folder: gc_graph
Graph file name: log_reg_graph.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid'}
logreg_classifier graph exported to gc_graph/log_reg_graph.pb


In [53]:
!cat $log_folder/GenericLogRegClassifierApproach*

Training 20 epochs
Epoch 1/20	0.15s	Loss: 25.916553	ACC: 0.7090053
Epoch 2/20	0.05s	Loss: 24.165014	ACC: 0.7188127
Epoch 3/20	0.05s	Loss: 23.344551	ACC: 0.72741896
Epoch 4/20	0.05s	Loss: 22.890036	ACC: 0.73701054
Epoch 5/20	0.05s	Loss: 22.586634	ACC: 0.74278635
Epoch 6/20	0.05s	Loss: 22.23237	ACC: 0.7447325
Epoch 7/20	0.05s	Loss: 22.053646	ACC: 0.75246143
Epoch 8/20	0.05s	Loss: 22.02954	ACC: 0.75041777
Epoch 9/20	0.05s	Loss: 21.754704	ACC: 0.7569038
Epoch 10/20	0.05s	Loss: 21.879087	ACC: 0.7567228
Epoch 11/20	0.05s	Loss: 21.688179	ACC: 0.75823724
Epoch 12/20	0.05s	Loss: 21.55584	ACC: 0.7572102
Epoch 13/20	0.05s	Loss: 21.638292	ACC: 0.75326216
Epoch 14/20	0.05s	Loss: 21.568205	ACC: 0.75734943
Epoch 15/20	0.05s	Loss: 21.573969	ACC: 0.7568969
Epoch 16/20	0.05s	Loss: 21.32011	ACC: 0.7602774
Epoch 17/20	0.05s	Loss: 21.364315	ACC: 0.75979
Epoch 18/20	0.05s	Loss: 21.33092	ACC: 0.76347345
Epoch 19/20	0.06s	Loss: 21.29929	ACC: 0.7621818
Epoch 20/20	0.05s	Loss: 21.250221	ACC: 0.7565905
Training 

In [54]:
pred_df = generic_model_hc100.transform(testData_with_embeddings)

In [55]:
preds.printSchema()
preds.select(preds.prediction).show(5, truncate=False)
preds.select(preds.category, preds.prediction.result).show(5, truncate=False)

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- category_array: array (nullable = false)
 |    |-- element: string (containsNull = true)
 |-- prediction: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable

In [56]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.80      0.92      0.86      1057
         pos       0.67      0.41      0.51       416

    accuracy                           0.78      1473
   macro avg       0.74      0.67      0.68      1473
weighted avg       0.76      0.78      0.76      1473



In [57]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericLogReg_100d"] = [res["pos"]["f1-score"], res["accuracy"]]

### GenericSVMClassifier

In [58]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_svm_graph_builder = TFGraphBuilder()\
    .setModelName("svm_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("svm_graph.pb")

GenericSVMClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericSVMClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [59]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(21)\
    .setBatchSize(128)\
    .setLearningRate(0.02)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_svm_graph_builder,
        gen_clf
])

In [60]:
generic_model_hc100 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: gc_graph
Graph file name: svm_graph.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to gc_graph/svm_graph.pb


In [61]:
!cat $log_folder/GenericSVMClassifierApproach*

Training 21 epochs
Epoch 1/21	0.14s	Loss: 28.110487	ACC: 0.70421124
Epoch 2/21	0.05s	Loss: 24.993732	ACC: 0.7264929
Epoch 3/21	0.05s	Loss: 24.563284	ACC: 0.7368399
Epoch 4/21	0.05s	Loss: 24.22303	ACC: 0.7489172
Epoch 5/21	0.05s	Loss: 24.104843	ACC: 0.75104445
Epoch 6/21	0.05s	Loss: 23.926744	ACC: 0.7526459
Epoch 7/21	0.05s	Loss: 23.967041	ACC: 0.75605434
Epoch 8/21	0.05s	Loss: 23.872986	ACC: 0.7553093
Epoch 9/21	0.05s	Loss: 23.740416	ACC: 0.75947666
Epoch 10/21	0.05s	Loss: 23.43158	ACC: 0.7590345
Epoch 11/21	0.05s	Loss: 23.815622	ACC: 0.75748175
Epoch 12/21	0.05s	Loss: 23.687748	ACC: 0.7590345
Epoch 13/21	0.05s	Loss: 23.645468	ACC: 0.7592086
Epoch 14/21	0.05s	Loss: 23.73254	ACC: 0.7591181
Epoch 15/21	0.05s	Loss: 23.776913	ACC: 0.75903106
Epoch 16/21	0.05s	Loss: 23.470568	ACC: 0.7592504
Epoch 17/21	0.05s	Loss: 24.0196	ACC: 0.7593931
Epoch 18/21	0.05s	Loss: 23.664045	ACC: 0.7618789
Epoch 19/21	0.06s	Loss: 23.57697	ACC: 0.7630278
Epoch 20/21	0.05s	Loss: 23.888702	ACC: 0.7555252
Epoch 21/2

In [62]:
pred_df = generic_model_hc100.transform(testData_with_embeddings)

In [63]:
preds.printSchema()
preds.select(preds.prediction).show(5, truncate=False)
preds.select(preds.category, preds.prediction.result).show(5, truncate=False)

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- category_array: array (nullable = false)
 |    |-- element: string (containsNull = true)
 |-- prediction: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable

In [64]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.76      0.97      0.85      1057
         pos       0.71      0.20      0.32       416

    accuracy                           0.75      1473
   macro avg       0.73      0.59      0.58      1473
weighted avg       0.74      0.75      0.70      1473



In [65]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericSVM_100d"] = [res["pos"]["f1-score"], res["accuracy"]]

## 200 Dimension Healthcare Embeddings (embeddings_clinical)



Now we will extract [embeddings_clinical](https://nlp.johnsnowlabs.com/2020/01/28/embeddings_clinical_en.html) embeddings which has 200 dimension output and use this embeddings in the model training.

In [66]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

embeddings_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


In [67]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)\
                                                  .select("text","category","sentence_embeddings")
trainingData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+--------+------------------------------------------------------------+
|                                                        text|category|                                         sentence_embeddings|
+------------------------------------------------------------+--------+------------------------------------------------------------+
| 'Bail-out' bivalirudin use in patients with thrombotic c...|     neg|[{sentence_embeddings, 0, 150,  'Bail-out' bivalirudin us...|
| (4) The acute rehabilitation of a 21-year-old woman with...|     neg|[{sentence_embeddings, 0, 182,  (4) The acute rehabilitat...|
+------------------------------------------------------------+--------+------------------------------------------------------------+
only showing top 2 rows



In [68]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)\
                                                  .select("text","category","sentence_embeddings")
testData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+--------+------------------------------------------------------------+
|                                                        text|category|                                         sentence_embeddings|
+------------------------------------------------------------+--------+------------------------------------------------------------+
| A 15-year-old boy had temporary hypertropia, supraductio...|     neg|[{sentence_embeddings, 0, 262,  A 15-year-old boy had tem...|
| A 51-year-old-woman presented with chronic eosinophilia,...|     neg|[{sentence_embeddings, 0, 139,  A 51-year-old-woman prese...|
+------------------------------------------------------------+--------+------------------------------------------------------------+
only showing top 2 rows



In [69]:
log_folder="ADE_logs_healthcare_200d"

### ClassifierDL

In [70]:
classifier_dl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("category")\
    .setBatchSize(16)\
    .setMaxEpochs(30)\
    .setLr(0.001)\
    .setDropout(0.3)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)

classifier_dl_pipeline = Pipeline(
    stages = [
        classifier_dl
])

In [71]:
clfDL_model_hc200 = classifier_dl_pipeline.fit(trainingData_with_embeddings)

In [72]:
preds = clfDL_model_hc200.transform(testData_with_embeddings)

preds_df = preds.select("category","text","class.result").toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

         neg       0.88      0.91      0.89      1057
         pos       0.75      0.68      0.71       416

    accuracy                           0.85      1473
   macro avg       0.81      0.80      0.80      1473
weighted avg       0.84      0.85      0.84      1473



In [73]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["ClassifierDL_200d"] = [res["pos"]["f1-score"], res["accuracy"]]

### MultiClassifierDL

In [74]:
# MultiClassifierDL accepts list of strings as input label. So we convert label column to array type.
trainingData_with_embeddings = trainingData_with_embeddings.withColumn("category_array", F.array(F.col("category")))
testData_with_embeddings = testData_with_embeddings.withColumn("category_array", F.array(F.col("category")))

In [75]:
multiClassifier = MultiClassifierDLApproach()\
    .setInputCols("sentence_embeddings")\
    .setOutputCol("prediction")\
    .setLabelColumn("category_array")\
    .setBatchSize(32)\
    .setMaxEpochs(20)\
    .setLr(9e-3)\
    .setThreshold(0.5)\
    .setShufflePerEpoch(False)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)\
  #   .setValidationSplit(0.1)

multiClassifier_pipeline = Pipeline(
    stages = [
        multiClassifier
])

In [76]:
multiClassifier_model_hc200 = multiClassifier_pipeline.fit(trainingData_with_embeddings)

In [77]:
preds = multiClassifier_model_hc200.transform(testData_with_embeddings)

In [78]:
preds_df = preds.select("category","text","prediction.result","prediction.metadata").toPandas()
preds_df.result.apply(lambda x: str(x) ).value_counts()

Unnamed: 0_level_0,count
result,Unnamed: 1_level_1
['neg'],1082
['pos'],391


In [79]:
# We will get the highest score label as result. You can control the number of zero label results with setThreshold() in the training.
preds_df["scores"] = preds_df.metadata.apply(lambda x: {k:float(v) for k,v in x[0].items()} if len(x)>=1 else "")
preds_df["result"] = preds_df.scores.apply(lambda x: max(x, key=x.get) if len(x)>=1 else "")
preds_df["result"].value_counts()

Unnamed: 0_level_0,count
result,Unnamed: 1_level_1
neg,1082
pos,391


In [80]:
print (classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

         neg       0.88      0.90      0.89      1057
         pos       0.74      0.69      0.72       416

    accuracy                           0.84      1473
   macro avg       0.81      0.80      0.80      1473
weighted avg       0.84      0.84      0.84      1473



In [81]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["MultiClassifierDL_200d"] = [res["pos"]["f1-score"], res["accuracy"]]

### Generic Classifier

In [82]:
# !pip install -q tensorflow==2.12.0 tensorflow_addons

In [83]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("category")\
    .setHiddenLayers([300,200, 50])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph_200d.pb")

In [84]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

gen_clf = GenericClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph_200d.pb")\
    .setEpochsNumber(40)\
    .setBatchSize(16)\
    .setLearningRate(0.001)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.5)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_graph_builder,
        gen_clf
])

In [85]:
generic_model_hc200 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph_200d.pb
Build params: {'input_dim': 200, 'output_dim': 2, 'hidden_layers': [300, 200, 50], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}
generic_classifier graph exported to gc_graph/gcf_graph_200d.pb


In [86]:
pred_df = generic_model_hc200.transform(testData_with_embeddings)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.94      0.65      0.77      1057
         pos       0.50      0.89      0.65       416

    accuracy                           0.72      1473
   macro avg       0.72      0.77      0.71      1473
weighted avg       0.82      0.72      0.74      1473



In [87]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericClassifier_200d"] = [res["pos"]["f1-score"], res["accuracy"]]

### GenericLogRegClassifier

In [88]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_logreg_graph_builder = TFGraphBuilder()\
    .setModelName("logreg_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("log_reg_graph.pb")

GenericLogRegClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericLogRegClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [89]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericLogRegClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/log_reg_graph.pb")\
    .setEpochsNumber(20)\
    .setBatchSize(128)\
    .setLearningRate(0.01)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_logreg_graph_builder,
        gen_clf
])


In [90]:
generic_model_hc200 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: logreg_classifier
Graph folder: gc_graph
Graph file name: log_reg_graph.pb
Build params: {'input_dim': 200, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid'}
logreg_classifier graph exported to gc_graph/log_reg_graph.pb


In [91]:
pred_df = generic_model_hc200.transform(testData_with_embeddings)

In [92]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.83      0.92      0.87      1057
         pos       0.72      0.51      0.60       416

    accuracy                           0.81      1473
   macro avg       0.77      0.72      0.73      1473
weighted avg       0.80      0.81      0.79      1473



In [93]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericLogReg_200d"] = [res["pos"]["f1-score"], res["accuracy"]]

### GenericSVMClassifier

In [94]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_svm_graph_builder = TFGraphBuilder()\
    .setModelName("svm_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("svm_graph.pb")

GenericSVMClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericSVMClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [95]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(20)\
    .setBatchSize(128)\
    .setLearningRate(0.005)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_svm_graph_builder,
        gen_clf
])


In [96]:
generic_model_hc200 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: gc_graph
Graph file name: svm_graph.pb
Build params: {'input_dim': 200, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to gc_graph/svm_graph.pb


In [97]:
pred_df = generic_model_hc200.transform(testData_with_embeddings)

In [98]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.84      0.89      0.87      1057
         pos       0.68      0.57      0.62       416

    accuracy                           0.80      1473
   macro avg       0.76      0.73      0.74      1473
weighted avg       0.80      0.80      0.80      1473



In [99]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericSVM_200d"] = [res["pos"]["f1-score"], res["accuracy"]]

## Bert Sentence Embeddings (sbiobert_base_cased_mli)

Now we will extract [sbiobert_base_cased_mli](https://nlp.johnsnowlabs.com/2020/11/27/sbiobert_base_cased_mli_en.html) embeddings which has 768 dimension output and use this embeddings in the model training.

In [100]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

bert_sent = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

embeddings_pipeline = Pipeline(
    stages = [
        document_assembler,
        bert_sent
])

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]


In [101]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)\
                                                  .select("text","category","sentence_embeddings")

trainingData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+--------+------------------------------------------------------------+
|                                                        text|category|                                         sentence_embeddings|
+------------------------------------------------------------+--------+------------------------------------------------------------+
| 'Bail-out' bivalirudin use in patients with thrombotic c...|     neg|[{sentence_embeddings, 0, 150,  'Bail-out' bivalirudin us...|
| (4) The acute rehabilitation of a 21-year-old woman with...|     neg|[{sentence_embeddings, 0, 182,  (4) The acute rehabilitat...|
+------------------------------------------------------------+--------+------------------------------------------------------------+
only showing top 2 rows



In [102]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)\
                                                  .select("text","category","sentence_embeddings")

testData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+--------+------------------------------------------------------------+
|                                                        text|category|                                         sentence_embeddings|
+------------------------------------------------------------+--------+------------------------------------------------------------+
| A 15-year-old boy had temporary hypertropia, supraductio...|     neg|[{sentence_embeddings, 0, 262,  A 15-year-old boy had tem...|
| A 51-year-old-woman presented with chronic eosinophilia,...|     neg|[{sentence_embeddings, 0, 139,  A 51-year-old-woman prese...|
+------------------------------------------------------------+--------+------------------------------------------------------------+
only showing top 2 rows



In [103]:
log_folder="ADE_logs_bert"

### ClassifierDL

In [104]:
classifier_dl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("category")\
    .setBatchSize(32)\
    .setMaxEpochs(20)\
    .setLr(0.001)\
    .setDropout(0.3)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)

classifier_dl_pipeline = Pipeline(
    stages=[
        classifier_dl
])

In [105]:
clfDL_model_bert = classifier_dl_pipeline.fit(trainingData_with_embeddings)

In [106]:
preds = clfDL_model_bert.transform(testData_with_embeddings)

preds_df = preds.select("category","text","class.result").toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

         neg       0.90      0.94      0.92      1057
         pos       0.82      0.73      0.77       416

    accuracy                           0.88      1473
   macro avg       0.86      0.83      0.84      1473
weighted avg       0.88      0.88      0.88      1473



In [107]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["ClassifierDL_bert"] = [res["pos"]["f1-score"], res["accuracy"]]

### MultiClassifierDL

In [108]:
# MultiClassifierDL accepts list of strings as input label. So we convert label column to array type.
trainingData_with_embeddings = trainingData_with_embeddings.withColumn("category_array", F.array(F.col("category")))
testData_with_embeddings = testData_with_embeddings.withColumn("category_array", F.array(F.col("category")))

In [109]:
multiClassifier = MultiClassifierDLApproach()\
    .setInputCols("sentence_embeddings")\
    .setOutputCol("prediction")\
    .setLabelColumn("category_array")\
    .setBatchSize(32)\
    .setMaxEpochs(20)\
    .setLr(9e-3)\
    .setThreshold(0.5)\
    .setShufflePerEpoch(False)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)\
  #   .setValidationSplit(0.1)

multiClassifier_pipeline = Pipeline(
    stages = [
        multiClassifier
])

In [110]:
multiClassifier_model_bert = multiClassifier_pipeline.fit(trainingData_with_embeddings)

In [111]:
preds = multiClassifier_model_bert.transform(testData_with_embeddings)

In [112]:
preds_df = preds.select("category","text","prediction.result","prediction.metadata").toPandas()
preds_df.result.apply(lambda x: str(x) ).value_counts()

Unnamed: 0_level_0,count
result,Unnamed: 1_level_1
['neg'],1132
['pos'],341


In [113]:
# We will get the highest score label as result. You can control the number of zero label results with setThreshold() in the training.
preds_df["scores"] = preds_df.metadata.apply(lambda x: {k:float(v) for k,v in x[0].items()} if len(x)>=1 else "")
preds_df["result"] = preds_df.scores.apply(lambda x: max(x, key=x.get) if len(x)>=1 else "")
preds_df = preds_df[(preds_df["result"]!="")]
preds_df["result"].value_counts()

Unnamed: 0_level_0,count
result,Unnamed: 1_level_1
neg,1132
pos,341


In [114]:
print (classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

         neg       0.88      0.94      0.91      1057
         pos       0.83      0.68      0.75       416

    accuracy                           0.87      1473
   macro avg       0.85      0.81      0.83      1473
weighted avg       0.87      0.87      0.86      1473



In [115]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["MultiClassifierDL_bert"] = [res["pos"]["f1-score"], res["accuracy"]]

### Generic Classifier

In [116]:
# !pip install -q tensorflow==2.11.0 tensorflow_addons

In [117]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("category")\
    .setHiddenLayers([300,200, 50])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph_bert.pb")

In [118]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

gen_clf = GenericClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph_bert.pb")\
    .setEpochsNumber(20)\
    .setBatchSize(32)\
    .setLearningRate(0.001)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.5)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_graph_builder,
        gen_clf
])


In [119]:
generic_model_bert = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph_bert.pb
Build params: {'input_dim': 768, 'output_dim': 2, 'hidden_layers': [300, 200, 50], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}
generic_classifier graph exported to gc_graph/gcf_graph_bert.pb


In [120]:
pred_df = generic_model_bert.transform(testData_with_embeddings)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.94      0.89      0.92      1057
         pos       0.76      0.85      0.80       416

    accuracy                           0.88      1473
   macro avg       0.85      0.87      0.86      1473
weighted avg       0.89      0.88      0.88      1473



In [121]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericClassifier_bert"] = [res["pos"]["f1-score"], res["accuracy"]]

### GenericLogRegClassifier

In [122]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_logreg_graph_builder = TFGraphBuilder()\
    .setModelName("logreg_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("log_reg_graph.pb")

GenericLogRegClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericLogRegClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [123]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericLogRegClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/log_reg_graph.pb")\
    .setEpochsNumber(20)\
    .setBatchSize(128)\
    .setLearningRate(0.01)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_logreg_graph_builder,
        gen_clf
])


In [124]:
generic_model_bert = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: logreg_classifier
Graph folder: gc_graph
Graph file name: log_reg_graph.pb
Build params: {'input_dim': 768, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid'}
logreg_classifier graph exported to gc_graph/log_reg_graph.pb


In [125]:
pred_df = generic_model_bert.transform(testData_with_embeddings)

In [126]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.92      0.93      0.92      1057
         pos       0.82      0.80      0.81       416

    accuracy                           0.89      1473
   macro avg       0.87      0.86      0.87      1473
weighted avg       0.89      0.89      0.89      1473



In [127]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericLogReg_bert"] = [res["pos"]["f1-score"], res["accuracy"]]

### GenericSVMClassifier

In [128]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_svm_graph_builder = TFGraphBuilder()\
    .setModelName("svm_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("svm_graph.pb")

GenericSVMClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericSVMClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [129]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(20)\
    .setBatchSize(128)\
    .setLearningRate(0.005)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_svm_graph_builder,
        gen_clf
])


In [130]:
generic_model_bert = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: gc_graph
Graph file name: svm_graph.pb
Build params: {'input_dim': 768, 'output_dim': 2, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to gc_graph/svm_graph.pb


In [131]:
pred_df = generic_model_bert.transform(testData_with_embeddings)

In [132]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.88      0.96      0.92      1057
         pos       0.88      0.65      0.75       416

    accuracy                           0.88      1473
   macro avg       0.88      0.81      0.83      1473
weighted avg       0.88      0.88      0.87      1473



In [133]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericSVM_bert"] = [res["pos"]["f1-score"], res["accuracy"]]

## DocumentLogRegClassifier

In [134]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
    .setInputCols("normalized")\
    .setOutputCol("cleanTokens")\
    .setCaseSensitive(False)

stemmer = Stemmer() \
    .setInputCols(["cleanTokens"]) \
    .setOutputCol("stem")

logreg = DocumentLogRegClassifierApproach()\
    .setInputCols("stem")\
    .setLabelCol("category")\
    .setOutputCol("prediction")\
    .setMaxIter(10)\
    .setTol(1e-6)

clf_Pipeline = Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        normalizer,
        stopwords_cleaner,
        stemmer,
        logreg
])

doclogreg_model = clf_Pipeline.fit(trainingData)

In [135]:
pred_df = doclogreg_model.transform(testData)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.87      0.85      0.86      1057
         pos       0.64      0.67      0.65       416

    accuracy                           0.80      1473
   macro avg       0.75      0.76      0.76      1473
weighted avg       0.80      0.80      0.80      1473



In [136]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["DocumentLogRegClassifier"] = [res["pos"]["f1-score"], res["accuracy"]]

## Comparision of All Models' Result in ADE Dataset

Here is the all results of classifier models. Only positive ADE f1 scores and accuracy scores are presented.

In [137]:
results_df

Unnamed: 0,pos-f1-score,accuracy
ClassifierDL_100d,0.0,0.717583
MultiClassifierDL_100d,0.68187,0.810591
GenericClassifier_100d,0.684096,0.803123
GenericLogReg_100d,0.508221,0.776646
GenericSVM_100d,0.317164,0.751527
ClassifierDL_200d,0.714465,0.845893
MultiClassifierDL_200d,0.716233,0.844535
GenericClassifier_200d,0.645273,0.722335
GenericLogReg_200d,0.597183,0.805838
GenericSVM_200d,0.6206,0.802444


# Mtsamples Dataset

❗ **PLEASE RE-START RUNTIME AND CONTINUE**

In [138]:
import json
import os

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [139]:
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.base import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline,PipelineModel

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"24G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params, gpu=True)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 6.1.3
Spark NLP_JSL Version : 6.1.1


## Load Dataset

In [140]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/mtsamples_classifier.csv

In [141]:
spark_df = spark.read.csv("mtsamples_classifier.csv", header = True)

spark_df.show(10,truncate=100)

+----------------+----------------------------------------------------------------------------------------------------+
|        category|                                                                                                text|
+----------------+----------------------------------------------------------------------------------------------------+
|Gastroenterology| PROCEDURES PERFORMED: Colonoscopy. INDICATIONS: Renewed symptoms likely consistent with active f...|
|Gastroenterology| OPERATION 1. Ivor-Lewis esophagogastrectomy. 2. Feeding jejunostomy. 3. Placement of two right-s...|
|Gastroenterology| PREOPERATIVE DIAGNOSES: 1. Gastroesophageal reflux disease. 2. Chronic dyspepsia. POSTOPERATIVE ...|
|Gastroenterology| PROCEDURE: Colonoscopy. PREOPERATIVE DIAGNOSES: Rectal bleeding and perirectal abscess. POSTOPER...|
|Gastroenterology| PREOPERATIVE DIAGNOSIS: Right colon tumor. POSTOPERATIVE DIAGNOSES: 1. Right colon cancer. 2. As...|
|Gastroenterology| PREOPERATIVE DIAGNOSI

In [142]:
spark_df.printSchema()

root
 |-- category: string (nullable = true)
 |-- text: string (nullable = true)



In [143]:
spark_df.count()

638

In [144]:
spark_df.groupBy("category").count().show()

+----------------+-----+
|        category|count|
+----------------+-----+
|         Urology|  115|
|       Neurology|  143|
|      Orthopedic|  223|
|Gastroenterology|  157|
+----------------+-----+



In [145]:
(trainingData, testData) = spark_df.randomSplit([0.8, 0.2], seed = 42)

In [146]:
trainingData.groupBy("category").count().show()

+----------------+-----+
|        category|count|
+----------------+-----+
|         Urology|   95|
|       Neurology|  121|
|      Orthopedic|  188|
|Gastroenterology|  132|
+----------------+-----+



In [147]:
testData.groupBy("category").count().show()

+----------------+-----+
|        category|count|
+----------------+-----+
|         Urology|   20|
|       Neurology|   22|
|      Orthopedic|   35|
|Gastroenterology|   25|
+----------------+-----+



## 100 Dimension Healthcare Embeddings (embeddings_healthcare_100d)



We will extract [healthcare_100d embeddings](https://nlp.johnsnowlabs.com/2020/05/29/embeddings_healthcare_100d_en.html) and use it in the classificaiton model training.

In [148]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

embeddings_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
])

embeddings_healthcare_100d download started this may take some time.
Approximate size to download 475.8 MB
[OK!]


In [149]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)
trainingData_with_embeddings = trainingData_with_embeddings.select("text","category","sentence_embeddings")
trainingData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+----------------+------------------------------------------------------------+
|                                                        text|        category|                                         sentence_embeddings|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
| ADMISSION DIAGNOSIS: Symptomatic cholelithiasis. DISCHAR...|Gastroenterology|[{sentence_embeddings, 0, 2228,  ADMISSION DIAGNOSIS: Sym...|
| ADMITTING DIAGNOSES: Hiatal hernia, gastroesophageal ref...|Gastroenterology|[{sentence_embeddings, 0, 3237,  ADMITTING DIAGNOSES: Hia...|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
only showing top 2 rows



In [150]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)
testData_with_embeddings = testData_with_embeddings.select("text","category","sentence_embeddings")
testData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+----------------+------------------------------------------------------------+
|                                                        text|        category|                                         sentence_embeddings|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
| ADMITTING DIAGNOSIS: Gastrointestinal bleed. HISTORY OF ...|Gastroenterology|[{sentence_embeddings, 0, 3978,  ADMITTING DIAGNOSIS: Gas...|
| CHIEF COMPLAINT: Dysphagia and hematemesis while vomitin...|Gastroenterology|[{sentence_embeddings, 0, 6515,  CHIEF COMPLAINT: Dysphag...|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
only showing top 2 rows



In [151]:
testData_with_embeddings.printSchema()

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)



In [152]:
testData_with_embeddings.select(testData_with_embeddings.sentence_embeddings.embeddings).show(3,truncate=120)

+------------------------------------------------------------------------------------------------------------------------+
|                                                                                          sentence_embeddings.embeddings|
+------------------------------------------------------------------------------------------------------------------------+
|[[-0.010433162, 0.0127568655, 0.110687375, 0.1609855, -0.14818177, -0.050856464, 0.064748034, -0.019294674, -0.002178...|
|[[-0.0055608996, 0.018396137, 0.12305377, 0.13861515, -0.18651922, -0.07559964, 0.12978752, -0.0046452526, -0.0031408...|
|[[-0.019461514, 0.060639214, 0.11142021, 0.12826534, -0.16216077, -0.10416892, 0.17136306, -0.020363769, 0.01169337, ...|
+------------------------------------------------------------------------------------------------------------------------+
only showing top 3 rows



In [153]:
log_folder="Mt_logs_healthcare_100d"
!mkdir -p $log_folder

### ClassifierDL

In [154]:
classifier_dl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("prediction")\
    .setLabelColumn("category")\
    .setBatchSize(8)\
    .setMaxEpochs(50)\
    .setLr(0.005)\
    .setDropout(0.3)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)\
    # .setValidationSplit(0.1)

classifier_dl_pipeline = Pipeline(stages=[classifier_dl])

In [155]:
clfDL_model_hc100 = classifier_dl_pipeline.fit(trainingData_with_embeddings)

In [156]:
preds = clfDL_model_hc100.transform(testData_with_embeddings)

In [157]:
preds_df = preds.select("category","text","prediction.result").toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       0.88      0.84      0.86        25
       Neurology       0.80      0.73      0.76        22
      Orthopedic       0.89      0.94      0.92        35
         Urology       0.71      0.75      0.73        20

        accuracy                           0.83       102
       macro avg       0.82      0.82      0.82       102
    weighted avg       0.83      0.83      0.83       102



In [158]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df = pd.DataFrame(columns=["macro-f1-score","weighted-f1-score","accuracy"])
results_df.loc["ClassifierDL_100d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]
results_df

Unnamed: 0,macro-f1-score,weighted-f1-score,accuracy
ClassifierDL_100d,0.816855,0.832431,0.833333


### MultiClassifierDL

In [159]:
# MultiClassifierDL accepts list of strings as input label. So we convert label column to array type.
trainingData_with_embeddings = trainingData_with_embeddings.withColumn("category_array", F.array(F.col("category")))
testData_with_embeddings = testData_with_embeddings.withColumn("category_array", F.array(F.col("category")))

In [160]:
multiClassifier = MultiClassifierDLApproach()\
    .setInputCols("sentence_embeddings")\
    .setOutputCol("prediction")\
    .setLabelColumn("category_array")\
    .setBatchSize(8)\
    .setMaxEpochs(40)\
    .setLr(5e-3)\
    .setThreshold(0.5)\
    .setShufflePerEpoch(False)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)\
  #   .setValidationSplit(0.1)

multiClassifier_pipeline = Pipeline(
    stages = [
        multiClassifier
    ])

In [161]:
multiClassifier_model_hc100 = multiClassifier_pipeline.fit(trainingData_with_embeddings)

In [162]:
preds = multiClassifier_model_hc100.transform(testData_with_embeddings)

In [163]:
preds_df = preds.select("category","text","prediction.result","prediction.metadata").toPandas()
preds_df.result.apply(lambda x: str(x) ).value_counts()

Unnamed: 0_level_0,count
result,Unnamed: 1_level_1
['Orthopedic'],35
['Urology'],21
['Gastroenterology'],19
['Neurology'],17
[],5
"['Neurology', 'Orthopedic']",4
"['Orthopedic', 'Gastroenterology']",1


In [164]:
# We will get the highest score label as result. You can control the number of zero label results with setThreshold() in the training.
preds_df["scores"] = preds_df.metadata.apply(lambda x: {k:float(v) for k,v in x[0].items()} if len(x)>=1 else "")
preds_df["result"] = preds_df.scores.apply(lambda x: max(x, key=x.get) if len(x)>=1 else "")
preds_df = preds_df[(preds_df["result"]!="")]
preds_df["result"].value_counts()

Unnamed: 0_level_0,count
result,Unnamed: 1_level_1
Orthopedic,36
Urology,21
Gastroenterology,20
Neurology,20


In [165]:
print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       0.95      0.86      0.90        22
       Neurology       0.80      0.76      0.78        21
      Orthopedic       0.89      0.91      0.90        35
         Urology       0.90      1.00      0.95        19

        accuracy                           0.89        97
       macro avg       0.89      0.88      0.88        97
    weighted avg       0.89      0.89      0.89        97



In [166]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["MultiClassifierDL_100d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### Generic Classifier

In [167]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("category")\
    .setHiddenLayers([300,200, 50])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph_100d.pb")

In [168]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

gen_clf = GenericClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph_100d.pb")\
    .setEpochsNumber(50)\
    .setBatchSize(32)\
    .setLearningRate(0.001)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.3)\
    .setFixImbalance(False)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_graph_builder,
        gen_clf
])

In [169]:
generic_model_hc100 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph_100d.pb
Build params: {'input_dim': 100, 'output_dim': 4, 'hidden_layers': [300, 200, 50], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}
generic_classifier graph exported to gc_graph/gcf_graph_100d.pb


In [170]:
pred_df = generic_model_hc100.transform(testData_with_embeddings)

In [171]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.96      0.88      0.92        25
       Neurology       0.77      0.77      0.77        22
      Orthopedic       0.86      0.91      0.89        35
         Urology       0.90      0.90      0.90        20

        accuracy                           0.87       102
       macro avg       0.87      0.87      0.87       102
    weighted avg       0.87      0.87      0.87       102



In [172]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericClassifier_100d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### GenericLogRegClassifier

In [173]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_logreg_graph_builder = TFGraphBuilder()\
    .setModelName("logreg_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("log_reg_graph.pb")

GenericLogRegClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericLogRegClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [174]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericLogRegClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/log_reg_graph.pb")\
    .setEpochsNumber(45)\
    .setBatchSize(32)\
    .setLearningRate(0.02)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_logreg_graph_builder,
        gen_clf
])


In [175]:
generic_model_hc100 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: logreg_classifier
Graph folder: gc_graph
Graph file name: log_reg_graph.pb
Build params: {'input_dim': 100, 'output_dim': 4, 'hidden_layers': [], 'output_act': 'sigmoid'}
logreg_classifier graph exported to gc_graph/log_reg_graph.pb


In [176]:
pred_df = generic_model_hc100.transform(testData_with_embeddings)

In [177]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.92      0.88      0.90        25
       Neurology       0.81      0.77      0.79        22
      Orthopedic       0.84      0.91      0.88        35
         Urology       0.89      0.85      0.87        20

        accuracy                           0.86       102
       macro avg       0.87      0.85      0.86       102
    weighted avg       0.86      0.86      0.86       102



In [178]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericLogReg_100d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### GenericSVMClassifier

In [179]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_svm_graph_builder = TFGraphBuilder()\
    .setModelName("svm_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("svm_graph.pb")

GenericSVMClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericSVMClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [180]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(45)\
    .setBatchSize(32)\
    .setLearningRate(0.02)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_svm_graph_builder,
        gen_clf
])

In [181]:
generic_model_hc100 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: gc_graph
Graph file name: svm_graph.pb
Build params: {'input_dim': 100, 'output_dim': 4, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to gc_graph/svm_graph.pb


In [182]:
pred_df = generic_model_hc100.transform(testData_with_embeddings)

In [183]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.91      0.84      0.88        25
       Neurology       0.82      0.82      0.82        22
      Orthopedic       0.89      0.91      0.90        35
         Urology       0.81      0.85      0.83        20

        accuracy                           0.86       102
       macro avg       0.86      0.86      0.86       102
    weighted avg       0.86      0.86      0.86       102



In [184]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericSVM_100d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

## 200 Dimension Healthcare Embeddings (embeddings_clinical)



Now we will extract [embeddings_clinical](https://nlp.johnsnowlabs.com/2020/01/28/embeddings_clinical_en.html) embeddings which has 200 dimension output and use this embeddings in the model training.

In [185]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

embeddings_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


In [186]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)\
                                                  .select("text","category","sentence_embeddings")
trainingData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+----------------+------------------------------------------------------------+
|                                                        text|        category|                                         sentence_embeddings|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
| ADMISSION DIAGNOSIS: Symptomatic cholelithiasis. DISCHAR...|Gastroenterology|[{sentence_embeddings, 0, 2228,  ADMISSION DIAGNOSIS: Sym...|
| ADMITTING DIAGNOSES: Hiatal hernia, gastroesophageal ref...|Gastroenterology|[{sentence_embeddings, 0, 3237,  ADMITTING DIAGNOSES: Hia...|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
only showing top 2 rows



In [187]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)\
                                                  .select("text","category","sentence_embeddings")
testData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+----------------+------------------------------------------------------------+
|                                                        text|        category|                                         sentence_embeddings|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
| ADMITTING DIAGNOSIS: Gastrointestinal bleed. HISTORY OF ...|Gastroenterology|[{sentence_embeddings, 0, 3978,  ADMITTING DIAGNOSIS: Gas...|
| CHIEF COMPLAINT: Dysphagia and hematemesis while vomitin...|Gastroenterology|[{sentence_embeddings, 0, 6515,  CHIEF COMPLAINT: Dysphag...|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
only showing top 2 rows



In [188]:
log_folder="Mt_logs_healthcare_200d"

### ClassifierDL

In [189]:
classifier_dl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("category")\
    .setBatchSize(32)\
    .setMaxEpochs(30)\
    .setLr(0.01)\
    .setDropout(0.1)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)

classifier_dl_pipeline = Pipeline(stages=[classifier_dl])

In [190]:
clfDL_model_hc200 = classifier_dl_pipeline.fit(trainingData_with_embeddings)

In [191]:
preds = clfDL_model_hc200.transform(testData_with_embeddings)

preds_df = preds.select("category","text","class.result").toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       0.00      0.00      0.00        25
       Neurology       0.00      0.00      0.00        22
      Orthopedic       0.00      0.00      0.00        35
         Urology       0.20      1.00      0.33        20

        accuracy                           0.20       102
       macro avg       0.05      0.25      0.08       102
    weighted avg       0.04      0.20      0.06       102



In [192]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["ClassifierDL_200d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### MultiClassifierDL



In [193]:
# MultiClassifierDL accepts list of strings as input label. So we convert label column to array type.
trainingData_with_embeddings = trainingData_with_embeddings.withColumn("category_array", F.array(F.col("category")))
testData_with_embeddings = testData_with_embeddings.withColumn("category_array", F.array(F.col("category")))

In [194]:
multiClassifier = MultiClassifierDLApproach()\
    .setInputCols("sentence_embeddings")\
    .setOutputCol("prediction")\
    .setLabelColumn("category_array")\
    .setBatchSize(16)\
    .setMaxEpochs(20)\
    .setLr(9e-3)\
    .setThreshold(0.2)\
    .setShufflePerEpoch(False)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)\
  #   .setValidationSplit(0.1)

multiClassifier_pipeline = Pipeline(
    stages = [
        multiClassifier
])

In [195]:
multiClassifier_model_hc200 = multiClassifier_pipeline.fit(trainingData_with_embeddings)

In [196]:
preds = multiClassifier_model_hc200.transform(testData_with_embeddings)

In [197]:
preds_df = preds.select("category","text","prediction.result","prediction.metadata").toPandas()
preds_df.result.apply(lambda x: str(x) ).value_counts()

Unnamed: 0_level_0,count
result,Unnamed: 1_level_1
['Orthopedic'],32
['Gastroenterology'],19
['Urology'],18
['Neurology'],13
"['Neurology', 'Orthopedic']",11
[],4
"['Orthopedic', 'Gastroenterology']",3
"['Urology', 'Orthopedic']",1
"['Urology', 'Gastroenterology']",1


In [198]:
# We will get the highest score label as result. you can control the number of zero label results with setThreshold() in the training.
preds_df["scores"] = preds_df.metadata.apply(lambda x: {k:float(v) for k,v in x[0].items()} if len(x)>=1 else "")
preds_df["result"] = preds_df.scores.apply(lambda x: max(x, key=x.get) if len(x)>=1 else "")
preds_df = preds_df[(preds_df["result"]!="")]
preds_df["result"].value_counts()

Unnamed: 0_level_0,count
result,Unnamed: 1_level_1
Orthopedic,38
Gastroenterology,22
Urology,20
Neurology,18


In [199]:
print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       1.00      0.92      0.96        24
       Neurology       0.89      0.80      0.84        20
      Orthopedic       0.87      0.94      0.90        35
         Urology       0.90      0.95      0.92        19

        accuracy                           0.91        98
       macro avg       0.91      0.90      0.91        98
    weighted avg       0.91      0.91      0.91        98



In [200]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["MultiClassifierDL_200d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### Generic Classifier

In [201]:
# !pip install -q tensorflow==2.7.0 tensorflow_addons

In [202]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("category")\
    .setHiddenLayers([300,200, 50])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph_200d.pb")

In [203]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

gen_clf = GenericClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph_200d.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(32)\
    .setLearningRate(0.001)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.3)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_graph_builder,
        gen_clf
])

In [204]:
generic_model_hc200 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph_200d.pb
Build params: {'input_dim': 200, 'output_dim': 4, 'hidden_layers': [300, 200, 50], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}
generic_classifier graph exported to gc_graph/gcf_graph_200d.pb


In [205]:
pred_df = generic_model_hc200.transform(testData_with_embeddings)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       1.00      0.72      0.84        25
       Neurology       0.75      0.82      0.78        22
      Orthopedic       0.91      0.89      0.90        35
         Urology       0.73      0.95      0.83        20

        accuracy                           0.84       102
       macro avg       0.85      0.84      0.84       102
    weighted avg       0.86      0.84      0.84       102



In [206]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericClassifier_200d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### GenericLogRegClassifier

In [207]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_logreg_graph_builder = TFGraphBuilder()\
    .setModelName("logreg_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("log_reg_graph.pb")

GenericLogRegClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericLogRegClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [208]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericLogRegClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/log_reg_graph.pb")\
    .setEpochsNumber(45)\
    .setBatchSize(32)\
    .setLearningRate(0.02)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_logreg_graph_builder,
        gen_clf
])


In [209]:
generic_model_hc200 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: logreg_classifier
Graph folder: gc_graph
Graph file name: log_reg_graph.pb
Build params: {'input_dim': 200, 'output_dim': 4, 'hidden_layers': [], 'output_act': 'sigmoid'}
logreg_classifier graph exported to gc_graph/log_reg_graph.pb


In [210]:
pred_df = generic_model_hc200.transform(testData_with_embeddings)

In [211]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.81      0.84      0.82        25
       Neurology       0.83      0.68      0.75        22
      Orthopedic       0.84      0.91      0.88        35
         Urology       0.80      0.80      0.80        20

        accuracy                           0.82       102
       macro avg       0.82      0.81      0.81       102
    weighted avg       0.82      0.82      0.82       102



In [212]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericLogReg_200d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### GenericSVMClassifier

In [213]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_svm_graph_builder = TFGraphBuilder()\
    .setModelName("svm_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("svm_graph.pb")

GenericSVMClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericSVMClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [214]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(45)\
    .setBatchSize(32)\
    .setLearningRate(0.02)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_svm_graph_builder,
        gen_clf
])


In [215]:
generic_model_hc200 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: gc_graph
Graph file name: svm_graph.pb
Build params: {'input_dim': 200, 'output_dim': 4, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to gc_graph/svm_graph.pb


In [216]:
pred_df = generic_model_hc200.transform(testData_with_embeddings)

In [217]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.81      0.88      0.85        25
       Neurology       0.78      0.82      0.80        22
      Orthopedic       0.88      0.86      0.87        35
         Urology       0.89      0.80      0.84        20

        accuracy                           0.84       102
       macro avg       0.84      0.84      0.84       102
    weighted avg       0.85      0.84      0.84       102



In [218]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericSVM_200d"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

## Bert Sentence Embeddings (sbiobert_base_cased_mli)

We will extract [sbiobert_base_cased_mli](https://nlp.johnsnowlabs.com/2020/11/27/sbiobert_base_cased_mli_en.html) embeddings which has 768 dimension output and use this embeddings in the model training.

In [219]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

bert_sent = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

embeddings_pipeline = Pipeline(
    stages = [
        document_assembler,
        bert_sent
])

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]


In [220]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)\
                                                  .select("text","category","sentence_embeddings")

trainingData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+----------------+------------------------------------------------------------+
|                                                        text|        category|                                         sentence_embeddings|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
| ADMISSION DIAGNOSIS: Symptomatic cholelithiasis. DISCHAR...|Gastroenterology|[{sentence_embeddings, 0, 2228,  ADMISSION DIAGNOSIS: Sym...|
| ADMITTING DIAGNOSES: Hiatal hernia, gastroesophageal ref...|Gastroenterology|[{sentence_embeddings, 0, 3237,  ADMITTING DIAGNOSES: Hia...|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
only showing top 2 rows



In [221]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)\
                                                  .select("text","category","sentence_embeddings")

testData_with_embeddings.show(2,truncate=60)

+------------------------------------------------------------+----------------+------------------------------------------------------------+
|                                                        text|        category|                                         sentence_embeddings|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
| ADMITTING DIAGNOSIS: Gastrointestinal bleed. HISTORY OF ...|Gastroenterology|[{sentence_embeddings, 0, 3978,  ADMITTING DIAGNOSIS: Gas...|
| CHIEF COMPLAINT: Dysphagia and hematemesis while vomitin...|Gastroenterology|[{sentence_embeddings, 0, 6515,  CHIEF COMPLAINT: Dysphag...|
+------------------------------------------------------------+----------------+------------------------------------------------------------+
only showing top 2 rows



In [222]:
log_folder="Mt_logs_bert"

### ClassifierDL

In [223]:
classifier_dl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("category")\
    .setBatchSize(8)\
    .setMaxEpochs(30)\
    .setLr(0.002)\
    .setDropout(0.1)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)

classifier_dl_pipeline = Pipeline(stages=[classifier_dl])

In [224]:
clfDL_model_bert = classifier_dl_pipeline.fit(trainingData_with_embeddings)

In [225]:
preds = clfDL_model_bert.transform(testData_with_embeddings)

preds_df = preds.select("category","text","class.result").toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       0.88      0.88      0.88        25
       Neurology       0.82      0.82      0.82        22
      Orthopedic       0.88      0.86      0.87        35
         Urology       0.90      0.95      0.93        20

        accuracy                           0.87       102
       macro avg       0.87      0.88      0.87       102
    weighted avg       0.87      0.87      0.87       102



In [226]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["ClassifierDL_bert"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### MultiClassifierDL

In [227]:
# MultiClassifierDL accepts list of strings as input label. So we convert label column to array type.
trainingData_with_embeddings = trainingData_with_embeddings.withColumn("category_array", F.array(F.col("category")))
testData_with_embeddings = testData_with_embeddings.withColumn("category_array", F.array(F.col("category")))

In [228]:
multiClassifier = MultiClassifierDLApproach()\
    .setInputCols("sentence_embeddings")\
    .setOutputCol("prediction")\
    .setLabelColumn("category_array")\
    .setBatchSize(8)\
    .setMaxEpochs(20)\
    .setLr(9e-3)\
    .setThreshold(0.5)\
    .setShufflePerEpoch(False)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath(log_folder)\
  #   .setValidationSplit(0.1)

multiClassifier_pipeline = Pipeline(
    stages = [
        multiClassifier
])

In [229]:
multiClassifier_model_bert = multiClassifier_pipeline.fit(trainingData_with_embeddings)

In [230]:
preds = multiClassifier_model_bert.transform(testData_with_embeddings)

In [231]:
preds_df = preds.select("category","text","prediction.result","prediction.metadata").toPandas()
preds_df.result.apply(lambda x: str(x) ).value_counts()

Unnamed: 0_level_0,count
result,Unnamed: 1_level_1
['Orthopedic'],32
['Gastroenterology'],25
['Neurology'],22
['Urology'],21
[],1
"['Neurology', 'Orthopedic']",1


In [232]:
# We will get the highest score label as result. You can control the number of zero label results with setThreshold() in the training.
preds_df["scores"] = preds_df.metadata.apply(lambda x: {k:float(v) for k,v in x[0].items()} if len(x)>=1 else "")
preds_df["result"] = preds_df.scores.apply(lambda x: max(x, key=x.get) if len(x)>=1 else "")
preds_df = preds_df[(preds_df["result"]!="")]
preds_df["result"].value_counts()

Unnamed: 0_level_0,count
result,Unnamed: 1_level_1
Orthopedic,32
Gastroenterology,25
Neurology,23
Urology,21


In [233]:
print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       0.92      0.92      0.92        25
       Neurology       0.78      0.86      0.82        21
      Orthopedic       0.91      0.83      0.87        35
         Urology       0.86      0.90      0.88        20

        accuracy                           0.87       101
       macro avg       0.87      0.88      0.87       101
    weighted avg       0.87      0.87      0.87       101



In [234]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["MultiClassifierDL_bert"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### Generic Classifier

In [235]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("category")\
    .setHiddenLayers([300,200, 50])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph_bert.pb")

In [236]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

gen_clf = GenericClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph_bert.pb")\
    .setEpochsNumber(25)\
    .setBatchSize(16)\
    .setLearningRate(0.002)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.2)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_graph_builder,
        gen_clf
])


In [237]:
generic_model_bert = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph_bert.pb
Build params: {'input_dim': 768, 'output_dim': 4, 'hidden_layers': [300, 200, 50], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}
generic_classifier graph exported to gc_graph/gcf_graph_bert.pb


In [238]:
pred_df = generic_model_bert.transform(testData_with_embeddings)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       1.00      0.84      0.91        25
       Neurology       0.59      0.86      0.70        22
      Orthopedic       0.85      0.80      0.82        35
         Urology       0.81      0.65      0.72        20

        accuracy                           0.79       102
       macro avg       0.81      0.79      0.79       102
    weighted avg       0.82      0.79      0.80       102



In [239]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericClassifier_bert"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

This age classification model is a sophisticated text classification tool tailored to identify and categorize text according to different age groups. Model distinguishes among Old Adult, Adult, Child and Teen, and Other/Unknown contexts, providing valuable insights into textual data that references age-specific scenarios or concerns.

In [240]:
generic_classifier = GenericClassifierModel.pretrained("genericclassifier_age_e5", "en", "clinical/models")\
    .setInputCols(["features"])\
    .setOutputCol("prediction")

sample_texts = [
"""The patient presents with conditions often associated with the stresses and lifestyle of early career and possibly higher education stages, including sleep irregularities and repetitive stress injuries. There's a notable emphasis on preventative care, with discussions around lifestyle choices that can impact long-term health, such as smoking cessation, regular exercise, and balanced nutrition. The patient is also counseled on mental health, particularly in managing stress and anxiety that may arise from personal and professional responsibilities and ambitions at this stage of life.""",
"""The senior patient presents with age-related issues such as reduced hearing and vision, arthritis, and memory lapses. Emphasis is on managing chronic conditions, maintaining social engagement, and adapting lifestyle to changing physical abilities. Discussions include medication management, dietary adjustments to suit older digestion, and the importance of regular, low-impact exercise.""",
"""The late teenage patient is dealing with final growth spurts, the stress of impending adulthood, and decisions about higher education or career paths. Health discussions include maintaining a balanced diet, the importance of regular sleep patterns, and managing academic and social pressures. Mental health support is considered crucial at this stage, with a focus on building resilience and coping mechanisms.""",
"""The patient, faces adjustments to a new lifestyle with changes in daily routines and social interactions. Health concerns include managing the transition from an active work life to more leisure time, which may impact physical and mental health. Preventative health measures are emphasized, along with the importance of staying mentally and physically active and engaged in the community."""
]

genericclassifier_age_e5 download started this may take some time.
Approximate size to download 4.1 MB
[OK!]


### GenericLogRegClassifier

In [241]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_logreg_graph_builder = TFGraphBuilder()\
    .setModelName("logreg_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("log_reg_graph.pb")

GenericLogRegClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericLogRegClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [242]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericLogRegClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/log_reg_graph.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(64)\
    .setLearningRate(0.004)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_logreg_graph_builder,
        gen_clf
])


In [243]:
generic_model_bert = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: logreg_classifier
Graph folder: gc_graph
Graph file name: log_reg_graph.pb
Build params: {'input_dim': 768, 'output_dim': 4, 'hidden_layers': [], 'output_act': 'sigmoid'}
logreg_classifier graph exported to gc_graph/log_reg_graph.pb


In [244]:
pred_df = generic_model_bert.transform(testData_with_embeddings)

In [245]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.96      0.92      0.94        25
       Neurology       0.81      0.77      0.79        22
      Orthopedic       0.86      0.86      0.86        35
         Urology       0.86      0.95      0.90        20

        accuracy                           0.87       102
       macro avg       0.87      0.87      0.87       102
    weighted avg       0.87      0.87      0.87       102



In [246]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericCLogReg_bert"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

### GenericSVMClassifier

In [247]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_svm_graph_builder = TFGraphBuilder()\
    .setModelName("svm_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("svm_graph.pb")

GenericSVMClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericSVMClassifier takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [248]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(64)\
    .setLearningRate(0.004)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(
    stages=[
        features_asm,
        gc_svm_graph_builder,
        gen_clf
])


In [249]:
generic_model_bert = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: svm_classifier
Graph folder: gc_graph
Graph file name: svm_graph.pb
Build params: {'input_dim': 768, 'output_dim': 4, 'hidden_layers': [], 'output_act': 'sigmoid', 'loss_func': 'hinge'}
svm_classifier graph exported to gc_graph/svm_graph.pb


In [250]:
pred_df = generic_model_bert.transform(testData_with_embeddings)

In [251]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.92      0.92      0.92        25
       Neurology       0.80      0.91      0.85        22
      Orthopedic       0.88      0.86      0.87        35
         Urology       0.89      0.80      0.84        20

        accuracy                           0.87       102
       macro avg       0.87      0.87      0.87       102
    weighted avg       0.88      0.87      0.87       102



In [252]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["GenericCSVM_bert"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

## DocumentLogRegClassifier

In [253]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

logreg = DocumentLogRegClassifierApproach()\
    .setInputCols("token")\
    .setLabelCol("category")\
    .setOutputCol("prediction")\
    .setMaxIter(10)\
    .setTol(1e-6)

clf_Pipeline = Pipeline(
    stages=[
        document_assembler,
        tokenizer,
        logreg
])

doclogreg_model = clf_Pipeline.fit(trainingData)

In [254]:
pred_df = doclogreg_model.transform(testData)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.92      0.88      0.90        25
       Neurology       0.80      0.73      0.76        22
      Orthopedic       0.91      0.89      0.90        35
         Urology       0.79      0.95      0.86        20

        accuracy                           0.86       102
       macro avg       0.86      0.86      0.86       102
    weighted avg       0.87      0.86      0.86       102



In [255]:
res = classification_report(preds_df["category"], preds_df["result"], output_dict=True)
results_df.loc["DocumentLogRegClassifier"] = [res["macro avg"]["f1-score"], res["weighted avg"]["f1-score"], res["accuracy"]]

## Comparision of All Models' Result in Mtsamples Dataset

Here is the all results of classifier models. Macro f1, weighted f1 and accuracy scores are presented.

In [256]:
results_df

Unnamed: 0,macro-f1-score,weighted-f1-score,accuracy
ClassifierDL_100d,0.816855,0.832431,0.833333
MultiClassifierDL_100d,0.884165,0.885508,0.886598
GenericClassifier_100d,0.869571,0.872821,0.872549
GenericLogReg_100d,0.859291,0.862404,0.862745
GenericSVM_100d,0.855965,0.86284,0.862745
ClassifierDL_200d,0.081967,0.064288,0.196078
MultiClassifierDL_200d,0.906453,0.907969,0.908163
GenericClassifier_200d,0.836114,0.8443,0.843137
GenericLogReg_200d,0.81256,0.821306,0.823529
GenericSVM_200d,0.839456,0.843439,0.843137
