![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/30.Clinical_Text_Classification_with_Spark_NLP.ipynb)

# Colab Setup

In [1]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing spark-nlp for GPU
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.2.1 -s $PUBLIC_VERSION -g

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

**PLEASE RE-START RUNTIME AND CONTINUE**

In [2]:
import json
import os

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [3]:
import pandas as pd
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp_jsl.base import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"24G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params,gpu=True)

spark

Spark NLP Version : 4.2.4
Spark NLP_JSL Version : 4.2.4


# Classifiers

The below classifiers will be used in this notebook.ClassifierDL, MultiClassifierDL, and GenericClassifier will be trained using healthcare_100d, embeddingd_clinical, and bert sentence embeddings(sbiobert_base_cased_mli). DocumentLogRegClassifier accepts tokens, so sentence embeddings are not utilized during DocumentLogRegClassifier training.

## ClassifierDL

ClassifierDL is a generic Multi-class Text Classification annotator. ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) built inside TensorFlow and supports up to 100 classes. For more information please [follow the link](https://nlp.johnsnowlabs.com/docs/en/annotators#classifierdl).


##  MultiClassifierDL

MultiClassifierDL is a Multi-label Text Classification annotator.MultiClassifierDL uses a Bidirectional GRU with a convolutional model built inside TensorFlow and supports up to 100 classes.  For more information please [follow the link](https://nlp.johnsnowlabs.com/docs/en/annotators#multiclassifierdl).


## GenericClassifier

GenericClassifier is a TensorFlow model for the generic classification of feature vectors in Healthcare  Lİbrary. It takes FEATURE_VECTOR annotations from FeaturesAssembler as input, classifies them, and outputs CATEGORY annotations. Please see [the link](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#genericclassifier) for more information.

## DocumentLogRegClassifier

DocumentLogRegClassifier is a model to classify documents with a Logarithmic Regression algorithm in Healthcare  Library. Training data requires columns for text and labels. The result is a trained DocumentLogRegClassifierModel. you can get more info [here](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#documentlogregclassifier). 


# ADE Dataset

### Data Preprocessing

In [159]:
#downloading sample datasets
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/ADE-NEG.txt
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/ADE_Corpus_V2/DRUG-AE.rel

**ADE-Negative Dataset**

In [4]:
df_neg= pd.read_csv("ADE-NEG.txt", header=None, delimiter="\t", names=["col1"])
df_neg.head()

Unnamed: 0,col1
0,6460590 NEG Clioquinol intoxication occurring ...
1,"8600337 NEG ""Retinoic acid syndrome"" was preve..."
2,8402502 NEG BACKGROUND: External beam radiatio...
3,"8700794 NEG Although the enuresis ceased, she ..."
4,17662448 NEG A 42-year-old woman had uneventfu...


In [5]:
df_neg['text'] =  df_neg.col1.str.split('NEG').str[1]
df_neg["category"] = "neg"
df_neg= df_neg[["text", "category"]]
df_neg.head()

Unnamed: 0,text,category
0,Clioquinol intoxication occurring in the trea...,neg
1,"""Retinoic acid syndrome"" was prevented with s...",neg
2,BACKGROUND: External beam radiation therapy o...,neg
3,"Although the enuresis ceased, she developed t...",neg
4,A 42-year-old woman had uneventful bilateral ...,neg


**ADE Positive Dataset**

In [6]:
df_pos= pd.read_csv("DRUG-AE.rel", header=None, delimiter="|")
df_pos.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,10030778,Intravenous azithromycin-induced ototoxicity.,ototoxicity,43,54,azithromycin,22,34
1,10048291,"Immobilization, while Paget's bone disease was...",increased calcium-release,960,985,dihydrotachysterol,908,926
2,10048291,Unaccountable severe hypercalcemia in a patien...,hypercalcemia,31,44,dihydrotachysterol,94,112
3,10082597,METHODS: We report two cases of pseudoporphyri...,pseudoporphyria,620,635,naproxen,646,654
4,10082597,METHODS: We report two cases of pseudoporphyri...,pseudoporphyria,620,635,oxaprozin,659,668


In [7]:
df_pos["category"]= "pos"
df_pos.rename(columns={1: "text"}, inplace=True)
df_pos= df_pos[["text", "category"]]
df_pos.head()

Unnamed: 0,text,category
0,Intravenous azithromycin-induced ototoxicity.,pos
1,"Immobilization, while Paget's bone disease was...",pos
2,Unaccountable severe hypercalcemia in a patien...,pos
3,METHODS: We report two cases of pseudoporphyri...,pos
4,METHODS: We report two cases of pseudoporphyri...,pos


**Merging Positive and Negative dataset**

In [8]:
ade_df= pd.concat([df_neg, df_pos])
ade_df.head()

Unnamed: 0,text,category
0,Clioquinol intoxication occurring in the trea...,neg
1,"""Retinoic acid syndrome"" was prevented with s...",neg
2,BACKGROUND: External beam radiation therapy o...,neg
3,"Although the enuresis ceased, she developed t...",neg
4,A 42-year-old woman had uneventful bilateral ...,neg


In [9]:
ade_df["category"].value_counts()

neg    16695
pos     6821
Name: category, dtype: int64

In [10]:
ade_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23516 entries, 0 to 6820
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   text      23516 non-null  object
 1   category  23516 non-null  object
dtypes: object(2)
memory usage: 551.2+ KB


We take 30% of the data to make a faster run. You can use all data for better scores.

In [11]:
spark_df = spark.createDataFrame(ade_df).sample(0.3, 3) # limit the data

trainingData, testData = spark_df.randomSplit([0.8, 0.2], seed = 100)

print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

Training Dataset Count: 5531
Test Dataset Count: 1427


In [12]:
spark_df.groupBy("category").count().show()

+--------+-----+
|category|count|
+--------+-----+
|     neg| 4932|
|     pos| 2026|
+--------+-----+



In [13]:
spark_df.printSchema()

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)



In [14]:
spark_df.head(3)

[Row(text=' Clioquinol intoxication occurring in the treatment of acrodermatitis enteropathica with reference to SMON outside of Japan.', category='neg'),
 Row(text=' A 42-year-old woman had uneventful bilateral laser-assisted subepithelial keratectomy (LASEK) to correct myopia.', category='neg'),
 Row(text=' A 16-year-old girl with erosive, polyarticular JRA showed no detectable change in her articular disease following nine exchanges.', category='neg')]

## 100 Dimension Healthcare Embeddings (embeddings_healthcare_100d)



Now we will extract [healthcare_100d embeddings](https://nlp.johnsnowlabs.com/2020/05/29/embeddings_healthcare_100d_en.html) and use it in the classificaiton model training.

In [15]:
document_assembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

tokenizer = Tokenizer() \
            .setInputCols(["document"]) \
            .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")\
        .setInputCols(["document","token"])\
        .setOutputCol("word_embeddings")

sentence_embeddings = SentenceEmbeddings() \
        .setInputCols(["document", "word_embeddings"]) \
        .setOutputCol("sentence_embeddings") \
        .setPoolingStrategy("AVERAGE")

embeddings_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,

    ])

embeddings_healthcare_100d download started this may take some time.
Approximate size to download 475.8 MB
[OK!]


In [16]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)
trainingData_with_embeddings = trainingData_with_embeddings.select("text","category","sentence_embeddings")
trainingData_with_embeddings.show(2,truncate=80)

+--------------------------------------------------------------------------------+--------+--------------------------------------------------------------------------------+
|                                                                            text|category|                                                             sentence_embeddings|
+--------------------------------------------------------------------------------+--------+--------------------------------------------------------------------------------+
| "Syndrome malin"-like symptoms probably due to interaction between neurolept...|     neg|[{sentence_embeddings, 0, 109,  "Syndrome malin"-like symptoms probably due t...|
| 'Bail-out' bivalirudin use in patients with thrombotic complications unrespo...|     neg|[{sentence_embeddings, 0, 150,  'Bail-out' bivalirudin use in patients with t...|
+--------------------------------------------------------------------------------+--------+--------------------------------------------

In [17]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)
testData_with_embeddings = testData_with_embeddings.select("text","category","sentence_embeddings")
testData_with_embeddings.show(2,truncate=80)

+--------------------------------------------------------------------------------+--------+--------------------------------------------------------------------------------+
|                                                                            text|category|                                                             sentence_embeddings|
+--------------------------------------------------------------------------------+--------+--------------------------------------------------------------------------------+
| 1 previously untreated patient with plasma-cell leukaemia and 8 patients wit...|     neg|[{sentence_embeddings, 0, 166,  1 previously untreated patient with plasma-ce...|
| A 15-year-old boy had temporary hypertropia, supraduction deficit, ipsilater...|     neg|[{sentence_embeddings, 0, 262,  A 15-year-old boy had temporary hypertropia, ...|
+--------------------------------------------------------------------------------+--------+--------------------------------------------

In [18]:
testData_with_embeddings.printSchema()

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)



In [19]:
testData_with_embeddings.select(testData_with_embeddings.sentence_embeddings.embeddings).show(3,truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [20]:
log_folder="ADE_logs_healthcare_100d"
!mkdir -p $log_folder

### ClassifierDLApproach

In [23]:
classifier_dl = ClassifierDLApproach()\
        .setInputCols(["sentence_embeddings"])\
        .setOutputCol("prediction")\
        .setLabelColumn("category")\
        .setBatchSize(16)\
        .setMaxEpochs(30)\
        .setLr(0.002)\
        .setDropout(0.3)\
        .setEnableOutputLogs(True)\
        .setOutputLogsPath(log_folder)\
        # .setValidationSplit(0.1)

classifier_dl_pipeline = Pipeline(
    stages = [
        classifier_dl
    ])

In [None]:
clfDL_model_hc100 = classifier_dl_pipeline.fit(trainingData_with_embeddings)

In [25]:
!cat $log_folder/ClassifierDLApproach_*

Training started - epochs: 30 - learning_rate: 0.002 - batch_size: 16 - training_examples: 5531 - classes: 2
Epoch 0/30 - 3.56s - loss: 193.45416 - acc: 0.7459486 - batches: 346
Epoch 1/30 - 0.65s - loss: 168.86009 - acc: 0.7964921 - batches: 346
Epoch 2/30 - 0.62s - loss: 162.3323 - acc: 0.8223979 - batches: 346
Epoch 3/30 - 0.61s - loss: 159.34192 - acc: 0.83498025 - batches: 346
Epoch 4/30 - 0.65s - loss: 157.53806 - acc: 0.84295124 - batches: 346
Epoch 5/30 - 0.61s - loss: 156.41156 - acc: 0.8516469 - batches: 346
Epoch 6/30 - 0.61s - loss: 155.46634 - acc: 0.85635704 - batches: 346
Epoch 7/30 - 0.59s - loss: 154.58415 - acc: 0.8596179 - batches: 346
Epoch 8/30 - 0.57s - loss: 153.78036 - acc: 0.86233526 - batches: 346
Epoch 9/30 - 0.57s - loss: 152.91891 - acc: 0.864328 - batches: 346
Epoch 10/30 - 0.60s - loss: 152.0986 - acc: 0.8663208 - batches: 346
Epoch 11/30 - 0.57s - loss: 151.33493 - acc: 0.8684947 - batches: 346
Epoch 12/30 - 0.57s - loss: 150.58063 - acc: 0.87266135 - ba

In [26]:
preds = clfDL_model_hc100.transform(testData_with_embeddings)

In [27]:
preds.printSchema()
preds.select(preds.prediction).show(5, truncate=False)
preds.select(preds.category, preds.prediction.result).show(5, truncate=False)

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- prediction: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    

In [28]:
preds_df = preds.select("category","text","prediction.result").toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

         neg       0.86      0.92      0.89       982
         pos       0.80      0.66      0.72       445

    accuracy                           0.84      1427
   macro avg       0.83      0.79      0.80      1427
weighted avg       0.84      0.84      0.84      1427



### MultiClassifierDL

We will use MultiClassifierDL built by using Bidirectional GRU and CNNs inside TensorFlow that supports up to 100 classes. It is designed for multi-label classification purposes. Here we will use MultiClassifierDL as a binary classifier. 

In [21]:
# MultiClassifierDL accepts list of strings as input label. So we convert label column to array type.
trainingData_with_embeddings = trainingData_with_embeddings.withColumn("category_array", F.array(F.col("category")))
testData_with_embeddings = testData_with_embeddings.withColumn("category_array", F.array(F.col("category")))

In [22]:
multiClassifier = MultiClassifierDLApproach()\
  .setInputCols("sentence_embeddings")\
  .setOutputCol("prediction")\
  .setLabelColumn("category_array")\
  .setBatchSize(16)\
  .setMaxEpochs(15)\
  .setLr(9e-3)\
  .setThreshold(0.5)\
  .setShufflePerEpoch(False)\
  .setEnableOutputLogs(True)\
  .setOutputLogsPath(log_folder)\
#   .setValidationSplit(0.1)

multiClassifier_pipeline = Pipeline(
    stages = [
        multiClassifier
    ])

In [23]:
multiClassifier_model_hc100 = multiClassifier_pipeline.fit(trainingData_with_embeddings)

In [24]:
!cat $log_folder/MultiClassifierDLApproach_*

Training started - epochs: 15 - learning_rate: 0.009 - batch_size: 16 - training_examples: 5531 - classes: 2
Epoch 0/15 - 13.73s - loss: 0.5167169 - acc: 0.7443017 - batches: 346
Epoch 1/15 - 2.56s - loss: 0.45646402 - acc: 0.7796278 - batches: 346
Epoch 2/15 - 2.55s - loss: 0.42079407 - acc: 0.8013669 - batches: 346
Epoch 3/15 - 2.67s - loss: 0.4000755 - acc: 0.81649375 - batches: 346
Epoch 4/15 - 2.82s - loss: 0.3859832 - acc: 0.8221097 - batches: 346
Epoch 5/15 - 3.84s - loss: 0.37113047 - acc: 0.8292655 - batches: 346
Epoch 6/15 - 3.24s - loss: 0.35642692 - acc: 0.8384058 - batches: 346
Epoch 7/15 - 2.54s - loss: 0.34178704 - acc: 0.8451993 - batches: 346
Epoch 8/15 - 2.57s - loss: 0.32826015 - acc: 0.85163045 - batches: 346
Epoch 9/15 - 2.64s - loss: 0.31305727 - acc: 0.8611413 - batches: 346
Epoch 10/15 - 2.61s - loss: 0.30029684 - acc: 0.87083334 - batches: 346
Epoch 11/15 - 2.63s - loss: 0.28393316 - acc: 0.8763587 - batches: 346
Epoch 12/15 - 2.66s - loss: 0.26964858 - acc: 0.

In [25]:
preds = multiClassifier_model_hc100.transform(testData_with_embeddings)

In [26]:
preds.printSchema()
preds.select(preds.prediction).show(5, truncate=False)
preds.select(preds.category_array, preds.prediction.result).show(5, truncate=False)

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- category_array: array (nullable = false)
 |    |-- element: string (containsNull = true)
 |-- prediction: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable

In [27]:
preds_df = preds.select("category","text","prediction.result","prediction.metadata").toPandas()
preds_df.result.apply(lambda x: str(x) ).value_counts()

['neg']           1168
['pos']            258
['neg', 'pos']       1
Name: result, dtype: int64

In [28]:
preds_df[preds_df.result.apply(len)==2]

Unnamed: 0,category,text,result,metadata
1310,pos,The development of an IgG lambda-type monoclon...,"[neg, pos]","[{'sentence': '0', 'neg': '0.5005941', 'pos': ..."


In [29]:
preds_df[preds_df.result.apply(len)==0]

Unnamed: 0,category,text,result,metadata


MultiClassifierDL is a multi-label classifier, so some predictions may include both labels or none of the labels. That can be controlled a bit with `.setThreshold()` parameter during training. For now we will keep not keep zero label predictions and get the highest score as prediction.

In [30]:
# We will get the highest score label as result. You can control the number of zero label results with setThreshold() in the training.
preds_df["scores"] = preds_df.metadata.apply(lambda x: {k:float(v) for k,v in x[0].items()} if len(x)>=1 else "")
preds_df["result"] = preds_df.scores.apply(lambda x: max(x, key=x.get) if len(x)>=1 else "")
preds_df["result"].value_counts()

neg    1168
pos     259
Name: result, dtype: int64

In [31]:
print (classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

         neg       0.79      0.95      0.86       982
         pos       0.79      0.46      0.58       445

    accuracy                           0.79      1427
   macro avg       0.79      0.70      0.72      1427
weighted avg       0.79      0.79      0.78      1427



### Generic Classifier

In [40]:
!pip install -q tensorflow==2.7.0 tensorflow_addons

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m489.6/489.6 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m53.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m40.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m463.1/463.1 KB[0m [31m48.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [41]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("category")\
    .setHiddenLayers([300,200, 50])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph_100d.pb")

GenericClassifier needs outputs from FeaturesAssembler. The FeaturesAssembler is used to collect features from different columns or an embeddings column.

The GenericClassifierApproach takes FEATURE_VECTOR annotations as input, classifies them and outputs CATEGORY annotations.

In [42]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")
      
gen_clf = GenericClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph_100d.pb")\
    .setEpochsNumber(25)\
    .setBatchSize(128)\
    .setLearningRate(0.004)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.2)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(stages=[
    features_asm,
    gc_graph_builder,
    gen_clf])


In [43]:
generic_model_hc100 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph_100d.pb
Build params: {'input_dim': 100, 'output_dim': 2, 'hidden_layers': [300, 200, 50], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}


Instructions for updating:
Colocations handled automatically by placer.


generic_classifier graph exported to gc_graph/gcf_graph_100d.pb


In [44]:
!cat $log_folder/GenericClassifierApproach_*

Training 25 epochs
Epoch 1/25	0.26s	Loss: 11.043847	ACC: 0.6627407
Epoch 2/25	0.11s	Loss: 9.728848	ACC: 0.7298572
Epoch 3/25	0.10s	Loss: 9.650522	ACC: 0.7189341
Epoch 4/25	0.11s	Loss: 9.593583	ACC: 0.71947336
Epoch 5/25	0.10s	Loss: 9.05388	ACC: 0.7394913
Epoch 6/25	0.13s	Loss: 8.973709	ACC: 0.74871767
Epoch 7/25	0.12s	Loss: 8.660992	ACC: 0.76186347
Epoch 8/25	0.12s	Loss: 8.845654	ACC: 0.7544521
Epoch 9/25	0.11s	Loss: 8.723371	ACC: 0.7587529
Epoch 10/25	0.15s	Loss: 8.709335	ACC: 0.75538594
Epoch 11/25	0.14s	Loss: 8.385962	ACC: 0.77450943
Epoch 12/25	0.13s	Loss: 8.240552	ACC: 0.77154356
Epoch 13/25	0.11s	Loss: 8.21255	ACC: 0.7771728
Epoch 14/25	0.11s	Loss: 8.314084	ACC: 0.76443475
Epoch 15/25	0.13s	Loss: 8.309925	ACC: 0.7594303
Epoch 16/25	0.13s	Loss: 8.300139	ACC: 0.7672362
Epoch 17/25	0.13s	Loss: 8.188091	ACC: 0.78023726
Epoch 18/25	0.11s	Loss: 7.853233	ACC: 0.78653723
Epoch 19/25	0.13s	Loss: 7.9730425	ACC: 0.7863268
Epoch 20/25	0.12s	Loss: 8.140604	ACC: 0.77815264
Epoch 21/25	0.15s	Lo

In [45]:
pred_df = generic_model_hc100.transform(testData_with_embeddings)

In [46]:
preds.printSchema()
preds.select(preds.prediction).show(5, truncate=False)
preds.select(preds.category, preds.prediction.result).show(5, truncate=False)

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- category_array: array (nullable = false)
 |    |-- element: string (containsNull = true)
 |-- prediction: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable

In [47]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.92      0.72      0.81       982
         pos       0.58      0.87      0.69       445

    accuracy                           0.76      1427
   macro avg       0.75      0.79      0.75      1427
weighted avg       0.81      0.76      0.77      1427



## 200 Dimension Healthcare Embeddings (embeddings_clinical)



Now we will extract [embeddings_clinical](https://nlp.johnsnowlabs.com/2020/01/28/embeddings_clinical_en.html) embeddings which has 200 dimension output and use this embeddings in the model training.

In [48]:
document_assembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

tokenizer = Tokenizer() \
            .setInputCols(["document"]) \
            .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
        .setInputCols(["document","token"])\
        .setOutputCol("word_embeddings")

sentence_embeddings = SentenceEmbeddings() \
        .setInputCols(["document", "word_embeddings"]) \
        .setOutputCol("sentence_embeddings") \
        .setPoolingStrategy("AVERAGE")

embeddings_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
    ])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


In [49]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)\
                                                  .select("text","category","sentence_embeddings")
trainingData_with_embeddings.show(2,truncate=80)

+--------------------------------------------------------------------------------+--------+--------------------------------------------------------------------------------+
|                                                                            text|category|                                                             sentence_embeddings|
+--------------------------------------------------------------------------------+--------+--------------------------------------------------------------------------------+
| "Syndrome malin"-like symptoms probably due to interaction between neurolept...|     neg|[{sentence_embeddings, 0, 109,  "Syndrome malin"-like symptoms probably due t...|
| 'Bail-out' bivalirudin use in patients with thrombotic complications unrespo...|     neg|[{sentence_embeddings, 0, 150,  'Bail-out' bivalirudin use in patients with t...|
+--------------------------------------------------------------------------------+--------+--------------------------------------------

In [50]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)\
                                                  .select("text","category","sentence_embeddings")
testData_with_embeddings.show(2,truncate=80)

+--------------------------------------------------------------------------------+--------+--------------------------------------------------------------------------------+
|                                                                            text|category|                                                             sentence_embeddings|
+--------------------------------------------------------------------------------+--------+--------------------------------------------------------------------------------+
| 1 previously untreated patient with plasma-cell leukaemia and 8 patients wit...|     neg|[{sentence_embeddings, 0, 166,  1 previously untreated patient with plasma-ce...|
| A 15-year-old boy had temporary hypertropia, supraduction deficit, ipsilater...|     neg|[{sentence_embeddings, 0, 262,  A 15-year-old boy had temporary hypertropia, ...|
+--------------------------------------------------------------------------------+--------+--------------------------------------------

In [51]:
log_folder="ADE_logs_healthcare_200d"

### ClassifierDLApproach

In [52]:
classifier_dl = ClassifierDLApproach()\
        .setInputCols(["sentence_embeddings"])\
        .setOutputCol("class")\
        .setLabelColumn("category")\
        .setBatchSize(16)\
        .setMaxEpochs(30)\
        .setLr(0.001)\
        .setDropout(0.3)\
        .setEnableOutputLogs(True)\
        .setOutputLogsPath(log_folder)

classifier_dl_pipeline = Pipeline(
    stages = [
        classifier_dl
    ])

In [53]:
clfDL_model_hc200 = classifier_dl_pipeline.fit(trainingData_with_embeddings)

In [54]:
preds = clfDL_model_hc200.transform(testData_with_embeddings)

preds_df = preds.select("category","text","class.result").toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

         neg       0.89      0.91      0.90       982
         pos       0.79      0.74      0.77       445

    accuracy                           0.86      1427
   macro avg       0.84      0.83      0.83      1427
weighted avg       0.86      0.86      0.86      1427



### MultiClassifierDL

In [55]:
# MultiClassifierDL accepts list of strings as input label. So we convert label column to array type.
trainingData_with_embeddings = trainingData_with_embeddings.withColumn("category_array", F.array(F.col("category")))
testData_with_embeddings = testData_with_embeddings.withColumn("category_array", F.array(F.col("category")))

In [56]:
multiClassifier = MultiClassifierDLApproach()\
  .setInputCols("sentence_embeddings")\
  .setOutputCol("prediction")\
  .setLabelColumn("category_array")\
  .setBatchSize(32)\
  .setMaxEpochs(20)\
  .setLr(9e-3)\
  .setThreshold(0.5)\
  .setShufflePerEpoch(False)\
  .setEnableOutputLogs(True)\
  .setOutputLogsPath(log_folder)\
#   .setValidationSplit(0.1)

multiClassifier_pipeline = Pipeline(
    stages = [
        multiClassifier
    ])

In [57]:
multiClassifier_model_hc200 = multiClassifier_pipeline.fit(trainingData_with_embeddings)

In [58]:
preds = multiClassifier_model_hc200.transform(testData_with_embeddings)

In [59]:
preds_df = preds.select("category","text","prediction.result","prediction.metadata").toPandas()
preds_df.result.apply(lambda x: str(x) ).value_counts()

['neg']    944
['pos']    483
Name: result, dtype: int64

In [60]:
# We will get the highest score label as result. You can control the number of zero label results with setThreshold() in the training.
preds_df["scores"] = preds_df.metadata.apply(lambda x: {k:float(v) for k,v in x[0].items()} if len(x)>=1 else "")
preds_df["result"] = preds_df.scores.apply(lambda x: max(x, key=x.get) if len(x)>=1 else "")
preds_df["result"].value_counts()

neg    944
pos    483
Name: result, dtype: int64

In [61]:
print (classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

         neg       0.90      0.87      0.89       982
         pos       0.73      0.80      0.77       445

    accuracy                           0.85      1427
   macro avg       0.82      0.83      0.83      1427
weighted avg       0.85      0.85      0.85      1427



### Generic Classifier

In [62]:
# !pip install -q tensorflow==2.7.0 tensorflow_addons

In [63]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("category")\
    .setHiddenLayers([300,200, 50])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph_200d.pb")

In [64]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")
      
gen_clf = GenericClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph_200d.pb")\
    .setEpochsNumber(40)\
    .setBatchSize(16)\
    .setLearningRate(0.001)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.5)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(stages=[
    features_asm,
    gc_graph_builder,
    gen_clf])


In [65]:
generic_model_hc200 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph_200d.pb
Build params: {'input_dim': 200, 'output_dim': 2, 'hidden_layers': [300, 200, 50], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}
generic_classifier graph exported to gc_graph/gcf_graph_200d.pb


In [66]:
pred_df = generic_model_hc200.transform(testData_with_embeddings)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.93      0.76      0.84       982
         pos       0.63      0.88      0.73       445

    accuracy                           0.80      1427
   macro avg       0.78      0.82      0.79      1427
weighted avg       0.84      0.80      0.81      1427



## Bert Sentence Embeddings (sbiobert_base_cased_mli)

Now we will extract [sbiobert_base_cased_mli](https://nlp.johnsnowlabs.com/2020/11/27/sbiobert_base_cased_mli_en.html) embeddings which has 768 dimension output and use this embeddings in the model training.

In [67]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
        
bert_sent = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

embeddings_pipeline = Pipeline(
    stages = [document_assembler,
              bert_sent])

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]


In [68]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)\
                                                  .select("text","category","sentence_embeddings")

trainingData_with_embeddings.show(2,truncate=80)

+--------------------------------------------------------------------------------+--------+--------------------------------------------------------------------------------+
|                                                                            text|category|                                                             sentence_embeddings|
+--------------------------------------------------------------------------------+--------+--------------------------------------------------------------------------------+
| "Syndrome malin"-like symptoms probably due to interaction between neurolept...|     neg|[{sentence_embeddings, 0, 109,  "Syndrome malin"-like symptoms probably due t...|
| 'Bail-out' bivalirudin use in patients with thrombotic complications unrespo...|     neg|[{sentence_embeddings, 0, 150,  'Bail-out' bivalirudin use in patients with t...|
+--------------------------------------------------------------------------------+--------+--------------------------------------------

In [69]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)\
                                                  .select("text","category","sentence_embeddings")

testData_with_embeddings.show(2,truncate=80)

+--------------------------------------------------------------------------------+--------+--------------------------------------------------------------------------------+
|                                                                            text|category|                                                             sentence_embeddings|
+--------------------------------------------------------------------------------+--------+--------------------------------------------------------------------------------+
| 1 previously untreated patient with plasma-cell leukaemia and 8 patients wit...|     neg|[{sentence_embeddings, 0, 166,  1 previously untreated patient with plasma-ce...|
| A 15-year-old boy had temporary hypertropia, supraduction deficit, ipsilater...|     neg|[{sentence_embeddings, 0, 262,  A 15-year-old boy had temporary hypertropia, ...|
+--------------------------------------------------------------------------------+--------+--------------------------------------------

In [70]:
log_folder="ADE_logs_bert"

### ClassifierDLApproach

In [71]:
classifier_dl = ClassifierDLApproach()\
        .setInputCols(["sentence_embeddings"])\
        .setOutputCol("class")\
        .setLabelColumn("category")\
        .setBatchSize(8)\
        .setMaxEpochs(2)\
        .setLr(0.001)\
        .setDropout(0.3)\
        .setEnableOutputLogs(True)\
        .setOutputLogsPath(log_folder)

classifier_dl_pipeline = Pipeline(stages=[classifier_dl])

In [72]:
clfDL_model_bert = classifier_dl_pipeline.fit(trainingData_with_embeddings)

In [73]:
preds = clfDL_model_bert.transform(testData_with_embeddings)

preds_df = preds.select("category","text","class.result").toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

         neg       0.92      0.88      0.90       982
         pos       0.76      0.83      0.79       445

    accuracy                           0.86      1427
   macro avg       0.84      0.85      0.84      1427
weighted avg       0.87      0.86      0.87      1427



### MultiClassifierDL

In [74]:
# MultiClassifierDL accepts list of strings as input label. So we convert label column to array type.
trainingData_with_embeddings = trainingData_with_embeddings.withColumn("category_array", F.array(F.col("category")))
testData_with_embeddings = testData_with_embeddings.withColumn("category_array", F.array(F.col("category")))

In [75]:
multiClassifier = MultiClassifierDLApproach()\
  .setInputCols("sentence_embeddings")\
  .setOutputCol("prediction")\
  .setLabelColumn("category_array")\
  .setBatchSize(32)\
  .setMaxEpochs(20)\
  .setLr(9e-3)\
  .setThreshold(0.5)\
  .setShufflePerEpoch(False)\
  .setEnableOutputLogs(True)\
  .setOutputLogsPath(log_folder)\
#   .setValidationSplit(0.1)

multiClassifier_pipeline = Pipeline(
    stages = [
        multiClassifier
    ])

In [76]:
multiClassifier_model_hc200 = multiClassifier_pipeline.fit(trainingData_with_embeddings)

In [77]:
preds = multiClassifier_model_hc200.transform(testData_with_embeddings)

In [78]:
preds_df = preds.select("category","text","prediction.result","prediction.metadata").toPandas()
preds_df.result.apply(lambda x: str(x) ).value_counts()

['neg']    1093
['pos']     334
Name: result, dtype: int64

In [79]:
# We will get the highest score label as result. You can control the number of zero label results with setThreshold() in the training.
preds_df["scores"] = preds_df.metadata.apply(lambda x: {k:float(v) for k,v in x[0].items()} if len(x)>=1 else "")
preds_df["result"] = preds_df.scores.apply(lambda x: max(x, key=x.get) if len(x)>=1 else "")
preds_df["result"].value_counts()

neg    1093
pos     334
Name: result, dtype: int64

In [80]:
print (classification_report(preds_df["category"], preds_df["result"]))

              precision    recall  f1-score   support

         neg       0.85      0.95      0.90       982
         pos       0.85      0.64      0.73       445

    accuracy                           0.85      1427
   macro avg       0.85      0.79      0.81      1427
weighted avg       0.85      0.85      0.85      1427



### Generic Classifier

In [81]:
# !pip install -q tensorflow==2.7.0 tensorflow_addons

In [82]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("category")\
    .setHiddenLayers([300,200, 50])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph_bert.pb")

In [83]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")
      
gen_clf = GenericClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph_bert.pb")\
    .setEpochsNumber(20)\
    .setBatchSize(32)\
    .setLearningRate(0.001)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.5)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(stages=[
    features_asm,
    gc_graph_builder,
    gen_clf])


In [84]:
generic_model_bert = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph_bert.pb
Build params: {'input_dim': 768, 'output_dim': 2, 'hidden_layers': [300, 200, 50], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}
generic_classifier graph exported to gc_graph/gcf_graph_bert.pb


In [85]:
pred_df = generic_model_bert.transform(testData_with_embeddings)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.90      0.95      0.92       982
         pos       0.86      0.77      0.81       445

    accuracy                           0.89      1427
   macro avg       0.88      0.86      0.87      1427
weighted avg       0.89      0.89      0.89      1427



## DocumentLogRegClassifier

In [86]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
      .setInputCols(["token"]) \
      .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

stemmer = Stemmer() \
      .setInputCols(["cleanTokens"]) \
      .setOutputCol("stem")

logreg = DocumentLogRegClassifierApproach()\
    .setInputCols("stem")\
    .setLabelCol("category")\
    .setOutputCol("prediction")\
    .setMaxIter(10)\
    .setTol(1e-6)

clf_Pipeline = Pipeline(stages=[
    document_assembler, 
    tokenizer,
    normalizer,
    stopwords_cleaner, 
    stemmer, 
    logreg])
doclogreg_model = clf_Pipeline.fit(trainingData)

In [87]:
pred_df = doclogreg_model.transform(testData)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

         neg       0.86      0.86      0.86       982
         pos       0.70      0.69      0.69       445

    accuracy                           0.81      1427
   macro avg       0.78      0.78      0.78      1427
weighted avg       0.81      0.81      0.81      1427



# Mtsamples Dataset

## Load Dataset

In [88]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/mtsamples_classifier.csv

In [89]:
spark_df = spark.read.csv("mtsamples_classifier.csv", header = True)

spark_df.show(10,truncate=100)

+----------------+----------------------------------------------------------------------------------------------------+
|        category|                                                                                                text|
+----------------+----------------------------------------------------------------------------------------------------+
|Gastroenterology| PROCEDURES PERFORMED: Colonoscopy. INDICATIONS: Renewed symptoms likely consistent with active f...|
|Gastroenterology| OPERATION 1. Ivor-Lewis esophagogastrectomy. 2. Feeding jejunostomy. 3. Placement of two right-s...|
|Gastroenterology| PREOPERATIVE DIAGNOSES: 1. Gastroesophageal reflux disease. 2. Chronic dyspepsia. POSTOPERATIVE ...|
|Gastroenterology| PROCEDURE: Colonoscopy. PREOPERATIVE DIAGNOSES: Rectal bleeding and perirectal abscess. POSTOPER...|
|Gastroenterology| PREOPERATIVE DIAGNOSIS: Right colon tumor. POSTOPERATIVE DIAGNOSES: 1. Right colon cancer. 2. As...|
|Gastroenterology| PREOPERATIVE DIAGNOSI

In [90]:
spark_df.printSchema()

root
 |-- category: string (nullable = true)
 |-- text: string (nullable = true)



In [91]:
spark_df.count()

638

In [92]:
spark_df.groupBy("category").count().show()

+----------------+-----+
|        category|count|
+----------------+-----+
|         Urology|  115|
|       Neurology|  143|
|      Orthopedic|  223|
|Gastroenterology|  157|
+----------------+-----+



In [93]:
(trainingData, testData) = spark_df.randomSplit([0.8, 0.2], seed = 42)

In [94]:
trainingData.groupBy("category").count().show()

+----------------+-----+
|        category|count|
+----------------+-----+
|         Urology|   95|
|       Neurology|  121|
|      Orthopedic|  188|
|Gastroenterology|  132|
+----------------+-----+



In [95]:
testData.groupBy("category").count().show()

+----------------+-----+
|        category|count|
+----------------+-----+
|         Urology|   20|
|       Neurology|   22|
|      Orthopedic|   35|
|Gastroenterology|   25|
+----------------+-----+



## 100 Dimension Healthcare Embeddings (embeddings_healthcare_100d)



We will extract [healthcare_100d embeddings](https://nlp.johnsnowlabs.com/2020/05/29/embeddings_healthcare_100d_en.html) and use it in the classificaiton model training.

In [96]:
document_assembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

tokenizer = Tokenizer() \
            .setInputCols(["document"]) \
            .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")\
        .setInputCols(["document","token"])\
        .setOutputCol("word_embeddings")

sentence_embeddings = SentenceEmbeddings() \
        .setInputCols(["document", "word_embeddings"]) \
        .setOutputCol("sentence_embeddings") \
        .setPoolingStrategy("AVERAGE")

embeddings_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,

    ])

embeddings_healthcare_100d download started this may take some time.
Approximate size to download 475.8 MB
[OK!]


In [97]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)
trainingData_with_embeddings = trainingData_with_embeddings.select("text","category","sentence_embeddings")
trainingData_with_embeddings.show(2,truncate=80)

+--------------------------------------------------------------------------------+----------------+--------------------------------------------------------------------------------+
|                                                                            text|        category|                                                             sentence_embeddings|
+--------------------------------------------------------------------------------+----------------+--------------------------------------------------------------------------------+
| ADMISSION DIAGNOSIS: Symptomatic cholelithiasis. DISCHARGE DIAGNOSIS: Sympto...|Gastroenterology|[{sentence_embeddings, 0, 2228,  ADMISSION DIAGNOSIS: Symptomatic cholelithia...|
| ADMITTING DIAGNOSES: Hiatal hernia, gastroesophageal reflux disease reflux. ...|Gastroenterology|[{sentence_embeddings, 0, 3237,  ADMITTING DIAGNOSES: Hiatal hernia, gastroes...|
+--------------------------------------------------------------------------------+-------------

In [98]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)
testData_with_embeddings = testData_with_embeddings.select("text","category","sentence_embeddings")
testData_with_embeddings.show(2,truncate=80)

+--------------------------------------------------------------------------------+----------------+--------------------------------------------------------------------------------+
|                                                                            text|        category|                                                             sentence_embeddings|
+--------------------------------------------------------------------------------+----------------+--------------------------------------------------------------------------------+
| ADMITTING DIAGNOSIS: Gastrointestinal bleed. HISTORY OF PRESENT ILLNESS: Ms....|Gastroenterology|[{sentence_embeddings, 0, 3978,  ADMITTING DIAGNOSIS: Gastrointestinal bleed....|
| CHIEF COMPLAINT: Dysphagia and hematemesis while vomiting. HISTORY OF PRESEN...|Gastroenterology|[{sentence_embeddings, 0, 6515,  CHIEF COMPLAINT: Dysphagia and hematemesis w...|
+--------------------------------------------------------------------------------+-------------

In [99]:
testData_with_embeddings.printSchema()

root
 |-- text: string (nullable = true)
 |-- category: string (nullable = true)
 |-- sentence_embeddings: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)



In [100]:
testData_with_embeddings.select(testData_with_embeddings.sentence_embeddings.embeddings).show(3,truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [101]:
log_folder="Mt_logs_healthcare_100d"
!mkdir -p $log_folder

### ClassifierDLApproach

In [102]:
classifier_dl = ClassifierDLApproach()\
        .setInputCols(["sentence_embeddings"])\
        .setOutputCol("prediction")\
        .setLabelColumn("category")\
        .setBatchSize(8)\
        .setMaxEpochs(50)\
        .setLr(0.005)\
        .setDropout(0.3)\
        .setEnableOutputLogs(True)\
        .setOutputLogsPath(log_folder)\
        # .setValidationSplit(0.1)

classifier_dl_pipeline = Pipeline(stages=[classifier_dl])

In [103]:
clfDL_model_hc100 = classifier_dl_pipeline.fit(trainingData_with_embeddings)

In [104]:
preds = clfDL_model_hc100.transform(testData_with_embeddings)

In [105]:
preds_df = preds.select("category","text","prediction.result").toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       0.76      0.76      0.76        25
       Neurology       0.73      0.73      0.73        22
      Orthopedic       0.87      0.94      0.90        35
         Urology       0.59      0.50      0.54        20

        accuracy                           0.76       102
       macro avg       0.74      0.73      0.73       102
    weighted avg       0.76      0.76      0.76       102



### MultiClassifierDL

In [106]:
# MultiClassifierDL accepts list of strings as input label. So we convert label column to array type.
trainingData_with_embeddings = trainingData_with_embeddings.withColumn("category_array", F.array(F.col("category")))
testData_with_embeddings = testData_with_embeddings.withColumn("category_array", F.array(F.col("category")))

In [107]:
multiClassifier = MultiClassifierDLApproach()\
  .setInputCols("sentence_embeddings")\
  .setOutputCol("prediction")\
  .setLabelColumn("category_array")\
  .setBatchSize(8)\
  .setMaxEpochs(40)\
  .setLr(5e-3)\
  .setThreshold(0.5)\
  .setShufflePerEpoch(False)\
  .setEnableOutputLogs(True)\
  .setOutputLogsPath(log_folder)\
#   .setValidationSplit(0.1)

multiClassifier_pipeline = Pipeline(
    stages = [
        multiClassifier
    ])

In [108]:
multiClassifier_model_hc100 = multiClassifier_pipeline.fit(trainingData_with_embeddings)

In [109]:
preds = multiClassifier_model_hc100.transform(testData_with_embeddings)

In [110]:
preds_df = preds.select("category","text","prediction.result","prediction.metadata").toPandas()
preds_df.result.apply(lambda x: str(x) ).value_counts()

['Orthopedic']                        35
['Urology']                           21
['Gastroenterology']                  19
['Neurology']                         17
[]                                     5
['Neurology', 'Orthopedic']            4
['Orthopedic', 'Gastroenterology']     1
Name: result, dtype: int64

In [111]:
# We will get the highest score label as result. You can control the number of zero label results with setThreshold() in the training.
preds_df["scores"] = preds_df.metadata.apply(lambda x: {k:float(v) for k,v in x[0].items()} if len(x)>=1 else "")
preds_df["result"] = preds_df.scores.apply(lambda x: max(x, key=x.get) if len(x)>=1 else "")
preds_df["result"].value_counts()

Orthopedic          36
Urology             21
Gastroenterology    20
Neurology           20
                     5
Name: result, dtype: int64

In [112]:
print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

                       0.00      0.00      0.00         0
Gastroenterology       0.95      0.76      0.84        25
       Neurology       0.80      0.73      0.76        22
      Orthopedic       0.89      0.91      0.90        35
         Urology       0.90      0.95      0.93        20

        accuracy                           0.84       102
       macro avg       0.71      0.67      0.69       102
    weighted avg       0.89      0.84      0.86       102



### Generic Classifier

In [113]:
!pip install -q tensorflow==2.7.0 tensorflow_addons

In [114]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("category")\
    .setHiddenLayers([300,200, 50])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph_100d.pb")

In [115]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")
      
gen_clf = GenericClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph_100d.pb")\
    .setEpochsNumber(50)\
    .setBatchSize(32)\
    .setLearningRate(0.001)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.3)\
    .setFixImbalance(False)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(stages=[
    features_asm,
    gc_graph_builder,
    gen_clf])


In [116]:
generic_model_hc100 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph_100d.pb
Build params: {'input_dim': 100, 'output_dim': 4, 'hidden_layers': [300, 200, 50], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}
generic_classifier graph exported to gc_graph/gcf_graph_100d.pb


In [117]:
pred_df = generic_model_hc100.transform(testData_with_embeddings)

In [118]:
preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.92      0.92      0.92        25
       Neurology       0.88      0.68      0.77        22
      Orthopedic       0.79      0.94      0.86        35
         Urology       0.94      0.85      0.89        20

        accuracy                           0.86       102
       macro avg       0.88      0.85      0.86       102
    weighted avg       0.87      0.86      0.86       102



## 200 Dimension Healthcare Embeddings (embeddings_clinical)



Now we will extract [embeddings_clinical](https://nlp.johnsnowlabs.com/2020/01/28/embeddings_clinical_en.html) embeddings which has 200 dimension output and use this embeddings in the model training.

In [119]:
document_assembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

tokenizer = Tokenizer() \
            .setInputCols(["document"]) \
            .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
        .setInputCols(["document","token"])\
        .setOutputCol("word_embeddings")

sentence_embeddings = SentenceEmbeddings() \
        .setInputCols(["document", "word_embeddings"]) \
        .setOutputCol("sentence_embeddings") \
        .setPoolingStrategy("AVERAGE")

embeddings_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
    ])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


In [120]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)\
                                                  .select("text","category","sentence_embeddings")
trainingData_with_embeddings.show(2,truncate=80)

+--------------------------------------------------------------------------------+----------------+--------------------------------------------------------------------------------+
|                                                                            text|        category|                                                             sentence_embeddings|
+--------------------------------------------------------------------------------+----------------+--------------------------------------------------------------------------------+
| ADMISSION DIAGNOSIS: Symptomatic cholelithiasis. DISCHARGE DIAGNOSIS: Sympto...|Gastroenterology|[{sentence_embeddings, 0, 2228,  ADMISSION DIAGNOSIS: Symptomatic cholelithia...|
| ADMITTING DIAGNOSES: Hiatal hernia, gastroesophageal reflux disease reflux. ...|Gastroenterology|[{sentence_embeddings, 0, 3237,  ADMITTING DIAGNOSES: Hiatal hernia, gastroes...|
+--------------------------------------------------------------------------------+-------------

In [121]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)\
                                                  .select("text","category","sentence_embeddings")
testData_with_embeddings.show(2,truncate=80)

+--------------------------------------------------------------------------------+----------------+--------------------------------------------------------------------------------+
|                                                                            text|        category|                                                             sentence_embeddings|
+--------------------------------------------------------------------------------+----------------+--------------------------------------------------------------------------------+
| ADMITTING DIAGNOSIS: Gastrointestinal bleed. HISTORY OF PRESENT ILLNESS: Ms....|Gastroenterology|[{sentence_embeddings, 0, 3978,  ADMITTING DIAGNOSIS: Gastrointestinal bleed....|
| CHIEF COMPLAINT: Dysphagia and hematemesis while vomiting. HISTORY OF PRESEN...|Gastroenterology|[{sentence_embeddings, 0, 6515,  CHIEF COMPLAINT: Dysphagia and hematemesis w...|
+--------------------------------------------------------------------------------+-------------

In [122]:
log_folder="Mt_logs_healthcare_200d"

### ClassifierDLApproach 

In [123]:
classifier_dl = ClassifierDLApproach()\
        .setInputCols(["sentence_embeddings"])\
        .setOutputCol("class")\
        .setLabelColumn("category")\
        .setBatchSize(8)\
        .setMaxEpochs(50)\
        .setLr(0.005)\
        .setDropout(0.1)\
        .setEnableOutputLogs(True)\
        .setOutputLogsPath(log_folder)

classifier_dl_pipeline = Pipeline(stages=[classifier_dl])

In [124]:
clfDL_model_hc200 = classifier_dl_pipeline.fit(trainingData_with_embeddings)

In [125]:
preds = clfDL_model_hc200.transform(testData_with_embeddings)

preds_df = preds.select("category","text","class.result").toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       1.00      0.76      0.86        25
       Neurology       0.94      0.77      0.85        22
      Orthopedic       0.89      0.97      0.93        35
         Urology       0.74      1.00      0.85        20

        accuracy                           0.88       102
       macro avg       0.89      0.88      0.87       102
    weighted avg       0.90      0.88      0.88       102



### MultiClassifierDL



In [126]:
# MultiClassifierDL accepts list of strings as input label. So we convert label column to array type.
trainingData_with_embeddings = trainingData_with_embeddings.withColumn("category_array", F.array(F.col("category")))
testData_with_embeddings = testData_with_embeddings.withColumn("category_array", F.array(F.col("category")))

In [127]:
multiClassifier = MultiClassifierDLApproach()\
  .setInputCols("sentence_embeddings")\
  .setOutputCol("prediction")\
  .setLabelColumn("category_array")\
  .setBatchSize(16)\
  .setMaxEpochs(20)\
  .setLr(9e-3)\
  .setThreshold(0.2)\
  .setShufflePerEpoch(False)\
  .setEnableOutputLogs(True)\
  .setOutputLogsPath(log_folder)\
#   .setValidationSplit(0.1)

multiClassifier_pipeline = Pipeline(
    stages = [
        multiClassifier
    ])

In [128]:
multiClassifier_model_hc200 = multiClassifier_pipeline.fit(trainingData_with_embeddings)

In [129]:
preds = multiClassifier_model_hc200.transform(testData_with_embeddings)

In [130]:
preds_df = preds.select("category","text","prediction.result","prediction.metadata").toPandas()
preds_df.result.apply(lambda x: str(x) ).value_counts()

['Orthopedic']                        32
['Gastroenterology']                  19
['Urology']                           18
['Neurology']                         13
['Neurology', 'Orthopedic']           11
[]                                     4
['Orthopedic', 'Gastroenterology']     3
['Urology', 'Orthopedic']              1
['Urology', 'Gastroenterology']        1
Name: result, dtype: int64

In [131]:
# We will get the highest score label as result. you can control the number of zero label results with setThreshold() in the training.
preds_df["scores"] = preds_df.metadata.apply(lambda x: {k:float(v) for k,v in x[0].items()} if len(x)>=1 else "")
preds_df["result"] = preds_df.scores.apply(lambda x: max(x, key=x.get) if len(x)>=1 else "")
preds_df["result"].value_counts()

Orthopedic          38
Gastroenterology    22
Urology             20
Neurology           18
                     4
Name: result, dtype: int64

In [132]:
print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

                       0.00      0.00      0.00         0
Gastroenterology       1.00      0.88      0.94        25
       Neurology       0.89      0.73      0.80        22
      Orthopedic       0.87      0.94      0.90        35
         Urology       0.90      0.90      0.90        20

        accuracy                           0.87       102
       macro avg       0.73      0.69      0.71       102
    weighted avg       0.91      0.87      0.89       102



### Generic Classifier 

In [133]:
# !pip install -q tensorflow==2.7.0 tensorflow_addons

In [134]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("category")\
    .setHiddenLayers([300,200, 50])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph_200d.pb")

In [135]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")
      
gen_clf = GenericClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph_200d.pb")\
    .setEpochsNumber(30)\
    .setBatchSize(32)\
    .setLearningRate(0.001)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.3)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(stages=[
    features_asm,
    gc_graph_builder,
    gen_clf])


In [136]:
generic_model_hc200 = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph_200d.pb
Build params: {'input_dim': 200, 'output_dim': 4, 'hidden_layers': [300, 200, 50], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}
generic_classifier graph exported to gc_graph/gcf_graph_200d.pb


In [137]:
pred_df = generic_model_hc200.transform(testData_with_embeddings)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.85      0.88      0.86        25
       Neurology       0.79      0.86      0.83        22
      Orthopedic       0.91      0.86      0.88        35
         Urology       0.89      0.85      0.87        20

        accuracy                           0.86       102
       macro avg       0.86      0.86      0.86       102
    weighted avg       0.87      0.86      0.86       102



## Bert Sentence Embeddings (sbiobert_base_cased_mli)

We will extract [sbiobert_base_cased_mli](https://nlp.johnsnowlabs.com/2020/11/27/sbiobert_base_cased_mli_en.html) embeddings which has 768 dimension output and use this embeddings in the model training.

In [138]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
        
bert_sent = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

embeddings_pipeline = Pipeline(
    stages = [document_assembler,
              bert_sent])

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]


In [139]:
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)\
                                                  .select("text","category","sentence_embeddings")

trainingData_with_embeddings.show(2,truncate=80)

+--------------------------------------------------------------------------------+----------------+--------------------------------------------------------------------------------+
|                                                                            text|        category|                                                             sentence_embeddings|
+--------------------------------------------------------------------------------+----------------+--------------------------------------------------------------------------------+
| ADMISSION DIAGNOSIS: Symptomatic cholelithiasis. DISCHARGE DIAGNOSIS: Sympto...|Gastroenterology|[{sentence_embeddings, 0, 2228,  ADMISSION DIAGNOSIS: Symptomatic cholelithia...|
| ADMITTING DIAGNOSES: Hiatal hernia, gastroesophageal reflux disease reflux. ...|Gastroenterology|[{sentence_embeddings, 0, 3237,  ADMITTING DIAGNOSES: Hiatal hernia, gastroes...|
+--------------------------------------------------------------------------------+-------------

In [140]:
testData_with_embeddings = embeddings_pipeline.fit(testData).transform(testData)\
                                                  .select("text","category","sentence_embeddings")

testData_with_embeddings.show(2,truncate=80)

+--------------------------------------------------------------------------------+----------------+--------------------------------------------------------------------------------+
|                                                                            text|        category|                                                             sentence_embeddings|
+--------------------------------------------------------------------------------+----------------+--------------------------------------------------------------------------------+
| ADMITTING DIAGNOSIS: Gastrointestinal bleed. HISTORY OF PRESENT ILLNESS: Ms....|Gastroenterology|[{sentence_embeddings, 0, 3978,  ADMITTING DIAGNOSIS: Gastrointestinal bleed....|
| CHIEF COMPLAINT: Dysphagia and hematemesis while vomiting. HISTORY OF PRESEN...|Gastroenterology|[{sentence_embeddings, 0, 6515,  CHIEF COMPLAINT: Dysphagia and hematemesis w...|
+--------------------------------------------------------------------------------+-------------

In [141]:
log_folder="Mt_logs_bert"

### ClassifierDLApproach 

In [142]:
classifier_dl = ClassifierDLApproach()\
        .setInputCols(["sentence_embeddings"])\
        .setOutputCol("class")\
        .setLabelColumn("category")\
        .setBatchSize(8)\
        .setMaxEpochs(30)\
        .setLr(0.002)\
        .setDropout(0.1)\
        .setEnableOutputLogs(True)\
        .setOutputLogsPath(log_folder)

classifier_dl_pipeline = Pipeline(stages=[classifier_dl])

In [143]:
clfDL_model_bert = classifier_dl_pipeline.fit(trainingData_with_embeddings)

In [144]:
preds = clfDL_model_bert.transform(testData_with_embeddings)

preds_df = preds.select("category","text","class.result").toPandas()
preds_df["result"] = preds_df["result"].apply(lambda x : x[0])

print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

Gastroenterology       1.00      0.84      0.91        25
       Neurology       0.73      0.86      0.79        22
      Orthopedic       0.91      0.83      0.87        35
         Urology       0.74      0.85      0.79        20

        accuracy                           0.84       102
       macro avg       0.84      0.85      0.84       102
    weighted avg       0.86      0.84      0.85       102



### MultiClassifierDL 

In [145]:
# MultiClassifierDL accepts list of strings as input label. So we convert label column to array type.
trainingData_with_embeddings = trainingData_with_embeddings.withColumn("category_array", F.array(F.col("category")))
testData_with_embeddings = testData_with_embeddings.withColumn("category_array", F.array(F.col("category")))

In [146]:
multiClassifier = MultiClassifierDLApproach()\
  .setInputCols("sentence_embeddings")\
  .setOutputCol("prediction")\
  .setLabelColumn("category_array")\
  .setBatchSize(8)\
  .setMaxEpochs(20)\
  .setLr(9e-3)\
  .setThreshold(0.5)\
  .setShufflePerEpoch(False)\
  .setEnableOutputLogs(True)\
  .setOutputLogsPath(log_folder)\
#   .setValidationSplit(0.1)

multiClassifier_pipeline = Pipeline(
    stages = [
        multiClassifier
    ])

In [147]:
multiClassifier_model_hc200 = multiClassifier_pipeline.fit(trainingData_with_embeddings)

In [148]:
preds = multiClassifier_model_hc200.transform(testData_with_embeddings)

In [149]:
preds_df = preds.select("category","text","prediction.result","prediction.metadata").toPandas()
preds_df.result.apply(lambda x: str(x) ).value_counts()

['Orthopedic']          33
['Gastroenterology']    23
['Neurology']           22
['Urology']             19
[]                       5
Name: result, dtype: int64

In [150]:
# We will get the highest score label as result. You can control the number of zero label results with setThreshold() in the training.
preds_df["scores"] = preds_df.metadata.apply(lambda x: {k:float(v) for k,v in x[0].items()} if len(x)>=1 else "")
preds_df["result"] = preds_df.scores.apply(lambda x: max(x, key=x.get) if len(x)>=1 else "")
preds_df["result"].value_counts()

Orthopedic          33
Gastroenterology    23
Neurology           22
Urology             19
                     5
Name: result, dtype: int64

In [151]:
print (classification_report(preds_df["category"], preds_df["result"]))

                  precision    recall  f1-score   support

                       0.00      0.00      0.00         0
Gastroenterology       1.00      0.92      0.96        25
       Neurology       0.82      0.82      0.82        22
      Orthopedic       0.91      0.86      0.88        35
         Urology       0.89      0.85      0.87        20

        accuracy                           0.86       102
       macro avg       0.72      0.69      0.71       102
    weighted avg       0.91      0.86      0.89       102



### Generic Classifier 

In [152]:
# !pip install -q tensorflow==2.7.0 tensorflow_addons

In [153]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "gc_graph"

gc_graph_builder = TFGraphBuilder()\
    .setModelName("generic_classifier")\
    .setInputCols(["features"])\
    .setLabelColumn("category")\
    .setHiddenLayers([300,200, 50])\
    .setHiddenAct("tanh")\
    .setHiddenActL2(True)\
    .setBatchNorm(True)\
    .setGraphFolder(graph_folder)\
    .setGraphFile("gcf_graph_bert.pb")

In [154]:
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")
      
gen_clf = GenericClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("features")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/gcf_graph_bert.pb")\
    .setEpochsNumber(25)\
    .setBatchSize(8)\
    .setLearningRate(0.002)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.2)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = Pipeline(stages=[
    features_asm,
    gc_graph_builder,
    gen_clf])


In [155]:
generic_model_bert = clf_Pipeline.fit(trainingData_with_embeddings)

TF Graph Builder configuration:
Model name: generic_classifier
Graph folder: gc_graph
Graph file name: gcf_graph_bert.pb
Build params: {'input_dim': 768, 'output_dim': 4, 'hidden_layers': [300, 200, 50], 'hidden_act': 'tanh', 'hidden_act_l2': True, 'batch_norm': True}
generic_classifier graph exported to gc_graph/gcf_graph_bert.pb


In [156]:
pred_df = generic_model_bert.transform(testData_with_embeddings)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.92      0.88      0.90        25
       Neurology       0.79      0.86      0.83        22
      Orthopedic       0.93      0.74      0.83        35
         Urology       0.73      0.95      0.83        20

        accuracy                           0.84       102
       macro avg       0.84      0.86      0.84       102
    weighted avg       0.86      0.84      0.84       102



## DocumentLogRegClassifier

In [157]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

logreg = DocumentLogRegClassifierApproach()\
    .setInputCols("token")\
    .setLabelCol("category")\
    .setOutputCol("prediction")\
    .setMaxIter(10)\
    .setTol(1e-6)

clf_Pipeline = Pipeline(stages=[document_assembler,
                                tokenizer,
                                logreg])

doclogreg_model = clf_Pipeline.fit(trainingData)

In [158]:
pred_df = doclogreg_model.transform(testData)

preds_df = pred_df.select('category','prediction.result').toPandas()
preds_df['result'] = preds_df.result.apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

                  precision    recall  f1-score   support

Gastroenterology       0.92      0.88      0.90        25
       Neurology       0.80      0.73      0.76        22
      Orthopedic       0.91      0.89      0.90        35
         Urology       0.79      0.95      0.86        20

        accuracy                           0.86       102
       macro avg       0.86      0.86      0.86       102
    weighted avg       0.87      0.86      0.86       102

