![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.5.Resume_MedicalNer_Model_Training.ipynb)

# 1.5 Resume MedicalNer Model Training

Steps:
- Train a new model for a few epochs.
- Load the same model and train for more epochs on the same taxnonomy, and check stats.
- Train a model already trained on a different data

## Colab Setup

In [None]:
import json

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
import os
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [None]:
# if you want to start the session with custom params as in start function above
from pyspark.sql import SparkSession

def start(SECRET):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+PUBLIC_VERSION) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+SECRET+"/spark-nlp-jsl-"+JSL_VERSION+".jar")
      
    return builder.getOrCreate()

#spark = start(SECRET)

In [None]:
import json
import os
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", # Amount of memory to use for the driver process, i.e. where SparkContext is initialized
          "spark.kryoserializer.buffer.max":"2000M", # Maximum allowable size of Kryo serialization buffer, in MiB unless otherwise specified. 
          "spark.driver.maxResultSize":"2000M"} # Limit of total size of serialized results of all partitions for each Spark action (e.g. collect) in bytes. 
                                                # Should be at least 1M, or 0 for unlimited. 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 3.4.0
Spark NLP_JSL Version : 3.4.0


## Download Clinical Word Embeddings for training

In [None]:
clinical_embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical', "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


## Download Data for Training (NCBI Disease Dataset)

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/NCBI_disease_official_test.conll
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/NCBI_disease_official_train_dev.conll

In [None]:
from sparknlp.training import CoNLL

training_data = CoNLL().readDataset(spark, 'NCBI_disease_official_train_dev.conll')

training_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Identification of...|[{document, 0, 89...|[{document, 0, 89...|[{token, 0, 13, I...|[{pos, 0, 13, NN,...|[{named_entity, 0...|
|The adenomatous p...|[{document, 0, 21...|[{document, 0, 21...|[{token, 0, 2, Th...|[{pos, 0, 2, NN, ...|[{named_entity, 0...|
|Complex formation...|[{document, 0, 63...|[{document, 0, 63...|[{token, 0, 6, Co...|[{pos, 0, 6, NN, ...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



In [None]:
from sparknlp.training import CoNLL

test_data = CoNLL().readDataset(spark, 'NCBI_disease_official_test.conll')

test_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Clustering of mis...|[{document, 0, 10...|[{document, 0, 10...|[{token, 0, 9, Cl...|[{pos, 0, 9, NN, ...|[{named_entity, 0...|
|Ataxia - telangie...|[{document, 0, 13...|[{document, 0, 13...|[{token, 0, 5, At...|[{pos, 0, 5, NN, ...|[{named_entity, 0...|
|The risk of cance...|[{document, 0, 15...|[{document, 0, 15...|[{token, 0, 2, Th...|[{pos, 0, 2, NN, ...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



## Split the test data into two parts:
- We Keep the first part separate and use it for training the model further, as it will be totally unseen data from the same taxonomy.

- The second part will be used to testing and evaluating

In [None]:
(test_data_1, test_data_2) = test_data.randomSplit([0.5, 0.5], seed = 100)

# save the test data as parquet for easy testing
clinical_embeddings.transform(test_data_1).write.parquet('test_1.parquet')

clinical_embeddings.transform(test_data_2).write.parquet('test_2.parquet')

## Train a new model, pause, and resume training on the same dataset.

### Create a graph

In [None]:
!pip install -q tensorflow-addons

In [None]:
from sparknlp_jsl.training import tf_graph

tf_graph.print_model_params("ner_dl")

tf_graph.build("ner_dl", build_params={"embeddings_dim": 200, "nchars": 128, "ntags": 12, "is_medical": 1}, model_location="./medical_ner_graphs", model_filename="auto")


### Train for 2 epochs

In [None]:
nerTagger = MedicalNerApproach()\
      .setInputCols(["sentence", "token", "embeddings"])\
      .setLabelColumn("label")\
      .setOutputCol("ner")\
      .setMaxEpochs(2)\
      .setLr(0.003)\
      .setBatchSize(8)\
      .setRandomSeed(0)\
      .setVerbose(1)\
      .setEvaluationLogExtended(True) \
      .setEnableOutputLogs(True)\
      .setIncludeConfidence(True)\
      .setTestDataset('./test_2.parquet')\
      .setGraphFolder('./medical_ner_graphs')\
      .setOutputLogsPath('./ner_logs')

ner_pipeline = Pipeline(stages=[
      clinical_embeddings,
      nerTagger
 ])

In [None]:

%%time
ner_model = ner_pipeline.fit(training_data)

CPU times: user 1.77 s, sys: 272 ms, total: 2.04 s
Wall time: 6min 18s


In [None]:
!ls ner_logs/MedicalNerApproach*

ner_logs/MedicalNerApproach_088475de9d98.log


In [None]:
# Training Logs
! cat ner_logs/MedicalNerApproach*

Name of the selected graph: /content/./medical_ner_graphs/blstm_12_200_128_128.pb
Training started - total epochs: 2 - lr: 0.003 - batch size: 8 - labels: 3 - chars: 84 - training examples: 6347


Epoch 1/2 started, lr: 0.003, dataset size: 6347


Epoch 1/2 - 172.76s - loss: 2031.3741 - avg training loss: 2.5551877 - batches: 795
Quality on test dataset: 
time to finish evaluation: 3.90s
Total test loss: 99.7826	Avg test loss: 1.6912
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 507	 143	 29	 0.78	 0.9458955	 0.8549747
B-Disease	 401	 109	 63	 0.7862745	 0.86422414	 0.8234086
tp: 908 fp: 252 fn: 92 labels: 2
Macro-average	 prec: 0.7831372, rec: 0.9050598, f1: 0.8396959
Micro-average	 prec: 0.7827586, rec: 0.908, f1: 0.84074074


Epoch 2/2 started, lr: 0.0029850747, dataset size: 6347


Epoch 2/2 - 170.24s - loss: 799.6125 - avg training loss: 1.0058019 - batches: 795
Quality on test dataset: 
time to finish evaluation: 2.78s
Total test loss: 74.0579	Avg test loss: 1.2552
label	 tp	 fp	 f

In [None]:
# Logs of 4 consecutive epochs to compare with 2+2 epochs on separate datasets from same taxonomy

#!cat ner_logs/MedicalNerApproach_4d3d69967c3f.log

### Evaluate

In [None]:
from sparknlp_jsl.eval import NerDLMetrics
import pyspark.sql.functions as F

pred_df = ner_model.stages[1].transform(clinical_embeddings.transform(test_data_2))

evaler = NerDLMetrics(mode="full_chunk", dropO=True)

eval_result = evaler.computeMetricsFromDF(pred_df.select("label","ner"), prediction_col="ner", label_col="label").cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
    .withColumn("recall", F.round(eval_result["recall"],4))\
    .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+-------+-----+----+-----+-----+---------+------+------+
| entity|   tp|  fp|   fn|total|precision|recall|    f1|
+-------+-----+----+-----+-----+---------+------+------+
|Disease|358.0|61.0|102.0|460.0|   0.8544|0.7783|0.8146|
+-------+-----+----+-----+-----+---------+------+------+

+------------------+
|             macro|
+------------------+
|0.8145620022753128|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.8145620022753128|
+------------------+

None


### Save the model to disk

In [None]:
ner_model.stages[1].write().overwrite().save('models/NCBI_NER_2_epoch')

### Train using the saved model on unseen dataset
#### We use unseen data from same taxonomy

In [None]:

nerTagger = MedicalNerApproach()\
      .setInputCols(["sentence", "token", "embeddings"])\
      .setLabelColumn("label")\
      .setOutputCol("ner")\
      .setMaxEpochs(2)\
      .setLr(0.003)\
      .setBatchSize(8)\
      .setRandomSeed(0)\
      .setVerbose(1)\
      .setEvaluationLogExtended(True) \
      .setEnableOutputLogs(True)\
      .setIncludeConfidence(True)\
      .setTestDataset('/content/test_2.parquet')\
      .setOutputLogsPath('ner_logs')\
      .setGraphFolder('medical_ner_graphs')\
      .setPretrainedModelPath("models/NCBI_NER_2_epoch") ## load exisitng model
    
ner_pipeline = Pipeline(stages=[
      clinical_embeddings,
      nerTagger
 ])

In [None]:
# remove the existing logs

! rm -r ./ner_logs

In [None]:

%%time
ner_model_retrained = ner_pipeline.fit(test_data_1)


CPU times: user 225 ms, sys: 35.7 ms, total: 261 ms
Wall time: 41.3 s


In [None]:
!ls ner_logs/MedicalNerApproach*

ner_logs/MedicalNerApproach_75af07e71f01.log


In [None]:
!cat ./ner_logs/MedicalNerApproach*

Name of the selected graph: pretrained graph
Training started - total epochs: 2 - lr: 0.003 - batch size: 8 - labels: 3 - chars: 84 - training examples: 477


Epoch 1/2 started, lr: 0.003, dataset size: 477


Epoch 1/2 - 15.21s - loss: 71.56894 - avg training loss: 1.1732613 - batches: 61
Quality on test dataset: 
time to finish evaluation: 3.12s
Total test loss: 63.2947	Avg test loss: 1.0728
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 472	 60	 64	 0.88721806	 0.880597	 0.88389504
B-Disease	 407	 45	 57	 0.9004425	 0.8771552	 0.88864625
tp: 879 fp: 105 fn: 121 labels: 2
Macro-average	 prec: 0.8938303, rec: 0.8788761, f1: 0.88629013
Micro-average	 prec: 0.89329267, rec: 0.879, f1: 0.8860887


Epoch 2/2 started, lr: 0.0029850747, dataset size: 477


Epoch 2/2 - 14.03s - loss: 51.73764 - avg training loss: 0.84815806 - batches: 61
Quality on test dataset: 
time to finish evaluation: 2.69s
Total test loss: 71.7597	Avg test loss: 1.2163
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 501	 89	 3

In [None]:
from sparknlp_jsl.eval import NerDLMetrics
import pyspark.sql.functions as F

pred_df = ner_model_retrained.stages[1].transform(clinical_embeddings.transform(test_data_2))

evaler = NerDLMetrics(mode="full_chunk", dropO=True)

eval_result = evaler.computeMetricsFromDF(pred_df.select("label","ner"), prediction_col="ner", label_col="label").cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
    .withColumn("recall", F.round(eval_result["recall"],4))\
    .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+-------+-----+----+----+-----+---------+------+------+
| entity|   tp|  fp|  fn|total|precision|recall|    f1|
+-------+-----+----+----+-----+---------+------+------+
|Disease|415.0|77.0|45.0|460.0|   0.8435|0.9022|0.8718|
+-------+-----+----+----+-----+---------+------+------+

+------------------+
|             macro|
+------------------+
|0.8718487394957983|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.8718487394957982|
+------------------+

None


## Now let's take a model trained on a different dataset and train on this dataset

In [None]:
jsl_ner = MedicalNerModel.pretrained('ner_jsl','en','clinical/models')

jsl_ner.getClasses()

ner_jsl download started this may take some time.
Approximate size to download 14.5 MB
[OK!]


['O',
 'B-Injury_or_Poisoning',
 'B-Direction',
 'B-Test',
 'I-Route',
 'B-Admission_Discharge',
 'B-Death_Entity',
 'I-Oxygen_Therapy',
 'I-Drug_BrandName',
 'B-Relationship_Status',
 'B-Duration',
 'I-Alcohol',
 'I-Triglycerides',
 'I-Date',
 'B-Respiration',
 'B-Hyperlipidemia',
 'I-Test',
 'B-Birth_Entity',
 'I-VS_Finding',
 'B-Age',
 'I-Social_History_Header',
 'B-Labour_Delivery',
 'I-Medical_Device',
 'B-Family_History_Header',
 'B-BMI',
 'I-Fetus_NewBorn',
 'I-BMI',
 'B-Temperature',
 'I-Section_Header',
 'I-Communicable_Disease',
 'I-ImagingFindings',
 'I-Psychological_Condition',
 'I-Obesity',
 'I-Sexually_Active_or_Sexual_Orientation',
 'I-Modifier',
 'B-Alcohol',
 'I-Temperature',
 'I-Vaccine',
 'I-Symptom',
 'B-Kidney_Disease',
 'I-Pulse',
 'B-Oncological',
 'I-EKG_Findings',
 'B-Medical_History_Header',
 'I-Relationship_Status',
 'I-Blood_Pressure',
 'B-Cerebrovascular_Disease',
 'I-Diabetes',
 'B-Oxygen_Therapy',
 'B-O2_Saturation',
 'B-Psychological_Condition',
 'B-Hear

### Now train a model using this model as base

In [None]:

nerTagger = MedicalNerApproach()\
      .setInputCols(["sentence", "token", "embeddings"])\
      .setLabelColumn("label")\
      .setOutputCol("ner")\
      .setMaxEpochs(2)\
      .setLr(0.003)\
      .setBatchSize(8)\
      .setRandomSeed(0)\
      .setVerbose(1)\
      .setEvaluationLogExtended(True) \
      .setEnableOutputLogs(True)\
      .setIncludeConfidence(True)\
      .setTestDataset('/content/test_2.parquet')\
      .setOutputLogsPath('ner_logs')\
      .setGraphFolder('medical_ner_graphs')\
      .setPretrainedModelPath("/root/cache_pretrained/ner_jsl_en_3.1.0_3.0_1624284761441")\
      .setOverrideExistingTags(True) # since the tags do not align, set this flag to true
    
# do hyperparameter by tuning the params above (max epoch, LR, dropout etc.) to get better results
ner_pipeline = Pipeline(stages=[
      clinical_embeddings,
      nerTagger
 ])

In [None]:
# remove the existing logs

! rm -r ./ner_logs

In [None]:

%%time
ner_jsl_retrained = ner_pipeline.fit(training_data)


CPU times: user 2.37 s, sys: 364 ms, total: 2.73 s
Wall time: 8min 21s


In [None]:
!cat ./ner_logs/MedicalNerApproach*

Name of the selected graph: pretrained graph
Training started - total epochs: 2 - lr: 0.003 - batch size: 8 - labels: 3 - chars: 106 - training examples: 6347


Epoch 1/2 started, lr: 0.003, dataset size: 6347


Epoch 1/2 - 214.96s - loss: 1914.243 - avg training loss: 2.407853 - batches: 795
Quality on test dataset: 
time to finish evaluation: 22.07s
Total test loss: 116.4437	Avg test loss: 1.9736
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 477	 95	 59	 0.83391607	 0.88992536	 0.86101085
B-Disease	 394	 81	 70	 0.8294737	 0.8491379	 0.83919066
tp: 871 fp: 176 fn: 129 labels: 2
Macro-average	 prec: 0.83169484, rec: 0.86953163, f1: 0.8501924
Micro-average	 prec: 0.83190066, rec: 0.871, f1: 0.85100144


Epoch 2/2 started, lr: 0.0029850747, dataset size: 6347


Epoch 2/2 - 225.94s - loss: 924.982 - avg training loss: 1.1634994 - batches: 795
Quality on test dataset: 
time to finish evaluation: 21.30s
Total test loss: 91.7439	Avg test loss: 1.5550
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease

In [None]:
from sparknlp_jsl.eval import NerDLMetrics
import pyspark.sql.functions as F

pred_df = ner_jsl_retrained.stages[1].transform(clinical_embeddings.transform(test_data_2))

evaler = NerDLMetrics(mode="full_chunk", dropO=True)

eval_result = evaler.computeMetricsFromDF(pred_df.select("label","ner"), prediction_col="ner", label_col="label").cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
    .withColumn("recall", F.round(eval_result["recall"],4))\
    .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+-------+-----+----+----+-----+---------+------+------+
| entity|   tp|  fp|  fn|total|precision|recall|    f1|
+-------+-----+----+----+-----+---------+------+------+
|Disease|395.0|70.0|65.0|460.0|   0.8495|0.8587|0.8541|
+-------+-----+----+----+-----+---------+------+------+

+------------------+
|             macro|
+------------------+
|0.8540540540540541|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.8540540540540541|
+------------------+

None
