![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.5.Resume_MedicalNer_Model_Training.ipynb)

# 1.5 Resume MedicalNer Model Training

Steps:
- Train a new model for a few epochs.
- Load the same model and train for more epochs on the same taxnonomy, and check stats.
- Train a model already trained on a different data

## Colab Setup

In [None]:
import os

jsl_secret = os.getenv('SECRET')

import sparknlp
sparknlp_version = sparknlp.version()
import sparknlp_jsl
jsl_version = sparknlp_jsl.version()

print (jsl_secret)

In [3]:
# if you want to start the session with custom params as in start function above
def start(secret):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+sparknlp_version) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+jsl_secret+"/spark-nlp-jsl-"+jsl_version+".jar")
      
    return builder.getOrCreate()

#spark = start(secret)

In [4]:
import json
import os
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp

params = {"spark.driver.memory":"16G",
"spark.kryoserializer.buffer.max":"2000M",
"spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(jsl_secret, params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

Spark NLP Version : 3.1.2
Spark NLP_JSL Version : 3.1.2


## Download Clinical Word Embeddings for training

In [5]:
clinical_embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical', "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


## Download Data for Training (NCBI Disease Dataset)

In [6]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/NCBI_disease_official_test.conll
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/NCBI_disease_official_train_dev.conll

In [7]:
from sparknlp.training import CoNLL

training_data = CoNLL().readDataset(spark, 'NCBI_disease_official_train_dev.conll')

training_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Identification of...|[[document, 0, 89...|[[document, 0, 89...|[[token, 0, 13, I...|[[pos, 0, 13, NN,...|[[named_entity, 0...|
|The adenomatous p...|[[document, 0, 21...|[[document, 0, 21...|[[token, 0, 2, Th...|[[pos, 0, 2, NN, ...|[[named_entity, 0...|
|Complex formation...|[[document, 0, 63...|[[document, 0, 63...|[[token, 0, 6, Co...|[[pos, 0, 6, NN, ...|[[named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



In [8]:
from sparknlp.training import CoNLL

test_data = CoNLL().readDataset(spark, 'NCBI_disease_official_test.conll')

test_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Clustering of mis...|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 9, Cl...|[[pos, 0, 9, NN, ...|[[named_entity, 0...|
|Ataxia - telangie...|[[document, 0, 13...|[[document, 0, 13...|[[token, 0, 5, At...|[[pos, 0, 5, NN, ...|[[named_entity, 0...|
|The risk of cance...|[[document, 0, 15...|[[document, 0, 15...|[[token, 0, 2, Th...|[[pos, 0, 2, NN, ...|[[named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



## Split the test data into two parts:
- We Keep the first part separate and use it for training the model further, as it will be totally unseen data from the same taxonomy.

- The second part will be used to testing and evaluating

In [9]:
(test_data_1, test_data_2) = test_data.randomSplit([0.5, 0.5], seed = 100)

# save the test data as parquet for easy testing
clinical_embeddings.transform(test_data_1).write.parquet('test_1.parquet')

clinical_embeddings.transform(test_data_2).write.parquet('test_2.parquet')

## Train a new model, pause, and resume training on the same dataset.

### Create a graph

In [10]:
from sparknlp_jsl.training import tf_graph
%tensorflow_version 1.x
tf_graph.print_model_params("ner_dl")

tf_graph.build("ner_dl", build_params={"embeddings_dim": 200, "nchars": 128, "ntags": 12, "is_medical": 1}, model_location="./medical_ner_graphs", model_filename="auto")


TensorFlow 1.x selected.
ner_dl parameters.
Parameter            Required   Default value        Description
ntags                yes        -                    Number of tags.
embeddings_dim       no         200                  Embeddings dimension.
nchars               no         100                  Number of chars.
lstm_size            no         128                  Number of LSTM units.
gpu_device           no         0                    Device for training.
is_medical           no         0                    Build a Medical Ner graph.
Instructions for updating:
non-resource variables are not supported in the long term

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer i

### Train for 2 epochs

In [12]:
nerTagger = MedicalNerApproach()\
      .setInputCols(["sentence", "token", "embeddings"])\
      .setLabelColumn("label")\
      .setOutputCol("ner")\
      .setMaxEpochs(2)\
      .setLr(0.003)\
      .setBatchSize(8)\
      .setRandomSeed(0)\
      .setVerbose(1)\
      .setEvaluationLogExtended(True) \
      .setEnableOutputLogs(True)\
      .setIncludeConfidence(True)\
      .setTestDataset('./test_2.parquet')\
      .setGraphFolder('./medical_ner_graphs')\
      .setOutputLogsPath('./ner_logs')

ner_pipeline = Pipeline(stages=[
      clinical_embeddings,
      nerTagger
 ])

In [13]:

%%time
ner_model = ner_pipeline.fit(training_data)


CPU times: user 3.28 s, sys: 297 ms, total: 3.57 s
Wall time: 11min 40s


In [None]:
# Training Logs
! cat ner_logs/MedicalNerApproach_a6231a3fe051.log

Name of the selected graph: /content/./medical_ner_graphs/blstm_12_200_128_128.pb
Training started - total epochs: 2 - lr: 0.003 - batch size: 8 - labels: 3 - chars: 84 - training examples: 6347


Epoch 1/2 started, lr: 0.003, dataset size: 6347


Epoch 1/2 - 237.31s - loss: 2064.9941 - batches: 795
Quality on test dataset: 
time to finish evaluation: 7.34s
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 494	 121	 42	 0.80325204	 0.92164177	 0.858384
B-Disease	 388	 90	 76	 0.8117155	 0.8362069	 0.82377917
tp: 882 fp: 211 fn: 118 labels: 2
Macro-average	 prec: 0.8074838, rec: 0.87892437, f1: 0.84169084
Micro-average	 prec: 0.8069533, rec: 0.882, f1: 0.8428094


Epoch 2/2 started, lr: 0.0029850747, dataset size: 6347


Epoch 2/2 - 235.26s - loss: 806.94214 - batches: 795
Quality on test dataset: 
time to finish evaluation: 6.18s
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 476	 81	 60	 0.8545781	 0.8880597	 0.8709972
B-Disease	 382	 58	 82	 0.8681818	 0.82327586	 0.8451327
tp: 858 fp: 139 fn

In [15]:
# Logs of 4 consecutive epochs to compare with 2+2 epochs on separate datasets from same taxonomy

#!cat ner_logs/MedicalNerApproach_4d3d69967c3f.log

Name of the selected graph: /content/./medical_ner_graphs/blstm_12_200_128_128.pb
Training started - total epochs: 4 - lr: 0.003 - batch size: 8 - labels: 3 - chars: 84 - training examples: 6347


Epoch 1/4 started, lr: 0.003, dataset size: 6347


Epoch 1/4 - 161.11s - loss: 1997.8568 - batches: 795
Quality on test dataset: 
time to finish evaluation: 4.95s
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 510	 194	 26	 0.7244318	 0.95149255	 0.8225807
B-Disease	 390	 110	 74	 0.78	 0.8405172	 0.8091286
tp: 900 fp: 304 fn: 100 labels: 2
Macro-average	 prec: 0.75221586, rec: 0.8960049, f1: 0.8178384
Micro-average	 prec: 0.7475083, rec: 0.9, f1: 0.8166969


Epoch 2/4 started, lr: 0.0029850747, dataset size: 6347


Epoch 2/4 - 161.94s - loss: 777.53375 - batches: 795
Quality on test dataset: 
time to finish evaluation: 4.13s
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 462	 60	 74	 0.88505745	 0.8619403	 0.87334585
B-Disease	 362	 47	 102	 0.8850856	 0.7801724	 0.8293242
tp: 824 fp: 107 fn: 176 

### Evaluate

In [None]:
from sparknlp_jsl.eval import NerDLMetrics
import pyspark.sql.functions as F

pred_df = ner_model.stages[1].transform(clinical_embeddings.transform(test_data_2))

evaler = NerDLMetrics(mode="full_chunk", dropO=True)

eval_result = evaler.computeMetricsFromDF(pred_df.select("label","ner"), prediction_col="ner", label_col="label").cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
    .withColumn("recall", F.round(eval_result["recall"],4))\
    .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+-------+-----+----+----+-----+---------+------+------+
| entity|   tp|  fp|  fn|total|precision|recall|    f1|
+-------+-----+----+----+-----+---------+------+------+
|Disease|377.0|63.0|83.0|460.0|   0.8568|0.8196|0.8378|
+-------+-----+----+----+-----+---------+------+------+

+------------------+
|             macro|
+------------------+
|0.8377777777777777|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.8377777777777777|
+------------------+

None


### Save the model to disk

In [None]:
ner_model.stages[1].write().overwrite().save('models/NCBI_NER_2_epoch')

### Train using the saved model on unseen dataset
#### We use unseen data from same taxonomy

In [None]:

nerTagger = MedicalNerApproach()\
      .setInputCols(["sentence", "token", "embeddings"])\
      .setLabelColumn("label")\
      .setOutputCol("ner")\
      .setMaxEpochs(2)\
      .setLr(0.003)\
      .setBatchSize(8)\
      .setRandomSeed(0)\
      .setVerbose(1)\
      .setEvaluationLogExtended(True) \
      .setEnableOutputLogs(True)\
      .setIncludeConfidence(True)\
      .setTestDataset('/content/test_2.parquet')\
      .setOutputLogsPath('ner_logs')\
      .setGraphFolder('medical_ner_graphs')\
      .setPretrainedModelPath("models/NCBI_NER_2_epoch") ## load exisitng model
    
ner_pipeline = Pipeline(stages=[
      clinical_embeddings,
      nerTagger
 ])

In [None]:

%%time
ner_model_retrained = ner_pipeline.fit(test_data_1)


CPU times: user 373 ms, sys: 45.1 ms, total: 418 ms
Wall time: 1min 2s


In [None]:
!cat ./ner_logs/MedicalNerApproach_f7726480b5ef.log

Name of the selected graph: pretrained graph
Training started - total epochs: 2 - lr: 0.003 - batch size: 8 - labels: 3 - chars: 79 - training examples: 477


Epoch 1/2 started, lr: 0.003, dataset size: 477


Epoch 1/2 - 20.49s - loss: 89.05175 - batches: 61
Quality on test dataset: 
time to finish evaluation: 7.03s
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 449	 51	 87	 0.898	 0.83768654	 0.86679536
B-Disease	 390	 51	 74	 0.88435376	 0.8405172	 0.86187845
tp: 839 fp: 102 fn: 161 labels: 2
Macro-average	 prec: 0.8911769, rec: 0.8391019, f1: 0.8643558
Micro-average	 prec: 0.89160466, rec: 0.839, f1: 0.8645028


Epoch 2/2 started, lr: 0.0029850747, dataset size: 477


Epoch 2/2 - 19.23s - loss: 50.519394 - batches: 61
Quality on test dataset: 
time to finish evaluation: 6.61s
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 496	 81	 40	 0.8596187	 0.92537314	 0.8912848
B-Disease	 421	 70	 43	 0.8574338	 0.9073276	 0.88167536
tp: 917 fp: 151 fn: 83 labels: 2
Macro-average	 prec: 0.85852623, 

In [None]:
from sparknlp_jsl.eval import NerDLMetrics
import pyspark.sql.functions as F

pred_df = ner_model_retrained.stages[1].transform(clinical_embeddings.transform(test_data_2))

evaler = NerDLMetrics(mode="full_chunk", dropO=True)

eval_result = evaler.computeMetricsFromDF(pred_df.select("label","ner"), prediction_col="ner", label_col="label").cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
    .withColumn("recall", F.round(eval_result["recall"],4))\
    .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+-------+-----+----+----+-----+---------+------+------+
| entity|   tp|  fp|  fn|total|precision|recall|    f1|
+-------+-----+----+----+-----+---------+------+------+
|Disease|410.0|81.0|50.0|460.0|    0.835|0.8913|0.8623|
+-------+-----+----+----+-----+---------+------+------+

+------------------+
|             macro|
+------------------+
|0.8622502628811777|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.8622502628811777|
+------------------+

None


## Now let's take a model trained on a different dataset and train on this dataset

In [None]:
jsl_ner = MedicalNerModel.pretrained('ner_jsl','en','clinical/models')

jsl_ner.getClasses()

ner_jsl download started this may take some time.
Approximate size to download 14.5 MB
[OK!]


['O',
 'B-Injury_or_Poisoning',
 'B-Direction',
 'B-Test',
 'I-Route',
 'B-Admission_Discharge',
 'B-Death_Entity',
 'I-Oxygen_Therapy',
 'I-Drug_BrandName',
 'B-Relationship_Status',
 'B-Duration',
 'I-Alcohol',
 'I-Triglycerides',
 'I-Date',
 'B-Respiration',
 'B-Hyperlipidemia',
 'I-Test',
 'B-Birth_Entity',
 'I-VS_Finding',
 'B-Age',
 'I-Social_History_Header',
 'B-Labour_Delivery',
 'I-Medical_Device',
 'B-Family_History_Header',
 'B-BMI',
 'I-Fetus_NewBorn',
 'I-BMI',
 'B-Temperature',
 'I-Section_Header',
 'I-Communicable_Disease',
 'I-ImagingFindings',
 'I-Psychological_Condition',
 'I-Obesity',
 'I-Sexually_Active_or_Sexual_Orientation',
 'I-Modifier',
 'B-Alcohol',
 'I-Temperature',
 'I-Vaccine',
 'I-Symptom',
 'B-Kidney_Disease',
 'I-Pulse',
 'B-Oncological',
 'I-EKG_Findings',
 'B-Medical_History_Header',
 'I-Relationship_Status',
 'I-Blood_Pressure',
 'B-Cerebrovascular_Disease',
 'I-Diabetes',
 'B-Oxygen_Therapy',
 'B-O2_Saturation',
 'B-Psychological_Condition',
 'B-Hear

### Now train a model using this model as base

In [None]:

nerTagger = MedicalNerApproach()\
      .setInputCols(["sentence", "token", "embeddings"])\
      .setLabelColumn("label")\
      .setOutputCol("ner")\
      .setMaxEpochs(2)\
      .setLr(0.003)\
      .setBatchSize(8)\
      .setRandomSeed(0)\
      .setVerbose(1)\
      .setEvaluationLogExtended(True) \
      .setEnableOutputLogs(True)\
      .setIncludeConfidence(True)\
      .setTestDataset('/content/test_2.parquet')\
      .setOutputLogsPath('ner_logs')\
      .setGraphFolder('medical_ner_graphs')\
      .setPretrainedModelPath("/root/cache_pretrained/ner_jsl_en_3.1.0_2.4_1624566960534")\
      .setOverrideExistingTags(True) # since the tags do not align, set this flag to true
    
# do hyperparameter by tuning the params above (max epoch, LR, dropout etc.) to get better results
ner_pipeline = Pipeline(stages=[
      clinical_embeddings,
      nerTagger
 ])

In [None]:

%%time
ner_jsl_retrained = ner_pipeline.fit(training_data)


CPU times: user 3.66 s, sys: 360 ms, total: 4.02 s
Wall time: 11min 35s


In [None]:
!cat ./ner_logs/MedicalNerApproach_880049ba07d7.log

Name of the selected graph: pretrained graph
Training started - total epochs: 2 - lr: 0.003 - batch size: 8 - labels: 3 - chars: 84 - training examples: 6347


Epoch 1/2 started, lr: 0.003, dataset size: 6347


Epoch 1/2 - 296.37s - loss: 2145.1052 - batches: 795
Quality on test dataset: 
time to finish evaluation: 16.62s
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 431	 55	 105	 0.8868313	 0.8041045	 0.8434442
B-Disease	 394	 87	 70	 0.81912684	 0.8491379	 0.8338624
tp: 825 fp: 142 fn: 175 labels: 2
Macro-average	 prec: 0.85297906, rec: 0.8266212, f1: 0.83959335
Micro-average	 prec: 0.85315406, rec: 0.825, f1: 0.8388409


Epoch 2/2 started, lr: 0.0029850747, dataset size: 6347


Epoch 2/2 - 319.75s - loss: 985.34314 - batches: 795
Quality on test dataset: 
time to finish evaluation: 15.92s
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 442	 69	 94	 0.8649706	 0.82462686	 0.8443171
B-Disease	 396	 70	 68	 0.8497854	 0.8534483	 0.85161287
tp: 838 fp: 139 fn: 162 labels: 2
Macro-average	 pre

In [None]:
from sparknlp_jsl.eval import NerDLMetrics
import pyspark.sql.functions as F

pred_df = ner_jsl_retrained.stages[1].transform(clinical_embeddings.transform(test_data_2))

evaler = NerDLMetrics(mode="full_chunk", dropO=True)

eval_result = evaler.computeMetricsFromDF(pred_df.select("label","ner"), prediction_col="ner", label_col="label").cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
    .withColumn("recall", F.round(eval_result["recall"],4))\
    .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+-------+-----+----+----+-----+---------+------+-----+
| entity|   tp|  fp|  fn|total|precision|recall|   f1|
+-------+-----+----+----+-----+---------+------+-----+
|Disease|388.0|78.0|72.0|460.0|   0.8326|0.8435|0.838|
+-------+-----+----+----+-----+---------+------+-----+

+-----------------+
|            macro|
+-----------------+
|0.838012958963283|
+-----------------+

None
+------------------+
|             micro|
+------------------+
|0.8380129589632831|
+------------------+

None
