![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **MedicalNerApproach**

In this notebook, we will examine the `MedicalNerApproach` to to train an `MedicalNerModel`.


**📖 Learning Objectives:**

1. Understand the meaning and usage of Named Entity Recognition.

2. Learn how to preprocess your data before training a model.

3. Understand how to train a Named Entity Recognition model using `MedicalNerApproach`.




**🔗 Helpful Links:**

- For extended examples of usage, see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/01.0.Clinical_Named_Entity_Recognition_Model.ipynb)

- Python Documentation: [MedicalNerApproach](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/ner/medical_ner/index.html#sparknlp_jsl.annotator.ner.medical_ner.MedicalNerApproach.po)

- Scala Documentation: [MedicalNerApproach](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/ner/MedicalNerApproach.html) <br/>


**Blogposts and videos:**


- [Named Entity Recognition (NER) with BERT in Spark NLP](https://towardsdatascience.com/named-entity-recognition-ner-with-bert-in-spark-nlp-874df20d1d77)

- [State of the art Clinical Named Entity Recognition in Spark NLP - Youtube](https://www.youtube.com/watch?v=YM-e4eOiQ34)

- [Named Entity Recognition for Healthcare with SparkNLP NerDL and NerCRF](https://medium.com/spark-nlp/named-entity-recognition-for-healthcare-with-sparknlp-nerdl-and-nercrf-a7751b6ad571)

- [Named Entity Recognition for Clinical Text](https://medium.com/atlas-research/ner-for-clinical-text-7c73caddd180)

## **📜 Background**


`MedicalNerApproach` is used for training generic NER models based on Neural Networks.

The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.

In the original framework, the CNN extracts a fixed length feature vector from character-level features. For each word, these vectors are concatenated and fed to the BLSTM network and then to the output layers. They employed a stacked bi-directional recurrent neural network with long short-term memory units to transform word features into named entity tag scores. The extracted features of each word are fed into a forward LSTM network and a backward LSTM network. The output of each network at each time step is decoded by a linear layer and a log-softmax layer into log-probabilities for each tag category. These two vectors are then simply added together to produce the final output. In the architecture of the proposed framework in the original paper, 50-dimensional pretrained word embeddings is used for word features, 25-dimension character embeddings is used for char features, and capitalization features (allCaps, upperInitial, lowercase, mixedCaps, noinfo) are used for case features. <br/>








## **🎬 Colab Setup**

###!!! Important Note :

To run this notebook, you need to configure your environment with high RAM.

In [None]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m7.3

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving 5.3.3.spark_nlp_for_healthcare.json to 5.3.3.spark_nlp_for_healthcare.json


In [None]:
from johnsnowlabs import nlp, medical, visual

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
🚨 Outdated Medical Secrets in license file. Version=5.3.3 but should be Version=5.3.2
🚨 Outdated OCR Secrets in license file. Version=5.1.2 but should be Version=5.3.2
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False 

In [None]:
import pyspark.sql.functions as F

spark = nlp.start()

spark

Spark Session already created, some configs may not take.
👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `TOKEN`, `WORD_EMBEDDINGS`

- Output: `NAMED_ENTITY`

## **📂 Training Data**

The training data should be a labeled Spark Dataset, in the format of CoNLL 2003 IOB with Annotation type columns. The data should have columns of type `DOCUMENT`, `TOKEN`, `WORD_EMBEDDINGS` and an additional label column of annotator type `NAMED_ENTITY`.

Excluding the label, this can be done with for example:

- SentenceDetector
- Tokenizer
- WordEmbeddingsModel (any embeddings can be chosen, e.g. BertEmbeddings for BERT based embeddings). <br/>



`CoNLL(includeDocId=True).readDataset(spark, "conll_file.txt")`  this methos can allow if you have doc_id information in the conll file, you can add this information to the dataframe as a column.

```
conll="""-DOCSTART- -X- -1- O

EU NNP B-NP B-ORG
rejects VBZ B-VP O

-DOCSTART- -X- 2 O

Rare NNP B-NP O
Hendrix NNP I-NP B-PER

-DOCSTART- -X- -3-1- O

China NNP B-NP B-LOC
says VBZ B-VP O

-DOCSTART-

China NNP B-NP B-LOC
says VBZ B-VP O
"""
```
```
with open('conll_file.txt', 'w') as f:
    f.write(conll)

data = CoNLL(includeDocId=True).readDataset(spark, "conll_file.txt")

data.show()
```



```
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|doc_id|                text|            document|            sentence|               token|                 pos|               label|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|     1|EU rejects German...|[{document, 0, 28...|[{document, 0, 47...|[{token, 0, 1, EU...|[{pos, 0, 1, NNP,...|[{named_entity, 0...|
|     2|Rare Hendrix song...|[{document, 0, 97...|[{document, 0, 50...|[{token, 0, 3, Ra...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|
|   3-1|China says Taiwan...|[{document, 0, 13...|[{document, 0, 46...|[{token, 0, 4, Ch...|[{pos, 0, 4, NNP,...|[{named_entity, 0...|
|     X|China says Taiwan...|[{document, 0, 13...|[{document, 0, 46...|[{token, 0, 4, Ch...|[{pos, 0, 4, NNP,...|[{named_entity, 0...|
+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
```



We will download NCBI Disease Dataset for the training.

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/NER_NCBIconlltrain.txt
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/NER_NCBIconlltest.txt

In [None]:
conll_data = nlp.CoNLL().readDataset(spark, 'NER_NCBIconlltrain.txt')

conll_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Identification of...|[{document, 0, 89...|[{document, 0, 89...|[{token, 0, 13, I...|[{pos, 0, 13, NN,...|[{named_entity, 0...|
|The adenomatous p...|[{document, 0, 21...|[{document, 0, 21...|[{token, 0, 2, Th...|[{pos, 0, 2, NN, ...|[{named_entity, 0...|
|Complex formation...|[{document, 0, 63...|[{document, 0, 63...|[{token, 0, 6, Co...|[{pos, 0, 6, NN, ...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



In [None]:
conll_data.count()

3266

In [None]:
conll_data.select(F.explode(F.arrays_zip(conll_data.token.result,
                                         conll_data.label.result)).alias("cols")) \
          .select(F.expr("cols['0']").alias("token"),
                  F.expr("cols['1']").alias("ground_truth"))\
          .groupBy('ground_truth')\
          .count()\
          .orderBy('count', ascending=False)\
          .show(100,truncate=False)

+------------+-----+
|ground_truth|count|
+------------+-----+
|O           |75093|
|I-Disease   |3547 |
|B-Disease   |3093 |
+------------+-----+



In [None]:
conll_data.select("label.result").distinct().count()

1537

As you can see, there are too many `O` labels in the dataset.
To make it more balanced, we can drop the sentences have only O labels.
(`c>1` means we drop all the sentences that have no valuable  labels other than `O`)

```
conll_data = conll_data.withColumn('unique', F.array_distinct("label.result"))\
                       .withColumn('c', F.size('unique'))\
                       .filter(F.col('c')>1)

conll_data.select(F.explode(F.arrays_zip(conll_data.token.result,conll_data.label.result)).alias("cols")) \
          .select(F.expr("cols['0']").alias("token"),
                  F.expr("cols['1']").alias("ground_truth"))\
          .groupBy('ground_truth')\
          .count()\
          .orderBy('count', ascending=False)\
          .show(100,truncate=False)
```


In [None]:
# Clinical word embeddings trained on PubMED dataset
clinical_embeddings = nlp.WordEmbeddingsModel.pretrained('embeddings_clinical', "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


Preparing the test data for the model evaluation

In [None]:
test_data = nlp.CoNLL().readDataset(spark, 'NER_NCBIconlltest.txt')

test_data = clinical_embeddings.transform(test_data)

test_data.write.mode("overwrite").parquet('NER_NCBIconlltest.parquet')

## **✅ Graph Creation**

We will use `TFGraphBuilder` annotator which can be used to create graphs in the model training pipeline. `TFGraphBuilder` inspects the data and creates the proper graph if a suitable version of TensorFlow is available. The graph is stored in the defined folder and loaded by the `MedicalNerApproach` annotator.

In [None]:
!pip install -q tensorflow==2.11.0
!pip install -q tensorflow-addons

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m588.3/588.3 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m58.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m40.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m62.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.2/439.2 kB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.9/4.9 MB[0m [31m72.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m781.3/781.3 kB[0m [31m53.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the fo

In [None]:
graph_folder_path = "medical_ner_graphs"

ner_graph_builder = medical.TFGraphBuilder()\
    .setModelName("ner_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFolder(graph_folder_path)\
    .setGraphFile("auto")\
    .setHiddenUnitsNumber(20)\
    .setIsLicensed(True) # False -> if you want to use TFGraphBuilder with NerDLApproach

## **🔎 Parameters**


- `batchSize`: (Int) Size for each batch in the optimization process (Default: 64).

- `dropout`: (Float) Dropout at the output of each layer (Default: 0.05f)

- `enableOutputLogs`: (Bool) Whether to output to annotators log folder (Default: false).

- `maxEpochs`: (Int) Maximum number of epochs to train (Default: 70).

- `minEpochs`: (Int) Maximum number of epochs to train (Default: 0).

- `graphFile`: (Str) File path that contains external graph file.

- `graphFolder`: (Str) Folder path that contain external graph files.

- `includeConfidence`: (Bool) Whether to include confidence scores in annotation metadata. Setting this parameter to True will add the confidence score to the metadata of the NAMED_ENTITY annotation. In addition, if includeAllConfidenceScores is set to True, then the confidence scores of all the tags will be added to the metadata, otherwise only for the predicted tag (the one with maximum score) (Default: False).

- `includeAllConfidenceScores`: (Bool) Whether to include confidence scores for all tags in annotation metadata or just the score of the predicted tag, by default False.
Needs the includeConfidence parameter to be set to True. Enabling this may slow down the inference speed (Default: False).

- `labelColumn`: (Str) Column with one label per document.

- `lr`: (Float) Learning rate for the optimization process (Default: 0.001).

- `outputLogsPath`: (Str) Folder path to save training logs. If no path is specified, the logs won’t be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).

- `testDataset`: (Str) Path to test dataset in parquet format. If set, the dataset will be used to calculate statistic on it during training.

- `validationSplit`: (Float) Choose the proportion of training dataset to be validated against the model on each Epoch. The value should be between 0.0 and 1.0 and by default it is 0.0 and off. <br/>
The validation dataset is randomly extracted from the training dataset before training starts. If the value is 0.0, then no validation will be performed (hold out data).

- `verbose`: (Int) Level of verbosity during training (Default: 2).

- `randomSeed`: (Int) Random seed. Set to positive integer to get reproducible results (Default: none)

- `evaluationLogExtended`: (Bool) Whether logs for validation to be extended (Default: False)

- `tagsMapping`: (List) A map specifying how old tags are mapped to new ones as a list of comma-separated entities, where the first entity is the old tag and the second entity is the new tag. For example, if the map is set to [“OLDTAG,NEWTAG”, “B-PER,B-VIP”, “I-PER, I-VIP”], then all occurrences of “OLDTAG” will be mapped to “NEWTAG”, all occurrences of “B-PER” will be mapped to “B-VIP”, and all occurrences of “I-PER” will be mapped to “I-VIP”. It only works if overrideExistingTags is set to False.

- `earlyStoppingPatience`: (Int) Number of epochs to wait before early stopping if no improvement, by default 5. Given the earlyStoppingCriterion, if the performance does not improve for the given number of epochs, then the training will stop. If the value is 0, then early stopping will occurs as soon as the criterion is met (no patience).

- `earlyStoppingCriterion` (Float) If set, this param specifies the criterion to stop training if performance is not improving. Default value is 0 which is means that early stopping is not used. <br/>
The criterion is set to F1-score if the validationSplit is greater than 0.0 (F1-socre on validation set) or testDataset is defined (F1-score on test set), otherwise it is set to model loss. The priority is as follows: - If testDataset is defined, then the criterion is set to F1-score on test set. - If validationSplit is greater than 0.0, then the criterion is set to F1-score on validation set. - Otherwise, the criterion is set to model loss. <br/>
Note that while the F1-score ranges from 0.0 to 1.0, the loss ranges from 0.0 to infinity. So, depending on which case you are in, the value you use for the criterion can be very different. For example, if validationSplit is 0.1, then a criterion of 0.01 means that if the F1-score on the validation set difference from last epoch is greater than 0.01, then the training should stop. However, if there is not validation or test set defined, then a criterion of 2.0 means that if the loss difference between the last epoch and the current one is less than 2.0, then training should stop.

- `pretrainedModelPath`: (Str) Path to an already trained MedicalNerModel. <br/>
This pretrained model will be used as a starting point for training the new one. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).

- `overrideExistingTags`: Whether to override the already learned tags when using a pretrained model to initialize the new model. A value of True (default) will override existing tags.

- `logPrefix`: (Str) A prefix that will be appended to every log, default value is empty.

- `useBestModel`: (Bool) Whether to restore and use the model from the epoch that has achieved the best performance at the end of the training. By default False (keep the model from the last trained epoch). <br/>
The best model depends on the earlyStoppingCriterion, which can be F1-score on test/validation dataset or the value of loss.

- `setRandomValidationSplitPerEpoch`: (Bool) When `True`, the validation set is randomly splitted for each epoch; and when `False`, the split is done only once before training (the same validation split used after each epoch). Default is `False`.


## **🦾 Model Training**

Now, we will define `NerApproach` and generate a pipeline along with clinical embeddings and TFGraph.

In [None]:
nerTagger = medical.NerApproach()\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setLabelColumn("label")\
    .setOutputCol("ner")\
    .setMaxEpochs(10)\
    .setBatchSize(64)\
    .setRandomSeed(0)\
    .setVerbose(1)\
    .setValidationSplit(0.2)\
    .setEvaluationLogExtended(True) \
    .setEnableOutputLogs(True)\
    .setIncludeConfidence(True)\
    .setOutputLogsPath('ner_logs')\
    .setTestDataset("NER_NCBIconlltest.parquet")\
    .setGraphFolder(graph_folder_path)
   # .setEnableMemoryOptimizer(True) #>> if you have a limited memory and a large conll file, you can set this True to train batch by batch

ner_pipeline = nlp.Pipeline(stages=[
          clinical_embeddings,
          ner_graph_builder,
          nerTagger
 ])

Fitting the pipeline with the training data.

In [None]:
%%time
ner_model = ner_pipeline.fit(conll_data)

TF Graph Builder configuration:
Model name: ner_dl
Graph folder: medical_ner_graphs
Graph file name: auto
Build params: {'ntags': 3, 'embeddings_dim': 200, 'nchars': 85, 'is_medical': True, 'lstm_size': 20}


Instructions for updating:
non-resource variables are not supported in the long term
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


ner_dl graph exported to medical_ner_graphs/blstm_3_200_20_85.pb
CPU times: user 13.5 s, sys: 536 ms, total: 14 s
Wall time: 2min 38s


Checking the results saved in the log file

In [None]:
import os
log_file= os.listdir("ner_logs")[0]

with open (f"./ner_logs/{log_file}") as f:
  print(f.read())

Name of the selected graph: /content/medical_ner_graphs/blstm_3_200_20_85.pb
Training started - total epochs: 10 - lr: 0.001 - batch size: 64 - labels: 3 - chars: 84 - training examples: 3266


Epoch 1/10 started, lr: 0.001, dataset size: 3266


Epoch 1/10 - 13.06s - loss: 422.98694 - avg training loss: 9.8369055 - batches: 43
Quality on validation dataset (20.0%), validation examples = 653
time to finish evaluation: 1.60s
Total validation loss: 86.7688	Avg validation loss: 6.1978
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 51	 11	 647	 0.82258064	 0.0730659	 0.13421053
B-Disease	 2	 5	 661	 0.2857143	 0.0030165913	 0.0059701493
tp: 53 fp: 16 fn: 1308 labels: 2
Macro-average	 prec: 0.5541475, rec: 0.038041245, f1: 0.07119507
Micro-average	 prec: 0.76811594, rec: 0.038941953, f1: 0.07412587
Quality on test dataset: 
time to finish evaluation: 0.84s
Total test loss: 93.6073	Avg test loss: 6.6862
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 42	 11	 747	 0.7924528	 0.05323194	 0.09976247
B-

**Evaluating the model**

In [None]:
pred_df = ner_model.stages[2].transform(test_data)

In [None]:
pred_df.columns

['text', 'document', 'sentence', 'token', 'pos', 'label', 'embeddings', 'ner']

In [None]:
evaler = medical.NerDLMetrics(mode="full_chunk")

eval_result = evaler.computeMetricsFromDF(pred_df.select("label","ner"), prediction_col="ner", label_col="label", drop_o = True, case_sensitive = True).cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
           .withColumn("recall", F.round(eval_result["recall"],4))\
           .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+-------+-----+-----+-----+-----+---------+------+------+
| entity|   tp|   fp|   fn|total|precision|recall|    f1|
+-------+-----+-----+-----+-----+---------+------+------+
|Disease|511.0|232.0|193.0|704.0|   0.6878|0.7259|0.7063|
+-------+-----+-----+-----+-----+---------+------+------+

+------------------+
|             macro|
+------------------+
|0.7062888735314443|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.7062888735314443|
+------------------+

None


In [None]:
evaler = medical.NerDLMetrics(mode="partial_chunk_per_token")

eval_result = evaler.computeMetricsFromDF(pred_df.select("label","ner"), prediction_col="ner", label_col="label", drop_o = True, case_sensitive = True).cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
           .withColumn("recall", F.round(eval_result["recall"],4))\
           .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+-------+------+-----+-----+------+---------+------+-----+
| entity|    tp|   fp|   fn| total|precision|recall|   f1|
+-------+------+-----+-----+------+---------+------+-----+
|Disease|1183.0|150.0|314.0|1497.0|   0.8875|0.7902|0.836|
+-------+------+-----+-----+------+---------+------+-----+

+------------------+
|             macro|
+------------------+
|0.8360424028268552|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.8360424028268552|
+------------------+

None


### Train a Model Choosing Best Model

`useBestModel`:  This param preserves and restores the model that has achieved the best performance at the end of the training. The priority is metrics from testDataset (micro F1), metrics from validationSplit (micro F1), and if none is set it will keep track of loss during the training. <br/>

`earlyStopping`: You can stop training at the point when the perforfmance on test/validation dataset starts to degrage. Two params are used in order to use this feature: <br/>

- `earlyStoppingCriterion`: This is used set the minimal improvement of the test metric to terminate training. The metric monitored is the same as the metrics used in `useBestModel` (micro F1 when using test/validation set, loss otherwise). Default is 0 which means no early stopping is applied.

- `earlyStoppingPatience`:  The number of epoch without improvement which will be tolerated. Default is 0, which means that early stopping will occur at the first time when performance in the current epoch is no better than in the previous epoch.

In [None]:
# remove the existing logs

! rm -r ./ner_logs

In [None]:
nerTagger = medical.NerApproach()\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setLabelColumn("label")\
    .setOutputCol("ner")\
    .setMaxEpochs(30)\
    .setBatchSize(64)\
    .setRandomSeed(0)\
    .setVerbose(1)\
    .setValidationSplit(0.2)\
    .setEvaluationLogExtended(True) \
    .setEnableOutputLogs(True)\
    .setIncludeConfidence(True)\
    .setOutputLogsPath('ner_logs')\
    .setTestDataset("NER_NCBIconlltest.parquet")\
    .setUseBestModel(True)\
    .setEarlyStoppingCriterion(0.04)\
    .setEarlyStoppingPatience(3)\
    .setGraphFolder(graph_folder_path)


ner_pipeline = nlp.Pipeline(stages=[
          clinical_embeddings,
          ner_graph_builder,
          nerTagger
 ])

In [None]:
%%time
ner_model = ner_pipeline.fit(conll_data)

TF Graph Builder configuration:
Model name: ner_dl
Graph folder: medical_ner_graphs
Graph file name: auto
Build params: {'ntags': 3, 'embeddings_dim': 200, 'nchars': 85, 'is_medical': True, 'lstm_size': 20}
ner_dl graph exported to medical_ner_graphs/blstm_3_200_20_85.pb
CPU times: user 10.2 s, sys: 261 ms, total: 10.5 s
Wall time: 3min 16s


Checking the results saved in the log file

In [None]:
import os
log_file= os.listdir("ner_logs")[0]

with open (f"./ner_logs/{log_file}") as f:
  print(f.read())

Name of the selected graph: /content/medical_ner_graphs/blstm_3_200_20_85.pb
Training started - total epochs: 30 - lr: 0.001 - batch size: 64 - labels: 3 - chars: 84 - training examples: 3266


Epoch 1/30 started, lr: 0.001, dataset size: 3266


Epoch 1/30 - 12.47s - loss: 826.4876 - avg training loss: 19.220642 - batches: 43
Quality on validation dataset (20.0%), validation examples = 653
time to finish evaluation: 1.40s
Total validation loss: 120.7536	Avg validation loss: 8.6253
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 1	 1	 697	 0.5	 0.0014326648	 0.002857143
B-Disease	 0	 0	 663	 0.0	 0.0	 0.0
tp: 1 fp: 1 fn: 1360 labels: 2
Macro-average	 prec: 0.25, rec: 7.163324E-4, f1: 0.0014285715
Micro-average	 prec: 0.5, rec: 7.3475385E-4, f1: 0.0014673515
Quality on test dataset: 
time to finish evaluation: 0.84s
Total test loss: 126.2043	Avg test loss: 9.0146
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 0	 0	 789	 0.0	 0.0	 0.0
B-Disease	 0	 0	 708	 0.0	 0.0	 0.0
tp: 0 fp: 0 fn: 1497 labe

As you see above, our **earlyStopping** feature worked, trainining was terminated before 30th epoch.

**Evaluating the model**

In [None]:
pred_df = ner_model.stages[2].transform(test_data)

In [None]:
pred_df.columns

['text', 'document', 'sentence', 'token', 'pos', 'label', 'embeddings', 'ner']

In [None]:
evaler = medical.NerDLMetrics(mode="full_chunk")

eval_result = evaler.computeMetricsFromDF(pred_df.select("label","ner"), prediction_col="ner", label_col="label", drop_o = True, case_sensitive = True).cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
           .withColumn("recall", F.round(eval_result["recall"],4))\
           .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+-------+-----+-----+-----+-----+---------+------+------+
| entity|   tp|   fp|   fn|total|precision|recall|    f1|
+-------+-----+-----+-----+-----+---------+------+------+
|Disease|497.0|238.0|207.0|704.0|   0.6762| 0.706|0.6908|
+-------+-----+-----+-----+-----+---------+------+------+

+------------------+
|             macro|
+------------------+
|0.6907574704656011|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.6907574704656011|
+------------------+

None


In [None]:
evaler = medical.NerDLMetrics(mode="partial_chunk_per_token")

eval_result = evaler.computeMetricsFromDF(pred_df.select("label","ner"), prediction_col="ner", label_col="label", drop_o = True, case_sensitive = True).cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
           .withColumn("recall", F.round(eval_result["recall"],4))\
           .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+-------+------+-----+-----+------+---------+------+------+
| entity|    tp|   fp|   fn| total|precision|recall|    f1|
+-------+------+-----+-----+------+---------+------+------+
|Disease|1174.0|137.0|323.0|1497.0|   0.8955|0.7842|0.8362|
+-------+------+-----+-----+------+---------+------+------+

+------------------+
|             macro|
+------------------+
|0.8361823361823362|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.8361823361823362|
+------------------+

None


`ner_utils`: This new module is used after NER training to calculate mertic chunkbase and plot training logs.

`evaluate`: if verbose, returns overall performance, as well as performance per chunk type; otherwise, simply returns overall precision, recall, f1 scores

`loss_plot`: Plots the figure of loss vs epochs

`get_charts` : Plots the figures of metrics ( precision, recall, f1) vs epochs

```
import sparknlp_jsl
from sparknlp_jsl.training_log_parser import ner_log_parser
parser = ner_log_parser()

pred_df = ner_model.stages[2].transform(test_data)

pred_df = pred_df.select(F.explode(F.arrays_zip(pred_df.ner.result,
                                                pred_df.label.result)).alias("cols"))\
                 .select(F.expr("cols['0']").alias("prediction"),
                         F.expr("cols['1']").alias("ground_truth"))
                 
df = pred_df.toPandas()


metrics = parser.evaluate( df['prediction'].values, df['ground_truth'].values)

parser.loss_plot(f"./ner_logs/{log_file}")

parser.get_charts('./ner_logs/'+log_file)
```

Saving the trained model and using in another pipeline via `load()`.

In [None]:
ner_model.stages[2].write().overwrite().save('models/custom_NER_model')

In [None]:
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = nlp.SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = nlp.Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

loaded_ner_model = medical.NerModel.load("models/custom_NER_model")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

converter = medical.NerConverterInternal()\
    .setInputCols(["document", "token", "ner"])\
    .setOutputCol("ner_chunk")

ner_prediction_pipeline = nlp.Pipeline(stages=[
    document,
    sentence,
    token,
    clinical_embeddings,
    loaded_ner_model,
    converter])

empty_data = spark.createDataFrame([['']]).toDF("text")

prediction_model = ner_prediction_pipeline.fit(empty_data)


light_model = nlp.LightPipeline(prediction_model)

In [None]:
text = "She has a metastatic breast cancer"

result = light_model.fullAnnotate(text)[0]

[(i.result, i.metadata['entity']) for i in result['ner_chunk']]

[('metastatic breast cancer', 'Disease')]

### Resume NER Model Training

Steps that we will follow:
- Train a new model for a few epochs.
- Load the same model and train for more epochs on the same taxnonomy, and check stats.
- Train a model already trained on a different data

**Downloading data for training**

In [None]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/NCBI_disease_official_test.conll
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/NCBI_disease_official_train_dev.conll

In [None]:
# from sparknlp.training import CoNLL

training_data = nlp.CoNLL().readDataset(spark, 'NCBI_disease_official_train_dev.conll')

training_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Identification of...|[{document, 0, 89...|[{document, 0, 89...|[{token, 0, 13, I...|[{pos, 0, 13, NN,...|[{named_entity, 0...|
|The adenomatous p...|[{document, 0, 21...|[{document, 0, 21...|[{token, 0, 2, Th...|[{pos, 0, 2, NN, ...|[{named_entity, 0...|
|Complex formation...|[{document, 0, 63...|[{document, 0, 63...|[{token, 0, 6, Co...|[{pos, 0, 6, NN, ...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



In [None]:
test_data = nlp.CoNLL().readDataset(spark, 'NCBI_disease_official_test.conll')

test_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Clustering of mis...|[{document, 0, 10...|[{document, 0, 10...|[{token, 0, 9, Cl...|[{pos, 0, 9, NN, ...|[{named_entity, 0...|
|Ataxia - telangie...|[{document, 0, 13...|[{document, 0, 13...|[{token, 0, 5, At...|[{pos, 0, 5, NN, ...|[{named_entity, 0...|
|The risk of cance...|[{document, 0, 15...|[{document, 0, 15...|[{token, 0, 2, Th...|[{pos, 0, 2, NN, ...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



**Split the test data into two parts:**
- We Keep the first part separate and use it for training the model further, as it will be totally unseen data from the same taxonomy.

- The second part will be used to testing and evaluating

In [None]:
(test_data_1, test_data_2) = test_data.randomSplit([0.5, 0.5], seed = 100)

# save the test data as parquet for easy testing
clinical_embeddings.transform(test_data_1).write.parquet('test_1.parquet')

clinical_embeddings.transform(test_data_2).write.parquet('test_2.parquet')

#### Train a new model, pause, and resume training on the same dataset.

**Create a graph**

In [None]:
graph_folder_path = "medical_ner_graphs"

ner_graph_builder = medical.TFGraphBuilder()\
    .setModelName("ner_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFolder(graph_folder_path)\
    .setGraphFile("auto")\
    .setHiddenUnitsNumber(24)\
    .setIsLicensed(True) # False -> if you want to use TFGraphBuilder with NerDLApproach

**Train for 2 epochs**

In [None]:
# remove the existing logs

! rm -r ./ner_logs

In [None]:
nerTagger = medical.NerApproach()\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setLabelColumn("label")\
    .setOutputCol("ner")\
    .setMaxEpochs(2)\
    .setLr(0.003)\
    .setBatchSize(8)\
    .setRandomSeed(0)\
    .setVerbose(1)\
    .setEvaluationLogExtended(True) \
    .setEnableOutputLogs(True)\
    .setIncludeConfidence(True)\
    .setTestDataset('./test_2.parquet')\
    .setGraphFolder(graph_folder_path)\
    .setOutputLogsPath('./ner_logs')

ner_pipeline = nlp.Pipeline(stages=[
    clinical_embeddings,
    ner_graph_builder,
    nerTagger
 ])

In [None]:
%%time
ner_model = ner_pipeline.fit(training_data)

TF Graph Builder configuration:
Model name: ner_dl
Graph folder: medical_ner_graphs
Graph file name: auto
Build params: {'ntags': 3, 'embeddings_dim': 200, 'nchars': 85, 'is_medical': True, 'lstm_size': 24}
ner_dl graph exported to medical_ner_graphs/blstm_3_200_24_85.pb
CPU times: user 9.62 s, sys: 137 ms, total: 9.75 s
Wall time: 1min 43s


In [None]:
!ls ner_logs/MedicalNerApproach*

ner_logs/MedicalNerApproach_10161838cd72.log


In [None]:
# Training Logs
! cat ner_logs/MedicalNerApproach*

Name of the selected graph: /content/medical_ner_graphs/blstm_3_200_24_85.pb
Training started - total epochs: 2 - lr: 0.003 - batch size: 8 - labels: 3 - chars: 84 - training examples: 6347


Epoch 1/2 started, lr: 0.003, dataset size: 6347


Epoch 1/2 - 39.76s - loss: 2162.3325 - avg training loss: 2.7199152 - batches: 795
Quality on test dataset: 
time to finish evaluation: 1.53s
Total test loss: 93.0759	Avg test loss: 1.4543
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 472	 65	 158	 0.87895715	 0.74920636	 0.8089117
B-Disease	 444	 103	 87	 0.81170017	 0.8361582	 0.82374763
tp: 916 fp: 168 fn: 245 labels: 2
Macro-average	 prec: 0.8453287, rec: 0.7926823, f1: 0.81815946
Micro-average	 prec: 0.84501845, rec: 0.788975, f1: 0.8160357


Epoch 2/2 started, lr: 0.0029850747, dataset size: 6347


Epoch 2/2 - 37.48s - loss: 1019.747 - avg training loss: 1.2827007 - batches: 795
Quality on test dataset: 
time to finish evaluation: 0.91s
Total test loss: 75.5340	Avg test loss: 1.1802
label	 tp	

**Evaluate**

In [None]:
#from sparknlp_jsl.eval import NerDLMetrics

pred_df = ner_model.stages[2].transform(clinical_embeddings.transform(test_data_2))

evaler = medical.NerDLMetrics(mode="full_chunk")

eval_result = evaler.computeMetricsFromDF(pred_df.select("label","ner"), prediction_col="ner", label_col="label", drop_o = True, case_sensitive = True).cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
           .withColumn("recall", F.round(eval_result["recall"],4))\
           .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+-------+-----+----+-----+-----+---------+------+------+
| entity|   tp|  fp|   fn|total|precision|recall|    f1|
+-------+-----+----+-----+-----+---------+------+------+
|Disease|397.0|68.0|130.0|527.0|   0.8538|0.7533|0.8004|
+-------+-----+----+-----+-----+---------+------+------+

+------------------+
|             macro|
+------------------+
|0.8004032258064516|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.8004032258064516|
+------------------+

None


**Save the model to disk**

In [None]:
ner_model.stages[2].write().overwrite().save('models/NCBI_NER_2_epoch')

**Train using the saved model on unseen dataset** <br/>
We will use unseen data from the same taxonomy

In [None]:
nerTagger = medical.NerApproach()\
      .setInputCols(["sentence", "token", "embeddings"])\
      .setLabelColumn("label")\
      .setOutputCol("ner")\
      .setMaxEpochs(2)\
      .setLr(0.003)\
      .setBatchSize(8)\
      .setRandomSeed(0)\
      .setVerbose(1)\
      .setEvaluationLogExtended(True) \
      .setEnableOutputLogs(True)\
      .setIncludeConfidence(True)\
      .setTestDataset('/content/test_2.parquet')\
      .setOutputLogsPath('ner_logs')\
      .setGraphFolder(graph_folder_path)\
      .setPretrainedModelPath("models/NCBI_NER_2_epoch") ## load existing model

ner_pipeline = nlp.Pipeline(stages=[
      clinical_embeddings,
      ner_graph_builder,
      nerTagger
 ])

In [None]:
# remove the existing logs

! rm -r ./ner_logs

In [None]:
%%time
ner_model_retrained = ner_pipeline.fit(test_data_1)

TF Graph Builder configuration:
Model name: ner_dl
Graph folder: medical_ner_graphs
Graph file name: auto
Build params: {'ntags': 3, 'embeddings_dim': 200, 'nchars': 79, 'is_medical': True, 'lstm_size': 24}
ner_dl graph exported to medical_ner_graphs/blstm_3_200_24_79.pb
CPU times: user 9.17 s, sys: 147 ms, total: 9.32 s
Wall time: 25.1 s


In [None]:
!cat ./ner_logs/MedicalNerApproach*

Name of the selected graph: pretrained graph
Training started - total epochs: 2 - lr: 0.003 - batch size: 8 - labels: 3 - chars: 84 - training examples: 438


Epoch 1/2 started, lr: 0.003, dataset size: 438


Epoch 1/2 - 5.98s - loss: 72.05317 - avg training loss: 1.2640907 - batches: 57
Quality on test dataset: 
time to finish evaluation: 1.33s
Total test loss: 62.7809	Avg test loss: 0.9810
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 526	 63	 104	 0.89303905	 0.83492064	 0.8630025
B-Disease	 458	 68	 73	 0.8707224	 0.86252356	 0.8666036
tp: 984 fp: 131 fn: 177 labels: 2
Macro-average	 prec: 0.88188076, rec: 0.8487221, f1: 0.86498374
Micro-average	 prec: 0.8825112, rec: 0.8475452, f1: 0.86467487


Epoch 2/2 started, lr: 0.0029850747, dataset size: 438


Epoch 2/2 - 3.29s - loss: 59.75368 - avg training loss: 1.0483102 - batches: 57
Quality on test dataset: 
time to finish evaluation: 0.91s
Total test loss: 61.0334	Avg test loss: 0.9536
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 535	 8

In [None]:
#from sparknlp_jsl.eval import NerDLMetrics
import pyspark.sql.functions as F

pred_df = ner_model_retrained.stages[2].transform(clinical_embeddings.transform(test_data_2))

evaler = medical.NerDLMetrics(mode="full_chunk")

eval_result = evaler.computeMetricsFromDF(pred_df.select("label","ner"), prediction_col="ner", label_col="label",case_sensitive = True).cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
    .withColumn("recall", F.round(eval_result["recall"],4))\
    .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+-------+-----+-----+----+-----+---------+------+------+
| entity|   tp|   fp|  fn|total|precision|recall|    f1|
+-------+-----+-----+----+-----+---------+------+------+
|Disease|453.0|103.0|74.0|527.0|   0.8147|0.8596|0.8366|
+-------+-----+-----+----+-----+---------+------+------+

+------------------+
|             macro|
+------------------+
|0.8365650969529086|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.8365650969529086|
+------------------+

None


#### Train a model already trained on a different data

We will use `ner_jsl` pretrained model for re-training.

In [None]:
jsl_ner = medical.NerModel.pretrained('ner_jsl','en','clinical/models')

jsl_ner.getClasses()

ner_jsl download started this may take some time.
[OK!]


['O',
 'B-Injury_or_Poisoning',
 'B-Direction',
 'B-Test',
 'I-Route',
 'B-Admission_Discharge',
 'B-Death_Entity',
 'I-Oxygen_Therapy',
 'B-Relationship_Status',
 'I-Drug_BrandName',
 'B-Duration',
 'I-Alcohol',
 'I-Triglycerides',
 'I-Date',
 'B-Hyperlipidemia',
 'B-Respiration',
 'I-Test',
 'B-Birth_Entity',
 'I-VS_Finding',
 'B-Age',
 'I-Vaccine_Name',
 'I-Social_History_Header',
 'B-Labour_Delivery',
 'I-Medical_Device',
 'B-Family_History_Header',
 'B-BMI',
 'I-Fetus_NewBorn',
 'I-BMI',
 'B-Temperature',
 'I-Section_Header',
 'I-Communicable_Disease',
 'I-ImagingFindings',
 'I-Psychological_Condition',
 'I-Obesity',
 'I-Sexually_Active_or_Sexual_Orientation',
 'I-Modifier',
 'B-Alcohol',
 'I-Temperature',
 'I-Vaccine',
 'I-Symptom',
 'I-Pulse',
 'B-Kidney_Disease',
 'B-Oncological',
 'I-EKG_Findings',
 'B-Medical_History_Header',
 'I-Relationship_Status',
 'B-Cerebrovascular_Disease',
 'I-Blood_Pressure',
 'I-Diabetes',
 'B-Oxygen_Therapy',
 'B-O2_Saturation',
 'B-Psychological_C

**Now train a model using this model as base**

In [None]:
nerTagger = medical.NerApproach()\
      .setInputCols(["sentence", "token", "embeddings"])\
      .setLabelColumn("label")\
      .setOutputCol("ner")\
      .setMaxEpochs(2)\
      .setLr(0.003)\
      .setBatchSize(8)\
      .setRandomSeed(0)\
      .setVerbose(1)\
      .setEvaluationLogExtended(True) \
      .setEnableOutputLogs(True)\
      .setIncludeConfidence(True)\
      .setTestDataset('/content/test_2.parquet')\
      .setOutputLogsPath('ner_logs')\
      .setGraphFolder(graph_folder_path)\
      .setPretrainedModelPath("/root/cache_pretrained/ner_jsl_en_4.2.0_3.0_1666181370373")\
      .setOverrideExistingTags(True) # since the tags do not align, set this flag to true

# do hyperparameter by tuning the params above (max epoch, LR, dropout etc.) to get better results
ner_pipeline = nlp.Pipeline(stages=[
      clinical_embeddings,
      ner_graph_builder,
      nerTagger
 ])

In [None]:
# remove the existing logs

! rm -r ./ner_logs

In [None]:
%%time
ner_jsl_retrained = ner_pipeline.fit(training_data)

TF Graph Builder configuration:
Model name: ner_dl
Graph folder: medical_ner_graphs
Graph file name: auto
Build params: {'ntags': 3, 'embeddings_dim': 200, 'nchars': 85, 'is_medical': True, 'lstm_size': 24}
ner_dl graph exported to medical_ner_graphs/blstm_3_200_24_85.pb
CPU times: user 11.2 s, sys: 266 ms, total: 11.4 s
Wall time: 5min 11s


In [None]:
!cat ./ner_logs/MedicalNerApproach*

Name of the selected graph: pretrained graph
Training started - total epochs: 2 - lr: 0.003 - batch size: 8 - labels: 3 - chars: 103 - training examples: 6347


Epoch 1/2 started, lr: 0.003, dataset size: 6347


Epoch 1/2 - 136.21s - loss: 1420.0325 - avg training loss: 1.7862043 - batches: 795
Quality on test dataset: 
time to finish evaluation: 9.56s
Total test loss: 77.1888	Avg test loss: 1.2061
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Disease	 489	 51	 141	 0.90555555	 0.77619046	 0.83589745
B-Disease	 475	 91	 56	 0.8392226	 0.8945386	 0.86599815
tp: 964 fp: 142 fn: 197 labels: 2
Macro-average	 prec: 0.8723891, rec: 0.8353645, f1: 0.8534754
Micro-average	 prec: 0.8716094, rec: 0.8303187, f1: 0.85046315


Epoch 2/2 started, lr: 0.0029850747, dataset size: 6347


Epoch 2/2 - 131.42s - loss: 655.38513 - avg training loss: 0.8243838 - batches: 795
Quality on test dataset: 
time to finish evaluation: 8.30s
Total test loss: 61.2552	Avg test loss: 0.9571
label	 tp	 fp	 fn	 prec	 rec	 f1
I-Dise

**Evaluating**

In [None]:
#from sparknlp_jsl.eval import NerDLMetrics

pred_df = ner_jsl_retrained.stages[2].transform(clinical_embeddings.transform(test_data_2))

evaler = medical.NerDLMetrics(mode="full_chunk")

eval_result = evaler.computeMetricsFromDF(pred_df.select("label","ner"), prediction_col="ner", label_col="label", drop_o = True, case_sensitive = True).cache()

eval_result.withColumn("precision", F.round(eval_result["precision"],4))\
    .withColumn("recall", F.round(eval_result["recall"],4))\
    .withColumn("f1", F.round(eval_result["f1"],4)).show(100)

print(eval_result.selectExpr("avg(f1) as macro").show())
print (eval_result.selectExpr("sum(f1*total) as sumprod","sum(total) as sumtotal").selectExpr("sumprod/sumtotal as micro").show())

+-------+-----+----+-----+-----+---------+------+------+
| entity|   tp|  fp|   fn|total|precision|recall|    f1|
+-------+-----+----+-----+-----+---------+------+------+
|Disease|423.0|77.0|104.0|527.0|    0.846|0.8027|0.8238|
+-------+-----+----+-----+-----+---------+------+------+

+------------------+
|             macro|
+------------------+
|0.8237585199610516|
+------------------+

None
+------------------+
|             micro|
+------------------+
|0.8237585199610516|
+------------------+

None
