![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.4.Resume_RelationExtractionApproach_Training.ipynb)

# 10.4 Resume RelationExtractionApproach Model Training

Steps:
- Train a new model for a few epochs.
- Load the same model and train for more epochs on the same taxnonomy, and evaluate.
- Train a model already trained on a different dataset.

## Colab Setup

In [None]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import json
import os

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", # Amount of memory to use for the driver process, i.e. where SparkContext is initialized
          "spark.kryoserializer.buffer.max":"2000M", # Maximum allowable size of Kryo serialization buffer, in MiB unless otherwise specified. 
          "spark.driver.maxResultSize":"2000M"} # Limit of total size of serialized results of all partitions for each Spark action (e.g. collect) in bytes. 
                                                # Should be at least 1M, or 0 for unlimited. 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 4.3.2
Spark NLP_JSL Version : 4.3.2


## Download Data for Training (NCBI Disease Dataset)

In [4]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/i2b2_clinical_rel_dataset.csv

In [5]:
data = spark.read.option("header","true").format("csv").load("i2b2_clinical_rel_dataset.csv")

data = data.select( 'sentence','firstCharEnt1','firstCharEnt2','lastCharEnt1','lastCharEnt2', "chunk1", "chunk2", "label1", "label2",'rel','dataset')

data.show(10)

# you only need these columns>> 'sentence','firstCharEnt1','firstCharEnt2','lastCharEnt1','lastCharEnt2', "chunk1", "chunk2", "label1", "label2",'rel'
# ('dataset' column is optional)

+--------------------+-------------+-------------+------------+------------+--------------------+--------------------+---------+---------+-----+-------+
|            sentence|firstCharEnt1|firstCharEnt2|lastCharEnt1|lastCharEnt2|              chunk1|              chunk2|   label1|   label2|  rel|dataset|
+--------------------+-------------+-------------+------------+------------+--------------------+--------------------+---------+---------+-----+-------+
|VITAL SIGNS - Tem...|           49|           75|          64|          84|    respiratory rate|          saturation|     test|     test|    O|   test|
|No lotions , crea...|            3|           34|           9|          42|             lotions|           incisions|treatment|  problem|TrNAP|   test|
|Because of expect...|           11|           58|          54|          68|expected long ter...|         a picc line|treatment|treatment|    O|  train|
|She states this l...|           16|           82|          31|          92|    li

In [6]:
data.groupby('dataset').count().show()

+-------+-----+
|dataset|count|
+-------+-----+
|  train|  350|
|   test|  650|
+-------+-----+



In [7]:
data.groupby('rel').count().show()

+-----+-----+
|  rel|count|
+-----+-----+
| TrIP|   14|
| TrAP|  164|
| TeCP|   26|
|    O|  414|
|TrNAP|   14|
| TrCP|   28|
|  PIP|  153|
| TrWP|   11|
| TeRP|  176|
+-----+-----+



In [8]:
from sparknlp_jsl.training import REDatasetHelper

# map entity columns to dataset columns
column_map = {
    "begin1": "firstCharEnt1",
    "end1": "lastCharEnt1",
    "begin2": "firstCharEnt2",
    "end2": "lastCharEnt2",
    "chunk1": "chunk1",
    "chunk2": "chunk2",
    "label1": "label1",
    "label2": "label2"
}

# apply preprocess function to dataframe
data = REDatasetHelper(data).create_annotation_column(
    column_map,
    ner_column_name="train_ner_chunks" # optional, default train_ner_chunks
)

In [9]:
# add relation direction

@F.udf(T.StringType())
def encodeRelationDirection(rel, begin1, begin2):
    if rel != "O":
        if begin1 > begin2:
            return "leftwards"
        else:
            return "rightwards"
    else:
        return "both"



data = data.withColumn("rel_dir", encodeRelationDirection("rel", "firstCharEnt1", "lastCharEnt2"))

train_data = data.where("dataset='train'")
test_data = data.where("dataset='test'")

## Split the test data into two parts:
- We keep the first part separate and use it for training the model further, as it will be totally unseen data from the same taxonomy.

- The second part will be used to testing and evaluating

In [10]:
(test_data_1, test_data_2) = test_data.randomSplit([0.5, 0.5], seed = 100)

## Train a new model, pause, and resume training on the same dataset.

### Create graph 

We will use `TFGraphBuilder` annotator which can be used to create graphs automatically in the model training pipeline. 

`TFGraphBuilder` inspects the data and creates the proper graph if a suitable version of TensorFlow is available. The graph is stored in the defined folder and loaded by the approach.

You can also create a custom graph by using `tf_graph` module in Spark NLP for Healthcare.

In [None]:
!pip install -q tensorflow==2.11.0 tensorflow-addons

In [12]:
from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder= "./tf_graphs"

re_graph_builder = TFGraphBuilder()\
    .setModelName("relation_extraction")\
    .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"]) \
    .setLabelColumn("rel")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("re_graph.pb")\
    .setHiddenLayers([300, 200])\
    .setHiddenAct("relu")\
    .setHiddenActL2(True)\
    .setHiddenWeightsL2(False)\
    .setBatchNorm(False)

### Train for 100 epochs

In [13]:
documenter = DocumentAssembler()\
    .setInputCol("sentence")\
    .setOutputCol("sentences")

tokenizer = Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")\

words_embedder = WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")
    
dependency_parser = DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

reApproach = RelationExtractionApproach()\
    .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setLabelColumn("rel")\
    .setEpochsNumber(100)\
    .setBatchSize(200)\
    .setDropout(0.5)\
    .setLearningRate(0.005)\
    .setModelFile(f"{graph_folder}/re_graph.pb")\
    .setFixImbalance(True)\
    .setFromEntity("firstCharEnt1", "lastCharEnt1", "label1")\
    .setToEntity("firstCharEnt2", "lastCharEnt2", "label2")\
    .setOutputLogsPath('/content')\
    .setRelationDirectionCol("rel_dir")\
    .setMaxSyntacticDistance(10)

finisher = Finisher()\
    .setInputCols(["relations"])\
    .setOutputCols(["relations_out"])\
    .setCleanAnnotations(False)\
    .setValueSplitSymbol(",")\
    .setAnnotationSplitSymbol(",")\
    .setOutputAsArray(False)

train_pipeline = Pipeline(stages=[
    documenter, 
    tokenizer, 
    words_embedder, 
    pos_tagger, 
    dependency_parser, 
    re_graph_builder,
    reApproach, 
    finisher
])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]


In [14]:
%%time 
rel_model = train_pipeline.fit(train_data)

TF Graph Builder configuration:
Model name: relation_extraction
Graph folder: ./tf_graphs
Graph file name: re_graph.pb
Build params: {'input_dim': 1149, 'output_dim': 27, 'hidden_layers': [300, 200], 'hidden_act': 'relu', 'hidden_act_l2': True, 'hidden_weights_l2': False, 'batch_norm': False}
relation_extraction graph exported to ./tf_graphs/re_graph.pb
CPU times: user 6.91 s, sys: 753 ms, total: 7.67 s
Wall time: 31.6 s


In [15]:
result_test = rel_model.transform(test_data_2)

### Evaluate

In [16]:
result_test_df = result_test.select('rel', 'relations.result').toPandas()
result_test_df.sample(5)

Unnamed: 0,rel,result
284,TrCP,[]
254,TrCP,[]
248,PIP,[PIP]
69,TeRP,[]
158,O,[O]


In [17]:
from sklearn.metrics import classification_report

pd_test_results = result_test_df.explode("result").fillna("O")

print(classification_report(pd_test_results["rel"], pd_test_results["result"]))

              precision    recall  f1-score   support

           O       0.56      0.89      0.69       133
         PIP       0.63      0.55      0.59        40
        TeCP       0.00      0.00      0.00         7
        TeRP       0.89      0.16      0.27        51
        TrAP       0.69      0.62      0.65        58
        TrCP       0.00      0.00      0.00        12
        TrIP       0.00      0.00      0.00         6
       TrNAP       0.00      0.00      0.00         4
        TrWP       0.00      0.00      0.00         3

    accuracy                           0.59       314
   macro avg       0.31      0.25      0.24       314
weighted avg       0.59      0.59      0.53       314



### Save to disk

In [18]:
rel_model.stages[-2].write().overwrite().save('RE_model_30e')

### Train using the saved model on unseen dataset

We use unseen data from the same taxonomy

In [19]:
reApproach_finetune = RelationExtractionApproach()\
    .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setLabelColumn("rel")\
    .setEpochsNumber(100)\
    .setBatchSize(200)\
    .setDropout(0.5)\
    .setLearningRate(0.005)\
    .setFixImbalance(True)\
    .setFromEntity("firstCharEnt1", "lastCharEnt1", "label1")\
    .setToEntity("firstCharEnt2", "lastCharEnt2", "label2")\
    .setRelationDirectionCol("rel_dir")\
    .setPretrainedModelPath("RE_model_30e")\
    .setОverrideExistingLabels(False)

finetune_pipeline = Pipeline(stages=[
    documenter, 
    tokenizer, 
    words_embedder, 
    pos_tagger, 
    dependency_parser, 
    re_graph_builder,
    reApproach_finetune, 
    finisher
])

In [20]:
%%time 
rel_model = finetune_pipeline.fit(test_data_1)

TF Graph Builder configuration:
Model name: relation_extraction
Graph folder: ./tf_graphs
Graph file name: re_graph.pb
Build params: {'input_dim': 1149, 'output_dim': 27, 'hidden_layers': [300, 200], 'hidden_act': 'relu', 'hidden_act_l2': True, 'hidden_weights_l2': False, 'batch_norm': False}
relation_extraction graph exported to ./tf_graphs/re_graph.pb
CPU times: user 929 ms, sys: 61.9 ms, total: 991 ms
Wall time: 12.2 s


In [21]:
result = rel_model.transform(test_data_2)

### Evaluate

In [22]:
result_test_df = result.select('rel', 'relations.result').toPandas()
result_test_df.sample(5)

Unnamed: 0,rel,result
216,TrAP,[TrAP]
174,TrAP,[TrAP]
75,O,[O]
276,TeRP,[TeRP]
12,O,[O]


In [23]:
from sklearn.metrics import classification_report

pd_test_results = result_test_df.explode("result").fillna("O")

print(classification_report(pd_test_results["rel"], pd_test_results["result"]))

              precision    recall  f1-score   support

           O       0.68      0.92      0.79       133
         PIP       0.76      0.40      0.52        40
        TeCP       0.00      0.00      0.00         7
        TeRP       0.83      0.76      0.80        51
        TrAP       0.79      0.71      0.75        58
        TrCP       0.12      0.08      0.10        12
        TrIP       0.67      0.33      0.44         6
       TrNAP       0.00      0.00      0.00         4
        TrWP       0.00      0.00      0.00         3

    accuracy                           0.71       314
   macro avg       0.43      0.36      0.38       314
weighted avg       0.68      0.71      0.68       314



### Save to disk

In [24]:
rel_model.stages[-2].write().overwrite().save('RE_model_finetuned')

## Now let's take a model trained on a different dataset (pretrained pipeline) and train on this dataset

In [25]:
clinical_re_Model = RelationExtractionModel()\
    .pretrained("re_clinical", "en", 'clinical/models')\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations")

clinical_re_Model.getClasses()

re_clinical download started this may take some time.
Approximate size to download 6 MB
[OK!]


['TrWP', 'TrNAP', 'TrCP', 'PIP', 'TeCP', 'TeRP', 'TrIP', 'TrAP', 'O']

### Now train a model using this model as base

In [26]:
reApproach_finetune = RelationExtractionApproach()\
    .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setLabelColumn("rel")\
    .setEpochsNumber(200)\
    .setBatchSize(200)\
    .setDropout(0.5)\
    .setLearningRate(0.001)\
    .setFixImbalance(True)\
    .setFromEntity("firstCharEnt1", "lastCharEnt1", "label1")\
    .setToEntity("firstCharEnt2", "lastCharEnt2", "label2")\
    .setPretrainedModelPath("/root/cache_pretrained/re_clinical_en_2.5.5_2.4_1596928426753")\
    .setОverrideExistingLabels(False)

finetune_pipeline = Pipeline(stages=[
    documenter, 
    tokenizer, 
    words_embedder, 
    pos_tagger, 
    dependency_parser, 
    re_graph_builder,
    reApproach_finetune, 
    finisher
])

In [27]:
%%time
rel_model = finetune_pipeline.fit(train_data)

TF Graph Builder configuration:
Model name: relation_extraction
Graph folder: ./tf_graphs
Graph file name: re_graph.pb
Build params: {'input_dim': 1149, 'output_dim': 27, 'hidden_layers': [300, 200], 'hidden_act': 'relu', 'hidden_act_l2': True, 'hidden_weights_l2': False, 'batch_norm': False}
relation_extraction graph exported to ./tf_graphs/re_graph.pb
CPU times: user 905 ms, sys: 44.6 ms, total: 950 ms
Wall time: 10.7 s


In [28]:
result = rel_model.transform(test_data_2)

### Evaluate

In [29]:
result_test_df = result.select('rel', 'relations.result').toPandas()
result_test_df.sample(5)

Unnamed: 0,rel,result
106,O,[TrAP]
72,TrAP,[O]
4,TeRP,[PIP]
163,TeRP,[TeCP]
194,O,[O]


In [30]:
from sklearn.metrics import classification_report

pd_test_results = result_test_df.explode("result").fillna("O")

print(classification_report(pd_test_results["rel"], pd_test_results["result"]))

              precision    recall  f1-score   support

           O       0.63      0.78      0.70       133
         PIP       0.48      0.75      0.58        40
        TeCP       0.00      0.00      0.00         7
        TeRP       0.79      0.67      0.72        51
        TrAP       0.68      0.43      0.53        58
        TrCP       0.00      0.00      0.00        12
        TrIP       0.00      0.00      0.00         6
       TrNAP       0.50      0.25      0.33         4
        TrWP       0.00      0.00      0.00         3

    accuracy                           0.62       314
   macro avg       0.34      0.32      0.32       314
weighted avg       0.59      0.62      0.59       314



### Save to disk

In [31]:
rel_model.stages[-2].write().overwrite().save('RE_pretrained_model_finetuned')