![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/RelationExtractionApproach.ipynb)

# **RelationExtractionApproach**

In this notebook, we will examine the `RelationExtractionApproach` to to train an `RelationExtractionModel`.


**📖 Learning Objectives:**

1. Learn how to preprocess your data before training a Relation Extraction model.

2. Understand how to train a Relation Extraction model using `RelationExtractionApproach`.

3. Understand how to resume Relation Extraction model training.




**🔗 Helpful Links:**

- Documentation: [RelationExtractionApproach](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#relationextraction)

- For extended examples of usage, see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/03.0.Clinical_Relation_Extraction.ipynb)

- Python Documentation: [RelationExtractionApproach](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/re/relation_extraction/index.html#sparknlp_jsl.annotator.re.relation_extraction.RelationExtractionApproach.name)

- Scala Documentation: [RelationExtractionApproach](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/re/RelationExtractionApproach.html) <br/>



## **📜 Background**


`RelationExtractionApproach` is used for training a TensorFlow model for relation extraction. <br/>

To train a custom relation extraction model, you need to first creat a Tensorflow graph using either the `TfGraphBuilder` annotator or the `tf_graph` module. Then, set the path to the Tensorflow graph using the method `.setModelFile("path/to/tensorflow_graph.pb")`.

If the parameter `relationDirectionCol` is set, the model will be trained using the direction information (see the parameter decription for details). Otherwise, the model won’t have direction between the relation of the entities.

After training a model (using the .fit() method), the resulting object is of class `RelationExtractionModel`.








## **🎬 Colab Setup**

In [1]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m34.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m4.7 M

In [2]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving 5.3.3.spark_nlp_for_healthcare.json to 5.3.3.spark_nlp_for_healthcare.json


In [3]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
🚨 Outdated Medical Secrets in license file. Version=5.3.3 but should be Version=5.3.2
🚨 Outdated OCR Secrets in license file. Version=5.1.2 but should be Version=5.3.2
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False 

In [4]:
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pandas as pd

spark = nlp.start()
spark

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `WORD_EMBEDDINGS`, `POS`, `CHUNK`, `DEPENDENCY`

- Output: `CATEGORY`

## **📂 Training Data**

Downloading sample data:

In [5]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/data/i2b2_clinical_rel_dataset.csv

In [6]:
data = spark.read.option("header","true").format("csv").load("i2b2_clinical_rel_dataset.csv")

data = data.select( 'sentence','firstCharEnt1','firstCharEnt2','lastCharEnt1','lastCharEnt2', "chunk1", "chunk2", "label1", "label2",'rel','dataset')

data.show(10)

# you only need these columns>> 'sentence','firstCharEnt1','firstCharEnt2','lastCharEnt1','lastCharEnt2', "chunk1", "chunk2", "label1", "label2",'rel'
# ('dataset' column is optional)

+--------------------+-------------+-------------+------------+------------+--------------------+--------------------+---------+---------+-----+-------+
|            sentence|firstCharEnt1|firstCharEnt2|lastCharEnt1|lastCharEnt2|              chunk1|              chunk2|   label1|   label2|  rel|dataset|
+--------------------+-------------+-------------+------------+------------+--------------------+--------------------+---------+---------+-----+-------+
|VITAL SIGNS - Tem...|           49|           75|          64|          84|    respiratory rate|          saturation|     test|     test|    O|   test|
|No lotions , crea...|            3|           34|           9|          42|             lotions|           incisions|treatment|  problem|TrNAP|   test|
|Because of expect...|           11|           58|          54|          68|expected long ter...|         a picc line|treatment|treatment|    O|  train|
|She states this l...|           16|           82|          31|          92|    li

In [7]:
data.groupby('dataset').count().show()

+-------+-----+
|dataset|count|
+-------+-----+
|  train|  350|
|   test|  650|
+-------+-----+



In [8]:
data.groupby('rel').count().show()

+-----+-----+
|  rel|count|
+-----+-----+
| TrIP|   14|
| TrAP|  164|
| TeCP|   26|
|    O|  414|
|TrNAP|   14|
| TrCP|   28|
|  PIP|  153|
| TrWP|   11|
| TeRP|  176|
+-----+-----+



In [9]:
#from sparknlp_jsl.training import REDatasetHelper

# map entity columns to dataset columns
column_map = {
    "begin1": "firstCharEnt1",
    "end1": "lastCharEnt1",
    "begin2": "firstCharEnt2",
    "end2": "lastCharEnt2",
    "chunk1": "chunk1",
    "chunk2": "chunk2",
    "label1": "label1",
    "label2": "label2"
}

# apply preprocess function to dataframe
data = medical.REDatasetHelper(data).create_annotation_column(
    column_map,
    ner_column_name="train_ner_chunks" # optional, default train_ner_chunks
)

train_data = data.where("dataset='train'")
test_data = data.where("dataset='test'")

In [10]:
data.show(10)

+------------+-------------+------------+-------------+---------+--------------------+-------+--------------------+--------------------+--------------------+---------+-----+
|lastCharEnt2|firstCharEnt2|lastCharEnt1|firstCharEnt1|   label1|            sentence|dataset|    train_ner_chunks|              chunk1|              chunk2|   label2|  rel|
+------------+-------------+------------+-------------+---------+--------------------+-------+--------------------+--------------------+--------------------+---------+-----+
|          84|           75|          64|           49|     test|VITAL SIGNS - Tem...|   test|[{chunk, 49, 64, ...|    respiratory rate|          saturation|     test|    O|
|          42|           34|           9|            3|treatment|No lotions , crea...|   test|[{chunk, 3, 9, lo...|             lotions|           incisions|  problem|TrNAP|
|          68|           58|          54|           11|treatment|Because of expect...|  train|[{chunk, 11, 54, ...|expected long t

## **✅ Graph Creation**

We will use `TFGraphBuilder` annotator which can be used to create graphs automatically in the model training pipeline.

`TFGraphBuilder` inspects the data and creates the proper graph if a suitable version of TensorFlow is available. The graph is stored in the defined folder and loaded by the approach.

You can also create a custom graph by using `tf_graph` module in Spark NLP for Healthcare.

In [11]:
!pip install -q tensorflow==2.11.0 tensorflow-addons

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m588.3/588.3 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.8/611.8 kB[0m [31m43.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m52.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m56.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.2/439.2 kB[0m [31m40.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.9/4.9 MB[0m [31m59.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m781.3/781.3 kB[0m [31m52.7 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependenc

In [12]:
'''
# custom graph

medical.tf_graph.print_model_params("relation_extraction")

medical.tf_graph.build("relation_extraction",
             build_params={"input_dim": 10000,
                          "output_dim": 10,
                          'batch_norm':1,
                          "hidden_layers": [300, 200],
                          "hidden_act": "relu",
                          'hidden_act_l2':1},
                          'model_location'=".",
                          'model_filename'="re_in6000D_out10.pb")
'''

'''
# ready to use graph
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/generic_classifier_graph/RE_in1200D_out20.pb
'''

'\n# ready to use graph\n!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Healthcare/generic_classifier_graph/RE_in1200D_out20.pb\n'

In [13]:
graph_folder= "./tf_graphs"

In [14]:
re_graph_builder = medical.TFGraphBuilder()\
    .setModelName("relation_extraction")\
    .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"]) \
    .setLabelColumn("rel")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("re_graph.pb")\
    .setHiddenLayers([300, 200])\
    .setHiddenAct("relu")\
    .setHiddenActL2(True)\
    .setHiddenWeightsL2(False)\
    .setBatchNorm(False)

## **🔎 Parameters**


- `FromEntity`: (begin_col: str, end_col: str, label_col: str) Sets from entity

>`begin_col` Column that has a reference of where the chunk begins </br>

> `end_col`: Column that has a reference of where the chunk ends

> `label_col`: Column that has a reference what are the type of chunk

- `ToEntity`: (begin_col: str, end_col: str, label_col: str) Sets to entity

> `begin_col` Column that has a reference of where the chunk begins </br>

> `end_col`: Column that has a reference of where the chunk ends

> `label_col`: Column that has a reference what are the type of chunk

- `CustomLabels`: (labels: dict[str, str]) Sets custom relation labels

> `labels`: Dictionary which maps old to new labels

- `RelationDirectionCol`: (col: str) Relation direction column (possible values are: "none", "left" or "right"). If this parameter is not set, the model will not have direction between the relation of the entities

> `col` Column contains the relation direction values

- `PretrainedModelPath` (value: str) Path to an already trained model saved to disk, which is used as a starting point for training the new model

- `ОverrideExistingLabels` (bool) Whether to override already learned labels when using a pretrained model to initialize the new model. Default is ‘true’

- `batchSize`: (Int) Size for each batch in the optimization process

- `EpochsNumber` (Int) Maximum number of epochs to train

- `Dropout`: (Float) Dropout at the output of each layer

- `LearningRate`: (Float) Learning rate for the optimization process

- `OutputLogsPath`: (Str) Folder path to save training logs. If no path is specified, the logs won't be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).

- `ModelFile`: (Str) The path to the Tensorflow graph

- `FixImbalance` (Float) Fix the imbalance in the training set by replicating examples of under represented categories

- `ValidationSplit` (Float) The proportion of training dataset to be used as validation set

- `OverrideExistingLabels` (Boolean) Controls whether to override already learned lebels when using a pretrained model to initialize the new model. A value of true will override existing labels

- `MultiClass` (Boolean) If multiClass is set, the model will return all the labels with corresponding scores. By default, multiClass is false.

- `ModelFile` (Str) Location of file of the model used for classification

- `MaxSyntacticDistance` (Int) Maximal syntactic distance, as threshold (Default: 0)














## **🦾 Model Training**

Now, we will define `RelationExtractionApproach` and generate a pipeline along with required annotators and TFGraph.

In [15]:
documenter = nlp.DocumentAssembler()\
    .setInputCol("sentence")\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")\

words_embedder = nlp.WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

# set training params and upload model graph (see ../Healthcare/8.Generic_Classifier.ipynb)
reApproach = medical.RelationExtractionApproach()\
    .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setLabelColumn("rel")\
    .setEpochsNumber(80)\
    .setBatchSize(200)\
    .setDropout(0.5)\
    .setLearningRate(0.001)\
    .setModelFile(f"{graph_folder}/re_graph.pb")\
    .setFixImbalance(True)\
    .setFromEntity("firstCharEnt1", "lastCharEnt1", "label1")\
    .setToEntity("firstCharEnt2", "lastCharEnt2", "label2")\
    .setOutputLogsPath('/content')

finisher = nlp.Finisher()\
    .setInputCols(["relations"])\
    .setOutputCols(["relations_out"])\
    .setCleanAnnotations(False)\
    .setValueSplitSymbol(",")\
    .setAnnotationSplitSymbol(",")\
    .setOutputAsArray(False)

train_pipeline = nlp.Pipeline(stages=[
    documenter,
    tokenizer,
    words_embedder,
    pos_tagger,
    dependency_parser,
    re_graph_builder,
    reApproach,
    finisher
])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]


In [16]:
%%time
rel_model = train_pipeline.fit(train_data)

TF Graph Builder configuration:
Model name: relation_extraction
Graph folder: ./tf_graphs
Graph file name: re_graph.pb
Build params: {'input_dim': 1149, 'output_dim': 27, 'hidden_layers': [300, 200], 'hidden_act': 'relu', 'hidden_act_l2': True, 'hidden_weights_l2': False, 'batch_norm': False}
relation_extraction graph exported to ./tf_graphs/re_graph.pb
CPU times: user 3.32 s, sys: 277 ms, total: 3.6 s
Wall time: 14.2 s


In [17]:
rel_model.stages

[DocumentAssembler_96c1787bdf2a,
 REGEX_TOKENIZER_9bfd3037b899,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 POS_6f55785005bf,
 dependency_e7755462ba78,
 TFGraphBuilderModel_8c134760e723,
 RelationExtractionModel_c13d7ac73199,
 Finisher_5b9859831457]

In [18]:
rel_model.stages[-2]

RelationExtractionModel_c13d7ac73199

Saving the model

In [19]:
rel_model.stages[-2].write().overwrite().save('custom_RE_model')

### Evaluating The Performance of the Model

We will create a pipeline and use our trained RE model by calling with `.load` method. Then we will get predictions by transforming our test set on this pipeline.

In [20]:
customReModel = medical.RelationExtractionModel.load("custom_RE_model")\
    .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"])\
    .setOutputCol("relations_pred")\
    .setMaxSyntacticDistance(0)

test_pipeline = nlp.Pipeline(stages=[
    documenter,
    tokenizer,
    words_embedder,
    words_embedder,
    pos_tagger,
    dependency_parser,
    customReModel])

test_results = test_pipeline.fit(test_data).transform(test_data)

In [21]:
test_results.show(5)

+------------+-------------+------------+-------------+---------+--------------------+-------+--------------------+--------------------+--------------------+-------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|lastCharEnt2|firstCharEnt2|lastCharEnt1|firstCharEnt1|   label1|            sentence|dataset|    train_ner_chunks|              chunk1|              chunk2| label2|  rel|           sentences|              tokens|          embeddings|            pos_tags|        dependencies|      relations_pred|
+------------+-------------+------------+-------------+---------+--------------------+-------+--------------------+--------------------+--------------------+-------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|          84|           75|          64|           49|     test|VITAL SIGNS - Tem...|   test|[{chunk, 49,

When you check the `relations_pred` column, you can see that some of the relations between the chunk pairs were not detected this time.

Now we will get metrics using the ground truth (`rel`) and prediction (`relations_pred`) columns in the result dataframe.

In [22]:
pd_test_results = test_results.select('rel', 'relations_pred.result').toPandas()
pd_test_results.head()

Unnamed: 0,rel,result
0,O,[O]
1,TrNAP,[O]
2,PIP,[PIP]
3,TeRP,[TeRP]
4,TrAP,[TrAP]


We will explode the `result` column and fill null values as `O` label.

In [23]:
pd_test_results = pd_test_results.explode("result").fillna("O")
pd_test_results.result.value_counts()

result
O        329
TrAP     115
PIP      107
TeRP      95
TeCP       2
TrCP       1
TrNAP      1
Name: count, dtype: int64

In [24]:
pd_test_results.rel.value_counts()

rel
O        274
TeRP     116
TrAP     108
PIP       88
TrCP      20
TeCP      15
TrIP      11
TrNAP      9
TrWP       9
Name: count, dtype: int64

In [25]:
from sklearn.metrics import classification_report
print(classification_report(pd_test_results["rel"], pd_test_results["result"]))

              precision    recall  f1-score   support

           O       0.68      0.82      0.75       274
         PIP       0.55      0.67      0.61        88
        TeCP       0.50      0.07      0.12        15
        TeRP       0.78      0.64      0.70       116
        TrAP       0.59      0.63      0.61       108
        TrCP       0.00      0.00      0.00        20
        TrIP       0.00      0.00      0.00        11
       TrNAP       0.00      0.00      0.00         9
        TrWP       0.00      0.00      0.00         9

    accuracy                           0.66       650
   macro avg       0.35      0.31      0.31       650
weighted avg       0.61      0.66      0.63       650



### Training by Setting the Relation Direction

We have `setRelationDirectionCol` parameter that is used during training with a new separate column that specifies the relationship directions. The column should contain one of the following values:

 - `rightwards`: The first entity in the text is also the first argument of the relation (as well as the second entity in the text is the second argument). In other words, the relation arguments are ordered *left to right* in the text.
 - `leftwards`: The first entity in the text is the second argument of the relation (and the second entity in the text is the first argument).
 - `both`: Order doesn't matter (relation is symmetric).

Let's modify our example training dataset accordingly

In [26]:
data.columns

['lastCharEnt2',
 'firstCharEnt2',
 'lastCharEnt1',
 'firstCharEnt1',
 'label1',
 'sentence',
 'dataset',
 'train_ner_chunks',
 'chunk1',
 'chunk2',
 'label2',
 'rel']

In [29]:
@F.udf(T.StringType())
def encodeRelationDirection(rel, begin1, begin2):
    if rel != "O":
        if begin1 > begin2:
            return "leftwards"
        else:
            return "rightwards"
    else:
        return "both"



data = data.withColumn("rel_dir", encodeRelationDirection("rel", "firstCharEnt1", "lastCharEnt2"))

train_data = data.where("dataset='train'")
test_data = data.where("dataset='test'")

Checking the new training data with **rel_dir** column

In [30]:
train_data.select("chunk1","label1","label2","chunk2","rel","rel_dir").show(10)

+--------------------+---------+---------+--------------------+----+----------+
|              chunk1|   label1|   label2|              chunk2| rel|   rel_dir|
+--------------------+---------+---------+--------------------+----+----------+
|expected long ter...|treatment|treatment|         a picc line|   O|      both|
|    light-headedness|  problem|  problem|         diaphoresis| PIP|rightwards|
| po pain medications|treatment|  problem|            his pain|TrAP| leftwards|
|bilateral pleural...|  problem|  problem|increased work of...| PIP|rightwards|
|    her urine output|     test|  problem|           decreased|TeRP|rightwards|
|his psychiatric i...|  problem|  problem|his neurologic in...| PIP|rightwards|
|   white blood cells|     test|     test|     red blood cells|   O|      both|
|            chloride|     test|     test|                 bun|   O|      both|
|     further work-up|     test|  problem|his neurologic co...|TeCP|rightwards|
|         four liters|treatment|     tes

In [31]:
train_data\
    .selectExpr("concat(label1, \", \", label2) AS args", "rel", "rel_dir")\
    .groupBy("rel", "args", "rel_dir").count().where("count > 10").orderBy("rel").show(1000, truncate=False)

+----+--------------------+----------+-----+
|rel |args                |rel_dir   |count|
+----+--------------------+----------+-----+
|O   |test, test          |both      |59   |
|O   |treatment, treatment|both      |26   |
|O   |problem, problem    |both      |38   |
|PIP |problem, problem    |rightwards|47   |
|PIP |problem, problem    |leftwards |18   |
|TeRP|test, problem       |rightwards|57   |
|TrAP|treatment, problem  |rightwards|29   |
|TrAP|treatment, problem  |leftwards |27   |
+----+--------------------+----------+-----+



Generating a new `RelationExtractionApproach()` with `.setRelationDirectionCol("rel_dir")` and training a new model.

In [32]:
documenter = nlp.DocumentAssembler()\
    .setInputCol("sentence")\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")\

words_embedder = nlp.WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

reApproach = medical.RelationExtractionApproach()\
    .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setLabelColumn("rel")\
    .setEpochsNumber(80)\
    .setBatchSize(200)\
    .setDropout(0.5)\
    .setLearningRate(0.001)\
    .setModelFile(f"{graph_folder}/re_graph.pb")\
    .setFixImbalance(True)\
    .setFromEntity("firstCharEnt1", "lastCharEnt1", "label1")\
    .setToEntity("firstCharEnt2", "lastCharEnt2", "label2")\
    .setOutputLogsPath('/rel_log')\
    .setRelationDirectionCol("rel_dir")

train_pipeline = nlp.Pipeline(stages=[
    documenter,
    tokenizer,
    words_embedder,
    pos_tagger,
    dependency_parser,
    re_graph_builder,
    reApproach,
    finisher
])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]


In [33]:
%%time
rel_model = train_pipeline.fit(train_data)

TF Graph Builder configuration:
Model name: relation_extraction
Graph folder: ./tf_graphs
Graph file name: re_graph.pb
Build params: {'input_dim': 1149, 'output_dim': 27, 'hidden_layers': [300, 200], 'hidden_act': 'relu', 'hidden_act_l2': True, 'hidden_weights_l2': False, 'batch_norm': False}
relation_extraction graph exported to ./tf_graphs/re_graph.pb
CPU times: user 718 ms, sys: 50.9 ms, total: 769 ms
Wall time: 5.65 s


In [34]:
rel_model.stages

[DocumentAssembler_aef2345a5aa1,
 REGEX_TOKENIZER_3a5ca79b51de,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 POS_6f55785005bf,
 dependency_e7755462ba78,
 TFGraphBuilderModel_b971cb7c6941,
 RelationExtractionModel_b9c86169cbe7,
 Finisher_5b9859831457]

Saving the model

In [35]:
rel_model.stages[-2].write().overwrite().save('rel_dir_RE_model')

**Evaluating The Performance of the Model Trained with Directions**

In [36]:
customReModelDir = medical.RelationExtractionModel.load("rel_dir_RE_model")\
    .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"])\
    .setOutputCol("relations_pred")\
    .setMaxSyntacticDistance(0)

test_pipeline_dir = nlp.Pipeline(stages=[
    documenter,
    tokenizer,
    words_embedder,
    words_embedder,
    pos_tagger,
    dependency_parser,
    customReModelDir])

test_results_dir = test_pipeline_dir.fit(test_data).transform(test_data)

Now we will get metrics using the ground truth (`rel`) and prediction (`relations_pred`) columns in the result dataframe.

In [37]:
pd_test_results_dir = test_results_dir.select('rel', 'relations_pred.result').toPandas()
pd_test_results_dir.head()

Unnamed: 0,rel,result
0,O,[O]
1,TrNAP,[O]
2,PIP,[PIP]
3,TeRP,[TeRP]
4,TrAP,[TrAP]


We will explode the `result` column and fill null values as `O` label and get metrics.

In [38]:
from sklearn.metrics import classification_report

pd_test_results_dir = pd_test_results_dir.explode("result").fillna("O")

print(classification_report(pd_test_results_dir["rel"], pd_test_results_dir["result"]))

              precision    recall  f1-score   support

           O       0.66      0.88      0.75       274
         PIP       0.59      0.65      0.62        88
        TeCP       0.00      0.00      0.00        15
        TeRP       0.82      0.59      0.69       116
        TrAP       0.73      0.64      0.68       108
        TrCP       0.00      0.00      0.00        20
        TrIP       0.00      0.00      0.00        11
       TrNAP       0.00      0.00      0.00         9
        TrWP       0.00      0.00      0.00         9

    accuracy                           0.67       650
   macro avg       0.31      0.31      0.31       650
weighted avg       0.63      0.67      0.64       650



As you can see, this model is slightly better than the previous one.

### Resume RelationExtractionApproach Model Training

Steps:
- Train a new model for a few epochs.
- Load the same model and train for more epochs on the same taxnonomy, and evaluate.
- Train a model already trained on a different dataset.

**Split the test data into two parts:**
- We keep the first part separate and use it for training the model further, as it will be totally unseen data from the same taxonomy.

- The second part will be used to testing and evaluating

In [39]:
(test_data_1, test_data_2) = test_data.randomSplit([0.5, 0.5], seed = 100)

#### Train a new model, pause, and resume training on the same dataset.

**Create a Graph**

In [40]:
#from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder= "./tf_graphs"

re_graph_builder = medical.TFGraphBuilder()\
    .setModelName("relation_extraction")\
    .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"]) \
    .setLabelColumn("rel")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("re_graph.pb")\
    .setHiddenLayers([300, 200])\
    .setHiddenAct("relu")\
    .setHiddenActL2(True)\
    .setHiddenWeightsL2(False)\
    .setBatchNorm(False)

**Train for 100 epochs**

In [41]:
reApproach = medical.RelationExtractionApproach()\
    .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setLabelColumn("rel")\
    .setEpochsNumber(100)\
    .setBatchSize(200)\
    .setDropout(0.5)\
    .setLearningRate(0.001)\
    .setModelFile(f"{graph_folder}/re_graph.pb")\
    .setFixImbalance(True)\
    .setFromEntity("firstCharEnt1", "lastCharEnt1", "label1")\
    .setToEntity("firstCharEnt2", "lastCharEnt2", "label2")\
    .setOutputLogsPath('/content')\
    .setRelationDirectionCol("rel_dir")\
    .setMaxSyntacticDistance(10)

finisher = nlp.Finisher()\
    .setInputCols(["relations"])\
    .setOutputCols(["relations_out"])\
    .setCleanAnnotations(False)\
    .setValueSplitSymbol(",")\
    .setAnnotationSplitSymbol(",")\
    .setOutputAsArray(False)

train_pipeline = nlp.Pipeline(stages=[
    documenter,
    tokenizer,
    words_embedder,
    pos_tagger,
    dependency_parser,
    re_graph_builder,
    reApproach,
    finisher
])

In [42]:
%%time
rel_model = train_pipeline.fit(train_data)

TF Graph Builder configuration:
Model name: relation_extraction
Graph folder: ./tf_graphs
Graph file name: re_graph.pb
Build params: {'input_dim': 1149, 'output_dim': 27, 'hidden_layers': [300, 200], 'hidden_act': 'relu', 'hidden_act_l2': True, 'hidden_weights_l2': False, 'batch_norm': False}
relation_extraction graph exported to ./tf_graphs/re_graph.pb
CPU times: user 709 ms, sys: 57.8 ms, total: 767 ms
Wall time: 5.25 s


In [43]:
result_test = rel_model.transform(test_data_2)

**Evaluate**

In [44]:
result_test_df = result_test.select('rel', 'relations.result').toPandas()
result_test_df.sample(5)

Unnamed: 0,rel,result
119,O,[]
43,O,[]
9,O,[O]
98,O,[O]
278,O,[O]


In [45]:
from sklearn.metrics import classification_report

pd_test_results = result_test_df.explode("result").fillna("O")

print(classification_report(pd_test_results["rel"], pd_test_results["result"]))

              precision    recall  f1-score   support

           O       0.56      0.93      0.70       140
         PIP       0.71      0.57      0.63        42
        TeCP       0.00      0.00      0.00         8
        TeRP       0.89      0.15      0.26        52
        TrAP       0.69      0.52      0.59        46
        TrCP       0.00      0.00      0.00        10
        TrIP       0.00      0.00      0.00         7
       TrNAP       0.00      0.00      0.00         5
        TrWP       0.00      0.00      0.00         4

    accuracy                           0.59       314
   macro avg       0.32      0.24      0.24       314
weighted avg       0.59      0.59      0.52       314



Save to disk

In [46]:
rel_model.stages[-2].write().overwrite().save('RE_model_100e')

#### Train using the saved model on unseen dataset

We use unseen data from the same taxonomy

In [47]:
reApproach_finetune = medical.RelationExtractionApproach()\
    .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setLabelColumn("rel")\
    .setEpochsNumber(30)\
    .setBatchSize(200)\
    .setDropout(0.5)\
    .setLearningRate(0.001)\
    .setFixImbalance(True)\
    .setFromEntity("firstCharEnt1", "lastCharEnt1", "label1")\
    .setToEntity("firstCharEnt2", "lastCharEnt2", "label2")\
    .setRelationDirectionCol("rel_dir")\
    .setPretrainedModelPath("RE_model_100e")\
    .setОverrideExistingLabels(False)

finetune_pipeline = nlp.Pipeline(stages=[
    documenter,
    tokenizer,
    words_embedder,
    pos_tagger,
    dependency_parser,
    re_graph_builder,
    reApproach_finetune,
    finisher
])

In [48]:
%%time
rel_model = finetune_pipeline.fit(test_data_1)

TF Graph Builder configuration:
Model name: relation_extraction
Graph folder: ./tf_graphs
Graph file name: re_graph.pb
Build params: {'input_dim': 1149, 'output_dim': 27, 'hidden_layers': [300, 200], 'hidden_act': 'relu', 'hidden_act_l2': True, 'hidden_weights_l2': False, 'batch_norm': False}
relation_extraction graph exported to ./tf_graphs/re_graph.pb
CPU times: user 752 ms, sys: 42.8 ms, total: 794 ms
Wall time: 4.99 s


In [49]:
result = rel_model.transform(test_data_2)

**Evaluate**

In [50]:
result_test_df = result.select('rel', 'relations.result').toPandas()
result_test_df.sample(5)

Unnamed: 0,rel,result
189,PIP,[PIP]
21,TrAP,[TrAP]
53,TeRP,[]
117,TrAP,[TrAP]
129,O,[TrAP]


In [51]:
from sklearn.metrics import classification_report

pd_test_results = result_test_df.explode("result").fillna("O")

print(classification_report(pd_test_results["rel"], pd_test_results["result"]))

              precision    recall  f1-score   support

           O       0.71      0.84      0.77       140
         PIP       0.70      0.55      0.61        42
        TeCP       0.00      0.00      0.00         8
        TeRP       0.78      0.67      0.72        52
        TrAP       0.55      0.74      0.63        46
        TrCP       0.17      0.10      0.12        10
        TrIP       0.50      0.14      0.22         7
       TrNAP       0.00      0.00      0.00         5
        TrWP       1.00      0.50      0.67         4

    accuracy                           0.68       314
   macro avg       0.49      0.39      0.42       314
weighted avg       0.65      0.68      0.65       314



Save to disk

In [52]:
rel_model.stages[-2].write().overwrite().save('RE_model_finetuned')

#### Take a model trained on a different dataset (pretrained model) and train on this dataset

In [53]:
clinical_re_Model = medical.RelationExtractionModel()\
    .pretrained("re_clinical", "en", 'clinical/models')\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations")

clinical_re_Model.getClasses()

re_clinical download started this may take some time.
[OK!]


['TrWP', 'TrNAP', 'TrCP', 'PIP', 'TeCP', 'TeRP', 'TrIP', 'TrAP', 'O']

**Now train a model using this model as base**

In [54]:
reApproach_finetune = medical.RelationExtractionApproach()\
    .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setLabelColumn("rel")\
    .setEpochsNumber(30)\
    .setBatchSize(200)\
    .setDropout(0.5)\
    .setLearningRate(0.001)\
    .setFixImbalance(True)\
    .setFromEntity("firstCharEnt1", "lastCharEnt1", "label1")\
    .setToEntity("firstCharEnt2", "lastCharEnt2", "label2")\
    .setPretrainedModelPath("/root/cache_pretrained/re_clinical_en_2.5.5_2.4_1596928426753")\
    .setОverrideExistingLabels(False)

finetune_pipeline = nlp.Pipeline(stages=[
    documenter,
    tokenizer,
    words_embedder,
    pos_tagger,
    dependency_parser,
    re_graph_builder,
    reApproach_finetune,
    finisher
])

In [55]:
%%time
rel_model = finetune_pipeline.fit(train_data)

TF Graph Builder configuration:
Model name: relation_extraction
Graph folder: ./tf_graphs
Graph file name: re_graph.pb
Build params: {'input_dim': 1149, 'output_dim': 27, 'hidden_layers': [300, 200], 'hidden_act': 'relu', 'hidden_act_l2': True, 'hidden_weights_l2': False, 'batch_norm': False}
relation_extraction graph exported to ./tf_graphs/re_graph.pb
CPU times: user 700 ms, sys: 29.1 ms, total: 729 ms
Wall time: 4.24 s


In [56]:
result = rel_model.transform(test_data_2)

**Evaluate**

In [57]:
result_test_df = result.select('rel', 'relations.result').toPandas()
result_test_df.sample(5)

Unnamed: 0,rel,result
199,O,[O]
32,TrNAP,[TrAP]
263,O,[O]
80,TeRP,[]
207,TeRP,[O]


In [58]:
from sklearn.metrics import classification_report

pd_test_results = result_test_df.explode("result").fillna("O")

print(classification_report(pd_test_results["rel"], pd_test_results["result"]))

              precision    recall  f1-score   support

           O       0.59      0.77      0.67       140
         PIP       0.68      0.36      0.47        42
        TeCP       0.00      0.00      0.00         8
        TeRP       0.61      0.52      0.56        52
        TrAP       0.43      0.52      0.47        46
        TrCP       0.00      0.00      0.00        10
        TrIP       0.50      0.14      0.22         7
       TrNAP       1.00      0.20      0.33         5
        TrWP       0.00      0.00      0.00         4

    accuracy                           0.56       314
   macro avg       0.42      0.28      0.30       314
weighted avg       0.55      0.56      0.53       314



Save to disk

In [59]:
rel_model.stages[-2].write().overwrite().save('RE_pretrained_model_finetuned')