![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# RelationExtractionDLModel

In this notebook, we will examine the `RelationExtractionDLModel` annotator.

This Relation Extraction annotator extracts and classifies instances of relations between named entities. In contrast with `RelationExtractionModel`, `RelationExtractionDLModel` is based on BERT. <br/>

**📖 Learning Objectives:**

1. Understand how to extract and classify the relations between named entities by using pre-trained BERT models.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**


- Documentation: [RelationExtractionDLModel](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#relationextractiondl)

- For extended examples of usage, see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/03.0.Clinical_Relation_Extraction.ipynb)

- Python Documentation: [RelationExtractionDLModel](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/re/relation_extraction_dl/index.html#sparknlp_jsl.annotator.re.relation_extraction_dl.RelationExtractionDLModel.name)

- Scala Documentation: [RelationExtractionDLModel](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/finance/graph/relation_extraction/RelationExtractionDLModel.html)

- Relation Extraction Models and Relation Pairs Table: [In this link](https://nlp.johnsnowlabs.com/docs/en/best_practices_pretrained_models#relation-extraction-models-and-relation-pairs-table), available Relation Extraction models, its labels, optimal NER model, and meaningful relation pairs are illustrated.

## **📜 Background**


These models are trained as end-to-end bert models using BioBERT and ported in to the Spark NLP ecosystem.
They offer SOTA performance on most benchmark tasks and outperform our existing Relation Extraction Models.

## **🎬 Colab Setup**

In [1]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m63.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m7.7

In [2]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving 5.3.3.spark_nlp_for_healthcare.json to 5.3.3.spark_nlp_for_healthcare.json


In [3]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
🚨 Outdated Medical Secrets in license file. Version=5.3.3 but should be Version=5.3.2
🚨 Outdated OCR Secrets in license file. Version=5.1.2 but should be Version=5.3.2
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False 

In [4]:
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


In [5]:
spark

## **🖨️ Input/Output Annotation Types**
- Input: `CHUNK, DOCUMENT`
- Output: `CATEGORY`

## **🔎 Parameters**


- `predictionThreshold` *(Float)*: Sets minimal activation of the target unit to encode a new relation instance. (Default: 0.5f)

- `customLabels` *(dict[str, str])*: Custom relation labels.

- `doExceptionHandling` *(Boolean)*: If true, exceptions are handled.

- `relationPairsCaseSensitive` *(Boolean)*: Determines whether relation pairs are case sensitive.







### `setPredictionThreshold`

This parameter is used for setting the minimal activation of the target unit to encode a new relation instance.


**Build pipeline using SparNLP pretrained models and the relation extration model**.

Now we will first use `ner_ade_clinical` NER model and detect `DRUG` and `ADE` entities. Then we can find the relations between them by using `redl_ade_biobert` Relation Extraction model.

In [6]:
documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

words_embedder = nlp.WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

ner_tagger = medical.NerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_tags")

ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ner_tags"])\
    .setOutputCol("ner_chunks")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

ade_re_ner_chunk_filter = medical.RENerChunksFilter() \
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(["drug-ade, ade-drug"])

ade_re_model = medical.RelationExtractionDLModel()\
    .pretrained('redl_ade_biobert', 'en', "clinical/models") \
    .setInputCols(["re_ner_chunks", "sentences"]) \
    .setPredictionThreshold(0.5)\
    .setOutputCol("relations")

pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    ade_re_ner_chunk_filter,
    ade_re_model
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
ner_ade_clinical download started this may take some time.
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]
redl_ade_biobert download started this may take some time.
[OK!]


**Create a light pipeline for annotating free text**

In [7]:
text ="""A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""


lmodel = nlp.LightPipeline(model)

results = lmodel.fullAnnotate(text)

In [8]:
results[0].keys()

dict_keys(['sentences', 'document', 'ner_chunks', 'ner_tags', 'relations', 'tokens', 'embeddings', 'pos_tags', 're_ner_chunks', 'dependencies'])

In [9]:
results[0]['ner_chunks']

[Annotation(chunk, 25, 32, naproxen, {'chunk': '0', 'confidence': '0.996', 'ner_source': 'ner_chunks', 'entity': 'DRUG', 'sentence': '0'}, []),
 Annotation(chunk, 87, 95, oxaprozin, {'chunk': '1', 'confidence': '0.9749', 'ner_source': 'ner_chunks', 'entity': 'DRUG', 'sentence': '0'}, []),
 Annotation(chunk, 137, 148, tense bullae, {'chunk': '2', 'confidence': '0.77250004', 'ner_source': 'ner_chunks', 'entity': 'ADE', 'sentence': '0'}, []),
 Annotation(chunk, 154, 210, cutaneous fragility on the face and the back of the hands, {'chunk': '3', 'confidence': '0.7458182', 'ner_source': 'ner_chunks', 'entity': 'ADE', 'sentence': '0'}, [])]

In [10]:
results[0]['relations']

[Annotation(category, 25, 148, 1, {'chunk2': 'tense bullae', 'confidence': '0.9989047', 'entity2_end': '148', 'syntactic_distance': '6', 'chunk1': 'naproxen', 'entity1': 'DRUG', 'entity2_begin': '137', 'chunk2_confidence': '0.77250004', 'entity1_begin': '25', 'sentence': '0', 'entity1_end': '32', 'entity2': 'ADE', 'chunk1_confidence': '0.996', 'context': 'A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands.'}, []),
 Annotation(category, 25, 210, 1, {'chunk2': 'cutaneous fragility on the face and the back of the hands', 'confidence': '0.9989704', 'entity2_end': '210', 'syntactic_distance': '7', 'chunk1': 'naproxen', 'entity1': 'DRUG', 'entity2_begin': '154', 'chunk2_confidence': '0.7458182', 'entity1_begin': '25', 'sentence': '0', 'entity1_end': '32', 'entity2': 'ADE', 'chunk1_confidence': '0.996', 'context': 'A 44-year-old man t

**Show extracted relations**

In [11]:
for rel in results[0]["relations"]:
    print("{}({}={} - {}={})".format(
        rel.result,
        rel.metadata['entity1'],
        rel.metadata['chunk1'],
        rel.metadata['entity2'],
        rel.metadata['chunk2']
    ))

1(DRUG=naproxen - ADE=tense bullae)
1(DRUG=naproxen - ADE=cutaneous fragility on the face and the back of the hands)
1(DRUG=oxaprozin - ADE=tense bullae)
1(DRUG=oxaprozin - ADE=cutaneous fragility on the face and the back of the hands)


Get relations in a pandas dataframe


In [12]:
def get_relations_df(results, rel_col='relations', chunk_col='ner_chunks'):
    rel_pairs=[]
    chunks = []

    for rel in results[0][rel_col]:
        rel_pairs.append((
            rel.metadata['entity1_begin'],
            rel.metadata['entity1_end'],
            rel.metadata['chunk1'],
            rel.metadata['entity1'],
            rel.metadata['entity2_begin'],
            rel.metadata['entity2_end'],
            rel.metadata['chunk2'],
            rel.metadata['entity2'],
            rel.result,
            rel.metadata['confidence'],
        ))

    for chunk in results[0][chunk_col]:
        chunks.append((
            chunk.metadata["sentence"],
            chunk.begin,
            chunk.end,
            chunk.result,
        ))

    rel_df = pd.DataFrame(rel_pairs, columns=['entity1_begin', 'entity1_end', 'chunk1', 'entity1', 'entity2_begin', 'entity2_end', 'chunk2', 'entity2', 'relation', 'confidence'])

    chunks_df = pd.DataFrame(chunks, columns = ["sentence", "begin", "end", "chunk"])
    chunks_df.begin = chunks_df.begin.astype(str)
    chunks_df.end = chunks_df.end.astype(str)

    result_df = pd.merge(rel_df,chunks_df, left_on=["entity1_begin", "entity1_end", "chunk1"], right_on=["begin", "end", "chunk"])[["sentence"] + list(rel_df.columns)]


    return result_df

In [13]:
print(text, "\n")

rel_df = get_relations_df (results)
rel_df

A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands. 



Unnamed: 0,sentence,entity1_begin,entity1_end,chunk1,entity1,entity2_begin,entity2_end,chunk2,entity2,relation,confidence
0,0,25,32,naproxen,DRUG,137,148,tense bullae,ADE,1,0.9989047
1,0,25,32,naproxen,DRUG,154,210,cutaneous fragility on the face and the back o...,ADE,1,0.9989704
2,0,87,95,oxaprozin,DRUG,137,148,tense bullae,ADE,1,0.99895453
3,0,87,95,oxaprozin,DRUG,154,210,cutaneous fragility on the face and the back o...,ADE,1,0.99900633


**Visualization of Extracted Relations**

In [14]:
vis = nlp.viz.RelationExtractionVisualizer()
vis.display(results[0], 'relations', show_relations=True) # default show_relations: True


As seen above, we only have the relations which have a confidence higher than 0.5 since we set `setPredictionThreshold(0.5)`.

Now, we will set `setPredictionThreshold(0.999)` and expect to see only the relations that have confidence score higher that 0.999.

In [15]:

ade_re_model = medical.RelationExtractionDLModel()\
    .pretrained('redl_ade_biobert', 'en', "clinical/models") \
    .setInputCols(["re_ner_chunks", "sentences"]) \
    .setPredictionThreshold(0.999)\
    .setOutputCol("relations")

pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    ade_re_ner_chunk_filter,
    ade_re_model
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

redl_ade_biobert download started this may take some time.
[OK!]


In [16]:
text ="""A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""


lmodel = nlp.LightPipeline(model)

results = lmodel.fullAnnotate(text)

Extracted chunks:

In [17]:
results[0]['ner_chunks']

[Annotation(chunk, 25, 32, naproxen, {'chunk': '0', 'confidence': '0.996', 'ner_source': 'ner_chunks', 'entity': 'DRUG', 'sentence': '0'}, []),
 Annotation(chunk, 87, 95, oxaprozin, {'chunk': '1', 'confidence': '0.9749', 'ner_source': 'ner_chunks', 'entity': 'DRUG', 'sentence': '0'}, []),
 Annotation(chunk, 137, 148, tense bullae, {'chunk': '2', 'confidence': '0.77250004', 'ner_source': 'ner_chunks', 'entity': 'ADE', 'sentence': '0'}, []),
 Annotation(chunk, 154, 210, cutaneous fragility on the face and the back of the hands, {'chunk': '3', 'confidence': '0.7458182', 'ner_source': 'ner_chunks', 'entity': 'ADE', 'sentence': '0'}, [])]

Relations:

In [18]:
print(text, "\n")

rel_df = get_relations_df (results)
rel_df

A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands. 



Unnamed: 0,sentence,entity1_begin,entity1_end,chunk1,entity1,entity2_begin,entity2_end,chunk2,entity2,relation,confidence
0,0,87,95,oxaprozin,DRUG,154,210,cutaneous fragility on the face and the back o...,ADE,1,0.99900633


As you see above, we only have the relations which have confidence score higher that 0.999.

### `setCustomLabels`

This parameter is used for setting the custom relation labels.

Lets set custom labels instead of default ones by using .`setCustomLabels` parameter

In [19]:
ade_re_model = medical.RelationExtractionDLModel()\
  .pretrained('redl_ade_biobert', 'en', "clinical/models") \
  .setInputCols(["re_ner_chunks", "sentences"]) \
  .setPredictionThreshold(0.5)\
  .setOutputCol("relations")


# set custom labels
ade_re_model.setCustomLabels({"1": "is_related", "0": "not_related"})


pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    ade_re_ner_chunk_filter,
    ade_re_model
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

redl_ade_biobert download started this may take some time.
[OK!]


Create a LightPipeline for annotation.

In [20]:
text ="""A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""

lmodel = nlp.LightPipeline(model)

results = lmodel.fullAnnotate(text)

Showing the results in a pandas dataframe

In [21]:
print(text, "\n")

rel_df = get_relations_df(results)
rel_df

A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands. 



Unnamed: 0,sentence,entity1_begin,entity1_end,chunk1,entity1,entity2_begin,entity2_end,chunk2,entity2,relation,confidence
0,0,25,32,naproxen,DRUG,137,148,tense bullae,ADE,is_related,0.9989047
1,0,25,32,naproxen,DRUG,154,210,cutaneous fragility on the face and the back o...,ADE,is_related,0.9989704
2,0,87,95,oxaprozin,DRUG,137,148,tense bullae,ADE,is_related,0.99895453
3,0,87,95,oxaprozin,DRUG,154,210,cutaneous fragility on the face and the back o...,ADE,is_related,0.99900633


As seen above, we see the labels as we customized; `is_related`, `not_related`.