![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/)

# RENerChunksFilter

In this notebook, we will examine the `RENerChunksFilter` annotator.

The `RENerChunksFilter` annotator filters desired relation pairs (defined by the parameter realtionPairs), and store those on the output column.

**📖 Learning Objectives:**

1. Understand how to filters desired relation pairs for the `RelationExtractionDLModel`

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

For extended examples of usage, see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/03.0.Clinical_Relation_Extraction.ipynb)

Python Documentation: [RENerChunksFilter](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/re/relation_ner_chunk_filter/index.html#sparknlp_jsl.annotator.re.relation_ner_chunk_filter.RENerChunksFilter.name)

Scala Documentation: [RENerChunksFilter](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/re/RENerChunksFilter.html)

Relation Extraction Models and Relation Pairs Table: [In this link](https://nlp.johnsnowlabs.com/docs/en/best_practices_pretrained_models#relation-extraction-models-and-relation-pairs-table), available Relation Extraction models, its labels, optimal NER model, and meaningful relation pairs are illustrated.

## **📜 Background**


Filtering the possible relations can be useful to perform additional analysis for a specific use case (e.g., checking adverse drug reactions and drug realations), which can be the input for further analysis using a pretrained `RelationExtractionDLModel`.

## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T

## **🖨️ Input/Output Annotation Types**
- Input: `CHUNK, DEPENDENCY`
- Output: `CHUNK`

## **🔎 Parameters**


- `maxSyntacticDistance` *(Int)*: Maximum syntactic distance between a pair of named entities to consider them as a relation. Increasing this value will increase recall, but also increase the number of false positives.

- `relationPairs` *(List[Str])*: List of dash-separated pairs of named entities. For example, [“Biomarker-RelativeDay”] will process all relations between entities of type “Biomarker” and “RelativeDay”.

- `relationPairsCaseSensitive` *(Boolean)*: Determines whether relation pairs are case sensitive.



### `setMaxSyntacticDistance`
This parameter is used for setting the maximal syntactic distance, as threshold.

**Build pipeline using SparNLP pretrained models and the relation extration model**.

 The precision of the RE model is controlled by "setMaxSyntacticDistance(4)", which sets the maximum syntactic distance between named entities to 4. A larger value will improve recall at the expense at lower precision. A value of 4 leads to literally perfect precision (i.e. the model doesn't produce any false positives) and reasonably good recall.

Now, we will first use `ner_ade_clinical` NER model and detect `DRUG` and `ADE` entities. Then we can find the relations between them by using `redl_ade_biobert` Relation Extraction model.

Let's set `.setMaxSyntacticDistance(10)` and see the results.

In [None]:
documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

words_embedder = nlp.WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

ner_tagger = medical.NerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_tags")

ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ner_tags"])\
    .setOutputCol("ner_chunks")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

ade_re_ner_chunk_filter = medical.RENerChunksFilter() \
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(10)

ade_re_model = medical.RelationExtractionDLModel()\
    .pretrained('redl_ade_biobert', 'en', "clinical/models") \
    .setInputCols(["re_ner_chunks", "sentences"]) \
    .setPredictionThreshold(0.5)\
    .setOutputCol("relations")

pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    ade_re_ner_chunk_filter,
    ade_re_model
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
ner_ade_clinical download started this may take some time.
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]
redl_ade_biobert download started this may take some time.
[OK!]


**Create a light pipeline for annotating free text**

In [None]:
text ="""A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""


lmodel = nlp.LightPipeline(model)

results = lmodel.fullAnnotate(text)

In [None]:
results[0].keys()

dict_keys(['sentences', 'document', 'ner_chunks', 'ner_tags', 'relations', 'tokens', 'embeddings', 'pos_tags', 're_ner_chunks', 'dependencies'])

In [None]:
results[0]['ner_chunks']

[Annotation(chunk, 25, 32, naproxen, {'chunk': '0', 'confidence': '0.996', 'ner_source': 'ner_chunks', 'entity': 'DRUG', 'sentence': '0'}, []),
 Annotation(chunk, 87, 95, oxaprozin, {'chunk': '1', 'confidence': '0.9749', 'ner_source': 'ner_chunks', 'entity': 'DRUG', 'sentence': '0'}, []),
 Annotation(chunk, 137, 148, tense bullae, {'chunk': '2', 'confidence': '0.77250004', 'ner_source': 'ner_chunks', 'entity': 'ADE', 'sentence': '0'}, []),
 Annotation(chunk, 154, 210, cutaneous fragility on the face and the back of the hands, {'chunk': '3', 'confidence': '0.7458182', 'ner_source': 'ner_chunks', 'entity': 'ADE', 'sentence': '0'}, [])]

In [None]:
results[0]['relations']

[Annotation(category, 25, 95, 1, {'chunk2': 'oxaprozin', 'confidence': '0.9989786', 'entity2_end': '95', 'syntactic_distance': '3', 'chunk1': 'naproxen', 'entity1': 'DRUG', 'entity2_begin': '87', 'chunk2_confidence': '0.9749', 'entity1_begin': '25', 'entity1_end': '32', 'entity2': 'DRUG', 'chunk1_confidence': '0.996', 'context': 'A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands.'}, []),
 Annotation(category, 25, 148, 1, {'chunk2': 'tense bullae', 'confidence': '0.9989047', 'entity2_end': '148', 'syntactic_distance': '6', 'chunk1': 'naproxen', 'entity1': 'DRUG', 'entity2_begin': '137', 'chunk2_confidence': '0.77250004', 'entity1_begin': '25', 'entity1_end': '32', 'entity2': 'ADE', 'chunk1_confidence': '0.996', 'context': 'A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheum

**Show extracted relations**

In [None]:
for rel in results[0]["relations"]:
    print("{}({}={} - {}={})".format(
        rel.result,
        rel.metadata['entity1'],
        rel.metadata['chunk1'],
        rel.metadata['entity2'],
        rel.metadata['chunk2']
    ))

1(DRUG=naproxen - DRUG=oxaprozin)
1(DRUG=naproxen - ADE=tense bullae)
1(DRUG=naproxen - ADE=cutaneous fragility on the face and the back of the hands)
1(DRUG=oxaprozin - ADE=tense bullae)
1(DRUG=oxaprozin - ADE=cutaneous fragility on the face and the back of the hands)
1(ADE=tense bullae - ADE=cutaneous fragility on the face and the back of the hands)


Get relations in a pandas dataframe


In [None]:
def get_relations_df(results, rel_col='relations', chunk_col='ner_chunks'):
    rel_pairs=[]
    chunks = []

    for rel in results[0][rel_col]:
        rel_pairs.append((
            rel.metadata['entity1_begin'],
            rel.metadata['entity1_end'],
            rel.metadata['chunk1'],
            rel.metadata['entity1'],
            rel.metadata['entity2_begin'],
            rel.metadata['entity2_end'],
            rel.metadata['chunk2'],
            rel.metadata['entity2'],
            rel.result,
            rel.metadata['confidence'],
        ))

    for chunk in results[0][chunk_col]:
        chunks.append((
            chunk.metadata["sentence"],
            chunk.begin,
            chunk.end,
            chunk.result,
        ))

    rel_df = pd.DataFrame(rel_pairs, columns=['entity1_begin', 'entity1_end', 'chunk1', 'entity1', 'entity2_begin', 'entity2_end', 'chunk2', 'entity2', 'relation', 'confidence'])

    chunks_df = pd.DataFrame(chunks, columns = ["sentence", "begin", "end", "chunk"])
    chunks_df.begin = chunks_df.begin.astype(str)
    chunks_df.end = chunks_df.end.astype(str)

    result_df = pd.merge(rel_df,chunks_df, left_on=["entity1_begin", "entity1_end", "chunk1"], right_on=["begin", "end", "chunk"])[["sentence"] + list(rel_df.columns)]


    return result_df

In [None]:
print(text, "\n")

rel_df = get_relations_df(results)
rel_df

A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands. 



Unnamed: 0,sentence,entity1_begin,entity1_end,chunk1,entity1,entity2_begin,entity2_end,chunk2,entity2,relation,confidence
0,0,25,32,naproxen,DRUG,87,95,oxaprozin,DRUG,1,0.9989786
1,0,25,32,naproxen,DRUG,137,148,tense bullae,ADE,1,0.9989047
2,0,25,32,naproxen,DRUG,154,210,cutaneous fragility on the face and the back o...,ADE,1,0.9989704
3,0,87,95,oxaprozin,DRUG,137,148,tense bullae,ADE,1,0.99895453
4,0,87,95,oxaprozin,DRUG,154,210,cutaneous fragility on the face and the back o...,ADE,1,0.99900633
5,0,137,148,tense bullae,ADE,154,210,cutaneous fragility on the face and the back o...,ADE,1,0.99894017


**Visualization of Extracted Relations**

In [None]:
vis = nlp.viz.RelationExtractionVisualizer()
vis.display(results[0], 'relations', show_relations=True) # default show_relations: True


Now, we will set `setMaxSyntacticDistance(4)` and see the difference from the previous results

In [None]:
ade_re_ner_chunk_filter = medical.RENerChunksFilter() \
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(4)

pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    ade_re_ner_chunk_filter,
    ade_re_model
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

In [None]:
text ="""A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""


lmodel = nlp.LightPipeline(model)

results = lmodel.fullAnnotate(text)

In [None]:
results[0]['ner_chunks']

[Annotation(chunk, 25, 32, naproxen, {'chunk': '0', 'confidence': '0.996', 'ner_source': 'ner_chunks', 'entity': 'DRUG', 'sentence': '0'}, []),
 Annotation(chunk, 87, 95, oxaprozin, {'chunk': '1', 'confidence': '0.9749', 'ner_source': 'ner_chunks', 'entity': 'DRUG', 'sentence': '0'}, []),
 Annotation(chunk, 137, 148, tense bullae, {'chunk': '2', 'confidence': '0.77250004', 'ner_source': 'ner_chunks', 'entity': 'ADE', 'sentence': '0'}, []),
 Annotation(chunk, 154, 210, cutaneous fragility on the face and the back of the hands, {'chunk': '3', 'confidence': '0.7458182', 'ner_source': 'ner_chunks', 'entity': 'ADE', 'sentence': '0'}, [])]

In [None]:
print(text, "\n")

rel_df = get_relations_df (results)
rel_df

A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands. 



Unnamed: 0,sentence,entity1_begin,entity1_end,chunk1,entity1,entity2_begin,entity2_end,chunk2,entity2,relation,confidence
0,0,25,32,naproxen,DRUG,87,95,oxaprozin,DRUG,1,0.9989786
1,0,87,95,oxaprozin,DRUG,137,148,tense bullae,ADE,1,0.99895453
2,0,87,95,oxaprozin,DRUG,154,210,cutaneous fragility on the face and the back o...,ADE,1,0.99900633
3,0,137,148,tense bullae,ADE,154,210,cutaneous fragility on the face and the back o...,ADE,1,0.99894017


### `setRelationPairs`
This parameter is used for setting the list of dash-separated pairs of named entities. For example, [“drug-ade”] will process all relations between entities of type “drug" and "ade".

Now, we will set `setRelationPairs(["OCCURRENCE-DURATION"])` parameter and expect only "OCCURRENCE-DURATION" and "DURATION-OCCURENCE" relations.

In [None]:
ner_tagger_clinical = medical.NerModel.pretrained("ner_events_clinical", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_tags")

clinical_re_ner_chunk_filter = medical.RENerChunksFilter() \
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(["OCCURRENCE-DURATION"])

clinical_re_model = medical.RelationExtractionDLModel()\
    .pretrained('redl_temporal_events_biobert', 'en', "clinical/models") \
    .setInputCols(["re_ner_chunks", "sentences"]) \
    .setPredictionThreshold(0.5)\
    .setOutputCol("relations")


pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger_clinical,
    ner_chunker,
    dependency_parser,
    clinical_re_ner_chunk_filter,
    clinical_re_model
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

ner_events_clinical download started this may take some time.
[OK!]
redl_temporal_events_biobert download started this may take some time.
[OK!]


We will create a LightPipeline for annotation.

In [None]:
text ="""The patient is a 68-year-old Caucasian male with past medical history of diabetes mellitus.
He was doing fairly well until last week while mowing the lawn, he injured his right foot. """

lmodel = nlp.LightPipeline(model)

results = lmodel.fullAnnotate(text)

Extracted NER chunks:

In [None]:
results[0]['ner_chunks']

[Annotation(chunk, 73, 89, diabetes mellitus, {'chunk': '0', 'confidence': '0.97415', 'ner_source': 'ner_chunks', 'entity': 'PROBLEM', 'sentence': '0'}, []),
 Annotation(chunk, 99, 115, doing fairly well, {'chunk': '1', 'confidence': '0.6459667', 'ner_source': 'ner_chunks', 'entity': 'OCCURRENCE', 'sentence': '1'}, []),
 Annotation(chunk, 123, 131, last week, {'chunk': '2', 'confidence': '0.4821', 'ner_source': 'ner_chunks', 'entity': 'DURATION', 'sentence': '1'}, []),
 Annotation(chunk, 139, 153, mowing the lawn, {'chunk': '3', 'confidence': '0.42610002', 'ner_source': 'ner_chunks', 'entity': 'OCCURRENCE', 'sentence': '1'}, []),
 Annotation(chunk, 159, 180, injured his right foot, {'chunk': '4', 'confidence': '0.56665', 'ner_source': 'ner_chunks', 'entity': 'PROBLEM', 'sentence': '1'}, [])]

Lets show the OCCURRENCE-DURATION relations by using pandas dataframe.

In [None]:
print(text, "\n")

rel_df = get_relations_df (results)
rel_df

The patient is a 68-year-old Caucasian male with past medical history of diabetes mellitus.
He was doing fairly well until last week while mowing the lawn, he injured his right foot.  



Unnamed: 0,sentence,entity1_begin,entity1_end,chunk1,entity1,entity2_begin,entity2_end,chunk2,entity2,relation,confidence
0,1,99,115,doing fairly well,OCCURRENCE,123,131,last week,DURATION,BEFORE,0.5968868
1,1,123,131,last week,DURATION,139,153,mowing the lawn,OCCURRENCE,OVERLAP,0.66523635


As seen above, we only have "OCCURRENCE-DURATION" relations.

### `setRelationPairsCaseSensitive`
This parameter is set to determine whether relation pairs are case sensitive (Default: False).

We will use the same `ADE` Relation Extraction pipeline as above with the `setRelationPairsCaseSensitive(True)` and see the difference. <br/>

We have `ADE` and `DRUG` entities coming from the NER model in uppercased. But, we will set `setRelationPairs(["drug-ade, ade-drug"])`lowercased and `setRelationPairsCaseSensitive(True)`. Therefore, we do not expect any relation since case status of the NER entities and the relation pairs do not match.

In [None]:
ade_re_ner_chunk_filter = medical.RENerChunksFilter() \
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(["drug-ade, ade-drug"])\
    .setRelationPairsCaseSensitive(True)      # it will return only "ade-drug" relationship.
                                              # True, then the pairs of entities in the dataset should match the pairs in setRelationPairs in their specific case (case sensitive).
                                              # False, meaning that the match of those relation names is case insensitive.


pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    ade_re_ner_chunk_filter,
    ade_re_model
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

Create a LightPipeline for annotation.

In [None]:
text ="""A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""


lmodel = nlp.LightPipeline(model)

results = lmodel.fullAnnotate(text)

Extracted NER chunks:

In [None]:
results[0]['ner_chunks']

[Annotation(chunk, 25, 32, naproxen, {'chunk': '0', 'confidence': '0.996', 'ner_source': 'ner_chunks', 'entity': 'DRUG', 'sentence': '0'}, []),
 Annotation(chunk, 87, 95, oxaprozin, {'chunk': '1', 'confidence': '0.9749', 'ner_source': 'ner_chunks', 'entity': 'DRUG', 'sentence': '0'}, []),
 Annotation(chunk, 137, 148, tense bullae, {'chunk': '2', 'confidence': '0.77250004', 'ner_source': 'ner_chunks', 'entity': 'ADE', 'sentence': '0'}, []),
 Annotation(chunk, 154, 210, cutaneous fragility on the face and the back of the hands, {'chunk': '3', 'confidence': '0.7458182', 'ner_source': 'ner_chunks', 'entity': 'ADE', 'sentence': '0'}, [])]

Show the ADE-DRUG relations by using pandas dataframe.

In [None]:
print(text, "\n")

rel_df = get_relations_df (results)
rel_df

A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands. 



Unnamed: 0,sentence,entity1_begin,entity1_end,chunk1,entity1,entity2_begin,entity2_end,chunk2,entity2,relation,confidence


As seen above, there is no relation catched because of the mismatch between the relation paris and the NER entities.

Now we will set `setRelationPairsCaseSensitive(False)` and expect to see relations.

In [None]:
ade_re_ner_chunk_filter = medical.RENerChunksFilter() \
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(["drug-ade, ade-drug"])\
    .setRelationPairsCaseSensitive(False)      # it will return only "ade-drug" relationship.
                                              # True, then the pairs of entities in the dataset should match the pairs in setRelationPairs in their specific case (case sensitive).
                                              # False, meaning that the match of those relation names is case insensitive.


pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    ade_re_ner_chunk_filter,
    ade_re_model
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

In [None]:
text ="""A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""


lmodel = nlp.LightPipeline(model)

results = lmodel.fullAnnotate(text)

In [None]:
print(text, "\n")

rel_df = get_relations_df (results)
rel_df

A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands. 



Unnamed: 0,sentence,entity1_begin,entity1_end,chunk1,entity1,entity2_begin,entity2_end,chunk2,entity2,relation,confidence
0,0,25,32,naproxen,DRUG,137,148,tense bullae,ADE,1,0.9989047
1,0,25,32,naproxen,DRUG,154,210,cutaneous fragility on the face and the back o...,ADE,1,0.9989704
2,0,87,95,oxaprozin,DRUG,137,148,tense bullae,ADE,1,0.99895453
3,0,87,95,oxaprozin,DRUG,154,210,cutaneous fragility on the face and the back o...,ADE,1,0.99900633


### Relation Extraction Across Sentences

We can extract relations across the sentences in the document by dropping the `SentenceDetector` in the pipeline. Since ReDL models are trained with BERT embeddings, they can perform better than their RE versions. **BUT** it all depends on the data that we are working on, so we can try different models and use the best one in our pipeline.

In [None]:
import pandas as pd

def get_only_relations_df(results, col='relations'):
  rel_pairs=[]
  for rel in results[0][col]:
      rel_pairs.append((
          rel.metadata['entity1_begin'],
          rel.metadata['entity1_end'],
          rel.metadata['chunk1'],
          rel.metadata['entity1'],
          rel.metadata['entity2_begin'],
          rel.metadata['entity2_end'],
          rel.metadata['chunk2'],
          rel.metadata['entity2'],
          rel.result,
          rel.metadata['confidence']
      ))

  rel_df = pd.DataFrame(rel_pairs, columns=['entity1_begin','entity1_end','chunk1', 'entity1', 'entity2_begin','entity2_end','chunk2', 'entity2', 'relation', 'confidence'])

  return rel_df

In [76]:
documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("tokens")

words_embedder = nlp.WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["document", "tokens"])\
    .setOutputCol("embeddings")

ner_tagger = medical.NerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
    .setInputCols("document", "tokens", "embeddings")\
    .setOutputCol("ner_tags")

ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["document", "tokens", "ner_tags"])\
    .setOutputCol("ner_chunks")

pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens"])\
    .setOutputCol("pos_tags")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["document", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

ade_re_ner_chunk_filter = medical.RENerChunksFilter() \
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(0)\
    .setRelationPairs(["drug-ade, ade-drug"])

ade_re_model = medical.RelationExtractionDLModel()\
    .pretrained('redl_ade_biobert', 'en', "clinical/models") \
    .setPredictionThreshold(0.5)\
    .setInputCols(["re_ner_chunks", "document"]) \
    .setOutputCol("relations")

ade_pipeline = nlp.Pipeline(stages=[
    documenter,
    tokenizer,
    words_embedder,
    ner_tagger,
    ner_chunker,
    pos_tagger,
    dependency_parser,
    ade_re_ner_chunk_filter,
    ade_re_model
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
ade_model = ade_pipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_ade_clinical download started this may take some time.
[OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]
redl_ade_biobert download started this may take some time.
[OK!]


In [None]:
ade_lmodel = nlp.LightPipeline(ade_model)

text ="""A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis.
They both presented with tense bullae and cutaneous fragility on the face and the back of the hands.
"""

ade_results = ade_lmodel.fullAnnotate(text)
rel_df = get_only_relations_df (ade_results)

rel_df

Unnamed: 0,entity1_begin,entity1_end,chunk1,entity1,entity2_begin,entity2_end,chunk2,entity2,relation,confidence
0,25,32,naproxen,DRUG,148,159,tense bullae,ADE,1,0.9988153
1,25,32,naproxen,DRUG,165,221,cutaneous fragility on the face and the back o...,ADE,1,0.9989286
2,87,95,oxaprozin,DRUG,148,159,tense bullae,ADE,1,0.9987915
3,87,95,oxaprozin,DRUG,165,221,cutaneous fragility on the face and the back o...,ADE,1,0.9988927


In [77]:
text = """The patient continued to receive regular insulin 4 times per day over the following 3 years with only occasional hives.
Two days ago, suddenly severe urticaria, angioedema, and occasional wheezing began."""

ade_results = ade_lmodel.fullAnnotate(text)
rel_df = get_only_relations_df (ade_results)

rel_df

Unnamed: 0,entity1_begin,entity1_end,chunk1,entity1,entity2_begin,entity2_end,chunk2,entity2,relation,confidence
0,41,47,insulin,DRUG,113,117,hives,ADE,1,0.99795294
1,41,47,insulin,DRUG,143,158,severe urticaria,ADE,1,0.9975689
2,41,47,insulin,DRUG,161,170,angioedema,ADE,1,0.9965976
3,41,47,insulin,DRUG,177,195,occasional wheezing,ADE,1,0.99725205
