![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# RENerChunksFilter

The `RENerChunksFilter` annotator filters NER chunks that contains the desired NER entity pairs only.

**📖 Learning Objectives:**

1. Understand how to filters desired relation pairs for the `RelationExtractionDLModel`

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

* Documentation : [RENerChunksFilter](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#renerchunksfilter)

* For extended examples of usage, see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/03.0.Clinical_Relation_Extraction.ipynb)

* Python Documentation: [RENerChunksFilter](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/re/relation_ner_chunk_filter/index.html#sparknlp_jsl.annotator.re.relation_ner_chunk_filter.RENerChunksFilter.name)

* Scala Documentation: [RENerChunksFilter](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/re/RENerChunksFilter.html)

* Relation Extraction Models and Relation Pairs Table: [In this link](https://nlp.johnsnowlabs.com/docs/en/best_practices_pretrained_models#relation-extraction-models-and-relation-pairs-table). Contains the available Relation Extraction models, its labels, associated NER model, and meaningful relation pairs to filter.

## **📜 Background**


Filtering the possible relations can be useful to perform additional analysis for a specific use case (e.g., checking adverse drug reactions and drug realations), which can be the input for further analysis using a pretrained `RelationExtractionDLModel`.

This annotator filters the NER chunks that contains the desired relation pairs, excluding the annotations between other entities.

## **🎬 Colab Setup**

In [1]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m49.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.2/139.2 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m5.8

In [2]:
# Upload license keys
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving 5.3.3.spark_nlp_for_healthcare.json to 5.3.3.spark_nlp_for_healthcare.json


In [3]:
# Install licensed libraries
from johnsnowlabs import nlp, medical

nlp.install()

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
🚨 Outdated Medical Secrets in license file. Version=5.3.3 but should be Version=5.3.2
🚨 Outdated OCR Secrets in license file. Version=5.1.2 but should be Version=5.3.2
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False 

In [4]:
import pyspark.sql.functions as F

# Start Spark Session
spark = nlp.start()
spark

👌 Detected license file /content/5.3.3.spark_nlp_for_healthcare.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**
- Input: `CHUNK, DEPENDENCY`
- Output: `CHUNK`

## **🔎 Parameters**


- `maxSyntacticDistance` *(Int)*: Maximum syntactic distance between a pair of named entities to consider them as a relation. Increasing this value will increase recall, but also increase the number of false positives.

- `relationPairs` *(List[Str])*: List of dash-separated pairs of named entities. For example, [“Biomarker-RelativeDay”] will process all relations between entities of type “Biomarker” and “RelativeDay”.

- `relationPairsCaseSensitive` *(Boolean)*: Determines whether relation pairs are case sensitive.


### `setMaxSyntacticDistance`

This parameter is used for setting the maximal syntactic distance, as threshold.

The precision of the RE model is controlled by "maxSyntacticDistance", which sets the maximum syntactic distance between named entities, controling the pairs to be classified. A larger value will improve recall at the expense at lower precision.

Let's compare the result with different values on an sample text.

First, we set the `.maxSyntacticDistance` to `10`.

In [5]:
documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

words_embedder = nlp.WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

ner_tagger = medical.NerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_tags")

ner_chunker = medical.NerConverter()\
    .setInputCols(["sentences", "tokens", "ner_tags"])\
    .setOutputCol("ner_chunks")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

ade_re_ner_chunk_filter = medical.RENerChunksFilter() \
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(10)

ade_re_model = medical.RelationExtractionDLModel()\
    .pretrained('redl_ade_biobert', 'en', "clinical/models") \
    .setInputCols(["re_ner_chunks", "sentences"]) \
    .setPredictionThreshold(0.5)\
    .setOutputCol("relations")

pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    ade_re_ner_chunk_filter,
    ade_re_model
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
ner_ade_clinical download started this may take some time.
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]
redl_ade_biobert download started this may take some time.
[OK!]


In [6]:
example ="""A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""

spark_df = spark.createDataFrame([[example]]).toDF("text")

results = model.transform(spark_df)

In [7]:
results.select(
    F.explode(F.arrays_zip(results.relations.result, results.relations.metadata)).alias("cols")
).select(
    F.expr("cols['0']").alias("relation"),
    F.expr("cols['1'].chunk1").alias("chunk1"),
    F.expr("cols['1'].entity1").alias("label1"),
    F.expr("cols['1'].chunk2").alias("chunk2"),
    F.expr("cols['1'].entity2").alias("label2"),
    F.expr("cols['1'].syntactic_distance ").alias("syntactic_distance ")
).show()

+--------+------------+------+--------------------+------+-------------------+
|relation|      chunk1|label1|              chunk2|label2|syntactic_distance |
+--------+------------+------+--------------------+------+-------------------+
|       1|    naproxen|  DRUG|           oxaprozin|  DRUG|                  3|
|       1|    naproxen|  DRUG|        tense bullae|   ADE|                  6|
|       1|    naproxen|  DRUG|cutaneous fragili...|   ADE|                  7|
|       1|   oxaprozin|  DRUG|        tense bullae|   ADE|                  3|
|       1|   oxaprozin|  DRUG|cutaneous fragili...|   ADE|                  4|
|       1|tense bullae|   ADE|cutaneous fragili...|   ADE|                  3|
+--------+------------+------+--------------------+------+-------------------+



Now, we will set the `maxSyntacticDistance` to `4` and see the difference from the previous results.

In [8]:
ade_re_ner_chunk_filter = medical.RENerChunksFilter() \
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(4)

pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    ade_re_ner_chunk_filter,
    ade_re_model
])

model = pipeline.fit(empty_data)

In [9]:
results = model.transform(spark_df)

In [10]:
results.select(
    F.explode(F.arrays_zip(results.relations.result, results.relations.metadata)).alias("cols")
).select(
    F.expr("cols['0']").alias("relation"),
    F.expr("cols['1'].chunk1").alias("chunk1"),
    F.expr("cols['1'].entity1").alias("label1"),
    F.expr("cols['1'].chunk2").alias("chunk2"),
    F.expr("cols['1'].entity2").alias("label2"),
    F.expr("cols['1'].syntactic_distance ").alias("syntactic_distance ")
).show()

+--------+------------+------+--------------------+------+-------------------+
|relation|      chunk1|label1|              chunk2|label2|syntactic_distance |
+--------+------------+------+--------------------+------+-------------------+
|       1|    naproxen|  DRUG|           oxaprozin|  DRUG|                  3|
|       1|   oxaprozin|  DRUG|        tense bullae|   ADE|                  3|
|       1|   oxaprozin|  DRUG|cutaneous fragili...|   ADE|                  4|
|       1|tense bullae|   ADE|cutaneous fragili...|   ADE|                  3|
+--------+------------+------+--------------------+------+-------------------+



We can see that the relations with syntactic distance greater than 4 were removed from the results. In this case, as we had two entities close to each other, having a higher maximum syntactic distance was creating false positives, and when controlling this parameter we were able to obtain the expected result.

### `setRelationPairs`


This parameter is used for setting the list of dash-separated pairs of named entities. For example, `["drug-ade"]` will process all relations between entities of type "drug" and "ade".

Now, we will set `setRelationPairs(["OCCURRENCE-DURATION"])` parameter and expect only "OCCURRENCE-DURATION" and "DURATION-OCCURENCE" relations.

In [11]:
ner_tagger_clinical = medical.NerModel.pretrained("ner_events_clinical", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_tags")

clinical_re_ner_chunk_filter = medical.RENerChunksFilter() \
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(["OCCURRENCE-DURATION"])

clinical_re_model = medical.RelationExtractionDLModel()\
    .pretrained('redl_temporal_events_biobert', 'en', "clinical/models") \
    .setInputCols(["re_ner_chunks", "sentences"]) \
    .setPredictionThreshold(0.5)\
    .setOutputCol("relations")

pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger_clinical,
    ner_chunker,
    dependency_parser,
    clinical_re_ner_chunk_filter,
    clinical_re_model
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

ner_events_clinical download started this may take some time.
[OK!]
redl_temporal_events_biobert download started this may take some time.
[OK!]


In [12]:
example = """The patient is a 68-year-old Caucasian male with past medical history of diabetes mellitus.
He was doing fairly well until last week while mowing the lawn, he injured his right foot. """

spark_df = spark.createDataFrame([[example]]).toDF("text")
results = model.transform(spark_df)

Lets show the OCCURRENCE-DURATION relations.

In [13]:
results.select(
    F.explode(F.arrays_zip(results.relations.result, results.relations.metadata)).alias("cols")
).select(
    F.expr("cols['0']").alias("relation"),
    F.expr("cols['1'].chunk1").alias("chunk1"),
    F.expr("cols['1'].entity1").alias("label1"),
    F.expr("cols['1'].chunk2").alias("chunk2"),
    F.expr("cols['1'].entity2").alias("label2"),
    F.expr("cols['1'].syntactic_distance ").alias("syntactic_distance ")
).show()

+--------+-----------------+----------+---------------+----------+-------------------+
|relation|           chunk1|    label1|         chunk2|    label2|syntactic_distance |
+--------+-----------------+----------+---------------+----------+-------------------+
|  BEFORE|doing fairly well|OCCURRENCE|      last week|  DURATION|                  3|
| OVERLAP|        last week|  DURATION|mowing the lawn|OCCURRENCE|                  3|
+--------+-----------------+----------+---------------+----------+-------------------+



As seen above, we only have "OCCURRENCE-DURATION" relations.

### `setRelationPairsCaseSensitive`

This parameter is set to determine whether relation pairs are case sensitive (Default: `False`).


In this example, the NER model contains `ADE` and `DRUG` entities (uppercased). We first set `relationPairs` to `["drug-ade, ade-drug"]` (lowercased) and `relationPairsCaseSensitive` to `True`.

In [14]:
ade_re_ner_chunk_filter = medical.RENerChunksFilter() \
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(["drug-ade, ade-drug"])\
    .setRelationPairsCaseSensitive(True)

pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    ade_re_ner_chunk_filter,
    ade_re_model
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

In [15]:
example = """A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""

spark_df = spark.createDataFrame([[example]]).toDF("text")

results = model.transform(spark_df)

In [16]:
results.select(
    F.explode(F.arrays_zip(results.relations.result, results.relations.metadata)).alias("cols")
).select(
    F.expr("cols['0']").alias("relation"),
    F.expr("cols['1'].chunk1").alias("chunk1"),
    F.expr("cols['1'].entity1").alias("label1"),
    F.expr("cols['1'].chunk2").alias("chunk2"),
    F.expr("cols['1'].entity2").alias("label2"),
    F.expr("cols['1'].syntactic_distance ").alias("syntactic_distance ")
).show()

+--------+------+------+------+------+-------------------+
|relation|chunk1|label1|chunk2|label2|syntactic_distance |
+--------+------+------+------+------+-------------------+
+--------+------+------+------+------+-------------------+



As seen above, no relation was identified by the model because of the mismatch between the relation paris and the NER entities.

Now we will set `relationPairsCaseSensitive` to `False` and expect to see relations.

In [17]:
ade_re_ner_chunk_filter = medical.RENerChunksFilter() \
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(["drug-ade, ade-drug"])\
    .setRelationPairsCaseSensitive(False)

pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    ade_re_ner_chunk_filter,
    ade_re_model
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

In [18]:
results = model.transform(spark_df)

In [19]:
results.select(
    F.explode(F.arrays_zip(results.relations.result, results.relations.metadata)).alias("cols")
).select(
    F.expr("cols['0']").alias("relation"),
    F.expr("cols['1'].chunk1").alias("chunk1"),
    F.expr("cols['1'].entity1").alias("label1"),
    F.expr("cols['1'].chunk2").alias("chunk2"),
    F.expr("cols['1'].entity2").alias("label2"),
    F.expr("cols['1'].syntactic_distance ").alias("syntactic_distance ")
).show()

+--------+---------+------+--------------------+------+-------------------+
|relation|   chunk1|label1|              chunk2|label2|syntactic_distance |
+--------+---------+------+--------------------+------+-------------------+
|       1| naproxen|  DRUG|        tense bullae|   ADE|                  6|
|       1| naproxen|  DRUG|cutaneous fragili...|   ADE|                  7|
|       1|oxaprozin|  DRUG|        tense bullae|   ADE|                  3|
|       1|oxaprozin|  DRUG|cutaneous fragili...|   ADE|                  4|
+--------+---------+------+--------------------+------+-------------------+

