![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/REChunkMerger.ipynb)

#   **📜 REChunkMerger**


The **`REChunkMerger`** annotator merges related entities into cohesive phrases, using a customizable separator.

**📖 Learning Objectives:**

1. Understand how to use the annotator.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

- Reference Documentation: [REChunkMerger](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators)


## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=False
nlp.install(refresh_install=True)

In [None]:
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

## **🖨️ Input/Output Annotation Types**

- Input: `CHUNK`

- Output: `CHUNK`

## **🔎 Parameters**


**Parameters**:

- `setSeparator`: The **`setSeparator`** parameter allows users to define a custom string that will be used to separate merged entities within the output phrase.

      
  

### Pipeline

In [None]:
documenter = nlp.DocumentAssembler() \
    .setInputCol("sentence") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \

words_embedder = nlp.WordEmbeddingsModel() \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens"]) \
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel() \
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens"]) \
    .setOutputCol("pos_tags")

ner_tagger = medical.NerModel() \
    .pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens", "embeddings"]) \
    .setOutputCol("ner_tags")

ner_converter = medical.NerConverter() \
    .setInputCols(["document", "tokens", "ner_tags"]) \
    .setOutputCol("ner_chunks")

depency_parser = nlp.DependencyParserModel() \
    .pretrained("dependency_conllu", "en") \
    .setInputCols(["document", "pos_tags", "tokens"]) \
    .setOutputCol("dependencies")

re_model = medical.RelationExtractionModel \
    .pretrained("re_clinical", "en", "clinical/models") \
    .setCustomLabels({"TeRP": "CustomLabel_TeRP", "TrWP": "CustomLabel_TeWP"}) \
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"]) \
    .setOutputCol("re_chunk")

re_chunk_merger = medical.REChunkMerger() \
    .setInputCols(["re_chunk"]) \
    .setOutputCol("relation_chunks") \
    .setSeparator(" && ")

nlpPipeline = nlp.Pipeline(
    stages=[
        documenter,
        tokenizer,
        words_embedder,
        pos_tagger,
        ner_tagger,
        ner_converter,
        depency_parser,
        re_model,
        re_chunk_merger
    ])

empty_data = spark.createDataFrame([[""]]).toDF("sentence")

model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
ner_clinical download started this may take some time.
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]
re_clinical download started this may take some time.
[OK!]


In [None]:
text =''' 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to " +
        "presentation and subsequent type two diabetes mellitus ( T2DM ). '''

result = model.transform(spark.createDataFrame([[text]]).toDF("sentence"))

In [None]:
result.selectExpr("explode(relation_chunks.result) result").show(truncate=False)

+----------------------------------------------------------------------+
|result                                                                |
+----------------------------------------------------------------------+
|gestational diabetes mellitus && subsequent type two diabetes mellitus|
|gestational diabetes mellitus && T2DM                                 |
|subsequent type two diabetes mellitus && T2DM                         |
+----------------------------------------------------------------------+



**Experimenting with different separators for the same sample sentence.**

In [None]:
re_chunk_merger = medical.REChunkMerger() \
    .setInputCols(["re_chunk"]) \
    .setOutputCol("relation_chunks_2") \
    .setSeparator(" >>> ")

nlpPipeline = nlp.Pipeline(
    stages=[
        documenter,
        tokenizer,
        words_embedder,
        pos_tagger,
        ner_tagger,
        ner_converter,
        depency_parser,
        re_model,
        re_chunk_merger
    ])

empty_data = spark.createDataFrame([[""]]).toDF("sentence")

model = nlpPipeline.fit(empty_data)

In [None]:
result = model.transform(spark.createDataFrame([[text]]).toDF("sentence"))

result.selectExpr("explode(relation_chunks_2.result) result").show(truncate=False)

+-----------------------------------------------------------------------+
|result                                                                 |
+-----------------------------------------------------------------------+
|gestational diabetes mellitus >>> subsequent type two diabetes mellitus|
|gestational diabetes mellitus >>> T2DM                                 |
|subsequent type two diabetes mellitus >>> T2DM                         |
+-----------------------------------------------------------------------+

