![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **AnnotationMerger**

This notebook will cover the usage of `AnnotationMerger`. This annotator provides the ability to merge the same type of columns coming from two or more annotators.

**📖 Learning Objectives:**

- Merging two or more same type annotation results in a spark nlp pipeline


**🔗 Helpful Links:**

- Documentation : [AnnotationMerger](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#annotationmerger)

- Python Docs : [AnnotationMerger](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/annotation_merger/index.html#sparknlp_jsl.annotator.annotation_merger.AnnotationMerger)

- Scala Docs : [AnnotationMerger](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/annotator/AnnotationMerger.html)

## **📜 Background**


Merge annotations from different pipeline steps that have the same annotation type into a unified annotation. Possible annotations that can be merged include:

- document (e.g., output of `DocumentAssembler` annotator)
- token (e.g., output of `Tokenizer` annotator)
- word_embeddings (e.g., output of `WordEmbeddingsModel` annotator)
- sentence_embeddings (e.g., output of `BertSentenceEmbeddings` annotator)
- category (e.g., output of `RelationExtractionModel` annotator)
- date (e.g., output of `DateMatcher` annotator)
- sentiment (e.g., output of `SentimentDLModel` annotator)
- pos (e.g., output of `PerceptronModel` annotator)
- chunk (e.g., output of `NerConverter` annotator)
- named_entity (e.g., output of `NerDLModel` annotator)
- dependency (e.g., output of `DependencyParserModel` annotator)
- language (e.g., output of `LanguageDetectorDL` annotator)
- keyword (e.g., output of `YakeModel` annotator)

## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp

nlp.install()

In [None]:
from johnsnowlabs import nlp, medical
import pyspark.sql.functions as F
import pandas as pd

spark = nlp.start()

## Helper Functions

In [None]:
def get_relations_df(results, rel_col='relations', chunk_col='ner_chunks'):
    """This function converts a relation column to pandas dataframe 
    using lightpipeline results"""
    rel_pairs=[]
    chunks = []

    for rel in results[0][rel_col]:
        rel_pairs.append((
            rel.metadata['entity1_begin'],
            rel.metadata['entity1_end'],
            rel.metadata['chunk1'], 
            rel.metadata['entity1'], 
            rel.metadata['entity2_begin'],
            rel.metadata['entity2_end'],
            rel.metadata['chunk2'], 
            rel.metadata['entity2'],
            rel.result, 
            rel.metadata['confidence'],
        ))

    rel_df = pd.DataFrame(rel_pairs, columns=['entity1_begin', 'entity1_end', 
                                              'chunk1', 'entity1', 'entity2_begin', 
                                              'entity2_end', 'chunk2', 'entity2', 
                                              'relation', 'confidence'])

    return rel_df

## **🖨️ Input/Output Annotation Types**

- Input: ` ANY`

- Output: ` ANY`

## **🔎 Parameters**


- `inputCols`: The name of the columns containing the input annotations. It can read either a String column or an Array.

- `outputCol`: The name of the column in Document type that is generated. We can specify only one column here.

- `inputType`: (String) The type of the annotations that to merge. Possible values are:

 `document | token | wordpiece | word_embeddings | sentence_embeddings | category | date | sentiment | pos | chunk | named_entity | regex | dependency | labeled_dependency | language | keyword` 

 All the parameters can be set using the corresponding set method in camel case. For example, `.setInputcols()`.

## Merging token

Here is a pipeline that uses 3 different annotators with a `token` output type. We will merge all these `token` type columns into one `all_token` column.

In [None]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Normalizer that outputs token type
normalizer = nlp.Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized")\
    .setLowercase(True)\
    .setCleanupPatterns(["[^\w\d\s]"]) # remove punctuations (keep alphanumeric chars)

# Regex pattern to make tokenization
pattern = "\s+|(?=[-.:;*+,&%\\[\\]])|(?<=[-.:;*+,&%\[\]])"
regexTokenizer = nlp.RegexTokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("regex_token") \
    .setPattern(pattern) \
    .setPositionalMask(False)

# Annotation merger that merges same type outputs
annotation_merger = medical.AnnotationMerger()\
    .setInputCols("token", "normalized", "regex_token")\
    .setInputType("token")\
    .setOutputCol("all_token")

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    normalizer,
    regexTokenizer,
    annotation_merger
    ])

In [None]:
sample_text = "The results of the test T1-T2 DATE**[12/24/13] $1.99 () (10/12), ph+ 90%"
df = spark.createDataFrame([[sample_text]]).toDF("text")

result = nlpPipeline.fit(df).transform(df)
result.show(truncate=False)

+------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------

Display the reasult with TabBar widget

In [None]:
df_all = result.select(F.explode(result.all_token.result).alias('all_token')).toPandas()
df_token = result.select(F.explode(result.token.result).alias('token')).toPandas()
df_normalized = result.select(F.explode(result.normalized.result).alias('normalized')).toPandas()
df_regex = result.select(F.explode(result.regex_token.result).alias('regex_token')).toPandas()

from google.colab import widgets

t = widgets.TabBar(["Token","Normalized", "Regex Token", "All Merged"])

with t.output_to(1):
    display(df_normalized)

with t.output_to(2):
    display(df_regex)

with t.output_to(3):
    display(df_all)

with t.output_to(0):
    display(df_token)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,normalized
0,the
1,results
2,of
3,the
4,test
5,t1t2
6,date122413
7,199
8,1012
9,ph


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,regex_token
0,The
1,results
2,of
3,the
4,test
5,T1
6,-
7,T2
8,DATE
9,*


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,all_token
0,The
1,results
2,of
3,the
4,test
5,T1-T2
6,DATE**[12/24/13]
7,$1.99
8,()
9,(


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,token
0,The
1,results
2,of
3,the
4,test
5,T1-T2
6,DATE**[12/24/13]
7,$1.99
8,()
9,(


<IPython.core.display.Javascript object>

## Merging Relation Extraction

In [None]:
# Create the pipeline with two RE models
documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

words_embedder = nlp.WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

pos_ner_tagger = medical.NerModel()\
    .pretrained("ner_posology", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_pos")

pos_ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ner_pos"])\
    .setOutputCol("pos_ner_chunks")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

# posology relation extraction model
pos_reModel = medical.RelationExtractionModel()\
    .pretrained("posology_re")\
    .setInputCols(["embeddings", "pos_tags", "pos_ner_chunks", "dependencies"])\
    .setOutputCol("pos_relations")\
    .setMaxSyntacticDistance(4)

ade_ner_tagger = medical.NerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ade_ner_tags")  

ade_ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ade_ner_tags"])\
    .setOutputCol("ade_ner_chunks")

# ADE relation extraction model
ade_reModel = medical.RelationExtractionModel()\
    .pretrained("re_ade_clinical", "en", 'clinical/models')\
    .setInputCols(["embeddings", "pos_tags", "ade_ner_chunks", "dependencies"])\
    .setOutputCol("ade_relations")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(["drug-ade, ade-drug"])

# Annotation merger that merges same type outputs
annotation_merger = medical.AnnotationMerger()\
    .setInputCols("ade_relations", "pos_relations")\
    .setInputType("category")\
    .setOutputCol("all_relations")

merger_pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer, 
    words_embedder, 
    pos_tagger, 
    pos_ner_tagger,
    pos_ner_chunker,
    dependency_parser,
    pos_reModel,
    ade_ner_tagger,
    ade_ner_chunker,
    ade_reModel,
    annotation_merger
])

empty_df= spark.createDataFrame([[""]]).toDF("text")
merger_model= merger_pipeline.fit(empty_df)

lmodel = nlp.LightPipeline(merger_model)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
ner_posology download started this may take some time.
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]
ner_ade_clinical download started this may take some time.
[OK!]
re_ade_clinical download started this may take some time.
Approximate size to download 10.9 MB
[OK!]


In [None]:
# Show example result
text = """
The patient was prescribed 1 unit of naproxen for 5 days after meals for chronic low back pain. 
The patient was also given 1 unit of oxaprozin daily for rheumatoid arthritis presented with tense bullae 
and cutaneous fragility on the face and the back of the hands. 
"""

results = lmodel.fullAnnotate(text)

In [None]:
get_relations_df(results, rel_col='ade_relations')

Unnamed: 0,entity1_begin,entity1_end,chunk1,entity1,entity2_begin,entity2_end,chunk2,entity2,relation,confidence
0,135,143,oxaprozin,DRUG,191,202,tense bullae,ADE,1,1.0
1,135,143,oxaprozin,DRUG,209,265,cutaneous fragility on the face and the back o...,ADE,1,1.0


In [None]:
get_relations_df(results, rel_col='pos_relations')

Unnamed: 0,entity1_begin,entity1_end,chunk1,entity1,entity2_begin,entity2_end,chunk2,entity2,relation,confidence
0,28,33,1 unit,DOSAGE,38,45,naproxen,DRUG,DOSAGE-DRUG,1.0
1,38,45,naproxen,DRUG,47,56,for 5 days,DURATION,DRUG-DURATION,1.0
2,125,130,1 unit,DOSAGE,135,143,oxaprozin,DRUG,DOSAGE-DRUG,1.0
3,135,143,oxaprozin,DRUG,145,149,daily,FREQUENCY,DRUG-FREQUENCY,1.0


Merging all above relation  annotations

In [None]:
get_relations_df(results, rel_col='all_relations')

Unnamed: 0,entity1_begin,entity1_end,chunk1,entity1,entity2_begin,entity2_end,chunk2,entity2,relation,confidence
0,135,143,oxaprozin,DRUG,191,202,tense bullae,ADE,1,1.0
1,135,143,oxaprozin,DRUG,209,265,cutaneous fragility on the face and the back o...,ADE,1,1.0
2,28,33,1 unit,DOSAGE,38,45,naproxen,DRUG,DOSAGE-DRUG,1.0
3,38,45,naproxen,DRUG,47,56,for 5 days,DURATION,DRUG-DURATION,1.0
4,125,130,1 unit,DOSAGE,135,143,oxaprozin,DRUG,DOSAGE-DRUG,1.0
5,135,143,oxaprozin,DRUG,145,149,daily,FREQUENCY,DRUG-FREQUENCY,1.0
