![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb)

# Chunk Mapping

## Colab Setup

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

In [2]:
%%capture

# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

In [3]:
import json
import os
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())


spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

spark

Spark NLP Version : 4.3.2
Spark NLP_JSL Version : 4.3.2


# 1- Pretrained Chunk Mapper Models and Pretrained Pipelines

**<center>MAPPER MODELS**

|index|model|index|model|index|model|
|-----:|:-----|-----:|:-----|-----:|:-----|
| 1| [abbreviation_category_mapper](https://nlp.johnsnowlabs.com/2022/11/16/abbreviation_category_mapper_en.html)  | 2| [abbreviation_mapper](https://nlp.johnsnowlabs.com/2022/05/11/abbreviation_mapper_en_3_0.html)  | 3| [cvx_code_mapper](https://nlp.johnsnowlabs.com/2022/10/12/cvx_code_mapper_en.html)  |
| 4| [cvx_name_mapper](https://nlp.johnsnowlabs.com/2022/10/12/cvx_name_mapper_en.html)  | 5| [drug_action_treatment_mapper](https://nlp.johnsnowlabs.com/2022/03/31/drug_action_treatment_mapper_en_3_0.html)  | 6| [drug_ade_mapper](https://nlp.johnsnowlabs.com/2022/08/23/drug_ade_mapper_en.html)  |
| 7| [drug_brandname_ndc_mapper](https://nlp.johnsnowlabs.com/2022/05/11/drug_brandname_ndc_mapper_en_3_0.html)  | 8| [drug_category_mapper](https://nlp.johnsnowlabs.com/2022/12/18/drug_category_mapper_en.html)  | 9| [icd10_icd9_mapper](https://nlp.johnsnowlabs.com/2022/09/30/icd10_icd9_mapper_en.html)  |
| 10| [icd10cm_mapper](https://nlp.johnsnowlabs.com/2022/10/29/icd10cm_mapper_en.html)  | 11| [icd10cm_snomed_mapper](https://nlp.johnsnowlabs.com/2022/06/26/icd10cm_snomed_mapper_en_3_0.html)  | 12| [icd10cm_umls_mapper](https://nlp.johnsnowlabs.com/2022/06/26/icd10cm_umls_mapper_en_3_0.html)  |
| 13| [icd9_icd10_mapper](https://nlp.johnsnowlabs.com/2022/09/30/icd9_icd10_mapper_en.html)  | 14| [icd9_mapper](https://nlp.johnsnowlabs.com/2022/09/30/icd9_mapper_en.html)  | 15| [icdo_snomed_mapper](https://nlp.johnsnowlabs.com/2022/06/26/icdo_snomed_mapper_en_3_0.html)  |
| 16| [kegg_disease_mapper](https://nlp.johnsnowlabs.com/2022/11/18/kegg_disease_mapper_en.html)  | 17| [kegg_drug_mapper](https://nlp.johnsnowlabs.com/2022/11/21/kegg_drug_mapper_en.html)  | 18| [mesh_umls_mapper](https://nlp.johnsnowlabs.com/2022/06/26/mesh_umls_mapper_en_3_0.html)  |
| 19| [ndc_drug_brandname_mapper](https://nlp.johnsnowlabs.com/2023/02/22/ndc_drug_brandname_mapper_en.html)  | 20| [normalized_section_header_mapper](https://nlp.johnsnowlabs.com/2022/06/26/normalized_section_header_mapper_en_3_0.html)  | 21| [rxnorm_action_treatment_mapper](https://nlp.johnsnowlabs.com/2022/05/08/rxnorm_action_treatment_mapper_en_3_0.html)  |
| 22| [rxnorm_drug_brandname_mapper](https://nlp.johnsnowlabs.com/2023/02/09/rxnorm_drug_brandname_mapper_en.html)  | 23| [rxnorm_mapper](https://nlp.johnsnowlabs.com/2022/06/27/rxnorm_mapper_en_3_0.html)  | 24| [rxnorm_ndc_mapper](https://nlp.johnsnowlabs.com/2022/05/20/rxnorm_ndc_mapper_en_3_0.html)  |
| 25| [rxnorm_nih_mapper](https://nlp.johnsnowlabs.com/2023/02/23/rxnorm_nih_mapper_en.html)  | 26| [rxnorm_normalized_mapper](https://nlp.johnsnowlabs.com/2022/09/29/rxnorm_normalized_mapper_en.html)  | 27| [rxnorm_umls_mapper](https://nlp.johnsnowlabs.com/2022/06/26/rxnorm_umls_mapper_en_3_0.html)  |
| 28| [snomed_icd10cm_mapper](https://nlp.johnsnowlabs.com/2022/06/26/snomed_icd10cm_mapper_en_3_0.html)  | 29| [snomed_icdo_mapper](https://nlp.johnsnowlabs.com/2022/06/26/snomed_icdo_mapper_en_3_0.html)  | 30| [snomed_umls_mapper](https://nlp.johnsnowlabs.com/2022/06/27/snomed_umls_mapper_en_3_0.html)  |
| 31| [umls_clinical_drugs_mapper](https://nlp.johnsnowlabs.com/2022/07/06/umls_clinical_drugs_mapper_en_3_0.html)  | 32| [umls_clinical_findings_mapper](https://nlp.johnsnowlabs.com/2022/07/08/umls_clinical_findings_mapper_en_3_0.html)  | 33| [umls_disease_syndrome_mapper](https://nlp.johnsnowlabs.com/2022/07/11/umls_disease_syndrome_mapper_en_3_0.html)  |
| 34| [umls_drug_substance_mapper](https://nlp.johnsnowlabs.com/2022/07/11/umls_drug_substance_mapper_en_3_0.html)  | 35| [umls_major_concepts_mapper](https://nlp.johnsnowlabs.com/2022/07/11/umls_major_concepts_mapper_en_3_0.html)  | 36| []()|

**You can find all these models and more [NLP Models Hub](https://nlp.johnsnowlabs.com/models?q=Chunk+Mapping&edition=Spark+NLP+for+Healthcare)**

<br>

**<center>PRETRAINED MAPPER PIPELINES**

|index|model|
|-----:|:-----|
| 1| [icd10_icd9_mapping](https://nlp.johnsnowlabs.com/2022/09/30/icd10_icd9_mapping_en.html)  |
| 2| [icd10cm_snomed_mapping](https://nlp.johnsnowlabs.com/2022/06/27/icd10cm_snomed_mapping_en_3_0.html)  |
| 3| [icd10cm_umls_mapping](https://nlp.johnsnowlabs.com/2021/05/04/icd10cm_umls_mapping_en.html)  |
| 4| [icdo_snomed_mapping](https://nlp.johnsnowlabs.com/2022/06/27/icdo_snomed_mapping_en_3_0.html)  |
| 5| [mesh_umls_mapping](https://nlp.johnsnowlabs.com/2021/05/04/mesh_umls_mapping_en.html)  |
| 6| [rxnorm_mesh_mapping](https://nlp.johnsnowlabs.com/2021/05/04/rxnorm_mesh_mapping_en.html)  |
| 7| [rxnorm_ndc_mapping](https://nlp.johnsnowlabs.com/2022/06/27/rxnorm_ndc_mapping_en_3_0.html)  |
| 8| [rxnorm_umls_mapping](https://nlp.johnsnowlabs.com/2021/05/04/rxnorm_umls_mapping_en.html)  |
| 9| [snomed_icd10cm_mapping](https://nlp.johnsnowlabs.com/2021/05/02/snomed_icd10cm_mapping_en.html)  |
| 10| [snomed_icdo_mapping](https://nlp.johnsnowlabs.com/2022/06/27/snomed_icdo_mapping_en_3_0.html)  |
| 11| [snomed_umls_mapping](https://nlp.johnsnowlabs.com/2021/05/04/snomed_umls_mapping_en.html)  |



You can check [Healthcare Code Mapping Notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings_JSL/Healthcare/11.1.Healthcare_Code_Mapping.ipynb) for the examples of pretrained mapper pipelines.

## 1.1- Drug Action Treatment Mapper

Pretrained `drug_action_treatment_mapper` model maps drugs with their corresponding `action` and `treatment` through `ChunkMapperModel()` annotator. <br/>


**Action** of drug refers to the function of a drug in various body systems. <br/>
**Treatment** refers to which disease the drug is used to treat. 

We can choose which option we want to use by setting `setRels()` parameter of `ChunkMapperModel()`
 

We will create a pipeline consisting `bert_token_classifier_drug_development_trials` ner model to extract ner chunk as well as `ChunkMapperModel()`. <br/>
 Also, we will set the `.setRels()` parameter with `action` and see the results. 

In [None]:
#ChunkMapper Pipeline
document_assembler = DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

sentence_detector = SentenceDetector()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols("sentence")\
      .setOutputCol("token")

ner =  MedicalBertForTokenClassifier.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models")\
      .setInputCols("token","sentence")\
      .setOutputCol("ner")

nerconverter = NerConverterInternal()\
      .setInputCols("sentence", "token", "ner")\
      .setOutputCol("ner_chunk")

#drug_action_treatment_mapper with "action" mappings
chunkerMapper= ChunkMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("action_mappings")\
    .setRels(["action"])
    

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer,
                                 ner, 
                                 nerconverter, 
                                 chunkerMapper])

text = [["""The patient was female and patient of Dr. X. and she was given Dermovate, Aspagin"""]]


test_data = spark.createDataFrame(text).toDF("text")

res = pipeline.fit(test_data).transform(test_data)

bert_token_classifier_drug_development_trials download started this may take some time.
[OK!]
drug_action_treatment_mapper download started this may take some time.
[OK!]


Chunks detected by ner model

In [None]:
res.select(F.explode('ner_chunk.result').alias("chunks")).show(truncate=False)

+---------+
|chunks   |
+---------+
|Dermovate|
|Aspagin  |
+---------+



Checking mapping results

In [None]:
res.select("action_mappings.result").show(truncate=False)

+------------------------------+
|result                        |
+------------------------------+
|[anti-inflammatory, analgesic]|
+------------------------------+



In [None]:
res.selectExpr("action_mappings.metadata").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                                                                                                                                                                                                                                                                                                          

As you see above under the ***metadata*** column, if exist, we can see all the relations for each chunk. <br/>


In [None]:
res.select(F.explode(F.arrays_zip(res.ner_chunk.result, 
                                  res.action_mappings.result, 
                                  res.action_mappings.metadata)).alias("col"))\
    .select(F.expr("col['0']").alias("ner_chunk"),
            F.expr("col['1']").alias("mapping_result"),
            F.expr("col['2']['all_relations']").alias("all_relations")).show(truncate=False)

+---------+-----------------+------------------------------------------------------------+
|ner_chunk|mapping_result   |all_relations                                               |
+---------+-----------------+------------------------------------------------------------+
|Dermovate|anti-inflammatory|corticosteroids::: dermatological preparations:::very strong|
|Aspagin  |analgesic        |anti-inflammatory:::antipyretic                             |
+---------+-----------------+------------------------------------------------------------+



Now, let's set the `.setRels(["treatment"])` and see the results. 

In [None]:
#drug_action_treatment_mapper with "treatment" mappings
chunkerMapper= ChunkMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("action_mappings")\
    .setRels(["treatment"])

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer,
                                 ner, 
                                 nerconverter, 
                                 chunkerMapper])

text = [
    ["""The patient was female and patient of Dr. X. and she was given Dermovate, Aspagin"""]
]

test_data = spark.createDataFrame(text).toDF("text")

res = pipeline.fit(test_data).transform(test_data)


drug_action_treatment_mapper download started this may take some time.
[OK!]


In [None]:
res.select(F.explode('ner_chunk.result').alias("chunks")).show(truncate=False)

+---------+
|chunks   |
+---------+
|Dermovate|
|Aspagin  |
+---------+



In [None]:
res.selectExpr("action_mappings.metadata").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                                                                                                       

Here are the ***treatment*** mappings and all relations under the metadata column. 

In [None]:
res.select(F.explode(F.arrays_zip(res.ner_chunk.result, 
                                  res.action_mappings.result, 
                                  res.action_mappings.metadata)).alias("col"))\
    .select(F.expr("col['0']").alias("ner_chunk"),
            F.expr("col['1']").alias("mapping_result"),
            F.expr("col['2']['all_relations']").alias("all_relations")).show(truncate=False)

+---------+----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ner_chunk|mapping_result        |all_relations                                                                                                                                                                                                          |
+---------+----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Dermovate|lupus                 |discoid lupus erythematosus:::empeines:::psoriasis:::eczema                                                                                                                                                          

## 1.2- Section Header Normalizer Mapper

We have `normalized_section_header_mapper` model that normalizes the section headers in clinical notes. It returns two levels of normalization called `level_1` and `level_2`. <br/>

**level_1** refers to the most comprehensive "section header" for the corresponding chunk while **level_2** refers to the second comprehensive one.

Let's create a piepline with `normalized_section_header_mapper` and see how it works

In [None]:
document_assembler = DocumentAssembler()\
       .setInputCol('text')\
       .setOutputCol('document')

sentence_detector = SentenceDetector()\
       .setInputCols(["document"])\
       .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols("sentence")\
      .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en","clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("word_embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_jsl_slim", "en", "clinical/models")\
      .setInputCols(["sentence","token", "word_embeddings"])\
      .setOutputCol("ner")

ner_converter = NerConverterInternal()\
      .setInputCols(["sentence", "token", "ner"])\
      .setOutputCol("ner_chunk")\
      .setWhiteList(["Header"])

chunkerMapper = ChunkMapperModel.pretrained("normalized_section_header_mapper", "en", "clinical/models") \
       .setInputCols("ner_chunk")\
       .setOutputCol("mappings")\
       .setRels(["level_1"]) #or level_2

pipeline = Pipeline().setStages([document_assembler,
                                sentence_detector,
                                tokenizer, 
                                embeddings,
                                clinical_ner, 
                                ner_converter, 
                                chunkerMapper])

sentences = [
    ["""ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma.
        PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma.
        GENERAL REVIEW Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.
    """]]

test_data = spark.createDataFrame(sentences).toDF("text")
res = pipeline.fit(test_data).transform(test_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl_slim download started this may take some time.
[OK!]
normalized_section_header_mapper download started this may take some time.
[OK!]


Checking the headers detected by ner model

In [None]:
res.select(F.explode('ner_chunk.result').alias("chunks")).show(truncate=False)

+-------------------+
|chunks             |
+-------------------+
|ADMISSION DIAGNOSIS|
|PRINCIPAL DIAGNOSIS|
|GENERAL REVIEW     |
+-------------------+



Checking mapping results

In [None]:
res.select("mappings.result").show(truncate=False)

+-----------------------------------+
|result                             |
+-----------------------------------+
|[DIAGNOSIS, DIAGNOSIS, REVIEW TYPE]|
+-----------------------------------+



In [None]:
res.select(F.explode(F.arrays_zip(res.ner_chunk.result, 
                                  res.mappings.result)).alias("col"))\
    .select(F.expr("col['0']").alias("ner_chunk"),
            F.expr("col['1']").alias("mapping_result")).show(truncate=False)

+-------------------+--------------+
|ner_chunk          |mapping_result|
+-------------------+--------------+
|ADMISSION DIAGNOSIS|DIAGNOSIS     |
|PRINCIPAL DIAGNOSIS|DIAGNOSIS     |
|GENERAL REVIEW     |REVIEW TYPE   |
+-------------------+--------------+



As you see above, we can see the "level_1" based normalized version of each section header.

## 1.3- Drug Brand Name NDC Mapper

We have `drug_brandname_ndc_mapper` model that maps drug brand names to corresponding National Drug Codes (NDC). Product NDCs for each strength are returned in result and metadata. <br/>

It has one relation type called `Strength_NDC`

Let's create a pipeline with `drug_brandname_ndc_mapper` and see how it works.

In [4]:
document_assembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("chunk")

chunkerMapper = ChunkMapperModel.pretrained("drug_brandname_ndc_mapper", "en", "clinical/models")\
      .setInputCols(["chunk"])\
      .setOutputCol("ndc")\
      .setRels(["Strength_NDC"])

pipeline = Pipeline().setStages([document_assembler,
                                 chunkerMapper])  

model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) 

lp = LightPipeline(model)

res = lp.fullAnnotate('ZYVOX')

drug_brandname_ndc_mapper download started this may take some time.
[OK!]


In [5]:
chunks = []
mappings = []
all_re= []

for m, n in list(zip(res[0]['chunk'], res[0]["ndc"])):
        
    chunks.append(m.result)
    mappings.append(n.result) 
    all_re.append(n.metadata["all_relations"])
    
import pandas as pd
pd.set_option('display.max_colwidth', None)

df = pd.DataFrame({'Brand_Name':chunks, 'Strenth_NDC': mappings, 'Other_NDC':all_re})

df

Unnamed: 0,Brand_Name,Strenth_NDC,Other_NDC
0,ZYVOX,600 mg/300mL | 0009-4992,600 mg/300mL | 66298-7807:::600 mg/300mL | 0009-7807:::600 mg/300mL | 0009-5140:::100 mg/5mL | 0009-5136:::600 mg/1 | 70518-1226:::600 mg/300mL | 66298-5140:::200 mg/100mL | 66298-5137:::200 mg/100mL | 0009-5137:::600 mg/1 | 0009-5138


As you see, we can see corresponding "NDC" mappings of each "brand names". 

## 1.4- RxNorm NDC Mapper

We have `rxnorm_ndc_mapper` model that maps RxNorm and RxNorm Extension codes with corresponding National Drug Codes (NDC).

It has two relation types that can be defined in `setRel()` parameter; **Product NDC** and **Package NDC**

Let's create a pipeline with `rxnorm_ndc_mapper` model by setting the  relation as `setRel("Product NDC")` and see the results. 

In [6]:
document_assembler = DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('ner_chunk')

sbert_embedder = BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("sentence_embeddings")\
      .setCaseSensitive(False)
    
rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("rxnorm_code")\
      .setDistanceFunction("EUCLIDEAN")

chunkerMapper_product = ChunkMapperModel.pretrained("rxnorm_ndc_mapper", "en", "clinical/models")\
      .setInputCols(["rxnorm_code"])\
      .setOutputCol("Product NDC")\
      .setRels(["Product NDC"]) #or Package NDC

pipeline = Pipeline().setStages([document_assembler,
                                 sbert_embedder,
                                 rxnorm_resolver,
                                 chunkerMapper_product
                                 ])

model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) 

lp = LightPipeline(model)

result = lp.fullAnnotate('macadamia nut 100 MG/ML')

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_rxnorm_augmented download started this may take some time.
[OK!]
rxnorm_ndc_mapper download started this may take some time.
[OK!]


In [7]:
chunks = []
rxnorm_code = []
product= []


for m, n, j in list(zip(result[0]['ner_chunk'], result[0]["rxnorm_code"], result[0]["Product NDC"])):

    chunks.append(m.result)
    rxnorm_code.append(n.result) 
    product.append(j.result)
    
import pandas as pd

df = pd.DataFrame({'ner_chunk':chunks,
                   'rxnorm_code': rxnorm_code,
                   'Product NDC': product})

df

Unnamed: 0,ner_chunk,rxnorm_code,Product NDC
0,macadamia nut 100 MG/ML,212433,00187-1474


As you see, we can see corresponding "Product NDC" mappings of each "RxNorm codes".

## 1.5- RxNorm Action Treatment Mapper

We have `rxnorm_action_treatment_mapper` model that maps RxNorm and RxNorm Extension codes with their corresponding action and treatment. It has two relation types that can be defined in `setRel()` parameter; <br/>

**Action** of drug refers to the function of a drug in various body systems. <br/>
**Treatment** refers to which disease the drug is used to treat.

Let's create a pipeline and see how it works. 

In [None]:
document_assembler = DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('ner_chunk')

sbert_embedder = BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("sentence_embeddings")\
      .setCaseSensitive(False)
    
rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("rxnorm_code")\
      .setDistanceFunction("EUCLIDEAN")

resolver2chunk = Resolution2Chunk()\
      .setInputCols(["rxnorm_code"]) \
      .setOutputCol("resolver2chunk")\

chunkerMapper_action = ChunkMapperModel.pretrained("rxnorm_action_treatment_mapper", "en", "clinical/models")\
      .setInputCols(["resolver2chunk"])\
      .setOutputCol("action_mapping")\
      .setRels(["action"]) #or treatment

pipeline = Pipeline().setStages([document_assembler,
                                 sbert_embedder,
                                 rxnorm_resolver,
                                 resolver2chunk,
                                 chunkerMapper_action
                                 ])

data= spark.createDataFrame([['Zonalon 50 mg']]).toDF('text')

res= pipeline.fit(data).transform(data)

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_rxnorm_augmented download started this may take some time.
[OK!]
rxnorm_action_treatment_mapper download started this may take some time.
[OK!]


In [None]:
res.select(F.explode(F.arrays_zip(res.ner_chunk.result, 
                                  res.rxnorm_code.result,
                                  res.action_mapping.result)).alias("col"))\
    .select(F.expr("col['0']").alias("document"),
            F.expr("col['1']").alias("rxnorm_code"),
            F.expr("col['2']").alias("Action Mapping")).show(truncate=False)

+-------------+-----------+--------------+
|document     |rxnorm_code|Action Mapping|
+-------------+-----------+--------------+
|Zonalon 50 mg|103971     |Analgesic     |
+-------------+-----------+--------------+



As you see, we can see corresponding "Action" mappings of each "RxNorm codes".

## 1.6- Abbreviation Mapper

We have `abbreviation_mapper` model that maps abbreviations and acronyms of medical regulatory activities with their definitions. <br/> It has one relation type that can be defined in `setRels(["definition"])` parameter.

Let's create a pipeline consisting `ner_abbreviation_clinical` to extract abbreviations from text, and feed the `abbreviation_mapper` with it. 

In [None]:
document_assembler = DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

sentence_detector = SentenceDetector()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols("sentence")\
      .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

#NER model to detect abbreviations in the text
abbr_ner = MedicalNerModel.pretrained('ner_abbreviation_clinical', 'en', 'clinical/models') \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("abbr_ner")

abbr_converter = NerConverterInternal() \
      .setInputCols(["sentence", "token", "abbr_ner"]) \
      .setOutputCol("abbr_ner_chunk")\

chunkerMapper = ChunkMapperModel.pretrained("abbreviation_mapper", "en", "clinical/models")\
      .setInputCols(["abbr_ner_chunk"])\
      .setOutputCol("mappings")\
      .setRels(["definition"]) 

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 abbr_ner, 
                                 abbr_converter, 
                                 chunkerMapper])

text = ["""Gravid with estimated fetal weight of 6-6/12 pounds.
           LABORATORY DATA: Laboratory tests include a CBC which is normal. 
           HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."""]

test_data = spark.createDataFrame([text]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_abbreviation_clinical download started this may take some time.
[OK!]
abbreviation_mapper download started this may take some time.
[OK!]


Checking the results

In [None]:
#abbreviations extracted by ner model
res.select("abbr_ner_chunk.result").show()

+----------+
|    result|
+----------+
|[CBC, HIV]|
+----------+



In [None]:
res.select(F.explode(F.arrays_zip(res.abbr_ner_chunk.result, res.mappings.result)).alias("col"))\
    .select(F.expr("col['0']").alias("Abbreviation"),
            F.expr("col['1']").alias("Definition")).show(truncate=False)

+------------+----------------------------+
|Abbreviation|Definition                  |
+------------+----------------------------+
|CBC         |complete blood count        |
|HIV         |human immunodeficiency virus|
+------------+----------------------------+



As you see, we can see corresponding "definition" mappings of each "abbreviation".

# 2- Creating a Mapper Model

There is a `ChunkMapperApproach()` to create your own mapper model. <br/>

This receives an `ner_chunk` and a Json with a mapping of ner entities and relations, and returns the `ner_chunk` augmented with the relations from the Json ontology. <br/> We give the path of json file to the `setDictionary()` parameter.




Let's create an example Json, then create a drug mapper model. This model will match the given drug name (only "metformin" for our example) with correpsonding action and treatment.  

The format of json file should be like following:


In [None]:
data_set= {
  "mappings": [
    {
      "key": "metformin",
      "relations": [
        {
          "key": "action",
          "values" : ["hypoglycemic", "Drugs Used In Diabetes"]
        },
        {
          "key": "treatment",
          "values" : ["diabetes", "t2dm"]
        }
      ]
    }
  ]
}

import json
with open('sample_drug.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)

By using `setRel()` parameter, we tell the model which type of mapping we want. In our case, if we want from our model to return **action** mapping, we set the parameter as `setRels(["action"])`,  we set as `setRels(["treatment"])` for **treatment**

Let's create a pipeline and see it in action. 

In [None]:
document_assembler = DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

sentence_detector = SentenceDetector()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols("sentence")\
      .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

#NER model to detect drug in the text
clinical_ner = MedicalNerModel.pretrained("ner_posology_small","en","clinical/models")\
	    .setInputCols(["sentence","token","embeddings"])\
	    .setOutputCol("ner")\
      .setLabelCasing("upper")
 
ner_converter = NerConverterInternal()\
      .setInputCols(["sentence", "token", "ner"])\
      .setOutputCol("ner_chunk")\
      .setWhiteList(["DRUG"])

chunkerMapper = ChunkMapperApproach()\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("mappings")\
      .setDictionary("/content/sample_drug.json")\
      .setRels(["action"]) #or treatment

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 clinical_ner, 
                                 ner_converter, 
                                 chunkerMapper])

text = ["The patient was given 1 unit of metformin daily."]

test_data = spark.createDataFrame([text]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology_small download started this may take some time.
[OK!]


In [None]:
res.printSchema()

root
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- sentence: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true

Checking the ner result

In [None]:
res.select(F.explode('ner_chunk.result').alias("chunks")).show(truncate=False)

+---------+
|chunks   |
+---------+
|metformin|
+---------+



Checking the mapper result

In [None]:
res.selectExpr("mappings.metadata").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                                                                                                                                                                                                |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk -> 0, __trained__ -> metformin, relation -> action, __distance_function__ -> levenshtein, confidence -> 0.9994, ner_source -> ner_chunk, ops -

In [None]:
res.select(F.explode(F.arrays_zip(res.ner_chunk.result, 
                                  res.mappings.result, 
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['0']").alias("ner_chunk"),
            F.expr("col['1']").alias("mapping_result"),
            F.expr("col['2']['all_relations']").alias("all_relations")).show(truncate=False)

+---------+--------------+----------------------+
|ner_chunk|mapping_result|all_relations         |
+---------+--------------+----------------------+
|metformin|hypoglycemic  |Drugs Used In Diabetes|
+---------+--------------+----------------------+



As you see, the model that we created with `ChunkMapperApproach()` succesfully mapped "metformin". Under the metadata, we can see all relations that we defined in the Json. 

### 2.1- Save the model to disk 

Now, we will save our model and use it with `ChunkMapperModel()`

In [None]:
model.stages[-1].write().save("models/drug_mapper")

Using the saved model. This time we will check 'treatment' mappings results


In [None]:
document_assembler = DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

sentence_detector = SentenceDetector()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols("sentence")\
      .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

#NER model to detect drug in the text
clinical_ner = MedicalNerModel.pretrained("ner_posology_small","en","clinical/models")\
	    .setInputCols(["sentence","token","embeddings"])\
	    .setOutputCol("ner")\
      .setLabelCasing("upper")
 
ner_converter = NerConverterInternal()\
      .setInputCols(["sentence", "token", "ner"])\
      .setOutputCol("ner_chunk")\
      .setWhiteList(["DRUG"])

chunkerMapper = ChunkMapperModel.load("/content/models/drug_mapper")\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("mappings")\
      .setRels(["treatment"]) 

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 clinical_ner, 
                                 ner_converter, 
                                 chunkerMapper])

text = ["The patient was given 1 unit of metformin daily."]

test_data = spark.createDataFrame([text]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology_small download started this may take some time.
[OK!]


In [None]:
res.selectExpr("mappings.metadata").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                                                                                                                                                                                    |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk -> 0, __trained__ -> metformin, relation -> treatment, __distance_function__ -> levenshtein, confidence -> 0.9994, ner_source -> ner_chunk, ops -> 0.0, all_relations -> t2dm, ent

In [None]:
res.select(F.explode(F.arrays_zip(res.ner_chunk.result, 
                                  res.mappings.result, 
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['0']").alias("ner_chunk"),
            F.expr("col['1']").alias("mapping_result"),
            F.expr("col['2']['all_relations']").alias("all_relations")).show(truncate=False)

+---------+--------------+-------------+
|ner_chunk|mapping_result|all_relations|
+---------+--------------+-------------+
|metformin|diabetes      |t2dm         |
+---------+--------------+-------------+



As you see above, we created our own drug mapper model successfully. 

### 2.2- Create a Model with Upper Cased or Lower Cased

We can set the case status of `ChunkMapperApproach` while creating a model by using `setLowerCase()` parameter.

Let's create a new mapping dictionary and see how it works. 

In [None]:
data_set= {
    "mappings": [
        {
            "key": "Warfarina lusa",
            "relations": [
                {
                    "key": "action",
                    "values": [
                        "Analgesic",
                        "Antipyretic"
                    ]
                },
                {
                    "key": "treatment",
                    "values": [
                        "diabetes",
                        "t2dm"
                    ]
                }
            ]
        }
    ]
}

import json
with open('mappings.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)

In [None]:
sentences = [
        ["""The patient was given Warfarina Lusa and amlodipine 10 MG.The patient was given Aspagin, coumadin 5 mg, coumadin, and he has metamorfin"""]
    ]


test_data = spark.createDataFrame(sentences).toDF("text")

**`setLowerCase(True)`**

In [None]:
chunkerMapper = ChunkMapperApproach() \
        .setInputCols(["ner_chunk"]) \
        .setOutputCol("mappings") \
        .setDictionary("mappings.json") \
        .setRels(["action"]) \
        .setLowerCase(True) \

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 clinical_ner, 
                                 ner_converter, 
                                 chunkerMapper])


result_df = pipeline.fit(test_data).transform(test_data)
result_df.selectExpr("explode(mappings)").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                                                                                                                                                                                                                                |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

"Warfarina lusa" is in lower case in the source json file, and in upper case(Warfarina Lusa) in our example training sentence. We trained that model in lower case, the model mapped the entity even though our training sentence is uppercased. <br/>

Let's check with `setLowerCase(False)` and see the difference. 

In [None]:
chunkerMapper = ChunkMapperApproach() \
        .setInputCols(["ner_chunk"]) \
        .setOutputCol("mappings") \
        .setDictionary("mappings.json") \
        .setRels(["action"]) \
        .setLowerCase(False) \

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 clinical_ner, 
                                 ner_converter, 
                                 chunkerMapper])


result_df = pipeline.fit(test_data).transform(test_data)
result_df.selectExpr("explode(mappings)").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                          |
+---------------------------------------------------------------------------------------------------------------------------------------------+
|{labeled_dependency, 22, 35, NONE, {chunk -> 0, confidence -> 0.66565, ner_source -> ner_chunk, entity -> Warfarina Lusa, sentence -> 0}, []}|
|{labeled_dependency, 41, 50, NONE, {chunk -> 1, confidence -> 0.9999, ner_source -> ner_chunk, entity -> amlodipine, sentence -> 0}, []}     |
|{labeled_dependency, 80, 86, NONE, {chunk -> 2, confidence -> 0.9905, ner_source -> ner_chunk, entity -> Aspagin, sentence -> 0}, []}        |
|{labeled_dependency, 89, 96, NONE, {chunk -> 3, confidence -> 0.9997, ner_source -> ner_chunk, entity -> coumadin, sentence -> 0}, []} 

As you see, our model couldn't map the given uppercased "Warfarine Lura".

### 2.3- Selecting Multiple Relations 

We can select multiple relations for the same chunk with the `setRels()` parameter.

In [None]:
chunkerMapper = ChunkMapperApproach() \
        .setInputCols(["ner_chunk"]) \
        .setOutputCol("mappings") \
        .setDictionary("mappings.json") \
        .setLowerCase(True) \
        .setRels(["action", "treatment"])

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 clinical_ner, 
                                 ner_converter, 
                                 chunkerMapper])


result_df = pipeline.fit(test_data).transform(test_data)
result_df.selectExpr("explode(mappings)").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                                                                                                                                                                                                                                |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

As you see, we are able to see all the relations(action, treatment) at the same time. 

### 2.4- Filtering Multi-token Chunks

If the chunk includes multi-tokens splitted by a whitespace, we can filter that chunk by using `setAllowMultiTokenChunk()` parameter.

In [None]:
chunkerMapper = ChunkMapperApproach() \
        .setInputCols(["ner_chunk"]) \
        .setOutputCol("mappings") \
        .setDictionary("mappings.json") \
        .setLowerCase(True) \
        .setRels(["action", "treatment"]) \
        .setAllowMultiTokenChunk(False)

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 clinical_ner, 
                                 ner_converter, 
                                 chunkerMapper])


result_df = pipeline.fit(test_data).transform(test_data)
result_df.selectExpr("explode(mappings)").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                          |
+---------------------------------------------------------------------------------------------------------------------------------------------+
|{labeled_dependency, 22, 35, NONE, {chunk -> 0, confidence -> 0.66565, ner_source -> ner_chunk, entity -> Warfarina Lusa, sentence -> 0}, []}|
|{labeled_dependency, 41, 50, NONE, {chunk -> 1, confidence -> 0.9999, ner_source -> ner_chunk, entity -> amlodipine, sentence -> 0}, []}     |
|{labeled_dependency, 80, 86, NONE, {chunk -> 2, confidence -> 0.9905, ner_source -> ner_chunk, entity -> Aspagin, sentence -> 0}, []}        |
|{labeled_dependency, 89, 96, NONE, {chunk -> 3, confidence -> 0.9997, ner_source -> ner_chunk, entity -> coumadin, sentence -> 0}, []} 

The chunk "Warfarina Lusa" is a multi-token. Therefore, our mapper model skip that entity. <br/>
So, let's set `.setAllowMultiTokenChunk(True)` and see the difference. 

In [None]:
chunkerMapper = ChunkMapperApproach() \
        .setInputCols(["ner_chunk"]) \
        .setOutputCol("mappings") \
        .setDictionary("mappings.json") \
        .setLowerCase(True) \
        .setRels(["action", "treatment"]) \
        .setAllowMultiTokenChunk(True)

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 clinical_ner, 
                                 ner_converter, 
                                 chunkerMapper])


result_df = pipeline.fit(test_data).transform(test_data)
result_df.selectExpr("explode(mappings)").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                                                                                                                                                                                                                                |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### 2.5- Lexical Fuzzy Matching Options in the ChunkMapper annotator
There are multiple options to achieve fuzzy matching using the ChunkMapper annotation:
- Partial Token NGram Fingerprinting: Specially useful to combine two frequent usecases; when there are noisy non informative tokens at the beginning / end of the chunk and the order of the chunk is not absolutely relevant. i.e. stomach acute pain --> acute pain stomach ; metformin 100 mg --> metformin.
- Char NGram Fingerprinting: Specially useful in usecases that involve typos or different spacing patterns for chunks. i.e. head ache / ache head --> headache ; metformini / metformoni / metformni --> metformin
- Fuzzy Distance (Slow): Specially useful when the mapping can be defined in terms of edit distance thresholds using functions like char based like Levenshtein, Hamming, LongestCommonSubsequence or token based like Cosine, Jaccard.

The mapping logic will be run in the previous order also ordering by longest key inside each option as an intuitive way to minimize false positives.

For more information please visit the followng links:  
https://en.wikipedia.org/wiki/Fingerprint_(computing)  
https://openrefine.org/docs/technical-reference/clustering-in-depth  
https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/similarity/package-summary.html

In [None]:
data_set_mappings = [
        {
            "key": "Warfarina lusa",
            "relations": [
                {
                    "key": "action",
                    "values": [
                        "Analgesic",
                        "Antipyretic"
                    ]
                },
                {
                    "key": "treatment",
                    "values": [
                        "diabetes",
                        "t2dm"
                    ]
                }
            ]
        },
        {
            "key": "amlodipine",
            "relations": [
                {
                    "key": "action",
                    "values": [
                        "Calcium Ions Inhibitor"
                    ]
                },
                {
                    "key": "treatment",
                    "values": [
                        "hypertension"
                    ]
                }
            ]
        },
        {
            "key": "coumadin",
            "relations": [
                {
                    "key": "action",
                    "values": [
                        "Coagulation Inhibitor"
                    ]
                },
                {
                    "key": "treatment",
                    "values": [
                        "hypertension"
                    ]
                }
            ]
        },
        {
            "key": "aspagin",
            "relations": [
                {
                    "key": "action",
                    "values": [
                        "Cycooxygenase Inhibitor"
                    ]
                },
                {
                    "key": "treatment",
                    "values": [
                        "arthritis"
                    ]
                }
            ]
        },
        {
            "key": "metformin",
            "relations": [
                {
                  "key": "action",
                  "values" : ["hypoglycemic", "Drugs Used In Diabetes"]
                },
                {
                  "key": "treatment",
                  "values" : ["diabetes", "t2dm"]
                }
            ]
        }
    ]

#### Different mapping sizes to test ChunkMappers sensitivity in terms of speed and efficiency

In [None]:
# Keys to test speed and efficiency
extra_keys = {
    "s500": [{"key": f"short key {i}", "relations": [
                {
                    "key": "any",
                    "values": [
                        "anyvalue",
                        "anyvalue"
                    ]
                }]} for i in range(500)],
    "s5000": [{"key": f"short key {i}", "relations": [
                {
                    "key": "any",
                    "values": [
                        "anyvalue",
                        "anyvalue"
                    ]
                }]} for i in range(5000)],
    "l5000": [{"key": f"a bit longer key {i}", "relations": [
                {
                    "key": "any",
                    "values": [
                        "anyvalue",
                        "anyvalue"
                    ]
                }]} for i in range(5000)]
}

In [None]:
import json
for c, extra_mappings in extra_keys.items():
    with open(f'mappings_{c}.json', 'w', encoding='utf-8') as f:
        json.dump({'mappings': data_set_mappings + extra_mappings}, f, ensure_ascii=False, indent=4)

In [None]:
sentences = [
        ["""The patient was given Lusa Warfarina 5mg and amlodipine 10 MG.The patient was given Aspaginaspa, coumadin 5 mg, coumadin, and he has metamorfin"""]
    ]

test_data = spark.createDataFrame(sentences).toDF("text")

#### Greedy Posology for longer and more illustrative chunks

In [None]:
document_assembler = DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

sentence_detector = SentenceDetector()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols("sentence")\
      .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en","clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


In [None]:
#NER model to detect drug in the text
clinical_ner = MedicalNerModel.pretrained("ner_posology_greedy","en","clinical/models")\
        .setInputCols(["sentence","token","embeddings"])\
        .setOutputCol("ner")\
        .setLabelCasing("upper")

ner_converter = NerConverterInternal()\
      .setInputCols("sentence", "token", "ner")\
      .setOutputCol("ner_chunk")

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 clinical_ner, 
                                 ner_converter])
cached_df = pipeline.fit(test_data).transform(test_data).cache()
cached_df.selectExpr("explode(ner_chunk) as chunk").show(truncate=False)

ner_posology_greedy download started this may take some time.
[OK!]
+-----------------------------------------------------------------------------------------------------------------------------------+
|chunk                                                                                                                              |
+-----------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 22, 39, Lusa Warfarina 5mg, {chunk -> 0, confidence -> 0.8111, ner_source -> ner_chunk, entity -> DRUG, sentence -> 0}, []}|
|{chunk, 45, 57, amlodipine 10, {chunk -> 1, confidence -> 0.66709995, ner_source -> ner_chunk, entity -> DRUG, sentence -> 0}, []} |
|{chunk, 84, 94, Aspaginaspa, {chunk -> 2, confidence -> 0.9827, ner_source -> ner_chunk, entity -> DRUG, sentence -> 0}, []}       |
|{chunk, 97, 109, coumadin 5 mg, {chunk -> 3, confidence -> 0.7287, ner_source -> ner_chunk, entity -> DRUG, sentence -> 0}, []}

#### Example with just token fingerprinting

In [None]:
cm = ChunkMapperApproach() \
        .setInputCols(["ner_chunk"]) \
        .setLowerCase(True) \
        .setRels(["action", "treatment"]) \
        .setAllowMultiTokenChunk(True) \
        .setEnableTokenFingerprintMatching(True) \
        .setMinTokenNgramFingerprint(1) \
        .setMaxTokenNgramFingerprint(3) \
        .setMaxTokenNgramDroppingCharsRatio(0.5)

chunkerMappers = [
    cm.copy().setOutputCol(f"mappings_{c}").setDictionary(f"mappings_{c}.json") \
    for c in extra_keys]

result_df = Pipeline(stages=chunkerMappers).fit(cached_df).transform(cached_df)
result_df.selectExpr("explode(mappings_s500)").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                                                                                                                                                                                                                                 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
result_df.select(F.explode(F.arrays_zip(result_df.mappings_s500.result, 
                                  result_df.mappings_s500.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['1']['__trained__']").alias("fixed_chunk"),
            F.expr("col['0']").alias("action_mapping_result"),
            F.expr("col['1']['relation']").alias("relation ")).show(truncate=False)

+------------------+--------------+----------------------+---------+
|ner_chunk         |fixed_chunk   |action_mapping_result |relation |
+------------------+--------------+----------------------+---------+
|Lusa Warfarina 5mg|Warfarina lusa|Analgesic             |action   |
|Lusa Warfarina 5mg|Warfarina lusa|diabetes              |treatment|
|amlodipine 10     |amlodipine    |Calcium Ions Inhibitor|action   |
|amlodipine 10     |amlodipine    |hypertension          |treatment|
|Aspaginaspa       |null          |NONE                  |null     |
|coumadin 5 mg     |coumadin      |Coagulation Inhibitor |action   |
|coumadin 5 mg     |coumadin      |hypertension          |treatment|
|coumadin          |coumadin      |Coagulation Inhibitor |action   |
|coumadin          |coumadin      |hypertension          |treatment|
|metamorfin        |null          |NONE                  |null     |
+------------------+--------------+----------------------+---------+



In [None]:
%%timeit
result_df.selectExpr("explode(mappings_s500)").write.mode("overwrite").save("timing_test")

252 ms ± 39.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit
result_df.selectExpr("explode(mappings_s5000)").write.mode("overwrite").save("timing_test")

241 ms ± 43.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit
result_df.selectExpr("explode(mappings_l5000)").write.mode("overwrite").save("timing_test")

245 ms ± 50.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Can be seen that fingerprinting is pretty much insensitive to the mappings size

#### Example with token and char fingerprinting

In [None]:
cm = ChunkMapperApproach() \
        .setInputCols(["ner_chunk"]) \
        .setLowerCase(True) \
        .setRels(["action", "treatment"]) \
        .setAllowMultiTokenChunk(True) \
        .setEnableTokenFingerprintMatching(True) \
        .setMinTokenNgramFingerprint(1) \
        .setMaxTokenNgramFingerprint(3) \
        .setMaxTokenNgramDroppingCharsRatio(0.5) \
        .setEnableCharFingerprintMatching(True) \
        .setMinCharNgramFingerprint(1) \
        .setMaxCharNgramFingerprint(3)

chunkerMappers = [
    cm.copy().setOutputCol(f"mappings_{c}").setDictionary(f"mappings_{c}.json") \
    for c in extra_keys]

result_df = Pipeline(stages=chunkerMappers).fit(cached_df).transform(cached_df)
result_df.selectExpr("explode(mappings_s500)").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                                                                                                                                                                                                                                 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
result_df.select(F.explode(F.arrays_zip(result_df.mappings_s500.result, 
                                  result_df.mappings_s500.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['1']['__trained__']").alias("fixed_chunk"),
            F.expr("col['0']").alias("action_mapping_result"),
            F.expr("col['1']['relation']").alias("relation ")).show(truncate=False)

+------------------+--------------+-----------------------+---------+
|ner_chunk         |fixed_chunk   |action_mapping_result  |relation |
+------------------+--------------+-----------------------+---------+
|Lusa Warfarina 5mg|Warfarina lusa|Analgesic              |action   |
|Lusa Warfarina 5mg|Warfarina lusa|diabetes               |treatment|
|amlodipine 10     |amlodipine    |Calcium Ions Inhibitor |action   |
|amlodipine 10     |amlodipine    |hypertension           |treatment|
|Aspaginaspa       |aspagin       |Cycooxygenase Inhibitor|action   |
|Aspaginaspa       |aspagin       |arthritis              |treatment|
|coumadin 5 mg     |coumadin      |Coagulation Inhibitor  |action   |
|coumadin 5 mg     |coumadin      |hypertension           |treatment|
|coumadin          |coumadin      |Coagulation Inhibitor  |action   |
|coumadin          |coumadin      |hypertension           |treatment|
|metamorfin        |null          |NONE                   |null     |
+------------------+

In [None]:
%%timeit
result_df.selectExpr("explode(mappings_s500)").write.mode("overwrite").save("timing_test")

268 ms ± 41.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit
result_df.selectExpr("explode(mappings_s5000)").write.mode("overwrite").save("timing_test")

220 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit
result_df.selectExpr("explode(mappings_l5000)").write.mode("overwrite").save("timing_test")

225 ms ± 23.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Can be seen that fingerprinting is pretty much insensitive to the mappings size

#### Example with token and char fingerprinting plus fuzzy distance calculation

In [None]:
cm = ChunkMapperApproach() \
        .setInputCols(["ner_chunk"]) \
        .setOutputCol("mappings") \
        .setDictionary("mappings.json") \
        .setLowerCase(True) \
        .setRels(["action", "treatment"]) \
        .setAllowMultiTokenChunk(True) \
        .setEnableTokenFingerprintMatching(True) \
        .setMinTokenNgramFingerprint(1) \
        .setMaxTokenNgramFingerprint(3) \
        .setMaxTokenNgramDroppingCharsRatio(0.5) \
        .setEnableCharFingerprintMatching(True) \
        .setMinCharNgramFingerprint(1) \
        .setMaxCharNgramFingerprint(3) \
        .setEnableFuzzyMatching(True) \
        .setFuzzyMatchingDistanceThresholds(0.31)

chunkerMappers = [
    cm.copy().setOutputCol(f"mappings_{c}").setDictionary(f"mappings_{c}.json") \
    for c in extra_keys]

result_df = Pipeline(stages=chunkerMappers).fit(cached_df).transform(cached_df)
result_df.selectExpr("explode(mappings_s500)").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                                                                                                                                                                                                                                 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
result_df.select(F.explode(F.arrays_zip(result_df.mappings_s500.result, 
                                  result_df.mappings_s500.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['1']['__trained__']").alias("fixed_chunk"),
            F.expr("col['0']").alias("action_mapping_result"),
            F.expr("col['1']['relation']").alias("relation ")).show(truncate=False)

+------------------+--------------+-----------------------+---------+
|ner_chunk         |fixed_chunk   |action_mapping_result  |relation |
+------------------+--------------+-----------------------+---------+
|Lusa Warfarina 5mg|Warfarina lusa|Analgesic              |action   |
|Lusa Warfarina 5mg|Warfarina lusa|diabetes               |treatment|
|amlodipine 10     |amlodipine    |Calcium Ions Inhibitor |action   |
|amlodipine 10     |amlodipine    |hypertension           |treatment|
|Aspaginaspa       |aspagin       |Cycooxygenase Inhibitor|action   |
|Aspaginaspa       |aspagin       |arthritis              |treatment|
|coumadin 5 mg     |coumadin      |Coagulation Inhibitor  |action   |
|coumadin 5 mg     |coumadin      |hypertension           |treatment|
|coumadin          |coumadin      |Coagulation Inhibitor  |action   |
|coumadin          |coumadin      |hypertension           |treatment|
|metamorfin        |metformin     |hypoglycemic           |action   |
|metamorfin        |

In [None]:
%%timeit
result_df.selectExpr("explode(mappings_s500)").write.mode("overwrite").save("timing_test")

286 ms ± 39.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit
result_df.selectExpr("explode(mappings_s5000)").write.mode("overwrite").save("timing_test")

653 ms ± 56.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit
result_df.selectExpr("explode(mappings_l5000)").write.mode("overwrite").save("timing_test")

751 ms ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Can be seen that distance functions are really affected by the mappings size

#### Example with fuzzy distance calculation using a pretrained model

In [None]:
chunkerMapper_action = ChunkMapperModel.pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("Action")\
    .setRels(["action"]) \
    .setAllowMultiTokenChunk(True) \
    .setEnableFuzzyMatching(True) \
    .setFuzzyMatchingDistanceThresholds(0.6)


result_df = chunkerMapper_action.transform(cached_df)
result_df.selectExpr("explode(Action)").show(truncate=False)

drug_action_treatment_mapper download started this may take some time.
[OK!]
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
+

In [None]:
result_df.select(F.explode(F.arrays_zip(result_df.Action.result, 
                                  result_df.Action.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['1']['__trained__']").alias("fixed_chunk"),
            F.expr("col['0']").alias("action_mapping_result"),
            F.expr("col['1']['relation']").alias("relation ")).show(truncate=False)

+------------------+---------------------------+---------------------------------------------+---------+
|ner_chunk         |fixed_chunk                |action_mapping_result                        |relation |
+------------------+---------------------------+---------------------------------------------+---------+
|Lusa Warfarina 5mg|pravastatina fg            |lipid modifying agents                       |action   |
|Lusa Warfarina 5mg|warfarin pmcs              |anticoagulant                                |action   |
|Lusa Warfarina 5mg|warfarina mk               |anticoagulant                                |action   |
|amlodipine 10     |boie amlodipine besilate   |antianginal                                  |action   |
|amlodipine 10     |azathioprine eg            |antitumour                                   |action   |
|amlodipine 10     |pharex amlodipine besylate |antianginal                                  |action   |
|amlodipine 10     |temax (amlodipine)         |antiang

# 3- ChunkMapperFilterer

`ChunkMapperFilterer` annotator allows filtering of the chunks that were passed through the ChunkMapperModel. <br/>

We can filter chunks by setting the `.setReturnCriteria()` parameter. It has 2 options; <br/>


**success:** Returns the chunks which are mapped by ChunkMapper <br/>

**fail:** Returns the chunks which are not mapped by ChunkMapper <br/>

Let's apply the both options and check the results. 

In [None]:
chunkerMapper = ChunkMapperApproach() \
        .setInputCols(["ner_chunk"]) \
        .setOutputCol("mappings") \
        .setDictionary("mappings.json") \
        .setRel("action") \
        .setLowerCase(True) \
        .setRels(["action", "treatment"]) \

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 clinical_ner, 
                                 ner_converter, 
                                 chunkerMapper])


result_df = pipeline.fit(test_data).transform(test_data)
result_df.selectExpr("explode(mappings)").show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                             |
+------------------------------------------------------------------------------------------------------------------------------------------------+
|{labeled_dependency, 22, 39, NONE, {chunk -> 0, confidence -> 0.8111, ner_source -> ner_chunk, entity -> Lusa Warfarina 5mg, sentence -> 0}, []}|
|{labeled_dependency, 45, 57, NONE, {chunk -> 1, confidence -> 0.66709995, ner_source -> ner_chunk, entity -> amlodipine 10, sentence -> 0}, []} |
|{labeled_dependency, 84, 94, NONE, {chunk -> 2, confidence -> 0.9827, ner_source -> ner_chunk, entity -> Aspaginaspa, sentence -> 0}, []}       |
|{labeled_dependency, 97, 109, NONE, {chunk -> 3, confidence -> 0.7287, ner_source -> ner_chunk, entity -> coumadin 5 

**`.setReturnCriteria("success")`**

In [None]:
cfModel = ChunkMapperFilterer() \
        .setInputCols(["ner_chunk","mappings"]) \
        .setOutputCol("chunks_filtered")\
        .setReturnCriteria("success")

cfModel.transform(result_df).selectExpr("explode(chunks_filtered)").show(truncate=False)

+---+
|col|
+---+
+---+



**`.setReturnCriteria("fail")`**

In [None]:
cfModel = ChunkMapperFilterer() \
        .setInputCols(["ner_chunk","mappings"]) \
        .setOutputCol("chunks_filtered")\
        .setReturnCriteria("fail")

cfModel.transform(result_df).selectExpr("explode(chunks_filtered)").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                |
+-----------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 22, 39, Lusa Warfarina 5mg, {chunk -> 0, confidence -> 0.8111, ner_source -> ner_chunk, entity -> DRUG, sentence -> 0}, []}|
|{chunk, 45, 57, amlodipine 10, {chunk -> 1, confidence -> 0.66709995, ner_source -> ner_chunk, entity -> DRUG, sentence -> 0}, []} |
|{chunk, 84, 94, Aspaginaspa, {chunk -> 2, confidence -> 0.9827, ner_source -> ner_chunk, entity -> DRUG, sentence -> 0}, []}       |
|{chunk, 97, 109, coumadin 5 mg, {chunk -> 3, confidence -> 0.7287, ner_source -> ner_chunk, entity -> DRUG, sentence -> 0}, []}    |
|{chunk, 112, 119, coumadin, {chunk -> 4, confidence -> 0.9969

# 4- ResolverMerger - Using Sentence Entity Resolver and `ChunkMapperModel` Together

We can merge the results of `ChunkMapperModel` and `SentenceEntityResolverModel` by using `ResolverMerger` annotator. 

We can detect our results that fail by `ChunkMapperModel` with `ChunkMapperFilterer` and then merge the resolver and mapper results with `ResolverMerger`

In [None]:
document_assembler = DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

sentence_detector = SentenceDetector()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols("sentence")\
      .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models")\
      .setInputCols(["sentence", "token", "embeddings"])\
      .setOutputCol("ner")

ner_converter = NerConverterInternal()\
      .setInputCols("sentence", "token", "ner")\
      .setOutputCol("chunk")

chunkerMapper = ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")\
      .setInputCols(["chunk"])\
      .setOutputCol("RxNorm_Mapper")\
      .setRels(["rxnorm_code"])

cfModel = ChunkMapperFilterer() \
      .setInputCols(["chunk", "RxNorm_Mapper"]) \
      .setOutputCol("chunks_fail") \
      .setReturnCriteria("fail")

chunk2doc = Chunk2Doc() \
      .setInputCols("chunks_fail") \
      .setOutputCol("chunk_doc")

sbert_embedder = BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
      .setInputCols(["chunk_doc"])\
      .setOutputCol("sentence_embeddings")\
      .setCaseSensitive(False)

resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models") \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("resolver_code") \
      .setDistanceFunction("EUCLIDEAN")

resolverMerger = ResolverMerger()\
      .setInputCols(["resolver_code","RxNorm_Mapper"])\
      .setOutputCol("RxNorm")

mapper_pipeline = Pipeline(
      stages = [
          document_assembler,
          sentence_detector,
          tokenizer,
          word_embeddings,
          ner_model,
          ner_converter,
          chunkerMapper,
          chunkerMapper,
          cfModel,
          chunk2doc,
          sbert_embedder,
          resolver,
          resolverMerger
      ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = mapper_pipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology_greedy download started this may take some time.
[OK!]
rxnorm_mapper download started this may take some time.
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_rxnorm_augmented download started this may take some time.
[OK!]


In [None]:
samples = [['The patient was given Adapin 10 MG, coumadn 5 mg'],
           ['The patient was given Avandia 4 mg, Tegretol, zitiga'] ]

result = model.transform(spark.createDataFrame(samples).toDF("text"))

In [None]:
result.selectExpr('chunk.result as chunk', 
                  'RxNorm_Mapper.result as RxNorm_Mapper', 
                  'chunks_fail.result as chunks_fail', 
                  'resolver_code.result as resolver_code',
                  'RxNorm.result as RxNorm'
              ).show(truncate = False)

+--------------------------------+----------------------+--------------+-------------+------------------------+
|chunk                           |RxNorm_Mapper         |chunks_fail   |resolver_code|RxNorm                  |
+--------------------------------+----------------------+--------------+-------------+------------------------+
|[Adapin 10 MG, coumadn 5 mg]    |[1000049, NONE]       |[coumadn 5 mg]|[200883]     |[1000049, 200883]       |
|[Avandia 4 mg, Tegretol, zitiga]|[261242, 203029, NONE]|[zitiga]      |[220989]     |[261242, 203029, 220989]|
+--------------------------------+----------------------+--------------+-------------+------------------------+



# 5- Section Header Normalizer Mapper with ChunkSentenceSplitter

`ChunkSentenceSplitter()` annotator splits documents or sentences by chunks provided. <br/> For detailed usage of this annotator, visit [this notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/18.Chunk_Sentence_Splitter.ipynb) <br/>

In this section, we will do the following steps; 
- Detect "section headers" in given text through Ner model
- Split the given text by headers with `ChunkSentenceSplitter()`
- Normalize the `ChunkSentenceSplitter()` outputs with `normalized_section_header_mapper` model. 

Let's start with creating Ner pipeline to detect "Header" 

In [None]:
sentences = [
    ["""ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma.
        PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma.
        REVIEW OF SYSTEMS Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.
    """]]

df= spark.createDataFrame(sentences).toDF("text")

In [None]:
documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

tokenizer= Tokenizer()\
      .setInputCols(["document"])\
      .setOutputCol("token")

tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_jsl_slim", "en", "clinical/models")\
      .setInputCols("token", "document")\
      .setOutputCol("ner")\
      .setCaseSensitive(True)

ner_converter = NerConverterInternal() \
      .setInputCols(["document", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(["Header"])

pipeline = Pipeline(
    stages = [
        documentAssembler,
        tokenizer,
        tokenClassifier,
        ner_converter
    ])
 
empty_df = spark.createDataFrame([[""]]).toDF('text')
pipeline_model = pipeline.fit(empty_df)

bert_token_classifier_ner_jsl_slim download started this may take some time.
[OK!]


In [None]:
result = pipeline_model.transform(df)
result.selectExpr('explode(ner_chunk)').show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                        |
+-------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 0, 18, ADMISSION DIAGNOSIS, {chunk -> 0, confidence -> 0.9994346, ner_source -> ner_chunk, entity -> Header, sentence -> 0}, []}   |
|{chunk, 89, 107, PRINCIPAL DIAGNOSIS, {chunk -> 1, confidence -> 0.99020165, ner_source -> ner_chunk, entity -> Header, sentence -> 0}, []}|
|{chunk, 175, 191, REVIEW OF SYSTEMS, {chunk -> 2, confidence -> 0.9989373, ner_source -> ner_chunk, entity -> Header, sentence -> 0}, []}  |
+-------------------------------------------------------------------------------------------------------------------------------------------+



Now, we have our header entities. We will split the text by the headers.

In [None]:
#applying ChunkSentenceSplitter 
chunkSentenceSplitter = ChunkSentenceSplitter()\
    .setInputCols("document","ner_chunk")\
    .setOutputCol("paragraphs")\
    .setGroupBySentences(False)

paragraphs = chunkSentenceSplitter.transform(result)

In [None]:
paragraphs.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|                 ner|           ner_chunk|          paragraphs|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|ADMISSION DIAGNOS...|[{document, 0, 30...|[{token, 0, 8, AD...|[{named_entity, 0...|[{chunk, 0, 18, A...|[{document, 0, 89...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [None]:
pd.set_option('display.max_colwidth', None)
result_df = paragraphs.selectExpr("explode(paragraphs) as result").selectExpr("result.result","result.metadata.entity", "result.metadata.splitter_chunk").toPandas()
result_df

Unnamed: 0,result,entity,splitter_chunk
0,ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma.\n,Header,ADMISSION DIAGNOSIS
1,"PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma.\n",Header,PRINCIPAL DIAGNOSIS
2,"REVIEW OF SYSTEMS Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.\n",Header,REVIEW OF SYSTEMS


As you see, we have our splitted text and **section headers**. <br/>
Now we will normalize this section headers with `normalized_section_header_mapper`

In [None]:
chunkerMapper = ChunkMapperModel.pretrained("normalized_section_header_mapper", "en", "clinical/models") \
       .setInputCols("ner_chunk")\
       .setOutputCol("mappings")\
       .setRels(["level_1"]) #or level_2

normalized_df= chunkerMapper.transform(paragraphs)

normalized_section_header_mapper download started this may take some time.
[OK!]


In [None]:
normalized_df.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|                 ner|           ner_chunk|          paragraphs|            mappings|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|ADMISSION DIAGNOS...|[{document, 0, 30...|[{token, 0, 8, AD...|[{named_entity, 0...|[{chunk, 0, 18, A...|[{document, 0, 89...|[{labeled_depende...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [None]:
normalized_df= normalized_df.select(F.explode(F.arrays_zip(normalized_df.ner_chunk.result, 
                                                           normalized_df.mappings.result)).alias("col"))\
                            .select(F.expr("col['0']").alias("ner_chunk"),
                                    F.expr("col['1']").alias("normalized_headers")).toPandas()
normalized_df.head()

Unnamed: 0,ner_chunk,normalized_headers
0,ADMISSION DIAGNOSIS,DIAGNOSIS
1,PRINCIPAL DIAGNOSIS,DIAGNOSIS
2,REVIEW OF SYSTEMS,REVIEW TYPE


Now, we have our normalized headers. We will merge it with `ChunkSentenceSplitter()` output

In [None]:
normalized_df= normalized_df.rename(columns={"ner_chunk": "splitter_chunk"})
df= pd.merge(result_df, normalized_df, on=["splitter_chunk"])

In [None]:
df

Unnamed: 0,result,entity,splitter_chunk,normalized_headers
0,ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma.\n,Header,ADMISSION DIAGNOSIS,DIAGNOSIS
1,"PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma.\n",Header,PRINCIPAL DIAGNOSIS,DIAGNOSIS
2,"REVIEW OF SYSTEMS Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.\n",Header,REVIEW OF SYSTEMS,REVIEW TYPE


Ultimately, we have splitted paragraphs, headers and normalized headers. 

# 5- Pretrained Mapper Pipelines

We will show an example of `rxnorm_umls_mapping` pipeline here. But you can check [Healthcare Code Mapping Notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/11.1.Healthcare_Code_Mapping.ipynb) for the examples of pretrained mapper pipelines. 

In [None]:
from sparknlp.pretrained import PretrainedPipeline

rxnorm_umls_pipeline= PretrainedPipeline("rxnorm_umls_mapping", "en", "clinical/models")

rxnorm_umls_mapping download started this may take some time.
Approx size to download 1.8 MB
[OK!]


In [None]:
rxnorm_umls_pipeline.annotate("1161611 315677 343663")

{'document': ['1161611 315677 343663'],
 'rxnorm_code': ['1161611', '315677', '343663'],
 'umls_code': ['C3215948', 'C0984912', 'C1146501']}

|**RxNorm Code** | **RxNorm Details** | **UMLS Code** | **UMLS Details** |
| ---------- | -----------:| ---------- | -----------:|
| 1161611 |  metformin Pill | C3215948 | metformin pill |
| 315677 | cimetidine 100 mg | C0984912 | cimetidine 100 mg |
| 343663 | insulin lispro 50 UNT/ML | C1146501 | insulin lispro 50 unt/ml |