![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/26.Chunk_Mapping.ipynb)

# Chunk Mapping

## Colab Setup

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

In [3]:
%%capture

# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

In [None]:
import json
import os
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())


spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

spark

# 1- Pretrained Chunk Mapper Models

| Mapper Model Name                | Relation Values          |
|----------------------------------|--------------------------|
| drug_action_treatment_mapper     | action, treatment        |
| normalized_section_header_mapper | level_1, level_2         |
| drug_brandname_ndc_mapper        | Strength_NDC             |
| rxnorm_ndc_mapper                | Product NDC, Package NDC |
| rxnorm_action_treatment_mapper   | Action, Treatment        |
| abbreviation_mapper              | definition               |
| rxnorm_mapper                    | rxnorm_code              |

## 1.1- Drug Action Treatment Mapper

Pretrained `drug_action_treatment_mapper` model maps drugs with their corresponding `action` and `treatment` through `ChunkMapperModel()` annotator. <br/>


**Action** of drug refers to the function of a drug in various body systems. <br/>
**Treatment** refers to which disease the drug is used to treat. 

We can choose which option we want to use by setting `setRel()` parameter of `ChunkMapperModel()`
 

We will create a pipeline consisting `bert_token_classifier_drug_development_trials` ner model to extract ner chunk as well as `ChunkMapperModel()`. <br/>
 Also, we will set the `.setRel()` parameter with `action` and see the results. 

In [None]:
#ChunkMapper Pipeline
document_assembler = DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

sentence_detector = SentenceDetector()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols("sentence")\
      .setOutputCol("token")

ner =  MedicalBertForTokenClassifier.pretrained("bert_token_classifier_drug_development_trials", "en", "clinical/models")\
      .setInputCols("token","sentence")\
      .setOutputCol("ner")

nerconverter = NerConverter()\
      .setInputCols("sentence", "token", "ner")\
      .setOutputCol("ner_chunk")

#drug_action_treatment_mapper with "action" mappings
chunkerMapper= ChunkMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("action_mappings")\
    .setRel("action")
    

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer,
                                 ner, 
                                 nerconverter, 
                                 chunkerMapper])

text = [
    ["""The patient was female and patient of Dr. X. and she was given Dermovate, Aspagin"""]
]


test_data = spark.createDataFrame(text).toDF("text")

res = pipeline.fit(test_data).transform(test_data)

bert_token_classifier_drug_development_trials download started this may take some time.
[OK!]
drug_action_treatment_mapper download started this may take some time.
[OK!]


Chunks detected by ner model

In [None]:
res.select(F.explode('ner_chunk.result').alias("chunks")).show(truncate=False)

+---------+
|chunks   |
+---------+
|Dermovate|
|Aspagin  |
+---------+



Checking mapping results

In [None]:
res.select("action_mappings.result").show(truncate=False)

+------------------------------+
|result                        |
+------------------------------+
|[Anti-Inflammatory, Analgesic]|
+------------------------------+



In [None]:
res.selectExpr("action_mappings.metadata").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                                                                                                                                                                                                                                       |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk -> 0, relation -> action,

As you see above under the ***metadata*** column, if exist, we can see all the relations for each chunk. <br/>


In [None]:
res.select(F.explode(F.arrays_zip("ner_chunk.result", "action_mappings.result", "action_mappings.metadata")).alias("col"))\
    .select(F.expr("col['0']").alias("ner_chunk"),
            F.expr("col['1']").alias("mapping_result"),
            F.expr("col['2']['all_relations']").alias("all_relations")).show(truncate=False)

+---------+-----------------+------------------------------------------------------------+
|ner_chunk|mapping_result   |all_relations                                               |
+---------+-----------------+------------------------------------------------------------+
|Dermovate|Anti-Inflammatory|Corticosteroids::: Dermatological Preparations:::Very Strong|
|Aspagin  |Analgesic        |Anti-Inflammatory:::Antipyretic                             |
+---------+-----------------+------------------------------------------------------------+



Now, let's set the `.setRel("treatment")` and see the results. 

In [None]:
#drug_action_treatment_mapper with "treatment" mappings
chunkerMapper= ChunkMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("action_mappings")\
    .setRel("treatment")

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer,
                                 ner, 
                                 nerconverter, 
                                 chunkerMapper])

text = [
    ["""The patient was female and patient of Dr. X. and she was given Dermovate, Aspagin"""]
]

test_data = spark.createDataFrame(text).toDF("text")

res = pipeline.fit(test_data).transform(test_data)


drug_action_treatment_mapper download started this may take some time.
Approximate size to download 8.3 MB
[OK!]


In [None]:
res.select(F.explode('ner_chunk.result').alias("chunks")).show(truncate=False)

+---------+
|chunks   |
+---------+
|Dermovate|
|Aspagin  |
+---------+



In [None]:
res.selectExpr("action_mappings.metadata").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                

Here are the ***treatment*** mappings and all relations under the metadata column. 

In [None]:
res.select(F.explode(F.arrays_zip("ner_chunk.result", "action_mappings.result", "action_mappings.metadata")).alias("col"))\
    .select(F.expr("col['0']").alias("ner_chunk"),
            F.expr("col['1']").alias("mapping_result"),
            F.expr("col['2']['all_relations']").alias("all_relations")).show(truncate=False)

+---------+----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ner_chunk|mapping_result        |all_relations                                                                                                                                                                                                          |
+---------+----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Dermovate|Lupus                 |Discoid Lupus Erythematosus:::Empeines:::Psoriasis:::Eczema                                                                                                                                                          

## 1.2- Section Header Normalizer Mapper

We have `normalized_section_header_mapper` model that normalizes the section headers in clinical notes. It returns two levels of normalization called `level_1` and `level_2`. <br/>

**level_1** refers to the most comprehensive "section header" for the corresponding chunk while **level_2** refers to the second comprehensive one.

Let's create a piepline with `normalized_section_header_mapper` and see how it works

In [None]:
document_assembler = DocumentAssembler()\
       .setInputCol('text')\
       .setOutputCol('document')

sentence_detector = SentenceDetector()\
       .setInputCols(["document"])\
       .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols("sentence")\
      .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en","clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("word_embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_jsl_slim", "en", "clinical/models")\
      .setInputCols(["sentence","token", "word_embeddings"])\
      .setOutputCol("ner")

ner_converter = NerConverter()\
      .setInputCols(["sentence", "token", "ner"])\
      .setOutputCol("ner_chunk")\
      .setWhiteList(["Header"])

chunkerMapper = ChunkMapperModel.pretrained("normalized_section_header_mapper", "en", "clinical/models") \
       .setInputCols("ner_chunk")\
       .setOutputCol("mappings")\
       .setRel("level_1") #or level_2

pipeline = Pipeline().setStages([document_assembler,
                                sentence_detector,
                                tokenizer, 
                                embeddings,
                                clinical_ner, 
                                ner_converter, 
                                chunkerMapper])

sentences = [
    ["""ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma.
        PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma.
        GENERAL REVIEW Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.
    """]]

test_data = spark.createDataFrame(sentences).toDF("text")
res = pipeline.fit(test_data).transform(test_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl_slim download started this may take some time.
Approximate size to download 14.4 MB
[OK!]
normalized_section_header_mapper download started this may take some time.
Approximate size to download 13.9 KB
[OK!]


Checking the headers detected by ner model

In [None]:
res.select(F.explode('ner_chunk.result').alias("chunks")).show(truncate=False)

+-------------------+
|chunks             |
+-------------------+
|ADMISSION DIAGNOSIS|
|PRINCIPAL DIAGNOSIS|
|GENERAL REVIEW     |
+-------------------+



Checking mapping results

In [None]:
res.select("mappings.result").show(truncate=False)

+-----------------------------------+
|result                             |
+-----------------------------------+
|[DIAGNOSIS, DIAGNOSIS, REVIEW TYPE]|
+-----------------------------------+



In [None]:
res.select(F.explode(F.arrays_zip("ner_chunk.result", "mappings.result")).alias("col"))\
    .select(F.expr("col['0']").alias("ner_chunk"),
            F.expr("col['1']").alias("mapping_result")).show(truncate=False)

+-------------------+--------------+
|ner_chunk          |mapping_result|
+-------------------+--------------+
|ADMISSION DIAGNOSIS|DIAGNOSIS     |
|PRINCIPAL DIAGNOSIS|DIAGNOSIS     |
|GENERAL REVIEW     |REVIEW TYPE   |
+-------------------+--------------+



As you see above, we can see the "level_1" based normalized version of each section header.

## 1.3- Drug Brand Name NDC Mapper

We have `drug_brandname_ndc_mapper` model that maps drug brand names to corresponding National Drug Codes (NDC). Product NDCs for each strength are returned in result and metadata. <br/>

It has one relation type called `Strength_NDC`

Let's create a pipeline with `drug_brandname_ndc_mapper` and see how it works.

In [None]:
document_assembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("chunk")

chunkerMapper = ChunkMapperModel.pretrained("drug_brandname_ndc_mapper", "en", "clinical/models")\
      .setInputCols(["chunk"])\
      .setOutputCol("ndc")\
      .setRel("Strength_NDC") 

pipeline = Pipeline().setStages([document_assembler,
                                 chunkerMapper])  

model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) 

lp = LightPipeline(model)

res = lp.fullAnnotate(["zytiga", "zyvana", "ZYVOX", "ZYTIGA"])

drug_brandname_ndc_mapper download started this may take some time.
[OK!]


Checking mapping results

In [None]:
chunks = []
mappings = []
all_re= []

for i in range(4):

  for m, n in list(zip(res[i]['chunk'], res[i]["ndc"])):
          
      chunks.append(m.result)
      mappings.append(n.result) 
      all_re.append(n.metadata["all_relations"])
    
import pandas as pd

df = pd.DataFrame({'Brand_Name':chunks, 'Strenth_NDC': mappings, 'Other_NDC':all_re})

df.head(20)

Unnamed: 0,Brand_Name,Strenth_NDC,Other_NDC
0,zytiga,500 mg/1 | 57894-195,250 mg/1 | 57894-150
1,zyvana,527 mg/1 | 69336-405,
2,ZYVOX,600 mg/300mL | 0009-4992,600 mg/300mL | 66298-7807:::600 mg/300mL | 000...
3,ZYTIGA,500 mg/1 | 57894-195,250 mg/1 | 57894-150


As you see, we can see corresponding "NDC" mappings of each "brand names". 

## 1.4- RxNorm NDC Mapper

We have `rxnorm_ndc_mapper` model that maps RxNorm and RxNorm Extension codes with corresponding National Drug Codes (NDC).

It has two relation types that can be defined in `setRel()` parameter; **Product NDC** and **Package NDC**

Let's create a pipeline with `rxnorm_ndc_mapper` model by setting the  relation as `setRel("Product NDC")` and see the results. 

In [None]:
document_assembler = DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('ner_chunk')

sbert_embedder = BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("sentence_embeddings")\
      .setCaseSensitive(False)
    
rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") \
      .setInputCols(["ner_chunk", "sentence_embeddings"]) \
      .setOutputCol("rxnorm_code")\
      .setDistanceFunction("EUCLIDEAN")

chunkerMapper_product = ChunkMapperModel.pretrained("rxnorm_ndc_mapper", "en", "clinical/models")\
      .setInputCols(["rxnorm_code"])\
      .setOutputCol("Product NDC")\
      .setRel("Product NDC") #or Package NDC

pipeline = Pipeline().setStages([document_assembler,
                                 sbert_embedder,
                                 rxnorm_resolver,
                                 chunkerMapper_product
                                 ])

model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) 

lp = LightPipeline(model)

result = lp.fullAnnotate(['doxepin hydrochloride 50 MG/ML', 'macadamia nut 100 MG/ML'])


sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_rxnorm_augmented download started this may take some time.
[OK!]
rxnorm_ndc_mapper download started this may take some time.
[OK!]


Checking the results

In [None]:
chunks = []
rxnorm_code = []
product= []


for i in range(2):

  for m, n, j in list(zip(result[i]['ner_chunk'], result[i]["rxnorm_code"], result[i]["Product NDC"])):

      chunks.append(m.result)
      rxnorm_code.append(n.result) 
      product.append(j.result)
    
import pandas as pd

df = pd.DataFrame({'ner_chunk':chunks,
                   'rxnorm_code': rxnorm_code,
                   'Product NDC': product})

df.head(20)

Unnamed: 0,ner_chunk,rxnorm_code,Product NDC
0,doxepin hydrochloride 50 MG/ML,1000091,00378-8117
1,macadamia nut 100 MG/ML,212433,00064-2120


As you see, we can see corresponding "Product NDC" mappings of each "RxNorm codes".

## 1.5- RxNorm Action Treatment Mapper

We have `rxnorm_action_treatment_mapper` model that maps RxNorm and RxNorm Extension codes with their corresponding action and treatment. It has two relation types that can be defined in `setRel()` parameter; <br/>

**Action** of drug refers to the function of a drug in various body systems. <br/>
**Treatment** refers to which disease the drug is used to treat.

Let's create a pipeline and see how it works. 

In [None]:
document_assembler = DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('ner_chunk')

sbert_embedder = BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("sentence_embeddings")\
      .setCaseSensitive(False)
    
rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") \
      .setInputCols(["ner_chunk", "sentence_embeddings"]) \
      .setOutputCol("rxnorm_code")\
      .setDistanceFunction("EUCLIDEAN")

chunkerMapper_action = ChunkMapperModel.pretrained("rxnorm_action_treatment_mapper", "en", "clinical/models")\
      .setInputCols(["rxnorm_code"])\
      .setOutputCol("Action")\
      .setRel("Action") #or Treatment

pipeline = Pipeline().setStages([document_assembler,
                                 sbert_embedder,
                                 rxnorm_resolver,
                                 chunkerMapper_action
                                 ])

model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) 

lp = LightPipeline(model)

res = lp.fullAnnotate(['Sinequan 150 MG', 'Zonalon 50 mg'])

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_rxnorm_augmented download started this may take some time.
[OK!]
rxnorm_action_treatment_mapper download started this may take some time.
[OK!]


Checking the results

In [None]:
chunks = []
rxnorm_code = []
action= []


for i in range(2):

  for m, n, j in list(zip(res[i]['ner_chunk'], res[i]["rxnorm_code"], res[i]["Action"])):

      chunks.append(m.result)
      rxnorm_code.append(n.result) 
      action.append(j.result)
    
import pandas as pd

df = pd.DataFrame({'ner_chunk':chunks,
                   'rxnorm_code': rxnorm_code,
                   'Action': action})

df.head(20)

Unnamed: 0,ner_chunk,rxnorm_code,Action
0,Sinequan 150 MG,1000067,Antidepressant:::Anxiolytic:::Psychoanaleptics...
1,Zonalon 50 mg,103971,Analgesic:::Analgesic (Opioid):::Analgetic:::O...


As you see, we can see corresponding "Action" mappings of each "RxNorm codes".

## 1.6- Abbreviation Mapper

We have `abbreviation_mapper` model that maps abbreviations and acronyms of medical regulatory activities with their definitions. <br/> It has one relation type that can be defined in `setRel("definition")` parameter.

Let's create a pipeline consisting `ner_abbreviation_clinical` to extract abbreviations from text, and feed the `abbreviation_mapper` with it. 

In [None]:
document_assembler = DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

sentence_detector = SentenceDetector()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols("sentence")\
      .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

#NER model to detect abbreviations in the text
abbr_ner = MedicalNerModel.pretrained('ner_abbreviation_clinical', 'en', 'clinical/models') \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("abbr_ner")

abbr_converter = NerConverter() \
      .setInputCols(["sentence", "token", "abbr_ner"]) \
      .setOutputCol("abbr_ner_chunk")\

chunkerMapper = ChunkMapperModel.pretrained("abbreviation_mapper", "en", "clinical/models")\
      .setInputCols(["abbr_ner_chunk"])\
      .setOutputCol("mappings")\
      .setRel("definition") 

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 abbr_ner, 
                                 abbr_converter, 
                                 chunkerMapper])

text = ["""Gravid with estimated fetal weight of 6-6/12 pounds.
           LABORATORY DATA: Laboratory tests include a CBC which is normal. 
           HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."""]

test_data = spark.createDataFrame([text]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_abbreviation_clinical download started this may take some time.
[OK!]
abbreviation_mapper download started this may take some time.
[OK!]


Checking the results

In [None]:
#abbreviations extracted by ner model
res.select("abbr_ner_chunk.result").show()

+----------+
|    result|
+----------+
|[CBC, HIV]|
+----------+



In [None]:
res.select(F.explode(F.arrays_zip("abbr_ner_chunk.result", "mappings.result")).alias("col"))\
    .select(F.expr("col['0']").alias("Abbreviation"),
            F.expr("col['1']").alias("Definition")).show(truncate=False)

+------------+----------------------------+
|Abbreviation|Definition                  |
+------------+----------------------------+
|CBC         |complete blood count        |
|HIV         |human immunodeficiency virus|
+------------+----------------------------+



As you see, we can see corresponding "definition" mappings of each "abbreviation".

# 2- Creating a Mapper Model

There is a `ChunkMapperApproach()` to create your own mapper model. <br/>

This receives an `ner_chunk` and a Json with a mapping of ner entities and relations, and returns the `ner_chunk` augmented with the relations from the Json ontology. <br/> We give the path of json file to the `setDictionary()` parameter.




Let's create an example Json, then create a drug mapper model. This model will match the given drug name (only "metformin" for our example) with correpsonding action and treatment.  

The format of json file should be like following:


In [None]:
data_set= {
  "mappings": [
    {
      "key": "metformin",
      "relations": [
        {
          "key": "action",
          "values" : ["hypoglycemic", "Drugs Used In Diabetes"]
        },
        {
          "key": "treatment",
          "values" : ["diabetes", "t2dm"]
        }
      ]
    }
  ]
}

import json
with open('sample_drug.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)

By using `setRel()` parameter, we tell the model which type of mapping we want. In our case, if we want from our model to return **action** mapping, we set the parameter as `setRel("action")`,  we set as `setRel("treatment")` for **treatment**

Let's create a pipeline and see it in action. 

In [None]:
document_assembler = DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

sentence_detector = SentenceDetector()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols("sentence")\
      .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

#NER model to detect drug in the text
clinical_ner = MedicalNerModel.pretrained("ner_posology_small","en","clinical/models")\
	    .setInputCols(["sentence","token","embeddings"])\
	    .setOutputCol("ner")\
      .setLabelCasing("upper")
 
ner_converter = NerConverter()\
      .setInputCols(["sentence", "token", "ner"])\
      .setOutputCol("ner_chunk")\
      .setWhiteList(["DRUG"])

chunkerMapper = ChunkMapperApproach()\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("mappings")\
      .setDictionary("/content/sample_drug.json")\
      .setRel("action") #or treatment

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 clinical_ner, 
                                 ner_converter, 
                                 chunkerMapper])

text = ["The patient was given 1 unit of metformin daily."]

test_data = spark.createDataFrame([text]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology_small download started this may take some time.
[OK!]


In [None]:
res.printSchema()

root
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- sentence: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true

Checking the ner result

In [None]:
res.select(F.explode('ner_chunk.result').alias("chunks")).show(truncate=False)

+---------+
|chunks   |
+---------+
|metformin|
+---------+



Checking the mapper result

In [None]:
res.selectExpr("mappings.metadata").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                                             |
+-------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk -> 0, relation -> action, confidence -> 0.9994, all_relations -> Drugs Used In Diabetes, entity -> metformin, sentence -> 0}]|
+-------------------------------------------------------------------------------------------------------------------------------------+



In [None]:
res.select(F.explode(F.arrays_zip("ner_chunk.result", "mappings.result", "mappings.metadata")).alias("col"))\
    .select(F.expr("col['0']").alias("ner_chunk"),
            F.expr("col['1']").alias("mapping_result"),
            F.expr("col['2']['all_relations']").alias("all_relations")).show(truncate=False)

+---------+--------------+----------------------+
|ner_chunk|mapping_result|all_relations         |
+---------+--------------+----------------------+
|metformin|hypoglycemic  |Drugs Used In Diabetes|
+---------+--------------+----------------------+



As you see, the model that we created with `ChunkMapperApproach()` succesfully mapped "metformin". Under the metadata, we can see all relations that we defined in Json. 

### 2.1- Save the model to disk 

Now, we will save our model and use it with `ChunkMapperModel()`

In [None]:
model.stages[-1].write().save("models/drug_mapper")

Using the saved model. This time we will check 'treatment' mappings results


In [None]:
document_assembler = DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

sentence_detector = SentenceDetector()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols("sentence")\
      .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

#NER model to detect drug in the text
clinical_ner = MedicalNerModel.pretrained("ner_posology_small","en","clinical/models")\
	    .setInputCols(["sentence","token","embeddings"])\
	    .setOutputCol("ner")\
      .setLabelCasing("upper")
 
ner_converter = NerConverter()\
      .setInputCols(["sentence", "token", "ner"])\
      .setOutputCol("ner_chunk")\
      .setWhiteList(["DRUG"])

chunkerMapper = ChunkMapperModel.load("/content/models/drug_mapper")\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("mappings")\
      .setRel("treatment") 

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 clinical_ner, 
                                 ner_converter, 
                                 chunkerMapper])

text = ["The patient was given 1 unit of metformin daily."]

test_data = spark.createDataFrame([text]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology_small download started this may take some time.
[OK!]


In [None]:
res.selectExpr("mappings.metadata").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                              |
+----------------------------------------------------------------------------------------------------------------------+
|[{chunk -> 0, relation -> treatment, confidence -> 0.9994, all_relations -> t2dm, entity -> metformin, sentence -> 0}]|
+----------------------------------------------------------------------------------------------------------------------+



In [None]:
res.select(F.explode(F.arrays_zip("ner_chunk.result", "mappings.result", "mappings.metadata")).alias("col"))\
    .select(F.expr("col['0']").alias("ner_chunk"),
            F.expr("col['1']").alias("mapping_result"),
            F.expr("col['2']['all_relations']").alias("all_relations")).show(truncate=False)

+---------+--------------+-------------+
|ner_chunk|mapping_result|all_relations|
+---------+--------------+-------------+
|metformin|diabetes      |t2dm         |
+---------+--------------+-------------+



As you see above, we created our own drug mapper model successfully. 

### 2.2- Create a Model with Upper Cased or Lower Cased

We can set the case status of `ChunkMapperApproach` while creating a model by using `setLowerCase()` parameter.

Let's create a new mapping dictionary and see how it works. 

In [7]:
data_set= {
    "mappings": [
        {
            "key": "Warfarina lusa",
            "relations": [
                {
                    "key": "action",
                    "values": [
                        "Analgesic",
                        "Antipyretic"
                    ]
                },
                {
                    "key": "treatment",
                    "values": [
                        "diabetes",
                        "t2dm"
                    ]
                }
            ]
        }
    ]
}

import json
with open('mappings.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)

In [5]:
sentences = [
        ["""The patient was given Warfarina Lusa and amlodipine 10 MG.The patient was given Aspagin, coumadin 5 mg, coumadin, and he has metamorfin"""]
    ]


test_data = spark.createDataFrame(sentences).toDF("text")

**`setLowerCase(True)`**

In [None]:
chunkerMapper = ChunkMapperApproach() \
        .setInputCols(["ner_chunk"]) \
        .setOutputCol("mappings") \
        .setDictionary("mappings.json") \
        .setRel("action") \
        .setLowerCase(True) \

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 clinical_ner, 
                                 ner_converter, 
                                 chunkerMapper])


result_df = pipeline.fit(test_data).transform(test_data)
result_df.selectExpr("explode(mappings)").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                                                       |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{labeled_dependency, 22, 35, Analgesic, {chunk -> 0, relation -> action, confidence -> 0.6642, all_relations -> Antipyretic, entity -> Warfarina Lusa, sentence -> 0}, []}|
|{labeled_dependency, 41, 50, NONE, {entity -> amlodipine, sentence -> 0, chunk -> 1, confidence -> 0.9999}, []}                                                           |
|{labeled_dependency, 80, 86, NONE, {entity -> Aspagin, sentence -> 0, chunk -> 2, confidence -> 0.9908}, []}                          

"Warfarina lusa" is in lower case in the source json file, and in upper case(Warfarina Lusa) in our example training sentence. We trained that model in lower case, the model mapped the entity even though our training sentence is uppercased. <br/>

Let's check with `setLowerCase(False)` and see the difference. 

In [None]:
chunkerMapper = ChunkMapperApproach() \
        .setInputCols(["ner_chunk"]) \
        .setOutputCol("mappings") \
        .setDictionary("mappings.json") \
        .setRel("action") \
        .setLowerCase(False) \

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 clinical_ner, 
                                 ner_converter, 
                                 chunkerMapper])


result_df = pipeline.fit(test_data).transform(test_data)
result_df.selectExpr("explode(mappings)").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                |
+-------------------------------------------------------------------------------------------------------------------+
|{labeled_dependency, 22, 35, NONE, {entity -> Warfarina Lusa, sentence -> 0, chunk -> 0, confidence -> 0.6642}, []}|
|{labeled_dependency, 41, 50, NONE, {entity -> amlodipine, sentence -> 0, chunk -> 1, confidence -> 0.9999}, []}    |
|{labeled_dependency, 80, 86, NONE, {entity -> Aspagin, sentence -> 0, chunk -> 2, confidence -> 0.9908}, []}       |
|{labeled_dependency, 89, 96, NONE, {entity -> coumadin, sentence -> 0, chunk -> 3, confidence -> 0.9997}, []}      |
|{labeled_dependency, 104, 111, NONE, {entity -> coumadin, sentence -> 0, chunk -> 4, confidence -> 0.9994}, []}    |
|{labeled_dependency, 125, 134, NONE, {entity -> metamor

As you see, our model couldn't map the given uppercased "Warfarine Lura".

### 2.3- Selecting Multiple Relations 

We can select multiple relations for the same chunk with the `setRels()` parameter.

In [None]:
chunkerMapper = ChunkMapperApproach() \
        .setInputCols(["ner_chunk"]) \
        .setOutputCol("mappings") \
        .setDictionary("mappings.json") \
        .setLowerCase(True) \
        .setRels(["action", "treatment"])

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 clinical_ner, 
                                 ner_converter, 
                                 chunkerMapper])


result_df = pipeline.fit(test_data).transform(test_data)
result_df.selectExpr("explode(mappings)").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                                                       |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{labeled_dependency, 22, 35, Analgesic, {chunk -> 0, relation -> action, confidence -> 0.6642, all_relations -> Antipyretic, entity -> Warfarina Lusa, sentence -> 0}, []}|
|{labeled_dependency, 22, 35, diabetes, {chunk -> 0, relation -> treatment, confidence -> 0.6642, all_relations -> t2dm, entity -> Warfarina Lusa, sentence -> 0}, []}     |
|{labeled_dependency, 41, 50, NONE, {entity -> amlodipine, sentence -> 0, chunk -> 1, confidence -> 0.9999}, []}                       

As you see, we are able to see all the relations(action, treatment) at the same time. 

### 2.4- Filtering Multi-token Chunks

If the chunk includes multi-tokens splitted by a whitespace, we can filter that chunk by using `setAllowMultiTokenChunk()` parameter.

In [8]:
chunkerMapper = ChunkMapperApproach() \
        .setInputCols(["ner_chunk"]) \
        .setOutputCol("mappings") \
        .setDictionary("mappings.json") \
        .setLowerCase(True) \
        .setRels(["action", "treatment"]) \
        .setAllowMultiTokenChunk(False)

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 clinical_ner, 
                                 ner_converter, 
                                 chunkerMapper])


result_df = pipeline.fit(test_data).transform(test_data)
result_df.selectExpr("explode(mappings)").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                |
+-------------------------------------------------------------------------------------------------------------------+
|{labeled_dependency, 22, 35, NONE, {entity -> Warfarina Lusa, sentence -> 0, chunk -> 0, confidence -> 0.6642}, []}|
|{labeled_dependency, 41, 50, NONE, {entity -> amlodipine, sentence -> 0, chunk -> 1, confidence -> 0.9999}, []}    |
|{labeled_dependency, 80, 86, NONE, {entity -> Aspagin, sentence -> 0, chunk -> 2, confidence -> 0.9908}, []}       |
|{labeled_dependency, 89, 96, NONE, {entity -> coumadin, sentence -> 0, chunk -> 3, confidence -> 0.9997}, []}      |
|{labeled_dependency, 104, 111, NONE, {entity -> coumadin, sentence -> 0, chunk -> 4, confidence -> 0.9994}, []}    |
|{labeled_dependency, 125, 134, NONE, {entity -> metamor

The chunk "Warfarina Lusa" is a multi-token. Therefore, our mapper model skip that entity. <br/>
So, let's set `.setAllowMultiTokenChunk(True)` and see the difference. 

In [None]:
chunkerMapper = ChunkMapperApproach() \
        .setInputCols(["ner_chunk"]) \
        .setOutputCol("mappings") \
        .setDictionary("mappings.json") \
        .setRel("action") \
        .setLowerCase(True) \
        .setRels(["action", "treatment"]) \
        .setAllowMultiTokenChunk(True)

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 clinical_ner, 
                                 ner_converter, 
                                 chunkerMapper])


result_df = pipeline.fit(test_data).transform(test_data)
result_df.selectExpr("explode(mappings)").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                                                       |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{labeled_dependency, 22, 35, Analgesic, {chunk -> 0, relation -> action, confidence -> 0.6642, all_relations -> Antipyretic, entity -> Warfarina Lusa, sentence -> 0}, []}|
|{labeled_dependency, 22, 35, diabetes, {chunk -> 0, relation -> treatment, confidence -> 0.6642, all_relations -> t2dm, entity -> Warfarina Lusa, sentence -> 0}, []}     |
|{labeled_dependency, 41, 50, NONE, {entity -> amlodipine, sentence -> 0, chunk -> 1, confidence -> 0.9999}, []}                       

# 3- ChunkMapperFiltererModel

`ChunkMapperFiltererModel` annotator allows filtering of the chunks that were passed through the ChunkMapperModel. <br/>

We can filter chunks by setting the `.setReturnCriteria()` parameter. It has 2 options; <br/>


**success:** Returns the chunks which are mapped by ChunkMapper <br/>

**fail:** Returns the chunks which are not mapped by ChunkMapper <br/>

Let's apply the both options and check the results. 

In [None]:
chunkerMapper = ChunkMapperApproach() \
        .setInputCols(["ner_chunk"]) \
        .setOutputCol("mappings") \
        .setDictionary("mappings.json") \
        .setRel("action") \
        .setLowerCase(True) \
        .setRels(["action", "treatment"]) \

pipeline = Pipeline().setStages([document_assembler,
                                 sentence_detector,
                                 tokenizer, 
                                 word_embeddings,
                                 clinical_ner, 
                                 ner_converter, 
                                 chunkerMapper])


result_df = pipeline.fit(test_data).transform(test_data)
result_df.selectExpr("explode(mappings)").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                                                       |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{labeled_dependency, 22, 35, Analgesic, {chunk -> 0, relation -> action, confidence -> 0.6642, all_relations -> Antipyretic, entity -> Warfarina Lusa, sentence -> 0}, []}|
|{labeled_dependency, 22, 35, diabetes, {chunk -> 0, relation -> treatment, confidence -> 0.6642, all_relations -> t2dm, entity -> Warfarina Lusa, sentence -> 0}, []}     |
|{labeled_dependency, 41, 50, NONE, {entity -> amlodipine, sentence -> 0, chunk -> 1, confidence -> 0.9999}, []}                       

**`.setReturnCriteria("success")`**

In [None]:
cfModel = ChunkMapperFiltererModel() \
        .setInputCols(["ner_chunk","mappings"]) \
        .setOutputCol("chunks_filtered")\
        .setReturnCriteria("success")

cfModel.transform(result_df).selectExpr("explode(chunks_filtered)").show(truncate=False)

+------------------------------------------------------------------------------------------------------+
|col                                                                                                   |
+------------------------------------------------------------------------------------------------------+
|{chunk, 22, 35, Warfarina Lusa, {entity -> DRUG, sentence -> 0, chunk -> 0, confidence -> 0.6642}, []}|
+------------------------------------------------------------------------------------------------------+



**`.setReturnCriteria("fail")`**

In [None]:
cfModel = ChunkMapperFiltererModel() \
        .setInputCols(["ner_chunk","mappings"]) \
        .setOutputCol("chunks_filtered")\
        .setReturnCriteria("fail")

cfModel.transform(result_df).selectExpr("explode(chunks_filtered)").show(truncate=False)

+----------------------------------------------------------------------------------------------------+
|col                                                                                                 |
+----------------------------------------------------------------------------------------------------+
|{chunk, 41, 50, amlodipine, {entity -> DRUG, sentence -> 0, chunk -> 1, confidence -> 0.9999}, []}  |
|{chunk, 80, 86, Aspagin, {entity -> DRUG, sentence -> 0, chunk -> 2, confidence -> 0.9908}, []}     |
|{chunk, 89, 96, coumadin, {entity -> DRUG, sentence -> 0, chunk -> 3, confidence -> 0.9997}, []}    |
|{chunk, 104, 111, coumadin, {entity -> DRUG, sentence -> 0, chunk -> 4, confidence -> 0.9994}, []}  |
|{chunk, 125, 134, metamorfin, {entity -> DRUG, sentence -> 0, chunk -> 5, confidence -> 0.9995}, []}|
+----------------------------------------------------------------------------------------------------+



# 4- Section Header Normalizer Mapper with ChunkSentenceSplitter

`ChunkSentenceSplitter()` annotator splits documents or sentences by chunks provided. <br/> For detailed usage of this annotator, visit [this notebook](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/18.Chunk_Sentence_Splitter.ipynb) <br/>

In this section, we will do the following steps; 
- Detect "section headers" in given text through Ner model
- Split the given text by headers with `ChunkSentenceSplitter()`
- Normalize the `ChunkSentenceSplitter()` outputs with `normalized_section_header_mapper` model. 

Let's start with creating Ner pipeline to detect "Header" 

In [None]:
sentences = [
    ["""ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma.
        PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma.
        REVIEW OF SYSTEMS Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.
    """]]

df= spark.createDataFrame(sentences).toDF("text")

In [None]:
documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

tokenizer= Tokenizer()\
      .setInputCols(["document"])\
      .setOutputCol("token")

tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_jsl_slim", "en", "clinical/models")\
    .setInputCols("token", "document")\
    .setOutputCol("ner")\
    .setCaseSensitive(True)

ner_converter = NerConverter() \
      .setInputCols(["document", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(["Header"])

pipeline = Pipeline(
    stages = [
        documentAssembler,
        tokenizer,
        tokenClassifier,
        ner_converter
    ])
 
empty_df = spark.createDataFrame([[""]]).toDF('text')
pipeline_model = pipeline.fit(empty_df)

bert_token_classifier_ner_jsl_slim download started this may take some time.
Approximate size to download 385.7 MB
[OK!]


In [None]:
result = pipeline_model.transform(df)
result.selectExpr('explode(ner_chunk)').show(truncate=False)

+------------------------------------------------------------------------------------------------------------------+
|col                                                                                                               |
+------------------------------------------------------------------------------------------------------------------+
|{chunk, 0, 18, ADMISSION DIAGNOSIS, {entity -> Header, sentence -> 0, chunk -> 0, confidence -> 0.9994346}, []}   |
|{chunk, 89, 107, PRINCIPAL DIAGNOSIS, {entity -> Header, sentence -> 0, chunk -> 1, confidence -> 0.99020165}, []}|
|{chunk, 175, 191, REVIEW OF SYSTEMS, {entity -> Header, sentence -> 0, chunk -> 2, confidence -> 0.9989373}, []}  |
+------------------------------------------------------------------------------------------------------------------+



Now, we have our header entities. We will split the text by the headers.

In [None]:
#applying ChunkSentenceSplitter 
chunkSentenceSplitter = ChunkSentenceSplitter()\
    .setInputCols("document","ner_chunk")\
    .setOutputCol("paragraphs")\
    .setGroupBySentences(False)

paragraphs = chunkSentenceSplitter.transform(result)

In [None]:
paragraphs.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|                 ner|           ner_chunk|          paragraphs|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|ADMISSION DIAGNOS...|[{document, 0, 30...|[{token, 0, 8, AD...|[{named_entity, 0...|[{chunk, 0, 18, A...|[{document, 0, 89...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [None]:
pd.set_option('display.max_colwidth', None)
result_df = paragraphs.selectExpr("explode(paragraphs) as result").selectExpr("result.result","result.metadata.entity", "result.metadata.splitter_chunk").toPandas()
result_df.head()

Unnamed: 0,result,entity,splitter_chunk
0,ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma.\n,Header,ADMISSION DIAGNOSIS
1,"PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma.\n",Header,PRINCIPAL DIAGNOSIS
2,"REVIEW OF SYSTEMS Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.\n",Header,REVIEW OF SYSTEMS


As you see, we have our splitted text and **section headers**. <br/>
Now we will normalize this section headers with `normalized_section_header_mapper`

In [None]:
chunkerMapper = ChunkMapperModel.pretrained("normalized_section_header_mapper", "en", "clinical/models") \
       .setInputCols("ner_chunk")\
       .setOutputCol("mappings")\
       .setRel("level_1") #or level_2

normalized_df= chunkerMapper.transform(paragraphs)

normalized_section_header_mapper download started this may take some time.
Approximate size to download 13.9 KB
[OK!]


In [None]:
normalized_df.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|                 ner|           ner_chunk|          paragraphs|            mappings|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|ADMISSION DIAGNOS...|[{document, 0, 30...|[{token, 0, 8, AD...|[{named_entity, 0...|[{chunk, 0, 18, A...|[{document, 0, 89...|[{labeled_depende...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [None]:
normalized_df= normalized_df.select(F.explode(F.arrays_zip("ner_chunk.result", "mappings.result")).alias("col"))\
                            .select(F.expr("col['0']").alias("ner_chunk"),
                                    F.expr("col['1']").alias("normalized_headers")).toPandas()
normalized_df.head()

Unnamed: 0,ner_chunk,normalized_headers
0,ADMISSION DIAGNOSIS,DIAGNOSIS
1,PRINCIPAL DIAGNOSIS,DIAGNOSIS
2,REVIEW OF SYSTEMS,REVIEW TYPE


Now, we have our normalized headers. We will merge it with `ChunkSentenceSplitter()` output

In [None]:
normalized_df= normalized_df.rename(columns={"ner_chunk": "splitter_chunk"})
df= pd.merge(result_df, normalized_df, on=["splitter_chunk"])

In [None]:
df.head()

Unnamed: 0,result,entity,splitter_chunk,normalized_headers
0,ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma.\n,Header,ADMISSION DIAGNOSIS,DIAGNOSIS
1,"PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma.\n",Header,PRINCIPAL DIAGNOSIS,DIAGNOSIS
2,"REVIEW OF SYSTEMS Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.\n",Header,REVIEW OF SYSTEMS,REVIEW TYPE


Ultimately, we have splitted paragraphs, headers and normalized headers. 