![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/7.Clinical_NER_Chunk_Merger.ipynb)

# 7. Clinical NER Chunk Merger

In [None]:
import json

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
import os
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

In [None]:
import json
import os
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp

import pandas as pd

params = {"spark.driver.memory":"16G",
"spark.kryoserializer.buffer.max":"2000M",
"spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print (sparknlp.version())
print (sparknlp_jsl.version())

3.4.0
3.4.0


In [None]:
 # if you want to start the session with custom params as in start function above

def start(secret):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:"+version) \
        .config("spark.jars", "https://pypi.johnsnowlabs.com/"+secret+"/spark-nlp-jsl-"+jsl_version+".jar")
      
    return builder.getOrCreate()


#spark = start(secret)

In [None]:
spark

In [None]:
# Sample data
data_chunk_merge = spark.createDataFrame([
  (1,"""A 63 years old man presents to the hospital with a history of recurrent infections that include cellulitis, pneumonias, and upper respiratory tract infections. He reports subjective fevers at home along with unintentional weight loss and occasional night sweats. The patient has a remote history of arthritis, which was diagnosed approximately 20 years ago and treated intermittently with methotrexate (MTX) and prednisone. On physical exam, he is found to be febrile at 102°F, rather cachectic, pale, and have hepatosplenomegaly. Several swollen joints that are tender to palpation and have decreased range of motion are also present. His laboratory values show pancytopenia with the most severe deficiency in neutrophils.
""")]).toDF("id","text")

data_chunk_merge.show(truncate=50)

+---+--------------------------------------------------+
| id|                                              text|
+---+--------------------------------------------------+
|  1|A 63 years old man presents to the hospital wit...|
+---+--------------------------------------------------+



In [None]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

# NER model trained on i2b2 (sampled from MIMIC) dataset
clinical_ner = MedicalNerModel.pretrained("ner_deid_large", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("clinical_ner")

clinical_ner_converter = NerConverter() \
  .setInputCols(["sentence", "token", "clinical_ner"]) \
  .setOutputCol("clinical_ner_chunk")

# internal clinical NER (general terms)
jsl_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("jsl_ner")

jsl_ner_converter = NerConverter() \
  .setInputCols(["sentence", "token", "jsl_ner"]) \
  .setOutputCol("jsl_ner_chunk")

# merge ner_chunks by prioritizing the overlapping indices (chunks with longer lengths and highest information will be kept from each ner model)
chunk_merger = ChunkMergeApproach()\
  .setInputCols('clinical_ner_chunk', "jsl_ner_chunk")\
  .setOutputCol('merged_ner_chunk')

# merge ner_chunks regardess of overlapping indices 
# only works with 2.7 and later 
chunk_merger_NonOverlapped = ChunkMergeApproach()\
  .setInputCols('clinical_ner_chunk', "jsl_ner_chunk")\
  .setOutputCol('nonOverlapped_ner_chunk')\
  .setMergeOverlapping(False)


nlpPipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    clinical_ner_converter,
    jsl_ner,
    jsl_ner_converter,
    chunk_merger,
    chunk_merger_NonOverlapped])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_large download started this may take some time.
Approximate size to download 14.1 MB
[OK!]
ner_jsl download started this may take some time.
Approximate size to download 14.5 MB
[OK!]


In [None]:
merged_data = model.transform(data_chunk_merge).cache()

In [None]:
from pyspark.sql import functions as F

result_df = merged_data.select('id',F.explode('merged_ner_chunk').alias("cols")) \
                       .select('id',F.expr("cols.begin").alias("begin"),
                               F.expr("cols.end").alias("end"),
                               F.expr("cols.result").alias("chunk"),
                               F.expr("cols.metadata.entity").alias("entity"))

result_df.show(50, truncate=100)

+---+-----+---+----------------------------------+-------------------------+
| id|begin|end|                             chunk|                   entity|
+---+-----+---+----------------------------------+-------------------------+
|  1|    2| 13|                      63 years old|                      Age|
|  1|   15| 17|                               man|                   Gender|
|  1|   35| 42|                          hospital|            Clinical_Dept|
|  1|   62| 70|                         recurrent|                 Modifier|
|  1|   72| 81|                        infections|Disease_Syndrome_Disorder|
|  1|   96|105|                        cellulitis|Disease_Syndrome_Disorder|
|  1|  108|117|                        pneumonias|Disease_Syndrome_Disorder|
|  1|  124|157|upper respiratory tract infections|Disease_Syndrome_Disorder|
|  1|  160|161|                                He|                   Gender|
|  1|  171|180|                        subjective|                 Modifier|

## NonOverlapped Chunk

all the entities form each ner model will be returned one by one

In [None]:
from pyspark.sql import functions as F

result_df2 = merged_data.select('id',F.explode('nonOverlapped_ner_chunk').alias("cols")) \
                        .select('id',F.expr("cols.begin").alias("begin"),
                                F.expr("cols.end").alias("end"),
                                F.expr("cols.result").alias("chunk"),
                                F.expr("cols.metadata.entity").alias("entity"))

result_df2.show(50, truncate=100)


+---+-----+---+----------------------------------+-------------------------+
| id|begin|end|                             chunk|                   entity|
+---+-----+---+----------------------------------+-------------------------+
|  1|    2| 13|                      63 years old|                      Age|
|  1|    2|  3|                                63|                      AGE|
|  1|   15| 17|                               man|                   Gender|
|  1|   35| 42|                          hospital|            Clinical_Dept|
|  1|   62| 70|                         recurrent|                 Modifier|
|  1|   72| 81|                        infections|Disease_Syndrome_Disorder|
|  1|   96|105|                        cellulitis|Disease_Syndrome_Disorder|
|  1|  108|117|                        pneumonias|Disease_Syndrome_Disorder|
|  1|  124|157|upper respiratory tract infections|Disease_Syndrome_Disorder|
|  1|  160|161|                                He|                   Gender|

### ChunkMergeApproach to admit N input cols 
We can feed the ChunkMergerApproach more than 2 chunks, also, we can filter out the entities that we don't want to get from the ChunkMergeApproach using `setBlackList` parameter.

In [None]:
import json

!mkdir data

In [None]:
sample_text = """A 28 year old female with a history of gestational diabetes mellitus diagnosed eight years prior to 
presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis 
three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index 
( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.
Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . 
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . 
She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was 
significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , 
or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , 
anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin 
( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed 
as blood samples kept hemolyzing due to significant lipemia .
The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior 
to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , 
the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , 
and lipase was 52 U/L .
 β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged 
 and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . 
 The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides 
 to 1400 mg/dL , within 24 hours .
 Twenty days ago.
 Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . 
 At birth the typical boy is growing slightly faster than the typical girl, but the velocities become equal at about 
 seven months, and then the girl grows faster until four years. 
 From then until adolescence no differences in velocity 
 can be detected. 21-02-2020 
21/04/2020
"""

In [None]:
# Defining ContextualParser for feeding ChunkMergerApproach

#defining rules
date = {
  "entity": "Parser_Date",
  "ruleScope": "sentence",
  "regex": "\\d{1,2}[\\/\\-\\:]{1}(\\d{1,2}[\\/\\-\\:]{1}){0,1}\\d{2,4}",
  "valuesDefinition":[],
  "prefix": [],
  "suffix": [],
  "contextLength": 150,
  "context": []
}


with open('data/date.json', 'w') as f:
    json.dump(date, f)


age = {
  "entity": "Parser_Age",
  "ruleScope": "sentence",
  "matchScope":"token",
  "regex" : "^[1][0-9][0-9]|[1-9][0-9]|[1-9]$",
  "prefix":["age of", "age"],
  "suffix": ["-years-old",
             "years-old",
             "-year-old",
             "-months-old",
             "-month-old",
             "-months-old",
             "-day-old",
             "-days-old",
             "month old",
             "days old",
             "year old",
             "years old", 
             "years",
             "year", 
             "months", 
             "old"
              ],
  "contextLength": 25,
  "context": [],
  "contextException": ["ago"],
  "exceptionDistance": 10
}

with open("data/age.json", 'w') as f:
  json.dump(age, f)



Using two ContextualParserApproach models and NER model in the same pipeline and merging by ChunkMergeApproach

In [None]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

# Contextual parser for age 
age_contextual_parser = ContextualParserApproach() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("entity_age") \
  .setJsonPath("data/age.json") \
  .setCaseSensitive(False) \
  .setContextMatch(False)\
  .setPrefixAndSuffixMatch(False)

chunks_age= ChunkConverter()\
  .setInputCols("entity_age")\
  .setOutputCol("chunk_age")

# Contextual parser for date
date_contextual_parser = ContextualParserApproach() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("entity_date") \
  .setJsonPath("data/date.json") \
  .setCaseSensitive(False) \
  .setContextMatch(False)\
  .setPrefixAndSuffixMatch(False)

chunks_date = ChunkConverter().setInputCols("entity_date").setOutputCol("chunk_date")

# Clinical word embeddings 
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

# Extracting entities by ner_deid_large
ner_model = MedicalNerModel.pretrained("ner_deid_large","en","clinical/models") \
  .setInputCols("sentence","token","embeddings") \
  .setOutputCol("ner")

ner_converter= NerConverter()\
  .setInputCols(["sentence", "token", "ner"])\
  .setOutputCol("ner_chunk")\
  .setWhiteList(["DATE", "AGE"])

# Chunkmerger; prioritize age_contextual_parser
parser_based_merge= ChunkMergeApproach()\
  .setInputCols(["chunk_age", "chunk_date", "ner_chunk"])\
  .setOutputCol("merged_chunks")

# Chunkmerger; prioritize ner_chunk
ner_based_merge= ChunkMergeApproach()\
  .setInputCols(["ner_chunk", "chunk_age", "chunk_date"])\
  .setOutputCol("merged_chunks_2")

# Using black list for limiting the entity types that will be extracted
limited_merge= ChunkMergeApproach()\
  .setInputCols(["ner_chunk", "chunk_age", "chunk_date"])\
  .setOutputCol("merged_chunks_black_list")\
  .setBlackList(["DATE", "Parser_Date"]) # this will block the dates. 

pipeline= Pipeline(stages=[
                           documentAssembler,
                           sentenceDetector,
                           tokenizer,
                           age_contextual_parser,
                           chunks_age,
                           date_contextual_parser,
                           chunks_date,
                           word_embeddings,
                           ner_model,
                           ner_converter,
                           parser_based_merge,
                           ner_based_merge,
                           limited_merge
])

empty_df= spark.createDataFrame([[""]]).toDF("text")
model= pipeline.fit(empty_df)


lmodel= LightPipeline(model)
lresult= lmodel.fullAnnotate(sample_text)[0]


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_large download started this may take some time.
Approximate size to download 14.1 MB
[OK!]


In [None]:
lresult.keys()

dict_keys(['chunk_age', 'document', 'ner_chunk', 'token', 'entity_date', 'ner', 'merged_chunks_2', 'entity_age', 'merged_chunks_black_list', 'embeddings', 'chunk_date', 'sentence', 'merged_chunks'])

If there is an overlap among the input entity types, ChunkMergerApproach model prioritizes the leftmost input. <br/>

At the 'parser_based_merge', we gave the contextual parser's chunks firstly. Therefore, 'parser_based_merge' prioritized the "Parser_Age" and "Parser_Date" entities over the "AGE" and "DATE" entity types that comes from NER model. <br/>

At the 'ner_based_merge', we gave the Ner model's inputs firstly, thus 'ner_based_merge' prioritized the "AGE" and "DATE" entities over the "Parser_Age" and "Parser_Date".  <br/>

At the limited_merge, we excluded "DATE" and "Parser_Date" entity types.

Let's compare the results of these ChunkMergeApproach below:

In [None]:
chunk= []
parser_based_merge= []
ner_based_merge= []

for i, k in list(zip(lresult["merged_chunks"], list(lresult["merged_chunks_2"],))):
  parser_based_merge.append(i.metadata["entity"])
  ner_based_merge.append(k.metadata["entity"])
  chunk.append(i.result)

df= pd.DataFrame({"chunk": chunk,"parser_based_merged_entity": parser_based_merge, "ner_based_merged_entity": ner_based_merge})
df.head()

Unnamed: 0,chunk,parser_based_merged_entity,ner_based_merged_entity
0,28,Parser_Age,AGE
1,21-02-2020,Parser_Date,DATE
2,21/04/2020,Parser_Date,DATE


`.setBlackList()` applied results:

In [None]:
chunk= []
limited_merge_entity= []

for i in list(lresult["merged_chunks_black_list"]):
  chunk.append(i.result)
  limited_merge_entity.append(i.metadata["entity"])

df= pd.DataFrame({"chunk": chunk, "limited_entity": limited_merge_entity }) 
df.head()

Unnamed: 0,chunk,limited_entity
0,28,AGE
