![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/10.0.Clinical_NER_Chunk_Merger.ipynb)

# Clinical NER Chunk Merger

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=True
nlp.install(refresh_install=True)

In [None]:
from johnsnowlabs import nlp, medical
import pandas as pd
import json
import warnings
warnings.filterwarnings('ignore')

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [5]:
spark

## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

#### Overlapped Chunk

In [6]:
# Sample data
data_chunk_merge = spark.createDataFrame([
  (1,"""A 63 years old man presents to the hospital with a history of recurrent infections that include cellulitis, pneumonias, and upper respiratory tract infections. He reports subjective fevers at home along with unintentional weight loss and occasional night sweats. The patient has a remote history of arthritis, which was diagnosed approximately 20 years ago and treated intermittently with methotrexate (MTX) and prednisone. On physical exam, he is found to be febrile at 102°F, rather cachectic, pale, and have hepatosplenomegaly. Several swollen joints that are tender to palpation and have decreased range of motion are also present. His laboratory values show pancytopenia with the most severe deficiency in neutrophils.""")
]).toDF("id","text")

data_chunk_merge.show(truncate=150)

+---+------------------------------------------------------------------------------------------------------------------------------------------------------+
| id|                                                                                                                                                  text|
+---+------------------------------------------------------------------------------------------------------------------------------------------------------+
|  1|A 63 years old man presents to the hospital with a history of recurrent infections that include cellulitis, pneumonias, and upper respiratory tract...|
+---+------------------------------------------------------------------------------------------------------------------------------------------------------+



In [7]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on i2b2 (sampled from MIMIC) dataset
clinical_ner = medical.NerModel.pretrained("ner_deid_large", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("clinical_ner")

clinical_ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "clinical_ner"]) \
    .setOutputCol("clinical_ner_chunk")

# internal clinical NER (general terms)
jsl_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("jsl_ner")

jsl_ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "jsl_ner"]) \
    .setOutputCol("jsl_ner_chunk")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_large download started this may take some time.
Approximate size to download 14.1 MB
[OK!]
ner_jsl download started this may take some time.
Approximate size to download 14.5 MB
[OK!]


**Merging overlapped chunks by considering their lenght** <br/>
If we set `setOrderingFeatures(["ChunkLength"])` and `setSelectionStrategy("DiverseLonger")` parameters, the longest chunk will be prioritized in case of overlapping.


In [8]:
chunk_merger = medical.ChunkMergeApproach()\
    .setInputCols('clinical_ner_chunk', "jsl_ner_chunk")\
    .setOutputCol('merged_ner_chunk')\
    .setOrderingFeatures(["ChunkLength"])\
    .setSelectionStrategy("DiverseLonger")\
    .setCaseSensitive(False)

nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        clinical_ner_converter,
        jsl_ner,
        jsl_ner_converter,
        chunk_merger
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [9]:
merged_data = model.transform(data_chunk_merge).cache()

In [10]:
merged_data.select("clinical_ner_chunk").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------+
|clinical_ner_chunk                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 2, 3, 63, {chunk -> 0, confidence -> 0.9997, ner_source -> clinical_ner_chunk, entity -> AGE, sentence -> 0}, []}]|
+---------------------------------------------------------------------------------------------------------------------------+



In [11]:
merged_data.select("jsl_ner_chunk").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [12]:
from pyspark.sql import functions as F

result_df = merged_data.select('id',F.explode('merged_ner_chunk').alias("cols")) \
                       .select('id',F.expr("cols.begin").alias("begin"),
                               F.expr("cols.end").alias("end"),
                               F.expr("cols.result").alias("chunk"),
                               F.expr("cols.metadata.entity").alias("entity"))

result_df.show(50, truncate=100)

+---+-----+---+----------------------------------+-------------------------+
| id|begin|end|                             chunk|                   entity|
+---+-----+---+----------------------------------+-------------------------+
|  1|    2| 13|                      63 years old|                      Age|
|  1|   15| 17|                               man|                   Gender|
|  1|   35| 42|                          hospital|            Clinical_Dept|
|  1|   62| 70|                         recurrent|                 Modifier|
|  1|   72| 81|                        infections|Disease_Syndrome_Disorder|
|  1|   96|105|                        cellulitis|Disease_Syndrome_Disorder|
|  1|  108|117|                        pneumonias|Disease_Syndrome_Disorder|
|  1|  124|157|upper respiratory tract infections|Disease_Syndrome_Disorder|
|  1|  160|161|                                He|                   Gender|
|  1|  171|180|                        subjective|                 Modifier|

**Merging overlapped chunks by considering their sequence** <br/>

If we set `setSelectionStrategy("Sequential")` parameter, the chunk on the leftmost side will be prioritized in case of overlapping.


In [13]:
chunk_merger = medical.ChunkMergeApproach() \
    .setInputCols("clinical_ner_chunk", "jsl_ner_chunk") \
    .setOutputCol("ner_chunk_new") \
    .setMergeOverlapping(True) \
    .setSelectionStrategy("Sequential")\
    .setCaseSensitive(False)

nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        clinical_ner_converter,
        jsl_ner,
        jsl_ner_converter,
        chunk_merger
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [14]:
merged_data = model.transform(data_chunk_merge).cache()

In [15]:
merged_data.select("clinical_ner_chunk").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------+
|clinical_ner_chunk                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 2, 3, 63, {chunk -> 0, confidence -> 0.9997, ner_source -> clinical_ner_chunk, entity -> AGE, sentence -> 0}, []}]|
+---------------------------------------------------------------------------------------------------------------------------+



In [16]:
merged_data.select("jsl_ner_chunk").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [17]:
from pyspark.sql import functions as F

result_df = merged_data.select('id',F.explode('ner_chunk_new').alias("cols")) \
                       .select('id',F.expr("cols.begin").alias("begin"),
                               F.expr("cols.end").alias("end"),
                               F.expr("cols.result").alias("chunk"),
                               F.expr("cols.metadata.entity").alias("entity"))

result_df.show(50, truncate=100)

+---+-----+---+----------------------------------+-------------------------+
| id|begin|end|                             chunk|                   entity|
+---+-----+---+----------------------------------+-------------------------+
|  1|    2|  3|                                63|                      AGE|
|  1|   15| 17|                               man|                   Gender|
|  1|   35| 42|                          hospital|            Clinical_Dept|
|  1|   62| 70|                         recurrent|                 Modifier|
|  1|   72| 81|                        infections|Disease_Syndrome_Disorder|
|  1|   96|105|                        cellulitis|Disease_Syndrome_Disorder|
|  1|  108|117|                        pneumonias|Disease_Syndrome_Disorder|
|  1|  124|157|upper respiratory tract infections|Disease_Syndrome_Disorder|
|  1|  160|161|                                He|                   Gender|
|  1|  171|180|                        subjective|                 Modifier|

**Merging overlapped chunks by considering their confidence** <br/>

If we set `setSelectionStrategy("Sequential")` and `setOrderingFeatures(["ChunkConfidence"])` parameters, the chunk with the highest confidence score will be prioritized in case of overlapping.


In [18]:
chunk_merger = medical.ChunkMergeApproach() \
    .setInputCols("clinical_ner_chunk", "jsl_ner_chunk") \
    .setOutputCol("ner_chunk_new") \
    .setMergeOverlapping(True) \
    .setOrderingFeatures(["ChunkConfidence"])\
    .setSelectionStrategy("Sequential")\
    .setCaseSensitive(False)


nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        clinical_ner_converter,
        jsl_ner,
        jsl_ner_converter,
        chunk_merger
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [19]:
merged_data = model.transform(data_chunk_merge).cache()

In [20]:
merged_data.select("clinical_ner_chunk").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------+
|clinical_ner_chunk                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 2, 3, 63, {chunk -> 0, confidence -> 0.9997, ner_source -> clinical_ner_chunk, entity -> AGE, sentence -> 0}, []}]|
+---------------------------------------------------------------------------------------------------------------------------+



In [21]:
merged_data.select("jsl_ner_chunk").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [22]:
from pyspark.sql import functions as F

result_df = merged_data.select('id',F.explode('ner_chunk_new').alias("cols")) \
                       .select('id',F.expr("cols.begin").alias("begin"),
                               F.expr("cols.end").alias("end"),
                               F.expr("cols.result").alias("chunk"),
                               F.expr("cols.metadata.entity").alias("entity"))

result_df.show(50, truncate=100)

+---+-----+---+----------------------------------+-------------------------+
| id|begin|end|                             chunk|                   entity|
+---+-----+---+----------------------------------+-------------------------+
|  1|    2|  3|                                63|                      AGE|
|  1|   15| 17|                               man|                   Gender|
|  1|   35| 42|                          hospital|            Clinical_Dept|
|  1|   62| 70|                         recurrent|                 Modifier|
|  1|   72| 81|                        infections|Disease_Syndrome_Disorder|
|  1|   96|105|                        cellulitis|Disease_Syndrome_Disorder|
|  1|  108|117|                        pneumonias|Disease_Syndrome_Disorder|
|  1|  124|157|upper respiratory tract infections|Disease_Syndrome_Disorder|
|  1|  160|161|                                He|                   Gender|
|  1|  171|180|                        subjective|                 Modifier|

**Merging overlapped chunks by considering custom values that we set** <br/>
`setChunkPrecedence` is used for the prioritization of the parameters of metadata. The desired order is set with a comma-separated list as `"parameter_1,parameter_2"`.

Then to set the values of these parameters, `setChunkPrecedenceValuePrioritization` is used by adding a list of string pairs like `["parameter_1,value_1", "parameter_2,value_2"]`

Here is a sample metadata of a NER chunk annotation. You can choose any of the parameters to set prioritization.

`{chunk -> 0, confidence -> 0.9997, ner_source -> posology_ner_chunk, entity -> DRUG, sentence -> 0}`

Let's select the `ner_source` and `entity` parameters to set prioritization. We will set as:
> `setChunkPrecedence('ner_source,entity')`

Then we will set values of these parameters to prioritize.
>`setChunkPrecedenceValuePrioritization(["clinical_ner_chunk,AGE", "jsl_ner_chunk,Age"])`

For an overlapped chunk, we prioritized the output of the `clinical_ner_chunk` column with the `AGE` entity, then the `jsl_ner_chunk` column with the `Age` entity.

In [23]:
chunk_merger = medical.ChunkMergeApproach() \
    .setInputCols("clinical_ner_chunk", "jsl_ner_chunk") \
    .setOutputCol("ner_chunk_new") \
    .setMergeOverlapping(True) \
    .setSelectionStrategy("Sequential")\
    .setOrderingFeatures(["ChunkPrecedence"]) \
    .setChunkPrecedence('ner_source,entity')\
    .setChunkPrecedenceValuePrioritization(["clinical_ner_chunk,AGE", "jsl_ner_chunk,Age"])\
    .setCaseSensitive(False)

nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        clinical_ner_converter,
        jsl_ner,
        jsl_ner_converter,
        chunk_merger
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [24]:
merged_data = model.transform(data_chunk_merge)

In [25]:
merged_data.select("clinical_ner_chunk").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------+
|clinical_ner_chunk                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 2, 3, 63, {chunk -> 0, confidence -> 0.9997, ner_source -> clinical_ner_chunk, entity -> AGE, sentence -> 0}, []}]|
+---------------------------------------------------------------------------------------------------------------------------+



In [26]:
merged_data.select("jsl_ner_chunk").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [27]:
from pyspark.sql import functions as F

result_df = merged_data.select('id',F.explode('ner_chunk_new').alias("cols")) \
                       .select('id',F.expr("cols.begin").alias("begin"),
                               F.expr("cols.end").alias("end"),
                               F.expr("cols.result").alias("chunk"),
                               F.expr("cols.metadata.entity").alias("entity"))

result_df.show(50, truncate=100)

+---+-----+---+----------------------------------+-------------------------+
| id|begin|end|                             chunk|                   entity|
+---+-----+---+----------------------------------+-------------------------+
|  1|    2|  3|                                63|                      AGE|
|  1|   15| 17|                               man|                   Gender|
|  1|   35| 42|                          hospital|            Clinical_Dept|
|  1|   62| 70|                         recurrent|                 Modifier|
|  1|   72| 81|                        infections|Disease_Syndrome_Disorder|
|  1|   96|105|                        cellulitis|Disease_Syndrome_Disorder|
|  1|  108|117|                        pneumonias|Disease_Syndrome_Disorder|
|  1|  124|157|upper respiratory tract infections|Disease_Syndrome_Disorder|
|  1|  160|161|                                He|                   Gender|
|  1|  171|180|                        subjective|                 Modifier|

**Merging overlapped chunks by considering begin indices** <br/>

If we set `setOrderingFeatures(["ChunkBegin"])` parameter, the chunk with the lowest begin indice will be prioritized in case of overlapping.




In [28]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

pos_ner = medical.NerModel.pretrained("ner_posology_large", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("pos_ner")

pos_ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "pos_ner"]) \
    .setOutputCol("pos_ner_chunk")\
    .setWhiteList(['DRUG'])

greedy_ner = medical.NerModel.pretrained("ner_posology_greedy", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("greedy_ner")

greedy_ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "greedy_ner"]) \
    .setOutputCol("greedy_ner_chunk") \
    .setWhiteList(['DRUG'])

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology_large download started this may take some time.
Approximate size to download 13.8 MB
[OK!]
ner_posology_greedy download started this may take some time.
Approximate size to download 13.9 MB
[OK!]


In [29]:
chunk_merger = medical.ChunkMergeApproach() \
    .setInputCols("pos_ner_chunk", "greedy_ner_chunk") \
    .setOutputCol("ner_chunk_new") \
    .setMergeOverlapping(True) \
    .setOrderingFeatures(["ChunkBegin"])\
    .setCaseSensitive(False)


nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        pos_ner,
        pos_ner_converter,
        greedy_ner,
        greedy_ner_converter,
        chunk_merger
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [30]:
# Sample data
data_chunk_merge = spark.createDataFrame([
  (1,"""A 43 years of woman was prescribed 100 mg metformin for 5 days.""")]).toDF("id","text")

data_chunk_merge.show(truncate=150)

+---+---------------------------------------------------------------+
| id|                                                           text|
+---+---------------------------------------------------------------+
|  1|A 43 years of woman was prescribed 100 mg metformin for 5 days.|
+---+---------------------------------------------------------------+



In [31]:
merged_data = model.transform(data_chunk_merge)

In [32]:
merged_data.select("pos_ner_chunk").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------+
|pos_ner_chunk                                                                                                                   |
+--------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 42, 50, metformin, {chunk -> 0, confidence -> 0.9996, ner_source -> pos_ner_chunk, entity -> DRUG, sentence -> 0}, []}]|
+--------------------------------------------------------------------------------------------------------------------------------+



In [33]:
merged_data.select("greedy_ner_chunk").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------+
|greedy_ner_chunk                                                                                                                              |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 35, 50, 100 mg metformin, {entity -> DRUG, confidence -> 0.67116666, ner_source -> greedy_ner_chunk, chunk -> 0, sentence -> 0}, []}]|
+----------------------------------------------------------------------------------------------------------------------------------------------+



In [34]:
from pyspark.sql import functions as F

result_df = merged_data.select('id',F.explode('ner_chunk_new').alias("cols")) \
                       .select('id',F.expr("cols.begin").alias("begin"),
                               F.expr("cols.end").alias("end"),
                               F.expr("cols.result").alias("chunk"),
                               F.expr("cols.metadata.entity").alias("entity"))

result_df.show(5, truncate=100)

+---+-----+---+----------------+------+
| id|begin|end|           chunk|entity|
+---+-----+---+----------------+------+
|  1|   35| 50|100 mg metformin|  DRUG|
+---+-----+---+----------------+------+



## NonOverlapped Chunk

All the entities form each ner model will be returned one by one

In [35]:
# merge ner_chunks regardess of overlapping indices
# only works with 2.7 and later
chunk_merger_NonOverlapped = medical.ChunkMergeApproach()\
    .setInputCols('clinical_ner_chunk', "jsl_ner_chunk")\
    .setOutputCol('nonOverlapped_ner_chunk')\
    .setMergeOverlapping(False)\
    .setCaseSensitive(False)


nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        clinical_ner_converter,
        jsl_ner,
        jsl_ner_converter,
        chunk_merger_NonOverlapped
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

In [36]:
# Sample data
data_chunk_merge = spark.createDataFrame([
  (1,"""A 63 years old man presents to the hospital with a history of recurrent infections that include cellulitis, pneumonias, and upper respiratory tract infections. He reports subjective fevers at home along with unintentional weight loss and occasional night sweats. The patient has a remote history of arthritis, which was diagnosed approximately 20 years ago and treated intermittently with methotrexate (MTX) and prednisone. On physical exam, he is found to be febrile at 102°F, rather cachectic, pale, and have hepatosplenomegaly. Several swollen joints that are tender to palpation and have decreased range of motion are also present. His laboratory values show pancytopenia with the most severe deficiency in neutrophils.""")
]).toDF("id","text")

data_chunk_merge.show(truncate=150)

+---+------------------------------------------------------------------------------------------------------------------------------------------------------+
| id|                                                                                                                                                  text|
+---+------------------------------------------------------------------------------------------------------------------------------------------------------+
|  1|A 63 years old man presents to the hospital with a history of recurrent infections that include cellulitis, pneumonias, and upper respiratory tract...|
+---+------------------------------------------------------------------------------------------------------------------------------------------------------+



In [37]:
merged_data = model.transform(data_chunk_merge)

In [38]:
from pyspark.sql import functions as F

result_df2 = merged_data.select('id',F.explode('nonOverlapped_ner_chunk').alias("cols")) \
                        .select('id',F.expr("cols.begin").alias("begin"),
                                F.expr("cols.end").alias("end"),
                                F.expr("cols.result").alias("chunk"),
                                F.expr("cols.metadata.entity").alias("entity"))

result_df2.show(50, truncate=100)


+---+-----+---+----------------------------------+-------------------------+
| id|begin|end|                             chunk|                   entity|
+---+-----+---+----------------------------------+-------------------------+
|  1|    2|  3|                                63|                      AGE|
|  1|    2| 13|                      63 years old|                      Age|
|  1|   15| 17|                               man|                   Gender|
|  1|   35| 42|                          hospital|            Clinical_Dept|
|  1|   62| 70|                         recurrent|                 Modifier|
|  1|   72| 81|                        infections|Disease_Syndrome_Disorder|
|  1|   96|105|                        cellulitis|Disease_Syndrome_Disorder|
|  1|  108|117|                        pneumonias|Disease_Syndrome_Disorder|
|  1|  124|157|upper respiratory tract infections|Disease_Syndrome_Disorder|
|  1|  160|161|                                He|                   Gender|

## ChunkMergeApproach to admit N input cols
We can feed the ChunkMergerApproach more than 2 chunks, also, we can filter out the entities that we don't want to get from the ChunkMergeApproach using `setBlackList` parameter.

In [39]:
# import json

!mkdir data

In [40]:
sample_text = """A 28 year old female with a history of gestational diabetes mellitus diagnosed eight years prior to
presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis
three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index
( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.
Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG .
She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was
significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding ,
or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l ,
anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin
( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed
as blood samples kept hemolyzing due to significant lipemia .
The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior
to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL ,
the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL ,
and lipase was 52 U/L .
 β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged
 and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again .
 The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides
 to 1400 mg/dL , within 24 hours .
 Twenty days ago.
 Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use .
 At birth the typical boy is growing slightly faster than the typical girl, but the velocities become equal at about
 seven months, and then the girl grows faster until four years.
 From then until adolescence no differences in velocity
 can be detected. 21-02-2020
21/04/2020
"""

In [41]:
# Defining ContextualParser for feeding ChunkMergerApproach

#defining rules
date = {
  "entity": "Parser_Date",
  "ruleScope": "sentence",
  "regex": "\\d{1,2}[\\/\\-\\:]{1}(\\d{1,2}[\\/\\-\\:]{1}){0,1}\\d{2,4}",
  "valuesDefinition":[],
  "prefix": [],
  "suffix": [],
  "contextLength": 150,
  "context": []
}


with open('data/date.json', 'w') as f:
    json.dump(date, f)


age = {
  "entity": "Parser_Age",
  "ruleScope": "sentence",
  "matchScope":"token",
  "regex" : "^[1][0-9][0-9]|[1-9][0-9]|[1-9]$",
  "prefix":["age of", "age"],
  "suffix": ["-years-old",
             "years-old",
             "-year-old",
             "-months-old",
             "-month-old",
             "-months-old",
             "-day-old",
             "-days-old",
             "month old",
             "days old",
             "year old",
             "years old",
             "years",
             "year",
             "months",
             "old"
              ],
  "contextLength": 25,
  "context": [],
  "contextException": ["ago"],
  "exceptionDistance": 10
}

with open("data/age.json", 'w') as f:
  json.dump(age, f)



Using two ContextualParserApproach models and NER model in the same pipeline and merging by ChunkMergeApproach

In [42]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Contextual parser for age
age_contextual_parser = medical.ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("entity_age") \
    .setJsonPath("data/age.json") \
    .setCaseSensitive(False) \
    .setPrefixAndSuffixMatch(False)\
    .setOptionalContextRules(False)

chunks_age= medical.ChunkConverter()\
    .setInputCols("entity_age")\
    .setOutputCol("chunk_age")

# Contextual parser for date
date_contextual_parser = medical.ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("entity_date") \
    .setJsonPath("data/date.json") \
    .setCaseSensitive(False) \
    .setPrefixAndSuffixMatch(False)

chunks_date = medical.ChunkConverter().setInputCols("entity_date").setOutputCol("chunk_date")

# Clinical word embeddings
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# Extracting entities by ner_deid_large
ner_model = medical.NerModel.pretrained("ner_deid_large","en","clinical/models") \
    .setInputCols("sentence","token","embeddings") \
    .setOutputCol("ner")

ner_converter= medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(["DATE", "AGE"])

# Chunkmerger; prioritize age_contextual_parser
parser_based_merge= medical.ChunkMergeApproach()\
    .setInputCols(["chunk_age", "chunk_date", "ner_chunk"])\
    .setOutputCol("merged_chunks")

# Chunkmerger; prioritize ner_chunk
ner_based_merge= medical.ChunkMergeApproach()\
    .setInputCols(["ner_chunk", "chunk_age", "chunk_date"])\
    .setOutputCol("merged_chunks_2")

# Using black list for limiting the entity types that will be extracted
limited_merge= medical.ChunkMergeApproach()\
    .setInputCols(["ner_chunk", "chunk_age", "chunk_date"])\
    .setOutputCol("merged_chunks_black_list")\
    .setBlackList(["DATE", "Parser_Date"]) # this will block the dates.

pipeline= nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        age_contextual_parser,
        chunks_age,
        date_contextual_parser,
        chunks_date,
        word_embeddings,
        ner_model,
        ner_converter,
        parser_based_merge,
        ner_based_merge,
        limited_merge
])

empty_df= spark.createDataFrame([[""]]).toDF("text")
model= pipeline.fit(empty_df)


lmodel= nlp.LightPipeline(model)
lresult= lmodel.fullAnnotate(sample_text)[0]


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_large download started this may take some time.
Approximate size to download 14.1 MB
[OK!]


In [43]:
lresult.keys()

dict_keys(['chunk_age', 'document', 'ner_chunk', 'token', 'entity_date', 'ner', 'merged_chunks_2', 'entity_age', 'merged_chunks_black_list', 'embeddings', 'chunk_date', 'sentence', 'merged_chunks'])

If there is an overlap among the input entity types, ChunkMergerApproach model prioritizes the leftmost input. <br/>

At the 'parser_based_merge', we gave the contextual parser's chunks firstly. Therefore, 'parser_based_merge' prioritized the "Parser_Age" and "Parser_Date" entities over the "AGE" and "DATE" entity types that comes from NER model. <br/>

At the 'ner_based_merge', we gave the Ner model's inputs firstly, thus 'ner_based_merge' prioritized the "AGE" and "DATE" entities over the "Parser_Age" and "Parser_Date".  <br/>

At the limited_merge, we excluded "DATE" and "Parser_Date" entity types.

Let's compare the results of these ChunkMergeApproach below:

In [44]:
chunk= []
parser_based_merge= []
ner_based_merge= []

for i, k in list(zip(lresult["merged_chunks"], list(lresult["merged_chunks_2"],))):
  parser_based_merge.append(i.metadata["entity"])
  ner_based_merge.append(k.metadata["entity"])
  chunk.append(i.result)

df= pd.DataFrame({"chunk": chunk,"parser_based_merged_entity": parser_based_merge, "ner_based_merged_entity": ner_based_merge})
df.head()

Unnamed: 0,chunk,parser_based_merged_entity,ner_based_merged_entity
0,28,Parser_Age,AGE
1,21-02-2020,Parser_Date,DATE
2,21/04/2020,Parser_Date,DATE


`.setBlackList()` applied results:

In [45]:
chunk= []
limited_merge_entity= []

for i in list(lresult["merged_chunks_black_list"]):
  chunk.append(i.result)
  limited_merge_entity.append(i.metadata["entity"])

df= pd.DataFrame({"chunk": chunk, "limited_entity": limited_merge_entity })
df.head()

Unnamed: 0,chunk,limited_entity
0,28,AGE


## Dictionary Format for the Selective Merging
The ChunkMergeModel includes setReplaceDict for replacing entity labels and setFalsePositives for enabling precise control over chunk merging outcomes. Additionally, the ChunkMergeApproach has setEntitiesConfidence, allowing users to adjust entity confidence levels for further customization.

In [46]:
posology_ner = medical.NerModel.pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("posology_ner")

posology_ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "posology_ner"]) \
    .setOutputCol("posology_ner_chunk")

# Deid NER
deid_ner = medical.NerModel \
    .pretrained('ner_deid_subentity_augmented', 'en', 'clinical/models') \
    .setInputCols(['sentence', 'token', 'embeddings']) \
    .setOutputCol('deid_ner')

deid_ner_converter = medical.NerConverterInternal() \
    .setInputCols(['sentence', 'token', 'deid_ner']) \
    .setOutputCol('deid_ner_chunk')

chunk_merger = medical.ChunkMergeApproach()\
    .setInputCols("posology_ner_chunk",'deid_ner_chunk')\
    .setOutputCol('merged_ner_chunk')

chunk_merge_model = medical.ChunkMergeModel() \
    .setInputCols("posology_ner_chunk","deid_ner_chunk") \
    .setOutputCol("merged_chunk") \
    .setReplaceDict({"DOCTOR": "NAME",
                     "PATIENT": "NAME"}) \
    .setFalsePositives([["metformin", "TREATMENT", "DRUG"],
                        ["glipizide","TREATMENT",""]])

nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        posology_ner,
        posology_ner_converter,
        deid_ner,
        deid_ner_converter,
        chunk_merger,
        chunk_merge_model
])

model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

text ="""Jennifer is 58 years old. She was  seen by Dr. John Green and discharged on metformin, glipizide for T2DM and atorvastatin and gemfibrozil for HTG."""

light_model = nlp.LightPipeline(model)

light_result = light_model.fullAnnotate(text)

ner_posology download started this may take some time.
Approximate size to download 13.8 MB
[OK!]
ner_deid_subentity_augmented download started this may take some time.
Approximate size to download 14.1 MB
[OK!]


In [47]:
light_result[0]["merged_ner_chunk"]

[Annotation(chunk, 0, 7, Jennifer, {'entity': 'PATIENT', 'confidence': '0.9993', 'ner_source': 'deid_ner_chunk', 'chunk': '0', 'sentence': '0'}, []),
 Annotation(chunk, 12, 13, 58, {'entity': 'AGE', 'confidence': '1.0', 'ner_source': 'deid_ner_chunk', 'chunk': '1', 'sentence': '0'}, []),
 Annotation(chunk, 47, 56, John Green, {'entity': 'DOCTOR', 'confidence': '0.7381', 'ner_source': 'deid_ner_chunk', 'chunk': '2', 'sentence': '1'}, []),
 Annotation(chunk, 76, 84, metformin, {'entity': 'DRUG', 'confidence': '1.0', 'ner_source': 'posology_ner_chunk', 'chunk': '3', 'sentence': '1'}, []),
 Annotation(chunk, 87, 95, glipizide, {'entity': 'DRUG', 'confidence': '0.9983', 'ner_source': 'posology_ner_chunk', 'chunk': '4', 'sentence': '1'}, []),
 Annotation(chunk, 110, 121, atorvastatin, {'entity': 'DRUG', 'confidence': '1.0', 'ner_source': 'posology_ner_chunk', 'chunk': '5', 'sentence': '1'}, []),
 Annotation(chunk, 127, 137, gemfibrozil, {'entity': 'DRUG', 'confidence': '0.9996', 'ner_source'

In [48]:
 light_result[0]["merged_chunk"]

[Annotation(chunk, 0, 7, Jennifer, {'entity': 'NAME', 'confidence': '0.9993', 'ner_source': 'deid_ner_chunk', 'chunk': '0', 'sentence': '0'}, []),
 Annotation(chunk, 12, 13, 58, {'entity': 'AGE', 'confidence': '1.0', 'ner_source': 'deid_ner_chunk', 'chunk': '1', 'sentence': '0'}, []),
 Annotation(chunk, 47, 56, John Green, {'entity': 'NAME', 'confidence': '0.7381', 'ner_source': 'deid_ner_chunk', 'chunk': '2', 'sentence': '1'}, []),
 Annotation(chunk, 76, 84, metformin, {'entity': 'DRUG', 'confidence': '1.0', 'ner_source': 'posology_ner_chunk', 'chunk': '3', 'sentence': '1'}, []),
 Annotation(chunk, 87, 95, glipizide, {'entity': 'DRUG', 'confidence': '0.9983', 'ner_source': 'posology_ner_chunk', 'chunk': '4', 'sentence': '1'}, []),
 Annotation(chunk, 110, 121, atorvastatin, {'entity': 'DRUG', 'confidence': '1.0', 'ner_source': 'posology_ner_chunk', 'chunk': '5', 'sentence': '1'}, []),
 Annotation(chunk, 127, 137, gemfibrozil, {'entity': 'DRUG', 'confidence': '0.9996', 'ner_source': 'po

In [49]:
chunks = []
entities = []
sentence= []
begin = []
end = []
confidence = []

for n in light_result[0]['merged_chunk']:
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity'])
    sentence.append(n.metadata['sentence'])
    confidence.append(n.metadata['confidence'])

df_clinical = pd.DataFrame({'chunks':chunks,
                            'begin': begin,
                            'end':end,
                            'sentence_id':sentence,
                            'entities':entities,
                            'confidence':confidence})

df_clinical.head(20)

Unnamed: 0,chunks,begin,end,sentence_id,entities,confidence
0,Jennifer,0,7,0,NAME,0.9993
1,58,12,13,0,AGE,1.0
2,John Green,47,56,1,NAME,0.7381
3,metformin,76,84,1,DRUG,1.0
4,glipizide,87,95,1,DRUG,0.9983
5,atorvastatin,110,121,1,DRUG,1.0
6,gemfibrozil,127,137,1,DRUG,0.9996


## Filtering Chunks According To Confidence


We have added a new `setEntitiesConfidence` parameter to `ChunkMergeApproach` annotator that enables filtering the chunks according to the confidence thresholds. The only thing you need to do is provide a csv file that has the NER labels as keys and the confidence thresholds as values.


In [50]:
conf_dict = """DRUG,0.99
FREQUENCY,0.99
DOSAGE,0.99
DURATION,0.99
STRENGTH,0.99
"""
with open('conf_dict.csv', 'w') as f:
    f.write(conf_dict)

In [51]:
posology_ner = medical.NerModel.pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("posology_ner")

posology_ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "posology_ner"]) \
    .setOutputCol("posology_ner_chunk")

chunk_merger = medical.ChunkMergeApproach()\
    .setInputCols("posology_ner_chunk")\
    .setOutputCol('merged_ner_chunk')

chunk_merger_filter = medical.ChunkMergeApproach()\
    .setInputCols("posology_ner_chunk")\
    .setOutputCol('filtered_ner_chunk')\
    .setEntitiesConfidenceResource("conf_dict.csv")

nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        posology_ner,
        posology_ner_converter,
        chunk_merger,
        chunk_merger_filter
])

model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

text ="""The patient was prescribed 1 capsule of Advil for 5 days. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night."""

light_model = nlp.LightPipeline(model)

light_result = light_model.fullAnnotate(text)

ner_posology download started this may take some time.
Approximate size to download 13.8 MB
[OK!]


In [52]:
light_result[0]["merged_ner_chunk"]

[Annotation(chunk, 27, 27, 1, {'entity': 'DOSAGE', 'confidence': '0.9992', 'ner_source': 'posology_ner_chunk', 'chunk': '0', 'sentence': '0'}, []),
 Annotation(chunk, 29, 35, capsule, {'entity': 'FORM', 'confidence': '0.9897', 'ner_source': 'posology_ner_chunk', 'chunk': '1', 'sentence': '0'}, []),
 Annotation(chunk, 40, 44, Advil, {'entity': 'DRUG', 'confidence': '0.997', 'ner_source': 'posology_ner_chunk', 'chunk': '2', 'sentence': '0'}, []),
 Annotation(chunk, 46, 55, for 5 days, {'entity': 'DURATION', 'confidence': '0.71383333', 'ner_source': 'posology_ner_chunk', 'chunk': '3', 'sentence': '0'}, []),
 Annotation(chunk, 125, 132, 40 units, {'entity': 'DOSAGE', 'confidence': '0.85029995', 'ner_source': 'posology_ner_chunk', 'chunk': '4', 'sentence': '1'}, []),
 Annotation(chunk, 137, 152, insulin glargine, {'entity': 'DRUG', 'confidence': '0.82715', 'ner_source': 'posology_ner_chunk', 'chunk': '5', 'sentence': '1'}, []),
 Annotation(chunk, 154, 161, at night, {'entity': 'FREQUENCY', 

In [53]:
light_result[0]["filtered_ner_chunk"]

[Annotation(chunk, 27, 27, 1, {'entity': 'DOSAGE', 'confidence': '0.9992', 'ner_source': 'posology_ner_chunk', 'chunk': '0', 'sentence': '0'}, []),
 Annotation(chunk, 29, 35, capsule, {'entity': 'FORM', 'confidence': '0.9897', 'ner_source': 'posology_ner_chunk', 'chunk': '1', 'sentence': '0'}, []),
 Annotation(chunk, 40, 44, Advil, {'entity': 'DRUG', 'confidence': '0.997', 'ner_source': 'posology_ner_chunk', 'chunk': '2', 'sentence': '0'}, [])]

In [54]:
chunks = []
entities = []
sentence= []
begin = []
end = []
confidence = []

for n in light_result[0]['filtered_ner_chunk']:
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity'])
    sentence.append(n.metadata['sentence'])
    confidence.append(n.metadata['confidence'])

df_clinical = pd.DataFrame({'chunks':chunks,
                            'begin': begin,
                            'end':end,
                            'sentence_id':sentence,
                            'entities':entities,
                            'confidence':confidence})

df_clinical.head(20)

Unnamed: 0,chunks,begin,end,sentence_id,entities,confidence
0,1,27,27,0,DOSAGE,0.9992
1,capsule,29,35,0,FORM,0.9897
2,Advil,40,44,0,DRUG,0.997


## Merging NERs with TextMatcher and RegexMatcher outputs in the same pipeline

### TextMatcher

Lets make a special NER for female using a dictionary related to female entity.

In [55]:
# write the target entities to txt file

entities = ['she', 'her', 'girl', 'woman', 'women', 'womanish', 'womanlike', 'womanly', 'madam', 'madame', 'senora', 'lady', 'miss', 'girlfriend', 'wife', 'bride', 'misses', 'mrs.', 'female']
with open ('female_entities.txt', 'w') as f:
    for i in entities:
        f.write(i+'\n')

In [56]:
sample_text = """A 28 year old female with a history of gestational diabetes mellitus diagnosed eight years prior to
presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis
three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index
( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.
Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG .
She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was
significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding ,
or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l ,
anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin
( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed
as blood samples kept hemolyzing due to significant lipemia .
The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior
to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL ,
the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL ,
and lipase was 52 U/L .
β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged
and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again .
This senora was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides
to 1400 mg/dL , within 24 hours .
Twenty days ago.
Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use .
At birth the typical boy is growing slightly faster than the typical girl, but the velocities become equal at about
seven months, and then the girl grows faster until four years.
From then until adolescence no differences in velocity
can be detected. 21-02-2020
21/04/2020
"""

In [57]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# Extracting entities by ner_jsl
ner_model = medical.NerModel.pretrained("ner_jsl","en","clinical/models") \
    .setInputCols("sentence","token","embeddings") \
    .setOutputCol("ner")

ner_converter= medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\

# Find female entities using TextMatcher
female_entity_extractor = nlp.TextMatcher() \
    .setInputCols(["document",'token'])\
    .setOutputCol("female_entities")\
    .setEntities("female_entities.txt")\
    .setCaseSensitive(False)\
    .setEntityValue('female_entity')

# Chunkmerger; prioritize female_entity
merger= medical.ChunkMergeApproach()\
    .setInputCols(["female_entities", "ner_chunk"])\
    .setOutputCol("merged_chunks")

pipeline= nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        ner_model,
        ner_converter,
        female_entity_extractor,
        merger
])

empty_df= spark.createDataFrame([[""]]).toDF("text")
model= pipeline.fit(empty_df)


tm_model= nlp.LightPipeline(model)
tm_result= tm_model.fullAnnotate(sample_text)[0]

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
Approximate size to download 14.5 MB
[OK!]


In [58]:
chunk= []
ner = []

for i in tm_result["ner_chunk"]:
  ner.append(i.metadata["entity"])
  chunk.append(i.result)

df_ner= pd.DataFrame({"chunk": chunk, "ner_entity": ner})


merged_chunk= []
merged_entity=[]

for i in tm_result["merged_chunks"]:
  merged_entity.append(i.metadata["entity"])
  merged_chunk.append(i.result)

df_merge= pd.DataFrame({"chunk": merged_chunk, "merged_entity": merged_entity})


df= df_ner.merge(df_merge, on='chunk', how='inner')
df= df[(df.ner_entity=="Gender") | (df.merged_entity=="female_entity")]
df.head(25)

Unnamed: 0,chunk,ner_entity,merged_entity
1,female,Gender,female_entity
21,she,Gender,female_entity
22,she,Gender,female_entity
27,She,Gender,female_entity
28,She,Gender,female_entity
39,She,Gender,female_entity
40,She,Gender,female_entity
45,her,Gender,female_entity
46,her,Gender,female_entity
47,her,Gender,female_entity


As seen above table, `Gender` NER entities with female info are replaced with `female_entity`. And chunk '`senora`' is identified incorrectly as `Drug_BrandName`, but this false entity is corrected with `female_entity`, using TextMatcher annotator merging.   

If your lookup table is large, you can even use  [BigTextMatcher](https://nlp.johnsnowlabs.com/docs/en/annotators#bigtextmatcher).

### RegexMatcher

Here we will use [RegexMatcher](https://nlp.johnsnowlabs.com/docs/en/annotators#regexmatcher) to build a NER label. Initially we will build a file that contains one or multiple line regex rules. For use of RegexMather you may check [this NB](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb).

In [59]:
rules = '''
\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER
'''

with open('regex_rules.txt', 'w') as f:
    f.write(rules)

This regex rule finds `SECTION_HEADER` chunks of the document. There are some pre-trained models that can find `SECTION_HEADER`, but here we will use this method just to demonstrate the use of RegexMatcher.

In [60]:
sample_text = """
POSTOPERATIVE DIAGNOSIS: Cervical lymphadenopathy.
PROCEDURE:  Excisional biopsy of right cervical lymph node.
ANESTHESIA:  General endotracheal anesthesia.
Specimen:  Right cervical lymph node.
EBL: 10 cc.
COMPLICATIONS:  None.
FINDINGS: Enlarged level 2 lymph node was identified and removed and sent for pathologic examination.
FLUIDS:  Please see anesthesia report.
URINE OUTPUT:  None recorded during the case.
INDICATIONS FOR PROCEDURE:  This is a 43-year-old female with a several-year history of persistent cervical lymphadenopathy.
She reports that it is painful to palpation on the right and has had multiple CT scans as well as an FNA which were all nondiagnostic.
After risks and benefits of surgery were discussed with the patient, an informed consent was obtained.
She was scheduled for an excisional biopsy of the right cervical lymph node.
PROCEDURE IN DETAIL:  The patient was taken to the operating room and placed in the supine position.
She was anesthetized with general endotracheal anesthesia. The neck was then prepped and draped in the sterile fashion.
Again, noted on palpation there was an enlarged level 2 cervical lymph node.
A 3-cm horizontal incision was made over this lymph node. Dissection was carried down until the sternocleidomastoid muscle was identified.
The enlarged lymph node that measured approximately 2 cm in diameter was identified and was removed and sent to Pathology for touch prep evaluation.
The area was then explored for any other enlarged lymph nodes. None were identified, and hemostasis was achieved with electrocautery.
A quarter-inch Penrose drain was placed in the wound.The wound was then irrigated and
closed with 3-0 interrupted Vicryl sutures for a deep closure followed by a running 4-0 Prolene subcuticular suture.
Mastisol and Steri-Strip were placed over the incision, and sterile bandage was applied.
The patient tolerated this procedure well and was extubated without complications and transported to the recovery room in stable condition.
She will return to the office tomorrow in followup to have the Penrose drain removed.
"""

Below is a typical pipeline, but RegexMatcher is added. RegexMatcher output chunks doesn't have an entity label, so we need to use [ChunkConverter](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/chunker/chunk_converter/index.html?highlight=chunkconverter#sparknlp_jsl.annotator.chunker.chunk_converter.ChunkConverter) to add entity labels to regex chunks. Finally NER and RegexMatcher (through ChunkConverter) outputs are merged  by ChunkMergeApproach.

In [61]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# Extracting entities using ner_clinical_large pretrained model
ner_model = medical.NerModel.pretrained("ner_clinical_large","en","clinical/models") \
    .setInputCols("sentence","token","embeddings") \
    .setOutputCol("ner")

ner_converter= medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\

# Find all tokens that matches regex rule file
regex_matcher = nlp.RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='./regex_rules.txt', delimiter=',')

# Add entity label to regex chunks to be able to merge with previous NER
chunkConverter= medical.ChunkConverter()\
    .setInputCols("regex_matches")\
    .setOutputCol("regex_chunk")

# Chunkmerger, prioritize regex
merger= medical.ChunkMergeApproach()\
    .setInputCols(["regex_chunk", "ner_chunk"])\
    .setOutputCol("merged_chunks")\
    .setMergeOverlapping(True)\
    .setChunkPrecedence("field")

pipeline= nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        ner_model,
        ner_converter,
        regex_matcher,
        chunkConverter,
        merger
])

empty_df= spark.createDataFrame([[""]]).toDF("text")
model= pipeline.fit(empty_df)

rm_model= nlp.LightPipeline(model)
rm_result=rm_model.fullAnnotate(sample_text)[0]


embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical_large download started this may take some time.
Approximate size to download 13.9 MB
[OK!]


In [62]:
rm_result["regex_chunk"]

[Annotation(chunk, 1, 24, POSTOPERATIVE DIAGNOSIS:, {'entity': 'SECTION_HEADER', 'ner_source': 'regex_chunk', 'identifier': 'SECTION_HEADER', 'chunk': '0', 'sentence': '0'}, []),
 Annotation(chunk, 52, 61, PROCEDURE:, {'entity': 'SECTION_HEADER', 'ner_source': 'regex_chunk', 'identifier': 'SECTION_HEADER', 'chunk': '1', 'sentence': '0'}, []),
 Annotation(chunk, 112, 122, ANESTHESIA:, {'entity': 'SECTION_HEADER', 'ner_source': 'regex_chunk', 'identifier': 'SECTION_HEADER', 'chunk': '2', 'sentence': '0'}, []),
 Annotation(chunk, 196, 199, EBL:, {'entity': 'SECTION_HEADER', 'ner_source': 'regex_chunk', 'identifier': 'SECTION_HEADER', 'chunk': '3', 'sentence': '0'}, []),
 Annotation(chunk, 208, 221, COMPLICATIONS:, {'entity': 'SECTION_HEADER', 'ner_source': 'regex_chunk', 'identifier': 'SECTION_HEADER', 'chunk': '4', 'sentence': '0'}, []),
 Annotation(chunk, 230, 238, FINDINGS:, {'entity': 'SECTION_HEADER', 'ner_source': 'regex_chunk', 'identifier': 'SECTION_HEADER', 'chunk': '5', 'sentenc

In [63]:
chunk= []
ner = []
for i in list(rm_result["ner_chunk"]):
  ner.append(i.metadata["entity"])
  chunk.append(i.result)
df_ner = pd.DataFrame({"chunk": chunk,  "ner_entity": ner})

chunk= []
regex = []
for i in list(rm_result["regex_chunk"]):
  regex.append(i.metadata["entity"])
  chunk.append(i.result)
df_regex = pd.DataFrame({"chunk": chunk,  "ner_entity": regex})

chunk= []
merge= []
for i in list(rm_result["merged_chunks"]):
  merge.append(i.metadata["entity"])
  chunk.append(i.result)
df_merge = pd.DataFrame({"chunk": chunk,  "merged_entity": merge})




As seen in the below widget, `SECTION_HEADER` labels are added to merged NER listing.

In [64]:
from google.colab import widgets

t = widgets.TabBar(["NER model", "RegexMatcher", "Merged NER + RegexMatcher"])

with t.output_to(0):
    display(df_ner)

with t.output_to(1):
    display(df_regex)

with t.output_to(2):
    display(df_merge)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,chunk,ner_entity
0,Cervical lymphadenopathy,PROBLEM
1,Excisional biopsy of right cervical lymph node,TEST
2,General endotracheal anesthesia,TREATMENT
3,Right cervical lymph node,PROBLEM
4,EBL,TEST
5,Enlarged level 2 lymph node,PROBLEM
6,pathologic examination,TEST
7,persistent cervical lymphadenopathy,PROBLEM
8,painful,PROBLEM
9,palpation on the right,TEST


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,chunk,ner_entity
0,POSTOPERATIVE DIAGNOSIS:,SECTION_HEADER
1,PROCEDURE:,SECTION_HEADER
2,ANESTHESIA:,SECTION_HEADER
3,EBL:,SECTION_HEADER
4,COMPLICATIONS:,SECTION_HEADER
5,FINDINGS:,SECTION_HEADER
6,FLUIDS:,SECTION_HEADER
7,URINE OUTPUT:,SECTION_HEADER
8,INDICATIONS FOR PROCEDURE:,SECTION_HEADER
9,PROCEDURE IN DETAIL:,SECTION_HEADER


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Unnamed: 0,chunk,merged_entity
0,POSTOPERATIVE DIAGNOSIS:,SECTION_HEADER
1,Cervical lymphadenopathy,PROBLEM
2,PROCEDURE:,SECTION_HEADER
3,Excisional biopsy of right cervical lymph node,TEST
4,ANESTHESIA:,SECTION_HEADER
5,General endotracheal anesthesia,TREATMENT
6,Right cervical lymph node,PROBLEM
7,EBL:,SECTION_HEADER
8,COMPLICATIONS:,SECTION_HEADER
9,FINDINGS:,SECTION_HEADER


<IPython.core.display.Javascript object>