![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.2.Contextual_Parser_Rule_Based_NER.ipynb)

# 1.2 ContextualParser (Rule Based NER)

In [None]:
import os

jsl_secret = os.getenv('SECRET')

import sparknlp
sparknlp_version = sparknlp.version()
import sparknlp_jsl
jsl_version = sparknlp_jsl.version()

print (jsl_secret)

In [3]:
import json
import os
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp

params = {"spark.driver.memory":"16G",
"spark.kryoserializer.buffer.max":"2000M",
"spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(jsl_secret, params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

Spark NLP Version : 3.1.2
Spark NLP_JSL Version : 3.1.2


In [4]:
spark

## How it works

This annotator is a kind of RegexMatcher based on a JSON file, that is defined through the parameter `setJsonPath()`

In this JSON file, you define the regex that you want to match along with the information that will output on metadata field.

For example here, you define the name of an entity that will categorize the matches, the regex value and the  `matchScope` that will tell the regex whether to make a full match or a partial match

```
{
  "entity": "Stage",
  "ruleScope": "sentence",
  "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]*",
  "matchScope": "token"
}
```


Ignore the `ruleScope` for the moment, it's always at a `sentence` level. Which means find match on each sentence. So, for example for this text:
```
A patient has liver metastases pT1bN0M0 and the T5 primary site may be colon or lung. If the primary site is not clearly identified , this case is cT4bcN2M1, Stage Grouping 88. N4 A child T?N3M1  has soft tissue aM3 sarcoma and the staging has been left unstaged. Both clinical and pathologic staging would be coded pT1bN0M0 as unstageable cT3cN2.Medications started.
```

The expected result will be:
```
val expectedResult = Array("pT1bN0M0", "T5", "cT4bcN2M1", "T?N3M1", "pT1bN0M0", "cT3cN2.Medications")
val expectedMetadata =
Array(Map("field" -> "Stage", "normalized" -> "", "confidenceValue" -> "0.13", "hits" -> "regex", "sentence" -> "0"),
	  Map("field" -> "Stage", "normalized" -> "", "confidenceValue" -> "0.13", "hits" -> "regex", "sentence" -> "0"),
	  Map("field" -> "Stage", "normalized" -> "", "confidenceValue" -> "0.13", "hits" -> "regex", "sentence" -> "1"),
	  Map("field" -> "Stage", "normalized" -> "", "confidenceValue" -> "0.13", "hits" -> "regex", "sentence" -> "2"),
	  Map("field" -> "Stage", "normalized" -> "", "confidenceValue" -> "0.13", "hits" -> "regex", "sentence" -> "3"),
	  Map("field" -> "Stage", "normalized" -> "", "confidenceValue" -> "0.13", "hits" -> "regex", "sentence" -> "3")
	 )
```

Whereas, using a `matchScope` at sub-token level it will output:

```
val expectedResult = Array("pT1b", "T5", "cT4bc", "T?", "pT1b", "cT3c")
val expectedMetadata =
Array(Map("field" -> "Stage", "normalized" -> "", "confidenceValue" -> "0.13", "hits" -> "regex", "sentence" -> "0"),
Map("field" -> "Stage", "normalized" -> "", "confidenceValue" -> "0.13", "hits" -> "regex", "sentence" -> "0"),
Map("field" -> "Stage", "normalized" -> "", "confidenceValue" -> "0.13", "hits" -> "regex", "sentence" -> "1"),
Map("field" -> "Stage", "normalized" -> "", "confidenceValue" -> "0.13", "hits" -> "regex", "sentence" -> "2"),
Map("field" -> "Stage", "normalized" -> "", "confidenceValue" -> "0.13", "hits" -> "regex", "sentence" -> "3"),
Map("field" -> "Stage", "normalized" -> "", "confidenceValue" -> "0.13", "hits" -> "regex", "sentence" -> "3")
)
```

The `confidence` value is another feature, which is computed  basically using a heuristic approach based on how many matches it has.

To clarify how many matches, this is an example of the JSON file with additional fields that will define the match we want to get

```
{
  "entity": "Gender",
  "ruleScope": "sentence",
  "matchScope": "token",
  "prefix": ["birth", "growing", "assessment"],
  "suffix": ["faster", "velocities"],
  "contextLength": 50,
  "context": ["typical", "grows"]
}
```


for example, `prefix` and `suffix` refer to the words that are required to be near the word we want to match.

This two work also with `contextLength` that will tell the maximum distance that prefix or suffix words can be away from the word to match, whereas `context` are words that must be immediately after or before the word to match

Now, there is another feature that can be used. The `dictionary` parameter. In this parameter, you define the set of words that you want to match and the word that will replace this match.

For example, with this definition, you are telling `ContextualParser` that when words `woman`, `female`, and `girl` are matched those will be replaced by `female`, whereas `man`, `male`, `boy` and `gentleman` are matched those will be replaced by `male`. 

```
female  woman   female  girl
male    man male    boy gentleman
```

So, for example for this text:

```
At birth, the typical boy is growing slightly faster than the typical girl, but the velocities become equal at about seven months, and then the girl grows faster until four years. From then until adolescence no differences in velocity can be detected.
```

The expected output of the annotator will be:

```
val expectedResult = Array("boy", "girl", "girl")
val expectedMetadata =
Array(Map("field" -> "Gender", "normalized" -> "male", "confidenceValue" -> "0.13", "hits" -> "regex", "sentence" -> "0"),
Map("field" -> "Gender", "normalized" -> "female", "confidenceValue" -> "0.13", "hits" -> "regex", "sentence" -> "0"),
Map("field" -> "Gender", "normalized" -> "female", "confidenceValue" -> "0.13", "hits" -> "regex", "sentence" -> "0"))
```

For the `dictionary`, you just need to define a csv or tsv file, where the first element of the row is the normalized word, the other elements will be the values to match. You can define several words and elements to match just by adding another row and you set the path to the file on the parameter `setDictionary`.

The `dictionary` parameter is of the type` ExternalResource` by default the delimiter is `"\t"` you cand set another delimiter if you want according to your dictionary file format.


In [5]:
sample_text = """A 28 year old female with a history of gestational diabetes mellitus diagnosed eight years prior to 
presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis 
three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index 
( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting.
Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . 
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . 
She had been on dapagliflozin for six months at the time of presentation . Physical examination on presentation was 
significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , 
or rigidity . Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , 
anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin 
( HbA1c ) 10% , and venous pH 7.27 . Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed 
as blood samples kept hemolyzing due to significant lipemia .
The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior 
to admission . However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , 
the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , 
and lipase was 52 U/L .
 β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged 
 and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again . 
 The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides 
 to 1400 mg/dL , within 24 hours .
 Twenty days ago.
 Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use . 
 At birth the typical boy is growing slightly faster than the typical girl, but the velocities become equal at about 
 seven months, and then the girl grows faster until four years. 
 From then until adolescence no differences in velocity 
 can be detected. 21-02-2020 
21/04/2020
"""
data = spark.createDataFrame([[sample_text]]).toDF("text").cache()

data.show(truncate = 100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|A 28 year old female with a history of gestational diabetes mellitus diagnosed eight years prior ...|
+----------------------------------------------------------------------------------------------------+



## Rules

In [6]:
!mkdir data

In [7]:
gender = '''male,man,male,boy,gentleman,he,him
female,woman,female,girl,lady,old-lady,she,her
neutral,neutral'''

with open('data/gender.csv', 'w') as f:
    f.write(gender)


gender = {
  "entity": "Gender",
  "ruleScope": "sentence", 
  "completeMatchRegex": "true"
}

import json

with open('data/gender.json', 'w') as f:
    json.dump(gender, f)


date = {
  "entity": "Date ",
  "ruleScope": "sentence",
  "regex": "\\d{1,2}[\\/\\-\\:]{1}(\\d{1,2}[\\/\\-\\:]{1}){0,1}\\d{2,4}",
  "valuesDefinition":[],
  "prefix": [],
  "suffix": [],
  "contextLength": 150,
  "context": []
}

with open('data/date.json', 'w') as f:
    json.dump(date, f)


age = {
  "entity": "Age",
  "ruleScope": "sentence",
  "matchScope":"token",
  "regex" : "^[1][0-9][0-9]|[1-9][0-9]|[1-9]$",
  "prefix":["age of", "age"],
  "suffix": ["-years-old",
             "years-old",
             "-year-old",
             "-months-old",
             "-month-old",
             "-months-old",
             "-day-old",
             "-days-old",
             "month old",
             "days old",
             "year old",
             "years old", 
             "years",
             "year", 
             "months", 
             "old"
              ],
  "contextLength": 25,
  "context": [],
  "contextException": ["ago"],
  "exceptionDistance": 10
}

with open('data/age.json', 'w') as f:
    json.dump(age, f)


## Pipeline definition

All rule files from the rule folder are added to the pipeline. They will generate different annotation labels that need to be consolidated. 

In [8]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")


In [9]:
!cd data && ls -lt

total 16
-rw-r--r-- 1 root root 434 Jul 25 16:12 age.json
-rw-r--r-- 1 root root 205 Jul 25 16:12 date.json
-rw-r--r-- 1 root root  97 Jul 25 16:12 gender.csv
-rw-r--r-- 1 root root  75 Jul 25 16:12 gender.json


In [10]:
gender_contextual_parser = ContextualParserApproach() \
        .setInputCols(["sentence", "token"]) \
        .setOutputCol("entity_gender") \
        .setJsonPath("data/gender.json") \
        .setCaseSensitive(False) \
        .setContextMatch(False)\
        .setDictionary('data/gender.csv', read_as=ReadAs.TEXT, options={"delimiter":","})\
        .setPrefixAndSuffixMatch(False)        

In [11]:
age_contextual_parser = ContextualParserApproach() \
        .setInputCols(["sentence", "token"]) \
        .setOutputCol("entity_age") \
        .setJsonPath("data/age.json") \
        .setCaseSensitive(False) \
        .setContextMatch(False)\
        .setPrefixAndSuffixMatch(False)

In [12]:
date_contextual_parser = ContextualParserApproach() \
        .setInputCols(["sentence", "token"]) \
        .setOutputCol("entity_date") \
        .setJsonPath("data/date.json") \
        .setCaseSensitive(False) \
        .setContextMatch(False)\
        .setPrefixAndSuffixMatch(False)

In [13]:
parserPipeline = Pipeline(stages=[
        document_assembler, 
        sentence_detector,
        tokenizer,
        gender_contextual_parser,
        age_contextual_parser,
        date_contextual_parser
        ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

parserModel = parserPipeline.fit(empty_data)

light_model = LightPipeline(parserModel)


In [14]:
annotations = light_model.fullAnnotate(sample_text)[0]
annotations.keys()

dict_keys(['document', 'entity_gender', 'token', 'entity_date', 'entity_age', 'sentence'])

In [15]:
print (annotations['entity_gender'])
print (annotations['entity_age'])
print (annotations['entity_date'])

[Annotation(chunk, 14, 19, female, {'field': 'Gender', 'normalized': 'female', 'confidenceValue': '0.13', 'hits': 'regex', 'sentence': '0'}), Annotation(chunk, 471, 473, she, {'field': 'Gender', 'normalized': 'female', 'confidenceValue': '0.13', 'hits': 'regex', 'sentence': '1'}), Annotation(chunk, 562, 564, she, {'field': 'Gender', 'normalized': 'female', 'confidenceValue': '0.13', 'hits': 'regex', 'sentence': '2'}), Annotation(chunk, 668, 670, she, {'field': 'Gender', 'normalized': 'female', 'confidenceValue': '0.13', 'hits': 'regex', 'sentence': '3'}), Annotation(chunk, 835, 837, her, {'field': 'Gender', 'normalized': 'female', 'confidenceValue': '0.13', 'hits': 'regex', 'sentence': '5'}), Annotation(chunk, 1377, 1379, she, {'field': 'Gender', 'normalized': 'female', 'confidenceValue': '0.13', 'hits': 'regex', 'sentence': '9'}), Annotation(chunk, 1517, 1519, her, {'field': 'Gender', 'normalized': 'female', 'confidenceValue': '0.13', 'hits': 'regex', 'sentence': '10'}), Annotation(ch

In [16]:
import random

def get_color():
    r = lambda: random.randint(100,255)
    return '#%02X%02X%02X' % (r(),r(),r())

In [17]:
ner_chunks = []
label_color = {}
unified_entities = {'entity':[]}
for ent_name in annotations.keys():
    if "entity" in ent_name and len(annotations[ent_name])>0:
        ner_chunks.append(ent_name)
        label = annotations[ent_name][0].metadata['field']
        label_color[label] = get_color()
        unified_entities['entity'].extend(annotations[ent_name])

In [18]:
unified_entities['entity'].sort(key=lambda x: x.begin, reverse=False)

In [19]:
unified_entities['entity']

[Annotation(chunk, 2, 3, 28, {'field': 'Age', 'normalized': '', 'confidenceValue': '0.47', 'hits': 'suffix,prefix,regex', 'sentence': '0'}),
 Annotation(chunk, 14, 19, female, {'field': 'Gender', 'normalized': 'female', 'confidenceValue': '0.13', 'hits': 'regex', 'sentence': '0'}),
 Annotation(chunk, 471, 473, she, {'field': 'Gender', 'normalized': 'female', 'confidenceValue': '0.13', 'hits': 'regex', 'sentence': '1'}),
 Annotation(chunk, 562, 564, she, {'field': 'Gender', 'normalized': 'female', 'confidenceValue': '0.13', 'hits': 'regex', 'sentence': '2'}),
 Annotation(chunk, 668, 670, she, {'field': 'Gender', 'normalized': 'female', 'confidenceValue': '0.13', 'hits': 'regex', 'sentence': '3'}),
 Annotation(chunk, 835, 837, her, {'field': 'Gender', 'normalized': 'female', 'confidenceValue': '0.13', 'hits': 'regex', 'sentence': '5'}),
 Annotation(chunk, 1377, 1379, she, {'field': 'Gender', 'normalized': 'female', 'confidenceValue': '0.13', 'hits': 'regex', 'sentence': '9'}),
 Annotatio

## Highlighting the entites with html

In [20]:
html_output = ''
pos = 0

for n in unified_entities['entity']:
    if pos < n.begin and pos < len(sample_text):
        white_text = sample_text[pos:n.begin]
        html_output += '<span class="others" style="background-color: white">{}</span>'.format(white_text)
    pos = n.end+1
    html_output += '<span class="entity-wrapper" style="background-color: {}"><span class="entity-name">{} </span><span class="entity-type">[{}]</span></span>'.format(
        label_color[n.metadata['field']],
        n.result,
        n.metadata['field'])

if pos < len(sample_text):
    html_output += '<span class="others" style="background-color: white">{}</span>'.format(sample_text[pos:])

html_output += """</div>"""

In [21]:
from IPython.display import HTML

HTML(html_output)