![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/01.5.Contextual_Parser_Rule_Based_NER.ipynb)

# 01.5 ContextualParser (Rule Based NER)

## Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs==5.1.7

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical, visual
nlp.settings.enforce_versions=False
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [5]:
from johnsnowlabs import nlp, medical, visual
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()
spark

👌 Detected license file /content/5.1.2.spark_nlp_for_healthcare.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.1.2, 💊Spark-Healthcare==5.1.3, running on ⚡ PySpark==3.1.2


In [11]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only


# How the ContextualParser Works

Spark NLP's `ContextualParser` is a licensed annotator that allows users to extract entities from a document based on pattern matching. It provides more functionality than its open-source counterpart `EntityRuler` by allowing users to customize specific characteristics for pattern matching. You're able to find entities using regex rules for full and partial matches, a dictionary with normalizing options and context parameters to take into account things such as token distances.

There are 3 components necessary to understand when using the `ContextualParser` annotator:

1. `ContextualParser` annotator's parameters
2. JSON configuration file
3. Dictionary

## 1. ContextualParser Annotator Parameters

Here are all the parameters available to use with the `ContextualParserApproach`:

```
contextualParser = ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("entity") \
    .setCaseSensitive(True) \
    .setJsonPath("context_config.json") \
    .setPrefixAndSuffixMatch(True) \
    .setCompleteContextMatch(True) \
    .setDictionary("dictionary.tsv", options={"orientation":"vertical"})
```


We will dive deeper into the details of each parameter, but here's a quick overview:

- `setCaseSensitive`: do you want the matching to be case sensitive (applies to all JSON properties apart from the regex property)
- `setJsonPath`: the path to your JSON configuration file
- `setPrefixAndSuffixMatch`: do you want to match using both the prefix AND suffix properties from the JSON configuration file
- `setCompleteContextMatch`: do you want an exact match of prefix and suffix.
- `setDictionary`: the path to your dictionary, used for normalizing entities

Let's start by looking at the JSON configuration file.

## 2. JSON Configuration File

Here is a fully utilized JSON configuration file.

```
{
  "entity": "Gender",
  "ruleScope": "sentence",
  "regex": "girl|boy",
  "completeMatchRegex": "true",
  "matchScope": "token",
  "prefix": ["birth", "growing", "assessment"],
  "suffix": ["faster", "velocities"],
  "contextLength": 100,
  "contextException": ["slightly"],
  "exceptionDistance": 40
 }
 ```

### 2.1. Basic Properties

There are 5 basic properties you can set in your JSON configuration file:

- `entity`
- `ruleScope`
- `regex`
- `completeMatchRegex`
- `matchScope`

Let's first look at the 3 most essential properties to set:

```
{
  "entity": "Digit",
  "ruleScope": "sentence",
  "regex": "\\d+" # Note here: backslashes are escape characters in JSON, so for regex pattern "\d+" we need to write it out as "\\d+"
}
```

Here, we're looking for tokens in our text that match the regex: "`\d+`" and assign the "`Digit`" entity to those tokens. When `ruleScope` is set to "`sentence`", we're looking for a match on each *token* of a **sentence**. You can change it to "`document`" to look for a match on each *sentence* of a **document**. The latter is particularly useful when working with multi-word matches, but we'll explore this at a later stage.

The next properties to look at are `completeMatchRegex` and `matchScope`. To understand their use case, let's take a look at an example where we're trying to match all digits in our text.

Let's say we come across the following string: ***XYZ987***

Depending on how we set the `completeMatchRegex` and `matchScope` properties, we'll get the following results:

```
{
  "entity": "Digit",
  "ruleScope": "sentence",
  "regex": "\\d+",
  "completeMatchRegex": "false",
  "matchScope": "token"
}
```

`OUTPUT: [XYZ987]`

```
{
  "entity": "Digit",
  "ruleScope": "sentence",
  "regex": "\\d+",  
  "completeMatchRegex": "false",
  "matchScope": "sub-token"
}
```

`OUTPUT: [987]`


```
{
  "entity": "Digit",
  "ruleScope": "sentence",
  "regex": "\\d+",
  "completeMatchRegex": "true"
  # matchScope is ignored here
}
```

`OUTPUT: []`

`"completeMatchRegex": "true"` will only return an output if our string was modified in the following way (to get a complete, exact match): **XYZ 987**

```
{
  "entity": "Digit",
  "ruleScope": "sentence",
  "regex": "\\d+",  
  "completeMatchRegex": "true",
  "matchScope": "token" # Note here: sub-token would return the same output
}
```

`OUTPUT: [987]`

### 2.2. Context Awareness Properties

There are 5 properties related to context awareness:

- `contextLength`
- `prefix`
- `suffix`
- `contextException`
- `exceptionDistance`



Let's look at a similar example. Say we have the following text: ***At birth, the typical boy is growing slightly faster than the typical girl, but growth rates become equal at about seven months.***

If we want to match the gender that grows faster at birth, we can start by defining our regex: "`girl|boy`"

Next, we add a prefix ("`birth`") and suffix ("`faster`") to ask the parser to match the regex only if the word "`birth`" comes before and only if the word "`faster`" comes after. Finally, we will need to set the `contextLength` - this is the maximum number of tokens after the prefix and before the suffix that will be searched to find a regex match.

Here's what the JSON configuration file would look like:

```
{
  "entity": "Gender",
  "ruleScope": "sentence",
  "regex": "girl|boy",
  "contextLength": 50,
  "prefix": ["birth"],
  "suffix": ["faster"]
}
```

`OUTPUT: [boy]`

If you remember, the annotator has a `setPrefixAndSuffixMatch()` parameter. If you set it to `True`, the previous output would remain as is. However, if you had set it to `False` and used the following JSON configuration:

```
{
  "entity": "Gender",
  "ruleScope": "sentence",
  "regex": "girl|boy",
  "contextLength": 50,
  "prefix": ["birth"],
  "suffix": ["faster", "rates"]
}
```

`OUTPUT: [boy, girl]`

The parser now takes into account either the prefix OR suffix, only one of the condition has to be fulfilled for a match to count.

If you remember, the annotator has a `setCompleteContextMatch()` parameter. If you set it to `True`, and used the following JSON configuration :

```
{
  "entity": "Gender",
  "ruleScope": "sentence",
  "regex": "girl|boy",
  "contextLength": 50,
  "prefix": ["birth"],
  "suffix": ["fast"]
}
```

`OUTPUT: []`

However if we set `setCompleteContextMatch()` as `False`, and use the same JSON configuration as above, we get the following output :

`OUTPUT: [boy]`

Here's the sentence again: ***At birth, the typical boy is growing slightly faster than the typical girl, but growth rates become equal at about seven months.***

The last 2 properties related to context awareness are `contextException` and `exceptionDistance`. This rules out matches based on a given exception:

```
{
  "entity": "Gender",
  "ruleScope": "sentence",
  "regex": "girl|boy",
  "contextLength": 50,
  "prefix": ["birth"],
  "suffix": ["faster", "rates"],
  "contextException": ["At"],
  "exceptionDistance": 5
}
```

`OUTPUT: [girl]`

Here we've asked the parser to ignore a match if the token "`At`" is within 5 tokens of the matched regex. This caused the token "`boy`" to be ignored.

If the annotator's `setOptionalContextRules` parameter is set `True`, it allows us to output regex matches regardless of context match (prefix, suffix configuration). For usage of the `setOptionalContextRules` parameter go to the [Example2 output](#scrollTo=1tdgMbaWDhNC&line=1&uniqifier=1).

When `shortestContextMatch` parameter is set to `True`, it will stop finding for matches when one of prefix and suffix data is found in the text.",
                                

Confidence Value Scenarios:
* When there is regex match only, the confidence value will be 0.5.
* When there are regex and prefix matches together, the confidence value will be > 0.5 depending on the distance between target token and the prefix.
* When there are regex and suffix matches together, the confidence value will be > 0.5 depending on the distance between target token and the suffix.
* When there are regex, prefix, and suffix matches all together, the confidence value will be > than the other scenarios.

## 3. Dictionary

Another key feature of the `ContextualParser` annotator is the use of dictionaries. You can specify a path to a dictionary in `tsv` or `csv` format using the `setDictionary()` parameter. Using a dictionary is a useful when you have a list of exact words that you want the parser to pick up when processing some text.

### 3.1. Orientation

The first feature to be aware of when it comes to feeding dictionaries is the format of the dictionaries. The `ContextualParser` annotator will accept dictionaries in the horizontal format and in a vertical format. This is how they would look in practice:

Horizontal:

| normalize | word1 | word2 | word3     |
|-----------|-------|-------|-----------|
| female    | woman | girl  | lady      |
| male      | man   | boy   | gentleman |



Vertical:

| female    | normalize |
|-----------|-----------|
| woman     | word1     |
| girl      | word2     |
| lady      | word3     |

As you can see, your dictionary needs to have a `normalize` field that lets the annotator know which entity labels to use, and another field that lets the annotator know a list of words it should be looking to match. Here's how to set the format that your dictionary uses:

```
contextualParser = ContextualParserApproach() \
    .setDictionary("dictionary.tsv", options={"orientation":"vertical"}) # default is horizontal
```

### 3.2. Dictionary-related JSON Properties

When working with dictionaries, there are 2 properties in the JSON configuration file to be aware of:

- `ruleScope`
- `matchScope`

This is especially true when you have multi-word entities in your dictionary.

Let's take an example of a dictionary that contains a list of cities, sometimes made up of multiple words:

| normalize | word1 | word2 | word3     |
|-----------|-------|-------|-----------|
| City      | New York | Salt Lake City  | Washington      |




Let's say we're working with the following text: ***I love New York. Salt Lake City is nice too.***

With the following JSON properties, here's what you would get:

```
{
  "entity": "City",
  "ruleScope": "sentence",
  "matchScope": "sub-token",
}
```

`OUTPUT: []`

When `ruleScope` is set to `"sentence"`, the annotator attempts to find matches at the token level, parsing through each token in the sentence one by one, looking for a match with the dictionary items. Since `"New York"` and `"Salt Lake City"` are made up of multiple tokens, the annotator would never find a match from the dictionary. Let's change `ruleScope` to `"document"`:

```
{
  "entity": "City",
  "ruleScope": "document",
  "matchScope": "sub-token",
}
```

`OUTPUT: [New York, Salt Lake City]`

When `ruleScope` is set to `"document"`, the annotator attempts to find matches by parsing through each sentence in the document one by one, looking for a match with the dictionary items. Beware of how you set `matchScope`. Taking the previous example, if we were to set `matchScope` to `"token"` instead of `"sub-token"`, here's what would happen:

```
{
  "entity": "City",
  "ruleScope": "document",
  "matchScope": "token"
}
```

`OUTPUT: [I love New York., Salt Lake City is nice too.]`

As you can see, when `ruleScope` is at the document level, if you set your `matchScope` to the token level, the annotator will output each sentence containing the matched entities as individual chunks.

### 3.3. Working with Multi-Word Matches

Although not directly related to dictionaries, if we build on top of what we've just seen, there is a use-case that is particularly in demand when working with the `ContextualParser` annotator: finding regex matches for chunks of words that span across multiple tokens.

Let's re-iterate how the `ruleScope` property works: when `ruleScope` is set to `"sentence"`, we're looking for a match on each token of a sentence. When `ruleScope` is set to `"document"`, we're looking for a match on each sentence of a document.

So now let's imagine you're parsing through medical documents trying to tag the *Family History* headers in those documents.

```
{
  "entity": "Family History Header",
  "regex": "[f|F]amily\s+[h|H]istory",  
  "ruleScope": "document",
  "matchScope": "sub-token"
}
```


`OUTPUT: [Family History, family history, Family history]`

If you had set `ruleScope` to  `"sentence"`, here's what would have happened:

```
{
  "entity": "Family History Header",
  "regex": "[f|F]amily\s+[h|H]istory",  
  "ruleScope": "sentence",
  "matchScope": "sub-token"
}
```

`OUTPUT: []`

Since Family History is divided into two different tokens, the annotator will never find a match since it's now looking for a match on each token of a sentence.

# Running a Pipeline

## Example 1: Detecting Cities

Let's try running through some examples to build on top of what you've learned so far.

In [None]:
# Here's some sample text
sample_text = """Peter Parker is a nice guy and lives in New York . Bruce Wayne is also a nice guy and lives in San Antonio and Gotham City . """

In [None]:
# Create a dictionary to detect cities
cities = """City\nNew York\nGotham City\nSan Antonio\nSalt Lake City"""

with open('cities.tsv', 'w') as f:
    f.write(cities)

# Check what dictionary looks like
!cat cities.tsv

City
New York
Gotham City
San Antonio
Salt Lake City

In [None]:
# Create JSON file
cities = {
  "entity": "City",
  "ruleScope": "document",
  "matchScope":"sub-token",
  "completeMatchRegex": "false"
}

import json
with open('cities.json', 'w') as f:
    json.dump(cities, f)

In [None]:
from johnsnowlabs.nlp import *
from johnsnowlabs.medical import *


In [None]:
# Build pipeline
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

contextual_parser = medical.ContextualParserApproach() \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("entity")\
    .setJsonPath("cities.json")\
    .setCaseSensitive(True)\
    .setDictionary('cities.tsv', options={"orientation":"vertical"})

chunk_converter = medical.ChunkConverter() \
    .setInputCols(["entity"]) \
    .setOutputCol("ner_chunk")

parserPipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    contextual_parser,
    chunk_converter,
    ])

In [None]:
# Create a lightpipeline model
empty_data = spark.createDataFrame([[""]]).toDF("text")

parserModel = parserPipeline.fit(empty_data)

light_model = nlp.LightPipeline(parserModel)

In [None]:
# Annotate the sample text
annotations = light_model.fullAnnotate(sample_text)[0]

In [None]:
# Check outputs
annotations.get('ner_chunk')

[Annotation(chunk, 40, 47, New York, {'field': 'City', 'tokenIndex': '9', 'ner_source': 'ner_chunk', 'normalized': 'City', 'confidenceValue': '0.50', 'entity': 'City', 'sentence': '0'}, []),
 Annotation(chunk, 95, 105, San Antonio, {'field': 'City', 'tokenIndex': '10', 'ner_source': 'ner_chunk', 'normalized': 'City', 'confidenceValue': '0.50', 'entity': 'City', 'sentence': '1'}, []),
 Annotation(chunk, 111, 121, Gotham City, {'field': 'City', 'tokenIndex': '13', 'ner_source': 'ner_chunk', 'normalized': 'City', 'confidenceValue': '0.50', 'entity': 'City', 'sentence': '1'}, [])]

In [None]:
# Visualize outputs
from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()

visualiser.display(annotations, label_col='ner_chunk', document_col='document', save_path="display_result.html")

Feel free to experiment with the annotator parameters and JSON properties to see how the output might change.

## Example 2: Detect Gender and Age

In [None]:
# Here's some sample text
sample_text = """A 28 year old female with a history of gestational diabetes mellitus diagnosed 8 years ago.
                 3 years ago, he reported an episode of HTG-induced pancreatitis .
                 5 months old boy with repeated concussions."""

In [None]:
# Create a dictionary to detect gender
gender = '''male,man,male,boy,gentleman,he,him
female,woman,female,girl,lady,old-lady,she,her
neutral,they,neutral,it'''

with open('gender.csv', 'w') as f:
    f.write(gender)

# Check what dictionary looks like
!cat gender.csv

male,man,male,boy,gentleman,he,him
female,woman,female,girl,lady,old-lady,she,her
neutral,they,neutral,it

In [None]:
# Create JSON file for gender
gender = {
  "entity": "Gender",
  "ruleScope": "sentence",
  "completeMatchRegex": "true",
  "matchScope":"token"
}

import json
with open('gender.json', 'w') as f:
    json.dump(gender, f)

In [None]:
# Create JSON file for age
age = {
  "entity": "Age",
  "ruleScope": "sentence",
  "matchScope":"token",
  "regex":"\\d{1,3}",
  "prefix":["age of", "age"],
  "suffix": ["-years-old", "years-old", "-year-old",
             "-months-old", "-month-old", "-months-old",
             "-day-old", "-days-old", "month old",
             "days old", "year old", "years old",
             "years", "year", "months", "old"],
  "contextLength": 25,
  "contextException": ["ago"],
  "exceptionDistance": 12
}

with open('age.json', 'w') as f:
    json.dump(age, f)

In [None]:
# Build pipeline
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

gender_contextual_parser = medical.ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("chunk_gender") \
    .setJsonPath("gender.json") \
    .setCaseSensitive(False) \
    .setDictionary('gender.csv', options={"delimiter":","}) \
    .setPrefixAndSuffixMatch(False)

age_contextual_parser = medical.ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("chunk_age") \
    .setJsonPath("age.json") \
    .setCaseSensitive(False) \
    .setPrefixAndSuffixMatch(False)\
    .setShortestContextMatch(True)\
    .setOptionalContextRules(False)

chunk_merger = medical.ChunkMergeApproach() \
    .setInputCols(["chunk_gender", "chunk_age"]) \
    .setOutputCol("ner_chunk")

parserPipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    gender_contextual_parser,
    age_contextual_parser,
    chunk_merger
    ])

In [None]:
# Create a lightpipeline model
empty_data = spark.createDataFrame([[""]]).toDF("text")

parserModel = parserPipeline.fit(empty_data)

light_model = nlp.LightPipeline(parserModel)

In [None]:
# Annotate the sample text
annotations = light_model.fullAnnotate(sample_text)[0]

In [None]:
# Check outputs
annotations.get('ner_chunk')

[Annotation(chunk, 2, 3, 28, {'tokenIndex': '1', 'entity': 'Age', 'field': 'Age', 'chunk': '0', 'normalized': '', 'sentence': '0', 'confidenceValue': '0.74'}, []),
 Annotation(chunk, 14, 19, female, {'tokenIndex': '4', 'entity': 'Gender', 'field': 'Gender', 'chunk': '1', 'normalized': 'female', 'sentence': '0', 'confidenceValue': '0.50'}, []),
 Annotation(chunk, 122, 123, he, {'tokenIndex': '4', 'entity': 'Gender', 'field': 'Gender', 'chunk': '2', 'normalized': 'male', 'sentence': '1', 'confidenceValue': '0.50'}, []),
 Annotation(chunk, 192, 192, 5, {'tokenIndex': '0', 'entity': 'Age', 'field': 'Age', 'chunk': '3', 'normalized': '', 'sentence': '2', 'confidenceValue': '0.74'}, []),
 Annotation(chunk, 205, 207, boy, {'tokenIndex': '3', 'entity': 'Gender', 'field': 'Gender', 'chunk': '4', 'normalized': 'male', 'sentence': '2', 'confidenceValue': '0.50'}, [])]

In [None]:
# Visualize outputs
from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()

visualiser.display(annotations, label_col='ner_chunk', document_col='document', save_path="display_result_2.html")

Feel free to experiment with the annotator parameters and JSON properties to see how the output might change. If you're looking to work on running the pipeline on a full dataset, just make sure to use the `fit()` and `transform()` methods directly on your dataset instead of using the lightpipeline.

In [None]:
# Create example dataframe with sample text
data = spark.createDataFrame([[sample_text]]).toDF("text")

# Fit and show
results = parserPipeline.fit(data).transform(data)
results.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|        chunk_gender|           chunk_age|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|A 28 year old fem...|[{document, 0, 23...|[{document, 0, 90...|[{token, 0, 0, A,...|[{chunk, 14, 19, ...|[{chunk, 2, 3, 28...|[{chunk, 2, 3, 28...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [None]:
results.select("chunk_age.result").show()

+-------+
| result|
+-------+
|[28, 5]|
+-------+



In [None]:
results.select("chunk_gender.result").show()

+-----------------+
|           result|
+-----------------+
|[female, he, boy]|
+-----------------+



## Example 3: Detect Test Result and Date

Medical text has a complex structure. Sometimes, our deid ner model mistakenly identifies certain entities as `dates`, such as test results or dimensions. In such cases, we utilize a rule-based NER (contextual parser).

In [None]:
data = pd.DataFrame(
    {'text': [
        '''Mark White was born 06-20-1990. Mark White is 45 years old. Test Result: RHC 11-22-33, LHC 11\\22\\33, Wedge 11-16-1972.''',
        '''John was born on 07-25-2000 and he was discharged on 03/15/2022. Test Result: RV 26/2. Left Ventricle 26-2.  Wedge 11/16/19.''',
        '''John Moore was born 03/20/2012 and he is 18 years old. Test Result: Pulmonary Artery 07\\31\\19 ( PA 07/31/19 ).'''
]})

# pre-process text
#data['text'].replace('\\', '\\\\', inplace=True)

In [None]:
# convert data for Spark processing
input_df = spark.createDataFrame(data)
input_df.show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                        |
+----------------------------------------------------------------------------------------------------------------------------+
|Mark White was born 06-20-1990. Mark White is 45 years old. Test Result: RHC 11-22-33, LHC 11\22\33, Wedge 11-16-1972.      |
|John was born on 07-25-2000 and he was discharged on 03/15/2022. Test Result: RV 26/2. Left Ventricle 26-2.  Wedge 11/16/19.|
|John Moore was born 03/20/2012 and he is 18 years old. Test Result: Pulmonary Artery 07\31\19 ( PA 07/31/19 ).              |
+----------------------------------------------------------------------------------------------------------------------------+



In [None]:
# create JSON file for test result patterns (to be used with ContextualizedParser)
test_result_rules = {
    'entity': 'test_result',
    'ruleScope': 'sentence',
    'matchScope': 'token',
    'regex': '(\d{2}.?\d{2}.?\d{2})|(\d{2}.?\d{2}.?\d{4})|(\d{2}.?\d{1})',
    'prefix': ['Right atrium', 'RA',
               'Left atrium', 'LA',
               'Wedge', "Catheterization",
               'Right Heart Catheterization', 'RHC',
               'Left Heart Catheterization', 'LHC',
               'PA', 'pulmonary artery',
               'RV', 'right ventricle'],
    'suffix': ['.', ','],
    'contextLength': 45,
    'completeMatchRegex': 'true',
    "contextException": ["born",  "on"],
    "exceptionDistance":10,
}

with open('test_result_rules.json', 'w', encoding='utf-8') as f:
    json.dump(test_result_rules, f, ensure_ascii=False, indent=4)

In [None]:
embeddings = nlp.WordEmbeddingsModel \
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("word_embeddings")

# identify test results
test_contextual_parser = medical.ContextualParserApproach() \
    .setInputCols(['sentence', 'token']) \
    .setOutputCol('test_result') \
    .setJsonPath('test_result_rules.json') \
    .setCaseSensitive(False) \
    .setPrefixAndSuffixMatch(True) \
    .setShortestContextMatch(True) \
    .setOptionalContextRules(False)

test_contextual_parser_converter = medical.ChunkConverter() \
    .setInputCols(['test_result']) \
    .setOutputCol('test_result_chunk')

# Deid NER
deid_ner = medical.NerModel \
    .pretrained('ner_deid_subentity_augmented', 'en', 'clinical/models') \
    .setInputCols(['sentence', 'token', 'word_embeddings']) \
    .setOutputCol('deid_ner')

deid_ner_converter = medical.NerConverterInternal() \
    .setInputCols(['sentence', 'token', 'deid_ner']) \
    .setOutputCol('deid_ner_chunk') \
    .setWhiteList(['date'])

# merge
chunk_merger = medical.ChunkMergeApproach() \
    .setInputCols(['test_result_chunk', 'deid_ner_chunk']) \
    .setOutputCol('ner_chunk') \
    .setMergeOverlapping(True)

parserPipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_detector,
    tokenizer,
    embeddings,
    test_contextual_parser,
    test_contextual_parser_converter,
    deid_ner,
    deid_ner_converter,
    chunk_merger,
])

empty_data = spark.createDataFrame([[""]]).toDF("text")

pipeline_model = parserPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_deid_subentity_augmented download started this may take some time.
[OK!]


In [None]:
output = pipeline_model.transform(input_df)

In [None]:
def process_output(result_col, chunk_alias, output):

    output.select(F.explode(F.arrays_zip(output[result_col].result,
                                         output[result_col].begin,
                                         output[result_col].end,
                                         output[result_col].metadata,)).alias("cols")) \
          .select(F.expr("cols['0']").alias(chunk_alias),
                  F.expr("cols['1']").alias("begin"),
                  F.expr("cols['2']").alias("end"),
                  F.expr("cols['3']['entity']").alias("entity"),
                  F.expr("cols['3']['confidence']").alias("confidence")) \
          .show(50, truncate=False)


In [None]:
from google.colab import widgets
t = widgets.TabBar(["ner_deid_result","contextual_text_result","merged_ner_result"])

with t.output_to(0):

    process_output("deid_ner_chunk", "deid_ner_chunk",output)

with t.output_to(1):

    process_output("test_result_chunk", "contextual_ner_chunk", output)

with t.output_to(2):

    process_output("ner_chunk", "merged_ner_chunk", output)



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

+--------------+-----+---+------+----------+
|deid_ner_chunk|begin|end|entity|confidence|
+--------------+-----+---+------+----------+
|06-20-1990    |20   |29 |DATE  |0.9677    |
|11-22-33      |77   |84 |DATE  |0.9997    |
|11-16-1972    |107  |116|DATE  |0.9966    |
|07-25-2000    |17   |26 |DATE  |0.987     |
|03/15/2022    |53   |62 |DATE  |1.0       |
|11/16/19      |115  |122|DATE  |1.0       |
|03/20/2012    |20   |29 |DATE  |0.9998    |
|07/31/19      |99   |106|DATE  |1.0       |
+--------------+-----+---+------+----------+



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

+--------------------+-----+---+-----------+----------+
|contextual_ner_chunk|begin|end|entity     |confidence|
+--------------------+-----+---+-----------+----------+
|11-22-33            |77   |84 |test_result|null      |
|11\22\33            |91   |98 |test_result|null      |
|11-16-1972          |107  |116|test_result|null      |
|26/2                |81   |84 |test_result|null      |
|11/16/19            |115  |122|test_result|null      |
|07\31\19            |85   |92 |test_result|null      |
|07/31/19            |99   |106|test_result|null      |
+--------------------+-----+---+-----------+----------+



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

+----------------+-----+---+-----------+----------+
|merged_ner_chunk|begin|end|entity     |confidence|
+----------------+-----+---+-----------+----------+
|06-20-1990      |20   |29 |DATE       |0.9677    |
|11-22-33        |77   |84 |test_result|null      |
|11\22\33        |91   |98 |test_result|null      |
|11-16-1972      |107  |116|test_result|null      |
|07-25-2000      |17   |26 |DATE       |0.987     |
|03/15/2022      |53   |62 |DATE       |1.0       |
|26/2            |81   |84 |test_result|null      |
|11/16/19        |115  |122|test_result|null      |
|03/20/2012      |20   |29 |DATE       |0.9998    |
|07\31\19        |85   |92 |test_result|null      |
|07/31/19        |99   |106|test_result|null      |
+----------------+-----+---+-----------+----------+



<IPython.core.display.Javascript object>

In [None]:
from google.colab import widgets
t = widgets.TabBar(["deid_ner_result","contextual_result","merged_ner_result"])

from sparknlp_display import NerVisualizer
visualiser = NerVisualizer()

results = output.collect()

with t.output_to(0):
    for i in range(len(results)):
        visualiser.display(results[i], label_col= 'deid_ner_chunk')

with t.output_to(1):
    for i in range(len(results)):
        visualiser.display(results[i], label_col= 'test_result_chunk')

with t.output_to(2):
    for i in range(len(results)):
        visualiser.display(results[i], label_col= 'ner_chunk')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Pretrained Contextual Parser Models

<center><b>Contextual Parser Model List</b>

|index|model| description|
|-----:|:-----|:-----|
| 1| [date_of_birth_parser](https://nlp.johnsnowlabs.com/2023/08/22/date_of_birth_parser_en.html)  | This model can extract date-of-birth (DOB) entities in clinical texts. |
| 1| [date_of_death_parser](https://nlp.johnsnowlabs.com/2023/08/22/date_of_birth_parser_en.html)  | This model can extract date-of-death (DOD) entities in clinical texts. |



## date-of-birth and date-of-death

In [7]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

#date_of_birth_parser
dob_contextual_parser = medical.ContextualParserModel.pretrained("date_of_birth_parser", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("chunk_dob")

chunk_converter_dob = medical.ChunkConverter() \
    .setInputCols(["chunk_dob"]) \
    .setOutputCol("ner_chunk_dob")

#date_of_death_parser
dod_contextual_parser = medical.ContextualParserModel.pretrained("date_of_death_parser", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("chunk_dod")

chunk_converter_dod = medical.ChunkConverter() \
    .setInputCols(["chunk_dod"]) \
    .setOutputCol("ner_chunk_dod")

chunk_merger = medical.ChunkMergeApproach() \
    .setInputCols(["ner_chunk_dob", "ner_chunk_dod"]) \
    .setOutputCol("ner_chunk")

parserPipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    dob_contextual_parser,
    chunk_converter_dob,
    dod_contextual_parser,
    chunk_converter_dod,
    chunk_merger
    ])

model = parserPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
date_of_birth_parser download started this may take some time.
[OK!]
date_of_death_parser download started this may take some time.
[OK!]


In [8]:
text = """
Record date : 2081-01-04
DB : 11.04.1962
DT : 12-03-1978
DOD : 10.25.23

SOCIAL HISTORY:
She was born on Nov 04, 1962 in London and got married on 04/05/1979. When she got pregnant on 15 May 1079, the doctor wanted to verify her DOB was November 4, 1962. Her date of birth was confirmed to be 11-04-1962, the patient is 45 years old on 25 Sep 2007.

PROCEDURES:
Patient was evaluated on 1988-03-15 for allergies. She was seen by the endocrinology service and she was discharged on 9/23/1988.

MEDICATIONS
1. Coumadin 1 mg daily. Last INR was on August 14, 2007, and her INR was 2.3."""

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

In [9]:
from pyspark.sql import functions as F

result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.begin,
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['3']['sentence']").alias("sentence_id"),
              F.expr("cols['0']").alias("chunk"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']").alias("ner_label"))\
      .show(truncate=False)

+-----------+----------------+---+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|sentence_id|chunk           |end|ner_label                                                                                                                                                           |
+-----------+----------------+---+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1          |11.04.1962      |40 |{tokenIndex -> 2, entity -> DOB, confidence -> 0.72, field -> DOB, ner_source -> ner_chunk_dob, chunk -> 0, normalized -> , sentence -> 1, confidenceValue -> 0.72} |
|3          |10.25.23        |71 |{tokenIndex -> 2, entity -> DOD, confidence -> 0.72, field -> DOD, ner_source -> ner_chunk_dod, chunk -> 1, normalized -> , sentence -> 3, confidenceValue -> 0.72} |


In [12]:
# Visualize outputs
from sparknlp_display import NerVisualizer

light_model = nlp.LightPipeline(model)

# Annotate the sample text
annotations = light_model.fullAnnotate(text)[0]

visualiser = NerVisualizer()

visualiser.display(annotations, label_col='ner_chunk', document_col='document', save_path="display_result_2.html")