![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/legal-nlp/05.6.Contextual_Parser_Rule_Based_NER.ipynb)

# ContextualParser (Rule Based NER)

#🎬 Installation

In [None]:
! pip install -q johnsnowlabs

##🔗 Automatic Installation
Using my.johnsnowlabs.com SSO

In [None]:
from johnsnowlabs import nlp, legal

# nlp.install(force_browser=True)

##🔗 Manual downloading
If you are not registered in my.johnsnowlabs.com, you received a license via e-email or you are using Safari, you may need to do a manual update of the license.

- Go to my.johnsnowlabs.com
- Download your license
- Upload it using the following command

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

- Install it

In [None]:
nlp.install()

#📌 Starting

In [None]:
spark = nlp.start()

#🔎 How the ContextualParser Works

Spark NLP's `ContextualParser` is a licensed annotator that allows users to extract entities from a document based on pattern matching. It provides more functionality than its open-source counterpart `EntityRuler` by allowing users to customize specific characteristics for pattern matching. You're able to find entities using regex rules for full and partial matches, a dictionary with normalizing options and context parameters to take into account things such as token distances. 

📚There are 3 components necessary to understand when using the `ContextualParser` annotator:

1. `ContextualParser` annotator's parameters
2. JSON configuration file
3. Dictionary

##📌 1. ContextualParser Annotator Parameters

📚Here are all the parameters available to use with the `ContextualParserApproach`:

```
contextualParser = legal.ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("entity") \
    .setCaseSensitive(True) \
    .setJsonPath("context_config.json") \
    .setPrefixAndSuffixMatch(True) \
    .setCompleteContextMatch(True) \
    .setDictionary("dictionary.tsv", options={"orientation":"vertical"})
```


📚We will dive deeper into the details of each parameter, but here's a quick overview:

- `setCaseSensitive`: do you want the matching to be case sensitive (applies to all JSON properties apart from the regex property)
- `setJsonPath`: the path to your JSON configuration file
- `setPrefixAndSuffixMatch`: do you want to match using both the prefix AND suffix properties from the JSON configuration file
- `setCompleteContextMatch`: do you want an exact match of prefix and suffix.
- `setDictionary`: the path to your dictionary, used for normalizing entities

Let's start by looking at the JSON configuration file.

##📌 2. JSON Configuration File

Here is a fully utilized JSON configuration file.

```
{
  "entity": "Header",
  "ruleScope": "sentence",
  "regex": "\d\.\d+\.?[A-Z-,; a-z]+",
  "completeMatchRegex": "true",
  "matchScope": "token",
  "prefix": ["PART"],
  "suffix": ["contract"],
  "contextLength": 100,
  "contextException": ["of"],
  "exceptionDistance": 40
 }
 ```

###✔️ 2.1. Basic Properties

There are 5 basic properties you can set in your JSON configuration file:

- `entity`
- `ruleScope`
- `regex`
- `completeMatchRegex`
- `matchScope`

Let's first look at the 3 most essential properties to set:

```
{
  "entity": "Digit",
  "ruleScope": "sentence",
  "regex": "\\d+" # Note here: backslashes are escape characters in JSON, so for regex pattern "\d+" we need to write it out as "\\d+"
}
```

📚Here, we're looking for tokens in our text that match the regex: "`\d+`" and assign the "`Digit`" entity to those tokens. When `ruleScope` is set to "`sentence`", we're looking for a match on each *token* of a **sentence**. You can change it to "`document`" to look for a match on each *sentence* of a **document**. The latter is particularly useful when working with multi-word matches, but we'll explore this at a later stage.

The next properties to look at are `completeMatchRegex` and `matchScope`. To understand their use case, let's take a look at an example where we're trying to match all digits in our text. 

Let's say we come across the following string: ***XYZ987***

Depending on how we set the `completeMatchRegex` and `matchScope` properties, we'll get the following results:

```
{
  "entity": "Digit",
  "ruleScope": "sentence",
  "regex": "\\d+",
  "completeMatchRegex": "false",
  "matchScope": "token"
}
```

`OUTPUT: [XYZ987]`

```
{
  "entity": "Digit",
  "ruleScope": "sentence",
  "regex": "\\d+",  
  "completeMatchRegex": "false",
  "matchScope": "sub-token"
}
```

`OUTPUT: [987]`


```
{
  "entity": "Digit",
  "ruleScope": "sentence",
  "regex": "\\d+",
  "completeMatchRegex": "true"
  # matchScope is ignored here
}
```

`OUTPUT: []`

`"completeMatchRegex": "true"` will only return an output if our string was modified in the following way (to get a complete, exact match): **XYZ 987**

```
{
  "entity": "Digit",
  "ruleScope": "sentence",
  "regex": "\\d+",  
  "completeMatchRegex": "true",
  "matchScope": "token" # Note here: sub-token would return the same output
}
```

`OUTPUT: [987]`

###✔️ 2.2. Context Awareness Properties

There are 5 properties related to context awareness:

- `contextLength`
- `prefix`
- `suffix`
- `contextException`
- `exceptionDistance`



Let's look at a similar example. Say we have the following text: ***At birth, the typical XYZ Corporation is growing slightly faster than the typical ABC Inc., but growth rates become equal at about seven months.***

If we want to match the company that grows faster at birth, we can start by defining our regex: "`XYZ|ABC`"

Next, we add a prefix ("`birth`") and suffix ("`faster`") to ask the parser to match the regex only if the word "`birth`" comes before and only if the word "`faster`" comes after. Finally, we will need to set the `contextLength` - this is the maximum number of tokens after the prefix and before the suffix that will be searched to find a regex match.

Here's what the JSON configuration file would look like:

```
{
  "entity": "Company",
  "ruleScope": "sentence",
  "regex": "XYZ|ABC",
  "contextLength": 50,
  "prefix": ["birth"],
  "suffix": ["faster"]
}
```

`OUTPUT: [XYZ]`

If you remember, the annotator has a `setPrefixAndSuffixMatch()` parameter. If you set it to `True`, the previous output would remain as is. However, if you had set it to `False` and used the following JSON configuration:

```
{
  "entity": "Company",
  "ruleScope": "sentence",
  "regex": "XYZ|ABC",
  "contextLength": 50,
  "prefix": ["birth"],
  "suffix": ["faster", "rates"]
}
```

`OUTPUT: [XYZ,ABC]`

The parser now takes into account either the prefix OR suffix, only one of the condition has to be fulfilled for a match to count.

If you remember, the annotator has a `setCompleteContextMatch()` parameter. If you set it to `True`, and used the following JSON configuration :

```
{
  "entity": "Company",
  "ruleScope": "sentence",
  "regex": "XYZ|ABC",
  "contextLength": 50,
  "prefix": ["birth"],
  "suffix": ["fast"]
}
```

`OUTPUT: []`

However if we set `setCompleteContextMatch()` as `False`, and use the same JSON configuration as above, we get the following output :

`OUTPUT: [XYZ]`

Here's the sentence again: ***At birth, the typical XYZ Corporation is growing slightly faster than the typical ABC Inc., but growth rates become equal at about seven months.***

The last 2 properties related to context awareness are `contextException` and `exceptionDistance`. This rules out matches based on a given exception:

```
{
  "entity": "Company",
  "ruleScope": "sentence",
  "regex": "XYZ|ABC",
  "contextLength": 50,
  "prefix": ["birth"],
  "suffix": ["faster", "rates"],
  "contextException": ["At"],
  "exceptionDistance": 5
}
```

`OUTPUT: [ABC]`

Here we've asked the parser to ignore a match if the token "`At`" is within 5 tokens of the matched regex. This caused the token "`XYZ`" to be ignored.

If the annotator's `setOptionalContextRules` parameter is set `True`, it allows us to output regex matches regardless of context match (prefix, suffix configuration). 

When `shortestContextMatch` parameter is set to `True`, it will stop finding for matches when one of prefix and suffix data is found in the text.",
                                

📚Confidence Value Scenarios:
* When there is regex match only, the confidence value will be 0.5.
* When there are regex and prefix matches together, the confidence value will be > 0.5 depending on the distance between target token and the prefix.
* When there are regex and suffix matches together, the confidence value will be > 0.5 depending on the distance between target token and the suffix.
* When there are regex, prefix, and suffix matches all together, the confidence value will be > than the other scenarios.

##📌 3. Dictionary

Another key feature of the `ContextualParser` annotator is the use of dictionaries. You can specify a path to a dictionary in `tsv` or `csv` format using the `setDictionary()` parameter. Using a dictionary is a useful when you have a list of exact words that you want the parser to pick up when processing some text.

###✔️ 3.1. Orientation

The first feature to be aware of when it comes to feeding dictionaries is the format of the dictionaries. The `ContextualParser` annotator will accept dictionaries in the horizontal format and in a vertical format. This is how they would look in practice:

Horizontal:

| normalize | word1 | word2 | word3     |
|-----------|-------|-------|-----------|
| country    | US | Spain  |  India      |
| Company   | Amazon   | Google   | John Snow Labs |



Vertical:

| country    | company |
|-----------|-----------|
| US     | Amazon     |
| India      | Google     |
| Spain      | John Snow Labs     | 

As you can see, your dictionary needs to have a `normalize` field that lets the annotator know which entity labels to use, and another field that lets the annotator know a list of words it should be looking to match. Here's how to set the format that your dictionary uses:

```
contextualParser = legal.ContextualParserApproach() \
    .setDictionary("dictionary.tsv", options={"orientation":"vertical"}) # default is horizontal
```

###✔️ 3.2. Dictionary-related JSON Properties

📚When working with dictionaries, there are 2 properties in the JSON configuration file to be aware of:

- `ruleScope`
- `matchScope`

This is especially true when you have multi-word entities in your dictionary.

Let's take an example of a dictionary that contains a list of cities, sometimes made up of multiple words:

| normalize | word1 | word2 | word3     |
|-----------|-------|-------|-----------|
| City      | New York | Salt Lake City  | Washington      |




Let's say we're working with the following text: ***I love New York. Salt Lake City is nice too.***

With the following JSON properties, here's what you would get:

```
{
  "entity": "City",
  "ruleScope": "sentence",
  "matchScope": "sub-token",
}
```

`OUTPUT: []`

📚When `ruleScope` is set to `"sentence"`, the annotator attempts to find matches at the token level, parsing through each token in the sentence one by one, looking for a match with the dictionary items. Since `"New York"` and `"Salt Lake City"` are made up of multiple tokens, the annotator would never find a match from the dictionary. Let's change `ruleScope` to `"document"`:

```
{
  "entity": "City",
  "ruleScope": "document",
  "matchScope": "sub-token",
}
```

`OUTPUT: [New York, Salt Lake City]`

📚When `ruleScope` is set to `"document"`, the annotator attempts to find matches by parsing through each sentence in the document one by one, looking for a match with the dictionary items. Beware of how you set `matchScope`. Taking the previous example, if we were to set `matchScope` to `"token"` instead of `"sub-token"`, here's what would happen:

```
{
  "entity": "City",
  "ruleScope": "document",
  "matchScope": "token"
}
```

`OUTPUT: [I love New York., Salt Lake City is nice too.]`

As you can see, when `ruleScope` is at the document level, if you set your `matchScope` to the token level, the annotator will output each sentence containing the matched entities as individual chunks.

###✔️ 3.3. Working with Multi-Word Matches

📚Although not directly related to dictionaries, if we build on top of what we've just seen, there is a use-case that is particularly in demand when working with the `ContextualParser` annotator: finding regex matches for chunks of words that span across multiple tokens. 

Let's re-iterate how the `ruleScope` property works: when `ruleScope` is set to `"sentence"`, we're looking for a match on each token of a sentence. When `ruleScope` is set to `"document"`, we're looking for a match on each sentence of a document. 

So now let's imagine you're parsing through legal documents trying to tag the *John Snow* headers in those documents.

```
{
  "entity": "John Snow",
  "regex": "[j|J]ohn\s+[s|S]now",  
  "ruleScope": "document",
  "matchScope": "sub-token"
}
```


`OUTPUT: [John Snow, john snow, John snow]`

If you had set `ruleScope` to  `"sentence"`, here's what would have happened:

```
{
  "entity": "John Snow",
  "regex": "[j|J]ohn\s+[s|S]now", 
  "ruleScope": "sentence",
  "matchScope": "sub-token"
}
```

`OUTPUT: []`

Since John Snow is divided into two different tokens, the annotator will never find a match since it's now looking for a match on each token of a sentence.

#🏃 Running a Pipeline

##🔎 Example 1: Detecting DOC, ALIAS, PARTY, Subheaders from a Credit agreement

Let's try running through some examples to build on top of what you've learned so far.

In [None]:
# Here's a credit agreement
sample_text = """
1.1 RESTATED CREDIT AGREEMENT
THIS TWELFTH AMENDMENT TO AMENDED AND RESTATED CREDIT AGREEMENT , ("Twelfth Amendment") is made as of the 27th day of December, 2007 , by
and between CULP , INC. , a North Carolina corporation (together with its
successors and permitted assigns, the "Borrower"), and WACHOVIA BANK , NATIONAL ASSOCIATION (formerly, Wachovia Bank , N.A ), a National banking association , as
Agent and as a Bank (together with its endorsees, successors and assigns, the "Bank" ).
"""


In [None]:
# Create a dictionary to detect date
date = '''date\n27th day of December, 2007'''

with open('date.tsv', 'w') as f:
    f.write(date)

# Check what dictionary looks like
!cat date.tsv

date
27th day of December, 2007

In [None]:
# Create JSON file
date= {
  "entity": "EFFDATE",
  "ruleScope": "document", 
  "matchScope":"sub-token",
  "completeMatchRegex": "true"
}

import json
with open('date.json', 'w') as f:
    json.dump(date, f)

In [None]:
# Create a dictionary to detect doc
doc= '''doc\nTWELFTH AMENDMENT TO AMENDED AND RESTATED CREDIT AGREEMENT'''

with open('doc.tsv', 'w') as f:
    f.write(doc)

# Check what dictionary looks like
!cat doc.tsv

doc
TWELFTH AMENDMENT TO AMENDED AND RESTATED CREDIT AGREEMENT

In [None]:
# Create JSON file
doc= {
  "entity": "Doc",
  "ruleScope": "document", 
  "matchScope":"sub-token",
  "completeMatchRegex": "true"
}

import json
with open('doc.json', 'w') as f:
    json.dump(doc, f)

In [None]:
# Create a dictionary to detect alias
alias = '''alias\nBorrower'''

with open('alias.tsv', 'w') as f:
    f.write(alias)

# Check what dictionary looks like
!cat alias.tsv

alias
Borrower

In [None]:
# Create JSON file
alias= {
  "entity": "ALIAS",
  "ruleScope": "document", 
  "matchScope":"sub-token",
  "completeMatchRegex": "true"
}

import json
with open('alias.json', 'w') as f:
    json.dump(alias, f)

In [None]:
# Create JSON file for sub header
sub_header = {
  "entity": "SUBHEADER",
  "ruleScope": "document", 
  "completeMatchRegex": "true",
  "regex":"\d\.\d+\.?[A-Z-,; a-z]+",
  "matchScope": "sub-token",
  "contextLength": 100
}
# \d\.+[A-Z ]+ --->header
#"^(\d\.?\d.*?)$" ---> subheader
import json
with open('sub_header.json', 'w') as f:
    json.dump(sub_header, f)

In [None]:
# Create a dictionary to detect party
party = '''party\nCULP , INC. , a North Carolina corporation\nWACHOVIA BANK , NATIONAL ASSOCIATION'''

with open('party.tsv', 'w') as f:
    f.write(party)

# Check what dictionary looks like
!cat party.tsv

party
CULP , INC. , a North Carolina corporation
WACHOVIA BANK , NATIONAL ASSOCIATION

In [None]:
# Create JSON file
party= {
  "entity": "party",
  "ruleScope": "document", 
  "matchScope":"sub-token",
  "completeMatchRegex": "true"
}

import json
with open('party.json', 'w') as f:
    json.dump(party, f)

In [None]:
# Create a dictionary to detect former_name
former_name = '''former_name\nWachovia Bank , N.A'''

with open('former_name.tsv', 'w') as f:
    f.write(former_name)

# Check what dictionary looks like
!cat former_name.tsv

former_name
Wachovia Bank , N.A

In [None]:
# Create JSON file
former_name= {
  "entity": "former_name",
  "ruleScope": "document", 
  "matchScope":"sub-token",
  "completeMatchRegex": "true"
}

import json
with open('former_name.json', 'w') as f:
    json.dump(former_name, f)

In [None]:
# Build pipeline
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

# text_splitter = legal.TextSplitter() \
#     .setInputCols(["document"]) \
#     .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

date_contextual_parser = legal.ContextualParserApproach() \
    .setInputCols(["document", "token"])\
    .setOutputCol("entity_date")\
    .setJsonPath("date.json")\
    .setDictionary('date.tsv', options={"orientation":"vertical"})\
    .setPrefixAndSuffixMatch(False)\
    .setShortestContextMatch(True)\
    .setOptionalContextRules(False) 

doc_contextual_parser = legal.ContextualParserApproach() \
    .setInputCols(["document", "token"])\
    .setOutputCol("entity_doc")\
    .setJsonPath("doc.json")\
    .setDictionary('doc.tsv', options={"orientation":"vertical"})\
    .setPrefixAndSuffixMatch(False)\
    .setShortestContextMatch(True)\
    .setOptionalContextRules(False)\
    .setCaseSensitive(True)

alias_contextual_parser = legal.ContextualParserApproach() \
    .setInputCols(["document", "token"])\
    .setOutputCol("entity_alias")\
    .setJsonPath("alias.json")\
    .setDictionary('alias.tsv', options={"orientation":"vertical"})\
    .setPrefixAndSuffixMatch(False)\
    .setShortestContextMatch(True)\
    .setOptionalContextRules(False)\
    .setCaseSensitive(True)

title_parser = legal.ContextualParserApproach() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("title")\
    .setJsonPath("sub_header.json") \
    .setCaseSensitive(True) \
    .setPrefixAndSuffixMatch(False)\
    .setOptionalContextRules(False)

party_contextual_parser = legal.ContextualParserApproach() \
    .setInputCols(["document", "token"])\
    .setOutputCol("entity_party")\
    .setJsonPath("party.json")\
    .setDictionary('party.tsv', options={"orientation":"vertical"})\
    .setPrefixAndSuffixMatch(False)\
    .setShortestContextMatch(True)\
    .setOptionalContextRules(False)\
    .setCaseSensitive(True)

former_name_contextual_parser = legal.ContextualParserApproach() \
    .setInputCols(["document", "token"])\
    .setOutputCol("entity_former_name")\
    .setJsonPath("former_name.json")\
    .setDictionary('former_name.tsv', options={"orientation":"vertical"})\
    .setPrefixAndSuffixMatch(False)\
    .setShortestContextMatch(True)\
    .setOptionalContextRules(False)\
    .setCaseSensitive(True)

chunk_converter = legal.ChunkMergeApproach() \
    .setInputCols(["entity_date", "entity_doc","entity_alias",'title','entity_party','entity_former_name']) \
    .setOutputCol("ner_chunk")

parserPipeline = nlp.Pipeline(stages=[
        document_assembler, 
        tokenizer,
        doc_contextual_parser,
        date_contextual_parser,
        alias_contextual_parser,
        title_parser,
        party_contextual_parser,
        former_name_contextual_parser,
        chunk_converter,
        ])

In [None]:

# Create a lightpipeline model
empty_data = spark.createDataFrame([[""]]).toDF("text")

parserModel = parserPipeline.fit(empty_data)

light_model = nlp.LightPipeline(parserModel)

In [None]:
# Annotate the sample text
annotations = light_model.fullAnnotate(sample_text)[0]

In [None]:
# Check outputs
annotations.get('ner_chunk')

[Annotation(chunk, 1, 29, 1.1 RESTATED CREDIT AGREEMENT, {'tokenIndex': '0', 'entity': 'SUBHEADER', 'field': 'SUBHEADER', 'chunk': '0', 'normalized': '', 'sentence': '0', 'confidenceValue': '0.50'}),
 Annotation(chunk, 36, 93, TWELFTH AMENDMENT TO AMENDED AND RESTATED CREDIT AGREEMENT, {'tokenIndex': '5', 'entity': 'Doc', 'field': 'Doc', 'chunk': '1', 'normalized': 'doc', 'sentence': '0', 'confidenceValue': '0.50'}),
 Annotation(chunk, 137, 162, 27th day of december, 2007, {'tokenIndex': '23', 'entity': 'EFFDATE', 'field': 'EFFDATE', 'chunk': '2', 'normalized': 'date', 'sentence': '0', 'confidenceValue': '0.50'}),
 Annotation(chunk, 181, 222, CULP , INC. , a North Carolina corporation, {'tokenIndex': '33', 'entity': 'party', 'field': 'party', 'chunk': '3', 'normalized': 'party', 'sentence': '0', 'confidenceValue': '0.50'}),
 Annotation(chunk, 282, 289, Borrower, {'tokenIndex': '53', 'entity': 'ALIAS', 'field': 'ALIAS', 'chunk': '4', 'normalized': 'alias', 'sentence': '0', 'confidenceVa

In [None]:
# Visualize outputs
# from sparknlp_display import NerVisualizer

visualiser = nlp.viz.NerVisualizer()

visualiser.display(annotations, label_col='ner_chunk', document_col='document', save_path="display_result.html")

Feel free to experiment with the annotator parameters and JSON properties to see how the output might change.