![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


# **ContextualParserApproach**

This notebook will cover the different parameters and usages of `ContextualParserApproach` annotator. 

**📖 Learning Objectives:**

1. Understand how to use `ContextualParserApproach`.

2. Become comfortable using the different parameters of the annotator.

3. Train an `ContextualParserModel` based on pattern matching.


**🔗 Helpful Links:**

- Documentation : [ContextualParserApproach](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#contextualparser)

- Python Docs : [ContextualParserApproach](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/context/contextual_parser/index.html#sparknlp_jsl.annotator.context.contextual_parser.ContextualParserApproach)

- Scala Docs : [ContextualParserApproach](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/context/ContextualParserApproach)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp/).

## **📜 Background**


`ContextualParser` annotator extracts entities from texts based on pattern matching. It provides more functionality than its open-source counterpart `EntityRuler` by allowing users to customize specific characteristics for pattern matching. 

It allows setting regex rules for full and partial matches, a dictionary with normalizing options and context parameters to take into account specific conditions such as token distances.

`ContextualParserApproach` annotator learns the patterns given by JSON/TSV/CSV file to define a new `ContextualParserModel`.

## **🎬 Colab Setup**

In [2]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.7/83.7 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m639.9/639.9 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.7/486.7 kB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.4/95.4 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.5/83.5 kB[0m [31m5.5 MB/s[0

In [3]:
from johnsnowlabs import nlp


nlp.install(force_browser=True)

<IPython.core.display.Javascript object>

127.0.0.1 - - [09/May/2023 13:08:14] "GET /login?code=mLefcDaSOGhcgXpNY5qsUnoAMmTLjR HTTP/1.1" 200 -


<IPython.core.display.Javascript object>

Downloading license...
Licenses extracted successfully
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-4.4.1-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-4.4.1-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-4.4.1.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-4.4.1.jar
🙆 JSL Home setup in /root/.johnsnowlabs
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-4.4.1-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==4.4.1 installed! ✅ Heal the planet with NLP! 


In [1]:
from johnsnowlabs import nlp, medical

spark = nlp.start()

📋 Loading license number 0 from /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.4.1, 💊Spark-Healthcare==4.4.1, running on ⚡ PySpark==3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `TOKEN`

- Output: `CHUNK`

## **🔎 Parameters**


- `inputCols`: The name of the columns containing the input annotations. It can read either a String column or an Array.
- `outputCol`: The name of the column in Document type that is generated. We can specify only one column here.
- `jsonPath`: Path to json file containing regex patterns and rules to match the entities.
- `dictionary`: Path to dictionary file in tsv or csv format.
- `caseSensitive`: Whether to use case sensitive when matching values.
- `prefixAndSuffixMatch`: Whether to match both prefix and suffix to annotate the match.
- `optionalContextRules`: When set to true, it will output regex match regardless of context matches.
- `shortestContextMatch`: When set to true, it will stop finding for matches when prefix/suffix data is found in the text.
- `completeContextMatch`: Whether to do an exact match of prefix and suffix.

All the parameters can be set using the corresponding set method in camel case. For example, `.setInputcols()`.

### `inputCols` and `outputCol`

Define the column names containing the `DOCUMENT` and `TOKEN` annotations needed as input to the `ContextualParser` and the name of the new column containg the identified entities.

Let's define a pipeline to process raw texts into `DOCUMENT` and `TOKEN` annotations:

In [2]:
document_assembler = (
    nlp.DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
)

tokenizer = (
    nlp.Tokenizer().setInputCols("document").setOutputCol("token")
)


Then, we use the defined column names of the previous stages to define the `ContextualParserApproach` input columns and define a name for the output column:

In [3]:
contextual_parser = (
    medical.ContextualParserApproach()
    .setInputCols(["document", "token"])
    .setOutputCol("entity")
)

### `jsonPath`

Defines the path to the JSON file containing the rules to match the entities on the text. The file needs to define the following information:

- `entity`: type of the entity that will be matched
- `ruleScope`: Scope to search the pattern. Can be one of the following:
  `document`: Match the entities on the entire document (useful for matching multi-token/phrases entities).
  - `sentence`: Match the entities in sentences in a token level (match words).
- `regex`: The pattern to match (backslashes are escape characters in JSON, so for regex pattern "\d+" we need to write it out as "\\d+").
- `completeMatchRegex`: Whether to consider only the exact matches on full tokens. If set to `True`, the parameter `matchScope` is ignored.
- `matchScope`: The return level of the match:
 - `token`: Returns the entire token containing the matched rule.
 - `sub-token`: Returns the part of the token where the rule matches.
- `prefix`: List of prefixes to be cosidered on the matches.
- `suffix`: List of suffixes to be cosidered on the matches.
- `contextLength`: Maximum length to be used as context.
- `contextException`: List of exceptions on the context.
- `exceptionDistance`: MAximum distance of the exception (Default to 40).



Let's see how to use a JSON file in an example:

In [6]:
sample_text = """Peter Parker is a nice guy and lives in New York . Bruce Wayne is also a nice guy and lives in San Antonio and Gotham City . """

sample_df = spark.createDataFrame([[sample_text]]).toDF("text")
sample_df.show()


+--------------------+
|                text|
+--------------------+
|Peter Parker is a...|
+--------------------+



We will preprocess the example sentences with the required input annotations.

In [9]:
document_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

sentence_detector = (
    nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
)

tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")


preprocessPipeline = nlp.Pipeline(
    stages=[document_assembler, sentence_detector, tokenizer]
)


preprocessModel = preprocessPipeline.fit(sample_df)

In [10]:
processed_df = preprocessModel.transform(sample_df)
processed_df.show()

+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|
+--------------------+--------------------+--------------------+--------------------+
|Peter Parker is a...|[{document, 0, 12...|[{document, 0, 49...|[{token, 0, 4, Pe...|
+--------------------+--------------------+--------------------+--------------------+



Create a sample JSON file:

In [21]:
import json


cities = {
    "entity": "City",
    "ruleScope": "document",
    "matchScope": "sub-token",
    "completeMatchRegex": "false",
    "regex": "([A-Z]{1}[a-z]+ [A-Z]{1}[a-z]+)" # Find two consecutive words in title case.
}

with open("cities.json", "w") as f:
    json.dump(cities, f)

In [22]:
contextual_parser = (
    medical.ContextualParserApproach()
    .setInputCols(["sentence", "token"])
    .setOutputCol("entity")
    .setJsonPath("cities.json")
)

contextual_parser.fit(processed_df).transform(processed_df).select("entity.result").show(truncate=False)

+---------------------------------------------------------------+
|result                                                         |
+---------------------------------------------------------------+
|[Peter Parker, New York, Bruce Wayne, San Antonio, Gotham City]|
+---------------------------------------------------------------+



Using the context to find matches:

In [80]:
context_example = "At birth, the typical boy is growing slightly faster than the typical girl, but growth rates become equal at about seven months."
context_df = spark.createDataFrame([[context_example]]).toDF("text")
processed_context = preprocessModel.transform(context_df)


context_rules = {
  "entity": "Gender",
  "ruleScope": "sentence",
  "regex": "girl|boy",
  "contextLength": 50,
  "prefix": ["birth"],
  "suffix": ["faster", "rates"]
}

with open("context.json", "w") as f:
    json.dump(context_rules, f)

In [81]:
contextual_parser = (
    medical.ContextualParserApproach()
    .setInputCols(["sentence", "token"])
    .setOutputCol("entity")
    .setJsonPath("context.json")
)

contextual_parser.fit(processed_context).transform(processed_context).select("entity.result").show(truncate=False)

+-----------+
|result     |
+-----------+
|[boy, girl]|
+-----------+



### `dictionary`

The dictionary parameter can be used to define to define entities as a list to be found on the text. 

When setting the dictionary file, we can use the parameter `orientation` that indicates whether the file is to be read horizontally or vertically.

Horizontal:

| normalize | word1 | word2 | word3     |
|-----------|-------|-------|-----------|
| female    | woman | girl  | lady      |
| male      | man   | boy   | gentleman |


Vertical:

| female    | normalize |
|-----------|-----------|
| woman     | word1     |
| girl      | word2     |
| lady      | word3     | 

</br>

JSON path needs to be set.

In [23]:
# Create a dictionary to detect cities
cities = """City\nNew York\nGotham City\nSan Antonio\nSalt Lake City"""


# TSV or CSV
with open('cities.tsv', 'w') as f:
    f.write(cities)

# Check what dictionary looks like
!cat cities.tsv

City
New York
Gotham City
San Antonio
Salt Lake City

In [25]:
contextual_parser = (
    medical.ContextualParserApproach()
    .setInputCols(["sentence", "token"])
    .setOutputCol("entity")
    .setJsonPath("cities.json")
    .setDictionary('cities.tsv', options={"orientation": "vertical"})
)

contextual_parser.fit(processed_df).transform(processed_df).select("entity.result").show(truncate=False)

+------------------------------------+
|result                              |
+------------------------------------+
|[new york, san antonio, gotham city]|
+------------------------------------+



### `caseSensitive`

Defines whether the mathces should be case sensitive or not. 

In [43]:
contextual_parser = (
    medical.ContextualParserApproach()
    .setInputCols(["sentence", "token"])
    .setOutputCol("entity")
    .setJsonPath("cities.json")
    .setCaseSensitive(False)
    .setDictionary('cities.tsv', options={"orientation": "vertical"})
)

contextual_parser.fit(processed_df).transform(processed_df).select("entity.result").show(truncate=False)

+------------------------------------+
|result                              |
+------------------------------------+
|[new york, san antonio, gotham city]|
+------------------------------------+



In [44]:
contextual_parser = (
    medical.ContextualParserApproach()
    .setInputCols(["sentence", "token"])
    .setOutputCol("entity")
    .setJsonPath("cities.json")
    .setCaseSensitive(True)
    .setDictionary('cities.tsv', options={"orientation": "vertical"})
)

contextual_parser.fit(processed_df).transform(processed_df).select("entity.result").show(truncate=False)

+------------------------------------+
|result                              |
+------------------------------------+
|[New York, San Antonio, Gotham City]|
+------------------------------------+



### `prefixAndSuffixMatch`

Whether to match both prefix and suffix to annotate the match.

In [83]:
contextual_parser = (
    medical.ContextualParserApproach()
    .setInputCols(["sentence", "token"])
    .setOutputCol("entity")
    .setPrefixAndSuffixMatch(True) 
    .setJsonPath("context.json")
)

contextual_parser.fit(processed_context).transform(processed_context).select("entity.result").show(truncate=False)

+------+
|result|
+------+
|[boy] |
+------+



In [84]:
contextual_parser = (
    medical.ContextualParserApproach()
    .setInputCols(["sentence", "token"])
    .setOutputCol("entity")
    .setPrefixAndSuffixMatch(False) 
    .setJsonPath("context.json")
)

contextual_parser.fit(processed_context).transform(processed_context).select("entity.result").show(truncate=False)

+-----------+
|result     |
+-----------+
|[boy, girl]|
+-----------+



### `optionalContextRules`

When set to true, it will output regex match regardless of context matches.

In [91]:
contextual_parser = (
    medical.ContextualParserApproach()
    .setInputCols(["sentence", "token"])
    .setOutputCol("entity")
    .setPrefixAndSuffixMatch(True) 
    .setJsonPath("context.json")
    .setOptionalContextRules(True)
)

contextual_parser.fit(processed_context).transform(processed_context).select("entity.result").show(truncate=False)

+-----------+
|result     |
+-----------+
|[boy, girl]|
+-----------+



In [92]:
contextual_parser = (
    medical.ContextualParserApproach()
    .setInputCols(["sentence", "token"])
    .setOutputCol("entity")
    .setPrefixAndSuffixMatch(True) 
    .setJsonPath("context.json")
    .setOptionalContextRules(False)
)

contextual_parser.fit(processed_context).transform(processed_context).select("entity.result").show(truncate=False)

+------+
|result|
+------+
|[boy] |
+------+



### `shortestContextMatch`

When set to true, it will stop finding for matches when prefix/suffix data is found in the text.

In [95]:
contextual_parser = (
    medical.ContextualParserApproach()
    .setInputCols(["sentence", "token"])
    .setOutputCol("entity")
    .setPrefixAndSuffixMatch(False) 
    .setJsonPath("context.json")
    .setShortestContextMatch(False)
)

contextual_parser.fit(processed_context).transform(processed_context).select("entity.result").show(truncate=False)

+-----------+
|result     |
+-----------+
|[boy, girl]|
+-----------+



In [96]:
contextual_parser = (
    medical.ContextualParserApproach()
    .setInputCols(["sentence", "token"])
    .setOutputCol("entity")
    .setPrefixAndSuffixMatch(False) 
    .setJsonPath("context.json")
    .setShortestContextMatch(True)
)

contextual_parser.fit(processed_context).transform(processed_context).select("entity.result").show(truncate=False)

+-----------+
|result     |
+-----------+
|[boy, girl]|
+-----------+



### `completeContextMatch`

Whether to do an exact match of prefix and suffix on the entire context or not.

In [98]:
context_rules_complete = {
  "entity": "Gender",
  "ruleScope": "sentence",
  "regex": "girl|boy",
  "contextLength": 50,
  "prefix": ["birth"],
  "suffix": ["fast"]
}

with open("context_complete.json", "w") as f:
  json.dump(context_rules_complete, f)

In [99]:
contextual_parser = (
    medical.ContextualParserApproach()
    .setInputCols(["sentence", "token"])
    .setOutputCol("entity")
    .setPrefixAndSuffixMatch(True) 
    .setJsonPath("context_complete.json")
    .setCompleteContextMatch(False)
)

contextual_parser.fit(processed_context).transform(processed_context).select("entity.result").show(truncate=False)

+------+
|result|
+------+
|[boy] |
+------+



In [100]:
contextual_parser = (
    medical.ContextualParserApproach()
    .setInputCols(["sentence", "token"])
    .setOutputCol("entity")
    .setPrefixAndSuffixMatch(True) 
    .setJsonPath("context_complete.json")
    .setCompleteContextMatch(True)
)

contextual_parser.fit(processed_context).transform(processed_context).select("entity.result").show(truncate=False)

+------+
|result|
+------+
|[]    |
+------+

