![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **NerChunker**

This notebook will cover the uses and the `RegexParsers` parameter of `NerChunker`. This annotator extracts phrases that fits into a known pattern using the NER tags. 




**📖 Learning Objectives:**

1. Understand how `NerChunker` works.

2. Become comfortable using the Regex parameter of the annotator.


**🔗 Helpful Links:**

- Documentation : [NerChunker](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#nerchunker)

- Python Docs : [NerChunker](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/ner/ner_chunker/index.html#sparknlp_jsl.annotator.ner.ner_chunker.NerChunker)

- Scala Docs : [NerChunker](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/ner/NerChunker.html)

- For extended examples of usage, see [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/01.0.Clinical_Named_Entity_Recognition_Model.ipynb).


## **📜 Background**

The `NerChunker` annotator is a component of the Spark NLP library that performs chunking of named entities that fit into a pattern defined by the Regex Parameter - `setRegexParsers`.

Named Entity Recognition (NER) is the process of identifying named entities such as people, organizations, locations, and other entities in unstructured text data. Chunking is the process of grouping together contiguous tokens in a sentence based on their relationships.

`NerChunker` annotator in Spark NLP combines these two tasks by first identifying named entities in a sentence using a model trained on annotated data, and then grouping them into chunks based on their type and position in the sentence. 

The output of the `NerChunker` annotator is a set of annotations that label each token in a sentence with its named entity type and chunk label, as well as any additional metadata such as confidence scores or start/end offsets. This can be useful for a variety of natural language processing tasks such as information extraction, entity linking, and text classification.

## **🎬 Colab Setup**

In [1]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs 

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.2/84.2 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.7/486.7 kB[0m [31m32.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m639.9/639.9 kB[0m [31m42.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.4/95.4 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.5/83.5 kB[0m [31m8.5 MB/s[0

In [2]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving 4.4.3.spark_nlp_for_healthcare.json to 4.4.3.spark_nlp_for_healthcare.json


In [None]:
from johnsnowlabs import nlp, medical, visual

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

In [None]:
from johnsnowlabs import nlp, medical, visual
import pandas as pd

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/4.4.1.spark_nlp_for_healthcare.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==4.4.1, 💊Spark-Healthcare==4.4.1, running on ⚡ PySpark==3.1.2


In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `NAMED_ENTITY`

- Output: `CHUNK`

## **🔎 Parameters**

- `setRegexParsers`: Array of grammar based chunk parsers.   



### `setRegexParsers`

## **💻 Pipeline**

Let us define pipeline for extracting posology related entities by using the [ner_posology](https://nlp.johnsnowlabs.com/2020/04/15/ner_posology_en.html) model. 

This model will extract the following entities:

`DOSAGE`, `DRUG`, `DURATION`, `FORM`, `FREQUENCY`, `ROUTE`, `STRENGTH`

In [None]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
        
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
 
# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("embeddings")

# NER model trained for pextracting entities related to posology
posology_ner = medical.NerModel.pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner])

empty_data = spark.createDataFrame([[""]]).toDF("text")

ner_model = nlpPipeline.fit(empty_data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology download started this may take some time.
[OK!]


Define a sample text about the usage of drugs, convert the text to Pyspark dataframe and get predictions for posology-related entity extraction by using `.transform`.

In [None]:
sample_text = """The patient was prescribed 1 capsule of Advil for 5 days. 
He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, metformin 1000 mg two times a day."""

data = spark.createDataFrame([[sample_text]]).toDF("text")

result = ner_model.transform(data)

result.select('text', 'token.result', 'ner.result').show(truncate = 60)

+------------------------------------------------------------+------------------------------------------------------------+------------------------------------------------------------+
|                                                        text|                                                      result|                                                      result|
+------------------------------------------------------------+------------------------------------------------------------+------------------------------------------------------------+
|The patient was prescribed 1 capsule of Advil for 5 days....|[The, patient, was, prescribed, 1, capsule, of, Advil, fo...|[O, O, O, O, B-DOSAGE, B-FORM, O, B-DRUG, B-DURATION, I-D...|
+------------------------------------------------------------+------------------------------------------------------------+------------------------------------------------------------+



The result in the dataframe below shows the entities extracted by the posology model.

In [None]:
result.select('ner.result').show(truncate = False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                             |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Explode the results to include the tokens, labels predicted by the model and the confidence score to get a better picture.

In [4]:
result.select(F.explode(F.arrays_zip(result.token.result, 
                                              result.ner.result, 
                                              result.ner.metadata)).alias("cols")) \
               .select(F.expr("cols['0']").alias("token"),
                       F.expr("cols['1']").alias("ner_label"),
                       F.expr("cols['2']['confidence']").alias("confidence"))\
               .filter("ner_label!='O'")\
               .show(30, truncate=100)

NameError: ignored

### `setRegexParsers`

This parameter is used to define a list of regex patterns to match chunks.


Let's say we want to extract `DRUG` and `FREQUENCY` together as a single chunk even if there are some unwanted tokens between them. 

In [None]:
# To extract drug and frequency together as a single chunk even if there are some unwanted tokens between them.
ner_chunker = medical.NerChunker()\
    .setInputCols(["sentence","ner"])\
    .setOutputCol("ner_chunk")\
    .setRegexParsers(["<DRUG>.*<FREQUENCY>"])

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    ner_chunker])

ner_chunker_model = nlpPipeline.fit(empty_data)

In [None]:
sample_text = """The patient was prescribed 1 capsule of Advil for 5 days . He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals , metformin 1000 mg two times a day."""

result = ner_chunker_model.transform(data)

In [None]:
result.select('ner_chunk.result').show(truncate = False)

+-----------------------------------------------------------------------------------------------------+
|result                                                                                               |
+-----------------------------------------------------------------------------------------------------+
|[insulin glargine at night, 12 units of insulin lispro with meals, metformin 1000 mg two times a day]|
+-----------------------------------------------------------------------------------------------------+



The result shows the chunks, defined by the `setRegexParsers` parameter, including all the tokens between those entitiy types.

In this case, `DRUG` and `FREQUENCY`.

### LightPipeline

Let’s use `LightPipeline` here to extract the entities. 

[LightPipeline](https://nlp.johnsnowlabs.com/docs/en/concepts#using-spark-nlps-lightpipeline) is a Spark NLP specific Pipeline class equivalent to the Spark ML Pipeline, which achieves fast results when dealing with small amounts of data.

In [None]:
light_model = nlp.LightPipeline(ner_chunker_model)

light_result = light_model.annotate(sample_text)

list(zip(light_result['token'], light_result['ner']))

[('The', 'O'),
 ('patient', 'O'),
 ('was', 'O'),
 ('prescribed', 'O'),
 ('1', 'B-DOSAGE'),
 ('capsule', 'B-FORM'),
 ('of', 'O'),
 ('Advil', 'B-DRUG'),
 ('for', 'B-DURATION'),
 ('5', 'I-DURATION'),
 ('days', 'I-DURATION'),
 ('.', 'O'),
 ('He', 'O'),
 ('was', 'O'),
 ('seen', 'O'),
 ('by', 'O'),
 ('the', 'O'),
 ('endocrinology', 'O'),
 ('service', 'O'),
 ('and', 'O'),
 ('she', 'O'),
 ('was', 'O'),
 ('discharged', 'O'),
 ('on', 'O'),
 ('40', 'B-DOSAGE'),
 ('units', 'I-DOSAGE'),
 ('of', 'O'),
 ('insulin', 'B-DRUG'),
 ('glargine', 'I-DRUG'),
 ('at', 'B-FREQUENCY'),
 ('night', 'I-FREQUENCY'),
 (',', 'O'),
 ('12', 'B-DOSAGE'),
 ('units', 'I-DOSAGE'),
 ('of', 'O'),
 ('insulin', 'B-DRUG'),
 ('lispro', 'I-DRUG'),
 ('with', 'B-FREQUENCY'),
 ('meals', 'I-FREQUENCY'),
 (',', 'O'),
 ('metformin', 'B-DRUG'),
 ('1000', 'B-STRENGTH'),
 ('mg', 'I-STRENGTH'),
 ('two', 'B-FREQUENCY'),
 ('times', 'I-FREQUENCY'),
 ('a', 'I-FREQUENCY'),
 ('day', 'I-FREQUENCY'),
 ('.', 'O')]

In [None]:
light_result["ner_chunk"]

['insulin glargine at night, 12 units of insulin lispro with meals , metformin 1000 mg two times a day']

The result shows the chunks, defined by the `setRegexParsers` parameter, including all the tokens between those entity types.

In this case, `DRUG` and `FREQUENCY`.