![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **ChunkFilterer**

This notebook will cover the different parameters and usages of `ChunkFilterer`. This annotator provides the ability to filter entities coming from CHUNK annotations.

**📖 Learning Objectives:**

1. Understand how to set filters,  via a white list of terms or a regular expression.

2. Become comfortable using the different parameters of the `ChunkFilterer`.


**🔗 Helpful Links:**

- Documentation : [ChunkFilterer](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#chunkfilterer)

- Python Docs : [ChunkFilterer](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/chunker/chunker_filterer/index.html#sparknlp_jsl.annotator.chunker.chunker_filterer.ChunkFilterer)

- Scala Docs : [ChunkFilterer](!!!!!!!!!!!!!!!!!!Link is Broken ---> https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#chunkfilterer)

- For extended examples of usage, see the [Spark NLP Workshop repository-1.Clinical_Named_Entity_Recognition_Model_Notebook](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb#scrollTo=YgSZ4ghNlDbV).

## **📜 Background**


`ChunkFilterer` will allow you to filter out named entities by some conditions or predefined look-up lists, so that you can feed these entities to other annotators like Assertion Status or Entity Resolvers.

## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.6/106.6 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.8/135.8 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m643.8/643.8 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.4/95.4 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m531.3/531.3 kB[0m [31m36.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m18.8 MB/s[

In [None]:
from johnsnowlabs import nlp


nlp.install(force_browser=True)

<IPython.core.display.Javascript object>

127.0.0.1 - - [04/Oct/2023 21:51:11] "GET /login?code=yA6mS30KJgFftYVDvOLBKwHKMfsuPX HTTP/1.1" 200 -


<IPython.core.display.Javascript object>

Downloading license...
Licenses extracted successfully
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.1.1-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.1.1-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.1.1.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.1.1.jar
🙆 JSL Home setup in /root/.johnsnowlabs
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-5.1.1-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==5.1.1 installed! ✅ Heal the planet with NLP! 


In [None]:
from johnsnowlabs import nlp, medical

spark = nlp.start()

👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.1.1, 💊Spark-Healthcare==5.1.1, running on ⚡ PySpark==3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT` , `CHUNK`

- Output: `CHUNK`

## **🔎 Parameters**


- `whiteList` : (list) If defined, list of entities to process. The rest will be ignored.

- `blackList` : (list) If defined, list of entities to ignore. The rest will be processed.

- `regex` : (list) If defined, list of regex to process the chunks (Default: []).

- `criteria`: (str) Tag representing what is the criteria to filter the chunks. Possibles values are: - isIn: Filter by the chunk - regex: Filter using a regex

- `entitiesConfidence` : (str) Path to csv with pairs (entity,confidenceThreshold). Filter the chunks with entities which have confidence lower than the confidence threshold.

### `criteria()`

###### `whiteList()`

It can be used with two criteria: `isin` and `regex` .

In [None]:
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol( "document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")\
    .setCustomBounds(['\n'])

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
    .setContextChars([".", ",", ";", ":", "!", "?", "*", "(", ")", "\"", "'","+","%","-",'='])\
    .setSplitChars(['\[', '\]', '\n'])

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# Posology NER model is used
posology_ner = medical.NerModel.pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

chunk_filterer = medical.ChunkFilterer()\
    .setInputCols("sentence","ner_chunk")\
    .setOutputCol("chunk_filtered")\
    .setCriteria("isin")\
    .setWhiteList(['Advil','metformin', 'insulin lispro']) #list of entities to process

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    ner_converter,
    chunk_filterer])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology download started this may take some time.
[OK!]


`Regextokenizer` created the tokens by dividing using "/s+" when no pattern was given. When a pattern was given to the `setPattern` parameter, it performed the separation using that pattern.

In [None]:
text = 'The patient was prescribed 1 capsule of Advil for 5 days . He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months .'

filter_df = spark.createDataFrame([[text]]).toDF("text")

result = chunk_filter_model.transform(filter_df)

In [None]:
import pyspark.sql.functions as F

In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"),
                          F.expr("cols['2']['confidence']").alias("confidence"))

result_df.show(50, truncate=100)

+-------------+-----------+----------+
|        token|  ner_label|confidence|
+-------------+-----------+----------+
|          The|          O|    0.9995|
|      patient|          O|    0.9214|
|          was|          O|     0.975|
|   prescribed|          O|    0.9513|
|            1|   B-DOSAGE|    0.9992|
|      capsule|     B-FORM|    0.9897|
|           of|          O|    0.9982|
|        Advil|     B-DRUG|     0.997|
|          for| B-DURATION|    0.5002|
|            5| I-DURATION|    0.6714|
|         days| I-DURATION|    0.9699|
|            .|          O|       1.0|
|           He|          O|    0.9998|
|          was|          O|    0.9908|
|         seen|          O|    0.9744|
|           by|          O|    0.9991|
|          the|          O|    0.9499|
|endocrinology|          O|    0.9725|
|      service|          O|    0.5585|
|          and|          O|    0.9899|
|          she|          O|     0.992|
|          was|          O|     0.991|
|   discharged|          

In [None]:
chunk_filterer.extractParamMap()

{Param(parent='ChunkFilterer_6e479c2a3a08', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='ChunkFilterer_6e479c2a3a08', name='inputCols', doc='previous annotations columns, if renamed'): ['sentence',
  'ner_chunk'],
 Param(parent='ChunkFilterer_6e479c2a3a08', name='outputCol', doc='output annotation column. can be left default.'): 'chunk_filtered',
 Param(parent='ChunkFilterer_6e479c2a3a08', name='criteria', doc='Select mode'): 'isin',
 Param(parent='ChunkFilterer_6e479c2a3a08', name='whiteList', doc='If defined, list of entities to process. The rest will be ignored.'): ['Advil',
  'metformin',
  'insulin lispro']}

In [None]:
light_model = nlp.LightPipeline(chunk_filter_model)

light_result = light_model.annotate(text)

light_result.keys()

dict_keys(['document', 'ner_chunk', 'chunk_filtered', 'token', 'ner', 'embeddings', 'sentence'])

In [None]:
#Returns the output of ner_converter with the defined criteria
light_result['ner_chunk']

['1',
 'capsule',
 'Advil',
 'for 5 days',
 '40 units',
 'insulin glargine',
 'at night',
 '12 units',
 'insulin lispro',
 'with meals',
 'metformin',
 '1000 mg',
 'two times a day',
 'SGLT2 inhibitors']

In [None]:
#Returns the output of chunk_filterer (with)
light_result['chunk_filtered']

['Advil', 'insulin lispro', 'metformin']

### `criteria()`

###### `blackList()`

In [None]:
chunk_filterer = medical.ChunkFilterer()\
    .setInputCols("sentence","ner_chunk")\
    .setOutputCol("chunk_filtered")\
    .setCriteria("isin")\
    .setBlackList(['12 units','40 units'])  #list of entities to ignore

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    ner_converter,
    chunk_filterer])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

In [None]:
light_model = nlp.LightPipeline(chunk_filter_model)

text ='The patient was prescribed 1 capsule of Advil for 5 days . He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months .'

light_result = light_model.annotate(text)

light_result.keys()

dict_keys(['document', 'ner_chunk', 'chunk_filtered', 'token', 'ner', 'embeddings', 'sentence'])

In [None]:
light_result['chunk_filtered']

['1',
 'capsule',
 'Advil',
 'for 5 days',
 'insulin glargine',
 'at night',
 'insulin lispro',
 'with meals',
 'metformin',
 '1000 mg',
 'two times a day',
 'SGLT2 inhibitors']

### `regex()`

In [None]:
chunk_filterer = medical.ChunkFilterer()\
    .setInputCols("sentence","ner_chunk")\
    .setOutputCol("chunk_filtered")\
    .setCriteria("regex")\
    .setRegex(["(\d+)\s*units"])

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    ner_converter,
    chunk_filterer])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

In [None]:
light_model = nlp.LightPipeline(chunk_filter_model)

text ='The patient was prescribed 1 capsule of Advil for 5 days . He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months .'

light_result = light_model.annotate(text)

light_result.keys()

dict_keys(['document', 'ner_chunk', 'chunk_filtered', 'token', 'ner', 'embeddings', 'sentence'])

In [None]:
light_result['chunk_filtered']

['40 units', '12 units']

In [None]:
ner_model = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models")\
    .setInputCols("sentence","token","embeddings")\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

chunk_filterer = medical.ChunkFilterer()\
    .setInputCols("sentence","ner_chunk")\
    .setOutputCol("chunk_filtered")\
    .setFilterEntity("result")

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ner_model,
    ner_converter,
    chunk_filterer])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

ner_clinical download started this may take some time.
[OK!]


In [None]:
text = 'Patient with severe fever, severe cough, sore throat, stomach pain, and a headache.'

filter_df = spark.createDataFrame([[text]]).toDF("text")

chunk_filter_result = chunk_filter_model.transform(filter_df)

In [None]:
chunk_filter_result.select('ner_chunk.result','chunk_filtered.result').show(truncate=False)

+-------------------------------------------------------------------+-------------------------------------------------------------------+
|result                                                             |result                                                             |
+-------------------------------------------------------------------+-------------------------------------------------------------------+
|[severe fever, severe cough, sore throat, stomach pain, a headache]|[severe fever, severe cough, sore throat, stomach pain, a headache]|
+-------------------------------------------------------------------+-------------------------------------------------------------------+



### `setMaxLength()`

This parameter can be adjusted when you want to see only tokens with a certain maximum length.

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

regexTokenizer = RegexTokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("regexToken") \
    .setPattern("\\s+|(?=[-:;*__+,$&\\[\\]])|(?<=[-:;*__+,$&\\[\\]])")\
    .setMaxLength(3)

pipeline = Pipeline().setStages([
      documentAssembler,
      regexTokenizer
    ])

data = spark.createDataFrame([["1. The investments made reached a value of £4.5Million, gaining __85.6% on DATE**[24/12/2022]."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("regexToken.result").show(truncate=False)

NameError: ignored

As seen in the above results, only tokens with a maximum length of 3 were received.

### `setMinLength()`

This parameter can be adjusted when you want to see only tokens with a certain minimum length.

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

regexTokenizer = RegexTokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("regexToken") \
    .setPattern("\\s+|(?=[-:;*__+,$&\\[\\]])|(?<=[-:;*__+,$&\\[\\]])")\
    .setMinLength(5)

pipeline = Pipeline().setStages([
      documentAssembler,
      regexTokenizer
    ])

data = spark.createDataFrame([["1. The investments made reached a value of £4.5Million, gaining __85.6% on DATE**[24/12/2022]."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("regexToken.result").show(truncate=False)

As seen in the above results, only tokens with a minimum length of 5 were received.