![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/ChunkFiltererApproach.ipynb)

# **ChunkFiltererApproach**

This notebook will cover the different parameters and usages of `ChunkFiltererApproach`. This annotator provides the ability to filter entities coming from CHUNK annotations.

**📖 Learning Objectives:**

1. Understand how to set filters,  via a white list of terms or a regular expression.

2. Become comfortable using the different parameters of the `ChunkFiltererApproach`.


**🔗 Helpful Links:**

- Documentation : [ChunkFiltererApproach](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#chunkfilterer)

- Python Docs : [ChunkFiltererApproach](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/chunker/chunker_filterer/index.html#sparknlp_jsl.annotator.chunker.chunker_filterer.ChunkFiltererApproach)

- Scala Docs : [ChunkFiltererApproach](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/chunker/ChunkFiltererApproach.html)

- For extended examples of usage, see the [Spark NLP Workshop repository-1.Clinical_Named_Entity_Recognition_Model_Notebook](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb#scrollTo=YgSZ4ghNlDbV).

## **📜 Background**


`ChunkFiltererApproach` will allow you to filter out named entities by some conditions or predefined look-up lists, so that you can feed these entities to other annotators like Assertion Status or Entity Resolvers.

## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m35.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m39.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m42.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m7.

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_8734_532.json to spark_nlp_for_healthcare_spark_ocr_8734_532.json


In [None]:
from johnsnowlabs import nlp, medical

nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-5.3.2-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==5.3.2 installed! ✅ Heal the planet with NLP! 


In [None]:
import pyspark.sql.functions as F
import pandas as pd

spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT` , `CHUNK`

- Output: `CHUNK`

## **🔎 Parameters**


- `whiteList` : (list) If defined, list of entities to process. The rest will be ignored.

- `blackList` : (list) If defined, list of entities to ignore. The rest will be processed.

- `regex` : (list) If defined, list of regex to process the chunks (Default: []).

- `criteria`: (str) Tag representing what is the criteria to filter the chunks. Possibles values are: - isIn: Filter by the chunk - regex: Filter using a regex

- `FilterEntity`: (str) Possible values are 'result' and 'entity'.

- `entitiesConfidence` : (str) Path to csv with pairs (entity,confidenceThreshold). Filter the chunks with entities which have confidence lower than the confidence threshold.

- `EntitiesConfidenceResourceAsJsonString ` : (json) Allows finely tune entity confidence levels using a JSON configuration.

- `doExceptionHandling`: (Bool) If true, exceptions are handled.

- `caseSensitive`: (Bool) Determines whether the definitions of the white listed and black listed entities are case sensitive or not.

- `setMaxLength()`: (int) Determines to get only tokens with a certain maximum length.

- `setMinLength()`: (int) Determines to get only tokens with a certain minimum length.

###  `whiteList()` & `criteria()`

It can be used with two criteria: `isin` and `regex` .

In [None]:
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol( "document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")\
    .setCustomBounds(['\n'])

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
    .setContextChars([".", ",", ";", ":", "!", "?", "*", "(", ")", "\"", "'","+","%","-",'='])\
    .setSplitChars(['\[', '\]', '\n'])

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# Posology NER model is used
posology_ner = medical.NerModel.pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

chunk_filterer = medical.ChunkFiltererApproach()\
    .setInputCols("sentence","ner_chunk")\
    .setOutputCol("chunk_filtered")\
    .setCriteria("isin")\
    .setWhiteList(['Advil','metformin', 'insulin lispro']) #list of entities to process

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    ner_converter,
    chunk_filterer])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology download started this may take some time.
[OK!]


`Regextokenizer` created the tokens by dividing using "/s+" when no pattern was given. When a pattern was given to the `setPattern` parameter, it performed the separation using that pattern.

In [None]:
text = 'The patient was prescribed 1 capsule of Advil for 5 days . He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months .'

filter_df = spark.createDataFrame([[text]]).toDF("text")

result = chunk_filter_model.transform(filter_df)

In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result, result.ner.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_chunk"),
                          F.expr("cols['2']['confidence']").alias("confidence"))

result_df.show(50, truncate=100)

+-------------+-----------+----------+
|        token|  ner_chunk|confidence|
+-------------+-----------+----------+
|          The|          O|    0.9995|
|      patient|          O|    0.9214|
|          was|          O|     0.975|
|   prescribed|          O|    0.9513|
|            1|   B-DOSAGE|    0.9992|
|      capsule|     B-FORM|    0.9897|
|           of|          O|    0.9982|
|        Advil|     B-DRUG|     0.997|
|          for| B-DURATION|    0.5002|
|            5| I-DURATION|    0.6714|
|         days| I-DURATION|    0.9699|
|            .|          O|       1.0|
|           He|          O|    0.9998|
|          was|          O|    0.9908|
|         seen|          O|    0.9744|
|           by|          O|    0.9991|
|          the|          O|    0.9499|
|endocrinology|          O|    0.9725|
|      service|          O|    0.5585|
|          and|          O|    0.9899|
|          she|          O|     0.992|
|          was|          O|     0.991|
|   discharged|          

In [None]:
chunk_filterer.extractParamMap()

{Param(parent='ChunkFiltererApproach_781a5b54e29b', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='ChunkFiltererApproach_781a5b54e29b', name='inputCols', doc='previous annotations columns, if renamed'): ['sentence',
  'ner_chunk'],
 Param(parent='ChunkFiltererApproach_781a5b54e29b', name='outputCol', doc='output annotation column. can be left default.'): 'chunk_filtered',
 Param(parent='ChunkFiltererApproach_781a5b54e29b', name='criteria', doc='It is used to compare black and white listed values with the result of the Annotation.'): 'isin',
 Param(parent='ChunkFiltererApproach_781a5b54e29b', name='whiteList', doc='If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels'): ['Advil',
  'metformin',
  'insulin lispro']}

In [None]:
light_model = nlp.LightPipeline(chunk_filter_model)

light_result = light_model.annotate(text)

light_result.keys()

dict_keys(['document', 'ner_chunk', 'chunk_filtered', 'token', 'ner', 'embeddings', 'sentence'])

In [None]:
#Returns the output of ner_converter with the defined criteria
light_result['ner_chunk']

['1',
 'capsule',
 'Advil',
 'for 5 days',
 '40 units',
 'insulin glargine',
 'at night',
 '12 units',
 'insulin lispro',
 'with meals',
 'metformin',
 '1000 mg',
 'two times a day',
 'SGLT2 inhibitors']

In [None]:
#Returns the output of chunk_filterer (with)
light_result['chunk_filtered']

['Advil', 'insulin lispro', 'metformin']

### `blackList()` & `criteria()`

In [None]:
chunk_filterer = medical.ChunkFiltererApproach()\
    .setInputCols("sentence","ner_chunk")\
    .setOutputCol("chunk_filtered")\
    .setCriteria("isin")\
    .setBlackList(['12 units','40 units'])  #list of entities to ignore

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    ner_converter,
    chunk_filterer])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

In [None]:
light_model = nlp.LightPipeline(chunk_filter_model)

text ='The patient was prescribed 1 capsule of Advil for 5 days . He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months .'

light_result = light_model.annotate(text)

light_result.keys()

dict_keys(['document', 'ner_chunk', 'chunk_filtered', 'token', 'ner', 'embeddings', 'sentence'])

In [None]:
light_result['chunk_filtered']

['1',
 'capsule',
 'Advil',
 'for 5 days',
 'insulin glargine',
 'at night',
 'insulin lispro',
 'with meals',
 'metformin',
 '1000 mg',
 'two times a day',
 'SGLT2 inhibitors']

### `regex()` & `criteria()`

In [None]:
chunk_filterer = medical.ChunkFiltererApproach()\
    .setInputCols("sentence","ner_chunk")\
    .setOutputCol("chunk_filtered")\
    .setCriteria("regex")\
    .setRegex(["(\d+)\s*units"])

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    ner_converter,
    chunk_filterer])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

In [None]:
light_model = nlp.LightPipeline(chunk_filter_model)

text ='The patient was prescribed 1 capsule of Advil for 5 days . He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months .'

light_result = light_model.annotate(text)

light_result.keys()

dict_keys(['document', 'ner_chunk', 'chunk_filtered', 'token', 'ner', 'embeddings', 'sentence'])

In [None]:
light_result['chunk_filtered']

['40 units', '12 units']

###`filterEntity()`

 If sets to “entity”, then you can use the ner label to filter. If sets to “result”, you can use the result attribute of the annotation to filter.

In [None]:
ner_model = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models")\
    .setInputCols("sentence","token","embeddings")\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

chunk_filterer = medical.ChunkFiltererApproach()\
    .setInputCols("sentence","ner_chunk")\
    .setOutputCol("chunk_filtered")\
    .setFilterEntity("entity")\
    .setBlackList(['PROBLEM'])

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ner_model,
    ner_converter,
    chunk_filterer])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

ner_clinical download started this may take some time.
[OK!]


In [None]:
text = 'Patient with severe fever, severe cough, sore throat, stomach pain, and a headache.'

filter_df = spark.createDataFrame([[text]]).toDF("text")

chunk_filter_result = chunk_filter_model.transform(filter_df)

In [None]:
result_df = chunk_filter_result.select(F.explode(F.arrays_zip(chunk_filter_result.chunk_filtered.result,
                                     chunk_filter_result.chunk_filtered.begin,
                                     chunk_filter_result.chunk_filtered.end,
                                     chunk_filter_result.chunk_filtered.metadata)).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']").alias("begin"),
                F.expr("cols['2']").alias("end"),
                F.expr("cols['3']['entity']").alias("entity"),
                F.expr("cols['3']['ner_source']").alias("ner_source")).toPandas()

In [None]:
result_df

Unnamed: 0,chunk,begin,end,entity,ner_source


In [None]:
result_df = chunk_filter_result.select(F.explode(F.arrays_zip(chunk_filter_result.ner_chunk.result,
                                     chunk_filter_result.ner_chunk.begin,
                                     chunk_filter_result.ner_chunk.end,
                                     chunk_filter_result.ner_chunk.metadata)).alias("cols"))\
        .select(F.expr("cols['0']").alias("chunk"),
                F.expr("cols['1']").alias("begin"),
                F.expr("cols['2']").alias("end"),
                F.expr("cols['3']['entity']").alias("entity"),
                F.expr("cols['3']['ner_source']").alias("ner_source")).toPandas()

In [None]:
result_df

Unnamed: 0,chunk,begin,end,entity,ner_source
0,severe fever,13,24,PROBLEM,ner_chunk
1,severe cough,27,38,PROBLEM,ner_chunk
2,sore throat,41,51,PROBLEM,ner_chunk
3,stomach pain,54,65,PROBLEM,ner_chunk
4,a headache,72,81,PROBLEM,ner_chunk


As you can see, if `PROBLEM` `entity` is filtered, it does not exists in the `chunk_filtered.result`

### EntitiesConfidenceResourceAsJsonString()

In [None]:
# Posology NER model is used
posology_ner = medical.NerModel.pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

chunk_filterer = medical.ChunkFiltererApproach()\
    .setInputCols("sentence","ner_chunk")\
    .setOutputCol("chunk_filtered")\
    .setEntitiesConfidenceResourceAsJsonString("""{'DURATION':'0.9',
                                                  'DOSAGE':'0.9',
                                                  'FREQUENCY':'0.9',
                                                  'STRENGTH':'0.9',
                                                  'DRUG':'0.9'}""")

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    ner_converter,
    chunk_filterer])

empty_data = spark.createDataFrame([[""]]).toDF("text")

chunk_filter_model = nlpPipeline.fit(empty_data)

ner_posology download started this may take some time.
[OK!]


In [None]:
text = 'The patient was prescribed 1 capsule of Advil for 5 days . He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months .'

filter_df = spark.createDataFrame([[text]]).toDF("text")

result = chunk_filter_model.transform(filter_df)

In [None]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|      chunk_filtered|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|The patient was p...|[{document, 0, 33...|[{document, 0, 57...|[{token, 0, 2, Th...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 27, 27, ...|[{chunk, 27, 27, ...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.chunk_filtered.result,result.chunk_filtered.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk_filtered"),
                          F.expr("cols['1']['confidence']").alias("confidence"))

result_df.show(50, truncate=100)

+--------------+----------+
|chunk_filtered|confidence|
+--------------+----------+
|             1|    0.9992|
|       capsule|    0.9897|
|         Advil|     0.997|
|     metformin|    0.9995|
+--------------+----------+



As you can see from the results, there are no entities with confidence levels less than 0.9.