![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/DocumentFiltererByNER.ipynb)

#   **📜 DocumentFiltererByNER**


The **`DocumentFiltererByNER`** annotator returns sentences containing the entity chunks you have filtered, allowing you to see only the sentences with the entities you want.
It is particularly useful for extracting and organizing the results obtained from Spark NLP Pipelines.

**📖 Learning Objectives:**

1. Understand how to use the annotator.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

- Reference Documentation: [DocumentFiltererByNER](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#documentfiltererbyner)



## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=False
nlp.install(refresh_install=True)

👌 Detected license file /content/license_keys.json
🚨 Outdated Medical Secrets in license file. Version=5.4.0.PR but should be Version=5.4.0
🚨 Outdated OCR Secrets in license file. Version=5.3.2 but should be Version=5.4.0
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.4.0-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.4.0-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.4.0.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.4.0.jar
🙆 JSL Home setup in /root/.johnsnowlabs
Running "/usr/bin/python3 -m pip install https://pypi.johnsnowlabs.com/[LIB_SECRET]/spark-nlp-jsl/spark_nlp_jsl-5.4.0-py3-none-any.whl --force-reinstall"
Installed 1 products:
💊 Spark-Healthcare==5.4.0 installed! ✅ Heal the planet with NLP! 


In [None]:
# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/license_keys.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.4.0, 💊Spark-Healthcare==5.4.0, running on ⚡ PySpark==3.4.0


In [None]:
spark

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `CHUNK`

- Output: `DOCUMENT`

## **🔎 Parameters**


**Parameters**:

- `blackList`: If defined, list of entities to ignore. The rest will be processed.
- `whiteList`: If defined, list of entities to process. The rest will be ignored.
- `caseSensitive`: Determines whether the definitions of the white listed and black listed entities are case sensitive or not.
- `outputAsDocument`: Whether to return all sentences joined into a single document.(default : `False`).
- `joinString`: This parameter specifies the string that will be inserted between results of documents when combining them into a single result if outputAsDocument is set to `True` (default is : " ").
      
  

### Pipeline

In [None]:
documentAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
  .setInputCols("document")\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

ner_jsl = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"])\
  .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]


In [None]:
df = spark.createDataFrame([
    ["Coronavirus disease (COVID-19) is an infectious disease caused by the SARS-CoV-2 virus."],
    ["Most people infected with the virus will experience mild to moderate respiratory illness and recover without requiring special treatment."],
    ["However, some will become seriously ill and require medical attention. "],
    ["Older people and those with underlying medical conditions like cardiovascular disease, diabetes, chronic respiratory disease, or cancer are more likely to develop serious illness."],
    ["Anyone can get sick with COVID-19 and become seriously ill or die at any age."],
    ["The best way to prevent and slow down transmission is to be well informed about the disease and how the virus spreads."],
    ["Protect yourself and others from infection by staying at least 1 metre apart from others, wearing a properly fitted mask, and washing your hands or using an alcohol-based rub frequently."],
    ["Get vaccinated when it’s your turn and follow local guidance."],
    ["Stay home if you feel unwell."],
    ["If you have a fever, cough and difficulty breathing, seek medical attention."],
    ["The virus can spread from an infected person’s mouth or nose in small liquid particles when they cough, sneeze, speak, sing or breathe. "],
    ["These particles range from larger respiratory droplets to smaller aerosols. It is important to practice respiratory etiquette, for example by coughing into a flexed elbow, and to stay home and self-isolate until you recover if you feel unwell."]
    ]).toDF("text")

In [None]:
from pyspark.sql.window import Window as W
from pyspark.sql import functions as F
spark_df = df.coalesce(1).withColumn("idx", F.monotonically_increasing_id())

In [None]:
spark_df.show()

+--------------------+---+
|                text|idx|
+--------------------+---+
|Coronavirus disea...|  0|
|Most people infec...|  1|
|However, some wil...|  2|
|Older people and ...|  3|
|Anyone can get si...|  4|
|The best way to p...|  5|
|Protect yourself ...|  6|
|Get vaccinated wh...|  7|
|Stay home if you ...|  8|
|If you have a fev...|  9|
|The virus can spr...| 10|
|These particles r...| 11|
+--------------------+---+



### `setWhiteList`

In [None]:
filterer = medical.DocumentFiltererByNER() \
  .setInputCols(["sentence", "ner_chunk"]) \
  .setOutputCol("filterer") \
  .setWhiteList(["Disease_Syndrome_Disorder"])

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ner_jsl,
    ner_converter,
    filterer])

res = pipeline.fit(spark_df).transform(spark_df)

In [None]:
res.selectExpr("idx as doc_id","explode(filterer) as filter").show(truncate=80)

+------+--------------------------------------------------------------------------------+
|doc_id|                                                                          filter|
+------+--------------------------------------------------------------------------------+
|     0|{document, 0, 86, Coronavirus disease (COVID-19) is an infectious DISAESE cau...|
|     1|{document, 0, 136, Most people infected with the virus will experience mild t...|
|     3|{document, 0, 178, Older people and those with underlying medical conditions ...|
|     6|{document, 0, 185, Protect yourself and others from infection by staying at l...|
|    10|{document, 0, 134, The virus can spread from an infected person’s mouth or no...|
+------+--------------------------------------------------------------------------------+



In [None]:
res.select('idx',F.explode(F.arrays_zip(res.ner_chunk.result,
                                     res.ner_chunk.metadata)).alias("cols")) \
      .select('idx',F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['confidence']").alias("confidence")) \
      .filter(F.col("ner_label") == "Disease_Syndrome_Disorder") \
      .show(truncate=False)

+---+-----------------------+-------------------------+----------+
|idx|chunk                  |ner_label                |confidence|
+---+-----------------------+-------------------------+----------+
|0  |Coronavirus disease    |Disease_Syndrome_Disorder|0.65905   |
|0  |infectious disease     |Disease_Syndrome_Disorder|0.5482    |
|1  |infected               |Disease_Syndrome_Disorder|0.9054    |
|1  |virus                  |Disease_Syndrome_Disorder|0.2245    |
|1  |respiratory illness    |Disease_Syndrome_Disorder|0.34315002|
|3  |respiratory disease    |Disease_Syndrome_Disorder|0.38300002|
|6  |infection              |Disease_Syndrome_Disorder|0.9878    |
|10 |infected person’s mouth|Disease_Syndrome_Disorder|0.43490002|
+---+-----------------------+-------------------------+----------+



### `setBlackList`

In [None]:
filterer = medical.DocumentFiltererByNER() \
  .setInputCols(["sentence", "ner_chunk"]) \
  .setOutputCol("filterer") \
  .setBlackList(["Disease_Syndrome_Disorder"])

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ner_jsl,
    ner_converter,
    filterer])

res = pipeline.fit(spark_df).transform(spark_df)
res.selectExpr("idx as doc_id","explode(filterer) as filter").show(truncate=False)

+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|doc_id|filter                                                                                                                                                                                                           |
+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|2     |{document, 0, 69, However, some will become seriously ill and require medical attention., {sentence -> 0}, []}                                                                                                   |
|4     |{document, 0, 76, Anyone can get sick with COVID-19 and become seriously ill or die at any age., {sentence -> 0}, []

### `setJoinString`

In [None]:
filterer = medical.DocumentFiltererByNER() \
  .setInputCols(["sentence", "ner_chunk"]) \
  .setOutputCol("filterer") \
  .setBlackList(["Disease_Syndrome_Disorder"])\
  .setOutputAsDocument(True)\
  .setJoinString(" ")

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ner_jsl,
    ner_converter,
    filterer])

res = pipeline.fit(spark_df).transform(spark_df)
res.selectExpr("idx as doc_id","explode(filterer) as filter").show(truncate=False)

+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|doc_id|filter                                                                                                                                                                                                                                                                                                     |
+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|2     |{document, 0, 69, However, some will become seriously ill and req

### `setCaseSensitive`

In [None]:
df = spark.createDataFrame([
    ["Coronavirus disease (COVID-19) is an infectious DISAESE caused by the SARS-CoV-2 virus."],
    ["Most people infected with the virus will experience mild to moderate respiratory illness and recover without requiring special treatment."],
    ["However, some will become seriously ill and require medical attention. "],
    ["Older people and those with underlying medical conditions like cardiovascular disease, diabetes, chronic respiratory disease, or cancer are more likely to develop serious illness."],
    ["Anyone can get sick with COVID-19 and become seriously ill or die at any age."],
    ["The best way to prevent and slow down transmission is to be well informed about the disease and how the virus spreads."],
    ["Protect yourself and others from infection by staying at least 1 metre apart from others, wearing a properly fitted mask, and washing your hands or using an alcohol-based rub frequently."],
    ["Get vaccinated when it’s your turn and follow local guidance."],
    ["Stay home if you feel unwell."],
    ["If you have a fever, cough and difficulty breathing, seek medical attention."],
    ["The virus can spread from an infected person’s mouth or nose in small liquid particles when they cough, sneeze, speak, sing or breathe. "],
    ["These particles range from larger respiratory droplets to smaller aerosols. It is important to practice respiratory etiquette, for example by coughing into a flexed elbow, and to stay home and self-isolate until you recover if you feel unwell."]
    ]).toDF("text")

spark_df = df.coalesce(1).withColumn("idx", F.monotonically_increasing_id())

In [None]:
filterer = medical.DocumentFiltererByNER() \
  .setInputCols(["sentence", "ner_chunk"]) \
  .setOutputCol("filterer") \
  .setWhiteList(["Disease_Syndrome_Disorder"])\
  .setCaseSensitive(False)

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ner_jsl,
    ner_converter,
    filterer])

res = pipeline.fit(spark_df).transform(spark_df)
res.selectExpr("idx as doc_id","explode(filterer) as filter").show(truncate=False)

+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|doc_id|filter                                                                                                                                                                                                                             |
+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0     |{document, 0, 86, Coronavirus disease (COVID-19) is an infectious DISAESE caused by the SARS-CoV-2 virus., {sentence -> 0}, []}                                                                                                    |
|1     |{document, 0, 136, Most people infected with