![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)


# **ZeroShotNerModel**

This notebook will cover the different parameters and usages of `ZeroShotNerModel` annotator.

**📖 Learning Objectives:**

1. Understand how to use `ZeroShotNerModel`.

2. Become comfortable using the different parameters of the annotator.

3. Identify clinical entities on text without training data.


**🔗 Helpful Links:**

- Documentation : [ZeroShotNerModel](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#zeroshotnermodel)

- Python Docs : [ZeroShotNerModel](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/ner/zero_shot_ner/index.html#sparknlp_jsl.annotator.ner.zero_shot_ner.ZeroShotNerModel)

- Scala Docs : [ZeroShotNerModel](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/finance/token_classification/ner/ZeroShotNerModel.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp/).

## **📜 Background**


`ZeroShotNerModel` implements zero-shot named entity recognition by utilizing `BERT` transformer models.

As a zero-shot model, there is no need to train the model in a specific data  set, neither have the entities previously set.

## **🎬 Colab Setup**

In [None]:
!pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m39.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m49.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m6.9 

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_8734_532.json to spark_nlp_for_healthcare_spark_ocr_8734_532.json


In [None]:
from johnsnowlabs import nlp

nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-5.3.2-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==5.3.2 installed! ✅ Heal the planet with NLP! 


In [None]:
from johnsnowlabs import nlp, medical
import pyspark.sql.functions as F

spark = nlp.start(hardware_target="gpu")
spark

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
🤓 Looks like you are missing some jars, trying fetching them ...
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
Downloading 🫘+🚀 Java Library spark-nlp-gpu-assembly-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
👌 Launched [92mgpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `TOKEN`

- Output: `NAMED_ENTITY`

## **🔎 Parameters**


- `entityDefinitions`: A dictionary with definitions of the named entities. The keys of dictionary are the entity types and the values are lists of hypothesis templates.
- `predictionThreshold`: Minimal confidence score to consider the entity(Default: `0.01`)
- `ignoreEntitites`: A list of entities to be discarted from the output..


All the parameters can be set using the corresponding set method in camel case. For example, `.setMultiLabel()`.

### `relationalCategories`

In [None]:
documentAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")

sentenceDetector = (
    nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
)

tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")

zero_shot_ner = (
    medical.ZeroShotNerModel.pretrained(
        "zero_shot_ner_roberta", "en", "clinical/models"
    )
    .setEntityDefinitions(
        {
            "PROBLEM": [
                "What is the disease?",
                "What is his symptom?",
                "What is her disease?",
                "What is his disease?",
                "What is the problem?",
                "What does a patient suffer",
                "What was the reason that the patient is admitted to the clinic?",
            ],
            "DRUG": [
                "Which drug?",
                "Which is the drug?",
                "What is the drug?",
                "Which drug does he use?",
                "Which drug does she use?",
                "Which drug do I use?",
                "Which drug is prescribed for a symptom?",
            ],
            "ADMISSION_DATE": ["When did patient admitted to a clinic?"],
            "PATIENT_AGE": [
                "How old is the patient?",
                "What is the gae of the patient?",
            ],
        }
    )
    .setInputCols(["sentence", "token"])
    .setOutputCol("zero_shot_ner")
    .setPredictionThreshold(0.1)
)

ner_converter = (
    medical.NerConverterInternal()
    .setInputCols(["sentence", "token", "zero_shot_ner"])
    .setOutputCol("ner_chunk")
)
pipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        zero_shot_ner,
        ner_converter,
    ]
)

zero_shot_ner_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

zero_shot_ner_roberta download started this may take some time.
[OK!]


In [None]:
zero_shot_ner.getClasses()

['DRUG', 'PATIENT_AGE', 'ADMISSION_DATE', 'PROBLEM']

In [None]:
text_list = [
    "The doctor pescribed Majezik for my severe headache.",
    "The patient was admitted to the hospital for his colon cancer.",
    "27 years old patient was admitted to clinic on Sep 1st by Dr. X for a right-sided pleural effusion for thoracentesis.",
]

data = spark.createDataFrame(text_list, nlp.StringType()).toDF("text")

results = zero_shot_ner_model.transform(data)

In [None]:
results.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|       zero_shot_ner|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|The doctor pescri...|[{document, 0, 51...|[{document, 0, 51...|[{token, 0, 2, Th...|[{named_entity, 0...|[{chunk, 21, 27, ...|
|The patient was a...|[{document, 0, 61...|[{document, 0, 61...|[{token, 0, 2, Th...|[{named_entity, 0...|[{chunk, 49, 60, ...|
|27 years old pati...|[{document, 0, 11...|[{document, 0, 11...|[{token, 0, 1, 27...|[{named_entity, 0...|[{chunk, 0, 7, 27...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [None]:
results.select(
    F.explode(
        F.arrays_zip(
            results.token.result,
            results.zero_shot_ner.result,
            results.zero_shot_ner.metadata,
            results.zero_shot_ner.begin,
            results.zero_shot_ner.end,
        )
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("token"),
    F.expr("cols['1']").alias("ner_label"),
    F.expr("cols['2']['sentence']").alias("sentence"),
    F.expr("cols['3']").alias("begin"),
    F.expr("cols['4']").alias("end"),
    F.expr("cols['2']['confidence']").alias("confidence"),
).show(
    50, truncate=100
)

+-------------+----------------+--------+-----+---+----------+
|        token|       ner_label|sentence|begin|end|confidence|
+-------------+----------------+--------+-----+---+----------+
|          The|               O|       0|    0|  2|      null|
|       doctor|               O|       0|    4|  9|      null|
|    pescribed|               O|       0|   11| 19|      null|
|      Majezik|          B-DRUG|       0|   21| 27| 0.6233715|
|          for|               O|       0|   29| 31|      null|
|           my|               O|       0|   33| 34|      null|
|       severe|       B-PROBLEM|       0|   36| 41|0.53198636|
|     headache|       I-PROBLEM|       0|   43| 50|0.53198636|
|            .|               O|       0|   51| 51|      null|
|          The|               O|       0|    0|  2|      null|
|      patient|               O|       0|    4| 10|      null|
|          was|               O|       0|   12| 14|      null|
|     admitted|               O|       0|   16| 23|    

### `predictionThreshold`

We can see that some confidece scores were not high, let's change the threshold.

In [None]:
zero_shot_ner.setPredictionThreshold(0.8)

pipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        zero_shot_ner,
        ner_converter,
    ]
)

zero_shot_ner_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = zero_shot_ner_model.transform(data)

results.select(
    F.explode(
        F.arrays_zip(
            results.token.result,
            results.zero_shot_ner.result,
            results.zero_shot_ner.metadata,
            results.zero_shot_ner.begin,
            results.zero_shot_ner.end,
        )
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("token"),
    F.expr("cols['1']").alias("ner_label"),
    F.expr("cols['2']['sentence']").alias("sentence"),
    F.expr("cols['3']").alias("begin"),
    F.expr("cols['4']").alias("end"),
    F.expr("cols['2']['confidence']").alias("confidence"),
).show(
    50, truncate=100
)

+-------------+----------------+--------+-----+---+----------+
|        token|       ner_label|sentence|begin|end|confidence|
+-------------+----------------+--------+-----+---+----------+
|          The|               O|       0|    0|  2|      null|
|       doctor|               O|       0|    4|  9|      null|
|    pescribed|               O|       0|   11| 19|      null|
|      Majezik|               O|       0|   21| 27|      null|
|          for|               O|       0|   29| 31|      null|
|           my|               O|       0|   33| 34|      null|
|       severe|               O|       0|   36| 41|      null|
|     headache|               O|       0|   43| 50|      null|
|            .|               O|       0|   51| 51|      null|
|          The|               O|       0|    0|  2|      null|
|      patient|               O|       0|    4| 10|      null|
|          was|               O|       0|   12| 14|      null|
|     admitted|               O|       0|   16| 23|    

### `ignoreEntities`

In [None]:
zero_shot_ner.setPredictionThreshold(0.45).setIgnoreEntities(["PATIENT_AGE"])

pipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        zero_shot_ner,
        ner_converter,
    ]
)

zero_shot_ner_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = zero_shot_ner_model.transform(data)

results.select(
    F.explode(
        F.arrays_zip(
            results.token.result,
            results.zero_shot_ner.result,
            results.zero_shot_ner.metadata,
            results.zero_shot_ner.begin,
            results.zero_shot_ner.end,
        )
    ).alias("cols")
).select(
    F.expr("cols['0']").alias("token"),
    F.expr("cols['1']").alias("ner_label"),
    F.expr("cols['2']['sentence']").alias("sentence"),
    F.expr("cols['3']").alias("begin"),
    F.expr("cols['4']").alias("end"),
    F.expr("cols['2']['confidence']").alias("confidence"),
).show(
    50, truncate=100
)

+-------------+----------------+--------+-----+---+----------+
|        token|       ner_label|sentence|begin|end|confidence|
+-------------+----------------+--------+-----+---+----------+
|          The|               O|       0|    0|  2|      null|
|       doctor|               O|       0|    4|  9|      null|
|    pescribed|               O|       0|   11| 19|      null|
|      Majezik|          B-DRUG|       0|   21| 27| 0.6233715|
|          for|               O|       0|   29| 31|      null|
|           my|               O|       0|   33| 34|      null|
|       severe|       B-PROBLEM|       0|   36| 41|0.53198636|
|     headache|       I-PROBLEM|       0|   43| 50|0.53198636|
|            .|               O|       0|   51| 51|      null|
|          The|               O|       0|    0|  2|      null|
|      patient|               O|       0|    4| 10|      null|
|          was|               O|       0|   12| 14|      null|
|     admitted|               O|       0|   16| 23|    

## Fast inference with [LightPipelines](https://nlp.johnsnowlabs.com/docs/en/concepts#using-spark-nlps-lightpipeline)

We can use Spark NLP's `LightPipeline` to run fast inference directly on text (or list of text) instead of using spark data frames.

Let's check how to do that.

In [None]:
lp = nlp.LightPipeline(zero_shot_ner_model)

result = lp.annotate("27 years old patient was admitted to clinic on Sep 1st by Dr. X for a right-sided pleural effusion for thoracentesis.")
result

{'zero_shot_ner': ['O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'B-ADMISSION_DATE',
  'I-ADMISSION_DATE',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'B-PROBLEM',
  'I-PROBLEM',
  'I-PROBLEM',
  'I-PROBLEM',
  'I-PROBLEM',
  'O'],
 'document': ['27 years old patient was admitted to clinic on Sep 1st by Dr. X for a right-sided pleural effusion for thoracentesis.'],
 'ner_chunk': ['Sep 1st', 'right-sided pleural effusion for thoracentesis'],
 'token': ['27',
  'years',
  'old',
  'patient',
  'was',
  'admitted',
  'to',
  'clinic',
  'on',
  'Sep',
  '1st',
  'by',
  'Dr',
  '.',
  'X',
  'for',
  'a',
  'right-sided',
  'pleural',
  'effusion',
  'for',
  'thoracentesis',
  '.'],
 'sentence': ['27 years old patient was admitted to clinic on Sep 1st by Dr. X for a right-sided pleural effusion for thoracentesis.']}

In [None]:
for token, ner_label in zip(result["token"], result["zero_shot_ner"]):
  print(f"{token} => {ner_label}")

27 => O
years => O
old => O
patient => O
was => O
admitted => O
to => O
clinic => O
on => O
Sep => B-ADMISSION_DATE
1st => I-ADMISSION_DATE
by => O
Dr => O
. => O
X => O
for => O
a => O
right-sided => B-PROBLEM
pleural => I-PROBLEM
effusion => I-PROBLEM
for => I-PROBLEM
thoracentesis => I-PROBLEM
. => O
