![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/PretrainedZeroShotNer.ipynb)

 # **Zero-Shot Named Entity Recognition (NER) in Spark NLP**


# **PretrainedZeroShotNER**


This notebook will cover the different parameters and usages of `PretrainedZeroShotNER` annotator.

**📖 Learning Objectives:**

1. Understand how to use `PretrainedZeroShotNER`.

2. Become comfortable using the different parameters of the annotator.

3. Identify clinical entities on text without training data.

**🔗 Helpful Links:**

- Documentation : [PretrainedZeroShotNER](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators#pretrainedzeroshotner)

- Python Docs : [PretrainedZeroShotNER](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/ner/pretrained_zero_shot_ner/index.html#sparknlp_jsl.annotator.ner.pretrained_zero_shot_ner.PretrainedZeroShotNER)

- Scala Docs : [PretrainedZeroShotNER](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/ner/PretrainedZeroShotNER.html)

- For extended examples of usage see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/healthcare-nlp/).

## **📜 Background**


`PretrainedZeroShotNer` makes it easy to identify specific entities in text without needing pre-labeled datasets. It uses advanced pre-trained language models to recognize entities in different fields and languages, saving time and effort. This method is flexible, letting you define your own entity labels instead of relying on a fixed set of examples. For the best results, it’s helpful to choose labels similar to the provided examples, as they guide the model’s understanding.

## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `TOKEN`

- Output: `NAMED_ENTITY`

## **🔎 Parameters**


`labels`: A list of labels descriving the entities. For example: [“person”, “location”]

`predictionThreshold`: Minimal confidence score to encode an entity (Default: `0.01`)

`setBatchSize`: Sets the number of inputs processed together in a single batch during inference. A higher batch size can improve throughput and reduce overall inference time on supported hardware, but may increase memory usage.

All the parameters can be set using the corresponding set method in camel case. For example, `setLabels()`.

## 🎯 **Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files

print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=False
nlp.install(refresh_install=True)

In [None]:
from johnsnowlabs import nlp, medical

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

## 🔎 **Pretrained Zero-Shot Named Entity Recognition Models**

| Index | Model | Predicted Entities |
|------:|:------|:-------------------|
| 1 | [zeroshot_ner_generic_large](https://nlp.johnsnowlabs.com/2024/11/28/zeroshot_ner_generic_large_en.html) | `AGE`,`DATE`,`DISEASE`,`DISORDER`,`DRUG`,`LOCATION`,`NAME`,`PHONE`,`RESULT`,`SYMPTOM`,`SYNDROME`,`TEST`,`TREATMENT` |
| 2 | [zeroshot_ner_generic_medium](https://nlp.johnsnowlabs.com/2024/11/28/zeroshot_ner_generic_medium_en.html) | `AGE`,`DATE`,`DISEASE`,`DISORDER`,`DRUG`,`LOCATION`,`NAME`,`PHONE`,`RESULT`,`SYMPTOM`,`SYNDROME`,`TEST`,`TREATMENT` |
| 3 | [zeroshot_ner_clinical_large](https://nlp.johnsnowlabs.com/2024/11/27/zeroshot_ner_clinical_large_en.html) | `PROBLEM`, `TREATMENT`, `TEST` |
| 4 | [zeroshot_ner_clinical_medium](https://nlp.johnsnowlabs.com/2024/11/27/zeroshot_ner_clinical_medium_en.html) | `PROBLEM`, `TREATMENT`, `TEST` |
| 5 | [zeroshot_ner_oncology_large](https://nlp.johnsnowlabs.com/2024/11/28/zeroshot_ner_oncology_large_en.html) | `Adenopathy`, `Age`, `Biomarker`, `Biomarker_Result` ,`Body_Part`, `Cancer_Dx`, `Cancer_Surgery`, `Cycle_Count`, `Cycle_Day`, `Date`, `Death_Entit`, `Directio`, `Dosage`, `Duration`, `Frequency`, `Gender`, `Grade`, `Histological_Type`, `Imaging_Test`, `Invasion`, `Metastasis`, `Oncogene`, `Pathology_Test`, `Race_Ethnicity`, `Radiation_Dose`, `Relative_Date`, `Response_To_Treatment`, `Route`, `Smoking_Status`, `Staging`, `Therapy`, `Tumor_Finding`, `Tumor_Size` |
| 6 | [zeroshot_ner_oncology_medium](https://nlp.johnsnowlabs.com/2024/11/27/zeroshot_ner_oncology_medium_en.html) | `Adenopathy`, `Age`, `Biomarker`, `Biomarker_Result`, `Body_Part`, `Cancer_Dx`, `Cancer_Surgery`, `Cycle_Count`, `Cycle_Day`, `Date`, `Death_Entit`, `Directio`, `Dosage`, `Duration`, `Frequency`, `Gender`, `Grade`, `Histological_Type`, `Imaging_Test`, `Invasion`, `Metastasis`, `Oncogene`, `Pathology_Test`, `Race_Ethnicity`, `Radiation_Dose`, `Relative_Date`, `Response_To_Treatment`, `Route`, `Smoking_Status`, `Staging`, `Therapy`, `Tumor_Finding`, `Tumor_Size` |
| 7 | [zeroshot_ner_deid_subentity_docwise_large](https://nlp.johnsnowlabs.com/2024/11/29/zeroshot_ner_deid_subentity_docwise_large_en.html) | `DATE`, `PATIENT`, `COUNTRY`, `PROFESSION`, `AGE`, `CITY`, `STATE`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `ORGANIZATION`, `PHONE`, `STREET`, `ZIP` |
| 8 | [zeroshot_ner_deid_subentity_docwise_medium](https://nlp.johnsnowlabs.com/2024/11/28/zeroshot_ner_deid_subentity_docwise_medium_en.html) | `DATE`, `PATIENT`, `COUNTRY`, `PROFESSION`, `AGE`, `CITY`, `STATE`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `ORGANIZATION`, `PHONE`, `STREET`, `ZIP` |
| 9 | [zeroshot_ner_deid_subentity_merged_medium](https://nlp.johnsnowlabs.com/2024/11/27/zeroshot_ner_deid_subentity_merged_medium_en.html) | `DOCTOR`, `PATIENT`, `AGE`, `DATE`, `HOSPITAL`, `CITY`, `STREET`, `STATE`, `COUNTRY`, `PHONE`, `IDNUM`, `EMAIL`, `ZIP`, `ORGANIZATION`, `PROFESSION`, `USERNAME` |
| 10 | [zeroshot_ner_deid_generic_docwise_large](https://nlp.johnsnowlabs.com/2024/11/28/zeroshot_ner_deid_generic_docwise_large_en.html) | `DATE`, `NAME`, `LOCATION`, `PROFESSION`, `AGE`, `ID`, `CONTACT` |
| 11 | [zeroshot_ner_deid_generic_docwise_medium](https://nlp.johnsnowlabs.com/2024/11/28/zeroshot_ner_deid_generic_docwise_medium_en.html) | `DATE`, `NAME`, `LOCATION`, `PROFESSION`, `AGE`, `ID`, `CONTACT` |
| 12 | [zeroshot_ner_oncology_biomarker_large](https://nlp.johnsnowlabs.com/2024/12/13/zeroshot_ner_oncology_biomarker_large_en.html) | `Biomarker`, `Biomarker_Result` |
| 13 | [zeroshot_ner_oncology_biomarker_medium](https://nlp.johnsnowlabs.com/2024/12/13/zeroshot_ner_oncology_biomarker_medium_en.html) | `Biomarker`, `Biomarker_Result` |
| 14 | [zeroshot_ner_deid_generic_multi_large](https://nlp.johnsnowlabs.com/2024/12/21/zeroshot_ner_deid_generic_multi_large_xx.html) | `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION` |
| 15 | [zeroshot_ner_deid_generic_multi_medium](https://nlp.johnsnowlabs.com/2024/12/21/zeroshot_ner_deid_generic_multi_medium_xx.html) | `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION` |
| 16 | [zeroshot_ner_deid_subentity_merged_large](https://nlp.johnsnowlabs.com/2024/12/17/zeroshot_ner_deid_subentity_merged_large_en.html) | `DOCTOR`, `PATIENT`, `AGE`, `DATE`, `HOSPITAL`, `CITY`, `STREET`, `STATE`, `COUNTRY`, `PHONE`, `IDNUM`, `EMAIL`, `ZIP`, `ORGANIZATION`, `PROFESSION`, `USERNAME` |
| 17 | [zeroshot_ner_jsl_large](https://nlp.johnsnowlabs.com/2025/01/01/zeroshot_ner_jsl_large_en.html) | `Admission_Discharge`, `Age`, `Alcohol`, `Body_Part`, `Clinical_Dept`, `Direction`, `Disease_Syndrome_Disorder`, `Dosage_Strength`, `Drug`, `Duration`, `Employment`, `Form`, `Frequency`, `Gender`, `Injury_or_Poisoning`, `Medical_Device`, `Modifier`, `Oncological`, `Procedure`, `Race_Ethnicity`, `Relationship_Status`, `Route`, `Section_Header`, `Smoking`, `Symptom`, `Test`, `Test_Result`, `Treatment`, `Vaccine` |
| 18 | [zeroshot_ner_jsl_medium](https://nlp.johnsnowlabs.com/2025/01/01/zeroshot_ner_jsl_medium_en.html) | `Admission_Discharge`, `Age`, `Alcohol`, `Body_Part`, `Clinical_Dept`, `Direction`, `Disease_Syndrome_Disorder`, `Dosage_Strength`, `Drug`, `Duration`, `Employment`, `Form`, `Frequency`, `Gender`, `Injury_or_Poisoning`, `Medical_Device`, `Modifier`, `Oncological`, `Procedure`, `Race_Ethnicity`, `Relationship_Status`, `Route`, `Section_Header`, `Smoking`, `Symptom`, `Test`, `Test_Result`, `Treatment`, `Vaccine` |
| 19 | [zeroshot_ner_ade_clinical_large](https://nlp.johnsnowlabs.com/2024/12/02/zeroshot_ner_ade_clinical_large_en.html) | `DRUG`, `ADE`, `PROBLEM` |
| 20 | [zeroshot_ner_sdoh_medium](https://nlp.johnsnowlabs.com/2025/01/06/zeroshot_ner_sdoh_medium_en.html) | `Access_To_Care`, `Age`, `Alcohol`, `Childhood_Development`, `Diet`, `Disability`, `Eating_Disorder`, `Education`, `Employment`, `Environmental_Condition`, `Exercise`, `Family_Member`, `Financial_Status`, `Gender`, `Geographic_Entity`, `Healthcare_Institution`, `Housing`, `Hypertension`, `Income`, `Insurance_Status`, `Language`, `Legal_Issues`, `Marital_Status`, `Mental_Health`, `Obesity`, `Other_Disease`, `Other_SDoH_Keywords`, `Quality_Of_Life`, `Race_Ethnicity`, `Sexual_Activity`, `Sexual_Orientation`, `Smoking`, `Social_Exclusion`, `Social_Support`, `Spiritual_Beliefs`, `Transportation`, `Violence_Or_Abuse` |
| 21 | [zeroshot_ner_sdoh_large](https://nlp.johnsnowlabs.com/2024/12/02/zeroshot_ner_ade_clinical_large_en.html) | `DRUG`, `ADE`, `PROBLEM` |

## **Build the Pipeline**

In this example we use the `zeroshot_ner_oncology_large` model

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")
# You can choose just the entities you need from the referece list
labels = ['Biomarker', 'Biomarker_Result', 'Body_Part', 'Cancer_Dx', 'Cancer_Surgery', 'Imaging_Test']

pretrained_zero_shot_ner = medical.PretrainedZeroShotNER().pretrained("zeroshot_ner_oncology_large", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("entities")\
    .setPredictionThreshold(0.5)\
    .setLabels(labels)

ner_converter = medical.NerConverterInternal()\
    .setInputCols("sentence", "token", "entities")\
    .setOutputCol("ner_chunk")


pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_detector,
    tokenizer,
    pretrained_zero_shot_ner,
    ner_converter
])

zeroshot_ner_oncology_large download started this may take some time.
Approximate size to download 1.5 GB
[OK!]


In [None]:
# Sample text. A clinical note.
text = """
Chief Complaint:
Palpable lump in the left breast for 3 weeks.

History of Present Illness:
The patient reports noticing a firm, non-tender mass in the upper outer quadrant of the left breast. No associated nipple discharge or pain. Denies weight loss, fever, or night sweats. Family history is positive for breast cancer in her mother at age 60.

Past Medical History:

Hypertension, controlled with medication

No prior malignancy

Physical Exam:

Left breast: 2.5 cm firm, irregular, non-mobile mass in upper outer quadrant.

No skin dimpling or nipple retraction.

Axilla: palpable, mobile lymph node (~1.0 cm).

Right breast: normal.

Imaging:

Mammogram: irregular spiculated mass, BI-RADS 5.

Ultrasound: hypoechoic lesion, 2.4 cm, irregular borders.

Procedure:
Core Needle Biopsy performed on 09/20/2025.

Pathology Report:

Invasive ductal carcinoma, grade II.

Estrogen receptor (ER): positive (80%).

Progesterone receptor (PR): positive (60%).

HER2/neu: negative.

Ki-67: 25%.

Assessment:
52-year-old female with left breast invasive ductal carcinoma, ER/PR positive, HER2 negative.

Plan:

Refer to surgical oncology for discussion of lumpectomy vs mastectomy.

Oncology consult for adjuvant therapy planning (endocrine therapy ± chemotherapy).

Baseline staging scans ordered.

Genetic counseling referral due to family history.
"""

In [None]:
# Create a Spark data frame withe the sample text
data = spark.createDataFrame([[text]]).toDF("text")
# Fit and Transform
result = pipeline.fit(data).transform(data)

In [None]:
# Print the results
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.begin,
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata)).alias("cols")) \
               .select( F.expr("cols['0']").alias("chunk"),
                       F.expr("cols['1']").alias("begin"),
                       F.expr("cols['2']").alias("end"),
                      F.expr("cols['3']['entity']").alias("ner_label"),
                      F.expr("cols['3']['confidence']").alias("confidence"))\
                       .filter("ner_label!='O'")\
                       .show(50,truncate=False)

+-------------------------+-----+----+----------------+----------+
|chunk                    |begin|end |ner_label       |confidence|
+-------------------------+-----+----+----------------+----------+
|breast                   |44   |49  |Body_Part       |0.9392416 |
|breast                   |186  |191 |Body_Part       |0.87266624|
|nipple                   |208  |213 |Body_Part       |0.9686622 |
|positive                 |296  |303 |Biomarker_Result|0.96460474|
|breast cancer            |309  |321 |Cancer_Dx       |0.9976246 |
|malignancy               |423  |432 |Cancer_Dx       |0.9891188 |
|breast                   |456  |461 |Body_Part       |0.7661266 |
|skin                     |533  |536 |Body_Part       |0.92960954|
|nipple                   |550  |555 |Body_Part       |0.9233175 |
|Axilla                   |570  |575 |Body_Part       |0.9729599 |
|lymph node               |595  |604 |Body_Part       |0.9553588 |
|breast                   |624  |629 |Body_Part       |0.85575

### `predictionThreshold` parameter

This parameter controls the minimum confidence score that a predicted entity must reach in order to be returned.

Let's set it to retun only entities that have confidece >= 0.8



In [None]:
pretrained_zero_shot_ner.setPredictionThreshold(0.8)

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_detector,
    tokenizer,
    pretrained_zero_shot_ner,
    ner_converter
])


data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.begin,
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata)).alias("cols")) \
               .select( F.expr("cols['0']").alias("chunk"),
                       F.expr("cols['1']").alias("begin"),
                       F.expr("cols['2']").alias("end"),
                      F.expr("cols['3']['entity']").alias("ner_label"),
                      F.expr("cols['3']['confidence']").alias("confidence"))\
                       .filter("ner_label!='O'")\
                       .show(50,truncate=False)

+-------------------------+-----+----+----------------+----------+
|chunk                    |begin|end |ner_label       |confidence|
+-------------------------+-----+----+----------------+----------+
|breast                   |44   |49  |Body_Part       |0.9392416 |
|breast                   |186  |191 |Body_Part       |0.87266624|
|nipple                   |208  |213 |Body_Part       |0.9686622 |
|positive                 |296  |303 |Biomarker_Result|0.96460474|
|breast cancer            |309  |321 |Cancer_Dx       |0.9976246 |
|malignancy               |423  |432 |Cancer_Dx       |0.9891188 |
|skin                     |533  |536 |Body_Part       |0.92960954|
|nipple                   |550  |555 |Body_Part       |0.9233175 |
|Axilla                   |570  |575 |Body_Part       |0.9729599 |
|lymph node               |595  |604 |Body_Part       |0.9553588 |
|breast                   |624  |629 |Body_Part       |0.8557541 |
|normal                   |632  |637 |Biomarker_Result|0.98224

 ## Customize Entity Labels

**Customizable Prediction Labels**

You’re not limited to a fixed set of labels — zero-shot NER lets you define the labels that match your use case. Simply provide the terms you want to extract, and the model will adapt its predictions accordingly.

In [None]:
# New, customized entities
labels = ['Proliferation_Index', 'Margin_Status', 'Histological_type']

pretrained_zero_shot_ner = medical.PretrainedZeroShotNER().pretrained("zeroshot_ner_oncology_large", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("entities")\
    .setPredictionThreshold(0.5)\
    .setLabels(labels)

ner_converter = medical.NerConverterInternal()\
    .setInputCols("sentence", "token", "entities")\
    .setOutputCol("ner_chunk")


pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_detector,
    tokenizer,
    pretrained_zero_shot_ner,
    ner_converter
])

result = pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.begin,
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata)).alias("cols")) \
               .select( F.expr("cols['0']").alias("chunk"),
                       F.expr("cols['1']").alias("begin"),
                       F.expr("cols['2']").alias("end"),
                      F.expr("cols['3']['entity']").alias("ner_label"))\
                       .filter("ner_label!='O'")\
                       .show(50,truncate=False)

zeroshot_ner_oncology_large download started this may take some time.
Approximate size to download 1.5 GB
[OK!]
+-----------------+-----+----+-------------------+
|chunk            |begin|end |ner_label          |
+-----------------+-----+----+-------------------+
|irregular borders|740  |756 |Margin_Status      |
|Invasive         |835  |842 |Margin_Status      |
|ductal           |844  |849 |Histological_type  |
|Ki-67            |980  |984 |Proliferation_Index|
|invasive         |1041 |1048|Histological_type  |
|ductal           |1050 |1055|Histological_type  |
+-----------------+-----+----+-------------------+



`Labels()` parameter to know what enities are set

In [None]:
pretrained_zero_shot_ner.getLabels()

['Proliferation_Index',
 'Margin_Status',
 'Histological_type',
 'Performance_Status']