![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/01.4.ZeroShot_Clinical_NER.ipynb)

 # 📌 **Zero-Shot Named Entity Recognition (NER) in Spark NLP**


## 📌 **Healthcare NLP for Data Scientists Course**

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

## 🎯 **Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files

print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=False
nlp.install(refresh_install=True)

In [None]:
from johnsnowlabs import nlp, medical

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

In [None]:
spark

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

# 📍 **Pretrained Zero-Shot NER**


📚 `Pretrained Zero-shot Named Entity Recognition (NER)` makes it easy to identify specific entities in text without needing pre-labeled datasets. It uses advanced pre-trained language models to recognize entities in different fields and languages, saving time and effort.

📚 This method is flexible, letting you define your own entity labels instead of relying on a fixed set of examples. For the best results, it’s helpful to choose labels similar to the provided examples, as they guide the model’s understanding.

📚 In this notebook, you’ll see how to use `Pretrained Zero-shot NER` in Spark NLP to extract valuable information from your data quickly and easily.

- 🪅 **Pretrained Zero-Shot Named Entity Recognition Models**

- 🪅 **Pretrained Zero-Shot Named Entity Recognition Models**

| Index | Model | Predicted Entities |
|------:|:------|:-------------------|
| 1 | [zeroshot_ner_generic_large](https://nlp.johnsnowlabs.com/2024/11/28/zeroshot_ner_generic_large_en.html) | `AGE`,`DATE`,`DISEASE`,`DISORDER`,`DRUG`,`LOCATION`,`NAME`,`PHONE`,`RESULT`,`SYMPTOM`,`SYNDROME`,`TEST`,`TREATMENT` |
| 2 | [zeroshot_ner_generic_medium](https://nlp.johnsnowlabs.com/2024/11/28/zeroshot_ner_generic_medium_en.html) | `AGE`,`DATE`,`DISEASE`,`DISORDER`,`DRUG`,`LOCATION`,`NAME`,`PHONE`,`RESULT`,`SYMPTOM`,`SYNDROME`,`TEST`,`TREATMENT` |
| 3 | [zeroshot_ner_clinical_large](https://nlp.johnsnowlabs.com/2024/11/27/zeroshot_ner_clinical_large_en.html) | `PROBLEM`, `TREATMENT`, `TEST` |
| 4 | [zeroshot_ner_clinical_medium](https://nlp.johnsnowlabs.com/2024/11/27/zeroshot_ner_clinical_medium_en.html) | `PROBLEM`, `TREATMENT`, `TEST` |
| 5 | [zeroshot_ner_oncology_large](https://nlp.johnsnowlabs.com/2024/11/28/zeroshot_ner_oncology_large_en.html) | `Adenopathy`, `Age`, `Biomarker`, `Biomarker_Result` ,`Body_Part`, `Cancer_Dx`, `Cancer_Surgery`, `Cycle_Count`, `Cycle_Day`, `Date`, `Death_Entit`, `Directio`, `Dosage`, `Duration`, `Frequency`, `Gender`, `Grade`, `Histological_Type`, `Imaging_Test`, `Invasion`, `Metastasis`, `Oncogene`, `Pathology_Test`, `Race_Ethnicity`, `Radiation_Dose`, `Relative_Date`, `Response_To_Treatment`, `Route`, `Smoking_Status`, `Staging`, `Therapy`, `Tumor_Finding`, `Tumor_Size` |
| 6 | [zeroshot_ner_oncology_medium](https://nlp.johnsnowlabs.com/2024/11/27/zeroshot_ner_oncology_medium_en.html) | `Adenopathy`, `Age`, `Biomarker`, `Biomarker_Result`, `Body_Part`, `Cancer_Dx`, `Cancer_Surgery`, `Cycle_Count`, `Cycle_Day`, `Date`, `Death_Entit`, `Directio`, `Dosage`, `Duration`, `Frequency`, `Gender`, `Grade`, `Histological_Type`, `Imaging_Test`, `Invasion`, `Metastasis`, `Oncogene`, `Pathology_Test`, `Race_Ethnicity`, `Radiation_Dose`, `Relative_Date`, `Response_To_Treatment`, `Route`, `Smoking_Status`, `Staging`, `Therapy`, `Tumor_Finding`, `Tumor_Size` |
| 7 | [zeroshot_ner_deid_subentity_docwise_large](https://nlp.johnsnowlabs.com/2024/11/29/zeroshot_ner_deid_subentity_docwise_large_en.html) | `DATE`, `PATIENT`, `COUNTRY`, `PROFESSION`, `AGE`, `CITY`, `STATE`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `ORGANIZATION`, `PHONE`, `STREET`, `ZIP` |
| 8 | [zeroshot_ner_deid_subentity_docwise_medium](https://nlp.johnsnowlabs.com/2024/11/28/zeroshot_ner_deid_subentity_docwise_medium_en.html) | `DATE`, `PATIENT`, `COUNTRY`, `PROFESSION`, `AGE`, `CITY`, `STATE`, `DOCTOR`, `HOSPITAL`, `IDNUM`, `ORGANIZATION`, `PHONE`, `STREET`, `ZIP` |
| 9 | [zeroshot_ner_deid_subentity_merged_medium](https://nlp.johnsnowlabs.com/2024/11/27/zeroshot_ner_deid_subentity_merged_medium_en.html) | `DOCTOR`, `PATIENT`, `AGE`, `DATE`, `HOSPITAL`, `CITY`, `STREET`, `STATE`, `COUNTRY`, `PHONE`, `IDNUM`, `EMAIL`, `ZIP`, `ORGANIZATION`, `PROFESSION`, `USERNAME` |
| 10 | [zeroshot_ner_deid_generic_docwise_large](https://nlp.johnsnowlabs.com/2024/11/28/zeroshot_ner_deid_generic_docwise_large_en.html) | `DATE`, `NAME`, `LOCATION`, `PROFESSION`, `AGE`, `ID`, `CONTACT` |
| 11 | [zeroshot_ner_deid_generic_docwise_medium](https://nlp.johnsnowlabs.com/2024/11/28/zeroshot_ner_deid_generic_docwise_medium_en.html) | `DATE`, `NAME`, `LOCATION`, `PROFESSION`, `AGE`, `ID`, `CONTACT` |
| 12 | [zeroshot_ner_oncology_biomarker_large](https://nlp.johnsnowlabs.com/2024/12/13/zeroshot_ner_oncology_biomarker_large_en.html) | `Biomarker`, `Biomarker_Result` |
| 13 | [zeroshot_ner_oncology_biomarker_medium](https://nlp.johnsnowlabs.com/2024/12/13/zeroshot_ner_oncology_biomarker_medium_en.html) | `Biomarker`, `Biomarker_Result` |
| 14 | [zeroshot_ner_deid_generic_multi_large](https://nlp.johnsnowlabs.com/2024/12/21/zeroshot_ner_deid_generic_multi_large_xx.html) | `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION` |
| 15 | [zeroshot_ner_deid_generic_multi_medium](https://nlp.johnsnowlabs.com/2024/12/21/zeroshot_ner_deid_generic_multi_medium_xx.html) | `AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION` |
| 16 | [zeroshot_ner_deid_subentity_merged_large](https://nlp.johnsnowlabs.com/2024/12/17/zeroshot_ner_deid_subentity_merged_large_en.html) | `DOCTOR`, `PATIENT`, `AGE`, `DATE`, `HOSPITAL`, `CITY`, `STREET`, `STATE`, `COUNTRY`, `PHONE`, `IDNUM`, `EMAIL`, `ZIP`, `ORGANIZATION`, `PROFESSION`, `USERNAME` |
| 17 | [zeroshot_ner_jsl_large](https://nlp.johnsnowlabs.com/2025/01/01/zeroshot_ner_jsl_large_en.html) | `Admission_Discharge`, `Age`, `Alcohol`, `Body_Part`, `Clinical_Dept`, `Direction`, `Disease_Syndrome_Disorder`, `Dosage_Strength`, `Drug`, `Duration`, `Employment`, `Form`, `Frequency`, `Gender`, `Injury_or_Poisoning`, `Medical_Device`, `Modifier`, `Oncological`, `Procedure`, `Race_Ethnicity`, `Relationship_Status`, `Route`, `Section_Header`, `Smoking`, `Symptom`, `Test`, `Test_Result`, `Treatment`, `Vaccine` |
| 18 | [zeroshot_ner_jsl_medium](https://nlp.johnsnowlabs.com/2025/01/01/zeroshot_ner_jsl_medium_en.html) | `Admission_Discharge`, `Age`, `Alcohol`, `Body_Part`, `Clinical_Dept`, `Direction`, `Disease_Syndrome_Disorder`, `Dosage_Strength`, `Drug`, `Duration`, `Employment`, `Form`, `Frequency`, `Gender`, `Injury_or_Poisoning`, `Medical_Device`, `Modifier`, `Oncological`, `Procedure`, `Race_Ethnicity`, `Relationship_Status`, `Route`, `Section_Header`, `Smoking`, `Symptom`, `Test`, `Test_Result`, `Treatment`, `Vaccine` |
| 19 | [zeroshot_ner_ade_clinical_large](https://nlp.johnsnowlabs.com/2024/12/02/zeroshot_ner_ade_clinical_large_en.html) | `DRUG`, `ADE`, `PROBLEM` |
| 20 | [zeroshot_ner_sdoh_medium](https://nlp.johnsnowlabs.com/2025/01/06/zeroshot_ner_sdoh_medium_en.html) | `Access_To_Care`, `Age`, `Alcohol`, `Childhood_Development`, `Diet`, `Disability`, `Eating_Disorder`, `Education`, `Employment`, `Environmental_Condition`, `Exercise`, `Family_Member`, `Financial_Status`, `Gender`, `Geographic_Entity`, `Healthcare_Institution`, `Housing`, `Hypertension`, `Income`, `Insurance_Status`, `Language`, `Legal_Issues`, `Marital_Status`, `Mental_Health`, `Obesity`, `Other_Disease`, `Other_SDoH_Keywords`, `Quality_Of_Life`, `Race_Ethnicity`, `Sexual_Activity`, `Sexual_Orientation`, `Smoking`, `Social_Exclusion`, `Social_Support`, `Spiritual_Beliefs`, `Transportation`, `Violence_Or_Abuse` |
| 21 | [zeroshot_ner_sdoh_large](https://nlp.johnsnowlabs.com/2024/12/02/zeroshot_ner_ade_clinical_large_en.html) | `DRUG`, `ADE`, `PROBLEM` |

In [None]:
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

labels = ['DOCTOR', 'PATIENT', 'AGE', 'DATE', 'HOSPITAL', 'CITY', 'STREET', 'STATE', 'COUNTRY', 'PHONE', 'IDNUM', 'EMAIL','ZIP', 'ORGANIZATION', 'PROFESSION', 'USERNAME']

pretrained_zero_shot_ner = medical.PretrainedZeroShotNER().pretrained("zeroshot_ner_deid_subentity_merged_medium", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("entities")\
    .setPredictionThreshold(0.5)\
    .setLabels(labels)

ner_converter = medical.NerConverterInternal()\
    .setInputCols("sentence", "token", "entities")\
    .setOutputCol("ner_chunk")


pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_detector,
    tokenizer,
    pretrained_zero_shot_ner,
    ner_converter
])

zeroshot_ner_deid_subentity_merged_medium download started this may take some time.
Approximate size to download 674 MB
[OK!]


In [None]:
text = """Dr. John Lee, from Royal Medical Clinic in Chicago,  attended to the patient on 11/05/2024.
The patient’s medical record number is 56467890. The patient, Emma Wilson, is 50 years old,  her Contact number: 444-456-7890 .
Dr. John Taylor, ID: 982345, a cardiologist at St. Mary's Hospital in Boston, was contacted on 05/10/2023 regarding a 45-year-old.
"""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

In [None]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
                                     result.ner_chunk.begin,
                                     result.ner_chunk.end,
                                     result.ner_chunk.metadata)).alias("cols")) \
               .select( F.expr("cols['0']").alias("chunk"),
                       F.expr("cols['1']").alias("begin"),
                       F.expr("cols['2']").alias("end"),
                      F.expr("cols['3']['entity']").alias("ner_label"))\
                       .filter("ner_label!='O'")\
                       .show(1000,truncate=False)

+--------------------+-----+---+----------+
|chunk               |begin|end|ner_label |
+--------------------+-----+---+----------+
|John Lee            |4    |11 |DOCTOR    |
|Royal Medical Clinic|19   |38 |HOSPITAL  |
|Chicago             |43   |49 |CITY      |
|11/05/2024          |80   |89 |DATE      |
|56467890            |131  |138|IDNUM     |
|Emma Wilson         |154  |164|PATIENT   |
|50                  |170  |171|AGE       |
|444-456-7890        |205  |216|PHONE     |
|John Taylor         |224  |234|DOCTOR    |
|982345              |241  |246|IDNUM     |
|cardiologist        |251  |262|PROFESSION|
|St. Mary's Hospital |267  |285|HOSPITAL  |
|Boston              |290  |295|CITY      |
|05/10/2023          |315  |324|DATE      |
|45-year-old         |338  |348|AGE       |
+--------------------+-----+---+----------+



**You can customize the prediction labels**

It is important to highlight that users are not limited to these labels. You have the flexibility to define and use any labels that suit your specific use case. Simply provide the labels you need, and the model will adapt to predict them.


In [None]:
# You can change the labels. If we can group them such as DOCTOR -> NAME, PATIENT -> NAME ...
labels = ['NAME', 'AGE', 'DATE', 'LOCATION', 'IDNUM','ORGANIZATION', 'PROFESSION']

pretrained_zero_shot_ner = medical.PretrainedZeroShotNER()\
    .pretrained("zeroshot_ner_deid_subentity_merged_medium", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("ner")\
    .setPredictionThreshold(0.5)\
    .setLabels(labels)

ner_converter = medical.NerConverterInternal()\
    .setInputCols("sentence", "token", "ner")\
    .setOutputCol("ner_chunk")

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_detector,
    tokenizer,
    pretrained_zero_shot_ner,
    ner_converter
])

result = pipeline.fit(data).transform(data)

result.selectExpr("explode(ner_chunk) as ner_chunk")\
      .selectExpr("ner_chunk.result as chunk",
                  "ner_chunk.begin",
                  "ner_chunk.end",
                  "ner_chunk.metadata['entity'] as ner_label").show(100, truncate=False)

zeroshot_ner_deid_subentity_merged_medium download started this may take some time.
Approximate size to download 674 MB
[OK!]
+--------------------+-----+---+------------+
|chunk               |begin|end|ner_label   |
+--------------------+-----+---+------------+
|John Lee            |4    |11 |NAME        |
|Royal Medical Clinic|19   |38 |ORGANIZATION|
|Chicago             |43   |49 |LOCATION    |
|11/05/2024          |80   |89 |DATE        |
|56467890            |131  |138|IDNUM       |
|Emma Wilson         |154  |164|NAME        |
|50                  |170  |171|AGE         |
|444-456-7890        |205  |216|IDNUM       |
|John Taylor         |224  |234|NAME        |
|982345              |241  |246|IDNUM       |
|cardiologist        |251  |262|PROFESSION  |
|St. Mary's Hospital |267  |285|ORGANIZATION|
|Boston              |290  |295|LOCATION    |
|05/10/2023          |315  |324|DATE        |
|45-year-old         |338  |348|AGE         |
+--------------------+-----+---+------------+


## 📍 **Light Pipeline**

In [None]:
empty_data = spark.createDataFrame([[""]]).toDF("text")
light_model = nlp.LightPipeline(pipeline.fit(empty_data))
light_result = light_model.fullAnnotate(text)

chunks = []
entities = []
sentence= []
begin = []
end = []

for n in light_result[0]['ner_chunk']:

    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity'])
    sentence.append(n.metadata['sentence'])

import pandas as pd

df_clinical = pd.DataFrame({'chunks':chunks,
                            'begin': begin,
                            'end':end,
                            'sentence_id':sentence,
                            'entities':entities})

df_clinical.head(50)

Unnamed: 0,chunks,begin,end,sentence_id,entities
0,John Lee,4,11,0,NAME
1,Royal Medical Clinic,19,38,0,ORGANIZATION
2,Chicago,43,49,0,LOCATION
3,11/05/2024,80,89,0,DATE
4,56467890,131,138,1,IDNUM
5,Emma Wilson,154,164,2,NAME
6,50,170,171,2,AGE
7,444-456-7890,205,216,2,IDNUM
8,John Taylor,224,234,3,NAME
9,982345,241,246,3,IDNUM


## 📍 **NER Visualizer**

For saving the visualization result as html, provide `save_path` parameter in the display function.

In [None]:
visualiser = nlp.viz.NerVisualizer()

visualiser.display(light_result[0], label_col='ner_chunk', document_col='document',
                   #save_path="display_result.html"
                   )

# 📌 **Zero-Shot Clinical NER**

📚 In this part, you will find an example of Zero-Shot NER model (`zero_shot_ner_roberta`) that is the first of its kind and can detect any named entities without using any annotated dataset to train a model.

📚 `ZeroShotNerModel` annotator also allows extracting entities by crafting appropriate prompts to query **any RoBERTa Question Answering model**.


💥 You can check the model card here: [Models Hub](https://nlp.johnsnowlabs.com/2022/08/29/zero_shot_ner_roberta_en.html)






🩸 Now we will create a pipeline for Zero-Shot NER model with only `documentAssembler`, `sentenceDetector`, `tokenizer`, `zero_shot_ner` and `ner_converter` stages. As you can see, we don't use any embeddings model, because it is already included in the model.

🩸 Only the thing that you need to do is create meaningful definitions for the entities that you want to extract. For example; we want to detect `PROBLEM`, `DRUG`, `PATIENT_AGE` and  `ADMISSION_DATE` entities, so we created a dictionary with the questions for detecting these entities and the labels that we want to see in the result. Then we provided this dictionary to the model by using `setEntityDefinitions` parameter.

In [None]:
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

zero_shot_ner = medical.ZeroShotNerModel.pretrained("zero_shot_ner_roberta", "en", "clinical/models")\
    .setEntityDefinitions(
        {
            "PROBLEM": ["What is the disease?", "What is his symptom?", "What is her disease?", "What is his disease?",
                        "What is the problem?" ,"What does a patient suffer", 'What was the reason that the patient is admitted to the clinic?'],
            "DRUG": ["Which drug?", "Which is the drug?", "What is the drug?", "Which drug does he use?", "Which drug does she use?", "Which drug do I use?", "Which drug is prescribed for a symptom?"],
            "ADMISSION_DATE": ["When did patient admitted to a clinic?"],
            "PATIENT_AGE": ["How old is the patient?",'What is the gae of the patient?']
        })\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("zero_shot_ner")\
    .setPredictionThreshold(0.1) # default 0.01

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "zero_shot_ner"])\
    .setOutputCol("ner_chunk")\

pipeline = nlp.Pipeline(stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    zero_shot_ner,
    ner_converter])

zero_shot_ner_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

zero_shot_ner_roberta download started this may take some time.
Approximate size to download 438.9 MB
[OK!]


In [None]:
zero_shot_ner.getClasses()

['DRUG', 'PATIENT_AGE', 'ADMISSION_DATE', 'PROBLEM']

In [None]:
zero_shot_ner.extractParamMap()

{Param(parent='ZeroShotRobertaNer_5d06c0297d21', name='engine', doc='Deep Learning engine used for this model'): 'tensorflow',
 Param(parent='ZeroShotRobertaNer_5d06c0297d21', name='ignoreEntities', doc='List of entities to ignore'): [],
 Param(parent='ZeroShotRobertaNer_5d06c0297d21', name='predictionThreshold', doc='Minimal confidence score to encode an entity (default is 0.1)'): 0.1,
 Param(parent='ZeroShotRobertaNer_5d06c0297d21', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='ZeroShotRobertaNer_5d06c0297d21', name='batchSize', doc='Size of every batch'): 8,
 Param(parent='ZeroShotRobertaNer_5d06c0297d21', name='maxSentenceLength', doc='Max sentence length to process'): 512,
 Param(parent='ZeroShotRobertaNer_5d06c0297d21', name='caseSensitive', doc='whether to ignore case in tokens for embeddings matching'): True,
 Param(parent='ZeroShotRobertaNer_5d06c0297d21', name='inputCols', doc='previous annotations columns, 

In [None]:
zero_shot_ner.getPredictionThreshold()

0.1

In [None]:
text_list = ["The doctor pescribed Majezik for my severe headache.",
             "The patient was admitted to the hospital for his colon cancer.",
             "27 years old patient was admitted to clinic on Sep 1st by Dr. X for a right-sided pleural effusion for thoracentesis."
            ]

data = spark.createDataFrame(text_list, nlp.StringType()).toDF("text")

results = zero_shot_ner_model.transform(data)

In [None]:
results.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|       zero_shot_ner|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|The doctor pescri...|[{document, 0, 51...|[{document, 0, 51...|[{token, 0, 2, Th...|[{named_entity, 0...|[{chunk, 21, 27, ...|
|The patient was a...|[{document, 0, 61...|[{document, 0, 61...|[{token, 0, 2, Th...|[{named_entity, 0...|[{chunk, 49, 60, ...|
|27 years old pati...|[{document, 0, 11...|[{document, 0, 11...|[{token, 0, 1, 27...|[{named_entity, 0...|[{chunk, 0, 7, 27...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



Lets check the NER model results.

In [None]:
results\
    .selectExpr("explode(zero_shot_ner) AS entity")\
    .select(
        "entity.metadata.word",
        "entity.result",
        "entity.metadata.sentence",
        "entity.begin",
        "entity.end",
        "entity.metadata.confidence",
        "entity.metadata.question")\
    .show(100, truncate=False)

+-------------+----------------+--------+-----+---+----------+---------------------------------------------------------------+
|word         |result          |sentence|begin|end|confidence|question                                                       |
+-------------+----------------+--------+-----+---+----------+---------------------------------------------------------------+
|The          |O               |0       |0    |2  |null      |null                                                           |
|doctor       |O               |0       |4    |9  |null      |null                                                           |
|pescribed    |O               |0       |11   |19 |null      |null                                                           |
|Majezik      |B-DRUG          |0       |21   |27 |0.6233715 |Which drug is prescribed for a symptom?                        |
|for          |O               |0       |29   |31 |null      |null                                             

In [None]:
results.select(F.explode(F.arrays_zip(results.token.result,
                                      results.zero_shot_ner.result,
                                      results.zero_shot_ner.metadata,
                                      results.zero_shot_ner.begin,
                                      results.zero_shot_ner.end)).alias("cols"))\
       .select(F.expr("cols['0']").alias("token"),
               F.expr("cols['1']").alias("ner_label"),
               F.expr("cols['2']['sentence']").alias("sentence"),
               F.expr("cols['3']").alias("begin"),
               F.expr("cols['4']").alias("end"),
               F.expr("cols['2']['confidence']").alias("confidence")).show(50, truncate=100)

+-------------+----------------+--------+-----+---+----------+
|        token|       ner_label|sentence|begin|end|confidence|
+-------------+----------------+--------+-----+---+----------+
|          The|               O|       0|    0|  2|      null|
|       doctor|               O|       0|    4|  9|      null|
|    pescribed|               O|       0|   11| 19|      null|
|      Majezik|          B-DRUG|       0|   21| 27| 0.6233715|
|          for|               O|       0|   29| 31|      null|
|           my|               O|       0|   33| 34|      null|
|       severe|       B-PROBLEM|       0|   36| 41|0.53198636|
|     headache|       I-PROBLEM|       0|   43| 50|0.53198636|
|            .|               O|       0|   51| 51|      null|
|          The|               O|       0|    0|  2|      null|
|      patient|               O|       0|    4| 10|      null|
|          was|               O|       0|   12| 14|      null|
|     admitted|               O|       0|   16| 23|    

Now we will check the NER chunks.

In [None]:
results.selectExpr("explode(ner_chunk)").show(100, truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                                                  |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 21, 27, Majezik, {entity -> DRUG, confidence -> 0.6233715, ner_source -> ner_chunk, chunk -> 0, sentence -> 0}, []}                                          |
|{chunk, 36, 50, severe headache, {entity -> PROBLEM, confidence -> 0.53198636, ner_source -> ner_chunk, chunk -> 1, sentence -> 0}, []}                              |
|{chunk, 49, 60, colon cancer, {entity -> PROBLEM, confidence -> 0.9406247, ner_source -> ner_chunk, chunk -> 0, sentence -> 0}, []}                            

In [None]:
results.select(F.explode(F.arrays_zip(results.ner_chunk.result,
                                      results.ner_chunk.metadata)).alias("cols"))\
       .select(F.expr("cols['0']").alias("chunk"),
               F.expr("cols['1']['entity']").alias("ner_label"),
               F.expr("cols['1']['confidence']").alias("confidence")).show(50, truncate=100)

+----------------------------------------------+--------------+----------+
|                                         chunk|     ner_label|confidence|
+----------------------------------------------+--------------+----------+
|                                       Majezik|          DRUG| 0.6233715|
|                               severe headache|       PROBLEM|0.53198636|
|                                  colon cancer|       PROBLEM| 0.9406247|
|                                      27 years|   PATIENT_AGE| 0.7028021|
|                                       Sep 1st|ADMISSION_DATE| 0.9757786|
|right-sided pleural effusion for thoracentesis|       PROBLEM|  0.582167|
+----------------------------------------------+--------------+----------+



## 📍 **LightPipelines**

In [None]:
import pandas as pd

# fullAnnotate in LightPipeline
print (text_list[-1], "\n")

light_model = nlp.LightPipeline(zero_shot_ner_model)
light_result = light_model.fullAnnotate(text_list[-1])

chunks = []
entities = []
sentence= []
begin = []
end = []

for n in light_result[0]['ner_chunk']:

    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity'])
    sentence.append(n.metadata['sentence'])



df_clinical = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end,
                   'sentence_id':sentence, 'entities':entities})

df_clinical.head(20)

27 years old patient was admitted to clinic on Sep 1st by Dr. X for a right-sided pleural effusion for thoracentesis. 



Unnamed: 0,chunks,begin,end,sentence_id,entities
0,27 years,0,7,0,PATIENT_AGE
1,Sep 1st,47,53,0,ADMISSION_DATE
2,right-sided pleural effusion for thoracentesis,70,115,0,PROBLEM


In [None]:
light_result[0]

{'zero_shot_ner': [Annotation(named_entity, 0, 1, B-PATIENT_AGE, {'sentence': '0', 'word': '27', 'confidence': '0.7028021', 'question': 'How old is the patient?'}, []),
  Annotation(named_entity, 3, 7, I-PATIENT_AGE, {'sentence': '0', 'word': 'years', 'confidence': '0.7028021', 'question': 'How old is the patient?'}, []),
  Annotation(named_entity, 9, 11, O, {'sentence': '0', 'word': 'old'}, []),
  Annotation(named_entity, 13, 19, O, {'sentence': '0', 'word': 'patient'}, []),
  Annotation(named_entity, 21, 23, O, {'sentence': '0', 'word': 'was'}, []),
  Annotation(named_entity, 25, 32, O, {'sentence': '0', 'word': 'admitted'}, []),
  Annotation(named_entity, 34, 35, O, {'sentence': '0', 'word': 'to'}, []),
  Annotation(named_entity, 37, 42, O, {'sentence': '0', 'word': 'clinic'}, []),
  Annotation(named_entity, 44, 45, O, {'sentence': '0', 'word': 'on'}, []),
  Annotation(named_entity, 47, 49, B-ADMISSION_DATE, {'sentence': '0', 'word': 'Sep', 'confidence': '0.9757786', 'question': 'Wh

## 📍 **NER Visualizer**

For saving the visualization result as html, provide `save_path` parameter in the display function.

In [None]:
visualiser = nlp.viz.NerVisualizer()

for i in text_list:

    light_result = light_model.fullAnnotate(i)
    visualiser.display(light_result[0], label_col='ner_chunk', document_col='document')

    # Change color of an entity label
    # visualiser.set_label_colors({'PROBLEM':'#008080', 'DRUG':'#800080', 'PATIENT_AGE':'#808080'})
    # visualiser.display(light_result[0], label_col='ner_chunk')


    # Set label filter
    # visualiser.display(light_result[0], label_col='ner_chunk', document_col='document',labels=['PROBLEM'])

## 🧨 **Save the Model and Load from Disc**

Now we will save the Zero-Shot NER model and then we will be able to use this model without definitions. So our model will have the same labels that we defined before.

In [None]:
# save model

zero_shot_ner.write().overwrite().save("zero_shot_ner_model")

In [None]:
# load from disc and create a new pipeline

zero_shot_ner_local = medical.ZeroShotNerModel.load("zero_shot_ner_model")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("zero_shot_ner")

ner_converter_local = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "zero_shot_ner"])\
    .setOutputCol("ner_chunk")\

pipeline_local = nlp.Pipeline(stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    zero_shot_ner_local,
    ner_converter_local])

zero_shot_ner_model_local = pipeline_local.fit(spark.createDataFrame([[""]]).toDF("text"))

In [None]:
# check the results

local_results = zero_shot_ner_model_local.transform(data)

local_results.select(F.explode(F.arrays_zip(local_results.ner_chunk.result,
                                            local_results.ner_chunk.metadata)).alias("cols"))\
             .select(F.expr("cols['0']").alias("chunk"),
                     F.expr("cols['1']['entity']").alias("ner_label"),
                     F.expr("cols['1']['confidence']").alias("confidence")).show(50, truncate=100)

+----------------------------------------------+--------------+----------+
|                                         chunk|     ner_label|confidence|
+----------------------------------------------+--------------+----------+
|                                       Majezik|          DRUG| 0.6233715|
|                               severe headache|       PROBLEM|0.53198636|
|                                  colon cancer|       PROBLEM| 0.9406247|
|                                      27 years|   PATIENT_AGE| 0.7028021|
|                                       Sep 1st|ADMISSION_DATE| 0.9757786|
|right-sided pleural effusion for thoracentesis|       PROBLEM|  0.582167|
+----------------------------------------------+--------------+----------+



# 📌 **NER Question Generator**

`NerQuestionGenerator` annotator helps you build questions on the fly using 2 entities from different labels (preferably a subject and a verb). For example, let's suppose you have an NER model, able to detect `PATIENT`and `ADMISSION` in the following text:

`John Smith was admitted Sep 3rd to Mayo Clinic`
- PATIENT: `John Smith`
- ADMISSION: `was admitted`

You can add the following annotator to construct questions using PATIENT and ADMISSION:

```python
# setEntities1 says which entity from NER goes first in the question
# setEntities2 says which entity from NER goes second in the question
# setQuestionMark to True adds a '?' at the end of the sentence (after entity 2)
# To sum up, the pattern is     [QUESTIONPRONOUN] [ENTITY1] [ENTITY2] [QUESTIONMARK]

qagenerator = NerQuestionGenerator()\
  .setInputCols(["ner_chunk"])\
  .setOutputCol("question")\
  .setQuestionMark(True)\
  .setQuestionPronoun("When")\
  .setStrategyType("Paired")\
  .setEntities1(["PATIENT"])\
  .setEntities2(["ADMISSION"])
```
In the column `question` you will find: `When John Smith was admitted?`. Likewise you could have `Where` or any other question pronoun you may need.

You can use those questions in a QuestionAnsweringModel or ZeroShotNER (any model which requires a question as an input. Let's see the case of QA.

```python
qa = BertForQuestionAnswering.pretrained("bert_qa_spanbert_finetuned_squadv1","en") \
  .setInputCols(["question", "document"]) \
  .setOutputCol("answer") \
  .setCaseSensitive(True)
```
The result will be:

```bash
+--------------------------------------------------------+-----------------------------+
|question                                                |answer                       |
+--------------------------------------------------------+-----------------------------+
|[{document, 0, 25, When John Smith was admitted ? ...}] |[{chunk, 0, 8, Sep 3rd ...}] |
+--------------------------------------------------------+-----------------------------+
```
Strategies:
- Paired: First chunk of Entity 1 will be grouped with first chunk of Entity 2, second with second, third with third, etc (one-vs-one)
- Combined: A more flexible strategy to be used in case the number of chukns in Entity 1 is not aligned with the number of chunks in Entityt 2. The first chunk from Entity 1 will be grouped with all chunks in Entity 2, the second chunk in Entity 1 with again be grouped with all the chunks in Entity 2, etc (one-vs-all).