![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

#  Zero-Shot Named Entity Recognition in Spark NLP

In this notebook, you will find an example of Zero-Shot NER model (`zero_shot_ner_roberta`) that is the first of its kind and can detect any named entities without using any annotated dataset to train a model. 

`ZeroShotNerModel` annotator also allows extracting entities by crafting appropriate prompts to query **any RoBERTa Question Answering model**. 


You can check the model card here: [Models Hub](https://nlp.johnsnowlabs.com/2022/08/29/zero_shot_ner_roberta_en.html)

In [0]:
import sparknlp
import sparknlp_jsl
import pandas as pd
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StringType

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

## Zero-Shot Clinical NER Pipeline

Now we will create a pipeline for Zero-Shot NER model with only `documentAssembler`, `sentenceDetector`, `tokenizer`, `zero_shot_ner` and `ner_converter` stages. As you can see, we don't use any embeddings model, because it is already included in the model. 

Only the thing that you need to do is create meaningful definitions for the entities that you want to extract. For example; we want to detect `PROBLEM`, `DRUG`, `PATIENT_AGE` and  `ADMISSION_DATE` entities, so we created a dictionary with the questions for detecting these entities and the labels that we want to see in the result. Then we provided this dictionary to the model by using `setEntityDefinitions` parameter.

In [0]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")
    
zero_shot_ner = ZeroShotNerModel.pretrained("zero_shot_ner_roberta", "en", "clinical/models")\
    .setEntityDefinitions(
        {
            "PROBLEM": ["What is the disease?", "What is his symptom?", "What is her disease?", "What is his disease?", 
                        "What is the problem?" ,"What does a patient suffer", 'What was the reason that the patient is admitted to the clinic?'],
            "DRUG": ["Which drug?", "Which is the drug?", "What is the drug?", "Which drug does he use?", "Which drug does she use?", "Which drug do I use?", "Which drug is prescribed for a symptom?"],
            "ADMISSION_DATE": ["When did patient admitted to a clinic?"],
            "PATIENT_AGE": ["How old is the patient?",'What is the gae of the patient?']
        })\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("zero_shot_ner")\
    .setPredictionThreshold(0.1) # default 0.01

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "zero_shot_ner"])\
    .setOutputCol("ner_chunk")\

pipeline = Pipeline(stages = [
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    zero_shot_ner, 
    ner_converter])

zero_shot_ner_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

In [0]:
zero_shot_ner.extractParamMap()

In [0]:
zero_shot_ner.getClasses()

In [0]:
zero_shot_ner.getPredictionThreshold()

In [0]:
text_list = ["The doctor pescribed Majezik for my severe headache.",
             "The patient was admitted to the hospital for his colon cancer.",
             "27 years old patient was admitted to clinic on Sep 1st by Dr. X for a right-sided pleural effusion for thoracentesis."
            ]

data = spark.createDataFrame(text_list, StringType()).toDF("text")

results = zero_shot_ner_model.transform(data)

In [0]:
results.show()

Lets check the NER model results.

In [0]:
results.selectExpr("explode(zero_shot_ner) AS entity")\
       .select(
           "entity.metadata.word",    
           "entity.result",    
           "entity.metadata.sentence",
           "entity.begin",
           "entity.end",
           "entity.metadata.confidence",
           "entity.metadata.question")\
       .show(100, truncate=False)

In [0]:
results.select(F.explode(F.arrays_zip(results.token.result,
                                      results.zero_shot_ner.result, 
                                      results.zero_shot_ner.metadata,
                                      results.zero_shot_ner.begin, 
                                      results.zero_shot_ner.end)).alias("cols"))\
       .select(F.expr("cols['0']").alias("token"),
               F.expr("cols['1']").alias("ner_label"),
               F.expr("cols['2']['sentence']").alias("sentence"),
               F.expr("cols['3']").alias("begin"),
               F.expr("cols['4']").alias("end"),
               F.expr("cols['2']['confidence']").alias("confidence")).show(50, truncate=100)

Now we will check the NER chunks.

In [0]:
results.selectExpr("explode(ner_chunk)").show(100, truncate=False)

In [0]:
results.select(F.explode(F.arrays_zip(results.ner_chunk.result,
                                      results.ner_chunk.metadata)).alias("cols"))\
       .select(F.expr("cols['0']").alias("chunk"),
               F.expr("cols['1']['entity']").alias("ner_label"),
               F.expr("cols['1']['confidence']").alias("confidence")).show(50, truncate=100)

## LightPipelines

In [0]:
# fullAnnotate in LightPipeline
print (text_list[-1], "\n")

light_model = LightPipeline(zero_shot_ner_model)
light_result = light_model.fullAnnotate(text_list[-1])

chunks = []
entities = []
sentence= []
begin = []
end = []

for n in light_result[0]['ner_chunk']:
        
    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    sentence.append(n.metadata['sentence'])
    
    

df_clinical = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end, 
                   'sentence_id':sentence, 'entities':entities})

df_clinical.head(20)

Unnamed: 0,chunks,begin,end,sentence_id,entities
0,27 years old,0,11,0,PATIENT_AGE
1,Sep 1st,47,53,0,ADMISSION_DATE
2,a right-sided pleural effusion for thoracentesis,68,115,0,PROBLEM


In [0]:
light_result[0]

### NER Visualizer

For saving the visualization result as html, provide `save_path` parameter in the display function.

In [0]:
from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()

for i in text_list:

    light_result = light_model.fullAnnotate(i)
    ner_vis = visualiser.display(light_result[0], label_col='ner_chunk', document_col='document', return_html=True)

    # Change color of an entity label
    # visualiser.set_label_colors({'PROBLEM':'#008080', 'DRUG':'#800080', 'PATIENT_AGE':'#808080'})
    # ner_vis = visualiser.display(light_result[0], label_col='ner_chunk', return_html=True)


    # Set label filter
    # ner_vis = visualiser.display(light_result[0], label_col='ner_chunk', document_col='document',labels=['PROBLEM'], return_html=True)
    
    displayHTML(ner_vis)

# Save the Model and Load from Disc

Now we will save the Zero-Shot NER model and then we will be able to use this model without definitions. So our model will have the same labels that we defined before.

In [0]:
# save model

zero_shot_ner.write().overwrite().save("dbfs:/databricks/driver/zero_shot_ner_model")

In [0]:
# load from disc and create a new pipeline

zero_shot_ner_local = ZeroShotNerModel.load("dbfs:/databricks/driver/zero_shot_ner_model")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("zero_shot_ner")

ner_converter_local = sparknlp.annotators.NerConverter()\
    .setInputCols(["sentence", "token", "zero_shot_ner"])\
    .setOutputCol("ner_chunk")\

pipeline_local = Pipeline(stages = [
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    zero_shot_ner_local, 
    ner_converter_local])

zero_shot_ner_model_local = pipeline_local.fit(spark.createDataFrame([[""]]).toDF("text"))

In [0]:
zero_shot_ner_local.getClasses()

In [0]:
# check the results

local_results = zero_shot_ner_model_local.transform(data)

local_results.select(F.explode(F.arrays_zip(local_results.ner_chunk.result,
                                            local_results.ner_chunk.metadata)).alias("cols"))\
             .select(F.expr("cols['0']").alias("chunk"),
                     F.expr("cols['1']['entity']").alias("ner_label"),
                     F.expr("cols['1']['confidence']").alias("confidence")).show(50, truncate=100)

# NER Question Generator

`NerQuestionGenerator` annotator helps you build questions on the fly using 2 entities from different labels (preferably a subject and a verb). For example, let's suppose you have an NER model, able to detect `PATIENT`and `ADMISSION` in the following text:

`John Smith was admitted Sep 3rd to Mayo Clinic`
- PATIENT: `John Smith`
- ADMISSION: `was admitted`

You can add the following annotator to construct questions using PATIENT and ADMISSION:

```python
# setEntities1 says which entity from NER goes first in the question
# setEntities2 says which entity from NER goes second in the question
# setQuestionMark to True adds a '?' at the end of the sentence (after entity 2)
# To sum up, the pattern is     [QUESTIONPRONOUN] [ENTITY1] [ENTITY2] [QUESTIONMARK]

qagenerator = NerQuestionGenerator()\
  .setInputCols(["ner_chunk"])\
  .setOutputCol("question")\
  .setQuestionMark(True)\
  .setQuestionPronoun("When")\
  .setStrategyType("Paired")\
  .setEntities1(["PATIENT"])\
  .setEntities2(["ADMISSION"])
```
In the column `question` you will find: `When John Smith was admitted?`. Likewise you could have `Where` or any other question pronoun you may need.

You can use those questions in a QuestionAnsweringModel or ZeroShotNER (any model which requires a question as an input. Let's see the case of QA.

```python
qa = BertForQuestionAnswering.pretrained("bert_qa_spanbert_finetuned_squadv1","en") \
  .setInputCols(["question", "document"]) \
  .setOutputCol("answer") \
  .setCaseSensitive(True)
```
The result will be:

```bash
+--------------------------------------------------------+-----------------------------+
|question                                                |answer                       |
+--------------------------------------------------------+-----------------------------+
|[{document, 0, 25, When John Smith was admitted ? ...}] |[{chunk, 0, 8, Sep 3rd ...}] |
+--------------------------------------------------------+-----------------------------+
```
Strategies:
- Paired: First chunk of Entity 1 will be grouped with first chunk of Entity 2, second with second, third with third, etc (one-vs-one)
- Combined: A more flexible strategy to be used in case the number of chukns in Entity 1 is not aligned with the number of chunks in Entityt 2. The first chunk from Entity 1 will be grouped with all chunks in Entity 2, the second chunk in Entity 1 with again be grouped with all the chunks in Entity 2, etc (one-vs-all).