![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/open-source-nlp/04.4.ZeroShot_NER.ipynb)

#  Zero-Shot for Named Entity Recognition

In this notebook, You can use the ZeroShotNerModel annotator to construct simple questions/answers mapped to NER labels like PERSON, NORP and etc.

## Colab Setup

In [None]:
! pip install -q pyspark==3.3.0 spark-nlp==5.0.0

In [None]:
import sparknlp

from sparknlp.base import *
from sparknlp.annotator import *

from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import pandas as pd

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

spark

Spark NLP version:  5.0.0
Apache Spark version:  3.3.0


## Zero-Shot NER Pipeline

ZeroShotNerModel implements zero shot named entity recognition by utilizing RoBERTa transformer models fine tuned on a question answering task.

Its input is a list of document annotations and it automatically generates questions which are used to recognize entities. The definitions of entities is given by a dictionary structures, specifying a set of questions for each entity. The model is based on RoBertaForQuestionAnswering.

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

zero_shot_ner = ZeroShotNerModel() \
    .pretrained() \
    .setEntityDefinitions(
        {
            "NAME": ["What is his name?", "What is my name?", "What is her name?"],
            "CITY": ["Which city?", "Which is the city?"]
        })\
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("zero_shot_ner")

ner_converter = NerConverter()\
    .setInputCols(["sentence", "token", "zero_shot_ner"])\
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages = [
    document_assembler,
    sentence_detector,
    tokenizer,
    zero_shot_ner,
    ner_converter])

zero_shot_ner_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

zero_shot_ner_roberta download started this may take some time.
Approximate size to download 442.3 MB
[OK!]


In [None]:
zero_shot_ner.getClasses()

['CITY', 'NAME']

In [None]:
zero_shot_ner.extractParamMap()

{Param(parent='ZeroShotNerModel_90a37731927d', name='ignoreEntities', doc='List of entities to ignore'): [],
 Param(parent='ZeroShotNerModel_90a37731927d', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='ZeroShotNerModel_90a37731927d', name='batchSize', doc='Size of every batch'): 8,
 Param(parent='ZeroShotNerModel_90a37731927d', name='maxSentenceLength', doc='Max sentence length to process'): 512,
 Param(parent='ZeroShotNerModel_90a37731927d', name='caseSensitive', doc='whether to ignore case in tokens for embeddings matching'): True,
 Param(parent='ZeroShotNerModel_90a37731927d', name='predictionThreshold', doc='Minimal confidence score to encode an entity (default is 0.1)'): 0.1,
 Param(parent='ZeroShotNerModel_90a37731927d', name='engine', doc='Deep Learning engine used for this model'): 'tensorflow',
 Param(parent='ZeroShotNerModel_90a37731927d', name='inputCols', doc='previous annotations columns, if renamed'): ['

In [None]:
zero_shot_ner.getPredictionThreshold()

0.1

In [None]:
from pyspark.sql.types import StringType

text_list = ["My name is Clara, I live in New York and Hellen lives in Paris."]

data = spark.createDataFrame(text_list, StringType()).toDF("text")

results = zero_shot_ner_model.transform(data)

In [None]:
results.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|       zero_shot_ner|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|My name is Clara,...|[{document, 0, 62...|[{document, 0, 62...|[{token, 0, 1, My...|[{named_entity, 0...|[{chunk, 11, 15, ...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [None]:
results.selectExpr("document", "explode(zero_shot_ner) AS entity") \
    .select(
        "document.result",
        "entity.result",
        "entity.metadata.word",
        "entity.metadata.confidence",
        "entity.metadata.question") \
    .show(truncate=False)

+-----------------------------------------------------------------+------+------+----------+------------------+
|result                                                           |result|word  |confidence|question          |
+-----------------------------------------------------------------+------+------+----------+------------------+
|[My name is Clara, I live in New York and Hellen lives in Paris.]|O     |My    |null      |null              |
|[My name is Clara, I live in New York and Hellen lives in Paris.]|O     |name  |null      |null              |
|[My name is Clara, I live in New York and Hellen lives in Paris.]|O     |is    |null      |null              |
|[My name is Clara, I live in New York and Hellen lives in Paris.]|B-NAME|Clara |0.93601274|What is my name?  |
|[My name is Clara, I live in New York and Hellen lives in Paris.]|O     |,     |null      |null              |
|[My name is Clara, I live in New York and Hellen lives in Paris.]|O     |I     |null      |null        

Now we will check the NER chunks.

In [None]:
results.selectExpr("explode(ner_chunk)").show(100, truncate=False)

+----------------------------------------------------------------------------------------------------+
|col                                                                                                 |
+----------------------------------------------------------------------------------------------------+
|{chunk, 11, 15, Clara, {entity -> NAME, sentence -> 0, chunk -> 0, confidence -> 0.93601274}, []}   |
|{chunk, 28, 35, New York, {entity -> CITY, sentence -> 0, chunk -> 1, confidence -> 0.83294815}, []}|
|{chunk, 41, 46, Hellen, {entity -> NAME, sentence -> 0, chunk -> 2, confidence -> 0.4536752}, []}   |
|{chunk, 57, 61, Paris, {entity -> CITY, sentence -> 0, chunk -> 3, confidence -> 0.53289855}, []}   |
+----------------------------------------------------------------------------------------------------+



In [None]:
results.select(F.explode(F.arrays_zip(results.ner_chunk.result,
                                      results.ner_chunk.metadata)).alias("cols"))\
       .select(F.expr("cols['0']").alias("chunk"),
               F.expr("cols['1']['entity']").alias("ner_label"),
               F.expr("cols['1']['confidence']").alias("confidence")).show(50, truncate=100)

+--------+---------+----------+
|   chunk|ner_label|confidence|
+--------+---------+----------+
|   Clara|     NAME|0.93601274|
|New York|     CITY|0.83294815|
|  Hellen|     NAME| 0.4536752|
|   Paris|     CITY|0.53289855|
+--------+---------+----------+



### LightPipelines

In [None]:
# fullAnnotate in LightPipeline
print (text_list[-1], "\n")

light_model = LightPipeline(zero_shot_ner_model)
light_result = light_model.fullAnnotate(text_list[-1])

chunks = []
entities = []
sentence= []
begin = []
end = []

for n in light_result[0]['ner_chunk']:

    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity'])
    sentence.append(n.metadata['sentence'])



df = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end,
                   'sentence_id':sentence, 'entities':entities})

df.head(20)

My name is Clara, I live in New York and Hellen lives in Paris. 



Unnamed: 0,chunks,begin,end,sentence_id,entities
0,Clara,11,15,0,NAME
1,New York,28,35,0,CITY
2,Hellen,41,46,0,NAME
3,Paris,57,61,0,CITY


In [None]:
light_result[0]

{'zero_shot_ner': [Annotation(named_entity, 0, 1, O, {'sentence': '0', 'word': 'My'}, []),
  Annotation(named_entity, 3, 6, O, {'sentence': '0', 'word': 'name'}, []),
  Annotation(named_entity, 8, 9, O, {'sentence': '0', 'word': 'is'}, []),
  Annotation(named_entity, 11, 15, B-NAME, {'sentence': '0', 'word': 'Clara', 'confidence': '0.93601274', 'question': 'What is my name?'}, []),
  Annotation(named_entity, 16, 16, O, {'sentence': '0', 'word': ','}, []),
  Annotation(named_entity, 18, 18, O, {'sentence': '0', 'word': 'I'}, []),
  Annotation(named_entity, 20, 23, O, {'sentence': '0', 'word': 'live'}, []),
  Annotation(named_entity, 25, 26, O, {'sentence': '0', 'word': 'in'}, []),
  Annotation(named_entity, 28, 30, B-CITY, {'sentence': '0', 'word': 'New', 'confidence': '0.83294815', 'question': 'Which city?'}, []),
  Annotation(named_entity, 32, 35, I-CITY, {'sentence': '0', 'word': 'York', 'confidence': '0.83294815', 'question': 'Which city?'}, []),
  Annotation(named_entity, 37, 39, O

### NER Visualizer

In [None]:
! pip install -q spark-nlp-display

In [None]:
from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()

for i in text_list:

    light_result = light_model.fullAnnotate(i)
    visualiser.display(light_result[0], label_col='ner_chunk', document_col='document')