![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/annotation/text/english/zero-shot%20text%20classification/Zero_Shot_Text_Classification_by_BERT.ipynb)

# Zero-Shot Learning in Modern NLP
### State-of-the-art NLP models for text classification without annotated data

>Natural language processing is a very exciting field right now. In recent years, the community has begun to figure out some pretty effective methods of learning from the enormous amounts of unlabeled data available on the internet. The success of transfer learning from unsupervised models has allowed us to surpass virtually all existing benchmarks on downstream supervised learning tasks. As we continue to develop new model architectures and unsupervised learning objectives, "state of the art" continues to be a rapidly moving target for many tasks where large amounts of labeled data are available.



In [None]:
# This is only to setup PySpark and Spark NLP on Colab
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash

In [None]:
import sparknlp
spark = sparknlp.start()

In [3]:
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline, PipelineModel

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer().setInputCols("document").setOutputCol("token")

zero_shot_classifier = BertForZeroShotClassification \
    .pretrained() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("class") \
    .setCandidateLabels(["urgent", "mobile", "travel", "movie", "music", "sport", "weather", "technology"])

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    zero_shot_classifier
])

bert_base_cased_zero_shot_classifier_xnli download started this may take some time.
Approximate size to download 387.7 MB
[OK!]


In [4]:
text = [["I have a problem with my iphone that needs to be resolved asap!!"],
        ["Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app."],
        ["I have a phone and I love it!"],
        ["I really want to visit Germany and I am planning to go there next year."],
        ["Let's watch some movies tonight! I am in the mood for a horror movie."],
        ["Have you watched the match yesterday? It was a great game!"],
        ["We need to harry up and get to the airport. We are going to miss our flight!"]]

# create a DataFrame in PySpark
inputDataset = spark.createDataFrame(text, ["text"])
model = pipeline.fit(inputDataset)
predictionDF = model.transform(inputDataset)

In [5]:
predictionDF.select("document.result", "class.result").show(10, False)

+----------------------------------------------------------------------------------------------------------------+--------+
|result                                                                                                          |result  |
+----------------------------------------------------------------------------------------------------------------+--------+
|[I have a problem with my iphone that needs to be resolved asap!!]                                              |[mobile]|
|[Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app.]|[mobile]|
|[I have a phone and I love it!]                                                                                 |[mobile]|
|[I really want to visit Germany and I am planning to go there next year.]                                       |[travel]|
|[Let's watch some movies tonight! I am in the mood for a horror movie.]                                         |[movie] |
|[Have y

In [6]:
sample_text = "Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app."

light_pipeline = LightPipeline(model)

results = light_pipeline.annotate(sample_text)
results


{'document': ['Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app.'],
 'token': ['Last',
  'week',
  'I',
  'upgraded',
  'my',
  'iOS',
  'version',
  'and',
  'ever',
  'since',
  'then',
  'my',
  'phone',
  'has',
  'been',
  'overheating',
  'whenever',
  'I',
  'use',
  'your',
  'app',
  '.'],
 'class': ['mobile']}

In [7]:
for tx in text:
  res = light_pipeline.annotate(tx[0])
  print(f"document: {res['document']} prediction: {res['class']}")


document: ['I have a problem with my iphone that needs to be resolved asap!!'] prediction: ['mobile']
document: ['Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app.'] prediction: ['mobile']
document: ['I have a phone and I love it!'] prediction: ['mobile']
document: ['I really want to visit Germany and I am planning to go there next year.'] prediction: ['travel']
document: ["Let's watch some movies tonight! I am in the mood for a horror movie."] prediction: ['movie']
document: ['Have you watched the match yesterday? It was a great game!'] prediction: ['sport']
document: ['We need to harry up and get to the airport. We are going to miss our flight!'] prediction: ['urgent']


## Multi Label vs. Multi Class

We can use `activation` parameter to set whether or not the result should be multi-class (the sum of all probabilities is `1.0`) or multi-label (each label has a probability between `0.0` to `1.0`)

- multi-class: `softmax` (default)
- multi-label: `sigmoid`

Since spark-nlp 4.4.3 we added `multilabel` parameter to set whether or not the result should be multi-label

- multilabel: `False` (default) i.e. `softmax` activation
- multilabel: `True` i.e. `sigmoid` activation

Note: `activation` parameter is still available for backward compatibility

In [15]:
zero_shot_classifier\
    .setCandidateLabels(["urgent", "mobile", "travel", "movie", "music", "sport", "weather", "technology"])\
    .setMultilabel(True)

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    zero_shot_classifier
])

In [16]:
text = [["I have a problem with my iphone that needs to be resolved asap!!"],
        ["Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app."],
        ["I have a phone and I love it!"],
        ["I really want to visit Germany and I am planning to go there next year."],
        ["Let's watch some movies tonight! I am in the mood for a horror movie."],
        ["Have you watched the match yesterday? It was a great game!"],
        ["We need to harry up and get to the airport. We are going to miss our flight!"]]

# create a DataFrame in PySpark
inputDataset = spark.createDataFrame(text, ["text"])
model = pipeline.fit(inputDataset)
predictionDF = model.transform(inputDataset)

In [17]:
predictionDF.select("document.result", "class.result").show(10, False)

+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
|result                                                                                                          |result                             |
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
|[I have a problem with my iphone that needs to be resolved asap!!]                                              |[urgent, mobile, movie, technology]|
|[Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app.]|[urgent, technology]               |
|[I have a phone and I love it!]                                                                                 |[mobile]                           |
|[I really want to visit Germany and I am planning to go there next year.]                    

In [18]:
# check the scores
predictionDF.select("class.metadata").show(10, False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                                                                                                                                    

### Zero-Shot Learning (ZSL)
> Traditionally, zero-shot learning (ZSL) most often referred to a fairly specific type of task: learn a classifier on one set of labels and then evaluate on a different set of labels that the classifier has never seen before. Recently, especially in NLP, it's been used much more broadly to mean get a model to do something that it wasn't explicitly trained to do. A well-known example of this is in the [GPT-2 paper](https://pdfs.semanticscholar.org/9405/cc0d6169988371b2755e573cc28650d14dfe.pdf) where the authors evaluate a language model on downstream tasks like machine translation without fine-tuning on these tasks directly.

Let's see how easy it is to just use any set of lables our trained model has never seen via `setCandidateLabels()` param:

In [27]:
zero_shot_classifier\
    .setCandidateLabels(["space & cosmos", "scientific discovery", "microbiology", "robots", "archeology", "politics"])\
    .setActivation("sigmoid") # multi-label

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    zero_shot_classifier
])

input_text3 = [
    ["Learn about the presidential election process, including the Electoral College, caucuses and primaries, and the national conventions."],
    ["""In a new book, Sean Carroll brings together physics and philosophy while advocating for "poetic naturalism." Ramin Skibba, Contributor. Space ..."""],
    ["Who are you voting for in 2024?"]]

# create a DataFrame in PySpark
inputDataset = spark.createDataFrame(input_text3, ["text"])
model = pipeline.fit(inputDataset)
predictionDF = model.transform(inputDataset)

predictionDF.select("document.result", "class.result").show(3, False)

+---------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+
|result                                                                                                                                             |result                                |
+---------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+
|[Learn about the presidential election process, including the Electoral College, caucuses and primaries, and the national conventions.]            |[politics]                            |
|[In a new book, Sean Carroll brings together physics and philosophy while advocating for "poetic naturalism." Ramin Skibba, Contributor. Space ...]|[space & cosmos, scientific discovery]|
|[Who are you voting for in 2024?]                     