<a href="https://colab.research.google.com/github/Dipeshpal/zero_shot_classification_GPT-2/blob/main/zero_shot_classification_GPT_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/b5/d5/c6c23ad75491467a9a84e526ef2364e523d45e2b0fae28a7cbe8689e7e84/transformers-4.8.1-py3-none-any.whl (2.5MB)
[K     |▏                               | 10kB 15.0MB/s eta 0:00:01[K     |▎                               | 20kB 22.1MB/s eta 0:00:01[K     |▍                               | 30kB 27.0MB/s eta 0:00:01[K     |▌                               | 40kB 27.7MB/s eta 0:00:01[K     |▋                               | 51kB 29.7MB/s eta 0:00:01[K     |▉                               | 61kB 32.3MB/s eta 0:00:01[K     |█                               | 71kB 28.0MB/s eta 0:00:01[K     |█                               | 81kB 29.2MB/s eta 0:00:01[K     |█▏                              | 92kB 30.2MB/s eta 0:00:01[K     |█▎                              | 102kB 31.7MB/s eta 0:00:01[K     |█▌                              | 112kB 31.7MB/s eta 0:00:01[K     |█▋                              | 

Source: https://joeddav.github.io/blog/2020/05/29/ZSL.html

Demo: http://35.208.71.201:8000/

## What is zero-shot learning?

Traditionally, zero-shot learning (ZSL) most often referred to a fairly specific type of task: learn a classifier on one set of labels and then evaluate on a different set of labels that the classifier has never seen before. Recently, especially in NLP, it's been used much more broadly to mean get a model to do som
ething that it wasn't explicitly trained to do. A well-known example of this is in the GPT-2 paper where the authors evaluate a language model on downstream tasks like machine translation without fine-tuning on these tasks directly.

The definition is not all that important, but it is useful to understand that the term is used in various ways and that we should therefore take care to understand the experimental setting when comparing different methods. For example, traditional zero-shot learning requires providing some kind of descriptor (Romera-Paredes et al. 2015) for an unseen class (such as a set of visual attributes or simply the class name) in order for a model to be able to predict that class without training data. Understanding that different zero-shot methods may adopt different rules for what kind of class descriptors are allowed provides relevant context when communicating about these techniques.

# A latent embedding approach

A common approach to zero shot learning in the computer vision setting is to use an existing featurizer to embed an image and any possible class names into their corresponding latent representations (e.g. Socher et al. 2013). They can then take some training set and use only a subset of the available labels to learn a linear projection to align the image and label embeddings. At test time, this framework allows one to embed any label (seen or unseen) and any image into the same latent space and measure their distance.

In the text domain, we have the advantage that we can trivially use a single model to embed both the data and the class names into the same space, eliminating the need for the data-hungry alignment step. This is not a new technique – researchers and practitioners have used pooled word vectors in similar ways for some time (such as Veeranna et al. 2016). But recently we have seen a dramatic increase in the quality of sentence embedding models. We therefore decided to run some experiments with Sentence-BERT, a recent technique which fine-tunes the pooled BERT sequence representations for increased semantic richness, as a method for obtaining sequence and label embeddings.

To formalize this, suppose we have a sequence embedding model 
Φ
sent
Φ 
sent
​
  and set of possible class names 
C
C. We classify a given sequence 
x
x according to,

ˆ
c
=
arg
max
c
∈
C
 
cos
(
Φ
sent
(
x
)
,
Φ
sent
(
c
)
)
c
^
 =arg 
c∈C
max
​
 cos(Φ 
sent
​
 (x),Φ 
sent
​
 (c))
where 
cos
cos is the cosine similarity. Here's an example code snippet showing how this can be done using Sentence-BERT as our embedding model 
Φ
sent
Φ 
sent
​
 :

In [5]:
from transformers import AutoTokenizer, AutoModel
from torch.nn import functional as F
import json

tokenizer = AutoTokenizer.from_pretrained('deepset/sentence_bert')
model = AutoModel.from_pretrained('deepset/sentence_bert')


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=385.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=438006864.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at deepset/sentence_bert were not used when initializing BertModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
file_path = "laptop.json"

f = open(file_path,)
data = json.load(f)
  
name_li = []
for i in data:
  name_li.append(i['product_name'])
  
# Closing file
f.close()

sentence = name_li[:20]
labels = ['64 GB', '128 GB', '256 GB', '512 GB', '1 TB']
print(sentence)

[['Asus R465JA Core i3-1005G1 4GB 128GB 14 Inch Full HD Windows 10 S Laptop - R465JA-EK058T'], ['Dell Latitude 3510 Core i5-10210U 8GB 256GB SSD 15.6 Inch Windows 10 Pro Laptop - VCFVM'], ['Lenovo V15-IIL Core i5-1035G1 8GB 512GB SSD 15.6 Inch Full HD Windows 10 Laptop - 82C500G4UK'], ['Lenovo ThinkPad E15 Core i7-10510U 16GB 512GB SSD 15.6 Inch FHD Windows 10 Pro Laptop - 20RD0011UK'], ['HP 250 G7 Core i5-1035G1 8GB 256GB SSD 15.6 Inch Windows 10 Pro Laptop - 14Z88EA'], ['HP 250 G7 Core i5-1035G1 8GB 256GB SSD 15.6 Inch FHD Windows 10 Home Laptop - 15L03ES'], ['Lenovo V14-ADA Athlon Gold 3150U 8GB 256GB SSD 14 Inch Full HD Windows 10 Home Laptop - 82C6005CUK'], ['Lenovo V15-ADA AMD Ryzen 5-3500U 8GB 256GB SSD 15.6 Inch FHD Windows 10 Pro Laptop - 82C70006UK'], ['Lenovo V15 Althlon Silver 3050U 4GB 128GB SSD 15.6 Inch FHD Windows 10 Laptop - 82C700E4UK'], ['Asus C523 Intel Celeron N3350 4GB 64GB eMMC 15.6 Inch Chromebook - C523NA-BR0067'], ['Asus VivoBook R429MA-BV286TS Celeron N4000 4

When using transformer architectures like BERT, NLI datasets are typically modeled via sequence-pair classification. That is, we feed both the premise and the hypothesis through the model together as distinct segments and learn a classification head predicting one of [contradiction, neutral, entailment].

The approach, proposed by Yin et al. (2019), uses a pre-trained MNLI sequence-pair classifier as an out-of-the-box zero-shot text classifier that actually works pretty well. The idea is to take the sequence we're interested in labeling as the "premise" and to turn each candidate label into a "hypothesis." If the NLI model predicts that the premise "entails" the hypothesis, we take the label to be true. See the code snippet below which demonstrates how easily this can be done with 🤗 Transformers.

In [7]:
def get_output(sent, labl):
  sentence = sent
  labels = labl

  # run inputs through model and mean-pool over the sequence
  # dimension to get sequence-level representations
  inputs = tokenizer.batch_encode_plus([sentence] + labels,
                                      return_tensors='pt',
                                      pad_to_max_length=True)
  input_ids = inputs['input_ids']
  attention_mask = inputs['attention_mask']
  output = model(input_ids, attention_mask=attention_mask)[0]
  sentence_rep = output[:1].mean(dim=1)
  label_reps = output[1:].mean(dim=1)

  # now find the labels with the highest cosine similarities to
  # the sentence
  similarities = F.cosine_similarity(sentence_rep, label_reps)
  closest = similarities.argsort(descending=True)
  for ind in closest:
      print(f'label: {labels[ind]} \t similarity: {similarities[ind]}')

In [8]:
for i in sentence:
  print(i[0])
  get_output(i[0], ['32 GB', '64 GB', '128GB', '256GB', '512GB', '1TB'])
  print("-------------")

Asus R465JA Core i3-1005G1 4GB 128GB 14 Inch Full HD Windows 10 S Laptop - R465JA-EK058T




label: 128GB 	 similarity: 0.5040742754936218
label: 64 GB 	 similarity: 0.4159872531890869
label: 256GB 	 similarity: 0.38052645325660706
label: 512GB 	 similarity: 0.31228169798851013
label: 32 GB 	 similarity: 0.29870760440826416
label: 1TB 	 similarity: 0.22793249785900116
-------------
Dell Latitude 3510 Core i5-10210U 8GB 256GB SSD 15.6 Inch Windows 10 Pro Laptop - VCFVM
label: 256GB 	 similarity: 0.3909125328063965
label: 64 GB 	 similarity: 0.3374137580394745
label: 512GB 	 similarity: 0.3313817083835602
label: 128GB 	 similarity: 0.2930394411087036
label: 32 GB 	 similarity: 0.2722538113594055
label: 1TB 	 similarity: 0.061147719621658325
-------------
Lenovo V15-IIL Core i5-1035G1 8GB 512GB SSD 15.6 Inch Full HD Windows 10 Laptop - 82C500G4UK
label: 512GB 	 similarity: 0.45660123229026794
label: 128GB 	 similarity: 0.4063299298286438
label: 256GB 	 similarity: 0.35200148820877075
label: 64 GB 	 similarity: 0.3311328887939453
label: 32 GB 	 similarity: 0.289680540561676
label:

## Classification as Natural Language Inference

We will now explore an alternative method which not only embeds sequences and labels into the same latent space where their distance can be measured, but that can actually tell us something about the compatibility of two distinct sequences out of the box.

As a quick review, natural language inference (NLI) considers two sentences: a "premise" and a "hypothesis". The task is to determine whether the hypothesis is true (entailment) or false (contradiction) given the premise.

In [9]:
from transformers import BartForSequenceClassification, BartTokenizer
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-mnli')
model = BartForSequenceClassification.from_pretrained('facebook/bart-large-mnli')

# pose sequence as a NLI premise and label (politics) as a hypothesis
premise = 'Who are you voting for in 2020?'
hypothesis = 'This text is about politics.'

# run through model pre-trained on MNLI
input_ids = tokenizer.encode(premise, hypothesis, return_tensors='pt')
logits = model(input_ids)[0]

# we throw away "neutral" (dim 1) and take the probability of
# "entailment" (2) as the probability of the label being true 
entail_contradiction_logits = logits[:,[0,2]]
probs = entail_contradiction_logits.softmax(dim=1)
true_prob = probs[:,1].item() * 100
print(f'Probability that the label is true: {true_prob:0.2f}%')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=26.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1154.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1629486723.0, style=ProgressStyle(descr…


Probability that the label is true: 98.08%


# Zero Shot Classification

In the paper, the authors report a label-weighted F1 of 
37.9
37.9 on Yahoo Answers using the smallest version of BERT fine-tuned only on the Multi-genre NLI (MNLI) corpus. By simply using the larger and more recent Bart model pre-trained on MNLI, we were able to bring this number up to 
53.7
53.7.

See our live demo here to try it out for yourself! Enter a sequence you want to classify and any labels of interest and watch Bart do its magic in real time.

In [19]:
from pprint import pprint
from transformers import pipeline

In [21]:
classifier = pipeline('zero-shot-classification',
                      model='joeddav/bart-large-mnli-yahoo-answers')

def predict_zero_shot(text, labels):
  a = classifier(text, labels)
  pprint(a)

#### Test on our dataset "laptop.json"

In [22]:
for i in sentence:
  print(i[0])
  predict_zero_shot(i[0], ['32 GB', '64 GB', '128GB', '256GB', '512GB', '1TB'])
  print("---------------------------------------------------")

Asus R465JA Core i3-1005G1 4GB 128GB 14 Inch Full HD Windows 10 S Laptop - R465JA-EK058T
{'labels': ['128GB', '64 GB', '512GB', '32 GB', '256GB', '1TB'],
 'scores': [0.41368767619132996,
            0.15011776983737946,
            0.12291817367076874,
            0.1208706945180893,
            0.11573821306228638,
            0.07666756212711334],
 'sequence': 'Asus R465JA Core i3-1005G1 4GB 128GB 14 Inch Full HD Windows 10 '
             'S Laptop - R465JA-EK058T'}
---------------------------------------------------
Dell Latitude 3510 Core i5-10210U 8GB 256GB SSD 15.6 Inch Windows 10 Pro Laptop - VCFVM
{'labels': ['256GB', '512GB', '128GB', '64 GB', '32 GB', '1TB'],
 'scores': [0.4381909668445587,
            0.13760486245155334,
            0.11547959595918655,
            0.11235415190458298,
            0.11121848225593567,
            0.08515190333127975],
 'sequence': 'Dell Latitude 3510 Core i5-10210U 8GB 256GB SSD 15.6 Inch '
             'Windows 10 Pro Laptop - VCFVM'}
----