## What is Zero-Shot Learning?

- <span style="color:green"> **Zero-Shot Learning** </span> is a concept, that a model when trained on enough unlabeled data (unsupervised learning) is able to generalize/ recognize at inference time even though the model was not trained on the inference data. This can be used in NLP, Images etc.
- <span style="color:green"> **Zero-Shot Learning** </span> is a setup in which a model can learn to recognize things that it hasn’t explicitly seen before in training.

#### Load useful libraries and data

In [1]:
from transformers import pipeline
import pandas as pd
import numpy as np
from tqdm import tqdm


In [2]:
data = pd.read_csv(
    "data/SMSSpamCollection.txt",
    encoding="utf-8",
    header=None,
    delimiter="\t",
    names=["target", "text"],
)

In [3]:
data.head(5)

Unnamed: 0,target,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
print(f'There are {data.shape[0]} rows in the dataset')

There are 5572 rows in the dataset


### Preparing the pipeline in one-line of code!

In [5]:
classifier = pipeline("zero-shot-classification",device = 0)

No model was supplied, defaulted to FacebookAI/roberta-large-mnli and revision 130fb28 (https://huggingface.co/FacebookAI/roberta-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/688 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

2026-02-10 22:01:16.095450: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M3
2026-02-10 22:01:16.095546: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB
2026-02-10 22:01:16.095570: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB
2026-02-10 22:01:16.095997: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2026-02-10 22:01:16.096012: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
All PyTorch model weights were used when initializing TFRobertaForSequenceClassification.

All the weights of TFRobertaForSequenceClassification were initialized from the PyTorch model.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

### Making Predictions

This model works best with informative labels, spam/ham are not so informative. Using spam/ham leads to a Hamming loss of 53% vs using click bait/written by humans leading to 19%

Can you find better label descriptions?

In [6]:
category_map = {"spam":"click bait", "ham":"written by humans"}

In [11]:
candidate_labels = list(category_map.values())
predictedCategories = []
trueCategories = []
for i in tqdm(range(100)):
    text = data.iloc[i,]['text']
    cat = [data.iloc[i,]['target']]
    res = classifier(text, candidate_labels, multi_label=False)
    labels = res['labels'] 
    scores = res['scores'] #extracting the scores associated with the labels
    res_dict = {label : score for label,score in zip(labels, scores)}
    sorted_dict = dict(sorted(res_dict.items(), key=lambda x:x[1],reverse = True)) #sorting the dictionary of labels in descending order based on their score
    categories  = next(k for i, (k,v) in enumerate(sorted_dict.items()))

    predictedCategories.append(categories)
    trueCats = [category_map[x] for x in cat]
    trueCategories.append(trueCats)

100%|██████████| 100/100 [00:32<00:00,  3.11it/s]


In [12]:
for y_true, y_pred in zip(trueCategories[:3], predictedCategories[:3]):
    print(f'True Categories {y_true}')
    print(f'Predicted Categories {y_pred}')
    print('#'*50)

True Categories ['normal personal SMS message between friends or family, not advertising']
Predicted Categories unwanted SMS spam with promotions, prizes, or advertisements
##################################################
True Categories ['normal personal SMS message between friends or family, not advertising']
Predicted Categories normal personal SMS message between friends or family, not advertising
##################################################
True Categories ['unwanted SMS spam with promotions, prizes, or advertisements']
Predicted Categories unwanted SMS spam with promotions, prizes, or advertisements
##################################################


### Hamming Loss
The Hamming loss is the fraction of labels that are incorrectly predicted.

In [13]:
from sklearn.metrics import hamming_loss
print(f'The hamming loss is {hamming_loss(trueCategories,predictedCategories):.4f} compared to 0.0237 from the last trained model in Notebook 1')

The hamming loss is 0.6000 compared to 0.0237 from the last trained model in Notebook 1


Comparing dataset-styles:
The hamming loss was 0.1900  before replacing it with this very explicit, dataset‑style:

category_map = {
    "spam": "unwanted SMS spam with promotions, prizes, or advertisements",
    "ham":  "normal personal SMS message between friends or family, not advertising"
}

compared to 0.0237 from the last trained model in Notebook 1

and now with option 1 the hamming-loss was 0.6000

## Check your understanding

Can you find better "labels" for the category description improve the Hamming Loss?

In [10]:
category_map = {
    "spam": "unwanted SMS spam with promotions, prizes, or advertisements",
    "ham":  "normal personal SMS message between friends or family, not advertising"
}

Changing to the very explicit, dataset‑style labels actually made the zero‑shot model worse, not better (Hamming loss from ~0.19 → ~0.60).

Our trained model in Notebook 1 (Hamming loss ≈ 0.0237) has actually seen the dataset and learned dataset‑specific patterns.
A generic zero‑shot model hasn’t seen this dataset; 0.19 is already quite reasonable for true zero‑shot. Beating 0.0237 with zero‑shot is very unlikely.

in other words:
Notebook 1 model (0.0237): trained, task‑specific, much better - as expected.
Zero‑shot with “click bait / written by humans” (≈0.19): quite decent zero‑shot result.
Zero‑shot with very explicit dataset‑style labels (≈0.60): label wording is misaligned with how the NLI model understands categories.


Optional: make the task more “SMS‑aware”
If you want to experiment one step deeper, you can help the model by changing the hypothesis template to make it explicitly about SMS:

from transformers import pipeline

classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli",
    device=0,
    hypothesis_template="This SMS message is {}."
)
Then you can even try simpler labels again:

category_map = {
    "spam": "spam",
    "ham":  "not spam",
}
Sometimes just telling the model the domain (“This SMS message …”) helps it interpret “spam / not spam” better.