# 02-01 : Zero Shot Text Classification

## References

- [Unlocking Zero-Shot Text Classification with Hugging Face’s Transformers](https://medium.com/@s.sadathosseini/unlocking-zero-shot-text-classification-with-hugging-faces-transformers-9e30de5c8455)
- [Aspect Mining Using Zero-Shot Classification](https://aiswaryaramachandran.medium.com/aspect-mining-using-zero-shot-classification-3190e8a89d68)
- [Exploring Hugging Face: Zero-Shot Classification](https://pub.aimind.so/exploring-hugging-face-zero-shot-classification-781ef3a18510)
- [Zero Shot Classification with Huggingface 🤗 + Sentence Transformers](https://sachin-abeywardana.medium.com/zero-shot-classification-with-huggingface-sentence-transformers-c6cd732de0e0)
- [Analyzing QAnon on Twitter with Zero-Shot Classification](https://towardsdatascience.com/analyzing-qanon-on-twitter-with-zero-shot-classification-13ad73d324fc)
- [MoritzLaurer/deberta-v3-large-zeroshot-v2.0](https://huggingface.co/MoritzLaurer/deberta-v3-large-zeroshot-v2.0)
- [facebook/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli)

 ### Interesting Models

- [FacebookAI/roberta-large-mnli](https://huggingface.co/FacebookAI/roberta-large-mnli) - fine-tuned on the Multi-Genre Natural Language Inference (MNLI) corpus.
- [MoritzLaurer/deberta-v3-large-zeroshot-v2.0](https://huggingface.co/MoritzLaurer/deberta-v3-large-zeroshot-v2.0)
- [facebook/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli)

In [1]:
import os
os.environ["TF_USE_LEGACY_KERAS"] = "1"

In [2]:
import pandas as pd
import numpy as np
from functools import partial
from typing import Dict, List
from pprint import pprint
from pqdm.threads import pqdm
from tqdm.notebook import tqdm
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report, jaccard_score, accuracy_score, f1_score
from transformers import pipeline

2024-05-19 19:58:54.556818: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-19 19:58:54.556845: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-19 19:58:54.557738: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-19 19:58:54.562425: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [22]:
data_path = '../../data'
input_path = f'{data_path}/input/labelled_tweets/csv_labels'
train_input_file = f'{input_path}/train.csv'
test_input_file = f'{input_path}/test.csv'
val_input_file = f'{input_path}/val.csv'
output_path = f'{data_path}/output/02_zero_shot'

## 1. Load Data

In [4]:
df_train = pd.read_csv(train_input_file)
df_val = pd.read_csv(val_input_file)
df_test = pd.read_csv(test_input_file)

# show the data frame shapes
print(f'Train shape: {df_train.shape}')
print(f'Val shape: {df_val.shape}')
print(f'Test shape: {df_test.shape}')

Train shape: (6957, 3)
Val shape: (987, 3)
Test shape: (1977, 3)


In [5]:
df_train.head()

Unnamed: 0,ID,text,labels
0,1311981051720409089t,"@sandraburgess3 They have no idea , they cant ...",ineffective
1,1361403925845401601t,@stepheniscowboy Nvm I ’ ve had covid I ’ ve g...,unnecessary
2,1293488278361055233t,Coronavirus updates : Government partners with...,pharma
3,1305252218526990338t,@OANN U . K . Glaxo Smith Klein whistleblower ...,rushed
4,1376135683400687618t,"3 / horse "" AstraZeneca , not so much for the ...",ineffective pharma


## 2. Preprocessing

### 2.1. Labels to List

In [6]:
df_train['labels_list'] = df_train['labels'].str.split(' ')
df_test['labels_list'] = df_test['labels'].str.split(' ')
df_val['labels_list'] = df_val['labels'].str.split(' ')

### 2.2. Multi-label Binarization

In [7]:
# get the list of label values
labels = pd.concat([df_train.labels_list, 
                    df_val.labels_list, 
                    df_test.labels_list])

# initialize MultiLabelBinarizer
labels_lookup = MultiLabelBinarizer()

# learn the vocabulary
labels_lookup = labels_lookup.fit(labels)

# show the vocabulary
vocab = labels_lookup.classes_
print(f'Vocabulary size: {len(vocab)}')
print(f'Vocabulary: {vocab}')


Vocabulary size: 12
Vocabulary: ['conspiracy' 'country' 'ineffective' 'ingredients' 'mandatory' 'none'
 'pharma' 'political' 'religious' 'rushed' 'side-effect' 'unnecessary']


In [8]:
# update the data frame with a `labels_encoded` column
df_train['labels_encoded'] = labels_lookup.transform(df_train.labels_list).tolist()
df_val['labels_encoded'] = labels_lookup.transform(df_val.labels_list).tolist()
df_test['labels_encoded'] = labels_lookup.transform(df_test.labels_list).tolist()

In [9]:
# add the one-hot encoded labels as columns to the data frames
df_train = df_train.join(pd.DataFrame(labels_lookup.transform(df_train.labels_list), 
                                     columns=labels_lookup.classes_, 
                                     index=df_train.index))

df_val = df_val.join(pd.DataFrame(labels_lookup.transform(df_val.labels_list),
                                    columns=labels_lookup.classes_,
                                    index=df_val.index))

df_test = df_test.join(pd.DataFrame(labels_lookup.transform(df_test.labels_list),
                                    columns=labels_lookup.classes_,
                                    index=df_test.index))

In [10]:
df_train.head()

Unnamed: 0,ID,text,labels,labels_list,labels_encoded,conspiracy,country,ineffective,ingredients,mandatory,none,pharma,political,religious,rushed,side-effect,unnecessary
0,1311981051720409089t,"@sandraburgess3 They have no idea , they cant ...",ineffective,[ineffective],"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]",0,0,1,0,0,0,0,0,0,0,0,0
1,1361403925845401601t,@stepheniscowboy Nvm I ’ ve had covid I ’ ve g...,unnecessary,[unnecessary],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]",0,0,0,0,0,0,0,0,0,0,0,1
2,1293488278361055233t,Coronavirus updates : Government partners with...,pharma,[pharma],"[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]",0,0,0,0,0,0,1,0,0,0,0,0
3,1305252218526990338t,@OANN U . K . Glaxo Smith Klein whistleblower ...,rushed,[rushed],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]",0,0,0,0,0,0,0,0,0,1,0,0
4,1376135683400687618t,"3 / horse "" AstraZeneca , not so much for the ...",ineffective pharma,"[ineffective, pharma]","[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0]",0,0,1,0,0,0,1,0,0,0,0,0


## 3. Classification

### 3.1. Create Classifier

In [11]:
# the model that will be used for classification
model_name = 'facebook/bart-large-mnli'

# create the classifier
classifier = pipeline("zero-shot-classification",
                      model=model_name)


2024-05-19 19:58:56.327094: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-19 19:58:56.354998: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-05-19 19:58:56.355179: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-

### 3.2. Test Classifier 

In [12]:
# select a row for testing
sample_row = df_train.iloc[146][['text', 'labels_list', 'labels_encoded']]
pprint(sample_row.to_dict())

{'labels_encoded': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
 'labels_list': ['rushed', 'side-effect'],
 'text': '@eleonorasfalcon @Phoebejoy1611 And what about the long term effects '
         "of an untrialled vaccine ? You can stop taking pills but you can't "
         'take a vaccine out of your body .'}


In [13]:
# perform classification
result = classifier(
    sequences=sample_row.text,
    candidate_labels=vocab,
    hypothesis_template='This concern with the vaccine is about {}.',
    multi_label=True)

pprint(result)

{'labels': ['side-effect',
            'ineffective',
            'rushed',
            'unnecessary',
            'conspiracy',
            'country',
            'pharma',
            'political',
            'mandatory',
            'ingredients',
            'religious',
            'none'],
 'scores': [0.9871069192886353,
            0.7577099204063416,
            0.5949831008911133,
            0.4207984209060669,
            0.09220512211322784,
            0.07710351794958115,
            0.07344336807727814,
            0.028462089598178864,
            0.0051172408275306225,
            0.002520242240279913,
            0.0013842779444530606,
            0.0006486524362117052],
 'sequence': '@eleonorasfalcon @Phoebejoy1611 And what about the long term '
             'effects of an untrialled vaccine ? You can stop taking pills but '
             "you can't take a vaccine out of your body ."}


### 3.3. Get Standardized Predictions

Standardize the prediction to match the order of the labels in the training set.

In [14]:
def standardize_prediction(prediction: Dict, vocabulary:List[str]) -> List[float]:
    """
    Standardize the prediction output to a fixed length list.
    """
    return [prediction['scores'][prediction['labels'].index(label)]
            for label in vocabulary]

## test the function
#standardize_prediction(result, vocab.tolist())

### 3.4. Get Predictions

In [15]:
def get_prediction(text:str, classifier, vocabulary:List[str]) -> List[float]:
    """
    Get the prediction for a given text.
    """
    result = classifier(
        sequences=text,
        candidate_labels=vocabulary,
        hypothesis_template='This concern with the vaccine is about {}.',
        multi_label=True)
    
    return standardize_prediction(result, vocabulary)

## test the function
#get_prediction(sample_row.text, classifier, vocab.tolist())

In [16]:
def predict(X: List[str], vocabulary:List[str], classifier, n_jobs:int=1) -> List[List[float]]:
    """
    Predict the labels for a list of texts.
    """
    if n_jobs == 1:
        result = []
        for text in tqdm(X):
            result.append(get_prediction(text, classifier, vocabulary))
            
        return result
    else:
        # create the partial function for parallel processing
        get_prediction_partial = partial(get_prediction, classifier=classifier, vocabulary=vocabulary)
    
        # perform parallel processing 
        return pqdm(X, get_prediction_partial, n_jobs=5)
        
## test the function
# predict(
#     X=df_train[:5].text.tolist(), 
#     vocabulary=vocab.tolist(),
#     classifier=classifier,
#     n_jobs=2)

## 4. Evaluating the model

In [17]:
class Evaluation:

    @staticmethod
    def f1_score_macro(y_true, y_pred):
        """Calculate F1-score (Macro-Average)."""
        return f1_score(y_true, y_pred, average='macro', zero_division=0)

    @staticmethod
    def f1_score_weighted(y_true, y_pred):
        """Calculate F1-score (Weighted-Average)."""
        return f1_score(y_true, y_pred, average='weighted', zero_division=0)

    @staticmethod
    def jaccard_similarity(y_true, y_pred):
        """Calculate average Jaccard Similarity."""
        return jaccard_score(y_true, y_pred, average='samples')

    @staticmethod
    def subset_accuracy(y_true, y_pred):
        """Calculate Subset Accuracy (Exact Match Accuracy)."""
        return accuracy_score(y_true, y_pred)

    @staticmethod
    def evaluate_all(y_true,
                     y_pred,
                     threshold:float=0.5):
        
        # Convert predictions to binary
        y_pred_bin = [[int(prob > threshold) for prob in pred] for pred in y_pred]
        
        """Evaluate all metrics and display a summary."""
        f1_macro = Evaluation.f1_score_macro(y_true, y_pred_bin)
        f1_weighted = Evaluation.f1_score_weighted(y_true, y_pred_bin)
        jaccard_similarity = Evaluation.jaccard_similarity(y_true, y_pred_bin)
        subset_accuracy = Evaluation.subset_accuracy(y_true, y_pred_bin)

        # Display a summary of the evaluation
        print(f"F1 Score (Macro-Average)   \t{f1_macro:.3f}")
        print(f"F1 Score (Weighted-Average)\t{f1_weighted:.3f}")
        print(f"Average Jaccard Similarity \t{jaccard_similarity:.3f}")
        print(f"Subset Accuracy            \t{subset_accuracy:.3f}")

### 4.1 Perform Predictions

In [18]:
data = df_train

# get the true values
y_true = data[vocab].values.tolist()

# get the predictions
y_pred = predict(X=data.text.tolist(),
        vocabulary=vocab.tolist(),
        classifier=classifier,
        n_jobs=2)

QUEUEING TASKS | :   0%|          | 0/6957 [00:00<?, ?it/s]

PROCESSING TASKS | :   0%|          | 0/6957 [00:00<?, ?it/s]

COLLECTING RESULTS | :   0%|          | 0/6957 [00:00<?, ?it/s]

In [23]:
# save the predictions to a file
np.save(f'{output_path}/02-01_bart-large-mnli_train.npy', y_pred)

In [26]:
# temp_read = np.load(f'{output_path}/02-01_bart-large-mnli_train.npy', allow_pickle=True)
# temp_read

### 4.2. Classification Report

In [19]:
def show_classification_report(data:pd.DataFrame,
                               y_pred:np.ndarray,
                               threshold:float=0.5):
    # get the true labels
    y_true = data[vocab].values
    
    # Convert predictions to binary
    y_pred_bin = [[int(prob > threshold) for prob in pred] for pred in y_pred]
    
    # show the classification report
    print(classification_report(y_true, y_pred_bin, target_names=vocab))    

In [20]:
# show the test classification report
show_classification_report(data, y_pred)

              precision    recall  f1-score   support

  conspiracy       0.12      0.70      0.20       341
     country       0.03      0.59      0.06       140
 ineffective       0.42      0.69      0.52      1171
 ingredients       0.55      0.47      0.51       304
   mandatory       0.40      0.59      0.48       548
        none       0.10      0.06      0.08       440
      pharma       0.30      0.74      0.42       889
   political       0.17      0.85      0.28       437
   religious       0.25      0.64      0.36        45
      rushed       0.22      0.88      0.35      1032
 side-effect       0.59      0.83      0.69      2663
 unnecessary       0.12      0.87      0.22       503

   micro avg       0.26      0.73      0.39      8513
   macro avg       0.27      0.66      0.35      8513
weighted avg       0.37      0.73      0.46      8513
 samples avg       0.30      0.73      0.40      8513



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### 4.3 Full Report 

In [21]:
Evaluation.evaluate_all(y_true, y_pred)

F1 Score (Macro-Average)   	0.347
F1 Score (Weighted-Average)	0.464
Average Jaccard Similarity 	0.293
Subset Accuracy            	0.068
