# Introduction
In this notebook, I explore the zero-shot classification using the Hugging Face library. Zero-shot classification allows us to classify text into multiple predefined categories, even if the text hasn't been explicitly trained on those categories. Finally, get the prediction on test set.
### Installation

In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.1


In [2]:
import re
import numpy as np
import pandas as pd
from transformers import pipeline
from tqdm.notebook import tqdm

2024-03-14 07:29:53.611095: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-14 07:29:53.611226: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-14 07:29:53.749261: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [3]:
classifier = pipeline("zero-shot-classification")

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

### Load the dataset

In [4]:
df_train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
df_test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
print('Training set shape = {}'.format(df_train.shape))
print('Test set shape = {}'.format(df_test.shape))

Training set shape = (7613, 5)
Test set shape = (3263, 4)


### Simple data preprocessing

In [5]:
# Text Cleaning

def clean(tweet):
    
    # Change to lower characters
    tweet = tweet.lower()
    
    # remove Urls
    tweet = re.sub(r"https?:\/\/t.co\/[A-Za-z0-9]+", "", tweet)
    
    # remove hashtags
    tweet = re.sub(r"#", "", tweet)
    
    # remove mentions
    tweet = re.sub(r"@", "", tweet)
    
    # Special characters
    tweet = re.sub(r"\x89û_", "", tweet)
    tweet = re.sub(r"\x89ûª", "", tweet)
    tweet = re.sub(r"\x89ûò", "", tweet)
    tweet = re.sub(r"\x89ûó", "", tweet)
    tweet = re.sub(r"\x89ûïWhen", "When", tweet)
    tweet = re.sub(r"\x89ûï", "", tweet)
    tweet = re.sub(r"China\x89ûªs", "China's", tweet)
    tweet = re.sub(r"let\x89ûªs", "let's", tweet)
    tweet = re.sub(r"\x89û÷", "", tweet)
    tweet = re.sub(r"\x89ûª", "", tweet)
    tweet = re.sub(r"\x89û\x9d", "", tweet)
    tweet = re.sub(r"å_", "", tweet)
    tweet = re.sub(r"\x89û¢", "", tweet)
    tweet = re.sub(r"\x89û¢åÊ", "", tweet)
    tweet = re.sub(r"fromåÊwounds", "from wounds", tweet)
    tweet = re.sub(r"åÊ", "", tweet)
    tweet = re.sub(r"åÈ", "", tweet)
    tweet = re.sub(r"JapÌ_n", "Japan", tweet)    
    tweet = re.sub(r"Ì©", "e", tweet)
    tweet = re.sub(r"å¨", "", tweet)
    tweet = re.sub(r"SuruÌ¤", "Suruc", tweet)
    tweet = re.sub(r"åÇ", "", tweet)
    tweet = re.sub(r"å£3million", "3 million", tweet)
    tweet = re.sub(r"åÀ", "", tweet)
        
    return tweet

df_train['text_cleaned'] = df_train['text'].apply(lambda s : clean(s))
df_test['text_cleaned'] = df_test['text'].apply(lambda s : clean(s))
print('The original: {}'.format(df_train.iloc[892]['text']))
print('The cleaned: {}'.format(df_train.iloc[892]['text_cleaned']))

The original: I can't bloody wait!! Sony Sets a Date For Stephen KingÛªs Û÷The Dark TowerÛª #stephenking #thedarktower http://t.co/J9LPdRXCDE  @bdisgusting
The cleaned: i can't bloody wait!! sony sets a date for stephen kings the dark tower stephenking thedarktower   bdisgusting


In [6]:
# Taget in training set: 0 = no disaster, 1 = disaster
text = df_train.iloc[0]["text_cleaned"]
target = df_train.iloc[0]["target"]
print(f"Text:{text} label: {target}")
result = classifier(
    text,
    candidate_labels=["disaster", "normal"], 
)
print(result)

Text:our deeds are the reason of this earthquake may allah forgive us all label: 1
{'sequence': 'our deeds are the reason of this earthquake may allah forgive us all', 'labels': ['disaster', 'normal'], 'scores': [0.9942195415496826, 0.005780473817139864]}


### Create Submission

This code first adds a new column called "target" to the df_test and initializes it to 0. Then use the pre-trained classifier to predict the label for the given cleaned text in test set. The candidate labels are "disaster" and "normal". According to the predicted label, assign a value of 1 to the "target" column with label equal to "disaster", or assign a value of 0 to the "target" column with label equal to "normal".

In [7]:
df_test['target'] = 0
for i in tqdm(range(len(df_test))):
    text = df_test.iloc[i]["text_cleaned"]
    target = df_test.iloc[i]["target"]
    
    # Perform zero-shot classification on the text
    results = classifier(text, candidate_labels=["disaster",  "normal"])
    
    # Get the predicted labels and assign a value to the target column
    labels = results["labels"]
    prediction = 1 if labels[0] == "disaster" else 0
    df_test.loc[i, "target"] = prediction

  0%|          | 0/3263 [00:00<?, ?it/s]

In [8]:
submission = df_test[["id", "target"]]
submission.to_csv("zero_shot_submission.csv", index=False)
submission.head()

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1
