<a href="https://colab.research.google.com/github/KelvinLam05/Zero-Shot-Text-Classification/blob/main/Zero_Shot_Text_Classification_with_Hugging_Face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Zero-Shot Text Classification**

In zero-shot text classification, the model can classify any text between given labels without any prior data.

**Goal of the project**

Let’s build a zero-shot text classifier of Scotch whiskies labeled *Single Malt Scotch*, *Blended Scotch Whisky* and *Blended Malt Scotch Whisky*.

In [None]:
# Importing libraries
import pandas as pd
import numpy as np
import ktrain
import tensorflow as tf
from ktrain import text
from sklearn.model_selection import train_test_split

In [None]:
# Load dataset
df = pd.read_csv('/content/scotch_whisky_review.csv')

In [None]:
# Examine the data
df.head()

Unnamed: 0,name,category,review.point,price,currency,description.1.2247.
0,Caol Ila 18 year old (Diageo Special Releases ...,Single Malt Scotch,88,100,$,"If you like sherried malts, you’ll love this! ..."
1,"Carlyle, 40%",Single Malt Scotch,88,13,$,"A dense, suffocating fog of peat smoke, sea sa..."
2,"Big Peat Christmas Edition 2017, 54.1%",Blended Malt Scotch Whisky,88,70,$,The style and class of the youngest was inspir...
3,"J. Mossman Gold Crown 12 year old, 40%",Blended Scotch Whisky,88,43,$,"After six ‘Work in Progress’ releases, Kilkerr..."
4,Chapter 7 2008 (distilled at Allt-a-Bhainne) 9...,Single Malt Scotch,88,65,$,"As the name implies, this non-chill filtered e..."


In [None]:
# Overview of all variables, their datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   name                 1000 non-null   object
 1   category             1000 non-null   object
 2   review.point         1000 non-null   int64 
 3   price                1000 non-null   object
 4   currency             1000 non-null   object
 5   description.1.2247.  983 non-null    object
dtypes: int64(1), object(5)
memory usage: 47.0+ KB


**Preprocessing**

In [None]:
# Rename column header
df.rename(columns = {'description.1.2247.': 'description'}, inplace = True)

In [None]:
# Checking for missing values
df.isnull().sum().sort_values(ascending = False)

description     17
currency         0
price            0
review.point     0
category         0
name             0
dtype: int64

In [None]:
# Drop rows with NaN values
df = df[pd.notnull(df['description'])]

In [None]:
df.isnull().sum().sort_values(ascending = False)

description     0
currency        0
price           0
review.point    0
category        0
name            0
dtype: int64

In [None]:
# Drop columns that are not needed
df = df[['description', 'category']]

In [None]:
# Checking the distribution of classes
df['category'].value_counts() 

Single Malt Scotch            830
Blended Scotch Whisky          96
Blended Malt Scotch Whisky     57
Name: category, dtype: int64

We have a highly unbalanced dataset.

In [None]:
# Find all unique characters and symbols 
all_text = str()

for sentence in df['description'].values:
    all_text += sentence
    
''.join(set(all_text))

'DnA\'–fQNX6”xg9sùU:;8t&LSi2(W0Y’C,“.â"dÌaP3?BJluh$ò1me4V/—HGEqp\u2028oZ!à\xa05I£bKçyéTvMwô%‘c -R7\nzkè#Oür€)\r…jûF'

The kind of data we get from customer feedback is usually unstructured. It contains unusual text and symbols that need to be cleaned so that a machine learning model can grasp it.

In [None]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 

In [None]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
stop_words = set(stopwords.words('english'))
lemma = WordNetLemmatizer()

We will now set up our cleaning function.

In [None]:
def clean_review(review_text):

  # Removing all irrelevant characters (numbers and punctuation)                           
  review_text = re.sub('[^a-zA-Z]', ' ', review_text)                           
  # Replace one or more spaces with single space
  review_text = re.sub(r'\s+', ' ', review_text)                                
  # Convert all characters into lowercase
  review_text = str(review_text).lower()                                        
  # Tokenization
  review_text = word_tokenize(review_text)
  # Removing Stopwords                                      
  review_text = [item for item in review_text if item not in stop_words]        
  # Lemmatization
  review_text = [lemma.lemmatize(word = w, pos = 'v') for w in review_text]     
  # Remove the words having length <= 2
  review_text = [i for i in review_text if len(i) > 2]                          
  # Convert the list of tokens into back to the string
  review_text = ' '.join (review_text)                                          
  
  return review_text 

In [None]:
df['clean_review'] = df['description'].apply(clean_review)

In [None]:
all_text = str()

for sentence in df['clean_review'].values:
    all_text += sentence
    
''.join(set(all_text))

'nqzpkfobgxsdraltyhuivjwmec '

In [None]:
# Get the first five rows
df['description'].head()

0    If you like sherried malts, you’ll love this! ...
1    A dense, suffocating fog of peat smoke, sea sa...
2    The style and class of the youngest was inspir...
3    After six ‘Work in Progress’ releases, Kilkerr...
4    As the name implies, this non-chill filtered e...
Name: description, dtype: object

In [None]:
# Get the first five rows
df['clean_review'].head()

0    like sherried malt love bottle respectable str...
1    dense suffocate fog peat smoke sea salt dry se...
2    style class youngest inspire classic elegance ...
3    six work progress release kilkerran glengyle d...
4    name imply non chill filter expression finish ...
Name: clean_review, dtype: object

In [None]:
# Display full strings
with pd.option_context('display.max_colwidth', None):
  display(df['clean_review'])

0                                                                                                                 like sherried malt love bottle respectable strength red apple cherry skin strawberry raspberry eccles cake malt loaf warm spice lot get nose finely structure dram soft leather rhubarb bramley apple cherryade fresh victoria plum pepper mute ginger deliver sustain flavor long spicy peel fruit finish give distillery closure could interest components bottle
1                                                                                                                                                       dense suffocate fog peat smoke sea salt dry seaweed high tide lemon scent candle remember come air supple silky texture lemon mousse bake apple vanilla cinnamon ginger massive rush pepper hold mouth long possible flavor delivery impressively long constantly evolve hot dry finish frankly relief peppery assault palate
2                                                           

**Testing for GPU**

In [None]:
import torch

In [None]:
torch.cuda.is_available()

True

In [None]:
device = torch.cuda.current_device() if torch.cuda.is_available() else -1

In [None]:
print(device)

0


In [None]:
# Load
Load_on_CPU = torch.device('cuda')

**Transformer Pipeline**

We’ll use the appropriate transformers.pipeline to compute the predicted class for each Scotch whisky.

In [None]:
from transformers import pipeline

In [None]:
task = 'zero-shot-classification'
zero_shot_model = 'vicgalle/xlm-roberta-large-xnli-anli'
zero_shot_classifier = pipeline(task, zero_shot_model, device = device)

We can use this pipeline by passing in a sequence and a list of candidate labels. The pipeline assumes by default that only one of the candidate labels is true, returning a list of scores for each label which add up to 1.

In [None]:
sequences = 'like sherried malt love bottle respectable strength red apple cherry skin strawberry raspberry eccles cake malt loaf warm spice lot get nose finely structure dram soft leather rhubarb bramley apple cherryade fresh victoria plum pepper mute ginger deliver sustain flavor long spicy peel fruit finish give distillery closure could interest components bottle'

In [None]:
candidate_labels = ['Single Malt Scotch', 'Blended Scotch Whisky', 'Blended Malt Scotch Whisky']

In [None]:
outputs = zero_shot_classifier(sequences = sequences, candidate_labels = candidate_labels, multi_class = False)

Let’s take a look at the outputs.

In [None]:
for label, score in zip(outputs['labels'], outputs['scores']):
    print(f'{label}: {score:.3f}')

Single Malt Scotch: 0.473
Blended Scotch Whisky: 0.267
Blended Malt Scotch Whisky: 0.260


The model correctly identifies that the likely label is Single Malt Scotch. Other irrelevant labels, such as Blended Scotch Whisky and Blended Malt Scotch Whisky, have a very low score.

In [None]:
task = 'zero-shot-classification'
zero_shot_model = 'vicgalle/xlm-roberta-large-xnli-anli'
classifier = pipeline(task, zero_shot_model, device = device) 

In [None]:
candidate_labels = ['Single Malt Scotch', 'Blended Scotch Whisky', 'Blended Malt Scotch Whisky']

In [None]:
# Compute the predicted label for each whisky
df['label_pred_zero_shot'] = df['clean_review'].apply(lambda x: classifier(x, candidate_labels = candidate_labels)['labels'][0])

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
# Accuracy
accuracy_score(df['category'], df['label_pred_zero_shot'])

0.6907426246185148

The Zero-Shot classifier does a decent job. Glancing at a few random reviews uncorrectly labeled by the Zero-Shot classifier, there does not seem to be a particularly problematic class, although such a assertion would require further investigation. But the length of the review could lead to poor performance. We can read about this on the Hugging Face forum. Joe Davison, Hugging Face developer and creator of the Zero-Shot pipeline, says the following:

*For long documents, I don’t think there’s an ideal solution right now. If truncation isn’t satisfactory, then the best thing you can do is probably split the document into smaller segments and ensemble the scores somehow.*

We’ll try another solution: summarizing the article first, then Zero-Shot classifying it.

**Initialize and summarizing with Bert Summarizer + Zero-Shot Classification**

In [None]:
# Find the length of strings
df['clean_review_length'] = df['clean_review'].apply(len)

In [None]:
# Generate descriptive statistics 
df['clean_review_length'].describe()

count    983.000000
mean     285.916582
std       63.868860
min      116.000000
25%      250.000000
50%      286.000000
75%      320.000000
max      668.000000
Name: clean_review_length, dtype: float64

In [None]:
# Find the longest string
min(df['clean_review'], key = len)

'lemonade hint aniseed putty nose tropical fruit spice milk chocolate palate finish medium length spicy hint licorice'

In [None]:
# Find the shortest string
max(df['clean_review'], key = len)

'color antique gold aroma dry creamy note vanilla marshmallow honey tropical fruit pineapple coconut palate malty creamy front vanilla marshmallow hint honey briefly become fruity tropical fruit turn dry oaky big long dry finish especially triple distil lowlander lowland whiskies know mature nicely younger age people know especially auchentoshan delicious older age mention still column older vintages auchentoshan offer individual retailers one cask time one casks one many vintage auchentoshan whiskies enjoy past years auchentoshan balance complexity perspective seem best year old range still say one hold fairly well age available exclusively park avenue liquors'

**Text Summarization with BERT**

To see how the summarization model works, we’ll do a quick dry run. 



In [None]:
from transformers import BertTokenizerFast, EncoderDecoderModel

In [None]:
review_text = max(df['clean_review'], key = len)

In [None]:
summarization_model = 'mrm8488/bert-small2bert-small-finetuned-cnn_daily_mail-summarization'

In [None]:
model = EncoderDecoderModel.from_pretrained(summarization_model).to(Load_on_CPU)
tokenizer = BertTokenizerFast.from_pretrained(summarization_model)

inputs = tokenizer(review_text, padding = 'max_length', truncation = True, max_length = 512, return_tensors = 'pt')
input_ids = inputs.input_ids.to(Load_on_CPU)
attention_mask = inputs.attention_mask.to(Load_on_CPU)

output = model.generate(input_ids, attention_mask = attention_mask, max_length = 286, no_repeat_ngram_size = 2, num_beams = 4, do_sample = False, early_stopping = True)
summary_text = tokenizer.decode(output[0], skip_special_tokens = True)

print(summary_text)

the color antique gold aroma dry creamy note vanilla marshmallow honey tropical fruit turns dry oaky big long dry finish especially triple distil lowlander lowland whiskies know mature nicely younger age people know especially auchentoshan delicious older age mention still column older vintages auchnoshan offer individual retailers one cask time one.


**Initialize and summarizing with Bert Summarizer + Zero-Shot Classification**

In [None]:
model = EncoderDecoderModel.from_pretrained(summarization_model).to(Load_on_CPU)
tokenizer = BertTokenizerFast.from_pretrained(summarization_model)

# Custom summarization pipeline (to handle long reviews)
def generate_summary(text):
    
    inputs = tokenizer(text, padding = 'max_length', truncation = True, max_length = 512, return_tensors = 'pt')
    input_ids = inputs.input_ids.to(Load_on_CPU)
    attention_mask = inputs.attention_mask.to(Load_on_CPU)

    output = model.generate(input_ids, attention_mask = attention_mask, max_length = 286, no_repeat_ngram_size = 2, num_beams = 4, do_sample = False, early_stopping = True)
    
    return tokenizer.decode(output[0], skip_special_tokens = True)

In [None]:
task = 'zero-shot-classification'
zero_shot_model = 'vicgalle/xlm-roberta-large-xnli-anli'
classifier = pipeline(task, zero_shot_model, device = device) 

In [None]:
candidate_labels = ['Single Malt Scotch', 'Blended Scotch Whisky', 'Blended Malt Scotch Whisky']

In [None]:
# Apply summarization then zero-shot classification to the dataset
df['label_pred_bert-small_sum_zs'] = df['clean_review'].apply(lambda x: classifier(generate_summary(x), candidate_labels = candidate_labels, multi_class = False)['labels'][0])

In [None]:
# Accuracy
accuracy_score(df['category'], df['label_pred_bert-small_sum_zs'])

0.728382502543235

Adding the summarization before the zero-shot classification, **the accuracy jumped by ~3.7%!**

**Text Summarization with T5**

To see how the summarization model works, we’ll do a quick dry run.

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

In [None]:
review_text = max(df['clean_review'], key = len)

In [None]:
model = T5ForConditionalGeneration.from_pretrained('t5-small').to(Load_on_CPU)
tokenizer = T5Tokenizer.from_pretrained('t5-small')

input_ids = tokenizer.encode(review_text, padding = 'max_length', truncation = True, max_length = 512, return_tensors = 'pt').to(Load_on_CPU)
summary_ids = model.generate(input_ids, max_length = 286, no_repeat_ngram_size = 2, num_beams = 4, do_sample = False, early_stopping = True)

summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens = True)

print(summary_text)

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/242M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

gold aroma dry creamy note vanilla marshmallow honey tropical fruit pineapple coconut palate malty creamy front vanilla  marshmallow hint honey briefly become fruity tropical fruits turn dry oaky big long dry finish especially triple distil lowlander lowland whiskies know mature nicely younger age people know especially auchentoshan delicious older age mention still column older vintages Auchen to offer individual retailers one cask time one cazks one many vintage aussientasan whisky enjoy past years a balance complexity perspective seem best year old range still say one hold


**Initialize and summarizing with T5 Summarizer + Zero-Shot Classification**

In [None]:
model = T5ForConditionalGeneration.from_pretrained('t5-small').to(Load_on_CPU)
tokenizer = T5Tokenizer.from_pretrained('t5-small')

# Custom summarization pipeline (to handle long reviews)
def generate_summary(text):

    input_ids = tokenizer.encode(text, padding = 'max_length', truncation = True, max_length = 512, return_tensors = 'pt').to(Load_on_CPU)
    summary_ids = model.generate(input_ids, max_length = 286, no_repeat_ngram_size = 2, num_beams = 4, do_sample = False, early_stopping = True)

    return tokenizer.decode(summary_ids[0], skip_special_tokens = True)

In [None]:
task = 'zero-shot-classification'
zero_shot_model = 'vicgalle/xlm-roberta-large-xnli-anli'
classifier = pipeline(task, zero_shot_model, device = device) 

In [None]:
candidate_labels = ['Single Malt Scotch', 'Blended Scotch Whisky', 'Blended Malt Scotch Whisky']

In [None]:
# Apply summarization then zero-shot classification to the dataset
df['label_pred_t5-small_sum_zs'] = df['clean_review'].apply(lambda x: classifier(generate_summary(x), candidate_labels = candidate_labels, multi_class = False)['labels'][0])

In [None]:
# Accuracy
accuracy_score(df['category'], df['label_pred_t5-small_sum_zs'])

0.7293997965412004

Adding the summarization before the zero-shot classification, **the accuracy jumped by ~3.8%!**