#### Zero-shot classification with bart-large-mnli

</p> The bart-large-mnli model is a version of the BART model that has been fine-tuned on the Multi-Genre Natural Language Inference dataset (https://paperswithcode.com/dataset/multinli), which improves its performance on classification. With this model we specify the category labels with which we want to label our text, and the model automatically infers from the labels how we are trying to classify the text. </p>

In [1]:
from transformers import pipeline

2024-04-15 12:43:45.152627: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
model_1 = pipeline(model="facebook/bart-large-mnli")

In [3]:
model_1("This movie is terrible!!",
    candidate_labels=["negative", "positive", "neutral"])

{'sequence': 'This movie is terrible!!',
 'labels': ['negative', 'positive', 'neutral'],
 'scores': [0.9972519874572754, 0.0015112911351025105, 0.0012367221061140299]}

#### RoBERTa model zero-shot classification

</p> The RoBERTa model is a version of the BERT model that has been improved with additional training data and changes to the model architecture. The RoBERTa-large-mnli model is RoBERTa fine-tuned, like in the previous example, with the  Multi-Genre Natural Language Inference dataset (https://paperswithcode.com/dataset/multinli) to improve performance on classification tasks.</p>

In [4]:
model_2 = pipeline("zero-shot-classification", model="roberta-large-mnli")


Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
text = "The service at the restaurant was amazing, but the food was just average."
labels = ["positive", "negative", "neutral"]

In [6]:
# Classify sentiment
result = model_2(text, labels)
print(result)

{'sequence': 'The service at the restaurant was amazing, but the food was just average.', 'labels': ['neutral', 'negative', 'positive'], 'scores': [0.48902052640914917, 0.3037945628166199, 0.20718494057655334]}


In [8]:
#Alternative labels
labels2 = ["food", "theatre", "movies", "books"]


In [9]:
# Classify topic
result = model_2(text, labels2)
print(result)

{'sequence': 'The service at the restaurant was amazing, but the food was just average.', 'labels': ['food', 'movies', 'theatre', 'books'], 'scores': [0.947460949420929, 0.01917973719537258, 0.016935639083385468, 0.016423670575022697]}


#### DistilBERT zero-shot classification

</p>Now, we will use a similar zero-shot method using a pre-trained model. However, this time we will use a model that has apready been fine-tuned specifically for sentiment analysis. This model(https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english) is a DistilBERT model fine-tuned on the Stanford Sentiment Treebank (https://huggingface.co/datasets/stanfordnlp/sst2), a dataset of over 200,000 phrases labelled as positive or negative by human labellers. DistilBERT is a smaller version of the BERT LLM.

In [10]:
model_3 = pipeline(model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

In [11]:
result = model_3("I love using Transformers. It's so easy and powerful!")
print(result)

[{'label': 'POSITIVE', 'score': 0.9998376369476318}]


## Testing a classifier on a pre-labelled dataset

In [12]:
import pandas as pd

In [13]:
df_test = pd.read_csv("imdb_shortened.csv")

In [18]:
df_test.head()

Unnamed: 0,short_reviews,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [19]:
df_test.describe()

Unnamed: 0,short_reviews,sentiment
count,100,100
unique,100,2
top,I have been a Mario fan for as long as I can r...,negative
freq,1,58


In [20]:
df_test.drop(columns='Unnamed: 0', axis=1, inplace=True)

KeyError: "['Unnamed: 0'] not found in axis"

In [21]:
df_test.head()

Unnamed: 0,short_reviews,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [22]:
df_test["model_sentiment"] = None
df_test["model_score"] = None



In [23]:
df_test.head()

Unnamed: 0,short_reviews,sentiment,model_sentiment,model_score
0,One of the other reviewers has mentioned that ...,positive,,
1,A wonderful little production. The filming tec...,positive,,
2,I thought this was a wonderful way to spend ti...,positive,,
3,Basically there's a family where a little boy ...,negative,,
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,,


In [24]:
sentiments = []
scores = []

for index, row in df_test.iterrows():
    phrase = row["short_reviews"]
    output = model_3(phrase)
    sentiment = output[0]["label"]
    score = output[0]["score"]
    sentiments.append(sentiment)
    scores.append(score)


df_test['model_sentiment'] = sentiments
df_test['model_score'] = scores

In [25]:
df_test.head()

Unnamed: 0,short_reviews,sentiment,model_sentiment,model_score
0,One of the other reviewers has mentioned that ...,positive,NEGATIVE,0.839039
1,A wonderful little production. The filming tec...,positive,POSITIVE,0.999253
2,I thought this was a wonderful way to spend ti...,positive,POSITIVE,0.999064
3,Basically there's a family where a little boy ...,negative,NEGATIVE,0.994191
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,POSITIVE,0.999079


In [None]:
df_test.to_csv("comparison.csv")

In [None]:
conflict = 0

for index, row in df_test.iterrows():
    if row['sentiment'] != row['model_sentiment'].lower():
        conflict  += 1
    
print(conflict)
