# Zero Shot Sentiment Analysis with Hugging Face Models

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Imports

from transformers import pipeline

import pandas as pd
import numpy as np

from tqdm import tqdm # a package for creating progress bars
import math

import torch
torch.cuda.is_available() # this command will return True if a compatible GPU is available

## Zero-shot classification with RoBERTa

</p> The RoBERTa model is a version of the BERT model that has been improved with additional training data and changes to the model architecture. The RoBERTa-large-mnli model is RoBERTa fine-tuned with the  Multi-Genre Natural Language Inference dataset (https://paperswithcode.com/dataset/multinli) to improve performance on classification tasks.</p>

In [None]:
# Loading a model from huggingface, specifying the zero-shot-classification pipeline from Hugging Face Transformers

model_2 = pipeline("zero-shot-classification", model="roberta-large-mnli")

In [None]:
#Providing an example text, and the labels we want to use for classification

text = "The service at the restaurant was amazing, but the food was just average."
labels = ["positive", "negative", "neutral"]

In [None]:
# Applying the model to the example sentence

result = model_2(text, labels)
print(result)

In [None]:
# Trying a different classification task, with different labels

labels2 = ["food", "theatre", "movies", "books"]

In [None]:
# Applying the model to the example sentence with the new labels

result = model_2(text, labels2)
print(result)

## Zero-shot classification with DistilBERT SST-2

</p>Now, we will use a similar zero-shot method using a pre-trained model. However, this time we will use a model that has apready been fine-tuned specifically for sentiment analysis. This model(https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english) is a DistilBERT model fine-tuned on the Stanford Sentiment Treebank (https://huggingface.co/datasets/stanfordnlp/sst2), a dataset of over 200,000 phrases labelled as positive or negative by human labellers. DistilBERT is a smaller version of the BERT LLM.

In [None]:
# Loading the model

model_2 = pipeline(model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

In [None]:
# applying the model to an example sentence (note, we do not speciy labels here, as the model has sentiment labels built in)

text_2 = "I love using Transformers. It's so easy and powerful!"

result = model_2(text_2)

print(result)

## Testing a DistilBERT SST-2 on a pre-labelled dataset

In [None]:
# Load in the example dataset of IMDB movie reviews

df_test = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/TRIADS_workshops/ml_text_analysis/IMDB_short_reviews.csv")

In [None]:
df_test.head()

In [None]:
df_test.info()

In [None]:
# Print first review just to take a look!

print(df_test["review"][0])

In [None]:
# Creating empty lists to store our sentiment labels and scores

sentiments = []
scores = []

# Looping through the dataframe and applying out model to each review, then saving the label and sentiment score to new columns

for index, row in df_test.iterrows():
    phrase = row["review"]
    output = model_3(phrase)
    sentiment = output[0]["label"]
    score = output[0]["score"]
    sentiments.append(sentiment)
    scores.append(score)


df_test['model_sentiment'] = sentiments
df_test['model_score'] = scores

In [None]:
df_test.head()

In [None]:
# Doing some basic math to calculate the accuracy rate compared to the human labels

correct_predictions = 0

for index, row in df_test.iterrows():
    if row['sentiment'] == row['model_sentiment'].lower():
        correct_predictions  += 1

print(correct_predictions)


accuracy = (correct_predictions * 100) / len(df_test)
print(accuracy)


# Zero-Shot Machine Translation with Hugging Face Models




### Now we will switch from sentiment analysis to machine translation. Machine translation is a text-generation task, rather than a classification task, as it generates new text in the target language, that captures the meaning of the text in the first language.



### The first model we will try is the Opus machine translation model from the Helsinki-NLP lab. Here we will try their model fine-tuned for translating French to English, but they have several thousand models on their repository for different language pairs and translation domains.

In [None]:
# Load in a sample dataset from GitHub and save it as a dataframe

europarl = pd.read_csv('https://github.com/bestvater/misc/raw/master/europarl_en-fr_5000.csv')



In [None]:
# Shorten the dataset to make it quicker to work with!

europarl = europarl.loc[0:99]

europarl.head()

In [None]:
# Load translation pipeline for French to English

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")

In [None]:
# Translate a French sentence
result = translator("Je suis ravi de vous rencontrer.")

print(result)


In [None]:
# Create an empty list to store machine translations, then apply the model to every french text in the dataset and save the machine translation to a new column
# This might take a minute!

english_translations = []

for index, row in europarl.iterrows():
    phrase = row["fr_text"]
    translation = translator(phrase)
    english_translations.append(translation[0]['translation_text'])

europarl['machine_translation'] = english_translations

europarl.head()

In [None]:
# Compare human translations to machine translations

print("Human translated English Text:")
print(europarl["en_text"][0])
print("Machine translated English Text:")
print(europarl["machine_translation"][0])

### Now we will try the same translation task using the same French texts, using a different model. Here we will use the Many to Many machine translation model, developed at Facebook. Unlike the Opus models which have distinct models for each language pair, the M2M model is designed to translate between any pairing amongst 100 languaged.

In [None]:
# Load the new model, specifying the languages, since this is a multilingual model!

translator2 = pipeline("translation", model="facebook/m2m100_418M", src_lang="fr", tgt_lang="en")

In [None]:
# Apply the new model to the French texts, and save the new translations to the dataframe
# Again, might take a minute!

english_translations_2 = []

for index, row in europarl.iterrows():
    phrase = row["fr_text"]
    translation = translator2(phrase)
    english_translations_2.append(translation[0]['translation_text'])

europarl['machine_translation_2'] = english_translations_2

europarl.head()

In [None]:
# Print the human translated English text, and compare it to our two machine translated texts

print("Human translated English Text:")
print(europarl["en_text"][1])
print("Machine translated English Text 1:")
print(europarl["machine_translation"][1])
print("Machine translated English Text 2:")
print(europarl["machine_translation_2"][1])

## Exercise:

### See if you can find another machine translation model on huggingface.co, and apply it to the French texts in our sample dataset. You can translate it to any language you like!