## Credit

Notes are taken from NLPlanet Practical NLP with Python course section 2.7 Evaluating a Sentiment Analysis Model
* https://www.nlplanet.org/course-practical-nlp/02-practical-nlp-first-tasks/07-evaluate-sentiment-analysis

Authored by Fabio Chiusano
* https://medium.com/@chiusanofabio94

**All quotes '' are sourced from the NLPlanet course.**

## Evaluating a Model over IMDb 

<u>IMDb:</u>
* (Internet Movie Database) is a wealth of information about movies, shows, actors, directors, and more.
* The IMDb contains user-generated reviews and ratings used in a dataset for sentiment analysis.

In [None]:
# Install datasets library
!pip install transformers datasets

In [1]:
# Imports
from datasets import load_dataset, load_metric
# load_dataset function loads datasets from the Hugging Face datasets repository
# load_metric function loads evaluation metrics used for measuring the performance of NLP models
from transformers import pipeline
# pipeline function allows you to create a pipeline for a specific task
import pandas as pd
# used for data manipulation

In [2]:
# Download IMDb Dataset

# Download tweets dataset
dataset = load_dataset("imdb", split="test")
# "imbd" = name of dataset being loaded from HuggingFace 'datasets' library
# split parameter specifies which part of the dataset to load
    # 'test' typically refers to a subset of datasets used to evaluate performance
print(dataset)

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})


In [3]:
# Convert dataset to pandas DataFrame
df = pd.DataFrame(dataset)
df.head()

Unnamed: 0,text,label
0,I love sci-fi and am willing to put up with a ...,0
1,"Worth the entertainment value of a rental, esp...",0
2,its a totally average film with a few semi-alr...,0
3,STAR RATING: ***** Saturday Night **** Friday ...,0
4,"First off let me say, If you haven't enjoyed a...",0


In [5]:
# Load pipeline and pre-trained sentiment model
model = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

In [None]:
# Compute the sentiment of each tweet using the model.
all_texts = df["text"].values.tolist()
# df["texts"] selects the "texts" column
    # .values converts selected column into NumPy array
        # .tolist() converts the NumPy array into Python list

all_sentiments = model(all_texts, truncation=True, max_length=512)
# Would take ~16.5 hrs for me to run
# sentiment performs computations on all_texts list
# truncation = True takes the max_length (number of tokens) from texts that are too long
# max_length sets the maximum number of tokens that are allowed per text

df["prediction"] = [0 if d["label"] == "NEGATIVE" else 1 for d in all_sentiments]
# modifies the "prediction" column in the DataFrame
df.head()

In [None]:
# Compute Accuracy (SST-2 Model)

# load 'accuracy' metric from datasets library
metric = load_metric('accuracy')

# compute accuracy over test set
prediction = df["prediction"]
references = df["label"]
score = metric.compute(predictions=predictions, references=references)
# .compute() function of metric object calculates the accuracy score -
    # by comparing the predicted values to the reference (true/accurate) values
print(score) # 0.89072
# This is the accuracy rating. Think of it as <score>% accurate

In [None]:
# Compute Accuracy (Tweets Model)

# load pipeline and pre-trained sentiment model
model = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment-latest")

# Compute the sentiment of each tweet using the model

all_texts = df["text"].values.tolist()
all_sentiments = model(all_texts, truncation=True, max_length=512)
# Would take ~16.5 hrs for me to run
df["prediction"] = [0 if d["label"] == "negative" else 1 for d in all_sentiments]

# Compute accuracy over test set
metric = load_metric('accuracy')

predictions = df["prediction"]
references = df["label"]
score = metric.compute(predictions=predictions, references=references)

print(score) # 0.80772

In [None]:
# Compute Accuracy (IMDb Model)

# load pipeline and pre-trained sentiment model
model = pipeline("sentiment-analysis", model="lvwerra/distilbert-imdb", device=0)

# Compute the sentiment of each tweet using the model

all_texts = df["text"].values.tolist()
all_sentiments = model(all_texts, truncation=True, max_length=512)
# Would take ~16.5 hrs for me to run
df["prediction"] = [0 if d["label"] == "NEGATIVE" else 1 for d in all_sentiments]

# Compute accuracy over test set
metric = load_metric('accuracy')

predictions = df["prediction"]
references = df["label"]
score = metric.compute(predictions=predictions, references=references)

print(score) # 0.928

## Using Pre-Trained Models vs Fine-Tuning Your Own Model

<u>Question:</u>
* When computing the setniment of movie reviews published on a personal site, should a pre-trained model be used? or would it be better to use a fine-tuned model using your own movie reviews?

<u>Answer:</u>
* It depends on:
    * 'If you estimate that a ~4% improvement in accuracy on your data will bring more benefits than the costs of building your dataset and finetuning a model, then it is better to proceed with the finetuning.'
    * 'Otherwise, it is best to use a pre-trained model and have a good-enough solution right away, wihtout the extra expenses.'