<div style="text-align: center; background-color: #f0f0f0; padding: 15px;">
    <h1 style="color: #333;">Final Project - Sentimental Analysis
Face Pipelines</h1>
    <h2 style="color: #666;">Advanced Topics in Artificial Intelligence and Machine Learning</h2>
    <h3 style="color: #999;">Jonathan Denoon, Matthew Persaud, Colin Smith</h3>
    <h4 style="color: #aaa;">August 12, 2024</h4>
</div>

<a id = '0'></a>
<h2>Table of Contents</h2>

* [Environment Setup](#1.0)
* [Importing the Pre-trained Model and Preparing IMDb Dataset](#2.0)    
* [Hyperparameters and Model Training](#3.0)
* [Model Classification](#4.0)
* [Model Evaluation](#5.0)



<a id='1.0'></a>
<h3>Environment Setup</h3>

<p>
In the first section of the notebook/code, the environment is setup. To setup the environment, we import the necessary libraries to build our sentimental analysis pipeline:</p>

In [1]:
# Import the necessary libraries
!pip install datasets
import pandas as pd
from transformers import pipeline, AutoModelForSequenceClassification, Trainer, TrainingArguments, AutoTokenizer
from datasets import load_dataset



<a id='2.0'></a>
<h3>Importing the Pre-trained Model and Preparing the IMDb Dataset</h3>

<p>
With our necessary libraries imported, the next step is to load the "IMDb" dataset, and the pre trained model from "transformers". This dataset will be used to train the model based off movie reviews, and their corresponding sentimental labels. Part of pre-processing the dataset includes tokenization, and partitioning subsets for training: </p>

In [2]:
# Loading the "IMDb dataset" and a pre trained model from "transformers"
dataset = load_dataset("imdb")
model_name = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Preparing the dataset (Tokenize)
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select([i for i in range(500)])  # Subset of data for training
small_test_dataset = tokenized_datasets["test"].shuffle(seed=42).select([i for i in range(500)])  # Subset of data for testing

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

<a id='3.0'></a>
<h3>Hyperparameters and Model Training</h3>

<p>
With the data pre-processed and our environment prepared, it is time to tune our hyperparamaters and our model arguments. 3 training epochs are used to achieve higher leveles of accuracy, and still have reasonable computation performance. The learning rate and batch sizes are also adjusted accordingly, to provide accurate results, while also still providing good reproducability and performance:</p>

In [3]:
# Hyper parameters
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)


# Model training
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_test_dataset,
)
trainer.train()

# Save/load model
model.save_pretrained("./sentiment_model")
tokenizer.save_pretrained("./sentiment_model")
sentiment_pipeline = pipeline("sentiment-analysis", model="./sentiment_model", tokenizer="./sentiment_model")



Epoch,Training Loss,Validation Loss
1,No log,0.41849
2,No log,0.370437
3,No log,0.367569


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


<a id='4.0'></a>
<h3>Model Classification</h3>

<p>
Now that we have a trained model, we will load our gathered data into the sentimental analysis pipeline for analysis: </p>

In [5]:
# Classification

# Load the .CSV Excel file containing names and reviews for analysis
df = pd.read_csv('spiderman_reviews.csv')

# Derive the reviews for sentiment analysis
texts = df['review'].tolist()

<a id='5.0'></a>
<h3>Model Evaluation</h3>

<p>
With the data loaded into our trained model, it is time to evaluate the model on the surveyed data, and review the model results. The expected results are a positive sentiemntal label for Jacob, and Sam, a negative value for Derek, and a positive value for Colin. Colin's review is designed to challenge the model with a more "neutral" sentimental value, to see how accurately the pipeline can label:</p>

In [6]:
# Perform sentiment analysis on the reviews
results = sentiment_pipeline(texts)

# Print the results with the names and sentiments and evaluate the model
for name, review, result in zip(df['name'], texts, results):
    print(f"Name: {name}\nReview: {review}\nSentiment: {result['label']} ({result['score']:.2f})\n")

Name: Jacob Fischer
Review: I really enjoyed this movie, it was cool seeing a cartoon spiderman.
Sentiment: LABEL_1 (0.97)

Name: Sam Weavers
Review: I think the movie was great, espicially how they used a black character for the first time ever.
Sentiment: LABEL_1 (0.98)

Name: Derek Smith
Review: I hated the movie, spiderman should be the same actor and not a cartoon 
Sentiment: LABEL_0 (0.94)

Name: Colin Smith
Review: I enjoyed watching the movie, it was not the best, but not the worst
Sentiment: LABEL_1 (0.87)

