## Basic Settings

**Since our dataset is large, we use 3 machines to run different proportions of datasets (1%, 10%, 50% and 100%).**

For our 1% dataset, it takes about 12 minutes for training on our machines.

For our 10% dataset, it takes about 2 hours for training on our machines.

For our half dataset (50%), it takes about 11 hours for training on our machines.

For our full dataset (100%), it takes about 18 hours for training on our machines.

### Hardware settings

2 Desktops with:
* CPU: 13th Gen Intel(R) Core(TM) i9-13900KF
* RAM: 32 GB
* GPU: NVIDIA GeForce RTX 4090 (VRAM 24 GB)

1 Server with:
* CPU: Intel Xeon w5-3435
* RAM: 128 GB
* GPU: 2 x NVIDIA RTX A5500 (VRAM 24 GB each)

The GPU allocation in byte and percentage. We can see that it takes about 24 GB (full work load) during the entire training process.
<img src="./figures/gpu_byte.png" alt="drawing" width="800"/>
<img src="./figures/gpu_percentage.png" alt="drawing" width="800"/>

In [1]:
# Download libraries
! pip install torch wandb transformers -q
! pip install -U datasets -q
! pip install accelerate -U -q
print("Done downloading libraries")

Done downloading libraries


In [2]:
import torch
import warnings
import pickle
import numpy as np
import tqdm

from datasets import load_dataset
from transformers import (
    RobertaForSequenceClassification, 
    Trainer, 
    TrainingArguments,
    RobertaTokenizer,
    pipeline
)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Environment setting
warnings.filterwarnings("ignore")
torch.cuda.empty_cache()
# getting device for training
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

Using cuda device


In [4]:
# Paths
dataset_name = "yelp_review_full"
dataset_save_path = "data/datasets"
pkl_save_path = "./data/pkl"
pretrained_model_path = "./pretrained_models/model"
pretrained_tokenizer_path = "./pretrained_models/tokenizer"
saved_model_path = "./yelp_sentiment_model"
finetuned_model = "./finetuned_model"
output_dir = "./outputs"
model_id = "roberta-base"

## Process data

In [5]:
### 1. Load the dataset
dataset = load_dataset(
    dataset_name,
    cache_dir=dataset_save_path
)

In [6]:
### 2. Load the tokenizer
# tokenizer
tokenizer = RobertaTokenizer.from_pretrained(
    model_id,
    cache_dir=pretrained_tokenizer_path,
    trust_remote_code=True,
)

In [7]:
### 3. Preprocess the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [8]:
### 4. Save the tokenized datasets
with open(pkl_save_path, 'wb') as file:
    pickle.dump(tokenized_datasets, file)

## Fine-tune RoBERTa

In [9]:
### 1. Load the model
# model
model = RobertaForSequenceClassification.from_pretrained(
    model_id, 
    num_labels=5,
    cache_dir=pretrained_model_path, 
    trust_remote_code=True
)

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
###### 2. Load datasets
with open(pkl_save_path, 'rb') as file:
    tokenized_datasets = pickle.load(file)

# Slice the dataset to use only 1% of it
def slice_dataset(dataset):
    indices = np.random.permutation(len(dataset))
    subset_size = len(dataset) // 100  # 1% of the dataset
    subset_indices = indices[:subset_size]
    return dataset.select(subset_indices)

# Apply slicing to both train and test datasets
sliced_datasets = {split: slice_dataset(tokenized_datasets[split]) for split in tokenized_datasets.keys()}

In [11]:
### 3. Training preparation
# training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,    # Evita CUDA Out-of-Memory errors.
    evaluation_strategy="epoch",  # Evaluate at the end of each epoch
    save_strategy="epoch",        # Save at the end of each epoch
    learning_rate=1e-05,
    per_device_train_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    lr_scheduler_type = "linear",
    report_to="none"
)


# trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=sliced_datasets["train"],
    eval_dataset=sliced_datasets["test"],
    compute_metrics=lambda eval_pred: {
        "accuracy": ((eval_pred[0].argmax(-1) == eval_pred[1]).mean()).item()
    }
)

In [12]:
### 4. Training & save the model
trainer.train()

model.save_pretrained(saved_model_path)
tokenizer.save_pretrained(saved_model_path)

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.902748,0.616
2,1.001300,0.833609,0.638
3,0.719800,0.836921,0.642
4,0.603200,0.895575,0.652
5,0.519000,0.88943,0.64


('./yelp_sentiment_model/tokenizer_config.json',
 './yelp_sentiment_model/special_tokens_map.json',
 './yelp_sentiment_model/vocab.json',
 './yelp_sentiment_model/merges.txt',
 './yelp_sentiment_model/added_tokens.json')

## Inference

If you don't like to run the training process, we provide a way to inference on our fine-tuned models on Hugging Face Hub.

In [13]:
# baseline model
total_length = len(sliced_datasets["test"])

# Load baseline model (RoBERTa without fine-tuning)
base_classifier = pipeline("sentiment-analysis", model_id, max_length=512)
base_TP = base_FP = base_FN = 0

# Calculate metrics on test dataset
for item in sliced_datasets["test"]:
    truth = int(item['label'])
    
    # Prediction of baseline model
    base_result = int(base_classifier(item['text'])[0]['label'][-1])
    if base_result == truth:
        base_TP += 1
    elif base_result == 1:
        base_FP += 1
    else:
        base_FN += 1

# Calculate accuracy, precision, recall, and F1-score
def calculate_metrics(TP, FP, FN):
    accuracy = TP / total_length
    precision = TP / (TP + FP) if TP + FP > 0 else 0
    recall = TP / (TP + FN) if TP + FN > 0 else 0
    f1_score = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0
    return accuracy, precision, recall, f1_score

base_metrics = calculate_metrics(base_TP, base_FP, base_FN)

print("Base Model Metrics:\nAccuracy: {:.2%}, Precision: {:.2%}, Recall: {:.2%}, F1-score: {:.2%}".format(*base_metrics))

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Base Model Metrics:
Accuracy: 17.80%, Precision: 21.71%, Recall: 49.72%, F1-score: 30.22%


In [14]:
# fine-tuned model on 1% dataset
total_length = len(sliced_datasets["test"])

# Choose the fine-tuned model you would like to run
repository_id = "HanzhiZhang/CSCE5218_01percent"

# Load fine-tuned model
finetuned_classifier = pipeline("sentiment-analysis", repository_id, max_length=512)
finetuned_TP = finetuned_FP = finetuned_FN = 0

# Calculate metrics on test dataset
for item in sliced_datasets["test"]:
    truth = int(item['label'])
    
    # Prediction of fine-tuned model
    finetuned_result = int(finetuned_classifier(item['text'])[0]['label'][-1])
    if finetuned_result == truth:
        finetuned_TP += 1
    elif finetuned_result == 1:
        finetuned_FP += 1
    else:
        finetuned_FN += 1

finetuned_metrics = calculate_metrics(finetuned_TP, finetuned_FP, finetuned_FN)

print("Fine-tuned Model (1% dataset) Metrics:\nAccuracy: {:.2%}, Precision: {:.2%}, Recall: {:.2%}, F1-score: {:.2%}".format(*finetuned_metrics))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Fine-tuned Model (1% dataset) Metrics:
Accuracy: 65.00%, Precision: 90.03%, Recall: 70.04%, F1-score: 78.79%


In [15]:
# fine-tuned model on 10% dataset
total_length = len(sliced_datasets["test"])

# Choose the fine-tuned model you would like to run
repository_id = "HanzhiZhang/CSCE5218_10percent"

# Load fine-tuned model
finetuned_classifier = pipeline("sentiment-analysis", repository_id, max_length=512)
finetuned_TP = finetuned_FP = finetuned_FN = 0

# Calculate metrics on test dataset
for item in sliced_datasets["test"]:
    truth = int(item['label'])
    
    # Prediction of fine-tuned model
    finetuned_result = int(finetuned_classifier(item['text'])[0]['label'][-1])
    if finetuned_result == truth:
        finetuned_TP += 1
    elif finetuned_result == 1:
        finetuned_FP += 1
    else:
        finetuned_FN += 1

finetuned_metrics = calculate_metrics(finetuned_TP, finetuned_FP, finetuned_FN)

print("Fine-tuned Model (10% dataset) Metrics:\nAccuracy: {:.2%}, Precision: {:.2%}, Recall: {:.2%}, F1-score: {:.2%}".format(*finetuned_metrics))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Fine-tuned Model (10% dataset) Metrics:
Accuracy: 71.20%, Precision: 90.82%, Recall: 76.72%, F1-score: 83.18%


In [16]:
# fine-tuned model on 50% dataset
total_length = len(sliced_datasets["test"])

# Choose the fine-tuned model you would like to run
repository_id = "HanzhiZhang/CSCE5218_50percent"

# Load fine-tuned model
finetuned_classifier = pipeline("sentiment-analysis", repository_id, max_length=512)
finetuned_TP = finetuned_FP = finetuned_FN = 0

# Calculate metrics on test dataset
for item in sliced_datasets["test"]:
    truth = int(item['label'])
    
    # Prediction of fine-tuned model
    finetuned_result = int(finetuned_classifier(item['text'])[0]['label'][-1])
    if finetuned_result == truth:
        finetuned_TP += 1
    elif finetuned_result == 1:
        finetuned_FP += 1
    else:
        finetuned_FN += 1

finetuned_metrics = calculate_metrics(finetuned_TP, finetuned_FP, finetuned_FN)

print("Fine-tuned Model (50% dataset) Metrics:\nAccuracy: {:.2%}, Precision: {:.2%}, Recall: {:.2%}, F1-score: {:.2%}".format(*finetuned_metrics))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Fine-tuned Model (50% dataset) Metrics:
Accuracy: 73.00%, Precision: 92.17%, Recall: 77.83%, F1-score: 84.39%


In [17]:
# fine-tuned model on 100% dataset
total_length = len(sliced_datasets["test"])

# Choose the fine-tuned model you would like to run
repository_id = "HanzhiZhang/CSCE5218_100percent"

# Load fine-tuned model
finetuned_classifier = pipeline("sentiment-analysis", repository_id, max_length=512)
finetuned_TP = finetuned_FP = finetuned_FN = 0

# Calculate metrics on test dataset
for item in sliced_datasets["test"]:
    truth = int(item['label'])
    
    # Prediction of fine-tuned model
    finetuned_result = int(finetuned_classifier(item['text'])[0]['label'][-1])
    if finetuned_result == truth:
        finetuned_TP += 1
    elif finetuned_result == 1:
        finetuned_FP += 1
    else:
        finetuned_FN += 1

finetuned_metrics = calculate_metrics(finetuned_TP, finetuned_FP, finetuned_FN)

print("Fine-tuned Model (100% dataset) Metrics:\nAccuracy: {:.2%}, Precision: {:.2%}, Recall: {:.2%}, F1-score: {:.2%}".format(*finetuned_metrics))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Fine-tuned Model (100% dataset) Metrics:
Accuracy: 75.40%, Precision: 91.73%, Recall: 80.90%, F1-score: 85.97%
