#### Section 2.2.1 Tutorial on Hugginface Transformer Trainer for BERT(25')

Please run through the following tutorial. 
 https://huggingface.co/docs/transformers/en/tasks/sequence_classification 

When you run the above jupternobook, you will get prompted to two accounts:

1. Create a huggingface account and create a [access token](https://huggingface.co/docs/hub/en/security-tokens) to login in this note book.
2. Create a wandb account to keep the log into the wandb, which has been automaitically integrated into hugginface trainer to track your experiments.  See the details here. https://docs.wandb.ai/guides/integrations/huggingface/ 


The above tutorial will take you about 40 minutes to run on an old [Nvidia Tesla T4 GPU](https://colab.research.google.com/github/d2l-ai/d2l-tvm-colab/blob/master/chapter_gpu_schedules/arch.ipynb) with the free version of Colab, and sometimes it is slow. So try the Oscer first, then paid Colab GPU. Through this tutorial, you will learn how to use trainer, evaluator, pipeline, accelerator in huggingface library. 

Your Task:
* See if you could replace the dataset with SST-5, and the model/tokenizers/configs with "bert-base-cased" and get familiar with this new framework for your sentiment classifier, report the final performance.

(HINT: you almost only need to change the parameters in AutoTokenizer, and AutoModelXX, the learning rate for finetuning is often small (around 10^-3 to 10^-5), the epoch is also around 3 to 10)

What to Report: 

* Your training hyperparameters.
* Performance Metrics

In [1]:
# Transformers installation
! pip install transformers datasets evaluate accelerate
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


# Section 2.2.1 Tutorial on Huggingface Transformer Trainer for BERT (25')

## Text Classification with SST-5 Dataset and BERT-base-cased

This tutorial demonstrates how to fine-tune BERT-base-cased on the SST-5 (Stanford Sentiment Treebank) dataset for 5-class sentiment classification.

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

## Load SST-5 dataset

Start by loading the SST-5 dataset from the ü§ó Datasets library. This dataset contains movie reviews with 5 sentiment classes:

In [3]:
from datasets import load_dataset

# Load SST-5 dataset (5-class sentiment classification)
sst5 = load_dataset("SetFit/sst5")
print("Dataset structure:", sst5)
print("Number of training examples:", len(sst5["train"]))
print("Number of test examples:", len(sst5["test"]))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/421 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


train.jsonl: 0.00B [00:00, ?B/s]

dev.jsonl: 0.00B [00:00, ?B/s]

test.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/8544 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1101 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2210 [00:00<?, ? examples/s]

Dataset structure: DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 8544
    })
    validation: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 1101
    })
    test: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 2210
    })
})
Number of training examples: 8544
Number of test examples: 2210


Then take a look at an example:

In [4]:
print("Example from test set:")
print(sst5["test"][0])
print("\nExample from train set:")
print(sst5["train"][0])

# Show the label distribution
from collections import Counter
train_labels = [example["label"] for example in sst5["train"]]
test_labels = [example["label"] for example in sst5["test"]]

print("\nTraining label distribution:")
print(Counter(train_labels))
print("\nTest label distribution:")
print(Counter(test_labels))

Example from test set:
{'text': 'no movement , no yuks , not much of anything .', 'label': 1, 'label_text': 'negative'}

Example from train set:
{'text': 'a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films', 'label': 4, 'label_text': 'very positive'}

Training label distribution:
Counter({3: 2322, 1: 2218, 2: 1624, 4: 1288, 0: 1092})

Test label distribution:
Counter({1: 633, 3: 510, 4: 399, 2: 389, 0: 279})


There are several fields in this dataset:

- `text`: the movie review text.
- `label`: a value from 0-4 representing sentiment intensity:
  - 0: very negative
  - 1: negative  
  - 2: neutral
  - 3: positive
  - 4: very positive
- `label_text`: human-readable label names

## Preprocess

The next step is to load a BERT-base-cased tokenizer to preprocess the `text` field:

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
print("Tokenizer loaded:", tokenizer.name_or_path)
print("Max length:", tokenizer.model_max_length)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Tokenizer loaded: bert-base-cased
Max length: 512


Create a preprocessing function to tokenize `text` and truncate sequences to be no longer than BERT's maximum input length:

In [6]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [7]:
tokenized_sst5 = sst5.map(preprocess_function, batched=True)
print("Tokenization completed!")
print("Sample tokenized example:")
print(tokenized_sst5["train"][0])

Map:   0%|          | 0/8544 [00:00<?, ? examples/s]

Map:   0%|          | 0/1101 [00:00<?, ? examples/s]

Map:   0%|          | 0/2210 [00:00<?, ? examples/s]

Tokenization completed!
Sample tokenized example:
{'text': 'a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films', 'label': 4, 'label_text': 'very positive', 'input_ids': [101, 170, 20329, 117, 6276, 1105, 1921, 19920, 1231, 118, 18632, 1104, 5295, 1105, 1103, 8839, 1105, 4970, 5367, 2441, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Now create a batch of examples using [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [8]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Evaluate

In [9]:
import evaluate

accuracy = evaluate.load("accuracy")

Downloading builder script: 0.00B [00:00, ?B/s]

Then create a function that passes your predictions and labels to [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the accuracy:

In [10]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

## Train

Before you start training your model, create a map of the expected ids to their labels with `id2label` and `label2id`:

In [11]:
# Define label mappings for SST-5 (5 classes)
id2label = {0: "very negative", 1: "negative", 2: "neutral", 3: "positive", 4: "very positive"}
label2id = {"very negative": 0, "negative": 1, "neutral": 2, "positive": 3, "very positive": 4}

print("Label mappings:")
print("ID to Label:", id2label)
print("Label to ID:", label2id)

Label mappings:
ID to Label: {0: 'very negative', 1: 'negative', 2: 'neutral', 3: 'positive', 4: 'very positive'}
Label to ID: {'very negative': 0, 'negative': 1, 'neutral': 2, 'positive': 3, 'very positive': 4}


In [12]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# Load BERT-base-cased model for 5-class classification
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-cased",
    num_labels=5,  # 5 classes for SST-5
    id2label=id2label,
    label2id=label2id
)

print(f"Model loaded: {model.config.name_or_path}")
print(f"Number of labels: {model.config.num_labels}")
print(f"Model parameters: {model.num_parameters():,}")

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded: bert-base-cased
Number of labels: 5
Model parameters: 108,314,117


## Define hyperparamters

In [13]:
# Training hyperparameters optimized for BERT fine-tuning
training_args = TrainingArguments(
    output_dir="bert-sst5-sentiment-classifier",
    learning_rate=2e-5,  # Optimal learning rate for BERT fine-tuning
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,  # 3 epochs is usually sufficient for BERT
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    push_to_hub=True,
    logging_dir="./logs",
    logging_steps=100,
    warmup_steps=500,  # Warmup for better convergence
    save_total_limit=2,  # Keep only best 2 checkpoints
    report_to="wandb",  # Enable wandb logging
)

print("Training Arguments:")
print(f"Learning Rate: {training_args.learning_rate}")
print(f"Batch Size: {training_args.per_device_train_batch_size}")
print(f"Epochs: {training_args.num_train_epochs}")
print(f"Weight Decay: {training_args.weight_decay}")
print(f"Warmup Steps: {training_args.warmup_steps}")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_sst5["train"],
    eval_dataset=tokenized_sst5["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("Trainer initialized successfully!")
print("Starting training...")

Training Arguments:
Learning Rate: 2e-05
Batch Size: 16
Epochs: 3
Weight Decay: 0.01
Warmup Steps: 500
Trainer initialized successfully!
Starting training...


In [14]:
# Start training
trainer.train()

print("Training completed!")

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mkarimnazarovj[0m ([33mkarimnazarovj-university-of-oklahoma[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,1.1784,1.125291,0.512217
2,1.0273,1.108752,0.505882
3,0.7001,1.198388,0.523529


Training completed!


In [15]:
# Evaluate the model and get comprehensive metrics
print("=" * 50)
print("FINAL EVALUATION RESULTS")
print("=" * 50)

# Get predictions on test set
predictions = trainer.predict(tokenized_sst5["test"])
y_pred = predictions.predictions.argmax(-1)
y_true = predictions.label_ids

# Import additional metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import numpy as np

# Calculate accuracy
accuracy = accuracy_score(y_true, y_pred)
print(f"Final Test Accuracy: {accuracy:.4f}")

# Detailed classification report
print("\nDetailed Classification Report:")
report = classification_report(
    y_true,
    y_pred,
    target_names=list(id2label.values()),
    digits=4
)
print(report)

# Confusion Matrix
print("\nConfusion Matrix:")
cm = confusion_matrix(y_true, y_pred)
print("Labels: [very negative, negative, neutral, positive, very positive]")
print(cm)

# Pretty print confusion matrix
print("\nConfusion Matrix (with labels):")
print(f"{'':>15}", end="")
labels = list(id2label.values())
for label in labels:
    print(f"{label:>15}", end="")
print()

for i, true_label in enumerate(labels):
    print(f"{true_label:>15}", end="")
    for j in range(len(labels)):
        print(f"{cm[i][j]:>15}", end="")
    print()

print("\n" + "=" * 50)
print("HYPERPARAMETERS SUMMARY")
print("=" * 50)
print(f"Model: bert-base-cased")
print(f"Dataset: SST-5 (5-class sentiment)")
print(f"Learning Rate: {training_args.learning_rate}")
print(f"Batch Size: {training_args.per_device_train_batch_size}")
print(f"Epochs: {training_args.num_train_epochs}")
print(f"Weight Decay: {training_args.weight_decay}")
print(f"Warmup Steps: {training_args.warmup_steps}")
print(f"Training Examples: {len(tokenized_sst5['train'])}")
print(f"Test Examples: {len(tokenized_sst5['test'])}")
print("=" * 50)

FINAL EVALUATION RESULTS


Final Test Accuracy: 0.5235

Detailed Classification Report:
               precision    recall  f1-score   support

very negative     0.5415    0.4444    0.4882       279
     negative     0.5781    0.5671    0.5726       633
      neutral     0.3510    0.3573    0.3541       389
     positive     0.4967    0.5941    0.5411       510
very positive     0.6554    0.5815    0.6162       399

     accuracy                         0.5235      2210
    macro avg     0.5245    0.5089    0.5144      2210
 weighted avg     0.5287    0.5235    0.5241      2210


Confusion Matrix:
Labels: [very negative, negative, neutral, positive, very positive]
[[124 119  27   9   0]
 [ 93 359 135  44   2]
 [ 12 127 139 104   7]
 [  0  13  81 303 113]
 [  0   3  14 150 232]]

Confusion Matrix (with labels):
                 very negative       negative        neutral       positive  very positive
  very negative            124            119             27              9              0
       negative        