### NLP Example using FineTuning

Key characteristics of the SST-2 dataset

    Source: The dataset is based on the Stanford Sentiment Treebank, a corpus of sentences from movie reviews.
    Task: It is a binary classification task, meaning the goal is to classify each sentence into one of two categories: positive or negative sentiment. Neutral sentences from the original Treebank are discarded.
    Input/Output:
        Input: A single sentence from a movie review, such as "The movie was a masterpiece" or "A very long movie, dull in stretches".
        Output: The corresponding sentiment label, either positive or negative.

In [None]:
# Import required libraries
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
import torch
import evaluate

import numpy as np

In [None]:
# Load small dataset (binary sentiment)
ds = load_dataset("glue", "sst2")

print(ds)

In [None]:
# Get tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tok(batch):
    return tokenizer(batch["sentence"], truncation=True)
ds_enc = ds.map(tok, batched=True)
print(ds_enc)

# if available
device = "mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu")
print("Using:", device)

In [None]:
# Model + metric
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
acc_metric = evaluate.load("accuracy")
f1_metric  = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(-1)
    return {
        "accuracy": acc_metric.compute(predictions=preds, references=labels)["accuracy"],
        "f1": f1_metric.compute(predictions=preds, references=labels, average="macro")["f1"],
    }


In [None]:
# Instantiate DataCollator
collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
# Trainer
args = TrainingArguments(
    output_dir="./bert-sst2",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=1,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_steps=50,      
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=ds_enc["train"].shuffle(seed=42).select(range(15000)),  # keep it lighter
    eval_dataset=ds_enc["validation"],
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics
)


In [None]:
# Train + evaluate
trainer.train()
eval_res = trainer.evaluate()
print(eval_res)

In [None]:
# Quick predictions
from transformers import pipeline, AutoTokenizer,AutoModelForSequenceClassification

# Specify the task and the path to your fine-tuned model
# The task must match the one the model was trained for (e.g., 'text-classification', 'summarization', 'ner').
model_path = "./bert-sst2/checkpoint-1876/" 
task = "text-classification" # Example task

# Define your custom label mapping (from step 1)
id2label = {0: "negative", 1: "positive"}

# Initialize the pipeline and if the model was trained with a specifc tokenizer use the same
# Load the model with the custom id2label mapping
model = AutoModelForSequenceClassification.from_pretrained(model_path, id2label=id2label)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Pipeline
classifier = pipeline(task, model=model, tokenizer=tokenizer)

In [None]:
custom_data = [
    "This movie was really good!",
    "The product did not meet my expectations.",
    "I'm not sure how I feel about this.",
]

In [None]:
# Run predictions on the list
results = classifier(custom_data)

for i, result in enumerate(results):
    print(f"Text: '{custom_data[i]}', Prediction: {result}")

### Some key points ####

#### <u>A. DataCollator's role:</u>

1. The DataCollatorWithPadding function in Hugging Face Transformers is used for dynamic padding at the batch level, which is a key difference from how a tokenizer might handle padding.
Here's why both are necessary:

    - Tokenizer's role:
    A tokenizer converts raw text into numerical input IDs and can generate an attention_mask indicating which tokens are real and which are padding. When you call tokenizer(...) with padding=True and truncation=True, it typically pads or truncates each individual sequence to a predefined max_length.
    - Data Collator's role (Dynamic Padding):
    The DataCollatorWithPadding takes a batch of already tokenized samples and dynamically pads them to the length of the longest sequence within that specific batch.

This offers several advantages:
1. Efficiency: It avoids padding all sequences to a fixed, potentially very long max_length if many sequences in a batch are short. This reduces redundant computations on padding tokens and saves memory.
2. Flexibility: It allows for variable-length inputs without requiring a global max_length to be set beforehand, which might not be optimal for all datasets or tasks.
3. Batching Requirement: Deep learning models typically require inputs of uniform shape for efficient matrix operations. The data collator ensures all sequences within a batch have the same length for this purpose. 

In essence, while the tokenizer prepares individual sequences, the data collator optimizes the batching process by applying dynamic padding and generating the corresponding attention mask at the batch level for efficient model training.


#### <u>B. Compute Metrics role:</u>

In Hugging Face (HF) training, the compute_metrics function is a user-defined callback that allows you to calculate and report custom evaluation metrics during the training loop. It is an optional but powerful feature of the Trainer class that provides deeper insights into your model's performance than just the default training loss. 
What compute_metrics does in training is,cinstead of providing a single loss value, the compute_metrics function enables you to track task-specific performance metrics, such as:

    Accuracy for classification tasks
    F1-score for classification with imbalanced data
    ROUGE for text summarization
    SQuAD F1 and Exact Match for question answering
    Word Error Rate (WER) for automatic speech recognition 

Here is a breakdown of its role in the training process:

    -> Receives predictions and labels: During evaluation, the Trainer passes the model's predictions and the true labels from the evaluation dataset to your compute_metrics function.
    -> Performs calculations: Your function then processes these predictions and labels to calculate your desired metrics. This often involves converting the model's raw output (logits) into a more usable format, like class predictions or decoded text.
    -> Returns results: The function returns a dictionary where the keys are the names of the metrics (e.g., 'accuracy') and the values are their computed scores. The Trainer then logs and reports these metrics.
    -> Offers flexibility: While the ðŸ¤— evaluate library provides many standard metrics, compute_metrics allows you to define complex, custom-tailored metrics for your specific use case. 

How to instantiate compute_metrics
You don't "instantiate" compute_metrics like a class, but rather define a function and pass it as an argument to the Trainer. The function must have a specific signature: it takes an EvalPrediction object as input and returns a dictionary of metrics. 

#### Label to id

Interpretation: 

The id2label and label2id mappings are saved with the model's configuration. This allows the model to automatically translate the numeric output predictions into meaningful, human-readable labels during inference with a pipeline() or a saved model checkpoint. 