# 🤗 and 3LC example on the IMDb dataset

This notebook demonstrates fine-tuning a pretrained DistilBERT model from `transformers` on the `IMDb` dataset, using the 3LC integrations with `Trainer` and `datasets` from Hugging Face. 3LC metrics are collected before and after one epoch of training.

The notebook covers:

- Getting a `TLCDataset` from a `datasets` dataset, highlighting key differences between `TLCDataset` and `datasets.Dataset`.
- Fine-tuning a pretrained `transformers` model on the IMDb dataset with `TLCTrainer`.
- Using a custom function for metrics collection.

In [None]:
EPOCHS = 1
TRAIN_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 256
TRAIN_DATASET_NAME = "hf-imdb-train"
EVAL_DATASET_NAME = "hf-imdb-test"
TRANSIENT_DATA_PATH = "./transient_data"
DEVICE = "cuda:0"
PROJECT_NAME = "hf-imdb"
INSTALL_DEPENDENCIES = False
TLC_PUBLIC_EXAMPLES_DEVELOPER_MODE = True

In [None]:
if INSTALL_DEPENDENCIES:
    %pip --quiet install ipykernel ipywidgets
    %pip --quiet install torch --index-url https://download.pytorch.org/whl/cu118
    %pip --quiet install torchvision --index-url https://download.pytorch.org/whl/cu118
    %pip --quiet install datasets transformers
    %pip --quiet install accelerate
    %pip --quiet install tlc

In [None]:
import evaluate
import numpy as np
import tlc
import torch

from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, TrainingArguments

In [None]:
### HIDDEN CELL ###

## Data & Alias management
# See comments in ../mnist.ipynb for details on data and alias management.

# Set this variable to True if you just want to run this notebook for local testing purposes
if not TLC_PUBLIC_EXAMPLES_DEVELOPER_MODE:
    from tlc.client.utils import (
        TLC_PUBLIC_EXAMPLES_RUN_ROOT,
        TLC_PUBLIC_EXAMPLES_TABLE_ROOT,
    )
    from tlc.core.objects.mutable_objects import Configuration

    print(f"Runs and Tables will be written to remote location: '{TLC_PUBLIC_EXAMPLES_RUN_ROOT}' and '{TLC_PUBLIC_EXAMPLES_TABLE_ROOT}'")
    Configuration.instance().run_root_url = TLC_PUBLIC_EXAMPLES_RUN_ROOT
    Configuration.instance().table_root_url = TLC_PUBLIC_EXAMPLES_TABLE_ROOT

With the 3LC integration, you can use `load_dataset` as a drop-in replacement to create a `TLCDataset`. Notice `.latest()`, which gets the latest version of the 3LC dataset.

In [None]:
from tlc.integration.huggingface import load_dataset

train_dataset = load_dataset("imdb", split="train", project_name=PROJECT_NAME, dataset_name=TRAIN_DATASET_NAME, write_row_cache=True)
eval_dataset = load_dataset("imdb", split="test", project_name=PROJECT_NAME, dataset_name=TRAIN_DATASET_NAME, write_row_cache=True)

For comparison, let's compare the first samples of the training splits in the 3LC integration and `datasets`.

In [None]:
import datasets

train_dataset_hf = datasets.load_dataset("imdb", split="train")
train_dataset_hf[0]

In [None]:
train_dataset[0]

It turns out they are different, which is probably quite surprising! The reason for this is that `TLCDataset` randomly samples examples by default, based on the editable column `Sampling Weight` in the 3LC Dashboard. In order to get the expected examples, you can either use `.get_sample_at_index()` or use the `sequential` context manager. The latter is used internally in 3LC when collecting metrics.

In [None]:
train_dataset[0]

`TLCDataset` provides a method `map` to apply both preprocessing and on-the-fly transforms to your data before it is used sent to the model. It takes a sample and returns the transformed example. Here `cache=True` is used to persist the result of tokenization for each sample, such that it is only done once.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenize = lambda sample: {**sample, **tokenizer(sample["text"], truncation=True)}

In [None]:
train_tokenized = train_dataset.map(tokenize)
eval_tokenized = eval_dataset.map(tokenize)

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
id2label = {0: "neg", 1: "pos"}
label2id = {"neg": 0, "pos": 1}

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

Computing metrics is done by implementing a function which returns all the per-sample metrics you would like to see in the 3LC Dashboard. We keep the metrics function in Hugging Face to see the intermediate aggregate metrics.

For special metrics such as the predicted category we specify that we would like this to be shown as a `CategoricalLabel`. 

In [None]:

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

def compute_tlc_metrics(logits, labels):
    predictions = logits.argmax(dim=-1)
    loss = torch.nn.functional.cross_entropy(logits, labels, reduction="none")

    probabilities = torch.nn.functional.softmax(logits, dim=-1)
    confidence = probabilities.gather(dim=-1, index=predictions.unsqueeze(-1)).squeeze()

    return {
        "predicted": predictions,
        "loss": loss,
        "confidence": confidence,
    }

compute_tlc_metrics.column_schemas = {
    "predicted": tlc.CategoricalLabelSchema(display_name="Predicted Label", class_names=id2label.values(), display_importance=4005),
    "loss": tlc.Schema(display_name="Loss", writable=False, value=tlc.Float32Value()),
    "confidence": tlc.Schema(display_name="Confidence", writable=False, value=tlc.Float32Value()),
}

## Train the model with TLCTrainer

To perform model training, we replace the usual `Trainer` with `TLCTrainer` and provide the per-sample metrics collection function. We also specify that we would like to collect metrics prior to training.

In [None]:
tlc.init(project_name=PROJECT_NAME)

In [None]:
from tlc.integration.huggingface import TLCTrainer

training_args = TrainingArguments(
    output_dir=TRANSIENT_DATA_PATH,
    learning_rate=2e-5,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=EVAL_BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    use_cpu=DEVICE == "cpu",
)

trainer = TLCTrainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=eval_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    compute_tlc_metrics=compute_tlc_metrics,
    collect_tlc_metrics_before_training=True,
)

In [None]:
trainer.train()