User Manual:

1. Make sure to restart the session first.
2. All the glue Tasks are in the glue_tasks array but since we don't have enough compute time we have to split up the glue_tasks.
3. Put the tasks that you are finetuning on in test_tasks array
4. Run all the cells and report your eval score in the shared google sheet

In [None]:
import random
import numpy as np
import torch
from transformers import set_seed

In [2]:
seed = random.randrange(2**32)
print(f"🔢 Using random seed: {seed}")

# Seed all RNGs
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)
set_seed(seed)  # also seeds Hugging Face’s Trainer internals

🔢 Using random seed: 1316945282


In [3]:
# Cell 1: Install dependencies (don’t upgrade CUDA‑linked packages)
!pip install transformers datasets evaluate box



In [4]:
# Cell 2: Imports
from datasets import load_dataset
import evaluate
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
)
import numpy as np
import pandas as pd

In [5]:
glue_tasks = [
    "cola", "sst2", "mrpc", "qqp", "mnli",
    "qnli", "rte", "stsb", "wnli"
]

test_tasks = [
    "cola", "sst2"
]

base_args = {
    "model_name_or_path":          "distilbert-base-uncased",
    "max_seq_length":              128,
    "per_device_train_batch_size": 32,
    "per_device_eval_batch_size":  64,
    "learning_rate":               2e-5,
    "num_train_epochs":            3,
    "logging_steps":               50,
    "weight_decay":                0.01,
    "save_steps":                  500,
    "output_dir":                  "./glue-results",  # subfolders per task
}

In [6]:
all_results = {}
best_metrics = {
    "cola": "matthews_correlation",
    "sst2": "accuracy",
    "mrpc": "f1",
    "qqp": "f1",
    "mnli": "accuracy",
    "qnli": "accuracy",
    "rte": "accuracy",
    "wnli": "accuracy",
    "stsb": "pearson",
}


for task in test_tasks:
    print(f"\n===== TASK: {task.upper()} =====")
    args = base_args.copy()
    args["task_name"]  = task
    args["output_dir"] = f"{base_args['output_dir']}/{task}"

    # 1) Load data & metric
    ds     = load_dataset("glue", args["task_name"])
    metric = evaluate.load("glue", args["task_name"])

    # 2) Tokenizer & collator
    tokenizer     = AutoTokenizer.from_pretrained(args["model_name_or_path"])
    data_collator = DataCollatorWithPadding(tokenizer)

    # 3) Preprocess
    def preprocess_fn(ex):
        if args["task_name"] in ("sst2", "cola", "stsb"):
            return tokenizer(
                ex["sentence"],
                truncation=True,
                padding="max_length",
                max_length=args["max_seq_length"]
            )
        if args["task_name"] == "mnli":
            return tokenizer(
                ex["premise"], ex["hypothesis"],
                truncation=True,
                padding="max_length",
                max_length=args["max_seq_length"]
            )
        return tokenizer(
            ex["sentence1"], ex["sentence2"],
            truncation=True,
            padding="max_length",
            max_length=args["max_seq_length"]
        )

    encoded = ds.map(preprocess_fn, batched=True)

    # 4) Model
    num_labels = 1 if args["task_name"] == "stsb" else ds["train"].features["label"].num_classes
    model      = AutoModelForSequenceClassification.from_pretrained(
                     args["model_name_or_path"],
                     num_labels=num_labels
                 )

    # 5) TrainingArguments
    metric_name = best_metrics[task]

    training_args = TrainingArguments(
        output_dir=args["output_dir"],
        seed=seed,
        per_device_train_batch_size=args["per_device_train_batch_size"],
        per_device_eval_batch_size=args["per_device_eval_batch_size"],
        learning_rate=args["learning_rate"],
        num_train_epochs=args["num_train_epochs"],
        logging_steps=args["logging_steps"],
        save_steps=args["save_steps"],
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model = metric_name,
        overwrite_output_dir=True,
    )

    # 6) Metrics function
    def compute_metrics(p):
        logits, labels = p
        if task == "stsb":
            preds = np.squeeze(logits)
        else:
            preds = np.argmax(logits, axis=-1)
        return metric.compute(predictions=preds, references=labels)

    # 7) Trainer setup
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=encoded["train"],
        eval_dataset=(
            encoded["validation_matched"] if task == "mnli"
            else encoded["validation"]
        ),
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics
    )

    # 8) Train & evaluate
    trainer.train()
    result = trainer.evaluate()
    all_results[task] = result



===== TASK: COLA =====


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/8551 [00:00<?, ? examples/s]

Map:   0%|          | 0/1043 [00:00<?, ? examples/s]

Map:   0%|          | 0/1063 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33msunnysolomon8880[0m ([33msunnysolomon8880-cornell-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Matthews Correlation
1,0.4745,0.476063,0.455969
2,0.3567,0.48456,0.481079
3,0.2289,0.521252,0.501637



===== TASK: SST2 =====


train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [None]:
# Cell 5: Summarize all task results
df = pd.DataFrame(all_results).T
display(df)