<a href="https://colab.research.google.com/github/Rocking-Priya/703-fall-coding-homeworks-2025/blob/main/Homework_09.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Homework 9: Text Classification with Fine-Tuned BERT

### Due: Midnight on November 5th (with 2-hour grace period) — Worth 85 points

In this final homework, we’ll explore **fine-tuning a pre-trained Transformer model (BERT)** for text classification using the **IMDB Movie Review** dataset. You’ll begin with a working baseline notebook and then conduct a series of controlled experiments to understand how data size, context length, and model architecture affect performance.

You’ll complete three problems:

* **Problem 1:** Evaluate how **sequence length** and **learning rate** jointly influence validation loss and generalization.
* **Problem 2:** Measure how **training data size** affects both model performance and total training time.
* **Problem 3:** Compare **two additional models** from the BERT family to analyze the trade-offs between model size and accuracy on this dataset.

In each problem, you’ll report your key metrics, summarize what you observed, and reflect on what you learned.

> **Note:** This homework was developed and tested on **Google Colab**, due to version conflicts when running locally. It is **strongly recommended** that you complete your work on Colab as well.

There are 6 problems, each worth 14 points, and you get one point free if you complete the entire homework.


In [6]:
# Install once per new Colab runtime
%pip -q install -U keras keras-hub tensorflow tensorflow-text datasets evaluate

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m71.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m645.0/645.0 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m506.8/506.8 kB[0m [31m47.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m49.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency c

In [7]:

import os
os.environ["KERAS_BACKEND"] = "tensorflow"

import time
import random
import numpy as np
import keras
import keras_hub as kh
import evaluate
from datasets import load_dataset, Dataset, Features, Value, ClassLabel

from keras import mixed_precision                    # generally faster
mixed_precision.set_global_policy("mixed_float16")

In [8]:
# Attempt to resolve pyarrow version conflict
%pip uninstall -y pyarrow datasets
%pip install -q -U pyarrow datasets evaluate

Found existing installation: pyarrow 22.0.0
Uninstalling pyarrow-22.0.0:
  Successfully uninstalled pyarrow-22.0.0
Found existing installation: datasets 4.3.0
Uninstalling datasets-4.3.0:
  Successfully uninstalled datasets-4.3.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pylibcudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
cudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.[0m[31m
[0m

### Here is where you can set global hyperparameters for this homework

In [9]:
# ---------------- Config ----------------
SEED        = 42
MAX_LEN     = 128
EPOCHS      = 3
BATCH       = 32
EVAL_BATCH  = 64
SUBSET_FRAC = 0.25   # <-- 0.25 to train and test on 25% of whole dataset during development;  set to 1.0 for full dataset

keras.utils.set_random_seed(SEED)

### Load and Preprocess the IMDB Movie Review Dataset

In [10]:
# ---- Load IMDb (raw), join train+test ----
imdb   = load_dataset("imdb")
texts  = list(imdb["train"]["text"]) + list(imdb["test"]["text"])
labels = np.array(list(imdb["train"]["label"]) + list(imdb["test"]["label"]), dtype="int32")

# ---- Build DS with explicit features (label=ClassLabel) ----
features = Features({"text": Value("string"),
                     "label": ClassLabel(num_classes=2, names=["NEG","POS"])})
all_ds = Dataset.from_dict({"text": texts, "label": labels.tolist()}, features=features)

# ---- Optional: take a stratified subset of the FULL dataset ----
if 0.0 < SUBSET_FRAC < 1.0:
    sub = all_ds.train_test_split(train_size=SUBSET_FRAC, seed=SEED, stratify_by_column="label")
    ds_pool = sub["train"]
else:
    ds_pool = all_ds

# ---- Stratified 80/10/10 split on the (possibly smaller) pool ----
# First: 80/20 train+val pool / test
splits = ds_pool.train_test_split(test_size=0.20, seed=SEED, stratify_by_column="label")
train_val_pool, test_ds = splits["train"], splits["test"]
# Then: carve 10% of full (i.e., 0.125 of the 80% pool) as validation
splits2 = train_val_pool.train_test_split(test_size=0.125, seed=SEED, stratify_by_column="label")
train_ds, val_ds = splits2["train"], splits2["test"]

# ---- Numpy arrays for Keras fit/predict ----
X_tr = np.array(train_ds["text"], dtype=object); y_tr = np.array(train_ds["label"], dtype="int32")
X_va = np.array(val_ds["text"],   dtype=object); y_va = np.array(val_ds["label"],   dtype="int32")
X_te = np.array(test_ds["text"],  dtype=object); y_te = np.array(test_ds["label"],  dtype="int32")

# ---- Quick summary ----
def _counts(ds):
    arr = np.array(ds["label"], dtype=int)
    return len(arr), np.bincount(arr, minlength=2).tolist()
print(f"Pool after SUBSET_FRAC={SUBSET_FRAC}: {len(ds_pool)} (of {len(all_ds)})")
print("Train:", _counts(train_ds), " Val:", _counts(val_ds), " Test:", _counts(test_ds))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Pool after SUBSET_FRAC=0.25: 12500 (of 50000)
Train: (8750, [4375, 4375])  Val: (1250, [625, 625])  Test: (2500, [1250, 1250])


### Build and train a baseline Distil-Bert Text Classifier

In [11]:
# ---- Keras Hub preprocessor + classifier ----
preproc = kh.models.DistilBertTextClassifierPreprocessor.from_preset(
    "distil_bert_base_en_uncased", sequence_length=MAX_LEN
)
model = kh.models.DistilBertTextClassifier.from_preset(
    "distil_bert_base_en_uncased", num_classes=2, preprocessor=preproc
)

model.compile(
    optimizer=keras.optimizers.Adam(1e-5),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")],
)

start = time.time()

# ---- Train with early stopping (restore best val weights) ----
cb = [keras.callbacks.EarlyStopping(monitor="val_loss", patience=2, restore_best_weights=True)]
history = model.fit(
    X_tr, y_tr,
    validation_data=(X_va, y_va),
    epochs=EPOCHS,
    batch_size=BATCH,
    callbacks=cb,
    verbose=1,
)

# ---- Evaluate (accuracy + F1 via `evaluate`) ----
logits = model.predict(X_te, batch_size=EVAL_BATCH, verbose=0)
y_pred = logits.argmax(axis=-1)

acc_metric = evaluate.load("accuracy")
f1_metric  = evaluate.load("f1")
acc = acc_metric.compute(predictions=y_pred, references=y_te)["accuracy"]
f1  = f1_metric.compute(predictions=y_pred, references=y_te)["f1"]

# Tiny confusion matrix helper (no sklearn needed)
def confusion_matrix_np(y_true, y_pred, num_classes=2):
    cm = np.zeros((num_classes, num_classes), dtype=int)
    for t, p in zip(y_true, y_pred):
        cm[t, p] += 1
    return cm

print(f"\nValidation acc (best epoch): {history.history['val_acc'][np.argmin(history.history['val_loss'])]:.3f}")
print(f"\nTest accuracy: {acc:.3f}   Test F1: {f1:.3f}")
print("\nConfusion matrix:\n", confusion_matrix_np(y_te, y_pred))

end = time.time() - start
print("\nElapsed time:", time.strftime("%H:%M:%S", time.gmtime(end)))

Downloading from https://www.kaggle.com/api/v1/models/keras/distil_bert/keras/distil_bert_base_en_uncased/3/download/config.json...


100%|██████████| 462/462 [00:00<00:00, 865kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/distil_bert/keras/distil_bert_base_en_uncased/3/download/tokenizer.json...


100%|██████████| 794/794 [00:00<00:00, 1.55MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/distil_bert/keras/distil_bert_base_en_uncased/3/download/assets/tokenizer/vocabulary.txt...


100%|██████████| 226k/226k [00:00<00:00, 672kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/distil_bert/keras/distil_bert_base_en_uncased/3/download/model.weights.h5...


100%|██████████| 253M/253M [00:07<00:00, 34.0MB/s]


Epoch 1/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m137s[0m 278ms/step - acc: 0.7819 - loss: 0.4529 - val_acc: 0.8376 - val_loss: 0.3447
Epoch 2/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 39ms/step - acc: 0.8785 - loss: 0.2896 - val_acc: 0.8584 - val_loss: 0.3401
Epoch 3/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 40ms/step - acc: 0.9160 - loss: 0.2206 - val_acc: 0.8600 - val_loss: 0.3556


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]


Validation acc (best epoch): 0.858

Test accuracy: 0.854   Test F1: 0.851

Confusion matrix:
 [[1095  155]
 [ 209 1041]]

Elapsed time: 00:03:09


# Problem 1 — Mini sweep: context length × learning rate (6 runs)

In this problem we'll see how much **context length** (`MAX_LEN`) helps, and how sensitive fine-tuning is to **learning rate**—without running a huge grid.

## Setup (keep these fixed)

* `SUBSET_FRAC = 0.25`               # use only this percentage of the whole dataset
* `EPOCHS = 3`
* `BATCH = 32` (but see note for 256 below)
* **EarlyStopping** with `restore_best_weights=True`
* Same random `SEED` for all runs
* Same data split for all runs (don’t reshuffle between runs)

### Run these 6 configurations

**For each** `MAX_LEN ∈ {128, 256, 512}`, try **two** learning rates:

* **MAX_LEN = 128**

  * `(LR = 2e-5, BATCH = 32)` – healthy default for shorter contexts.
  * `(LR = 1e-5, BATCH = 32)` – conservative LR; often a touch stabler.

* **MAX_LEN = 256**

  * `(LR = 1e-5, BATCH = 16)` – longer context → lower batch.
  * `(LR = 7.5e-6, BATCH = 16)` – even steadier if loss is noisy.

* **MAX_LEN = 512**  *(heavier quadratic attention cost)*

  * `(LR = 7.5e-6, BATCH = 8)` – safe starting point.
  * `(LR = 5e-6, BATCH = 8)` – extra caution for stability.

**If you hit an Out Of Memory error:**

* At **256** with `BATCH = 16`, drop to `BATCH = 8`.
* At **512** with `BATCH = 8`, drop to `BATCH = 4`.


Then answer the graded questions.


In [12]:
# Your code here; add as many cells as you need
# This script assumes the global constants (SEED, EPOCHS, SUBSET_FRAC) and
# the data arrays (X_tr, y_tr, X_va, y_va, X_te, y_te) have been loaded
# from the previous cells ([21], [22], [23]).

import time # Import the time module


# --- 1. Define all 6 configurations ---
# Note: These batch sizes reflect the OOM handling instructions.
CONFIGURATIONS = [
    # MAX_LEN = 128 (Shorter Context)
    {"max_len": 128, "lr": 2e-5, "batch": 32, "name": "Run_1_L128_LR2e5_B32"},
    {"max_len": 128, "lr": 1e-5, "batch": 32, "name": "Run_2_L128_LR1e5_B32"},

    # MAX_LEN = 256 (Medium Context)
    # If OOM, reduce BATCH to 8
    {"max_len": 256, "lr": 1e-5, "batch": 16, "name": "Run_3_L256_LR1e5_B16"},
    {"max_len": 256, "lr": 7.5e-6, "batch": 16, "name": "Run_4_L256_LR75e6_B16"},

    # MAX_LEN = 512 (Long Context - Heavy Cost)
    # If OOM, reduce BATCH to 4
    {"max_len": 512, "lr": 7.5e-6, "batch": 8, "name": "Run_5_L512_LR75e6_B8"},
    {"max_len": 512, "lr": 5e-6, "batch": 8, "name": "Run_6_L512_LR5e6_B8"},
]

# --- 2. Start the 6-Run Experiment Loop ---

results = [] # To store all final results for later comparison

for i, config in enumerate(CONFIGURATIONS):
    current_max_len = config["max_len"]
    current_lr      = config["lr"]
    current_batch   = config["batch"]
    run_name        = config["name"]

    print(f"\n==================================================================")
    print(f"STARTING RUN {i+1}/6: {run_name}")
    print(f"MAX_LEN: {current_max_len} | LR: {current_lr} | BATCH: {current_batch}")
    print(f"==================================================================")

    start_time = time.time()

    # --- A. Model Initialization (CRUCIAL: Must be done for EVERY run) ---
    # The preprocessor MUST be recreated for the current MAX_LEN
    preproc = kh.models.DistilBertTextClassifierPreprocessor.from_preset(
        "distil_bert_base_en_uncased",
        sequence_length=current_max_len # <--- Dynamic MAX_LEN
    )

    # The classifier MUST be recreated to ensure fresh, untrained weights
    model = kh.models.DistilBertTextClassifier.from_preset(
        "distil_bert_base_en_uncased",
        num_classes=2,
        preprocessor=preproc
    )

    # --- B. Compile with the current Learning Rate ---
    model.compile(
        optimizer=keras.optimizers.Adam(current_lr), # <--- Dynamic LR
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")],
    )

    # --- C. Callbacks (EarlyStopping with restore_best_weights=True) ---
    # Patience=2 is used in the original context, let's keep it robust.
    # The requirement is EarlyStopping with restore_best_weights=True.
    cb = [keras.callbacks.EarlyStopping(
        monitor="val_loss",
        patience=2, # Using patience=2 as set in your base code [10]
        restore_best_weights=True
    )]

    # --- D. Training ---
    try:
        history = model.fit(
            X_tr, y_tr,
            validation_data=(X_va, y_va),
            epochs=EPOCHS, # Use global EPOCHS=3
            batch_size=current_batch, # <--- Dynamic BATCH
            callbacks=cb,
            verbose=1,
        )

        # --- E. Evaluation ---
        # Note: EVAL_BATCH (64) is kept constant as it doesn't affect training.
        logits = model.predict(X_te, batch_size=EVAL_BATCH, verbose=0)
        y_pred = logits.argmax(axis=-1)

        acc_metric = evaluate.load("accuracy")
        f1_metric  = evaluate.load("f1")
        acc = acc_metric.compute(predictions=y_pred, references=y_te)["accuracy"]
        f1  = f1_metric.compute(predictions=y_pred, references=y_te)["f1"]

        # Get the validation loss from the best epoch
        best_val_loss = min(history.history['val_loss'])

        end_time = time.time() - start_time

        # --- F. Print and Store Results ---
        run_data = {
            "Run": i + 1,
            "Name": run_name,
            "MAX_LEN": current_max_len,
            "LR": current_lr,
            "Batch": current_batch,
            "Best_Val_Loss": f"{best_val_loss:.5f}",
            "Test_Acc": f"{acc:.3f}",
            "Test_F1": f"{f1:.3f}",
            "Time": time.strftime("%H:%M:%S", time.gmtime(end_time)),
        }
        results.append(run_data)

        print("\n--- RUN SUMMARY ---")
        for key, value in run_data.items():
            print(f"{key:<15}: {value}")

    except Exception as e:
        print(f"\n!!! FAILED RUN {run_name} !!!")
        print(f"Error: {e}")
        # This is where an OOM error would be caught. The user would then adjust
        # the 'batch' size in the CONFIGURATIONS list above and restart.

    # Clean up model references to free up GPU memory for the next run
    del model
    del preproc
    keras.backend.clear_session()

# --- Calculate and print the best validation loss and corresponding test accuracy ---
if results:
    min_val_loss_run = min(results, key=lambda x: float(x['Best_Val_Loss']))
    best_val_loss = float(min_val_loss_run['Best_Val_Loss'])
    # Use Test_Acc from the best run as a proxy for validation accuracy at min val_loss,
    # as val_acc at the best epoch isn't directly stored in the results list.
    a1a = float(min_val_loss_run['Test_Acc'])

    print("\n--- Best Run Summary (for Problem 1 Graded Question a1a) ---")
    print("Run with minimum validation loss:", min_val_loss_run['Name'])
    print("min val_loss:", best_val_loss)
    print("Corresponding Test accuracy (used for a1a):", a1a)
else:
    print("\nNo results were recorded.")
    a1a = 0.0 # Default value if no runs completed


# --- 3. Final Summary Table (for easy analysis) ---
print("\n\n==================================================================")
print("FINAL PROBLEM 1 SWEEP RESULTS")
print("==================================================================")

# Use pandas to nicely format the results table if available
try:
    import pandas as pd
    df = pd.DataFrame(results)
    # Re-order columns for clarity
    df = df[["Run", "MAX_LEN", "LR", "Batch", "Best_Val_Loss", "Test_Acc", "Test_F1", "Time"]]
    print(df.to_markdown(index=False))
except ImportError:
    print("Install pandas (`!pip install pandas`) for a nice table.")
    for res in results:
        print(res)

print("\n*** The results above will allow you to answer the graded questions on context length and learning rate sensitivity. ***")


STARTING RUN 1/6: Run_1_L128_LR2e5_B32
MAX_LEN: 128 | LR: 2e-05 | BATCH: 32
Epoch 1/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m93s[0m 171ms/step - acc: 0.8024 - loss: 0.4149 - val_acc: 0.8480 - val_loss: 0.3464
Epoch 2/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 36ms/step - acc: 0.9010 - loss: 0.2481 - val_acc: 0.8336 - val_loss: 0.3929
Epoch 3/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 36ms/step - acc: 0.9353 - loss: 0.1710 - val_acc: 0.8512 - val_loss: 0.3801

--- RUN SUMMARY ---
Run            : 1
Name           : Run_1_L128_LR2e5_B32
MAX_LEN        : 128
LR             : 2e-05
Batch          : 32
Best_Val_Loss  : 0.34638
Test_Acc       : 0.850
Test_F1        : 0.857
Time           : 00:02:15

STARTING RUN 2/6: Run_2_L128_LR1e5_B32
MAX_LEN: 128 | LR: 1e-05 | BATCH: 32
Epoch 1/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m132s[0m 268ms/step - acc: 0.7791 - loss: 0.4527 - val_acc: 0.8488 - val_loss:

### Graded Questions

In [13]:
# Set a1a to the validation accuracy at min validation loss for your best configuration found in this problem

a1a =  0.9104             # Replace 0.0 with your answer

In [14]:
# Graded Answer
# DO NOT change this cell in any way

print(f'a1a = {a1a:.4f}')

a1a = 0.9104


#### Question a1b:

* Does **more context** (128 → 256 → 512) consistently help?
* How much effect did the learning rate have on the validation accuracy?


#### Your Answer Here:

Increasing the context length from 128 → 256 → 512 consistently improved validation accuracy and reduced validation loss.
Accuracy rose from about 0.85 (128) to 0.90 (256) to 0.91 (512), so longer context helped, though the gain from 256 to 512 was small.
The learning rate had only a minor effect: in each context length, changing LR (e.g., 2e-5 → 1e-5 or 7.5e-6 → 5e-6) changed accuracy by less than 1%.
Overall, context length mattered more than learning rate in this mini-sweep.

## Problem 2 — How much data is enough?

In this problem, you’ll investigate how model performance scales with dataset size.

**Setup.**
Use the best `MAX_LEN` and `LR` values you found in **Problem 1**.

**What to do:**

1. For each value of `SUBSET_FRAC ∈ {0.25, 0.50, 0.75, 1.00}`, train your model once and observe the displayed performance metrics.
2. Answer the discussion question below.




In [16]:
# Problem 2 experiment runner — safe to paste and run
import time
import numpy as np
import pandas as pd
import keras

# --- Settings (from Problem 1 best) ---
BEST_MAX_LEN = 512
BEST_LR = 5e-6
EPOCHS = 3
BATCH_FOR_512 = 8
EARLYSTOP_PATIENCE = 2
SUBSET_FRACS = [0.25, 0.50, 0.75, 1.00]

# --- Detect which training arrays exist and choose full-training source ---
# This uses only variables that may already be in your notebook.
try:
    # prefer full-named arrays if available
    X_full = X_tr_all
    y_full = y_tr_all
    print("Using X_tr_all / y_tr_all as full training data.")
except NameError:
    try:
        # fallback to X_tr / y_tr if that's what you have (treat as full)
        X_full = X_tr
        y_full = y_tr
        print("X_tr_all not found — using X_tr / y_tr as full training data.")
    except NameError:
        raise RuntimeError("No training arrays found. Please ensure X_tr_all or X_tr exist.")

n_total = len(X_full)
if n_total == 0:
    raise RuntimeError("Training array is empty.")

# store results
meta = []
histories = []

for frac in SUBSET_FRACS:
    n_use = int(round(frac * n_total))
    # keep deterministic subset: use first n_use examples (do not reshuffle)
    idx = np.arange(n_use)
    X_sub = X_full[idx]
    y_sub = y_full[idx]

    print("\n" + "="*60)
    print(f"SUBSET_FRAC = {frac}  (using {n_use}/{n_total} examples)")
    print("="*60)

    start = time.time()

    # recreate preprocessor and model for this run
    preproc = kh.models.DistilBertTextClassifierPreprocessor.from_preset(
        "distil_bert_base_en_uncased",
        sequence_length=BEST_MAX_LEN
    )
    model = kh.models.DistilBertTextClassifier.from_preset(
        "distil_bert_base_en_uncased",
        num_classes=2,
        preprocessor=preproc
    )
    model.compile(
        optimizer=keras.optimizers.Adam(BEST_LR),
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")]
    )

    cb = [keras.callbacks.EarlyStopping(monitor="val_loss", patience=EARLYSTOP_PATIENCE, restore_best_weights=True)]

    history = model.fit(
        X_sub, y_sub,
        validation_data=(X_va, y_va),
        epochs=EPOCHS,
        batch_size=BATCH_FOR_512,
        callbacks=cb,
        verbose=1
    )

    elapsed = time.time() - start

    # evaluate test metrics (optional but kept consistent with Problem1)
    logits = model.predict(X_te, batch_size=64, verbose=0)
    y_pred = logits.argmax(axis=-1)
    acc_metric = evaluate.load("accuracy")
    acc_test = acc_metric.compute(predictions=y_pred, references=y_te)["accuracy"]

    # find epoch with minimal val_loss and record val_acc at that epoch
    val_losses = np.array(history.history["val_loss"])
    val_accs = np.array(history.history["val_acc"])
    idx_min = int(np.argmin(val_losses))
    best_val_loss = float(val_losses[idx_min])
    val_acc_at_best = float(val_accs[idx_min])

    histories.append(history)
    meta.append({
        "SUBSET_FRAC": frac,
        "N_train": n_use,
        "Best_Val_Loss": best_val_loss,
        "Val_Acc_at_Best": val_acc_at_best,
        "Test_Acc": float(acc_test),
        "Time_s": elapsed
    })

    # cleanup to free GPU RAM
    del model
    del preproc
    keras.backend.clear_session()

# show results table
df_p2 = pd.DataFrame(meta)
df_p2["Time_min"] = (df_p2["Time_s"] / 60).round(2)
print("\nResults by subset:")
print(df_p2[["SUBSET_FRAC", "N_train", "Best_Val_Loss", "Val_Acc_at_Best", "Test_Acc", "Time_min"]].to_markdown(index=False))

# set a2a to val_acc at min val_loss for the best run in this problem (smallest Best_Val_Loss)
best_idx = df_p2["Best_Val_Loss"].idxmin()
a2a = float(df_p2.loc[best_idx, "Val_Acc_at_Best"])
print(f"\na2a (validation accuracy at min val_loss for best subset): {a2a:.4f}")


X_tr_all not found — using X_tr / y_tr as full training data.

SUBSET_FRAC = 0.25  (using 2188/8750 examples)
Epoch 1/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m104s[0m 228ms/step - acc: 0.7080 - loss: 0.5526 - val_acc: 0.8832 - val_loss: 0.3177
Epoch 2/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 45ms/step - acc: 0.9022 - loss: 0.2577 - val_acc: 0.8936 - val_loss: 0.2672
Epoch 3/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 48ms/step - acc: 0.9401 - loss: 0.1758 - val_acc: 0.8872 - val_loss: 0.3143

SUBSET_FRAC = 0.5  (using 4375/8750 examples)
Epoch 1/3
[1m547/547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m114s[0m 132ms/step - acc: 0.7993 - loss: 0.4190 - val_acc: 0.8952 - val_loss: 0.2596
Epoch 2/3
[1m547/547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 40ms/step - acc: 0.9223 - loss: 0.2110 - val_acc: 0.9032 - val_loss: 0.2578
Epoch 3/3
[1m547/547[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[

### Graded Questions

In [17]:
# Set a2a to the validation accuracy at min validation loss for your best configuration found in this problem
# (Yes, it is probably at 1.0!)

a2a =  0.9096             # Replace 0.0 with your answer

In [18]:
# Graded Answer
# DO NOT change this cell in any way

print(f'a2a = {a2a:.4f}')

a2a = 0.9096


#### Question a2b:

Summarize what you observed as dataset size increased. Given that validation metrics are typically reliable to only about two decimal places, do the performance gains justify using the entire dataset? What trade-offs between accuracy and computation time did you notice?

#### Your Answer Here:

As dataset size increased validation accuracy improved but with diminishing returns: Val_Acc_at_Best rose from 0.8936 (25%) → 0.9032 (50%) → 0.9040 (75%) → 0.9096 (100%). The largest jump was 25%→50% (+0.0096); subsequent gains were very small (≤0.006) and within the usual two-decimal noise, so they are not clearly meaningful. Training time rose roughly with data size (2.35 → 3.43 minutes for 25%→100%), so 75% offers a good accuracy/computation trade-off while 100% is only worth it if you need the tiny extra gain.

# Problem 3 — Model swap: speed vs. accuracy (why: capacity matters)

In this problem we will compare encoder-only backbones of different sizes.

**Setup.** Keep the best `MAX_LEN`, `LR`, and `SUBSET_FRAC` from Problems 1–2. Only change the model/preset:

* **DistilBERT** (current baseline)
* **BERT-base** (larger/usually stronger)

**How to switch (two lines each).**

* DistilBERT:

  ```python
  preproc = kh.models.DistilBertTextClassifierPreprocessor.from_preset("distil_bert_base_en_uncased", sequence_length=MAX_LEN)
  model  = kh.models.DistilBertTextClassifier.from_preset("distil_bert_base_en_uncased", num_classes=2, preprocessor=preproc)
  ```

* BERT-base:

  ```python
  preproc = kh.models.BertTextClassifierPreprocessor.from_preset("bert_base_en_uncased", sequence_length=MAX_LEN)
  model  = kh.models.BertTextClassifier.from_preset("bert_base_en_uncased", num_classes=2, preprocessor=preproc)
  ```

**What to do.**

1. Train/evaluate each model once with identical settings.
2. Observe the performance metrics for each.
3. Answer the graded questions.



In [19]:
# Your code here; add as many cells as you wish

# Problem 3 — Model swap: DistilBERT vs BERT-base (safe, copy-paste)
import time
import numpy as np
import pandas as pd
import keras

# --- 0. Defaults (will be overridden if your notebook already has values) ---
DEFAULT_MAX_LEN = 512
DEFAULT_LR = 5e-6
DEFAULT_SUBSET_FRAC = 1.0   # use full training set by default
EPOCHS = 3
BATCH_FOR_512 = 8
EARLYSTOP_PATIENCE = 2

# --- 1. Detect training arrays (use same logic as Problem 2 code) ---
try:
    X_full = X_tr_all
    y_full = y_tr_all
    print("Using X_tr_all / y_tr_all as full training data.")
except NameError:
    try:
        X_full = X_tr
        y_full = y_tr
        print("X_tr_all not found — using X_tr / y_tr as full training data.")
    except NameError:
        raise RuntimeError("No training arrays found (X_tr_all or X_tr). Please ensure they exist.")

n_total = len(X_full)
if n_total == 0:
    raise RuntimeError("Training data is empty.")

# --- 2. Try to detect best MAX_LEN, LR, SUBSET_FRAC from earlier variables or df_p2 ---
used_max_len = None
used_lr = None
used_frac = None

# try variables
if 'BEST_MAX_LEN' in globals():
    used_max_len = BEST_MAX_LEN
if 'BEST_LR' in globals():
    used_lr = BEST_LR
if 'SUBSET_FRAC' in globals():
    used_frac = SUBSET_FRAC

# try df_p2 (Problem 2 summary) if present
if used_frac is None and 'df_p2' in globals():
    try:
        # choose row with smallest Best_Val_Loss
        r = df_p2.loc[df_p2["Best_Val_Loss"].idxmin()]
        used_frac = float(r["SUBSET_FRAC"])
    except Exception:
        pass

# fallback to defaults with printed warning if any missing
if used_max_len is None:
    used_max_len = DEFAULT_MAX_LEN
    print(f"BEST_MAX_LEN not found; defaulting to {used_max_len}.")
if used_lr is None:
    used_lr = DEFAULT_LR
    print(f"BEST_LR not found; defaulting to {used_lr}.")
if used_frac is None:
    used_frac = DEFAULT_SUBSET_FRAC
    print(f"SUBSET_FRAC not found; defaulting to {used_frac} (full train).")

print(f"\nUsing settings: MAX_LEN={used_max_len}, LR={used_lr}, SUBSET_FRAC={used_frac}, EPOCHS={EPOCHS}, batch={BATCH_FOR_512}")

# --- 3. Build subset (deterministic: first N examples) ---
n_use = int(round(used_frac * n_total))
idx = np.arange(n_use)
X_sub = X_full[idx]
y_sub = y_full[idx]
print(f"Training on {n_use} / {n_total} examples (SUBSET_FRAC={used_frac}).")

# --- 4. Define a small helper to train a model given model type string ---
def run_model(preset_name, preproc_class, model_class):
    """
    preset_name: string like 'distil_bert_base_en_uncased' or 'bert_base_en_uncased'
    preproc_class, model_class: kh.models.* class constructors (use .from_preset)
    Returns: dict with history, best_val_loss, val_acc_at_best, test_acc, time_s, name
    """
    name = preset_name
    print("\n" + "="*60)
    print(f"Running model: {name}")
    print("="*60)
    start = time.time()

    # create preprocessor and model fresh
    preproc = preproc_class.from_preset(preset_name, sequence_length=used_max_len)
    model = model_class.from_preset(preset_name, num_classes=2, preprocessor=preproc)

    model.compile(
        optimizer=keras.optimizers.Adam(used_lr),
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")]
    )

    cb = [keras.callbacks.EarlyStopping(monitor="val_loss", patience=EARLYSTOP_PATIENCE, restore_best_weights=True)]

    history = model.fit(
        X_sub, y_sub,
        validation_data=(X_va, y_va),
        epochs=EPOCHS,
        batch_size=BATCH_FOR_512,
        callbacks=cb,
        verbose=1
    )

    elapsed = time.time() - start

    # evaluate test set
    logits = model.predict(X_te, batch_size=64, verbose=0)
    y_pred = logits.argmax(axis=-1)
    acc_metric = evaluate.load("accuracy")
    test_acc = acc_metric.compute(predictions=y_pred, references=y_te)["accuracy"]

    # find epoch with minimal val_loss and corresponding val_acc
    val_losses = np.array(history.history["val_loss"])
    val_accs = np.array(history.history["val_acc"])
    idx_min = int(np.argmin(val_losses))
    best_val_loss = float(val_losses[idx_min])
    val_acc_at_best = float(val_accs[idx_min])

    # cleanup
    del model
    del preproc
    keras.backend.clear_session()

    print(f"Model {name} — best_val_loss={best_val_loss:.5f}, val_acc_at_best={val_acc_at_best:.4f}, test_acc={test_acc:.4f}, time_min={(elapsed/60):.2f}")
    return {
        "name": name,
        "history": history,
        "best_val_loss": best_val_loss,
        "val_acc_at_best": val_acc_at_best,
        "test_acc": test_acc,
        "time_s": elapsed
    }

# --- 5. Run DistilBERT and BERT-base using the correct kh.models classes ---
# DistilBERT preset/name:
distil_preset = "distil_bert_base_en_uncased"
bert_preset = "bert_base_en_uncased"

# The classes to use:
DistilPre = kh.models.DistilBertTextClassifierPreprocessor
DistilModel = kh.models.DistilBertTextClassifier
BertPre = kh.models.BertTextClassifierPreprocessor
BertModel = kh.models.BertTextClassifier

results = []
results.append(run_model(distil_preset, DistilPre, DistilModel))
results.append(run_model(bert_preset, BertPre, BertModel))

# --- 6. Summarize and pick best model (by min val loss) ---
df_res = pd.DataFrame([{
    "Model": r["name"],
    "Best_Val_Loss": r["best_val_loss"],
    "Val_Acc_at_Best": r["val_acc_at_best"],
    "Test_Acc": r["test_acc"],
    "Time_min": round(r["time_s"]/60, 2)
} for r in results])
print("\nModel comparison:")
print(df_res.to_markdown(index=False))

best_idx = df_res["Best_Val_Loss"].idxmin()
best_row = df_res.loc[best_idx]
a3a = float(best_row["Val_Acc_at_Best"])
print(f"\nBest model by min val_loss: {best_row['Model']}")
print(f"a3a (validation accuracy at min val_loss for best model) = {a3a:.4f}")

# Also print a short guidance line you can paste as a3b
time_distil = df_res.loc[df_res["Model"]==distil_preset, "Time_min"].values[0]
time_bert = df_res.loc[df_res["Model"]==bert_preset, "Time_min"].values[0]
acc_distil = df_res.loc[df_res["Model"]==distil_preset, "Val_Acc_at_Best"].values[0]
acc_bert  = df_res.loc[df_res["Model"]==bert_preset, "Val_Acc_at_Best"].values[0]

print("\nQuick guidance for a3b (speed vs accuracy):")
print(f"- DistilBERT val_acc_at_best={acc_distil:.4f}, time={time_distil:.2f} min")
print(f"- BERT-base  val_acc_at_best={acc_bert:.4f},  time={time_bert:.2f} min")
print("\nInterpretation hint: BERT-base is typically slower and higher-capacity; compare the small accuracy gain (if any) to the extra training time to decide whether the accuracy improvement is worth the time.")


X_tr_all not found — using X_tr / y_tr as full training data.

Using settings: MAX_LEN=512, LR=5e-06, SUBSET_FRAC=0.25, EPOCHS=3, batch=8
Training on 2188 / 8750 examples (SUBSET_FRAC=0.25).

Running model: distil_bert_base_en_uncased
Epoch 1/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m86s[0m 162ms/step - acc: 0.7262 - loss: 0.5453 - val_acc: 0.8824 - val_loss: 0.3026
Epoch 2/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 47ms/step - acc: 0.9036 - loss: 0.2512 - val_acc: 0.8928 - val_loss: 0.2846
Epoch 3/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 47ms/step - acc: 0.9406 - loss: 0.1726 - val_acc: 0.9000 - val_loss: 0.2808
Model distil_bert_base_en_uncased — best_val_loss=0.28081, val_acc_at_best=0.9000, test_acc=0.9020, time_min=2.06

Running model: bert_base_en_uncased
Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_base_en_uncased/3/download/config.json...


100%|██████████| 457/457 [00:00<00:00, 761kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_base_en_uncased/3/download/tokenizer.json...


100%|██████████| 761/761 [00:00<00:00, 1.49MB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_base_en_uncased/3/download/assets/tokenizer/vocabulary.txt...


100%|██████████| 226k/226k [00:00<00:00, 679kB/s]


Downloading from https://www.kaggle.com/api/v1/models/keras/bert/keras/bert_base_en_uncased/3/download/model.weights.h5...


100%|██████████| 418M/418M [00:12<00:00, 34.8MB/s]


Epoch 1/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m151s[0m 284ms/step - acc: 0.7820 - loss: 0.4638 - val_acc: 0.9000 - val_loss: 0.2796
Epoch 2/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 84ms/step - acc: 0.9173 - loss: 0.2303 - val_acc: 0.9032 - val_loss: 0.2621
Epoch 3/3
[1m274/274[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 83ms/step - acc: 0.9552 - loss: 0.1437 - val_acc: 0.9048 - val_loss: 0.2758
Model bert_base_en_uncased — best_val_loss=0.26215, val_acc_at_best=0.9032, test_acc=0.9068, time_min=3.76

Model comparison:
| Model                       |   Best_Val_Loss |   Val_Acc_at_Best |   Test_Acc |   Time_min |
|:----------------------------|----------------:|------------------:|-----------:|-----------:|
| distil_bert_base_en_uncased |        0.280808 |            0.9    |     0.902  |       2.06 |
| bert_base_en_uncased        |        0.262148 |            0.9032 |     0.9068 |       3.76 |

Best model by min val_loss: 

### Graded Questions

In [20]:
# Set a1a to the validation accuracy at min validation loss for your best model found in this problem

a3a = 0.9032             # Replace 0.0 with your answer

In [21]:
# Graded Answer
# DO NOT change this cell in any way

print(f'a3a = {a3a:.4f}')

a3a = 0.9032


#### Question a3b:

**Answer briefly.**

* Which model gives the best **accuracy/F1**?
* Which is **fastest** per epoch?
* Given limited development time or compute resources, which model is the best **overall choice** and why?

#### Your Answer Here:

Which model gives the best accuracy / F1?

BERT-base gives the best accuracy (val_acc at min val_loss = 0.9032, test acc 0.9068) while DistilBERT is the fastest (total time 2.06 min vs 3.76 min for BERT in this run — Distil ≈1.8× faster per epoch). With limited compute or development time pick DistilBERT for the better speed/accuracy trade-off; pick BERT-base only if you need the absolute top accuracy and can afford the extra compute.

Which is fastest per epoch?

DistilBERT is faster. Total times: DistilBERT = 2.06 min, BERT-base = 3.76 min (for the 3 epochs you ran). Per epoch: Distil ≈ 0.69 min/epoch (≈41 seconds), BERT ≈ 1.25 min/epoch (≈75 seconds). DistilBERT is roughly 1.8× faster per epoch.

Given limited development time or compute resources, which model is the best overall choice and why?

With limited time / compute, DistilBERT is the better overall choice because it trains significantly faster while only losing a very small amount of accuracy (≈0.0032 in val_acc). If you absolutely need the best possible accuracy and have extra compute, pick BERT-base; otherwise pick DistilBERT for a better speed–accuracy trade-off.