## Group Project LLM (IMDB)

- r=2,4,8,16, epoch=10
- seed=42
- evaluation:
    - accuracy, f1, precision, recall
    - efficiency (time, trainable parameters, trainable paramters ratio, convergence)

In [1]:
import warnings
warnings.filterwarnings("default", module="__main__")
warnings.filterwarnings("ignore", module=".*")

## Base Model: DistilBERT

In [3]:
# ================== BASELINE DISTILBERT ================

import os, time, random
import numpy as np
import torch

from datasets import load_dataset
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
    set_seed
)
import evaluate


SEED = 42
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DATASET = "imdb"
TEXT_COL = "text"
LABEL_COL = "label"
NUM_EPOCHS = 10
BATCH_SIZE = 16
LR = 2e-5


def set_all_seeds(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    set_seed(seed)

set_all_seeds(SEED)

# -------- Load dataset and split (8:1:1) --------
raw = load_dataset(DATASET)
train_full = raw["train"]

train_temp = train_full.train_test_split(test_size=0.2, seed=SEED)
train_ds = train_temp["train"]
temp = train_temp["test"]

val_test = temp.train_test_split(test_size=0.5, seed=SEED)
val_ds = val_test["train"]
test_ds = val_test["test"]


# -------- Tokenization --------
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")
def preprocess(x):
    return tokenizer(x[TEXT_COL], truncation=True, max_length=256)

train_ds = train_ds.map(preprocess, batched=True)
val_ds   = val_ds.map(preprocess, batched=True)
test_ds  = test_ds.map(preprocess, batched=True)

train_ds = train_ds.rename_column(LABEL_COL, "labels")
val_ds   = val_ds.rename_column(LABEL_COL, "labels")
test_ds  = test_ds.rename_column(LABEL_COL, "labels")

cols = ["input_ids","attention_mask","labels"]
train_ds.set_format(type="torch", columns=cols)
val_ds.set_format(type="torch", columns=cols)
test_ds.set_format(type="torch", columns=cols)

collator = DataCollatorWithPadding(tokenizer)

# -------- Metrics --------
acc = evaluate.load("accuracy")
f1 = evaluate.load("f1")
prec = evaluate.load("precision")
rec = evaluate.load("recall")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": acc.compute(predictions=preds, references=labels)["accuracy"],
        "f1": f1.compute(predictions=preds, references=labels, average="binary")["f1"],
        "precision": prec.compute(predictions=preds, references=labels, average="binary")["precision"],
        "recall": rec.compute(predictions=preds, references=labels, average="binary")["recall"],
    }

# -------- Model --------
model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
).to(DEVICE)

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
ratio = trainable_params / total_params

print(f"Baseline: total={total_params}, trainable={trainable_params}, ratio={ratio:.4%}")

# -------- Train --------
args = TrainingArguments(
    output_dir="./baseline_distilbert_imdb",
    num_train_epochs=10,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    seed=SEED,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    data_collator=collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

start = time.time()
trainer.train()
end = time.time()

print(f"Baseline training time: {end-start:.2f}s")
print("Eval:", trainer.evaluate(test_ds))
print("Convergence history:", trainer.state.log_history)


Generating train split: 100%|██████████| 25000/25000 [00:00<00:00, 53605.71 examples/s]
Generating test split: 100%|██████████| 25000/25000 [00:00<00:00, 65670.62 examples/s]
Generating unsupervised split: 100%|██████████| 50000/50000 [00:00<00:00, 63399.59 examples/s] 
Map: 100%|██████████| 20000/20000 [00:04<00:00, 4190.15 examples/s]
Map: 100%|██████████| 2500/2500 [00:00<00:00, 4265.27 examples/s]
Map: 100%|██████████| 2500/2500 [00:00<00:00, 3836.64 examples/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Baseline: total=66955010, trainable=66955010, ratio=100.0000%


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3091,0.246099,0.894,0.898273,0.875093,0.922713
2,0.1826,0.255972,0.904,0.904382,0.913849,0.89511
3,0.1088,0.319014,0.9052,0.908388,0.890826,0.926656
4,0.065,0.447382,0.9028,0.904743,0.899454,0.910095
5,0.0407,0.526969,0.8968,0.901901,0.870778,0.935331
6,0.0259,0.51662,0.9052,0.90753,0.898069,0.917192
7,0.019,0.575031,0.9008,0.903801,0.889313,0.91877
8,0.0128,0.557584,0.906,0.907298,0.907656,0.90694
9,0.0085,0.608972,0.9024,0.902087,0.918301,0.886435
10,0.0075,0.612088,0.9048,0.905481,0.912,0.899054


Baseline training time: 1231.16s


Eval: {'eval_loss': 0.3191632926464081, 'eval_accuracy': 0.9068, 'eval_f1': 0.9085916045508042, 'eval_precision': 0.8832951945080092, 'eval_recall': 0.9353796445880452, 'eval_runtime': 5.713, 'eval_samples_per_second': 437.602, 'eval_steps_per_second': 13.828, 'epoch': 10.0}
Convergence history: [{'loss': 0.3091, 'grad_norm': 4.842110633850098, 'learning_rate': 1.80032e-05, 'epoch': 1.0, 'step': 625}, {'eval_loss': 0.2460990995168686, 'eval_accuracy': 0.894, 'eval_f1': 0.8982725527831094, 'eval_precision': 0.87509349289454, 'eval_recall': 0.9227129337539433, 'eval_runtime': 5.7236, 'eval_samples_per_second': 436.788, 'eval_steps_per_second': 13.803, 'epoch': 1.0, 'step': 625}, {'loss': 0.1826, 'grad_norm': 2.859170436859131, 'learning_rate': 1.60032e-05, 'epoch': 2.0, 'step': 1250}, {'eval_loss': 0.2559722065925598, 'eval_accuracy': 0.904, 'eval_f1': 0.9043824701195219, 'eval_precision': 0.9138486312399355, 'eval_recall': 0.8951104100946372, 'eval_runtime': 5.7662, 'eval_samples_per_se

In [None]:
import pandas as pd
# -------- Final Evaluation --------
final_metrics = trainer.evaluate(test_ds)

# -------- Save metrics --------
os.makedirs("./baseline_distilbert_imdb", exist_ok=True)
with open("./baseline_distilbert_imdb/final_metrics.json", "w") as f:
    json.dump(final_metrics, f, indent=4)

print("Saved final metrics to baseline_distilbert_imdb/final_metrics.json")

# -------- Save model --------
trainer.save_model("./baseline_distilbert_imdb/final_model")
print("Saved model to baseline_distilbert_imdb/final_model")

# -------- Convergence history --------
log_history = trainer.state.log_history
df_logs = pd.DataFrame(trainer.state.log_history)
# Separate clean tables
df_train = df_logs[df_logs["loss"].notnull()].reset_index(drop=True)
df_eval  = df_logs[df_logs["eval_loss"].notnull()].reset_index(drop=True)

df_train.to_csv("./baseline_distilbert_imdb/train_log.csv", index=False)
df_eval.to_csv("./baseline_distilbert_imdb/eval_log.csv", index=False)

## Sparse LoRA

In [4]:
# ================== SPARSE LoRA MODEL =================

from typing import Dict, Any, List, Optional
import math
from peft import LoraConfig, get_peft_model

# -------- Sparse LoRA config --------
RANKS: List[int] = [2, 4, 8, 16]
L1_LAMBDA = 1e-5   # sparsity strength for LoRA weights


def count_trainable_params(model: torch.nn.Module) -> int:
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


def count_total_params(model: torch.nn.Module) -> int:
    return sum(p.numel() for p in model.parameters())


def compute_lora_sparsity(model: torch.nn.Module, threshold: float = 1e-3) -> float:
    """
    Approximate sparsity: fraction of LoRA parameters with |w| < threshold.
    """
    total = 0
    near_zero = 0
    for name, param in model.named_parameters():
        if "lora_" in name and param.requires_grad:
            data = param.detach().abs()
            total += data.numel()
            near_zero += (data < threshold).sum().item()
    return near_zero / total if total > 0 else math.nan


class SparseLoraTrainer(Trainer):
    """
    Trainer with L1 penalty only on LoRA parameters.
    """
    def __init__(self, l1_lambda: float = 0.0, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.l1_lambda = l1_lambda

    def compute_loss(
        self,
        model,
        inputs,
        return_outputs: bool = False,
        num_items_in_batch: Optional[int] = None,
    ):
        outputs = model(**inputs)
        loss = outputs.loss

        if self.l1_lambda > 0:
            l1_reg = 0.0
            for name, param in model.named_parameters():
                if "lora_" in name and param.requires_grad:
                    l1_reg = l1_reg + param.abs().sum()
            loss = loss + self.l1_lambda * l1_reg

        return (loss, outputs) if return_outputs else loss


results_per_rank: List[Dict[str, Any]] = []

for r in RANKS:
    print("\n" + "=" * 80)
    print(f"Training Sparse LoRA DistilBERT with rank = {r}, epochs = {NUM_EPOCHS}")
    print("=" * 80)

    set_all_seeds(SEED)

    # Base DistilBERT for this rank
    base_model = DistilBertForSequenceClassification.from_pretrained(
        "distilbert-base-uncased",
        num_labels=2,
    )

    # LoRA config: attention projections in DistilBERT
    lora_config = LoraConfig(
        r=r,
        lora_alpha=2 * r,
        lora_dropout=0.1,
        bias="none",
        task_type="SEQ_CLS",    # sequence classification
        target_modules=["q_lin", "k_lin", "v_lin", "out_lin"],
    )

    lora_model = get_peft_model(base_model, lora_config)
    lora_model.to(DEVICE)

    total_params = count_total_params(lora_model)
    trainable_params = count_trainable_params(lora_model)
    param_ratio = trainable_params / total_params

    print(f"[Rank {r}] total params: {total_params:,}")
    print(f"[Rank {r}] trainable params: {trainable_params:,}")
    print(f"[Rank {r}] trainable params ratio (trainable / total): {param_ratio:.4%}")

    output_dir = f"./sparse_lora_rank{r}_imdb"
    os.makedirs(output_dir, exist_ok=True)

    training_args_lora = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=NUM_EPOCHS,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        learning_rate=LR,
        weight_decay=0.01,
        eval_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        greater_is_better=True,
        seed=SEED,
        report_to="none",
    )

    trainer = SparseLoraTrainer(
        l1_lambda=L1_LAMBDA,
        model=lora_model,
        args=training_args_lora,
        train_dataset=train_ds,
        eval_dataset=val_ds,
        tokenizer=tokenizer,
        data_collator=collator,
        compute_metrics=compute_metrics,
    )

    start_time = time.time()
    trainer.train()
    end_time = time.time()
    train_time = end_time - start_time
    print(f"[Rank {r}] Training time: {train_time:.2f} seconds")

    # --- final evals ---
    val_metrics = trainer.evaluate(eval_dataset=val_ds)
    test_metrics = trainer.evaluate(eval_dataset=test_ds)
    lora_sparsity = compute_lora_sparsity(lora_model, threshold=1e-3)

    print(f"[Rank {r}] Validation metrics: {val_metrics}")
    print(f"[Rank {r}] Test metrics: {test_metrics}")
    print(f"[Rank {r}] LoRA sparsity (<1e-3): {lora_sparsity:.2%}")

    # ==========================
    # SAVE METRICS / MODEL / LOG
    # ==========================
    # 1) save metrics
    metrics_payload = {
        "rank": r,
        "train_time_sec": train_time,
        "total_params": int(total_params),
        "trainable_params": int(trainable_params),
        "param_ratio": float(param_ratio),
        "lora_sparsity_<1e-3": float(lora_sparsity),
        "val_metrics": val_metrics,
        "test_metrics": test_metrics,
    }
    with open(os.path.join(output_dir, "final_metrics.json"), "w") as f:
        json.dump(metrics_payload, f, indent=4)

    # 2) save final model (best checkpoint)
    final_model_dir = os.path.join(output_dir, "final_model")
    trainer.save_model(final_model_dir)  # saves model + config
    tokenizer.save_pretrained(final_model_dir)  # save tokenizer too
    print(f"[Rank {r}] Saved model to {final_model_dir}")

    # 3) save convergence history
    log_history = trainer.state.log_history
    with open(os.path.join(output_dir, "log_history.json"), "w") as f:
        json.dump(log_history, f, indent=4)
    print(f"[Rank {r}] Saved log history to log_history.json")

    # --- store in-memory summary for printing ---
    results_per_rank.append(
        {
            "rank": r,
            "total_params": total_params,
            "trainable_params": trainable_params,
            "param_ratio": param_ratio,
            "train_time_sec": train_time,
            "val_metrics": val_metrics,
            "test_metrics": test_metrics,
            "lora_sparsity(<1e-3)": lora_sparsity,
        }
    )

print("\n\n=== Summary over ranks (Sparse LoRA) ===")
for res in results_per_rank:
    r = res["rank"]
    print(f"\nRank {r}:")
    print(f"  Params: {res['trainable_params']:,} / {res['total_params']:,} "
          f"({res['param_ratio']:.2%})")
    print(f"  Train time: {res['train_time_sec']:.2f} s")
    print(f"  Val F1:  {res['val_metrics'].get('eval_f1', float('nan')):.4f}, "
          f"Acc: {res['val_metrics'].get('eval_accuracy', float('nan')):.4f}")
    print(f"  Test F1: {res['test_metrics'].get('eval_f1', float('nan')):.4f}, "
          f"Acc: {res['test_metrics'].get('eval_accuracy', float('nan')):.4f}")
    print(f"  LoRA sparsity (<1e-3): {res['lora_sparsity(<1e-3)']:.2%}")


Training Sparse LoRA DistilBERT with rank = 2, epochs = 10


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[Rank 2] total params: 67,620,868
[Rank 2] trainable params: 665,858
[Rank 2] trainable params ratio (trainable / total): 0.9847%


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.4988,0.354335,0.8536,0.856245,0.852895,0.859621
2,0.3368,0.326381,0.8644,0.866902,0.863174,0.870662
3,0.3143,0.312004,0.8756,0.877318,0.877664,0.876972
4,0.3034,0.302004,0.8776,0.879717,0.876959,0.882492
5,0.2965,0.299173,0.8732,0.876893,0.86381,0.890379
6,0.2898,0.292821,0.8824,0.883241,0.8896,0.876972
7,0.2882,0.289809,0.8804,0.882237,0.881196,0.883281
8,0.2858,0.288128,0.8824,0.884615,0.880469,0.888801
9,0.2828,0.286901,0.8812,0.88219,0.88747,0.876972
10,0.2829,0.286499,0.8828,0.8846,0.883556,0.885647


[Rank 2] Training time: 1115.94 seconds


[Rank 2] Validation metrics: {'eval_loss': 0.2881282567977905, 'eval_accuracy': 0.8824, 'eval_f1': 0.8846153846153846, 'eval_precision': 0.88046875, 'eval_recall': 0.888801261829653, 'eval_runtime': 6.6931, 'eval_samples_per_second': 373.519, 'eval_steps_per_second': 11.803, 'epoch': 10.0}
[Rank 2] Test metrics: {'eval_loss': 0.2873707413673401, 'eval_accuracy': 0.8836, 'eval_f1': 0.8841099163679809, 'eval_precision': 0.8719560094265515, 'eval_recall': 0.8966074313408724, 'eval_runtime': 6.5961, 'eval_samples_per_second': 379.013, 'eval_steps_per_second': 11.977, 'epoch': 10.0}
[Rank 2] LoRA sparsity (<1e-3): 13.77%
[Rank 2] Saved model to ./sparse_lora_rank2_imdb/final_model
[Rank 2] Saved log history to log_history.json

Training Sparse LoRA DistilBERT with rank = 4, epochs = 10


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[Rank 4] total params: 67,694,596
[Rank 4] trainable params: 739,586
[Rank 4] trainable params ratio (trainable / total): 1.0925%


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.4756,0.348947,0.858,0.861382,0.853055,0.869874
2,0.33,0.322691,0.8676,0.871156,0.860108,0.882492
3,0.308,0.307732,0.8768,0.878068,0.881558,0.874606
4,0.2987,0.297457,0.88,0.881047,0.885965,0.876183
5,0.2913,0.297204,0.8788,0.882966,0.865254,0.90142
6,0.2852,0.28992,0.8824,0.882588,0.894013,0.871451
7,0.2834,0.286207,0.8876,0.888801,0.891978,0.885647
8,0.2809,0.28441,0.8892,0.890902,0.889851,0.891956
9,0.2779,0.283639,0.8872,0.887917,0.895032,0.880915
10,0.2775,0.283132,0.89,0.891175,0.894361,0.888013


[Rank 4] Training time: 1114.57 seconds


[Rank 4] Validation metrics: {'eval_loss': 0.28313153982162476, 'eval_accuracy': 0.89, 'eval_f1': 0.891175306687772, 'eval_precision': 0.8943606036536934, 'eval_recall': 0.88801261829653, 'eval_runtime': 6.2777, 'eval_samples_per_second': 398.235, 'eval_steps_per_second': 12.584, 'epoch': 10.0}
[Rank 4] Test metrics: {'eval_loss': 0.2854563295841217, 'eval_accuracy': 0.8904, 'eval_f1': 0.8904876099120703, 'eval_precision': 0.8813291139240507, 'eval_recall': 0.8998384491114702, 'eval_runtime': 6.4066, 'eval_samples_per_second': 390.224, 'eval_steps_per_second': 12.331, 'epoch': 10.0}
[Rank 4] LoRA sparsity (<1e-3): 17.68%
[Rank 4] Saved model to ./sparse_lora_rank4_imdb/final_model
[Rank 4] Saved log history to log_history.json

Training Sparse LoRA DistilBERT with rank = 8, epochs = 10


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[Rank 8] total params: 67,842,052
[Rank 8] trainable params: 887,042
[Rank 8] trainable params ratio (trainable / total): 1.3075%


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.4684,0.353341,0.8612,0.866075,0.848073,0.884858
2,0.3317,0.326179,0.8692,0.873501,0.857251,0.890379
3,0.3093,0.31054,0.8824,0.883241,0.8896,0.876972
4,0.3001,0.299509,0.8856,0.886778,0.890302,0.883281
5,0.2933,0.302488,0.886,0.890595,0.867614,0.914826
6,0.2866,0.293456,0.8888,0.888532,0.903752,0.873817
7,0.285,0.289793,0.8904,0.891528,0.895072,0.888013
8,0.2822,0.288134,0.8924,0.894634,0.888716,0.900631
9,0.2783,0.287695,0.8904,0.89101,0.898876,0.883281
10,0.2782,0.287262,0.892,0.893196,0.896032,0.890379


[Rank 8] Training time: 1117.01 seconds


[Rank 8] Validation metrics: {'eval_loss': 0.28813374042510986, 'eval_accuracy': 0.8924, 'eval_f1': 0.8946337641989816, 'eval_precision': 0.888715953307393, 'eval_recall': 0.9006309148264984, 'eval_runtime': 6.5995, 'eval_samples_per_second': 378.815, 'eval_steps_per_second': 11.971, 'epoch': 10.0}
[Rank 8] Test metrics: {'eval_loss': 0.29389551281929016, 'eval_accuracy': 0.8872, 'eval_f1': 0.8877388535031847, 'eval_precision': 0.8751962323390895, 'eval_recall': 0.9006462035541195, 'eval_runtime': 6.407, 'eval_samples_per_second': 390.196, 'eval_steps_per_second': 12.33, 'epoch': 10.0}
[Rank 8] LoRA sparsity (<1e-3): 23.14%


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[Rank 8] Saved model to ./sparse_lora_rank8_imdb/final_model
[Rank 8] Saved log history to log_history.json

Training Sparse LoRA DistilBERT with rank = 16, epochs = 10


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[Rank 16] total params: 68,136,964
[Rank 16] trainable params: 1,181,954
[Rank 16] trainable params ratio (trainable / total): 1.7347%


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.4712,0.36356,0.8648,0.869195,0.853343,0.885647
2,0.3439,0.339625,0.8744,0.879509,0.856502,0.903785
3,0.3209,0.324419,0.8824,0.882306,0.895935,0.869085
4,0.3121,0.311804,0.8856,0.886328,0.893429,0.879338
5,0.3045,0.320701,0.884,0.889734,0.859031,0.922713
6,0.2971,0.307862,0.8896,0.889246,0.905229,0.873817
7,0.2948,0.30366,0.8924,0.893212,0.899281,0.887224
8,0.2911,0.302278,0.8932,0.895825,0.886486,0.905363
9,0.2867,0.301418,0.8932,0.894006,0.90008,0.888013
10,0.2866,0.301111,0.8932,0.894257,0.89817,0.890379


[Rank 16] Training time: 1120.20 seconds


[Rank 16] Validation metrics: {'eval_loss': 0.3022783696651459, 'eval_accuracy': 0.8932, 'eval_f1': 0.8958252048380804, 'eval_precision': 0.8864864864864865, 'eval_recall': 0.9053627760252366, 'eval_runtime': 6.5918, 'eval_samples_per_second': 379.262, 'eval_steps_per_second': 11.985, 'epoch': 10.0}
[Rank 16] Test metrics: {'eval_loss': 0.3085694909095764, 'eval_accuracy': 0.8904, 'eval_f1': 0.8911834789515488, 'eval_precision': 0.8765625, 'eval_recall': 0.9063004846526656, 'eval_runtime': 6.4227, 'eval_samples_per_second': 389.247, 'eval_steps_per_second': 12.3, 'epoch': 10.0}
[Rank 16] LoRA sparsity (<1e-3): 29.31%
[Rank 16] Saved model to ./sparse_lora_rank16_imdb/final_model
[Rank 16] Saved log history to log_history.json


=== Summary over ranks (Sparse LoRA) ===

Rank 2:
  Params: 665,858 / 67,620,868 (0.98%)
  Train time: 1115.94 s
  Val F1:  0.8846, Acc: 0.8824
  Test F1: 0.8841, Acc: 0.8836
  LoRA sparsity (<1e-3): 13.77%

Rank 4:
  Params: 739,586 / 67,694,596 (1.09%)
  Train