## **Deep Learning CS 6953 - Spring 2025**   
**Esteban D Lopez**   
**Shruti Karkamar**   
**Steven Granaturov**   
  
**Project 2**

AI Disclaimer: OpenAI's and Google's models have been consulted for this assignment for: simple explanations of roberta and lora architectures, understand process conceptually and in simple terms, and to optimize and review code for errors. 

# Starter Notebook

Install and import required libraries

In [1]:
!pip install transformers datasets evaluate accelerate peft trl bitsandbytes
!pip install nvidia-ml-py3

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting trl
  Downloading trl-0.16.1-py3-none-any.whl.metadata (12 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-non

In [2]:
import os
import pandas as pd
import torch
from transformers import RobertaModel, RobertaTokenizer, TrainingArguments, Trainer, DataCollatorWithPadding, RobertaForSequenceClassification
from peft import LoraConfig, get_peft_model, PeftModel
from datasets import load_dataset, Dataset, ClassLabel
import pickle
import torch
torch.backends.cudnn.benchmark = True

## Load Tokenizer and Preprocess Data

In [3]:
base_model = 'roberta-base'

dataset = load_dataset('ag_news', split='train')
tokenizer = RobertaTokenizer.from_pretrained(base_model)

def preprocess(examples):
    tokenized = tokenizer(examples['text'], truncation=True, padding=True)
    return tokenized

tokenized_dataset = dataset.map(preprocess, batched=True,  remove_columns=["text"])
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

In [4]:
# Extract the number of classess and their names
num_labels = dataset.features['label'].num_classes
class_names = dataset.features["label"].names
print(f"number of labels: {num_labels}")
print(f"the labels: {class_names}")

# Create an id2label mapping
# We will need this for our classifier.
id2label = {i: label for i, label in enumerate(class_names)}

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")


number of labels: 4
the labels: ['World', 'Sports', 'Business', 'Sci/Tech']


## Load Pre-trained Model
Set up config for pretrained model and download it from hugging face

In [5]:
model = RobertaForSequenceClassification.from_pretrained(
    base_model,
    id2label=id2label)
model

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
         

## Anything from here on can be modified

In [6]:
# Split the original training set
split_datasets = tokenized_dataset.train_test_split(test_size=640, seed=42)
train_dataset = split_datasets['train']
eval_dataset = split_datasets['test']

## Setup LoRA Config
Setup PEFT config and get peft model for finetuning

We apply LoRA to efficiently fine-tune the model while significantly reducing the number of trainable parameters:
- **Rank (`r=7`)**: Balances model adaptability with limited complexity.
- **Alpha (`α=15`)**: Controls adjustment strength; chosen to moderately impact original weights.
- **Dropout (`0.09`)**: Helps prevent overfitting.
- **Target Modules (`query`, `value`, `key`)**: Crucial attention layers where fine-tuning is most beneficial.

In [22]:
# PEFT Config
peft_config = LoraConfig(
    r=7,
    lora_alpha=15,
    lora_dropout=0.09,
    bias='none',
    target_modules=['query', 'value','key'],
    task_type="SEQ_CLS",
)

We combine our LoRA configuration with the RoBERTa model to create a fine-tunable model (`peft_model`). This ensures only specified parts of the model (defined by LoRA) are trainable, aligning with our parameter budget.

In [23]:
peft_model = get_peft_model(model, peft_config)
for name, param in peft_model.named_parameters():
    if name.startswith("classifier"):        # both dense & out_proj
        param.requires_grad = False
peft_model

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): RobertaForSequenceClassification(
      (roberta): RobertaModel(
        (embeddings): RobertaEmbeddings(
          (word_embeddings): Embedding(50265, 768, padding_idx=1)
          (position_embeddings): Embedding(514, 768, padding_idx=1)
          (token_type_embeddings): Embedding(1, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (encoder): RobertaEncoder(
          (layer): ModuleList(
            (0-11): 12 x RobertaLayer(
              (attention): RobertaAttention(
                (self): RobertaSdpaSelfAttention(
                  (query): lora.Linear(
                    (base_layer): Linear(in_features=768, out_features=768, bias=True)
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.09, inplace=False)
                    )
                    (lora_A): Mo

To confirm our LoRA integration, we inspect and list out the model parameters explicitly set as trainable. This step ensures compliance with the project's parameter limit (≤1 million parameters).

In [24]:
print("Trainable parameters:")
for name, param in peft_model.named_parameters():
   if param.requires_grad:
      print(name)

Trainable parameters:
base_model.model.roberta.encoder.layer.0.attention.self.query.lora_A.default.weight
base_model.model.roberta.encoder.layer.0.attention.self.query.lora_B.default.weight
base_model.model.roberta.encoder.layer.0.attention.self.key.lora_A.default.weight
base_model.model.roberta.encoder.layer.0.attention.self.key.lora_B.default.weight
base_model.model.roberta.encoder.layer.0.attention.self.value.lora_A.default.weight
base_model.model.roberta.encoder.layer.0.attention.self.value.lora_B.default.weight
base_model.model.roberta.encoder.layer.1.attention.self.query.lora_A.default.weight
base_model.model.roberta.encoder.layer.1.attention.self.query.lora_B.default.weight
base_model.model.roberta.encoder.layer.1.attention.self.key.lora_A.default.weight
base_model.model.roberta.encoder.layer.1.attention.self.key.lora_B.default.weight
base_model.model.roberta.encoder.layer.1.attention.self.value.lora_A.default.weight
base_model.model.roberta.encoder.layer.1.attention.self.value.

We print the total number of trainable parameters and their percentage compared to the full model, verifying our adherence to the project constraint (≤1 million parameters).

In [26]:
print('PEFT Model')
peft_model.print_trainable_parameters()

PEFT Model
trainable params: 980,740 || all params: 125,629,448 || trainable%: 0.7807


## Training Setup

We define accuracy as our primary evaluation metric since the task (text classification) emphasizes correctness of predictions. This metric guides our model training, selection, and hyperparameter tuning decisions.

In [27]:
# To track evaluation accuracy during training
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    # Calculate accuracy
    accuracy = accuracy_score(labels, preds)
    return {
        'accuracy': accuracy
    }

We set detailed training parameters carefully chosen through preliminary experimentation:
- **Learning Rate:** Initial rate of 3e-5, typical for transformer fine-tuning.
- **Epochs:** Set at 6 for effective yet efficient training.
- **Batch Sizes:** Moderate sizes (train: 256, eval: 128) to make the most out of access to A100 compute power.
- **Mixed Precision (bf16=True):** Balances training speed with memory efficiency with A100.
- **Optimizer:** AdamW optimizer with weight decay (0.01) to minimize overfitting.
- **Scheduler:** Linear learning rate scheduler with warmup to improve training stability.

In [28]:
# Setup Training args
output_dir = "results"
training_args = TrainingArguments(
    output_dir=output_dir,
    report_to=None,
    eval_steps=200,
    eval_strategy='steps',
    save_strategy='steps',
    save_steps=200,
    logging_steps=100,
    learning_rate=3e-5,
    num_train_epochs=6,
    use_cpu=False,
    dataloader_num_workers=12,
    per_device_train_batch_size=256,
    per_device_eval_batch_size=128,
    optim="adamw_torch",
    weight_decay = 0.01, #regularization
    fp16 = False, #mixed precision
    bf16=True,
    lr_scheduler_type="linear", #linear decay with warmup
    warmup_steps = 500, #warmup
    gradient_checkpointing=False,
    gradient_checkpointing_kwargs={'use_reentrant':True},
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True
)

def get_trainer(model):
      return  Trainer(
          model=model,
          args=training_args,
          compute_metrics=compute_metrics,
          train_dataset=train_dataset,
          tokenizer=tokenizer,
          eval_dataset=eval_dataset,
          data_collator=data_collator,
      )

### Start Training

We perform a controlled hyperparameter sweep over multiple learning rates (1e-5, 3e-5, 5e-5). By systematically testing different learning rates, we identify the most effective configuration for model convergence and accuracy improvement.

In [29]:
#Hyperparameter sweep over learning rates
for lr in [1e-5, 3e-5, 5e-5]:
    # update the LR in your TrainingArguments object
    training_args.learning_rate = lr

    # re‑instantiate a fresh Trainer with that new LR
    trainer = get_trainer(peft_model)

    print(f"\n=== Training with learning_rate = {lr} ===")
    trainer.train()
    print("Done.\n")

  return  Trainer(
No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.



=== Training with learning_rate = 1e-05 ===


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33medl9434[0m ([33medl9434-nyu[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss,Accuracy
200,1.3815,1.375867,0.439063
400,1.3249,1.2625,0.864062
600,0.4312,0.32201,0.898438
800,0.3106,0.307428,0.904687
1000,0.304,0.298131,0.90625
1200,0.2837,0.297219,0.909375
1400,0.2763,0.294465,0.910937
1600,0.2784,0.293399,0.907813
1800,0.269,0.288782,0.909375
2000,0.2701,0.288606,0.909375


  return  Trainer(
No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Done.


=== Training with learning_rate = 3e-05 ===


Step,Training Loss,Validation Loss,Accuracy
200,0.2777,0.290244,0.910937
400,0.2674,0.288532,0.907813
600,0.2613,0.285434,0.910937
800,0.2516,0.281797,0.907813
1000,0.2533,0.275518,0.910937
1200,0.2422,0.269236,0.907813
1400,0.2357,0.267763,0.909375
1600,0.2395,0.266469,0.90625
1800,0.2291,0.258716,0.909375
2000,0.227,0.256091,0.90625


  return  Trainer(
No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Done.


=== Training with learning_rate = 5e-05 ===


Step,Training Loss,Validation Loss,Accuracy
200,0.2718,0.287228,0.910937
400,0.2614,0.284725,0.909375
600,0.2537,0.280179,0.90625
800,0.2415,0.275224,0.907813
1000,0.2416,0.265042,0.909375
1200,0.2294,0.24849,0.907813
1400,0.2177,0.230032,0.915625
1600,0.2161,0.223845,0.920312
1800,0.2057,0.215532,0.923438
2000,0.2001,0.216095,0.920312


Done.



## Evaluate Finetuned Model


### Performing Inference on Custom Input


We create an easy-to-use function (`classify`) for making predictions on arbitrary text inputs. This function helps qualitatively verify that our model predictions align with expectations.

In [30]:
def classify(model, tokenizer, text):
     device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
     inputs = tokenizer(text, truncation=True, padding=True, return_tensors="pt").to(device)
     output = model(**inputs)

     prediction = output.logits.argmax(dim=-1).item()

     print(f'\n Class: {prediction}, Label: {id2label[prediction]}, Text: {text}')
     return id2label[prediction]

In [31]:
classify( peft_model, tokenizer, "Kederis proclaims innocence Olympic champion Kostas Kederis today left hospital ahead of his date with IOC inquisitors claiming his ...")
classify( peft_model, tokenizer, "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.")


 Class: 0, Label: World, Text: Kederis proclaims innocence Olympic champion Kostas Kederis today left hospital ahead of his date with IOC inquisitors claiming his ...

 Class: 2, Label: Business, Text: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindlingand of ultra-cynics, are seeing green again.


'Business'

### Run Inference on eval_dataset

We define `evaluate_model()` to run inference on our entire evaluation dataset efficiently. This function computes accuracy systematically across all evaluation samples, providing robust performance validation.

In [32]:
from torch.utils.data import DataLoader
import evaluate
from tqdm import tqdm

def evaluate_model(inference_model, dataset, labelled=True, batch_size=128, data_collator=None):
    """
    Evaluate a PEFT model on a dataset.

    Args:
        inference_model: The model to evaluate.
        dataset: The dataset (Hugging Face Dataset) to run inference on.
        labelled (bool): If True, the dataset includes labels and metrics will be computed.
                         If False, only predictions will be returned.
        batch_size (int): Batch size for inference.
        data_collator: Function to collate batches. If None, the default collate_fn is used.

    Returns:
        If labelled is True, returns a tuple (metrics, predictions)
        If labelled is False, returns the predictions.
    """
    # Create the DataLoader
    eval_dataloader = DataLoader(dataset, batch_size=batch_size, collate_fn=data_collator)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    inference_model.to(device)
    inference_model.eval()

    all_predictions = []
    if labelled:
        metric = evaluate.load('accuracy')

    # Loop over the DataLoader
    for batch in tqdm(eval_dataloader):
        # Move each tensor in the batch to the device
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = inference_model(**batch)
        predictions = outputs.logits.argmax(dim=-1)
        all_predictions.append(predictions.cpu())

        if labelled:
            # Expecting that labels are provided under the "labels" key.
            references = batch["labels"]
            metric.add_batch(
                predictions=predictions.cpu().numpy(),
                references=references.cpu().numpy()
            )

    # Concatenate predictions from all batches
    all_predictions = torch.cat(all_predictions, dim=0)

    if labelled:
        eval_metric = metric.compute()
        print("Evaluation Metric:", eval_metric)
        return eval_metric, all_predictions
    else:
        return all_predictions

**Evaluate Final Model Performance**

We run the full evaluation function on our evaluation dataset to get an accurate and comprehensive measurement of our model’s performance. This step ensures our results meet or exceed the desired baseline accuracy (≥80%).

In [33]:
# Check evaluation accuracy
_, _ = evaluate_model(peft_model, eval_dataset, True, 128, data_collator)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

100%|██████████| 5/5 [00:00<00:00,  6.89it/s]

Evaluation Metric: {'accuracy': 0.93125}





### Run Inference on unlabelled dataset

We load additional unlabeled data (`test_unlabelled.pkl`) provided for final predictions. This data undergoes the same preprocessing pipeline as the training data to ensure consistent input formatting.

In [34]:
#Load your unlabelled data
unlabelled_dataset = pd.read_pickle("test_unlabelled.pkl")
test_dataset = unlabelled_dataset.map(preprocess, batched=True, remove_columns=["text"])
unlabelled_dataset

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 8000
})

We use our trained model to predict labels on the unlabeled dataset. Results are stored in a CSV file (`inference_output.csv`) suitable for submission or further analysis.

In [35]:
# Run inference and save predictions
preds = evaluate_model(peft_model, test_dataset, False, 128, data_collator)
df_output = pd.DataFrame({
    'ID': range(len(preds)),
    'Label': preds.numpy()  # or preds.tolist()
})
df_output.to_csv(os.path.join(output_dir,"inference_output.csv"), index=False)
print("Inference complete. Predictions saved to inference_output.csv")

100%|██████████| 63/63 [00:06<00:00,  9.89it/s]

Inference complete. Predictions saved to inference_output.csv



