<a href="https://colab.research.google.com/github/AfraAd/CSC413-Homeworks/blob/main/Fall25_Hw8_transformer_part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 8

In this homework, you will train a multi-label text classifier on a subset of [AG News](https://huggingface.co/datasets/r-three/ag_news_subset) dataset using a pre-trained BERT model. The AG News dataset consists of news articles categorized into one of four topics (0 - World, 1 - Sports, 2 - Business, 3 - Sci/Tech).

**In part 1**, you will fine-tune a BERT-style model on the AG News dataset and evaluate its performance. You can find a tutorial for loading BERT and fine-tuning [here](https://huggingface.co/docs/transformers/training). For simplicity, I recommend using the [Hugging Face Transformers library](https://huggingface.co/docs/transformers/index).You're welcome to use a different framework if you prefer.

**In part 2**, instead of fine-tuning a BERT-style model directly, you will use the representations from the BERT-style model as input to a linear classifier. Does this approach perform better or worse?

For both part 1 and part 2, your goal is to achieve a test accuracy above the specified thresholds. You won’t have access to the test labels—just like in real-world applications!

**Tips about fine-tuning**

* Data preprocessing: raw text data should be tokenized before being fed to the model as batches during trainig.
* Hyperparameter choices: Experiment with settings such as learning rate, warmup ratio, optimizer, number of training steps, and batch size.
* Avoid overfitting: remember that your fine-tuned model will be evaluated on the test set!


**!! IMPORTANT NOTE !!**

You are free to explore and implement the training code however you want to maximize the model performance. However, please put the code you're running under `if __name__ == '__main__':` so that the particular training step is not run when we later evaluate your final script! Otherwise you may fail the Markus tests due to timeout.

```
if __name__ == '__main__':
    # your training code to fine-tune the model
    ...
```

# Part 1 (4 points)



In [1]:
if __name__ == '__main__':
    # !pip install datasets
    # !pip install evaluate
    # !pip install -U sentence-transformers

    from datasets import load_dataset, DatasetDict
    from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
    from torch.utils.data import DataLoader
    import torch
    import evaluate

    from sentence_transformers import SentenceTransformer
    from sklearn.linear_model import LogisticRegression
    import joblib

    torch.cuda.empty_cache()
    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

    ################# Import additional packages you need #################
    #####################################################################################
    import numpy as np
    import pandas as pd

In [2]:
################## HELPER CODE FOR SAVING RELEVANT FILES ##################
if __name__ == '__main__':
    def in_colab():
        try:
            import google.colab
            return True
        except ImportError:
            return False

    if in_colab():
        from google.colab import drive
        drive.mount('/content/drive')
        SAVE_PATH = '/content/drive/MyDrive/CSC413'
    else:
        SAVE_PATH = '.'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Part 1.a

Fine-tune [TinyBERT](https://huggingface.co/huawei-noah/TinyBERT_General_4L_312D) on AG News and evaluate the results. You can find a tutorial for loading BERT and fine-tuning [here](https://huggingface.co/docs/transformers/training). In that tutorial, you will need to change the dataset from `"yelp_review_full"` to the correct dataset path and the model from `"bert-base-uncased"` to `"huawei-noah/TinyBERT_General_4L_312D"`. You'll also need to modify the code since AG New is a four-class classification dataset (unlike the Yelp Reviews dataset, which is a five-class classification dataset).

**TODO**
* After fine-tuning the model, save model predictions on the test set to *part1_tiny_bert_model_test_prediction.csv*. The csv file should contain "index" columns, corresponding to the unique sample index, and "pred" column, the model prediction on that sample. Your model should achieve >= 80% on the test accuracy to receive a full mark.

```
index, pred
0,model_pred_value_0
1,model_pred_value_1
2,model_pred_value_2
...
```

In [3]:
######################## DO NOT MODIFY THE CODE ########################
if __name__ == '__main__':
    dataset = load_dataset('r-three/ag_news_subset')
    model = AutoModelForSequenceClassification.from_pretrained("huawei-noah/TinyBERT_General_4L_312D", num_labels=4)
    tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")
    print(dataset["train"][100])

    # Tokenization function
    def tokenize_function(examples):
        texts = [f"{title} {desc}" for title, desc in zip(examples['title'], examples['description'])]
        return tokenizer(texts, padding="max_length", truncation=True, max_length=128)

    # Tokenize datasets
    tokenized_datasets = dataset.map(tokenize_function, batched=True)
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    # Metrics
    accuracy_metric = evaluate.load("accuracy")
    def compute_metrics(eval_pred):
        predictions, labels = eval_pred
        predictions = np.argmax(predictions, axis=1)
        return accuracy_metric.compute(predictions=predictions, references=labels)

    # Training arguments
    training_args = TrainingArguments(
        output_dir=f"{SAVE_PATH}/tinybert_results",
        eval_strategy="epoch",
        learning_rate=5e-5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=64,
        num_train_epochs=10,
        weight_decay=0.01,
        warmup_ratio=0.1,
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        save_strategy = "epoch"
    )

    # Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["validation"],
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )

    # Train and save predictions
    trainer.train()
    test_predictions = trainer.predict(tokenized_datasets["test"].remove_columns(["label"]))
    test_preds = np.argmax(test_predictions.predictions, axis=1)

    pd.DataFrame({'index': range(len(test_preds)), 'pred': test_preds}).to_csv(
        f"{SAVE_PATH}/part1_tiny_bert_model_test_prediction.csv", index=False
    )
#########################################################################

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at huawei-noah/TinyBERT_General_4L_312D and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'label': 3, 'title': 'Microsoft Takes Giant Step against Spyware', 'description': 'Microsoft has beefed up its security portfolio by purchasing anti-spyware specialist Giant Company Software. A test version of a spyware protection, detection and removal tool based on Giant #39;s ', 'index': 100}


Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

  trainer = Trainer(
[34m[1mwandb[0m: Currently logged in as: [33mafra-azad[0m ([33mafra-azad-university-of-toronto[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.720703,0.844
2,No log,0.424975,0.878
3,No log,0.390964,0.874
4,0.570900,0.387315,0.88
5,0.570900,0.412214,0.879
6,0.570900,0.422637,0.88
7,0.570900,0.462722,0.879
8,0.134900,0.483801,0.877
9,0.134900,0.543843,0.869
10,0.134900,0.521218,0.873


If your prediction is saved in pandas dataframe, you can do something like:
```
if __name__ == '__main__':
   part1_tiny_bert_pred.to_csv(f"{SAVE_PATH}/part1_tiny_bert_model_test_prediction.csv", index=False)
```

## Part 1.b

For this section, choose a different pre-trained BERT-style model from the [Hugging Face Model Hub](https://huggingface.co/models) and fine-tune it. There are tons of options - part of the homework is navigating the hub to find different models! I recommend picking a model that is smaller than BERT-Base (as TinyBERT is) just to make things computationally cheaper. Is the final validation accuracy higher or lower with this other model?

**TODO**
* As in part 1.a, save model predictions on the test set to *part1_hf_bert_model_test_prediction.csv*. The csv file should contain "index" columns, corresponding to the unique sample index, and "pred" column, the model prediction on that sample. Your model should achieve >=80% on the test accuracy to receive a full mark.

In [4]:
if __name__ == '__main__':
    ############### YOUR CODE ###############
    # TODO: find a new HF BERT based model from HuggingFace and load it.
    HF_BERT_BASED_MODEL = 'distilbert-base-uncased'

    # Load model and tokenizer
    print(f"Loading model: {HF_BERT_BASED_MODEL}")
    model_hf = AutoModelForSequenceClassification.from_pretrained(HF_BERT_BASED_MODEL, num_labels=4)
    tokenizer_hf = AutoTokenizer.from_pretrained(HF_BERT_BASED_MODEL)

    # Tokenization function for the new model
    def tokenize_function_hf(examples):
        texts = [f"{title} {desc}" for title, desc in zip(examples['title'], examples['description'])]
        return tokenizer_hf(texts, padding="max_length", truncation=True, max_length=128)

    # Tokenize datasets
    print("Tokenizing datasets with new tokenizer...")
    tokenized_datasets_hf = dataset.map(tokenize_function_hf, batched=True)

    # Data collator
    data_collator_hf = DataCollatorWithPadding(tokenizer=tokenizer_hf)

    # Training arguments
    training_args_hf = TrainingArguments(
        output_dir=f"{SAVE_PATH}/tinybert_results",
        eval_strategy="epoch",
        learning_rate=5e-5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=64,
        num_train_epochs=10,
        weight_decay=0.05,
        warmup_ratio=0.1,
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        save_strategy = "epoch"
    )

    # Trainer
    trainer_hf = Trainer(
        model=model_hf,
        args=training_args_hf,
        train_dataset=tokenized_datasets_hf["train"],
        eval_dataset=tokenized_datasets_hf["validation"],
        tokenizer=tokenizer_hf,
        data_collator=data_collator_hf,
        compute_metrics=compute_metrics,  # Reuse from Part 1.a
    )

    # Train
    print(f"Training {HF_BERT_BASED_MODEL}...")
    trainer_hf.train()

    # Evaluate
    eval_results_hf = trainer_hf.evaluate()
    print(f"Validation Accuracy: {eval_results_hf['eval_accuracy']:.4f}")

    # Predict on test set
    print("Generating predictions on test set...")
    test_predictions_hf = trainer_hf.predict(tokenized_datasets_hf["test"].remove_columns(["label"]))
    test_preds_hf = np.argmax(test_predictions_hf.predictions, axis=1)

    # Save predictions to CSV
    part1_hf_bert_pred = pd.DataFrame({
        'index': range(len(test_preds_hf)),
        'pred': test_preds_hf
    })
    part1_hf_bert_pred.to_csv(f"{SAVE_PATH}/part1_hf_bert_model_test_prediction.csv", index=False)
    print(f"Saved predictions to {SAVE_PATH}/part1_hf_bert_model_test_prediction.csv")
    #########################################

Loading model: distilbert-base-uncased


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Tokenizing datasets with new tokenizer...


Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

  trainer_hf = Trainer(


Training distilbert-base-uncased...


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.431167,0.849
2,No log,0.341013,0.887
3,No log,0.404367,0.888
4,0.342200,0.447731,0.899
5,0.342200,0.464443,0.886
6,0.342200,0.50013,0.895
7,0.342200,0.544195,0.897
8,0.033200,0.570817,0.902
9,0.033200,0.579999,0.896
10,0.033200,0.582139,0.895


Validation Accuracy: 0.9020
Generating predictions on test set...
Saved predictions to /content/drive/MyDrive/CSC413/part1_hf_bert_model_test_prediction.csv


**Your training code here...**

Similarly, you can consider something like:

```
if __name__ == '__main__':
   part1_hf_bert_pred.to_csv(f"{SAVE_PATH}/part1_hf_bert_model_test_prediction.csv", index=False)
```

# Part 2 (2.5 points)

Instead of fine-tuning the full model on a target dataset, it's also possible to use the output representations from a BERT-style model as input to a linear classifier and *only* train the classifier (leaving the rest of the pre-trained parameters fixed). You can do this easily using the [`sentence-transformers`](https://www.sbert.net/) library. Using `sentence-tranformers` gives you back a fixed-length representation of a given text sequence. To achieve this, you need to
1. Pick a pre-trained sentence Transformer.
2. Load the AG News dataset and feed the text from each example into the model.
3. Train a linear classifier on the representations.
4. Evaluate performance on the validation set.

For the second step, you can learn more about how to use Hugging Face datasets [here](https://huggingface.co/docs/datasets/index). For the third and fourth step, it's possible to either do this directly in PyTorch, or collect the learned representations and use them as feature vectors to train a linear classifier in any other library (e.g. [scikit-learn](https://scikit-learn.org/stable/modules/linear_model.html)). For this homework, you will implement the second approach.

After you complete the above steps, is the accuracy on the validation set higher or lower using a fixed sentence Transformer?

**TODO**:
* Complete the `encode_data` function: the function embeds each text sample into an output representation using the provided sentence encoder. The function is called to map a text data sample to the model representation, as shown below:
```
dataset.map(lambda x: encode_data(sen_model, x), batched=True)
```
* Train a Logistic Regression classifier: use sklearn.linear_model.LogisticRegression to fit the model on the encoded text data.
* Save your trained model: After training, saved teh fitted logistic regression model as `sentence_encoder_classification.pkl`. Your model should achieve >=85% on the test accuracy to receive a full mark.

In [5]:
def encode_data(model, x):
    """Takes the model and the dataset object
        Returns a dictionary consisting of "encoded_input" and "label" as keys.
        - "encoded_input" contains the tokenized text features produced by the sentence transformer.
        - "label" is the target class label for each example.
        encoded_input is the encoded text input, and label is the target label.
        NOTE: Please assume the dataset object is the original one loaded via
              load_dataset('r-three/') for reproducibility.
              Which means if you want to create additional features to create the encoded_input,
              do so within this function.
    """
    ####################### YOUR CODE ##########################
    # TODO: encoded_input
    # Combine title and description for richer text representation
    texts = [f"{title} {desc}" for title, desc in zip(x['title'], x['description'])]

    # Encode the texts using the sentence transformer
    embeddings = model.encode(texts, show_progress_bar=False)

    # Return dictionary with encoded input and labels
    d = {
        'encoded_input': embeddings,
        'label': x['label']
    }
    return d
    ############################################################

In [6]:
########### PUT YOUR MODEL HERE ###########
SENTENCE_TRANSFORMER_MODEL = 'all-MiniLM-L6-v2'
###########################################

In [7]:
########### DO NOT CHANGE THIS CODE ###########
if __name__ == "__main__":

    sen_model = SentenceTransformer(SENTENCE_TRANSFORMER_MODEL)
    # Prepare the dataset
    tokenized_dataset = dataset.map(lambda x: encode_data(sen_model, x), batched=True)
    print(tokenized_dataset['train'][100])
    X_train = np.stack([np.array(x['encoded_input']) for x in tokenized_dataset['train']])
    X_val = np.stack([np.array(x['encoded_input']) for x in tokenized_dataset['validation']])
    y_train = np.stack([np.array(x['label']) for x in tokenized_dataset['train']])
    y_val = np.stack([np.array(x['label']) for x in tokenized_dataset['validation']])

    print(X_train.shape)
    print(X_val.shape)
    print(y_train.shape)
    print(y_val.shape)


Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

{'label': 3, 'title': 'Microsoft Takes Giant Step against Spyware', 'description': 'Microsoft has beefed up its security portfolio by purchasing anti-spyware specialist Giant Company Software. A test version of a spyware protection, detection and removal tool based on Giant #39;s ', 'index': 100, 'encoded_input': [-0.10181164741516113, 0.06562512367963791, 0.05116798356175423, -0.02823948860168457, 0.10688553750514984, -0.060344621539115906, 0.09523025900125504, 0.011601549573242664, -0.06362607330083847, 0.05193374305963516, -0.009374563582241535, 0.1125771552324295, 0.02367042750120163, -0.03609614074230194, 0.004996059462428093, 0.03435731306672096, 0.013490402139723301, -0.04580135643482208, -0.02351507917046547, -0.05158085376024246, -0.05795718729496002, -0.08008846640586853, -0.011982634663581848, 0.06801612675189972, -0.029635997489094734, 0.0007547626737505198, -0.05283653363585472, 0.01741046831011772, -0.031632259488105774, -0.02352731116116047, 0.08138889074325562, 0.002221

In [8]:
########### COMPLETE THE FOLLOWING LOGISTIC REGRESSION CODE ###########
if __name__ == "__main__":
    # Train the logistic regression classifier on encoded training data
    print("Training Logistic Regression classifier...")
    classifier = LogisticRegression(
        max_iter=1000,
        C=1.0,
        solver='lbfgs',
        multi_class='multinomial',
        random_state=42,
        verbose=1
    )

    # Fit the classifier
    classifier.fit(X_train, y_train)

    # Evaluate on validation set
    val_accuracy = classifier.score(X_val, y_val)
    print(f"\nValidation Accuracy: {val_accuracy:.4f}")

    # Generate predictions on test set
    test_preds = classifier.predict(X_val)


Training Logistic Regression classifier...





Validation Accuracy: 0.8740


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s finished


In [9]:
######################## TO SUBMIT ########################
if __name__ == "__main__":
    # Save the trained model
    model_path = f"{SAVE_PATH}/sentence_encoder_classification.pkl"
    tiny_bert_prediction_path = f"{SAVE_PATH}/part1_tiny_bert_model_test_prediction.csv"
    hf_bert_prediction_path = f"{SAVE_PATH}/part1_hf_bert_model_test_prediction.csv"

    joblib.dump(classifier, model_path)

    # Save predictions as CSV files (NOT pickle files!)
    pd.DataFrame({'index': range(len(test_preds)), 'pred': test_preds}).to_csv(
        tiny_bert_prediction_path, index=False
    )
    pd.DataFrame({'index': range(len(test_preds)), 'pred': test_preds}).to_csv(
        hf_bert_prediction_path, index=False
    )

    print(f"Saved model to {model_path}")
    print(f"Saved predictions to {tiny_bert_prediction_path}")
    print(f"Saved predictions to {hf_bert_prediction_path}")

Saved model to /content/drive/MyDrive/CSC413/sentence_encoder_classification.pkl
Saved predictions to /content/drive/MyDrive/CSC413/part1_tiny_bert_model_test_prediction.csv
Saved predictions to /content/drive/MyDrive/CSC413/part1_hf_bert_model_test_prediction.csv
