# Fine-tuning a Pre-trained Transformer Model on a Custom Dataset

## Introduction
In modern Natural Language Processing (NLP), **fine-tuning** pre-trained transformer models is the most common and effective approach to achieve state-of-the-art results on various downstream tasks. Instead of training a model from scratch, we take a large language model (like BERT, RoBERTa, etc.) that has already learned rich language representations on massive text corpora, and then adapt it to a specific task (e.g., text classification, question answering, named entity recognition) using a smaller, task-specific dataset.

This process leverages **transfer learning**, allowing us to achieve high performance with significantly less labeled data and computational resources compared to training from zero.

In this assignment, you will fine-tune a pre-trained BERT model for a simple binary text classification task using a small custom dataset, utilizing the powerful **HuggingFace Transformers library** and its convenient `Trainer` API.

---

## Learning Objectives
Upon completion of this assignment, you should be able to:
- Understand the fine-tuning paradigm for transformer models.
- Prepare a custom text classification dataset using HuggingFace `Datasets`.
- Load and apply a pre-trained tokenizer to custom text data.
- Initialize a pre-trained model for sequence classification.
- Define and compute relevant evaluation metrics for classification.
- Use the HuggingFace `Trainer` API to fine-tune a model efficiently.
- Evaluate the performance of a fine-tuned model.
- Perform inference with the fine-tuned model on new data.
- Discuss the advantages, challenges, and strategies for improving fine-tuning.

---

## Setup and Prerequisites
Ensure you have the necessary libraries installed. If not, uncomment and run the following cells:

```bash
# pip install transformers datasets accelerate evaluate scikit-learn torch
```

---

In [None]:
import torch
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
import evaluate
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

print(f"PyTorch Version: {torch.__version__}")
print(f"Transformers Version: {transformers.__version__}")
print(f"Datasets Version: {datasets.__version__}")
print(f"Evaluate Version: {evaluate.__version__}")

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

---

## Assignment Questions

---

### Question 1: Dataset Preparation
For this assignment, we'll create a *very small* custom dataset in-memory for a binary text classification task (e.g., positive/negative reviews). In a real-world scenario, you would load this from a CSV, JSON, or other file format.

1.  **Define Custom Data:** Create a Python list of dictionaries, where each dictionary represents an example with at least two keys: `"text"` (for the input sentence) and `"label"` (for the class ID, e.g., 0 for negative, 1 for positive).
    * Aim for around 10-20 examples total.
    * Ensure some examples for both classes.
2.  **Create HuggingFace `Dataset`:** Convert your list of dictionaries into a `datasets.Dataset` object.
3.  **Split Data:** Split the `Dataset` into `train` and `validation` sets using `dataset.train_test_split()`. Use a `test_size` of 0.2-0.3 and a `seed` for reproducibility.
4.  **Inspect:** Print the number of examples in the training and validation splits. Print one example from the training split to show its structure.

---

---

### Question 2: Tokenization and Data Collator
Transformer models require numerical inputs (token IDs, attention masks, token type IDs). The `AutoTokenizer` handles this.

1.  **Load Tokenizer:** Load a pre-trained tokenizer compatible with BERT (e.g., `'bert-base-uncased'`).
2.  **Define Tokenization Function:** Create a function `tokenize_function(examples)` that takes a dictionary of examples (as provided by `dataset.map()`) and returns the tokenized inputs. Ensure it handles `truncation=True` and `padding='max_length'` or `'longest'` (for dynamic padding with `DataCollator`).
3.  **Apply Tokenization:** Apply this function to both your training and validation `Dataset` splits using the `.map()` method.
4.  **Data Collator:** Initialize a `DataCollatorWithPadding` using your tokenizer. Explain why `DataCollatorWithPadding` is preferred over padding all sequences to `max_length` upfront for training efficiency.
5.  **Inspect Tokenized Data:** Print a single tokenized example from the training set to see its `input_ids`, `attention_mask`, and `labels`.

---

---

### Question 3: Model Loading
Now, load the pre-trained model for your specific task: sequence classification.

1.  **Load Model:** Load `AutoModelForSequenceClassification` from the same pre-trained checkpoint as your tokenizer (e.g., `'bert-base-uncased'`).
2.  **Specify Number of Labels:** Crucially, pass `num_labels` parameter to the model constructor, matching the number of unique classes in your dataset (e.g., 2 for binary classification).
3.  **Move to Device:** Move the model to your `device` (GPU/CPU).
4.  **Inspect Model:** Print the model's configuration (e.g., `model.config`) and verify that `num_labels` is set correctly.

---

---

### Question 4: Define Evaluation Metrics
For classification tasks, beyond just accuracy, metrics like precision, recall, and F1-score are vital, especially for imbalanced datasets. The `Trainer` API expects a `compute_metrics` function.

1.  **Load Metrics:** Use `evaluate.load()` to load the necessary metrics (e.g., `'accuracy'`, `'f1'`, `'precision'`, `'recall'`).
2.  **`compute_metrics` Function:** Create a function `compute_metrics(eval_pred)` that:
    * Takes `EvalPrediction` object (which contains predictions and labels).
    * Converts predictions to class IDs (e.g., `np.argmax(logits, axis=1)`).
    * Calculates accuracy, precision, recall, and F1-score.
    * Returns a dictionary where keys are metric names and values are their scores.
    * *Hint:* For precision, recall, F1, you might need to specify `average='weighted'` or `average='binary'` depending on your dataset and problem.

---

---

### Question 5: Fine-tuning with `Trainer` API
The `Trainer` class in HuggingFace simplifies the training loop significantly.

1.  **`TrainingArguments`:** Define `TrainingArguments`:
    * `output_dir`: A path to save checkpoints and logs.
    * `num_train_epochs`: A small number (e.g., 3-5) as fine-tuning converges quickly.
    * `per_device_train_batch_size`, `per_device_eval_batch_size`: Small batch sizes (e.g., 8-16) for custom datasets.
    * `learning_rate`: A small learning rate (e.g., 2e-5 or 5e-5) is common for fine-tuning.
    * `evaluation_strategy`: `'epoch'` (evaluate at the end of each epoch).
    * `logging_dir`, `logging_strategy`: For tracking training progress.
    * `save_strategy`: `'epoch'`.
    * `load_best_model_at_end`: `True`.
    * `metric_for_best_model`: Choose a metric to monitor for best model saving (e.g., `'f1'` or `'accuracy'`).
    * `greater_is_better`: `True`.
2.  **Initialize `Trainer`:** Create a `Trainer` instance, passing in:
    * `model`
    * `args` (your `TrainingArguments`)
    * `train_dataset`, `eval_dataset` (your tokenized datasets)
    * `tokenizer`
    * `data_collator`
    * `compute_metrics`
    
3.  **Train Model:** Start the training process using `trainer.train()`.

---

---

### Question 6: Model Evaluation
After training, evaluate the model's performance on the validation set.

1.  **Evaluate:** Use `trainer.evaluate()` to get the final evaluation metrics on the validation set.
2.  **Print Results:** Print the evaluation results.
3.  **Discussion:** Based on your metrics, how well did your model perform on this small custom dataset? Given the size of the dataset, are these results expected? What do the precision, recall, and F1-score tell you about the model's performance on each class?

---

---

### Question 7: Inference with Fine-tuned Model
Test your fine-tuned model on a brand new, unseen sentence.

1.  **New Sentence:** Define a new sentence that was *not* part of your training or validation set.
2.  **Prepare Input:** Tokenize the new sentence. Ensure it's returned as PyTorch tensors (`return_tensors="pt"`) and moved to the `device`.
3.  **Predict:** Pass the tokenized input through your `model` (which should now be the fine-tuned one). Remember to use `model.eval()` and `torch.no_grad()`.
4.  **Interpret Output:** The model will output `logits`. Convert these logits into probabilities (e.g., using `torch.softmax`) and then determine the predicted class label (0 or 1).
5.  **Print Prediction:** Print the original new sentence and its predicted class label.

---

---

### Question 8: Discussion and Challenges
1.  **Advantages of Fine-tuning:** What are the primary advantages of fine-tuning a pre-trained transformer model compared to training a traditional machine learning model (like Logistic Regression + TF-IDF) or a neural network from scratch for text classification?
2.  **Challenges/Considerations:** What are some common challenges or important considerations when fine-tuning transformer models on custom datasets? (Think about data size, quality, hyperparameter tuning, computational resources, and potential overfitting).
3.  **Improving Performance:** If your model's performance was not satisfactory, what steps would you take to try and improve it? Suggest at least three actionable strategies.

---

## Submission Guidelines
- Ensure your notebook runs without errors from top to bottom.
- Save your notebook as `your_name_finetuning_assignment.ipynb`.
- Clearly answer all questions and provide explanations where requested in Markdown cells.
- Feel free to add additional code cells or markdown cells for clarity or experimentation.

---