# Assignment: Fine-tune BERT on a Custom Classification Dataset

## Objective
This assignment focuses on the practical application of transfer learning in Natural Language Processing by fine-tuning a pre-trained BERT model for a custom text classification task. You will learn to prepare a dataset, configure a BERT model for classification, train it, and evaluate its performance.

## Part 1: Environment Setup and Dataset Preparation (30 Marks)

1.  **Environment Setup:**
    * Create a new Python virtual environment.
    * Install the necessary libraries: `transformers`, `torch` (or `tensorflow` if preferred), `scikit-learn`, `pandas`, `numpy`, `matplotlib`, `seaborn`.
        * Provide a `requirements.txt` file listing all dependencies.

2.  **Custom Dataset Acquisition and Loading:**
    * Select a custom text classification dataset suitable for this assignment. Recommended options:
        * **Sentiment Analysis:** (e.g., IMDB movie reviews, Twitter sentiment, Amazon reviews - you might need to combine and label if using raw data).
        * **News Classification:** (e.g., AG News, BBC News summary dataset).
        * **Spam Detection:** (e.g., SMS Spam Collection Dataset).
        * **Topic Classification:** (e.g., a subset of Reddit posts by subreddit, or general articles by topic).
    * **Minimum Requirements:** The dataset should have at least **2 classes** and a minimum of **1000 samples** in total (aim for more if possible for better results).
    * Load your chosen dataset into a Pandas DataFrame. The DataFrame should have at least two columns: one for the text content and one for the corresponding label.
    * Describe your chosen dataset, its source, the number of classes, and the distribution of samples per class.

3.  **Data Preprocessing and Splitting:**
    * Perform any necessary basic text preprocessing (e.g., handling missing values, uniform case, basic cleaning). *Note: BERT models handle a lot of this internally, so heavy cleaning is often not required, but basic sanity checks are good.*
    * Split your dataset into training, validation, and test sets (e.g., 70% train, 15% validation, 15% test). Ensure stratification if your classes are imbalanced.
    * Print the shape of each split and the distribution of classes within each split.

In [None]:
# Your code for environment setup, dataset loading, and preprocessing here.
# Include explanations, dataset description, and split statistics.

## Part 2: BERT Tokenization and Model Setup (30 Marks)

1.  **Load BERT Tokenizer:**
    * Load a pre-trained BERT tokenizer (e.g., `bert-base-uncased` or `distilbert-base-uncased`).
    * Explain your choice of tokenizer and its corresponding model.

2.  **Tokenization and Input Formatting:**
    * Tokenize your training, validation, and test texts using the loaded tokenizer.
    * Ensure the tokenization includes:
        * `truncation=True`
        * `padding='max_length'`
        * `return_tensors='pt'` (for PyTorch) or `'tf'` (for TensorFlow).
    * Specify a `max_length` that is appropriate for your dataset (e.g., 128 or 256).
    * Convert your labels into a suitable numerical format (e.g., using `LabelEncoder` or directly mapping to integers).
    * Print the shape of the tokenized outputs (input IDs, attention masks) for a sample of your dataset (e.g., the first 5 samples) and verify their format.

3.  **Load Pre-trained BERT Model for Sequence Classification:**
    * Load a pre-trained BERT model suitable for sequence classification (e.g., `BertForSequenceClassification` from `transformers`).
    * Initialize the model with the correct number of labels for your classification task (`num_labels`).
    * Print the model architecture or a summary to understand its layers.

In [None]:
# Your code for loading tokenizer, tokenization, and model setup here.
# Include explanations and verifications.

## Part 3: Training and Evaluation (40 Marks)

1.  **Define Training Arguments:**
    * Use the `TrainingArguments` class from `transformers` to define your training configuration.
    * Set parameters such as:
        * `output_dir`
        * `num_train_epochs` (e.g., 2-4)
        * `per_device_train_batch_size` (e.g., 8-32, depends on GPU memory)
        * `per_device_eval_batch_size`
        * `warmup_steps`, `weight_decay`
        * `logging_dir`, `logging_steps`
        * `evaluation_strategy='epoch'`
        * `save_strategy='epoch'`
        * `load_best_model_at_end=True`
    * Justify your choices for key training arguments.

2.  **Define Metrics:**
    * Create a `compute_metrics` function that takes `EvalPrediction` as input and returns a dictionary of metrics.
    * Include metrics such as `accuracy`, `precision`, `recall`, and `f1-score` (use `sklearn.metrics`).
    * Ensure proper handling for multi-class or binary classification.

3.  **Create Trainer and Train:**
    * Instantiate the `Trainer` class, passing your model, training arguments, train dataset, eval dataset, and `compute_metrics` function.
    * Start the training process using `trainer.train()`.
    * Show the training loss and evaluation metrics during training.

4.  **Evaluate on Test Set:**
    * After training, evaluate your fine-tuned model on the held-out test set using `trainer.evaluate()`.
    * Print the final test set metrics (accuracy, precision, recall, f1-score).
    * Analyze the results: How well did the model perform? Are there signs of overfitting or underfitting? How does the performance compare to potential baselines?

5.  **Qualitative Analysis (Bonus - 5 Marks):**
    * Choose 3-5 challenging samples from your test set.
    * Predict their labels using your fine-tuned model.
    * Compare the predicted label with the true label and the model's confidence. Discuss why the model might have made correct or incorrect predictions for these specific examples.

In [None]:
# Your code for defining training arguments, metrics, Trainer, training, and evaluation.
# Include the final test set metrics and analysis.
# (For bonus, add qualitative analysis code and discussion.)

## Part 4: Reflection and Future Work (Bonus - 10 Marks)

1.  **Challenges Faced:**
    * Describe any challenges you encountered during this assignment (e.g., GPU memory issues, hyperparameter tuning, data imbalance) and how you addressed them.

2.  **Potential Improvements:**
    * Suggest ways to further improve the model's performance. Consider:
        * Different BERT variants (e.g., RoBERTa, ELECTRA, larger models).
        * Advanced text preprocessing (e.g., handling emojis, slang).
        * Hyperparameter optimization techniques (e.g., grid search, random search).
        * Data augmentation.
        * Class imbalance handling techniques.
        * Different classification heads.

3.  **Learnings:**
    * Summarize your key learnings from fine-tuning BERT for text classification.
    * How does transfer learning benefit NLP tasks compared to training from scratch?

## Submission Guidelines

* Submit this Jupyter Notebook (.ipynb file) with all cells executed and outputs visible.
* Ensure your code is well-commented and easy to understand.
* Provide a `requirements.txt` file listing all dependencies.
        * Include a brief `README.md` file (optional but recommended) explaining how to run your code and any specific instructions, especially if your dataset needs specific download steps.
* Make sure your notebook runs without errors in the specified environment.