# HW 0 - Part 2: Language Model (LM) Fine-tuning with Huggingface

In this assignment, you will implement Pytorch code to train a language model (LM) using the [🤗 Transformers](https://github.com/huggingface/transformers) library. You will fine-tune a pre-trained GPT-2 model on a [Harry Potter corpus](https://huggingface.co/datasets/WutYee/HarryPotter_books_1to7), and evaluate the model on a lauguage modeling task (a.k.a. next token prediction). If you are familiar with 🤗 Transformers and 🤗 Datasets, feel free to skip steps 0 through 2.

### Step 0: Installation
If you are using Google Colab or a fresh Python environment, you will need to install the required libraries:

Uncomment and run the following cell to install 🤗 Transformers and 🤗 Datasets:

In [None]:
#! pip install transformers datasets

- `transformers`: Provides pre-trained models like GPT-2 for fine-tuning.
- `datasets`: Offers easy access to datasets.

### Step 1: Preparing the dataset
We will use the [Harry Potter corpus](https://huggingface.co/datasets/WutYee/HarryPotter_books_1to7) dataset to fine-tune the GPT-2. The 🤗 Datasets library makes it simple to load datasets.

Run the following code to load the dataset using `load_dataset`:

In [None]:
from datasets import load_dataset

# Load the Harry Potter corpus dataset
datasets = load_dataset('WutYee/HarryPotter_books_1to7')

# Preview the dataset structure
print(datasets)


As shown in `DatasetDict`, the dataset is typically split into subsets like `train`, `test`, or `validation`. To access a specific example, you must choose a split and an index.

In [None]:
# Access an example from the 'train' split
example = datasets["train"][10]

# Print the example
print(example)

### Step 2: Preprocessing the Dataset
To fine-tune GPT-2, we need to tokenize the dataset text into a format the model can process. To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. This is all done by the `AutoTokenizer` class:


In [None]:
from transformers import AutoTokenizer

model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)

The tokenizer will split the text into tokens and convert them to numerical IDs.

We can now call the tokenizer on all our texts. This is very simple, using the `map` method from the Datasets library. First we define a function that call the tokenizer on our texts:

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

Then we apply it to all the splits in our `datasets` object, using `batched=True` and 4 processes to speed up the preprocessing. We won't need the `text` column afterward, so we discard it.

In [None]:
# Apply the tokenizer to the dataset
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

If we now look at an element of our datasets, we will see the text have been replaced by the `input_ids` the model will need:

In [None]:
tokenized_datasets["train"][1]

Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain `block_size`. To do this, we will use the `map` method again, with the option `batched=True`. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.

First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in your GPU RAM, so here we take a bit less at just 128.

In [None]:
block_size = 128

Then we write the preprocessing function that will group our texts:

In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it
    # instead of this drop, you can customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the `map` method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of `block_size` every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

And we can check our datasets have changed: now the samples contain chunks of `block_size` contiguous tokens, potentially spanning over several of our original texts.

In [None]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

Now that the data has been cleaned, we're ready to train our model. 🤗 Transformers provides APIs and tools to easily download and train pretrained LM models. First we load the pre-trained GPT-2 model using `AutoModelForCausalLM.from_pretrained`.

In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(model_name)

Now we need to implement fine-tuning for a pre-trained GPT-2 model and evaluate the model on a language modeling task.

### Step 3: Fine-Tuning the GPT-2 Model. 

1. **Implement the Training Loop and Perplexity Evaluation**:  
   - Write a training loop to fine-tune the GPT-2 model
   - You may experiment with different optimizers, learning rates, and batch sizes. Here are the default values to start with:
         - Learning rate: 2e-5
         - Optimzer: AdamW
         - Batch size: 8
   - Include an evaluation function to calculate **perplexity** on the validation set at the end of each epoch.  
   - You may refer to open-source trainer implementations such as [miniGPT](https://github.com/karpathy/minGPT/blob/master/mingpt/trainer.py#L81) for guidance.

2. **Validation and Test Evaluation**:  
   - After each epoch, evaluate your model on the **validation set** and record the perplexity.  
   - Once training is complete (after 3 epochs), evaluate the final model on the **test set**.

Your goal is to achieve a **perplexity** in the range of **30–50** after **3 epochs** of training.

To receive full credit, you must report the following:

- Training loss and perplexity on the **validation set** for each of the 3 epochs.  
- The final perplexity score on the **test set**.

e.g.,

---

### **Example Output**

| Epoch | Training Loss | Perplexity on Validation Set |
|-------|---------------|-----------------------------|
|   1   |     3.14    |          18.37             |
|   2   |     2.98    |          17.83             |
|   3   |     2.91    |          17.73             |

**Final Perplexity on the Test Set**: **43.46**

---

In [None]:
# Write your code here !


### Step 4: Submit your code and PDF

### Submit your code

Zip the `hw0` directory, and submit the zip file to `HW 0 - Code` on Gradescope.

### Submit your PDF

1. Re-run all cells in order. Make sure that the outputs of all cells are displayed correctly.
2. Convert the notebook to PDF: 
- Download the .ipynb file to your local machine.
- Use a tool like `nbconvert` to convert the notebook to PDF.
- Alternatively, if you encounter issues with nbconvert, you can save the notebook webpage as a PDF.
3. Review the PDF file: Look at the PDF file and make sure all your codes are displayed correctly. 
5. Submit your PDF and the notebook file on Gradescope: `HW 0 - Part 2 PDF`

