# HW 0 - Part 2: Language Model (LM) Fine-tuning with Huggingface

In this assignment, you will implement Pytorch code to train a language model (LM) using the [🤗 Transformers](https://github.com/huggingface/transformers) library. You will fine-tune a pre-trained GPT-2 model on a [Harry Potter corpus](https://huggingface.co/datasets/WutYee/HarryPotter_books_1to7), and evaluate the model on a lauguage modeling task (a.k.a. next token prediction). If you are familiar with 🤗 Transformers and 🤗 Datasets, feel free to skip steps 0 through 2.

### Step 0: Installation
If you are using Google Colab or a fresh Python environment, you will need to install the required libraries:

Uncomment and run the following cell to install 🤗 Transformers and 🤗 Datasets:

In [2]:
! pip install transformers datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

- `transformers`: Provides pre-trained models like GPT-2 for fine-tuning.
- `datasets`: Offers easy access to datasets.

### Step 1: Preparing the dataset
We will use the [Harry Potter corpus](https://huggingface.co/datasets/WutYee/HarryPotter_books_1to7) dataset to fine-tune the GPT-2. The 🤗 Datasets library makes it simple to load datasets.

Run the following code to load the dataset using `load_dataset`:

In [3]:
from datasets import load_dataset

# Load the Harry Potter corpus dataset
datasets = load_dataset('WutYee/HarryPotter_books_1to7')

# Preview the dataset structure
print(datasets)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


(…)ry Potter-Book 1-The Sorcerers Stone.txt:   0%|          | 0.00/472k [00:00<?, ?B/s]

(…)Potter-Book 2-The Chamber of Secrets.txt:   0%|          | 0.00/548k [00:00<?, ?B/s]

(…)rry Potter-Book 3-The Goblet of Fire.txt:   0%|          | 0.00/1.19M [00:00<?, ?B/s]

(…)otter-Book 4-The Prisoner of Azkaban.txt:   0%|          | 0.00/609k [00:00<?, ?B/s]

(…)tter-Book 5-The Order of the Phoenix.txt:   0%|          | 0.00/1.61M [00:00<?, ?B/s]

(…) Potter-Book 6-The Half-Blood Prince.txt:   0%|          | 0.00/1.09M [00:00<?, ?B/s]

(…)ry Potter-Book 7-The Deathly Hallows.txt:   0%|          | 0.00/1.19M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/81349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/23118 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/23620 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 81349
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 23118
    })
    test: Dataset({
        features: ['text'],
        num_rows: 23620
    })
})



As shown in `DatasetDict`, the dataset is typically split into subsets like `train`, `test`, or `validation`. To access a specific example, you must choose a split and an index.

In [4]:
# Access an example from the 'train' split
example = datasets["train"][1]

# Print the example
print(example)

{'text': "Sorcerer's Stone"}


### Step 2: Preprocessing the Dataset
To fine-tune GPT-2, we need to tokenize the dataset text into a format the model can process. To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. This is all done by the `AutoTokenizer` class:


In [5]:
from transformers import AutoTokenizer

model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

The tokenizer will split the text into tokens and convert them to numerical IDs.

We can now call the tokenizer on all our texts. This is very simple, using the `map` method from the Datasets library. First we define a function that call the tokenizer on our texts:

In [6]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

Then we apply it to all the splits in our `datasets` object, using `batched=True` and 4 processes to speed up the preprocessing. We won't need the `text` column afterward, so we discard it.

In [7]:
# Apply the tokenizer to the dataset
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

Map (num_proc=4):   0%|          | 0/81349 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/23118 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/23620 [00:00<?, ? examples/s]

If we now look at an element of our datasets, we will see the text have been replaced by the `input_ids` the model will need:

In [8]:
tokenized_datasets["train"][1]

{'input_ids': [50, 8387, 11751, 338, 8026], 'attention_mask': [1, 1, 1, 1, 1]}

Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain `block_size`. To do this, we will use the `map` method again, with the option `batched=True`. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.

First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in your GPU RAM, so here we take a bit less at just 128.

In [9]:
block_size = 128

Then we write the preprocessing function that will group our texts:

In [10]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it
    # instead of this drop, you can customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the `map` method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of `block_size` every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

In [11]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/81349 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/23118 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/23620 [00:00<?, ? examples/s]

And we can check our datasets have changed: now the samples contain chunks of `block_size` contiguous tokens, potentially spanning over several of our original texts.

In [12]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

'�t hold with such nonsense.      Mr. Dursley was the director of a firm called Grunnings, which madedrills. He was a big, beefy man with hardly any neck, although he did have avery large mustache. Mrs. Dursley was thin and blonde and had nearly twice theusual amount of neck, which came in very useful as she spent so much of hertime craning over garden fences, spying on the neighbors. The Dursleys had asmall son called Dudley and in their opinion there was no finer boy anywhere.      The Dursleys'

Now that the data has been cleaned, we're ready to train our model. 🤗 Transformers provides APIs and tools to easily download and train pretrained LM models. First we load the pre-trained GPT-2 model using `AutoModelForCausalLM.from_pretrained`.

In [13]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(model_name)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Now we need to implement fine-tuning for a pre-trained GPT-2 model and evaluate the model on a language modeling task.

### Step 3: Fine-Tuning the GPT-2 Model.

1. **Implement the Training Loop and Perplexity Evaluation**:  
   - Write a training loop to fine-tune the GPT-2 model
   - You may experiment with different optimizers, learning rates, and batch sizes. Here are the default values to start with:
         - Learning rate: 2e-5
         - Optimzer: AdamW
         - Batch size: 8
   - Include an evaluation function to calculate **perplexity** on the validation set at the end of each epoch.  
   - You may refer to open-source trainer implementations such as [miniGPT](https://github.com/karpathy/minGPT/blob/master/mingpt/trainer.py#L81) for guidance.

2. **Validation and Test Evaluation**:  
   - After each epoch, evaluate your model on the **validation set** and record the perplexity.  
   - Once training is complete (after 3 epochs), evaluate the final model on the **test set**.

Your goal is to achieve a **perplexity** in the range of **30–50** after **3 epochs** of training.

To receive full credit, you must report the following:

- Training loss and perplexity on the **validation set** for each of the 3 epochs.  
- The final perplexity score on the **test set**.

e.g.,

---

### **Example Output**

| Epoch | Training Loss | Perplexity on Validation Set |
|-------|---------------|-----------------------------|
|   1   |     3.14    |          18.37             |
|   2   |     2.98    |          17.83             |
|   3   |     2.91    |          17.73             |

**Final Perplexity on the Test Set**: **43.46**

---

In [14]:
import torch
from torch.utils.data import DataLoader
from transformers import AdamW, get_scheduler
from transformers import DataCollatorForLanguageModeling
# from datasets import load_metric
from tqdm import tqdm

In [15]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

device

device(type='cuda')

In [16]:
model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [17]:
# Write your code here !
learning_rate = 2e-5
batch_size = 8
epochs = 3

optimizer = AdamW(model.parameters(), lr=learning_rate)
tokenizer.pad_token = tokenizer.eos_token



In [18]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False
)

# Prepare dataloaders
train_dataloader = DataLoader(
    lm_datasets["train"], shuffle=True, batch_size=batch_size, collate_fn=data_collator
)
val_dataloader = DataLoader(
    lm_datasets["validation"], batch_size=batch_size, collate_fn=data_collator
)

In [23]:
validation_loss = []
perplexity_list = []
for epoch in range(epochs):
    print(f"Epoch {epoch + 1}/{epochs}")
    model.train()
    total_loss = 0

    for batch in tqdm(train_dataloader):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.item()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Average Training Loss: {total_loss / len(train_dataloader):.4f}")

    # Validation
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for batch in val_dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            total_loss += outputs.loss.item()

    avg_val_loss = total_loss / len(val_dataloader)
    perplexity = torch.exp(torch.tensor(avg_val_loss))
    print(f"Validation Loss: {avg_val_loss:.4f}, Perplexity: {perplexity:.4f}")

    validation_loss.append(avg_val_loss)
    perplexity_list.append(perplexity)

Epoch 1/3


100%|██████████| 1173/1173 [05:42<00:00,  3.43it/s]


Average Training Loss: 2.7662
Validation Loss: 2.8472, Perplexity: 17.2393
Epoch 2/3


100%|██████████| 1173/1173 [05:42<00:00,  3.42it/s]


Average Training Loss: 2.6988
Validation Loss: 2.8487, Perplexity: 17.2650
Epoch 3/3


100%|██████████| 1173/1173 [05:43<00:00,  3.42it/s]


Average Training Loss: 2.6394
Validation Loss: 2.8635, Perplexity: 17.5236


In [27]:
from prettytable import PrettyTable
table = PrettyTable()

table.add_column("Epoch", [1, 2, 3])
table.add_column("Training Loss", validation_loss)
table.add_column("Perplexity", perplexity_list)

print(table)

+-------+--------------------+-----------------+
| Epoch |   Training Loss    |    Perplexity   |
+-------+--------------------+-----------------+
|   1   | 2.847190071458686  | tensor(17.2393) |
|   2   | 2.8486833523397577 | tensor(17.2650) |
|   3   | 2.863548262478554  | tensor(17.5236) |
+-------+--------------------+-----------------+


In [28]:
test_dataloader = DataLoader(
    lm_datasets["test"], batch_size=batch_size, collate_fn=data_collator
)

total_loss = 0
model.eval()
with torch.no_grad():
    for batch in test_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        total_loss += outputs.loss.item()

avg_test_loss = total_loss / len(test_dataloader)
test_perplexity = torch.exp(torch.tensor(avg_test_loss))
print(f"Test Loss: {avg_test_loss:.4f}, Test Perplexity: {test_perplexity:.4f}")

Test Loss: 3.8344, Test Perplexity: 46.2664


### Step 4: Submit your code and PDF

See the instruction in `hw0/README.md`