In this notebook, you will train a decoder-only LLM (GPT-2) with a **character** tokenizer on data from Shakespeare and generate sentences.

You will use Hugging Face to train the models.

**Important**: you will need to use a GPU for training. To change to a GPU, select Runtime > Change runtime type from the menu bar above. Select 'T4'.

# Load English training data
First, upload the `shakespeare_input.txt` downloaded from the Homework 3 into the Colab file manager. To do this, click the folder icon on the left-hand sidebar. Then, click the upload icon in the sidebar (the one with the arrow pointing up) and select the `shakespeare_input.txt` file.

After you have the file in the Colab notebook's context, you'll need to open it up and read in each line to a Python list and save it to an object called `training_data`.
The function currently removes lines with no text. You can also perform any preprocessing you want to do here as well.

In [32]:
with open('shakespeare_input.txt') as f:
  training_data = [line for line in f.read().lower().splitlines() if len(line) > 0]

training_data[:10] # to check the first 10 lines

['first citizen:',
 'before we proceed any further, hear me speak.',
 'all:',
 'speak, speak.',
 'first citizen:',
 'you are all resolved rather to die than to famish?',
 'all:',
 'resolved. resolved.',
 'first citizen:',
 'first, you know caius marcius is chief enemy to the people.']

# "Train" a tokenizer

Hugging Face models use specified tokenizers which define the possible tokens.
Here we want to modify the existing `GPT2TokenizerFast` class to tokenize on characters.

In [33]:
# Run this to make sure you have a necessary package
! pip install transformers[torch]
! pip install datasets



Define a new Hugging Face tokenizer here that only accepts characters and save it to an object named `char_tokenizer`.

You can reference the following:
* https://discuss.huggingface.co/t/character-level-tokenizer/12450/3
* https://huggingface.co/learn/nlp-course/en/chapter6/

In [34]:
import string
import torch
import os
import multiprocessing
from transformers import GPT2TokenizerFast, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import Dataset
from math import exp

char_tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
char_vocab = list(string.ascii_lowercase) + list(string.punctuation) + [" ", "\n"]
char_tokenizer.add_tokens(char_vocab, special_tokens=True)
char_tokenizer.add_special_tokens({"pad_token": "[PAD]"})
char_tokenizer.save_pretrained("char_tokenizer_shakespeare")

device = "cuda" if torch.cuda.is_available() else "cpu"

Test your new tokenizer with the following cell. It should provide each token as a character. You may get unexpected behavior for the space character, and that's ok.

In [35]:
char_tokenizer.tokenize("hello world")

['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd']

# Train GPT-2 model with character tokenizer

Here's where you will train your GPT-2 model on the Shakespeare data using your new character tokenizer. Specifically, train the `GPT2LMHeadModel` from the `transformers` package.

Here are some references for the code for this part:
* https://colab.research.google.com/github/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb
* https://huggingface.co/docs/transformers/en/tasks/language_modeling. Note that this is for finetuning, not training from scratch. It is still useful for explanations of Hugging Face classes

You will want to define a model, load in the Shakespeare dataset in a format that Hugging Face can work with, define training parameters, and then train the model.
This training may take 30 minutes or longer.

**You will also need to save the model** with a name like `char_gpt2_shakespeare` to be able to generate from it later.

In [36]:
 # Check for GPU
print(torch.cuda.is_available())
print(torch.__version__)
print(torch.version.cuda)

True
2.5.1+cu121
12.1


In [37]:
char_tokenizer = GPT2TokenizerFast.from_pretrained("char_tokenizer_shakespeare")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [38]:
def tokenize_function(examples):
    return char_tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

dataset = Dataset.from_dict({"text": training_data})

tokenized_dataset = dataset.map(tokenize_function, batched=True)

print(tokenized_dataset[0])


Map:   0%|          | 0/136177 [00:00<?, ? examples/s]

{'text': 'first citizen:', 'input_ids': [69, 72, 81, 82, 83, 50257, 66, 72, 83, 72, 89, 68, 77, 25, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [39]:
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(char_tokenizer))
model.to(device)

print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50260, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50260, bias=False)
)


In [40]:
dataloader_num_workers = min(4, multiprocessing.cpu_count())

training_args = TrainingArguments(
    output_dir="./char_gpt2_shakespeare",
    eval_strategy="epoch",
    logging_steps=500,
    per_device_train_batch_size=64,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    learning_rate=3e-4,
    weight_decay=0.01,
    save_strategy="epoch",
    fp16=True,
    fp16_full_eval=True,
    dataloader_num_workers=dataloader_num_workers,
    auto_find_batch_size=True,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=char_tokenizer,
    mlm=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,
    data_collator=data_collator,
)

In [41]:
trainer.train()

model.save_pretrained("char_gpt2_shakespeare")
char_tokenizer.save_pretrained("char_gpt2_shakespeare")

Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

# Generate from the trained model

In [31]:
# FILL IN CODE
model = GPT2LMHeadModel.from_pretrained("char_gpt2_shakespeare")
char_tokenizer = GPT2TokenizerFast.from_pretrained("char_gpt2_shakespeare")

model.to(device)

model.config.pad_token_id = model.config.eos_token_id

def generate_text(prompt, max_length=250):
    input_ids = char_tokenizer.encode(prompt, return_tensors="pt").to(device)
    attention_mask = torch.ones_like(input_ids, device=device)

    output = model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_length=min(max_length, input_ids.shape[1] + 50),
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.9
    )

    raw_decoded = char_tokenizer.decode(output[0], skip_special_tokens=False)
    filtered_decoded = raw_decoded.replace("<|endoftext|>", "").strip()

    return filtered_decoded

prompt = "Whomst"
generated_text = generate_text(prompt)
print("\nGenerated Text:", generated_text)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Generated Text: Whomst i pray you, and shall be company, for the weight,


# Calculate perplexity for test documents

In this section, load the test documents from the Homework 3.
Calculate perplexity for both models.

In [7]:
# NOTE : I simply could not run the perplexity calculations on my GPU.
# These lines come from the recommendations made by the error message from CUDA, Stack Overflow, and finally ChatGPT 4o.
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
del model
device = "cpu"
model = GPT2LMHeadModel.from_pretrained("char_gpt2_shakespeare")
model = model.float()
model = model.to(device)
print("✅ Model successfully reloaded and moved to CPU.")

✅ Model successfully reloaded and moved to CPU.


In [8]:
# FILL IN CODE
def calculate_perplexity(model, tokenizer, text, device="cpu", max_length=1024, stride=512):
    model = model.to(device)  # Move model to the correct device at function start... "CUDA assert" errors otherwise.
    model.eval()
    total_log_likelihood = 0.0
    total_tokens = 0

    with torch.no_grad():  
        input_ids = tokenizer.encode(text, return_tensors="pt").to(device) # move them here?

        # Process in chunks
        for i in range(0, input_ids.shape[1], stride):
            chunk = input_ids[:, i : i + max_length]
            if chunk.shape[1] < 2:  # Skip empty sequences
                continue

            outputs = model(chunk, labels=chunk)
            loss = outputs.loss.item()

            total_log_likelihood += loss * chunk.shape[1]
            total_tokens += chunk.shape[1]

    perplexity = exp(total_log_likelihood / total_tokens) if total_tokens > 0 else float("inf")
    return perplexity

# Load test documents properly
with open("nytimes_article.txt", encoding="utf-8", errors="replace") as f:
    nyt_text = f.read()

with open("shakespeare_sonnets.txt", encoding="utf-8", errors="replace") as f:
    shakespeare_text = f.read()

device = "cpu"
nyt_perplexity = calculate_perplexity(model, char_tokenizer, nyt_text, device=device)
shakespeare_perplexity = calculate_perplexity(model, char_tokenizer, shakespeare_text, device=device)

print(f"Perplexity on NYT Article: {nyt_perplexity:.2f}")
print(f"Perplexity on Shakespeare Sonnets: {shakespeare_perplexity:.2f}")

Token indices sequence length is longer than the specified maximum sequence length for this model (4651 > 1024). Running this sequence through the model will result in indexing errors


Perplexity on NYT Article: 56.23
Perplexity on Shakespeare Sonnets: 39.90
