Fine-tuning GPT-2 on OpenWebText (with Google Colab)

## Install Required Dataset Package  
We install the `datasets` library from HuggingFace to access the OpenWebText dataset.


In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.1-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (1

## Import Libraries  
We import the necessary modules for model loading, tokenization, training, and data handling.


In [2]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import load_dataset
import torch

Load GPT-2 tokenizer and model

## Load Pretrained GPT-2 Model and Tokenizer  
We load the base GPT-2 model and tokenizer using HuggingFace's `from_pretrained` method.


In [3]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## Load and Subsample OpenWebText Dataset  
We load 1% of the OpenWebText training split and then select only 200 samples to reduce training time.


load OpenWebText subset and only using the first 200 samples (becuase the low resurces avaliable)

In [4]:
dataset = load_dataset("openwebtext", split='train[:1%]', trust_remote_code=True)


dataset = dataset.select(range(200))


README.md:   0%|          | 0.00/7.35k [00:00<?, ?B/s]

openwebtext.py:   0%|          | 0.00/2.73k [00:00<?, ?B/s]

Downloading data:   0%|          | 0/21 [00:00<?, ?files/s]

urlsf_subset00.tar:   0%|          | 0.00/633M [00:00<?, ?B/s]

urlsf_subset01.tar:   0%|          | 0.00/629M [00:00<?, ?B/s]

urlsf_subset02.tar:   0%|          | 0.00/629M [00:00<?, ?B/s]

urlsf_subset03.tar:   0%|          | 0.00/628M [00:00<?, ?B/s]

urlsf_subset04.tar:   0%|          | 0.00/627M [00:00<?, ?B/s]

urlsf_subset05.tar:   0%|          | 0.00/630M [00:00<?, ?B/s]

urlsf_subset06.tar:   0%|          | 0.00/626M [00:00<?, ?B/s]

urlsf_subset07.tar:   0%|          | 0.00/625M [00:00<?, ?B/s]

urlsf_subset08.tar:   0%|          | 0.00/625M [00:00<?, ?B/s]

urlsf_subset09.tar:   0%|          | 0.00/626M [00:00<?, ?B/s]

urlsf_subset10.tar:   0%|          | 0.00/625M [00:00<?, ?B/s]

urlsf_subset11.tar:   0%|          | 0.00/625M [00:00<?, ?B/s]

urlsf_subset12.tar:   0%|          | 0.00/624M [00:00<?, ?B/s]

urlsf_subset13.tar:   0%|          | 0.00/629M [00:00<?, ?B/s]

urlsf_subset14.tar:   0%|          | 0.00/627M [00:00<?, ?B/s]

urlsf_subset15.tar:   0%|          | 0.00/621M [00:00<?, ?B/s]

urlsf_subset16.tar:   0%|          | 0.00/619M [00:00<?, ?B/s]

urlsf_subset17.tar:   0%|          | 0.00/619M [00:00<?, ?B/s]

urlsf_subset18.tar:   0%|          | 0.00/618M [00:00<?, ?B/s]

urlsf_subset19.tar:   0%|          | 0.00/619M [00:00<?, ?B/s]

urlsf_subset20.tar:   0%|          | 0.00/377M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8013769 [00:00<?, ? examples/s]

toknizing the dataset and fixing the padding by adding pad token

##  Tokenize the Text Data  
We set the padding token and tokenize the dataset with truncation and padding to a max length of 512 tokens.


In [5]:
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])


Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Adding labels field so the model can compute the loss

## Add Labels for Causal Language Modeling  
We set the labels equal to the `input_ids` since GPT-2 is trained to predict the next token in the sequence.


In [6]:
def add_labels(example):
    example["labels"] = example["input_ids"]
    return example

tokenized_dataset = tokenized_dataset.map(add_labels)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

the trainig paremeters

## Define Training Arguments  
We specify training configurations such as batch size, number of epochs, learning rate, logging, and output directories.


In [7]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./gpt2-openwebtext-finetuned",
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    save_steps=500,
    save_total_limit=2,
    logging_dir="./logs",
    logging_steps=100,
    learning_rate=5e-5,
    warmup_steps=500,
    weight_decay=0.01,
    report_to="none"
)

## Initialize Trainer  
We use HuggingFace’s `Trainer` class to manage training and evaluation using the model, dataset, and arguments defined above.


In [8]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset,
)

Fine-tuneing the model

## Start Fine-Tuning  
We begin training the GPT-2 model on the small OpenWebText sample using the Trainer.


In [9]:

trainer.train()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss


TrainOutput(global_step=50, training_loss=3.6818557739257813, metrics={'train_runtime': 3613.7644, 'train_samples_per_second': 0.111, 'train_steps_per_second': 0.014, 'total_flos': 104516812800000.0, 'train_loss': 3.6818557739257813, 'epoch': 2.0})

Saving the final model and thetokenizer

## 💾 Save the Fine-Tuned Model  
After training, we save the model and tokenizer locally for later inference.


In [10]:
model.save_pretrained("./gpt2-openwebtext-finetuned")
tokenizer.save_pretrained("./gpt2-openwebtext-finetuned")

('./gpt2-openwebtext-finetuned/tokenizer_config.json',
 './gpt2-openwebtext-finetuned/special_tokens_map.json',
 './gpt2-openwebtext-finetuned/vocab.json',
 './gpt2-openwebtext-finetuned/merges.txt',
 './gpt2-openwebtext-finetuned/added_tokens.json')

finding perplexity of tuned model (the base model perlexity usually falls within the range of 32 to 37)

## Evaluate the Model Perplexity  
We compute the perplexity of the fine-tuned model on the evaluation set.  
Perplexity is a common metric for language models lower is better.


In [13]:
import math

eval_results = trainer.evaluate()
perplexity = math.exp(eval_results["eval_loss"])
print(f"Perplexity: {perplexity:.2f}")

Perplexity: 18.55


now testing the fine-tuned model vs the base model

##  Load Fine-Tuned Model for Text Generation  
We load the saved fine-tuned model using the `pipeline` utility for quick inference.


In [20]:
from transformers import pipeline

# fine-tuned model
generator_finetuned = pipeline(
    'text-generation',
    model="./gpt2-openwebtext-finetuned",
    tokenizer="./gpt2-openwebtext-finetuned"
)

Device set to use cpu


## 📝 Generate Text using Fine-Tuned Model  
We provide prompts and generate text to evaluate the model's ability to produce coherent, topic-relevant content.


In [22]:
prompts = [
    "In the future what will artificial intelligence do?",
    "The benefits of healthy eating are?",
    "the most important skill for success is?"
]

for prompt in prompts:
    print(f"\n--- Prompt: {prompt} ---")

    print("\n[Fine-Tuned GPT-2 Output]")
    ft_output = generator_finetuned(prompt, max_length=100, num_return_sequences=1)
    print(ft_output[0]["generated_text"])


--- Prompt: In the future what will artificial intelligence do? ---

[Fine-Tuned GPT-2 Output]
In the future what will artificial intelligence do?

It does seem to be making predictions that will likely make the end result of quantum computing more complex than most people realize. At first glance this appears to be a good idea. If you are a scientist who wants to find the solution to an equation of the form X + Y, say, then the average quantum computer will have an equivalent in the neighborhood of what a computer can. If you wanted to check the accuracy or safety of a modern quantum

--- Prompt: The benefits of healthy eating are? ---

[Fine-Tuned GPT-2 Output]
The benefits of healthy eating are? They get you out of trouble and make you healthier. They have a tremendous safety net, helping keep you on the road for years."

More from Wonkblog:

Families who have diabetes become worried about future health insurance plans

--- Prompt: the most important skill for success is? ---

[Fin

## Load Base GPT-2 Model for Comparison  
We load the original GPT-2 model to compare its output with the fine-tuned version.


In [23]:
#base model
generator_base = pipeline(
    'text-generation',
    model="gpt2",
    tokenizer="gpt2"
)


Device set to use cpu


## ⚖️ Generate Text using Base Model for Comparison  
We generate responses from the base GPT-2 model using the same prompts for side-by-side evaluation.


In [24]:
prompts = [
    "In the future what will artificial intelligence do?",
    "The benefits of healthy eating are?",
    "the most important skill for success is?"
]

for prompt in prompts:
    print(f"\n--- Prompt: {prompt} ---")

    print("\n[Base GPT-2 Output]")
    base_output = generator_base(prompt, max_length=100, num_return_sequences=1)
    print(base_output[0]["generated_text"])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



--- Prompt: In the future what will artificial intelligence do? ---

[Base GPT-2 Output]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In the future what will artificial intelligence do? Will we ever know?

The answer is this: AI doesn't know much. We know that when we need one, they will help us with all of it. We know that we should avoid any kind of uncertainty regarding how AI's methods will make sense for those involved, and we know that artificial intelligence will come in handy in certain scenarios.

What is especially important, even though the future lies in a future where humans aren't expected

--- Prompt: The benefits of healthy eating are? ---

[Base GPT-2 Output]


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The benefits of healthy eating are?

While it's great for the body to tolerate toxins on foods such as raw milk, fruits and vegetables and can be used for exercise, it can also be detrimental to the health of a person or group of people. In fact, as more people adopt healthy eating and exercise habits, some experts suggest that weight loss may become a more popular strategy for people pursuing their wellness.

For example, some studies have found that weight control programs can prevent obesity,

--- Prompt: the most important skill for success is? ---

[Base GPT-2 Output]
the most important skill for success is? Can she find her feet quickly or do she take it by force?

Rice: No she is able to gain new ideas in this sport. She may have learned to play this sport from her parents.

Lloyd: She was really talented in that she was able to adapt to a variety of environments in those circumstances, which was a nice compliment to her abilities.

Lloyd: The biggest advantage that Rouse has ha