Fine-Tuning GPT-2 for Creative Story Generation

In [1]:
!pip install transformers datasets torch accelerate




In [2]:
import torch
from transformers import (
    GPT2Tokenizer,
    GPT2LMHeadModel,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling
)
from datasets import Dataset




In [3]:
stories = [
    "Once upon a time, a lonely robot learned how to feel emotions.",
    "In a distant galaxy, humans discovered a planet full of intelligent machines.",
    "The young engineer built an AI that could dream of electric sheep.",
    "A forgotten algorithm suddenly became self-aware one night.",
    "The city was silent after artificial intelligence took control."
]

dataset = Dataset.from_dict({"text": stories})


In [4]:
model_name = "gpt2"

tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = GPT2LMHeadModel.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Embedding(50257, 768)

In [5]:
def tokenize_function(example):
    return tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=64
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True)


Map:   0%|          | 0/5 [00:00<?, ? examples/s]

In [6]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)


In [7]:
training_args = TrainingArguments(
    output_dir="./gpt2-story",
    overwrite_output_dir=True,
    num_train_epochs=5,
    per_device_train_batch_size=2,
    save_steps=500,
    save_total_limit=1,
    logging_steps=10,
    learning_rate=5e-5,
    weight_decay=0.01,
    fp16=torch.cuda.is_available()
)


In [8]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator
)

trainer.train()


  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 2


[34m[1mwandb[0m: You chose 'Use an existing W&B account'
[34m[1mwandb[0m: Logging into https://api.wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: Find your API key here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mpranavzaware01[0m ([33mpranavzaware01-vishwakarma-institute-of-technology[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
10,3.5171


TrainOutput(global_step=15, training_loss=3.0891908009847007, metrics={'train_runtime': 520.4954, 'train_samples_per_second': 0.048, 'train_steps_per_second': 0.029, 'total_flos': 816537600000.0, 'train_loss': 3.0891908009847007, 'epoch': 5.0})

In [9]:
def generate_story(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        do_sample=True,
        temperature=0.8,
        top_k=50,
        top_p=0.95
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


In [10]:
prompt = "Once upon a time in the future,"
print(generate_story(prompt))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time in the future, a robot had toiled in the future. One day, a young robot was on a mission to find an alien species that was different from the one he had longed for. As he traveled with the robots, his mission became a disaster. The machine became an orphaned, one whose only hope was an artificial intelligence. Humans were created here. They had to find one, an intelligent robot. Humans were machines. And with their lives on the line,


In [11]:
model.save_pretrained("./gpt2-story")
tokenizer.save_pretrained("./gpt2-story")


('./gpt2-story/tokenizer_config.json',
 './gpt2-story/special_tokens_map.json',
 './gpt2-story/vocab.json',
 './gpt2-story/merges.txt',
 './gpt2-story/added_tokens.json')