# Fine-tuning GPT2 on Anime synopsis data

This notebook attempts to understand how to fine-tune GPT2 on a specific corpus of text. In this case, the data used will be Anime Synopsis data.

In order to use the GPT2 model for fine-tuning, we will use HuggingFace's `datasets` and `transformers` libraries.

> Note: Run this notebook on Google Colab with GPU runtime enabled

In [None]:
!pip install datasets
!pip install transformers



`GPT2LMHeadModel` is the model instance we will be using to fine-tune.
`GPT2Tokenizer` is the corresponding tokenizer, that will perform word embedding on the text corpus given.

HuggingFace also provides the `Trainer` API due to which we won't have to write custom training loops.

In [None]:
import pandas as pd
import numpy as np
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling, TextDataset
from datasets import Dataset

dataset_path = "/content/anime_gpt_data.csv"
output_dir = "/content/models/"

In [None]:
anime_df = pd.read_csv(dataset_path)
anime_df

Unnamed: 0,synopsis
0,<|startoftext|>Following their participation a...
1,<|startoftext|>Music accompanies the path of t...
2,<|startoftext|>The Abyss—a gaping chasm stretc...
3,"<|startoftext|>""In order for something to be o..."
4,<|startoftext|>After helping revive the legend...
...,...
15189,<|startoftext|>All-new animation offered throu...
15190,<|startoftext|>High school student Sora Kashiw...
15191,<|startoftext|>After regaining her squid-like ...
15192,"<|startoftext|>For years, the Niflheim Empire ..."


Every synopsis starts with the token `"<|startoftext|>"` and ends with the `"<|endoftext|>"` token. In the original GPT2 model, the authors used only the latter token to determine start and end of sentences.

But, we will be using a "beginning of sentence" token as well.

In [None]:
anime_dataset = Dataset.from_pandas(anime_df)

In [None]:
# load the tokenizer with BOS, EOS, and pad token

tokenizer = GPT2Tokenizer.from_pretrained(
    "gpt2",
    bos_token = "<|startoftext|>",
    eos_token = "<|endoftext|>",
    pad_token = "<|pad|>"
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
def preprocess_dataset(data):
  return tokenizer(data['synopsis'], truncation=True)

In [None]:
anime_dataset_preprocessed = anime_dataset.map(preprocess_dataset, batched=False)

Map:   0%|          | 0/15194 [00:00<?, ? examples/s]

In [None]:
anime_dataset_preprocessed

Dataset({
    features: ['synopsis', 'input_ids', 'attention_mask'],
    num_rows: 15194
})

In [None]:
# GPT2 is not a masked language model
data_collator = DataCollatorForLanguageModeling(
    tokenizer = tokenizer,
    mlm = False
)

In [None]:
import torch

In [None]:
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

Embedding(50259, 768)

In [None]:
tokenizer.save_pretrained(output_dir)
model.save_pretrained(output_dir)

In [None]:
device = torch.device("cuda")
device

device(type='cuda')

In [None]:
model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50259, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50259, bias=False)
)

In [None]:
def finetune(dataset, model, tokenizer, data_collator,
             output_dir, overwrite_output_dir, num_train_epochs,
             save_steps, per_device_train_batch_size):
  training_args = TrainingArguments(
      output_dir=output_dir,
      overwrite_output_dir=overwrite_output_dir,
      per_device_train_batch_size=per_device_train_batch_size,
      num_train_epochs=num_train_epochs
  )

  trainer = Trainer(
      model=model,
      args=training_args,
      data_collator=data_collator,
      train_dataset=dataset,
  )

  trainer.train()
  trainer.save_model()

In [None]:
overwrite_output_dir = False
per_device_train_batch_size = 8
num_train_epochs = 5.0
save_steps = 500
output_model_dir = "/content/animegptsan"

In [None]:
finetune(
    anime_dataset_preprocessed,
    model,
    tokenizer,
    data_collator,
    output_model_dir,
    overwrite_output_dir,
    num_train_epochs,
    save_steps,
    per_device_train_batch_size
)



Step,Training Loss
500,4.3799
1000,3.6374
1500,3.5812
2000,3.5169
2500,3.3635
3000,3.3593
3500,3.3531
4000,3.2821
4500,3.2112
5000,3.2148


# Inference

Load the model from the output directory and infer.

In [None]:
def load_model(model_path):
  my_model = GPT2LMHeadModel.from_pretrained(model_path)
  return my_model

def load_tokenizer(tok_path):
  my_tokenizer = GPT2Tokenizer.from_pretrained(tok_path)
  return my_tokenizer

In [None]:
anime_model = load_model("/content/animegptsan/")
anime_tokenizer = load_tokenizer("/content/models/")

In [None]:
def generate_text(mod, tok, sequence, max_length, opts: dict):
  outputs = []
  ids = tok.encode(f"{sequence}", return_tensors="pt")
  final_outputs = mod.generate(
        ids,
        do_sample=opts["do_sample"],
        max_length=max_length,
        top_k=opts["top_k"],
        top_p=opts["top_p"],
        temperature=opts["temperature"],
        num_return_sequences=opts["num_return_sequences"]
    )

  for i, out in enumerate(final_outputs):
    output = tok.decode(out, skip_special_tokens=True)
    outputs.append(output)
  return outputs

Change the `opts` dictionary accordingly...

In [None]:
seq = "Shadow realm"
opts = {
    "do_sample": True,
    "top_k": 40,
    "top_p": 0.9,
    "temperature": 1.0,
    "num_return_sequences": 10
}
generated = generate_text(anime_model, anime_tokenizer, seq, 200, opts)

for v in generated:
  print(v)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Shadow realm, the land of chaos, and the realm of light. The gods of the universe have decided that the world is the center of the world's struggle for dominance and control. To the human eye, the world seems chaotic, and humans have been living in a peaceful life since ancient times. But the gods have also decided to move forward and bring the world back to its former peacefulness. This is the story of Aoyagi and the others who join forces with the gods in the search of peace. (Source: AniDB)
Shadow realm of darkness. There exist the Demon Lords. Their leader is the Demon King, Demon Lord Grendizer. He is the master of many demon masters, but he has a secret: he can use magic for his own evil purposes. Demon Lord Ruri, along with her allies, must protect Ruri from her Demon Lord's dark desires by fighting her master and take her down to the demon realm.
Shadow realm, where the inhabitants are called "Toys." They have great power and special abilities, but these toys are not always abl