# Introduction to GPT2

[Open this notebook on Colab in Playground Mode](https://colab.research.google.com/drive/1Qzch1S3_7OaqxOwgoP52w0MkumuONe6T#scrollTo=PCJJLiSTdTW2&forceEdit=true&sandboxMode=true)

_**What is GPT2?**_

GPT2 (Generative Pre-trained Transformer 2) is a language model that was created by OpenAI in February 2019.

The original GPT2 is a (extra) large transformer-based language model with 1.5 billion parameters that was trained on a dataset of 8 million web pages. The GPT2 model also comes in different sizes that include small (117M), medium (345M) and large (762M) parameters.

![GPT2 Image Sizes](https://jalammar.github.io/images/gpt2/gpt2-sizes.png)

<br />

_**What was it trained on?**_

GPT2 was trained to predict the next word in 40GB of "Internet text". This "Internet text" was obtained from Reddit. The dataset was taken from web links, and the data is text. Hence, the dataset was called WebText.

<br />

_**The difference between GPT2 and BERT?**_

The GPT2 consists of transformer decoder blocks while BERT uses transformer encoder blocks. This makes BERT very good at fill-in-the-blanks while GPT2 is very good at writing essays.

<br />

_**References**_

* Learn about GPT2
    * [The Illustrated GPT-2 (Visualizing Transformer Language Models)](https://jalammar.github.io/illustrated-gpt2/)
    * [How to Build OpenAI's GPT-2: "The AI That Was Too Dangerous to Release"](https://blog.floydhub.com/gpt2/)
    * [Better Language
Models and Their
Implications](https://openai.com/blog/better-language-models/)

* Code
    * [Official PyTorch Example](https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_clm.py)
    * [Fine-tuning GPT2 for Text Generation Using Pytorch](https://towardsdatascience.com/fine-tuning-gpt2-for-text-generation-using-pytorch-2ee61a4f1ba7)
    * [Fine-Tuning GPT2 on Colab GPU… For Free!](https://towardsdatascience.com/fine-tuning-gpt2-on-colab-gpu-for-free-340468c92ed)

# Download Libraries

In [None]:
!pip install transformers
!pip install datasets



# Imports

In [None]:
import pandas as pd
import numpy as np
import re
import os
import logging
import torch
import transformers
import math

from datasets import load_dataset
from dataclasses import dataclass
from sklearn.model_selection import train_test_split

In [None]:
logger = logging.getLogger(__name__)

Configuration is as follows:

* Model Configuration -
    * model_name: Name of the language model.
    * dataset_name: The name of the dataset to use (via the datasets library).
    * dataset_config_name: The configuration name of the dataset to use (via the datasets library).
    * output_dir: The output directory where the model predictions and checkpoints will be written.
    * overwrite_output_dir: Overwrite the content of the output directory.

* Training Configuration -
    * per_device_train_batch_size: The batch size per GPU/TPU core/CPU for training.
    * save_steps: Number of updates steps before two checkpoint saves.
    * num_train_epochs: Total number of training epochs to perform.

* Preprocessing Configuration -
    * fast_tokenizer: Whether or not to try to load the fast version of the tokenizer.
    * num_workers: The number of processes to use for the preprocessing.
    * overwrite_cache: Overwrite the cached training and evaluation sets.
    * max_train_samples: For debugging purposes or quicker training, truncate the number of training examples to this.
    * max_val_samples: For debugging purposes or quicker training, truncate the number of validation examples to this.
    * block_size: Optional input sequence length after tokenization.



In [None]:
@dataclass
class Config:

    model_name: str = 'gpt2'
    dataset_name: str = 'wikitext'
    dataset_config_name: str = 'wikitext-2-raw-v1'
    output_dir: str = '/tmp/test-clm'
    overwrite_output_dir: bool = False
    
    per_device_train_batch_size: int = 1
    save_steps: int = -1
    num_train_epochs: int = 1

    fast_tokenizer: bool = True
    num_workers: int = None
    overwrite_cache: bool = False
    max_train_samples: int = None
    max_val_samples: int = None
    block_size: int = None

In [None]:
config = Config(
    per_device_train_batch_size=2,
    num_train_epochs=2
)

# Fine-Tune GPT2

Prepare dataset defined by HuggingFace

Snippets taken from the official HuggingFace example: https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_clm.py

In [None]:
def get_dataset(config):
    dataset = load_dataset(config.dataset_name, config.dataset_config_name)

    conf = transformers.AutoConfig.from_pretrained(config.model_name)
    tokenizer = transformers.AutoTokenizer.from_pretrained(
        config.model_name, use_fast=config.fast_tokenizer
    )

    # Tokenize text
    column_names = dataset["train"].column_names
    text_column_name = "text" if "text" in column_names else column_names[0]

    def tokenize_function(example):
        return tokenizer(example[text_column_name])

    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        num_proc=config.num_workers,
        remove_columns=column_names,
        load_from_cache_file=not config.overwrite_cache
    )

    if config.block_size is None:
        block_size = tokenizer.model_max_length
        if block_size > 1024:
            logger.warn(
                f"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). "
                "Picking 1024 instead. You can change that default value by setting block_size in config."
            )
        block_size = 1024
    else:
        if config.block_size > tokenizer.model_max_length:
            logger.warn(
                f"The block_size passed ({config.block_size}) is larger than the maximum length for the model"
                f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}."
            )
        block_size = min(config.block_size, tokenizer.model_max_length)

    # Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
    def group_texts(examples):
        # Concatenate all texts.
        concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
        total_length = (total_length // block_size) * block_size
        # Split by chunks of max_len.
        result = {
            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, t in concatenated_examples.items()
        }
        result["labels"] = result["input_ids"].copy()
        return result

    lm_dataset = tokenized_dataset.map(
        group_texts,
        batched=True,
        num_proc=config.num_workers,
        load_from_cache_file=not config.overwrite_cache,
    )

    return tokenized_dataset, lm_dataset, tokenizer, conf

Train method -

This will get the dataset configured in the Config dataclass and split it into train and eval dataset. After this, the Trainer will run the training and evaluation loops.

In [None]:
def train(config):
    print("Getting Dataset...")
    tokenized_dataset, lm_dataset, tokenizer, conf = get_dataset(config)

    if "train" not in tokenized_dataset:
        raise ValueError("Training requires a train dataset")
    train_dataset = lm_dataset["train"]
    if config.max_train_samples is not None:
        train_dataset = train_dataset.select(range(config.max_train_samples))

    if "validation" not in tokenized_dataset:
        raise ValueError("Validation requires a validation dataset")
    eval_dataset = lm_dataset["validation"]
    if config.max_val_samples is not None:
        eval_dataset = eval_dataset.select(range(config.max_val_samples))
    print("Dataset Prepared!")

    print("Loading Model...")
    model = transformers.AutoModelForCausalLM.from_pretrained(
            config.model_name, config=conf
    )
    print("Model Loaded!")

    
    # Initialize our Trainer
    print("Starting Training...")
    training_args = transformers.TrainingArguments(
        output_dir=config.output_dir,
        overwrite_output_dir=config.overwrite_output_dir,
        per_device_train_batch_size=config.per_device_train_batch_size,
        num_train_epochs=config.num_train_epochs,
        save_steps=config.save_steps
    )
    
    trainer = transformers.Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset if train else None,
        eval_dataset=eval_dataset if eval else None,
        tokenizer=tokenizer,
        # Data collator will default to DataCollatorWithPadding, so we change it.
        data_collator=transformers.default_data_collator,
    )

    train_result = trainer.train()
    trainer.save_model()  # Saves the tokenizer too for easy upload

    metrics = train_result.metrics

    max_train_samples = (
        config.max_train_samples if config.max_train_samples is not None else len(train_dataset)
    )
    metrics["train_samples"] = min(max_train_samples, len(train_dataset))

    # Evaluation
    print("Starting Evaluation...")

    metrics = trainer.evaluate()

    max_val_samples = config.max_val_samples if config.max_val_samples is not None else len(eval_dataset)
    metrics["eval_samples"] = min(max_val_samples, len(eval_dataset))
    perplexity = math.exp(metrics["eval_loss"])
    metrics["perplexity"] = perplexity

    print("Evaluation Metrics:")
    print(f"Eval Loss: {metrics['eval_loss']}\nPerplexity: {metrics['perplexity']}")


In [None]:
train(config)

Getting Dataset...


Reusing dataset wikitext (/root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91)
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91/cache-aabc5be2a17e5dd4.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91/cache-085fbe7a91131577.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91/cache-7d63db416c5a4f8e.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91/cache-3b5f36296ad45978.arrow
Loading cached processed dataset at /root/.cache/hugg

Dataset Prepared!
Loading Model...
Model Loaded!
Starting Training...


Step,Training Loss
500,3.2873
1000,3.1893
1500,3.0693
2000,3.0145


Starting Evaluation...


Evaluation Metrics:
Eval Loss: 3.044163942337036
Perplexity: 20.992472946173525


# Generate Text

Load the fine-tuned model and tokenizer

In [None]:
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'

tokenizer = transformers.GPT2Tokenizer.from_pretrained(config.output_dir)
model = transformers.GPT2LMHeadModel.from_pretrained(config.output_dir)
model = model.to(device)

Snippet for generation taken from: https://towardsdatascience.com/fine-tuning-gpt2-on-colab-gpu-for-free-340468c92ed

`choose_from_top` will choose a token from its probability _p_ out of the top `n` most probable words.

`generate` will generate a sentence given an input text, model and tokenizer. The `length` will define how long the generated sentence would be while `n` defines how many probabilities to consider (top n tokens).

In [None]:
def choose_from_top(probs, n=5):
    ind = np.argpartition(probs, -n)[-n:]
    top_prob = probs[ind]
    top_prob = top_prob / np.sum(top_prob)
    choice = np.random.choice(n, 1, p=top_prob)
    token_id = ind[choice][0]
    return int(token_id)

def generate(input_str, model, tokenizer, length=250, n=5):
    cur_ids = torch.tensor(tokenizer.encode(input_str)).unsqueeze(0).long().to(device)
    model.eval()
    with torch.no_grad():
        for i in range(length):
            outputs = model(cur_ids[:, -1024:], labels=cur_ids[:, -1024:])
            loss, logits = outputs[:2]
            softmax_logits = torch.softmax(logits[0, -1], dim=0)
            next_token_id = choose_from_top(softmax_logits.to('cpu').numpy(), n=n)
            cur_ids = torch.cat([cur_ids, torch.ones((1, 1)).long().to(device) * next_token_id], dim=1)
        output_list = list(cur_ids.squeeze().to('cpu').numpy())
        output_text = tokenizer.decode(output_list)
        return output_text

In [None]:
generated_text = generate("= Interstellar =", model, tokenizer)
print(generated_text)

= Interstellar = 
 The film's title refers to the first time in history that humans have made a space trip. It also describes a time where humans traveled through space, as well as other worlds. The story of the film follows a group of astronauts who travel to Mars, where they meet the extraterrestrial race known as the " Voyage Express, a race of extraterrestrials that has been searching for humans since the very beginning. " 
 = = Production = = 
 The first draft script was written by John Travolta and directed by David Koeppner. The script was written by John Travolta and directed by David Koeppner and was directed by David Koeppner and Michael Cera. The script was written by John Travolta and directed by Michael Cera. The film was shot in a 3 @,@ 000 x 3 @,@ 000 @,@ inch format and had a budget of $ 1 @.@ 8 million. It also had a crew consisting of John Travolta, John Travolta, Michael Cera, and James Franco. 
 The film was shot during the summer in Los Angeles. The location had be