<a href="https://colab.research.google.com/github/affjljoo3581/GPT2/blob/master/GPT2_Interactive_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPT2 Interactive Notebook

## Introduction
Welcome! In this notebook, you can play your own trained GPT2 model. This notebook is based on [affjljoo3581/GPT2](https://github.com/affjljoo3581/GPT2). You can play GPT2 model which is trained by [affjljoo3581/GPT2](https://github.com/affjljoo3581/GPT2) in this notebook.

## Preparation

First of all, you need to set *Runtime Type* to **GPU**. Let's check the current GPU device. We recommend to run this notebook on **Telsa T4** or **Tesla P100**.

In [None]:
import warnings
warnings.filterwarnings('ignore')

import IPython, torch
IPython.display.HTML(f'<p style="font-size: 12pt">Current GPU: <b>{torch.cuda.get_device_name()}</b></p>')

Next, clone GPT2 repository from github. [affjljoo3581/GPT2](https://github.com/affjljoo3581/GPT2) contains not only training, but also text generation and model evaluation.

In [3]:
!rm -rf GPT2
!git clone --quiet https://github.com/affjljoo3581/GPT2

Before playing with GPT2, you need to download trained model file and vocabulary file. Moreover, to evaluate the model, an evaluation corpus file is needed. This notebook supports through [Google Cloud Storage](https://cloud.google.com/storage), so upload the required files to your own storage and specify them to the belows.

In [None]:
#@title Download resources from Google Cloud Storage

model = 'gs://my-bucket/my-model' #@param {type:"string"}
vocab = 'gs://my-bucket/my-vocab' #@param {type:"string"}
eval_corpus = 'gs://my-bucket/my-eval-corpus' #@param {type:"string"}

!gcloud auth login
!gsutil -q cp $model model.pth
!gsutil -q cp $vocab vocab.txt
!gsutil -q cp $eval_corpus corpus.txt

Finally, configure the details of GPT2 model.

In [6]:
#@title Model Configuration
seq_len = 128 #@param {type:"integer"}
layers = 24 #@param {type:"integer"}
heads = 16 #@param {type:"integer"}
dims = 1024 #@param {type:"integer"}
rate = 4 #@param {type:"integer"}

* `seq_len` : maximum sequence length
* `layers` : number of transformer layers
* `heads` : number of multi-heads in attention layer
* `dims` : dimension of representation in each layer
* `rate` : increase rate of dimensionality in bottleneck 

## Generate Sentences!
According to [The Curious Case of Neural Text Degeneration](https://arxiv.org/pdf/1904.09751.pdf), ***Top-k Sampling*** — a popular sampling procedure — is problematic for both the presence of flat distributions and of peaked ones. The authors claimed that there is a risk of generating bland or generic text in some contexts with small $k$. Also,  the top-k vocabulary can include inappropriate candidates with large $k$. So they proposed ***Nucleus Sampling***. In nucleus sampling, the candidates consist of top-p tokens, rather than top-k ones. That is, the highest probability tokens whose cumulative probability mass exceeds the pre-chosen threshold $p$ would be selected.

In this notebook, *nucleus sampling* will be used in text generation. As mentioned above, the hyperparameter `nucleus_prob` which is the threshold $p$ is required.

In [None]:
#@title Generation Options
nucleus_prob = 0.8 #@param {type:"slider", min:0, max:1, step:0.01}

import IPython
display(IPython.display.HTML('''<style> div.output_text pre {
    white-space: pre-line; max-width: 1000px; display: inline-block;
} </style>'''))

!export PYTHONPATH=GPT2/src; python -m gpt2 generate \
        --vocab_path    vocab.txt \
        --model_path    model.pth \
        --seq_len       $seq_len \
        --layers        $layers \
        --heads         $heads \
        --dims          $dims \
        --rate          $rate \
        --nucleus_prob  $nucleus_prob \
        --use_gpu

## Evaluate Model!
After training the model, you may want to evaluate your own model performance. The most popular and objective method to evaluate the model is to calculate metrics with test dataset, which is not used during training. First, let's check the number of sequences in evaluation corpus dataset.


In [None]:
import IPython
lines = !wc -l corpus.txt | awk '{print $1}'
IPython.display.HTML(f'<p style="font-size: 12pt">Total Evaluation Sequences: <b>{lines[0]}</b></p>')

To improve performance, we will use batch in evaluation. That is, the number of total iterations should be `total sequences / batch size`. Usually, larger batch size leads higher efficiency but too large one occurs memory error. So, it is important to decide a proper batch size.

In [None]:
#@title Evaluation Options
%%time
batch_eval = 256 #@param {type: "integer"}
total_steps = 100 #@param {type: "integer"}

!export PYTHONPATH=GPT2/src; python -m gpt2 evaluate \
        --model_path    model.pth \
        --eval_corpus   corpus.txt \
        --vocab_path    vocab.txt \
        --seq_len       $seq_len \
        --layers        $layers \
        --heads         $heads \
        --dims          $dims \
        --rate          $rate \
        --batch_eval    $batch_eval \
        --total_steps   $total_steps \
        --use_gpu