# Stig Introduction Generator
This notebook guides you to use GPT-2 model by OpenAI and use it to generate dialogues like your favorite character. For this notebook, I have used the dialogues by BBC UK Top Gear's hosts, Jeremy Clarkson, James May and Richard Hammond, use while introducing another person from their show: **The Stig**. Here's an example of how they introduce "The Stig" in their show.


```
Some say he naturally faces magnetic north, and that all his legs are hydraulic. All we know is he‚Äôs called the Stig.

Some say that he‚Äôs terrified of ducks, and that there‚Äôs an airport in Russia named after him. All we know is he‚Äôs called the Stig.

Some say that his tears are adhesive, and that if he caught fire he‚Äôd burn for 1000 days. All we know is he‚Äôs called the Stig.
```
The data was gathered from this [link](https://www.topgearbox.com/stig-quotes/).

You can use this notebook as a reference to create your own language model that can talk like your favorite character.

```
Some say this language model has helped Stan Lee to generate plots for new Marvel Comics, and by using this model, we can enjoy Marvel Comics for centuries. All we know is it is open source.
```

Enable GPU by going to Runtime > Change runtime type > GPU  
Clone the nsheppherd's forked GitHub [repository](https://github.com/nshepperd/gpt-2)

In [None]:
!git clone https://github.com/nshepperd/gpt-2.git

Cloning into 'gpt-2'...
remote: Enumerating objects: 435, done.[K
remote: Counting objects: 100% (64/64), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 435 (delta 19), reused 48 (delta 13), pack-reused 371[K
Receiving objects: 100% (435/435), 4.48 MiB | 22.15 MiB/s, done.
Resolving deltas: 100% (220/220), done.


In [None]:
cd gpt-2

/content/gpt-2


Install requirements

In [None]:
!pip3 install -q -r requirements.txt

Collecting fire>=0.1.3
[?25l  Downloading https://files.pythonhosted.org/packages/11/07/a119a1aa04d37bc819940d95ed7e135a7dcca1c098123a3764a6dcace9e7/fire-0.4.0.tar.gz (87kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 92kB 10.9MB/s 
[?25hCollecting regex==2017.4.5
[?25l  Downloading https://files.pythonhosted.org/packages/36/62/c0c0d762ffd4ffaf39f372eb8561b8d491a11ace5a7884610424a8b40f95/regex-2017.04.05.tar.gz (601kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 604kB 36.0MB/s 
[?25hCollecting requests==2.21.0
[?25l  Downloading https://files.pythonhosted.org/packages/7d/e3/20f3d364d6c8e5d2353c72a67778eb189176f08e873c9900e10c0287b84b/requests-2.21.0-py2.py3-none-any.whl (57kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 61kB 9.6MB/s 
[?25hCollecting tqdm==4.31.1
[?25l  Downloading https://file

Mount drive to access google drive for saving and accessing checkpoints later. Have to log in to your google account

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

Download the model data

In [None]:
!python3 download_model.py 117M

Fetching checkpoint: 1.00kit [00:00, 1.22Mit/s]                                                     
Fetching encoder.json: 1.04Mit [00:00, 6.63Mit/s]                                                   
Fetching hparams.json: 1.00kit [00:00, 1.15Mit/s]                                                   
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:46, 10.6Mit/s]                                  
Fetching model.ckpt.index: 6.00kit [00:00, 6.30Mit/s]                                               
Fetching model.ckpt.meta: 472kit [00:00, 3.93Mit/s]                                                 
Fetching vocab.bpe: 457kit [00:00, 3.54Mit/s]                                                       


Set Encoding

In [None]:
!export PYTHONIOENCODING=UTF-8

Fetch checkpoints if you have them saved in google drive

In [None]:
#!cp -r /content/drive/My\ Drive/checkpoint/ /content/gpt-2/ 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Load the dataset into the folder to train the model. Since the data is small, like this one, make sure that the model does not overfit if fine-tuning is run for long times. I ran this model for about 20 mins.

In [None]:
!cp /content/drive/My\ Drive/LinkedIn_Articles/Stig\ Introductions/stig-introductions.txt /content/gpt-2/

Begin training the model. There are several GPT-2 models based on trainable parameters mentioned in model name.

In [None]:
!PYTHONPATH=src ./train.py --dataset /content/gpt-2/stig-introductions.txt --model_name '117M'

2021-05-21 18:33:06.246190: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-05-21 18:33:08.018920: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-21 18:33:08.020020: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-05-21 18:33:08.074140: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-21 18:33:08.074762: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2021-05-21 18:33:08.074806: I tensorflow/stream_executor/platform/default/dso_loade

Save our checkpoints to start training again later

In [None]:
!cp -r /content/gpt-2/checkpoint/ /content/drive/My\ Drive/

Load your trained model for use in sampling below (117M or 345M)

In [None]:
!cp -r /content/gpt-2/checkpoint/run1/* /content/gpt-2/models/117M/

Generate conditional samples from the model given a prompt you provide -  change top-k hyperparameter if desired (default is 40),  if you're using 345M, add "--model-name 345M"

In [None]:
!python3 src/interactive_conditional_samples.py --top_k 40 --model_name "117M"

2021-05-21 18:40:49.311793: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-05-21 18:40:50.890850: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-21 18:40:50.891734: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-05-21 18:40:50.934675: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-21 18:40:50.935300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2021-05-21 18:40:50.935348: I tensorflow/stream_executor/platform/default/dso_loade

To check flag descriptions, use:

In [None]:
!python3 src/interactive_conditional_samples.py -- --help

2021-05-21 18:39:22.613498: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1mNAME[0m
    interactive_conditional_samples.py - Interactively run the model :model_name=124M : String, which model to use :seed=None : Integer seed for random number generators, fix seed to reproduce results :nsamples=1 : Number of samples to return total :batch_size=1 : Number of batches (only affects speed/memory).  Must divide nsamples. :length=None : Number of tokens in generated text, if None (default), is determined by model hyperparameters :temperature=1 : Float value controlling randomness in boltzmann distribution. Lower temperature results in less random completions. As the temperature approaches zero, the model will become deterministic and repetitive. Higher temperature results in more random completions. :top_k=0 : Integer value controlling diversity. 1 means only 1 word is considered for each step (token), resulting in det

Generate unconditional samples from the model,  if you're using 345M, add "--model-name 345M"

In [None]:
!python3 src/generate_unconditional_samples.py --model_name "117M" --nsamples 5 | tee /content/drive/My\ Drive/LinkedIn_Articles/Stig\ Introductions/samples.txt

2021-05-21 19:00:18.000695: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-05-21 19:00:19.517739: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-21 19:00:19.518602: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-05-21 19:00:19.547872: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-05-21 19:00:19.548453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2021-05-21 19:00:19.548496: I tensorflow/stream_executor/platform/default/dso_loade

To check flag descriptions, use:

In [None]:
!python3 src/generate_unconditional_samples.py -- --help

2021-05-21 18:48:10.138954: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1mNAME[0m
    generate_unconditional_samples.py - Run the sample_model :model_name=124M : String, which model to use :seed=None : Integer seed for random number generators, fix seed to reproduce results :nsamples=0 : Number of samples to return, if 0, continues to generate samples indefinately. :batch_size=1 : Number of batches (only affects speed/memory). :length=None : Number of tokens in generated text, if None (default), is determined by model hyperparameters :temperature=1 : Float value controlling randomness in boltzmann distribution. Lower temperature results in less random completions. As the temperature approaches zero, the model will become deterministic and repetitive. Higher temperature results in more random completions. :top_k=0 : Integer value controlling diversity. 1 means only 1 word is considered for each step (token), re

Copy content to the drive for permanent storage

In [None]:
!cp -r /content/gpt-2 /content/drive/My\ Drive/LinkedIn_Articles/Stig\ Introductions/gpt-2/