<a href="https://colab.research.google.com/github/ShaneZhong/train-gpt-2-model/blob/master/Train_the_GPT_2_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training the GPT-2 model from an input text

Setup:

1) Make sure GPU is enabled, go to edit->notebook settings->Hardware Accelerator GPU

2) make a copy to your google drive, click on copy to drive in panel

Note: colab will reset after 12 hours make sure to save your model checkpoints to google drive around 10-11 hours mark or before, then go to runtime->reset all runtimes. Now copy your train model back into colab and start training again from the previous checkpoint.

## Environment Setup



### Git clone and install dependencies

In [22]:
#@title Clone or pull train-gpt-2-model from Github

%cd /content
Mode = "pull" #@param ["clone", "pull"]

if (Mode == "clone"):
  !git clone https://github.com/ShaneZhong/train-gpt-2-model.git
else:
  %cd /content/train-gpt-2-model
  !git pull

!pip3 install -r /content/train-gpt-2-model/requirements.txt

/content
/content/train-gpt-2-model
Your configuration specifies to merge with the ref 'refs/heads/finetuning'
from the remote, but no such ref was fetched.


In [3]:
cd /content/train-gpt-2-model

/content/train-gpt-2-model/train-gpt-2-model


In [17]:
ls

98-0.txt         Dockerfile.cpu     LICENSE           [0m[01;34msrc[0m/
[01;34mcheckpoint[0m/      Dockerfile.gpu     [01;34mmodels[0m/           train-horovod.py
CONTRIBUTORS.md  download_model.py  README.md         [01;32mtrain.py[0m*
DEVELOPERS.md    [01;32mencode.py[0m*         requirements.txt


### GDrive Mount

In [5]:
# Run this cell to mount your Google Drive.
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [6]:
# download the models
!python3 download_model.py 117M
!python3 download_model.py 345M

Fetching checkpoint:   0%|                                              | 0.00/77.0 [00:00<?, ?it/s]Fetching checkpoint: 1.00kit [00:00, 532kit/s]                                                      
Fetching encoder.json: 1.04Mit [00:00, 35.8Mit/s]                                                   
Fetching hparams.json: 1.00kit [00:00, 562kit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:14, 35.1Mit/s]                                  
Fetching model.ckpt.index: 6.00kit [00:00, 3.07Mit/s]                                               
Fetching model.ckpt.meta: 472kit [00:00, 25.0Mit/s]                                                 
Fetching vocab.bpe: 457kit [00:00, 33.6Mit/s]                                                       
Fetching checkpoint: 1.00kit [00:00, 635kit/s]                                                      
Fetching encoder.json: 1.04Mit [00:00, 36.0Mit/s]                                         

In [0]:
!export PYTHONIOENCODING=UTF-8

### Fetch previous model checkpoints in google drive (optional)

In [9]:
# check if you have any content in your GDrive directory
ls /content/drive/My\ Drive/checkpoint/

[0m[01;34mrun1[0m/


In [0]:
# If the above is not empty, you can copy the previously
# saved model to your project directory.
!cp -r /content/drive/My\ Drive/checkpoint/ /content/train-gpt-2-model/ 

## Get the training dataset

lets get our text to train on, in this case from project gutenberg, A Tale of Two Cities, by Charles Dickens

In [10]:
  !wget https://www.gutenberg.org/files/98/98-0.txt

--2019-06-29 14:46:09--  https://www.gutenberg.org/files/98/98-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 804335 (785K) [text/plain]
Saving to: ‘98-0.txt’


2019-06-29 14:46:12 (655 KB/s) - ‘98-0.txt’ saved [804335/804335]



In [11]:
ls /content/train-gpt-2-model/

CONTRIBUTORS.md  download_model.py  requirements.txt    [0m[01;32mtrain.py[0m*
DEVELOPERS.md    [01;32mencode.py[0m*         [01;34msrc[0m/
Dockerfile.cpu   LICENSE            [01;34mtrain-gpt-2-model[0m/
Dockerfile.gpu   README.md          train-horovod.py


### Alternative: Trump tweets
The Trump tweets is already saved in the /data directory

In [20]:
cd /content/train-gpt-2-model/

/content/train-gpt-2-model


In [21]:
ls /content/train-gpt-2-model/data

ls: cannot access '/content/train-gpt-2-model/data': No such file or directory


## Training the model

Select either the 117M or 345M model to train. 

**IMPORTANT**: After running the cell below, it does not stop automatically. To stop training, you need to click the stop button. The saved model will be generated in the checkpoint directory (`e.g. Saving checkpoint/run1/model-289`)


Many parameters can be tunned in this model. You can find the reference here: [link](https://github.com/ShaneZhong/train-gpt-2-model/blob/finetuning/train.py)

In [18]:
#@title Train the model with the following parameters
input_data = '/content/train-gpt-2-model/data/trump_tweets.txt' #@param
model = "345M" #@param ["117M","345M"]
Samples_per_N_steps = 10 #@param

!PYTHONPATH=src /content/train-gpt-2-model/train.py --dataset $input_data --model_name $model --sample_every $Samples_per_N_steps

W0629 14:50:09.858506 140239881455488 deprecation_wrapper.py:119] From /content/train-gpt-2-model/train-gpt-2-model/src/model.py:147: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

W0629 14:50:09.869672 140239881455488 deprecation_wrapper.py:119] From /content/train-gpt-2-model/train-gpt-2-model/src/memory_saving_gradients.py:13: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

W0629 14:50:09.981479 140239881455488 deprecation_wrapper.py:119] From /content/train-gpt-2-model/train.py:87: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0629 14:50:09.981886 140239881455488 deprecation_wrapper.py:119] From /content/train-gpt-2-model/train.py:90: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-06-29 14:50:09.988601: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-06-29 14:50:09.988916: I tensorflow/compiler/xl

In [16]:
# Train the model from "A Tale of Two Cities":

#!PYTHONPATH=src ./train.py --dataset /content/train-gpt-2-model/98-0.txt --model_name '117M'
!PYTHONPATH=src ./train.py --dataset /content/train-gpt-2-model/98-0.txt --model_name '345M'

W0629 14:48:39.724495 140388479313792 deprecation_wrapper.py:119] From /content/train-gpt-2-model/train-gpt-2-model/src/model.py:147: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

W0629 14:48:39.735804 140388479313792 deprecation_wrapper.py:119] From /content/train-gpt-2-model/train-gpt-2-model/src/memory_saving_gradients.py:13: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

W0629 14:48:39.848906 140388479313792 deprecation_wrapper.py:119] From ./train.py:87: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0629 14:48:39.849313 140388479313792 deprecation_wrapper.py:119] From ./train.py:90: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-06-29 14:48:39.855903: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-06-29 14:48:39.856234: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x32bdc00 ex

### Save the trained model to GDrive
By default, the trained model is saved in the `checkpoint` folder under your your GDrive root directory.

In [0]:
!cp -r /content/train-gpt-2-model/checkpoint/ /content/drive/My\ Drive/

## Apply the trained model

### Fetch the trained model
The trained model (117M or 345M) is pasted to the model directory.

In [0]:
#!cp -r /content/train-gpt-2-model/checkpoint/run1/* /content/train-gpt-2-model/models/117M/
!cp -r /content/train-gpt-2-model/checkpoint/run1/* /content/train-gpt-2-model/models/345M/

In [0]:
# load the instruction
!python3 src/interactive_conditional_samples.py -- --help

W0628 22:14:07.029035 139712883050368 deprecation_wrapper.py:119] From /content/train-gpt-2-model/src/model.py:147: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

Type:        function
String form: <function interact_model at 0x7f116e973d08>
File:        /content/train-gpt-2-model/src/interactive_conditional_samples.py
Line:        11
Docstring:   Interactively run the model
:model_name=117M : String, which model to use
:seed=None : Integer seed for random number generators, fix seed to reproduce
 results
:nsamples=1 : Number of samples to return total
:batch_size=1 : Number of batches (only affects speed/memory).  Must divide nsamples.
:length=None : Number of tokens in generated text, if None (default), is
 determined by model hyperparameters
:temperature=1 : Float value controlling randomness in boltzmann
 distribution. Lower temperature results in less random completions. As the
 temperature approaches zero, the model will become deterministic an

### Conditional samples

In [0]:
#!python3 src/interactive_conditional_samples.py --model_name='117M' --nsamples=2 --top_k=40 --temperature=0.7
!python3 src/interactive_conditional_samples.py --model_name='345M' --nsamples=2 --top_k=40 --temperature=0.7

W0628 22:16:48.055194 140603117918080 deprecation_wrapper.py:119] From /content/train-gpt-2-model/src/model.py:147: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

W0628 22:16:48.399002 140603117918080 deprecation_wrapper.py:119] From src/interactive_conditional_samples.py:55: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-06-28 22:16:48.400708: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-06-28 22:16:48.442906: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-28 22:16:48.443303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
2019-06-28 22:16:48.443626: I tensorflow/stream

### Unconditional samples

In [0]:
!python3 src/generate_unconditional_samples.py --model_name='117M' --nsamples=2 --top_k=40 --temperature=0.7
!python3 src/generate_unconditional_samples.py --model_name='345M' --nsamples=2 --top_k=40 --temperature=0.7

W0628 14:07:37.595685 140452906493824 deprecation_wrapper.py:119] From /content/train-gpt-2-model/src/model.py:147: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

W0628 14:07:38.008329 140452906493824 deprecation_wrapper.py:119] From src/generate_unconditional_samples.py:52: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-06-28 14:07:38.010619: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-06-28 14:07:38.031302: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-28 14:07:38.031820: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
2019-06-28 14:07:38.032214: I tensorflow/stre