<a href="https://colab.research.google.com/github/KCL-Machine-Learning/cocktail-recipe-gen/blob/notebook/gpt_2_cocktail_recipe_gen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Train GPT-2 To Generate Text On Google Colab

This notebook focuses on using `gpt-2-simple` to interact with GPT-2. More information about `gpt-2-simple` can be found on [this GitHub repository](https://github.com/minimaxir/gpt-2-simple).

To get started:
1. Copy this notebook to your Google Drive to keep it and save your changes. (File -> Save a Copy in Drive)
2. Make sure you're running the notebook in Google Chrome.
3. Run the cell below to set the TensorFlow version and install the gpt-2-simple library we will use to interact with gpt-2


In [0]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

# GPU

This notebook has access to the Nvidia Tesla P100. Creating a copy of this notebook retains access to this GPU.

You can verify the GPU being used by this notebook by running the cell below.

In [0]:
!nvidia-smi

# Downloading GPT-2

To interact with GPT-2 you must first download one of the GPT-2 models.

There are three released sizes of GPT-2:

* `124M` (default): the "small" model, 0.5GB on disk.
* `355M`: the "medium" model, 1.3GB on disk.
* `774M`: the "large" model, 2.9GB on disk.
* `1558M`: the "extra large" true model, 5.8GB on disk. Cannot currently be finetuned with Colaboratory but can be used to generate text without finetuning.

Larger models have more knowledge, but take longer to finetune and longer to generate text.

The next cell downloads all 3 finetunable models from Google Cloud Storage and saves it in the Colaboratory VM at `/models/<model_name>`.

If you don't need all of them and want to save time/space then comment out the lines for the models you don't need before running the cell.

These models aren't permanently saved in the Colaboratory VM; you'll have to redownload them if you want to finetune them at a later time.

In [0]:
gpt2.download_gpt2(model_name="124M")
gpt2.download_gpt2(model_name="355M")
gpt2.download_gpt2(model_name="774M")

# Mounting Google Drive

The best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model *out* of Colaboratory, is to route it through Google Drive *first*.

Running this cell (which will only work in Colaboratory) will mount your personal Google Drive in the VM, which later cells can use to get data in/out. (it will ask for an auth code; that auth is not saved anywhere)

In [0]:
gpt2.mount_gdrive()

# Uploading Text Files to Train on Colab

In the Colaboratory Notebook sidebar on the left of the screen, select *Files*. From there you can upload files:

![alt text](https://i.imgur.com/TGcZT4h.png)

You may upload **any small text file**  (<10 MB) and update the file name in the cell below, then run the cell.

The recommended method to upload a file is to upload it to Google Drive first, then copy that file from Google Drive to the Colaboratory VM.





In [0]:
file_name1 = "cocktails.txt"
file_name2 = "cocktails-full.txt"

file_name = file_name2

In [0]:
gpt2.copy_file_from_gdrive(file_name)

# Setting The Run Name

The cell below sets the `run_name` (essentially the name of the model) which is used when finetuning a model or generating text based on an existing model.

In [0]:
run_name1 = '124M_cocktails'
run_name2 = '355M_cocktails'
run_name3 = '774M_cocktails'
run_name4 = '124M_cocktails-full'
run_name5 = '355M_cocktails-full'
run_name6 = '774M_cocktails-full'

run_name = run_name6

# Finetuning GPT-2

If you already have a finetuned model and wish to finetune it further, then use the cell below to copy it from Google Drive. Make sure the `run_name` is set to that of the model you want to copy.

In [0]:
gpt2.copy_checkpoint_from_gdrive(run_name=run_name)

The next cell will start the actual finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (to have the finetuning run indefinitely, set `steps = -1`)

The model checkpoints will be saved in `/checkpoint/<run_name>`. The checkpoints are saved every 500 steps (by default) and when the cell is stopped.

The training might time out after 4ish hours; make sure you end training, save the results, and copy the model to your google drive so you don't lose it!

Other optional but helpful parameters for `gpt2.finetune`:

*  **`restore_from`**: Set to `fresh` to start training from the base GPT-2, or set to `latest` to restart training from an existing checkpoint. (If it is set to `latest` and there is no existing checkpoint, it will use the base GPT-2 instead.)
* **`sample_every`**: Number of steps to print example output
* **`print_every`**: Number of steps to print training progress.
* **`learning_rate`**:  Learning rate for the training. (The default is `1e-4`, but this can be lowered to `1e-5` if you have <1MB input data)
*  **`run_name`**: subfolder within `checkpoint` to save the model. This is useful if you want to work with multiple models (will also need to specify  `run_name` when loading the model)
* **`overwrite`**: Set to `True` if you want to continue finetuning an existing model (and `restore_from='latest'`) without creating duplicate copies. 

In [0]:
try:
  sess = gpt2.start_tf_sess()
except ValueError:
  sess = gpt2.reset_session(sess)

gpt2.finetune(sess,
              dataset=file_name,
              model_name='774M',  # This dictates which base model to use
              steps=1000,  # This dictates the number of steps to finetune for
              restore_from='latest',
              run_name=run_name,  # For all intents and purposes this is the name of the model
              print_every=10,  # This will print the current step, how many seconds the model has finetuned for, the current loss, and the average loss
              sample_every=200,  # This will print a sample from the model with no prompt
              save_every=500,  # Make sure steps is a multiple of this in order to save after the last step
              learning_rate=1e-5
              ,overwrite=True  # This will overwrite the model on disk when saving
              )

# Saving a Trained Model to Google Drive

After the model is trained, you can copy the checkpoint folder to your own Google Drive by running the cell below.

If you want to download it to your personal computer, it's strongly recommended you copy it there first, then download from Google Drive. The checkpoint folder is copied as a `.tar` compressed file; you can download it and uncompress it locally.

In [0]:
gpt2.copy_checkpoint_to_gdrive(run_name=run_name)

# Loading a Trained Model

Running the cell below will copy the `.tar` checkpoint file from your Google Drive into the Colab VM.

In [0]:
gpt2.copy_checkpoint_from_gdrive(run_name=run_name)

The next cell allows you to load the model and necessary metadata to generate text.

In [0]:
try:
  sess = gpt2.start_tf_sess()
except ValueError:
  sess = gpt2.reset_session(sess)

gpt2.load_gpt2(sess, run_name)

# Generating Text From A Trained Model

After you've finetuned a model, you can now generate text. `generate` generates a single text from the loaded model.

In [0]:
gpt2.generate(sess, run_name)

There are many parameters that control the text generation and can augment how the text is generated. 

Some optional but helpful parameters for `gpt2.generate` include:

* **`return_as_list`**: Set to `True` if you want to return the samples generated as a list of samples instead of printing them to console. This is useful if you need to pass the generated text elsewhere.
* **`truncate`**: Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first `<|endoftext|>`). It may be useful to combine this with a smaller `length` if the input texts are short.
* **`prefix`**: Force the test to start with a given string and generate text from there.
* **`seed`**: An unsigned interger that is the seed for the random number generator.
* **`nsamples`**: The total number of samples to generate
* **`batch_size`**: Number of batches (only affects speed/memory).  Must divide nsamples.
*  **`length`**: Number of tokens to generate (default 1023, the maximum)
* **`temperature`**: The higher the temperature, the more 'random' the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits each generated token to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)
*  **`include_prefix`**: If using `truncate` and `include_prefix=False`, the specified `prefix` will not be included in the returned text. (default `True`)

In [0]:
prefix = ""

gpt2.generate(sess,
              prefix=prefix,
              length=10,
              temperature=0.7,
              top_p=0.9,
              nsamples=1,
              batch_size=1
              )

For bulk generation, you can generate a large amount of text to a file and sort out the samples locally on your computer. The next cell will generate a generated text file with a unique timestamp.

You can rerun the cells as many times as you want for even more generated texts!

In [0]:
gen_file = 'gpt2_gentext_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())

gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=500,
                      temperature=0.7,
                      nsamples=100,
                      batch_size=20
                      )

In [0]:
# may have to run twice to get file to download
files.download(gen_file)