<a href="https://colab.research.google.com/github/TD1138/GPT2---Twilight-Zone-Narrations/blob/main/notebooks/GPT_2_Retrain_Twilight_Zone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Retraining GPT-2 to output Twilight Zone Narrations!

by Tom Devine


The Twilight Zone is one of my favourite shows!

Created by Rod Serling, this 1950s show was groundbreaking for it's era, mixing science fiction and fantasy with tales of morality - going on to influence thousands of creators since - recent shows such as Black Mirror owe a huge debt to this anthology show!

Rod Serling, alongside show-running and plot duties, narrated each episode, providing an introduction to each story, as well as a closing narration, often highlighting the moral aspect of the story.

It is these narrations that I'll be retraining GPT-2 to output!

Lets import the standard packages:

In [1]:
import pandas as pd
import numpy as np
import requests

# Text Pre-processing

Originally I was going to scrape these narrations from Wikipedia, but I managed to obtain a copy of them from an angelfire website!
https://www.angelfire.com/fl/twilightzonelist/openingnarrations.html

Let's import the dataset:

In [2]:
url = 'https://raw.githubusercontent.com/TD1138/GPT2---Twilight-Zone-Narrations/main/text_files/all_opening_narrations.txt'
resp = requests.get(url)
tz_file = resp.text
tz_file

"1) Where is Everybody?\r\nThe place is here, the time is now, and the journey into the shadows that we're about to watch could be our journey.\r\n\r\n2) One for the Angels\r\nStreet scene: summer. The present. Man on a sidewalk named Lew Bookman, age sixtyish. Occupation: pitchman. Lew Bookman, a fixture of the summer, a rather minor component to a hot July, a nondescript, commonplace little man whose life is a treadmill built out of sidewalks. In just a moment, Lew Bookman will have to concern himself with survival, because as of three o'clock this hot July afternoon he'll be stalked by Mr. Death.\r\n\r\n3) Mr. Denton on Doomsday\r\nPortrait of a town drunk named Al Denton. This is a man who's begun his dying early - a long, agonizing route through a maze of bottles. Al Denton, who would probably give an arm or a leg or a part of his soul to have another chance, to be able to rise up and shake the dirt from his body and the bad dreams that infest his consciousness. In the parlance of

We need to split this file into different episodes.
It looks like each file has an additional line break token, so we can split on the double return '\r\n\r\n'
Let's test this:

In [3]:
tz_file.split('\r\n\r\n')[:20]

["1) Where is Everybody?\r\nThe place is here, the time is now, and the journey into the shadows that we're about to watch could be our journey.",
 "2) One for the Angels\r\nStreet scene: summer. The present. Man on a sidewalk named Lew Bookman, age sixtyish. Occupation: pitchman. Lew Bookman, a fixture of the summer, a rather minor component to a hot July, a nondescript, commonplace little man whose life is a treadmill built out of sidewalks. In just a moment, Lew Bookman will have to concern himself with survival, because as of three o'clock this hot July afternoon he'll be stalked by Mr. Death.",
 "3) Mr. Denton on Doomsday\r\nPortrait of a town drunk named Al Denton. This is a man who's begun his dying early - a long, agonizing route through a maze of bottles. Al Denton, who would probably give an arm or a leg or a part of his soul to have another chance, to be able to rise up and shake the dirt from his body and the bad dreams that infest his consciousness. In the parlance of the 

Looks like this gave us what we need!

Lets also remove the episode numbers in this step.

Note that we're leaving the linebreak after the episode title - hopefully our model will be able to come up with it's own titles as well as narrations!

In [4]:
tz_list = tz_file.split('\r\n\r\n')
tz_list = [title.split(') ')[1] for title in tz_list]
tz_list[:10]

["Where is Everybody?\r\nThe place is here, the time is now, and the journey into the shadows that we're about to watch could be our journey.",
 "One for the Angels\r\nStreet scene: summer. The present. Man on a sidewalk named Lew Bookman, age sixtyish. Occupation: pitchman. Lew Bookman, a fixture of the summer, a rather minor component to a hot July, a nondescript, commonplace little man whose life is a treadmill built out of sidewalks. In just a moment, Lew Bookman will have to concern himself with survival, because as of three o'clock this hot July afternoon he'll be stalked by Mr. Death.",
 "Mr. Denton on Doomsday\r\nPortrait of a town drunk named Al Denton. This is a man who's begun his dying early - a long, agonizing route through a maze of bottles. Al Denton, who would probably give an arm or a leg or a part of his soul to have another chance, to be able to rise up and shake the dirt from his body and the bad dreams that infest his consciousness. In the parlance of the times, th

Let's add an identifier to the start and end of each narration, so the model knows where to start and end:

In [5]:
start_text = '[narration_start]'
end_text = '[narration_end]'

tz_list = [start_text+title+end_text for title in tz_list]
tz_list[:10]

["[narration_start]Where is Everybody?\r\nThe place is here, the time is now, and the journey into the shadows that we're about to watch could be our journey.[narration_end]",
 "[narration_start]One for the Angels\r\nStreet scene: summer. The present. Man on a sidewalk named Lew Bookman, age sixtyish. Occupation: pitchman. Lew Bookman, a fixture of the summer, a rather minor component to a hot July, a nondescript, commonplace little man whose life is a treadmill built out of sidewalks. In just a moment, Lew Bookman will have to concern himself with survival, because as of three o'clock this hot July afternoon he'll be stalked by Mr. Death.[narration_end]",
 "[narration_start]Mr. Denton on Doomsday\r\nPortrait of a town drunk named Al Denton. This is a man who's begun his dying early - a long, agonizing route through a maze of bottles. Al Denton, who would probably give an arm or a leg or a part of his soul to have another chance, to be able to rise up and shake the dirt from his body a

Finally lets write it out to the local drive to be read into GPT-2

In [6]:
file_name = 'tz_opening_narrations.txt'
tz_out = '\n'.join(tz_list)
with open(file_name, 'w') as f:
  f.write(tz_out)

Next lets install the GPT-2 model.

I'm using a great implementation of the model called GPT-2 Simple, made by Max Woolf (aka minimaxir on Github)

See details on the implementation here:
https://github.com/minimaxir/gpt-2-simple

In [7]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2

TensorFlow 1.x selected.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



# Downloading GPT-2

Now we need to download the actual GPT-2 model.

I'm using the medium version of the model (355M)

In [8]:
model_version = '355M'

In [9]:
gpt2.download_gpt2(model_name=model_version)

Fetching checkpoint: 1.05Mit [00:00, 315Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 124Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 986Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:05, 280Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 583Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 158Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 123Mit/s]                                                       


## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (to have the finetuning run indefinitely, set `steps = -1`)

The model checkpoints will be saved in `/checkpoint/run1` by default. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

The training might time out after 4ish hours; make sure you end training and save the results so you don't lose them!

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

Other optional-but-helpful parameters for `gpt2.finetune`:


*  **`restore_from`**: Set to `fresh` to start training from the base GPT-2, or set to `latest` to restart training from an existing checkpoint.
* **`sample_every`**: Number of steps to print example output
* **`print_every`**: Number of steps to print training progress.
* **`learning_rate`**:  Learning rate for the training. (default `1e-4`, can lower to `1e-5` if you have <1MB input data)
*  **`run_name`**: subfolder within `checkpoint` to save the model. This is useful if you want to work with multiple models (will also need to specify  `run_name` when loading the model)
* **`overwrite`**: Set to `True` if you want to continue finetuning an existing model (w/ `restore_from='latest'`) without creating duplicate copies. 

In [None]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name=model_version,
              steps=1000,
              restore_from='fresh',
              run_name='run1',
              print_every=10,
              sample_every=200,
              save_every=500
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint models/355M/model.ckpt
INFO:tensorflow:Restoring parameters from models/355M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:00<00:00,  4.70it/s]


dataset has 20581 tokens
Training...
[10 | 23.89] loss=3.16 avg=3.16
[20 | 39.53] loss=2.24 avg=2.69
[30 | 55.39] loss=2.95 avg=2.78
[40 | 71.46] loss=2.10 avg=2.61
[50 | 87.70] loss=0.99 avg=2.28
[60 | 104.17] loss=1.94 avg=2.22
[70 | 120.60] loss=0.75 avg=2.00
[80 | 136.95] loss=0.30 avg=1.78
[90 | 153.31] loss=0.43 avg=1.63
[100 | 169.71] loss=0.09 avg=1.47
[110 | 186.14] loss=0.22 avg=1.35
[120 | 202.53] loss=1.54 avg=1.36
[130 | 218.92] loss=0.28 avg=1.27
[140 | 235.30] loss=0.09 avg=1.18
[150 | 251.71] loss=0.13 avg=1.11
[160 | 268.09] loss=0.41 avg=1.06


After the model is trained, you can copy the checkpoint folder to your own Google Drive.

If you want to download it to your personal computer, it's strongly recommended you copy it there first, then download from Google Drive. The checkpoint folder is copied as a `.rar` compressed file; you can download it and uncompress it locally.

In [None]:
gpt2.copy_checkpoint_to_gdrive(run_name='run1')

You're done! Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model.

## Load a Trained Model Checkpoint

Running the next cell will copy the `.rar` checkpoint file from your Google Drive into the Colaboratory VM.

In [None]:
gpt2.copy_checkpoint_from_gdrive(run_name='run1')

The next cell will allow you to load the retrained model checkpoint + metadata necessary to generate text.

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

In [None]:
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name='run1')

## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text. `generate` generates a single text from the loaded model.

In [12]:
gpt2.generate(sess, run_name='run1')

He was a fixture in his youth, a fixture of quiet desperation in an age of violence and gloom in an age of uncertainty. Now Henry Bledsoe looks for an escape - any escape, any way, anything, anybody - to get out of the rut. And this little old man is just what Mr. Bledsoe is waiting for something different. Something different, apparently, from what he's seen on TV, something different, apparently, from what any other man would expect from a wife or a mother. Something different, apparently, from what a wife or a mother should expect from a wife or a girlfriend. Something different, apparently, from what a girlfriend should expect from a life partner. Something different, apparently, from what a wife or a girlfriend should expect from a life companion. Jane Eyre, London. Mid-morning. A child's voice announces the start of the world. It is Mr. Jane Eyre, age thirty-six. And this is the Earth.

25) The Empire State Building
Her name: Statue of Liberty. Her motto: 'Give me your tired, you

If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = gpt2.generate(sess, return_as_list=True)[0]`

You can also pass in a `prefix` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `nsamples`. Unique to GPT-2, you can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 20 for `batch_size`).

Other optional-but-helpful parameters for `gpt2.generate` and friends:

*  **`length`**: Number of tokens to generate (default 1023, the maximum)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)
* **`truncate`**: Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first `<|endoftext|>`). It may be useful to combine this with a smaller `length` if the input texts are short.
*  **`include_prefix`**: If using `truncate` and `include_prefix=False`, the specified `prefix` will not be included in the returned text.

In [None]:
gpt2.generate(sess,
              length=250,
              temperature=0.7,
              prefix="LORD",
              nsamples=5,
              batch_size=5
              )

LORD WILLOUGHBY:
That, by the way, Clarence and I have done good side by side;
And yet side we, and he side we have done ill.

KING RICHARD II:
Why then 'tis done ill. O, how should I ease it?
Side with him and my brother, my sovereign!
Side wither away, and as night falls,
Like to the farthest morning to my last,
Side wither away, and as morning comes,
Like to the furthest afternoon to my last!
Side wither away, and as our fortunes turn,
Like to the furthest afternoon to our last!

QUEEN MARGARET:
What is this? counsel? counsel!

KING RICHARD II:
My queen and my heir, for half a mile and a half
She will glide this way, to be or no.

QUEEN MARGARET:
So stands the orchard here, for half a mile and a!

KING RICHARD II:
So stands the orchard here, to fence it, to!
Fashion it in her, like the hedgehog's net
LORD STANLEY:
What if I told you, in the hope of succor,
That I had lain a little while in your arms?

DUKE OF YORK:
No doubt, my lord.

QUEEN ELIZABETH:
'Tis a pity I should be coil'd 

For bulk generation, you can generate a large amount of text to a file and sort out the samples locally on your computer. The next cell will generate a generated text file with a unique timestamp.

You can rerun the cells as many times as you want for even more generated texts!

In [None]:
gen_file = 'gpt2_gentext_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())

gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=500,
                      temperature=0.7,
                      nsamples=100,
                      batch_size=20
                      )

In [None]:
# may have to run twice to get file to download
files.download(gen_file)

## Generate Text From The Pretrained Model

If you want to generate text from the pretrained model, not a finetuned model, pass `model_name` to `gpt2.load_gpt2()` and `gpt2.generate()`.

This is currently the only way to generate text from the 774M or 1558M models with this notebook.

In [None]:
model_name = "774M"

gpt2.download_gpt2(model_name=model_name)

Fetching checkpoint: 1.05Mit [00:00, 354Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 131Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 279Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 3.10Git [00:23, 131Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 380Mit/s]                                                
Fetching model.ckpt.meta: 2.10Mit [00:00, 226Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 199Mit/s]                                                       


In [None]:
sess = gpt2.start_tf_sess()

gpt2.load_gpt2(sess, model_name=model_name)

W0828 18:37:58.571830 139905369159552 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.


Loading pretrained model models/774M/model.ckpt


In [None]:
gpt2.generate(sess,
              model_name=model_name,
              prefix="The secret of life is",
              length=100,
              temperature=0.7,
              top_p=0.9,
              nsamples=5,
              batch_size=5
              )

The secret of life is that it's really easy to make it complicated," said Bill Nye, the host of the popular science show "Bill Nye the Science Guy." "And this is one of the reasons why we all need to be smarter about science, because we can't keep up with the amazing things that are going on all the time."

While Nye is correct that "everything that's going on all the time" is making the world a better place, he misses the point. This is not
The secret of life is in the rhythm of the universe. It's not a mystery. It's not a mystery to me. It's the nature of the universe. It's the beauty of the universe. It's the way the universe works. It's the way the universe is. It's the way the universe is going to work. It's the way the universe is. It's the way the universe is. It's the way the universe is. It's the way the universe is. It's the way
The secret of life is in the universe.


-

The Red Devil

It's the end of the world as we know it, and the only thing that can save us is a band of 

# Etcetera

If the notebook has errors (e.g. GPU Sync Fail), force-kill the Colaboratory virtual machine and restart it with the command below:

In [None]:
!kill -9 -1

# LICENSE

MIT License

Copyright (c) 2019 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.