<a href="https://colab.research.google.com/github/TD1138/GPT2---Twilight-Zone-Narrations/blob/main/notebooks/GPT_2_Retrain_Twilight_Zone.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Retraining GPT-2 to output Twilight Zone Narrations!

by Tom Devine


The Twilight Zone is one of my favourite shows!

Created by Rod Serling, this 1950s show was groundbreaking for it's era, mixing science fiction and fantasy with tales of morality - going on to influence thousands of creators since - recent shows such as Black Mirror owe a huge debt to this anthology show!

Rod Serling, alongside show-running and plot duties, narrated each episode, providing an introduction to each story, as well as a closing narration, often highlighting the moral aspect of the story.

It is these narrations that I'll be retraining GPT-2 to output!

Lets import the standard packages:

In [1]:
import pandas as pd
import numpy as np
import requests

# Text Pre-processing

Originally I was going to scrape these narrations from Wikipedia, but I managed to obtain a copy of them from an angelfire website!
https://www.angelfire.com/fl/twilightzonelist/openingnarrations.html

Let's import the dataset:

In [2]:
url = 'https://raw.githubusercontent.com/TD1138/GPT2---Twilight-Zone-Narrations/main/text_files/all_opening_narrations.txt'
resp = requests.get(url)
tz_file = resp.text
tz_file[:666]

"1) Where is Everybody?\r\nThe place is here, the time is now, and the journey into the shadows that we're about to watch could be our journey.\r\n\r\n2) One for the Angels\r\nStreet scene: summer. The present. Man on a sidewalk named Lew Bookman, age sixtyish. Occupation: pitchman. Lew Bookman, a fixture of the summer, a rather minor component to a hot July, a nondescript, commonplace little man whose life is a treadmill built out of sidewalks. In just a moment, Lew Bookman will have to concern himself with survival, because as of three o'clock this hot July afternoon he'll be stalked by Mr. Death.\r\n\r\n3) Mr. Denton on Doomsday\r\nPortrait of a town drunk named Al Dent"

We need to split this file into different episodes.
It looks like each file has an additional line break token, so we can split on the double return '\r\n\r\n'
Let's test this:

In [3]:
tz_file.split('\r\n\r\n')[:10]

["1) Where is Everybody?\r\nThe place is here, the time is now, and the journey into the shadows that we're about to watch could be our journey.",
 "2) One for the Angels\r\nStreet scene: summer. The present. Man on a sidewalk named Lew Bookman, age sixtyish. Occupation: pitchman. Lew Bookman, a fixture of the summer, a rather minor component to a hot July, a nondescript, commonplace little man whose life is a treadmill built out of sidewalks. In just a moment, Lew Bookman will have to concern himself with survival, because as of three o'clock this hot July afternoon he'll be stalked by Mr. Death.",
 "3) Mr. Denton on Doomsday\r\nPortrait of a town drunk named Al Denton. This is a man who's begun his dying early - a long, agonizing route through a maze of bottles. Al Denton, who would probably give an arm or a leg or a part of his soul to have another chance, to be able to rise up and shake the dirt from his body and the bad dreams that infest his consciousness. In the parlance of the 

Looks like this gave us what we need!

Lets also remove the episode numbers in this step.

Note that we're leaving the linebreak after the episode title - hopefully our model will be able to come up with it's own titles as well as narrations!

In [4]:
tz_list = tz_file.split('\r\n\r\n')
tz_list = [title.split(') ')[1] for title in tz_list]
tz_list[:10]

["Where is Everybody?\r\nThe place is here, the time is now, and the journey into the shadows that we're about to watch could be our journey.",
 "One for the Angels\r\nStreet scene: summer. The present. Man on a sidewalk named Lew Bookman, age sixtyish. Occupation: pitchman. Lew Bookman, a fixture of the summer, a rather minor component to a hot July, a nondescript, commonplace little man whose life is a treadmill built out of sidewalks. In just a moment, Lew Bookman will have to concern himself with survival, because as of three o'clock this hot July afternoon he'll be stalked by Mr. Death.",
 "Mr. Denton on Doomsday\r\nPortrait of a town drunk named Al Denton. This is a man who's begun his dying early - a long, agonizing route through a maze of bottles. Al Denton, who would probably give an arm or a leg or a part of his soul to have another chance, to be able to rise up and shake the dirt from his body and the bad dreams that infest his consciousness. In the parlance of the times, th

Let's add an identifier to the start and end of each narration, so the model knows where to start and end:

In [5]:
start_text = '[narration_start]'
end_text = '[narration_end]'

tz_list = [start_text+title+end_text for title in tz_list]
tz_list[:10]

["[narration_start]Where is Everybody?\r\nThe place is here, the time is now, and the journey into the shadows that we're about to watch could be our journey.[narration_end]",
 "[narration_start]One for the Angels\r\nStreet scene: summer. The present. Man on a sidewalk named Lew Bookman, age sixtyish. Occupation: pitchman. Lew Bookman, a fixture of the summer, a rather minor component to a hot July, a nondescript, commonplace little man whose life is a treadmill built out of sidewalks. In just a moment, Lew Bookman will have to concern himself with survival, because as of three o'clock this hot July afternoon he'll be stalked by Mr. Death.[narration_end]",
 "[narration_start]Mr. Denton on Doomsday\r\nPortrait of a town drunk named Al Denton. This is a man who's begun his dying early - a long, agonizing route through a maze of bottles. Al Denton, who would probably give an arm or a leg or a part of his soul to have another chance, to be able to rise up and shake the dirt from his body a

Finally lets write it out to the local drive to be read into GPT-2

In [6]:
file_name = 'tz_opening_narrations.txt'
tz_out = '\n'.join(tz_list)
with open(file_name, 'w') as f:
  f.write(tz_out)

Next lets install the GPT-2 model.

I'm using a great implementation of the model called GPT-2 Simple, made by Max Woolf (aka minimaxir on Github)

See details on the implementation here:
https://github.com/minimaxir/gpt-2-simple

In [7]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2

TensorFlow 1.x selected.
  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



# Downloading GPT-2

Now we need to download the actual GPT-2 model.

I'm using the medium version of the model (355M)

In [8]:
model_version = '355M'

In [9]:
gpt2.download_gpt2(model_name=model_version)

Fetching checkpoint: 1.05Mit [00:00, 310Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 84.4Mit/s]                                                   
Fetching hparams.json: 1.05Mit [00:00, 389Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:05, 250Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 334Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 136Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 140Mit/s]                                                       


## Finetuning the GPT-2 model

Here we take the pre-trained GPT-2 model and refine it with our Twilight Zone narrations!

The model checkpoints will be saved in `/checkpoint/run1` by default. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

In [10]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name=model_version,
              steps=100,
              restore_from='fresh',
              run_name='run1',
              print_every=10,
              sample_every=200,
              save_every=500
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint models/355M/model.ckpt
INFO:tensorflow:Restoring parameters from models/355M/model.ckpt


100%|██████████| 1/1 [00:00<00:00,  5.17it/s]

Loading dataset...
dataset has 20581 tokens
Training...





[10 | 23.75] loss=3.20 avg=3.20
[20 | 38.63] loss=2.99 avg=3.10
[30 | 53.62] loss=2.02 avg=2.74
[40 | 68.74] loss=1.53 avg=2.43
[50 | 83.98] loss=0.93 avg=2.12
[60 | 99.34] loss=0.67 avg=1.88
[70 | 114.82] loss=2.22 avg=1.93
[80 | 130.39] loss=0.34 avg=1.72
[90 | 146.09] loss=0.31 avg=1.56
[100 | 161.83] loss=0.21 avg=1.42
Saving checkpoint/run1/model-100


Now our model is re-trained, let's save it into my Drive

In [11]:
gpt2.mount_gdrive()

gpt2.copy_checkpoint_to_gdrive(run_name='run1')

Mounted at /content/drive


## Loading our Model

This is the opposite of the last step - we can start from here now we've trained our model!

In [4]:
gpt2.mount_gdrive()
gpt2.copy_checkpoint_from_gdrive(run_name='run1')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Let's start up the tensorflow session and load our retrained GPT-2 model:

In [5]:
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name='run1')

Loading checkpoint checkpoint/run1/model-1000
INFO:tensorflow:Restoring parameters from checkpoint/run1/model-1000


## Narration Generation!

Let's generate some Twilight Zone narrations!

If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = gpt2.generate(sess, return_as_list=True)[0]`

You can also pass in a `prefix` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `nsamples`. Unique to GPT-2, you can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 20 for `batch_size`).

Other optional-but-helpful parameters for `gpt2.generate` and friends:

*  **`length`**: Number of tokens to generate (default 1023, the maximum)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)
* **`truncate`**: Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first `<|endoftext|>`). It may be useful to combine this with a smaller `length` if the input texts are short.
*  **`include_prefix`**: If using `truncate` and `include_prefix=False`, the specified `prefix` will not be included in the returned text.

In [16]:
gpt2.generate(sess,
              length=500,
              temperature=0.7,
              prefix='[narration_start]',
              truncate='[narration_end]',
              include_prefix=False,
              nsamples=5,
              batch_size=5
              )

It's You, Johnny
You're sitting at a bar in the dark, a lonely, wind-whipped watering hole in the country of Twilight. And this is Mr. Johnny Valianto, a man of few words and few places to be found. But you've found him. In just a moment, he'll show you what he's really made of. Johnny Valianto, the outlaw, trapped in the Twilight Zone.
Narration No. 5: The Last Flight
Submitted for your approval or at least your analysis: the disappearance of forty-five minutes on a desert planet, vast and unknown to man. Forty-five minutes on which the odds are against man, probably against himself. In just a moment we'll begin to read the books, because these two men are Neil Armstrong and Buzz Aldrin. They're on their way home from the Moon. And they're about to discover that not everything that is seen is as it appears. Not everything that is meant to be is as it appears. Not everything that comes to mind is as it appears. As you've perceived, this is the Twilight Zone.
The Mighty Casey
You're wat