<a href="https://colab.research.google.com/github/PhilChodrow/PIC16B/blob/master/lectures/tf/tf-5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation with Recurrent Neural Networks

In this set of lecture notes, we'll consider a new kind of machine learning task. Previously, we've focused on *classification* problems. In classification problems, the goal is to assign a given piece of data to one of several categories. Today, we'll instead consider a simple  *generation* problem. A *generative* model can be used to create "realistic" examples after it's been trained. Generative models are at the heart of machine learning topics like deepfakes, language modeling, and [style transfer](https://www.tensorflow.org/tutorials/generative/style_transfer).  



*Parts of these lecture notes were based on [this tutorial](https://keras.io/examples/generative/lstm_character_level_text_generation/). It is recommended to run the code contained in these notes in a Google Colab instance with GPU acceleration enabled.* 

In [1]:
import tensorflow as tf
from tensorflow.keras.layers.experimental import preprocessing
from tensorflow.keras import layers

import numpy as np
from matplotlib import pyplot as plt
import pandas as pd


## Our Task

Today, we are going to see whether we can teach an algorithm to understand and reproduce the pinnacle of cultural achievement; the benchmark against which all art is to be judged; the mirror that reveals to humany its truest self. I speak, of course, about *Star Trek: Deep Space Nine.*

<figure class="image" style="width:100px">
  <img src="https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/_images/DS9.jpg" alt="">
  <figcaption><i></i></figcaption>
</figure>

In particular, we are going to attempt to teach a neural  network to generate *episode scripts*. This a text generation task: after training, our hope is that our model will be able to create scripts that are reasonably realistic in their appearance. 


In [2]:
start_episode = 20 # Start in Season 2, Season 1 is not very good
num_episodes = 50  # only pick this many episodes to train on

url = "https://github.com/PhilChodrow/PIC16B/blob/master/datasets/star_trek_scripts.json?raw=true"
star_trek_scripts = pd.read_json(url)

cleaned = star_trek_scripts["DS9"].str.replace("\n\n\n\n\n\nThe Deep Space Nine Transcripts -", "")
cleaned = cleaned.str.split("\n\n\n\n\n\n\n").str.get(-2)
text = "\n\n".join(cleaned[start_episode:(start_episode + num_episodes)])
for char in ['\xa0', 'à', 'é', "}", "{"]:
    text = text.replace(char, "")

In [3]:
print(text[0:500])

  Last
time on Deep Space Nine.  
SISKO: This is the emblem of the Alliance for Global Unity. They call
themselves the Circle. 
O'BRIEN: What gives them the right to mess up our station? 
ODO: They're an extremist faction who believe in Bajor for the
Bajorans. 
SISKO: I can't loan you a Starfleet runabout without knowing where you
plan on taking it. 
KIRA: To Cardassia Four to rescue a Bajoran prisoner of war. 
(The prisoners are rescued.) 
KIRA: Come on. We have a ship waiting. 
JARO: What you 


Our first step, as usual, is data preparation. What we need to do is format the data in such a way that we can treat the situation as a classification problem after all. That is: 

> Given a string of text, predict the next character in that string. 

Doing this repeatedly will allow the model to generate large bodies of text. 

To do this, we want to split our data like so: 

```
predictor = "to boldly g"
target    = "o"
```

The following function will do this for us. The `max_len` argument gives the number of characters that should be in the predictor string, and the `step_size` argument lets us skip indices if we want to in order to decrease the size of the data. 

In [4]:
def split(raw_text, max_len, step_size = 1):

    lines = []
    next_chars = []

    for i in range(0, len(text) - max_len, step_size):
        lines.append(text[i:i+max_len])
        next_chars.append(text[i+max_len])
    
    return lines, next_chars

In [5]:
max_len = 20

lines, next_chars =  split(text, max_len = max_len, step_size = 5)
for i in range(10, 15):
    print(lines[i] + "     =>    " + next_chars[i])

he emblem of the All     =>    i
blem of the Alliance     =>     
of the Alliance for      =>    G
e Alliance for Globa     =>    l
iance for Global Uni     =>    t


Our next step is to vectorize the characters. This is similar to the word vectorization task, but it's simple enough in this case that's arguably more convenient to actually handle it outside of TensorFlow. It is also possible to handle vectorization using TensorFlow tools, as demonstrated in [this tutorial](https://www.tensorflow.org/tutorials/text/text_generation). 

In [6]:
chars = sorted(set(text))
char_indices = {char : chars.index(char) for char in chars}
X = np.zeros((len(lines), max_len, len(chars)))
y = np.zeros((len(lines), 1), dtype = np.int32)
for i, line in enumerate(lines):
	for t, char in enumerate(line):
		X[i, t, char_indices[char]] = 1
	y[i] = char_indices[next_chars[i]]

Let's take a look at what happened here: 

In [7]:
X.shape, y.shape

((314163, 20, 78), (314163, 1))

In [8]:
X[0]

array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

In [9]:
y[0]

array([42], dtype=int32)

Now we're ready to perform a train-test split: 

In [10]:
train_len = int(0.7*X.shape[0])
X_train = X[0:train_len]
X_val = X[train_len:]

y_train = y[0:train_len]
y_val  = y[train_len:]

Model time! We'll use a simple *Long Short-Term Memory* (LSTM) model for this example. LSTMs are one example of *recurrent* neural network layers. Here's a diagram illustrating the schematic functioning of a recurrent layer. 

![](https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png)
*Image credit: [Chris Olah](https://colah.github.io/posts/2015-08-Understanding-LSTMs/), OpenAI*

On the lefthand side, we have a "zoomed out" picture of a recurrent neural network layer. On the righthand side, we see the "zoomed in" version. The key point here is that output $h_2$ depends not only on input $x_2$, but also, indirectly, on inputs $x_0$ and $x_1$. This means that recurrent neural networks are highly suitable for modeling processes that have temporal structure. Text is an example: the last few characters are the "history" of the text. Timeseries data are another clear example, and indeed, we can use a very similar workflow to the one we'll use today in order to do forecasting in timeseries. 

Since training for this kind of task gets expensive fast, we'll use just one LSTM layer followed by a `Dense` output layer. 

In [11]:
model = tf.keras.models.Sequential([
    layers.LSTM(128, name = "LSTM", input_shape=(max_len, len(chars))),
    layers.Dense(len(chars), activation = "softmax")        
])

In [12]:
model.compile(loss = tf.keras.losses.SparseCategoricalCrossentropy(), 
              optimizer = "adam")

Time for training. We'll do just one epoch for now, mostly just to prove that we've set up our model correctly. 

In [14]:
# code I used to train and save the model
model.fit(X_train, 
          y_train,
          validation_data= (X_val, y_val),
          batch_size=128, epochs = 200)
model.save('DS9_model') 

# model.fit(X_train, 
#           y_train,
#           validation_data= (X_val, y_val),
#           batch_size=128, epochs = 20)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78



INFO:tensorflow:Assets written to: DS9_model/assets


INFO:tensorflow:Assets written to: DS9_model/assets


Now, instead of training the entire model live during lecture, I'm instead going to load in a saved model that I previously trained for 200 epochs on Google Colab. On Colab, each epoch takes around 10s or so. 200 epochs corresponds to roughly 30 minutes. 

In [15]:
model = tf.keras.models.load_model('DS9_model')

Generative models define *probability distributions* over the space of possible outputs. So, our overall algorithm is going to generate new text in a partially randomized way. To make this happen, we define a `sample` function which will take the model outputs, turn them into probabilities, and then sample from the probabilities to produce a single character (well, technically, an integer corresponding to a single character). 

An important parameter here is the so-called *temperature* (this terminology comes from statistical physics. When the temperature is high, the model will more frequently choose low-probability characters. This is sometimes interpreted as "creativity," and leads to more unpredictable outputs. When the temperature is low, on the other hand, the model will "play it safe" and tend to stick to known patterns. In the extreme limiting case as the temperature approaches 0, the model will ultimately get stuck in "loops" in which it repeats common phrases over and over again. 

In [16]:
def sample(preds, temp):
    preds = np.asarray(preds).astype("float64")
    probs = np.exp(preds/temp)
    probs = probs / probs.sum()
    samp = np.random.multinomial(1, probs, 1)
    return np.argmax(samp)

Now that we know how to sample from the model predictions to create a new character, let's now define a convenient function that will allow us to create entire strings of specified length using this process. 

In [17]:
def generate_string(seed_index, temp, gen_length, model): 

    gen_seq = np.zeros((max_len + gen_length, len(chars)))
    seed = X[seed_index]
    gen_seq[0:max_len] = seed
    
    gen_text = lines[seed_index]

    for i in range(0, gen_length):
        window = gen_seq[i: i + max_len]
        preds = model.predict(np.array([window]))[0]
        next_index = sample(preds, temp)
        gen_seq[max_len + i, next_index] = True

        next_char = chars[next_index]
        gen_text += next_char

    return(gen_text)

Let's try it out! 

In [18]:
gen_length = 500
seed_index = 10000

for temp in [0.01, 0.02, 0.03, 0.04, 0.05]:

    gen = generate_string(seed_index, temp, gen_length, model)

    print(4*"-")
    print("TEMPERATURE: " + str(temp))
    print(gen[:-gen_length], end="")
    print(" => ", end = "")
    print(gen[-gen_length:], "")

----
TEMPERATURE: 0.01
tioning. 
KIRA: No p => reficies and not for the first time to be respect thing? 
O'BRIEN: Yes, I don't you are no inside of the Federation. 
KIRA [OC]: Now gives there. 
DAX: It wanted to talk down the wormhole. 
DAX: I do be a missed the station want to do their programme was at the saying isnevering their componeeration and for the death. The right in the put me too. 
QUARK: Well, you're all the patting to have any of your facerion. 
KIRA: I was all minutes again. 
ODO: I stouth it is the first family too of look. 
 
----
TEMPERATURE: 0.02
tioning. 
KIRA: No p => refitian systems who asked to like a trictle the oll find of a way I guess weapons. 
O'BRIEN: Yes, I was see. He's a do ranaboly was again. 
ROM: No. Now, you wouldn't realise you was an a homanal offecture. 
KIRA: No, it could take reaching around to think the oursers him out of the sent has been coming matiat on the same trans of the shot to Bajor. 
KIRA: I'm sorry, I don't know what you're tellod h

Let's make a few observations. 

1. First of all, it can take a surprisingly long time to make predictions using our model. This is because we have to call the `predict()` method *for each character*, in order to ensure that the model appropriately takes into account its recent predictions. This can take a pretty long time! 
2. Second, determining a good value for the temperature can take some experimentation. Note that low temperatures don't necessarily correspond to "more realistic" text -- they just correspond to highlighting common patterns in the text, possibly in excess. Higher temperatures also don't necessarily correspond to a "creative" algorithm in any normal sense of the word -- set the temperature too high, and you'll just get gibberish. 

## Specialization

In this case, we were able to create a model for generating Star Trek scripts using an instance of Google Colab in roughly 15 minutes. This model is highly limited. Although it clearly has learned some relevant features of Star Trek scripts, there's no way that you'd mistake the output of the model for an actual script by a screenwriter. Considering how hard this was, imagine how much effort and computational resources are required to create more general language models! Indeed, as highlighted in a [recent and controversial paper](https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf), training large language models in this day and age can require energy expenditure comparable to a trans-Atlantic flight! 