# Chapter 6: Text

## Long Short-Term Memory Networks

A _long short-term memory_ (LSTM) network is a particular type of layer in _recurrent neural networks_ (RNNs), a type of neural network which processes sequential data that is discretely separated over steps in the "time" dimension. An LSTM network is an RNN which has an LSTM layer.

A vanilla recurrent layer computes its output for each timestep, $\mathbf{h}_{(t)}$, using the values of the input sequence at the current timestemp, $\mathbf{x}_{(t)}$, and the output of the layer at the previous timestep, $\mathbf{h}_{(t-1)}$. LSTM cells also contain a state vector for each timestep, $\mathbf{c}_{(t)}$, which it with $\mathbf{h}_{(t)}$ and $\mathbf{x}_{(t)}$ to compute the new layer state and output at each timestep. Below is a diagram of an LSTM cell.

<img width="600" src="https://camo.githubusercontent.com/c433cc6abd96207bc5e01ad4253026152785b9f9/68747470733a2f2f692e696d6775722e636f6d2f434f32554e4c5a2e706e67">

For an in-depth discussion of RNNs, see [`RecurrentNeuralNetworks.ipynb`](https://github.com/DCtheTall/hands-on-machine-learning/blob/master/chapter14/RecurrentNeuralNetworks.ipynb) in my GitHub repository for my implementations of _Hands On Machine Learning with Scikit-Learn and TensorFlow_ by Aurélian Géron.

## Downloading the Data

## Tokenization

The first step is to dowload the data and to _tokenize_ the text, i.e. split up the text into individual units. In this case, we will be separating the text into lowercase words.



In [1]:
!wget http://www.gutenberg.org/cache/epub/11339/pg11339.txt && \
  mv pg11339.txt aesop.txt

--2020-04-26 18:40:56--  http://www.gutenberg.org/cache/epub/11339/pg11339.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 243023 (237K) [text/plain]
Saving to: ‘pg11339.txt’


2020-04-26 18:41:10 (498 KB/s) - ‘pg11339.txt’ saved [243023/243023]



In [0]:
import re
from tensorflow.keras.preprocessing.text import Tokenizer


filename = 'aesop.txt'
with open(filename, encoding='utf-8-sig') as f:
  text = f.read()

seq_length = 20
start_story = '| ' * seq_length

start = text.find("THE FOX AND THE GRAPES\n\n\n")
end = text.find("ILLUSTRATIONS\n\n\n[")

text = text[start:end]
text = text.lower()
text = start_story + text
text = text.replace('\n\n\n\n\n', start_story)
text = text.replace('\n', ' ')
text = re.sub('  +', '. ', text).strip()
text = text.replace('..', '.')
text = text.replace('..', '.')
text = re.sub('([!"#$%&()*+,-./:;<=>?@[\]^_`{|}~])', r' \1 ', text)
text = re.sub('\s{2,}', ' ', text)

tokenizer = Tokenizer(char_level=False, filters='')
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1
tokens_list = tokenizer.texts_to_sequences([text])[0]

In [3]:
len(text)

213714

In [4]:
total_words

4169

In [5]:
text[:1000]

' | | | | | | | | | | | | | | | | | | | | the fox and the grapes . a hungry fox saw some fine bunches of grapes hanging from a vine that was trained along a high trellis , and did his best to reach them by jumping as high as he could into the air . but it was all in vain , for they were just out of reach : so he gave up trying , and walked away with an air of dignity and unconcern , remarking , " i thought those grapes were ripe , but i see now they are quite sour . " | | | | | | | | | | | | | | | | | | | | the goose that laid the golden eggs . a man and his wife had the good fortune to possess a goose which laid a golden egg every day . lucky though they were , they soon began to think they were not getting rich fast enough , and , imagining the bird must be made of gold inside , they decided to kill it in order to secure the whole store of precious metal at once . but when they cut it open they found it was just like any other goose . thus , they neither got rich all at once , as the

In [6]:
', '.join(str(t) for t in tokens_list[:1000])

'1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 56, 4, 3, 940, 5, 6, 382, 56, 94, 77, 216, 1557, 9, 940, 941, 62, 6, 581, 20, 12, 2226, 162, 6, 359, 2227, 2, 4, 158, 11, 250, 7, 383, 35, 29, 1176, 25, 359, 25, 10, 88, 55, 3, 582, 5, 19, 16, 12, 37, 14, 785, 2, 17, 23, 47, 96, 43, 9, 383, 30, 28, 10, 170, 36, 425, 2, 4, 426, 89, 21, 57, 582, 9, 1558, 4, 2228, 2, 1559, 2, 8, 18, 144, 260, 940, 47, 1177, 2, 19, 18, 90, 115, 23, 63, 360, 2229, 5, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1178, 20, 519, 3, 659, 660, 5, 6, 66, 4, 11, 520, 32, 3, 91, 384, 7, 2230, 6, 1178, 48, 519, 6, 659, 1179, 159, 75, 5, 942, 385, 23, 47, 2, 23, 171, 126, 7, 204, 23, 47, 45, 386, 474, 361, 177, 2, 4, 2, 2231, 3, 229, 186, 34, 109, 9, 786, 427, 2, 23, 943, 7, 237, 16, 14, 309, 7, 944, 3, 583, 584, 9, 2232, 2233, 24, 78, 5, 19, 26, 23, 387, 16, 787, 23, 122, 16, 12, 96, 103, 132, 95, 1178, 5, 261, 2, 23, 945, 86, 474, 37, 24, 78, 2, 25, 23, 32, 1560, 2, 788, 946, 132, 

### Building the Dataset

We want to use the LSTM network to predict the next word in the sequence. In order to do so, we will train the network with sequences of 20 words. The output of each sequence is the subsequent word.

Below is code which generates the dataset from the tokens list constructed above.

In [7]:
import numpy as np
from keras.utils import np_utils


def generate_sequences(tokens_list, step, total_words):
  """Generate a dataset from a tokenized body of text."""
  X, y = [], []
  for i in range(0, len(tokens_list) - seq_length, step):
    X.append(tokens_list[i:i + seq_length])
    y.append(tokens_list[i + seq_length])
  y = np_utils.to_categorical(y, num_classes=total_words)
  num_seq = len(X)
  print('Number of sequences:', num_seq)
  return np.array(X), np.array(y), num_seq

Using TensorFlow backend.


In [8]:
step = 1
seq_length = 20
X, y, num_seq = generate_sequences(tokens_list, step, total_words)

Number of sequences: 50415


In [9]:
X.shape

(50415, 20)

In [10]:
y.shape

(50415, 4169)

## The Embedding Layer

The _embedding layer_ functions as a lookup table that converts each token to a vector of a specified `embedding_size`. The number of weights learned is equal to the size of the vocabulary multiplied by the `embedding_size`. Transforming each token into a continuously valued vector enables the model to learn a way to represent each word using backpropagation.

## Building the LSTM Network

Below is code for creating an LSTM model using TensorFlow and Keras. For this model we use a _stacked LSTM_ architecture. A stacked LSTM outputs the hidden state over each timestep to the next layer to create a sequential output. This sequential output is used as input for a second LSTM layer. We control whether a Keras `LSTM` layer returns the hidden state for the last or all timesteps using the `return_sequences` parameter.

In [0]:
from tensorflow.keras.layers import Input, Embedding, LSTM, Dropout, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import RMSprop


n_units = 256
embedding_size = 100

text_in = Input(shape=(None,))
x = Embedding(total_words, embedding_size)(text_in)
x = LSTM(n_units, return_sequences=True)(x)
x = LSTM(n_units)(x)
x = Dropout(rate=0.2)(x)
text_out = Dense(total_words, activation='softmax')(x)

model = Model(text_in, text_out)
opt = RMSprop(lr=0.001)
model.compile(loss='categorical_crossentropy', optimizer=opt)

In [37]:
model.summary()

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding_2 (Embedding)      (None, None, 100)         416900    
_________________________________________________________________
lstm_3 (LSTM)                (None, None, 256)         365568    
_________________________________________________________________
lstm_4 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 4169)              1071433   
Total params: 2,379,213
Trainable params: 2,379,213
Non-trainable params: 0
_________________________________________________

Now we will train the model for 100 epochs and save the result.

In [0]:
from google.colab import drive

drive.mount('/content/gdrive/')
checkpoint_path = '/content/gdrive/My Drive/gdl_models/lstm/'

In [0]:
from tensorflow.keras.callbacks import ModelCheckpoint

callbacks = [ModelCheckpoint(filepath=checkpoint_path + 'weights.hdf5',
                             verbose=1, save_weights_only=True),
             ModelCheckpoint(
                 filepath=checkpoint_path + 'weights_{epoch:04d}.hdf5',
                 verbose=1, save_weights_only=True)]

model.fit(X, y, epochs=1000, batch_size=32, shuffle=True, callbacks=callbacks)

## Generating New Text

Now we will have the model generate a new sequence of words. We do so by first giving the model an input sequence and then letting it predict the next word. We then use the include the new word in the following sequence and repeat.

In [0]:
model.load_weights(checkpoint_path + 'weights.hdf5')

For each sequence, the model will predict the probability that each word in the vocabulary follows the sequence. We can introduce a _temperature_ parameter to scale the output probabilities. A lower temperature means the model will be more deterministic, i.e. more likely to always pick the word with just the highest probability according to the model.

In [0]:
def sample_with_temperature(preds, temperature=1.0):
  """Sample from the probabilities predicted by the model with temperature."""
  preds = np.log(np.asarray(preds).astype('float64')) / temperature
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  return np.argmax(np.random.multinomial(1, preds, 1))


def generate_text(seed_text, next_words, model, max_sequence_len, temperature):
  """Generate new text using the trained model."""
  output = seed_text
  seed_text = start_story + seed_text
  for _ in range(next_words):
    tokens_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = tokens_list[-max_sequence_len:]
    y_class = sample_with_temperature(model.predict([token_list], verbose=0)[0],
                                      temperature)
    if y_class == 0:
      output_word = ''
    else:
      output_word = tokenizer.index_word[y_class]
    if output_word == '|':
      break
    output += output_word + ' '
    seed_text += output_word + ' '
  return output

In [40]:
seed_text = 'the frog and the snake '
gen_words = 500
temp = 0.1

generate_text(seed_text, gen_words, model, seq_length, temp)

'the frog and the snake . a hungry man of an meadow however , he felt by a tanner , and the mouse had there was the matter ; and the bramble , said , " ah him . you may wake me out in the same friend , but i see no to your own taken to the suddenly good - looking for the sound of a good deal . " '

Below is code for generating human-led text. The model picks the top 10 words with the highest probability of coming next and a human user decides what the model writes. This is similar to how automatic suggestions work in messaging apps.

In [0]:
from IPython.display import clear_output

def generate_human_led_text(model, max_sequence_len):
  """Generate a human-led sequence of text."""
  output = ''
  seed_text = start_story
  while True:
    tokens_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = tokens_list[-max_sequence_len:]
    probs = model.predict([token_list])[0]
    top_10_idx = np.flip(np.argsort(probs)[-10:])
    top_10_probs = probs[top_10_idx]
    top_10_words = tokenizer.sequences_to_texts([[x] for x in top_10_idx])
    for prob, word in zip(top_10_probs, top_10_words):
      print('{:<6.1%} : {}'.format(prob, word))
    chosen_word = input()
    if chosen_word == '|':
      break
    seed_text += chosen_word + ' '
    output += chosen_word + ' '
    clear_output()
  return output

In [0]:
generate_human_led_text(model, seq_length)