# Deep Learning for Text Generation 
> A Practioners Guide : Part II

Temperature is a scaling factor applied to the outputs of our dense layer before applying the softmaxactivation function. In a nutshell, it defines how conservative or "creative" the model's guesses are for the next character in a sequence. Lower values of temperature (e.g., 0.2) will generate "safe" guesses whereas values of temperature above 1.0 will start to generate "riskier" guesses. Think of it as the amount of surpise you'd have at seeing an English word start with "st" versus "sg". When temperature is low, we may get lots of "the"s and "and"s; when temperature is high, things get more unpredictable.

# Training a Text Generator from Scratch

<img src="illustrations/tf_logo.png" >

We discussed about **RNNs** and **language models** in the previous notebook. Lets get our hands dirty and train our very own language model from scratch.

We will train a language model using Tensorflow 2.0. TF2.0 is the updated version of the already popular deep learning framework. TF2.0 provides keras based high level APIs along with core set of functionality along with eager execution for more complex workflows. We will be relying this session using TF+Keras setup which is easy to understand and deploy.


This notebook will leverage **GRUs** inplace of vanilla RNNs for two main reasons, better at handling vanishing and exploding gradients as well as ability to handle longer context. As far as corpus for training our language model, we utilize the famous book _**The Adventures of Sherlock Holmes**_ by _Sir Arthur Conan Doyle_. The book is made available through _Project Gutenberg_, check references section for details.

## Import Packages

In [32]:
import os
import numpy as np
import tensorflow as tf

In [33]:
print("Tensorflow version={}".format(tf.__version__))

Tensorflow version=2.0.0


## Load Dataset

In [2]:
datafile_path = r'data/the_adventures_of_sherlock_holmes_1661-0.txt'

In [34]:
# Load the text file
text = open(datafile_path, 'rb').read().decode(encoding='utf-8')
print ('Book contains a total of {} characters'.format(len(text)))

Book contains a total of 594197 characters


### A quick snippet of the book

In [4]:
print(text[1300:1500])

I. A SCANDAL IN BOHEMIA


I.

To Sherlock Holmes she is always _the_ woman. I have seldom heard him
mention her under any other name. In his eyes she eclipses and
predominates the whole of her 


## Prepare Text

We shall perform bare minimum clean up of the text. The aim is to help our model understand the usage of words and its context. Typical preprocessing steps such as stopword removal, stemming, lower casing etc. are not required in this case.

In [5]:
# We remove first 1300 characters to remove 
# details related to project gutenberg
text = text [1300:]

### Unique Character Count | Vocab Size

In [6]:
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

96 unique characters


### Character to Integer Mapping

In [7]:
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

In [42]:
print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

{
  '\n':   0,
  '\r':   1,
  ' ' :   2,
  '!' :   3,
  '"' :   4,
  '$' :   5,
  '%' :   6,
  '&' :   7,
  "'" :   8,
  '(' :   9,
  ')' :  10,
  '*' :  11,
  ',' :  12,
  '-' :  13,
  '.' :  14,
  '/' :  15,
  '0' :  16,
  '1' :  17,
  '2' :  18,
  '3' :  19,
  ...
}


### Text to Integer Sample

In [43]:
print ('{} ---- char-2-int ----  {}'.format(repr(text[40:60]), text_as_int[40:60]))

'Sherlock Holmes, by ' ---- char-2-int ----  [61 74 68 71 59 67  2 37 71 68 69 61 75  2 75 64 61  2 65 75]


## Prepare Dataset

We leverage a sliding window approach to train out model. We first set the maximum sequence length to 100 characters. This is done for the purposes of preparing and training batches.

In [10]:
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(10):
    print(idx2char[i.numpy()])

I
.
 
A
 
S
C
A
N
D


### Prepare Batch

In [50]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(10):
    print(repr(''.join(idx2char[item.numpy()])))
    print("-"*110)

'I. A SCANDAL IN BOHEMIA\r\n\r\n\r\nI.\r\n\r\nTo Sherlock Holmes she is always _the_ woman. I have seldom heard '
--------------------------------------------------------------------------------------------------------------
'him\r\nmention her under any other name. In his eyes she eclipses and\r\npredominates the whole of her se'
--------------------------------------------------------------------------------------------------------------
'x. It was not that he felt any emotion\r\nakin to love for Irene Adler. All emotions, and that one part'
--------------------------------------------------------------------------------------------------------------
'icularly,\r\nwere abhorrent to his cold, precise but admirably balanced mind. He\r\nwas, I take it, the m'
--------------------------------------------------------------------------------------------------------------
'ost perfect reasoning and observing machine that\r\nthe world has seen, but as a lover he would have pl'
--------------

In [51]:
def split_input_target(chunk):
    """
    Utility which takes a chunk of input text and target as one position shifted form of input chunk.
    Parameters:
        chunk: input list of words
    Returns:
        Tuple-> input_text(i.e. chunk minus last word),target_text(input chunk minus the first word)
    """
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

In [13]:
for input_example, target_example in  dataset.take(1):
    print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
    print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  'I. A SCANDAL IN BOHEMIA\r\n\r\n\r\nI.\r\n\r\nTo Sherlock Holmes she is always _the_ woman. I have seldom heard'
Target data: '. A SCANDAL IN BOHEMIA\r\n\r\n\r\nI.\r\n\r\nTo Sherlock Holmes she is always _the_ woman. I have seldom heard '


In [14]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Step    0
  input: 38 ('I')
  expected output: 14 ('.')
Step    1
  input: 14 ('.')
  expected output: 2 (' ')
Step    2
  input: 2 (' ')
  expected output: 30 ('A')
Step    3
  input: 30 ('A')
  expected output: 2 (' ')
Step    4
  input: 2 (' ')
  expected output: 48 ('S')


### Prepare Training Batch

In [15]:
# Batch size
BATCH_SIZE = 64
# Buffer size to shuffle the dataset
BUFFER_SIZE = 10000

In [52]:
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print("Dataset Shape={}".format(dataset))

Dataset Shape=<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>


## Prepare Model

We prepare a utility function to generate the architecture of our deep learning based language model. We leverage the high level ```tf.keras``` API for creating this model. We use only 1 hidden layer. You may experiment with additional layers as well

In [19]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    """
    Utility to create a model object.
    Parameters:
        vocab_size: number of unique characters
        embedding_dim: size of embedding vector. This typically in powers of 2, i.e. 64, 128, 256 and so on
        rnn_units: number of GRU units to be used
        batch_size: batch size for training the model
    Returns:
        tf.keras model object
    """
    model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
    return model

In [18]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [20]:
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

In [21]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           24576     
_________________________________________________________________
gru (GRU)                    (64, None, 1024)          3938304   
_________________________________________________________________
dense (Dense)                (64, None, 96)            98400     
Total params: 4,061,280
Trainable params: 4,061,280
Non-trainable params: 0
_________________________________________________________________


In [22]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

In [23]:
model.compile(optimizer='adam', loss=loss)

### Setup Callbacks
- We setup a single callback to store training checkpoints. 
- You may leverage other callbacks such as tensorboard, earlystopping etc as needed

In [24]:
# Directory where the checkpoints will be saved
checkpoint_dir = r'data/training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

## Time to Train the ~Dragon🐉~ Language Model 

Now that we have prepared our training dataset along with our model, let us train it. We train it for a few epochs and observe the loss to understand whether it is learning or not.

In [26]:
EPOCHS = 12
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


---

## Generate Text

We trained out model on the text from _The Adventures of Sherlock Holmes_. Now one should notice that we literally did not perform any preprocessing on the text apart from removing some metadata and table of contents. The model is trained with a vocab size of **96** unique characters which includes numbers and special characters apart from lower and upper case letters.

We should also note that we have trained a character level language model to reduce the vocab size. Imagine the vocab size for training at the word level, wouldn't it be orders of magnitude larger than this? Also imagine the amount of training data required to help the model understand different contexts under which a specific word might be used.

Let us generate some text and see what our model has learnt.

In [27]:
# fetch the latest checkpoint from the model directory
tf.train.latest_checkpoint(checkpoint_dir)

'data/training_checkpoints/ckpt_12'

### Model Load
> Notice that we trained the model with certain batch size. Using ```model.summary``` we saw how the batch size shows up as one the parameters which determine input's shape.

> For inference, we would be using a single input sentence/context to generate text. Thus we build the model again using ```build_model``` utility we prepared earlier but use a ```batch_size``` of 1 this time. Once we have the model object with desired batch size, we use ```load_weights``` to utilize the latest checkpoint weights for inference

In [28]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

In [29]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 256)            24576     
_________________________________________________________________
gru_1 (GRU)                  (1, None, 1024)           3938304   
_________________________________________________________________
dense_1 (Dense)              (1, None, 96)             98400     
Total params: 4,061,280
Trainable params: 4,061,280
Non-trainable params: 0
_________________________________________________________________


In [57]:
def generate_text(model, context_string, num_generate=1000,temperature=1.0):
    """
    Utility to generate text given a trained model and context
    Parameters:
        model: tf.keras object trained on a sufficiently sized corpus
        context_string: input string which acts as context for the model
        num_generate: number of characters to be generated
        temperature: parameter to control randomness of outputs
    Returns:
        string : context_string+text_generated
    """

    # vectorizing: convert context string into string indices
    input_eval = [char2idx[s] for s in context_string]
    input_eval = tf.expand_dims(input_eval, 0)

    # String for generated characters
    text_generated = []

    model.reset_states()
    # Loop till required number of characters are generated
    for i in range(num_generate):
        predictions = model(input_eval)
        predictions = tf.squeeze(predictions, 0)

        # temperature helps control the character returned by the model.
        predictions = predictions / temperature
        # Sampling over a categorical distribution
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # predicted character acts as input for next step
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])

    return (context_string + ''.join(text_generated))

### Let us generate some samples

In [58]:
print(generate_text(model, context_string=u"Watson you are",num_generate=100))

Watson you are certainly leave
your husband’s read, and then two vareak illsted of bory, it was young McCarthy’t 


In [60]:
# We increase the temperature, i.e. increase randomness
print(generate_text(model, context_string=u"Watson you are",num_generate=100,temperature=2))

Watson you are that prsilUS4œlder, he
int always,
heabveirnies brokeve at Togheatudeprégail I; ‘Ych—_ OR2Uà,D“W


In [61]:
# We decrease the temperature, i.e. increase randomness
print(generate_text(model, context_string=u"Watson you are",num_generate=100,temperature=0.5))

Watson you are the news
all open. It was a man who was a man with a sudden brightly more than I cannot means to w


## Decoding Strategies

## References

+ Project Gutenberg : [The Adventures of Sherlock Holmes](https://www.gutenberg.org/ebooks/1661)
+ [Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
+ [Freecode](https://www.freecodecamp.org/news/applied-introduction-to-lstms-for-text-generation-380158b29fb3/)