# Text Generation with Neural Networks

In this notebook we will create a network that can generate text, here we show it being done character by character. Very awesome write up on this here: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import tensorflow as tf

## The Data

In [3]:
path_to_file = "shakespeare.txt"

In [4]:
text = open(path_to_file, mode='r').read()

In [5]:
#print(text[:1000]) # repeated stracture

In [6]:
vocab = sorted(set(text)) # Unique Charecters in the Text

In [7]:
len(vocab)

84

## Text Processing

In [8]:
####### enumerate(vocab)  =  count the unique charecters  ( enumerate them)
#for pair in enumerate(vocab):
#    print(pair)

In [9]:
char_to_ind = {char:ind for ind,char in enumerate(vocab)}

In [10]:
char_to_ind['H']

33

In [11]:
ind_to_char = np.array(vocab)

In [12]:
ind_to_char[33]

'H'

In [13]:
encoded_text = np.array([char_to_ind[c] for c in text])

In [14]:
encoded_text

array([ 0,  1,  1, ..., 30, 39, 29])

In [15]:
encoded_text.shape

(5445609,)

In [16]:
sample = text[:400]

In [17]:
print(sample)


                     1
  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But as the riper should by time decease,
  His tender heir might bear his memory:
  But thou contracted to thine own bright eyes,
  Feed'st thy light's flame with self-substantial fuel,
  Making a famine where abundance lies,
  Thy self thy foe, to thy sweet self too cruel:
  Thou t


In [18]:
encoded_text[:50]

array([ 0,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1, 12,  0,  1,  1, 31, 73, 70, 68,  1, 61, 56, 64,
       73, 60, 74, 75,  1, 58, 73, 60, 56, 75, 76, 73, 60, 74,  1, 78])

## Creating Batches

We would like to have batches that consist of the basic stracture of the text.
In this text we have rhyme at almost every second line.
we can choose to grab 3 lines for one rhyme or 4 for two.

In [19]:
line = '''From fairest creatures we desire increase,'''

In [20]:
len(line)

42

In [21]:
lines = '''  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But as the riper should by time decease,
  His tender heir might bear his memory:
  But thou contracted to thine own bright eyes,
  Feed'st thy light's flame with self-substantial fuel,
  Making a famine where abundance lies,
  Thy self thy foe, to thy sweet self too cruel:'''

In [22]:
len(lines)/8

45.875

In [23]:
seq_len = 45*4 # We will choose batch of 4 lines
print(seq_len)

180


In [24]:
total_num_seq = len(text) // (seq_len+1) # +1 For zero indexing

In [25]:
total_num_seq

30086

https://www.tensorflow.org/api_docs/python/tf/data/Dataset

In [26]:
char_dataset = tf.data.Dataset.from_tensor_slices(encoded_text)

We where able to do the incoding on the tf.data.dataset object 

In [27]:
sequences = char_dataset.batch(seq_len+1, drop_remainder=True)

In [28]:
def create_seq_targets(seq):
    input_txt = seq[:-1]
    target_txt = seq[1:]
    return input_txt, target_txt

In [29]:
dataset = sequences.map(create_seq_targets)

In [30]:
for input_txt, target_txt in dataset.take(1):
    print(input_txt.numpy())
    print("".join(ind_to_char[input_txt.numpy()]))
    print("\n")
    print(target_txt.numpy())
    print("".join(ind_to_char[target_txt.numpy()]))

[ 0  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 12  0
  1  1 31 73 70 68  1 61 56 64 73 60 74 75  1 58 73 60 56 75 76 73 60 74
  1 78 60  1 59 60 74 64 73 60  1 64 69 58 73 60 56 74 60  8  0  1  1 45
 63 56 75  1 75 63 60 73 60 57 80  1 57 60 56 76 75 80  5 74  1 73 70 74
 60  1 68 64 62 63 75  1 69 60 77 60 73  1 59 64 60  8  0  1  1 27 76 75
  1 56 74  1 75 63 60  1 73 64 71 60 73  1 74 63 70 76 67 59  1 57 80  1
 75 64 68 60  1 59 60 58 60 56 74 60  8  0  1  1 33 64 74  1 75 60 69 59
 60 73  1 63 60 64 73  1 68 64 62 63]

                     1
  From fairest creatures we desire increase,
  That thereby beauty's rose might never die,
  But as the riper should by time decease,
  His tender heir migh


[ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 12  0  1
  1 31 73 70 68  1 61 56 64 73 60 74 75  1 58 73 60 56 75 76 73 60 74  1
 78 60  1 59 60 74 64 73 60  1 64 69 58 73 60 56 74 60  8  0  1  1 45 63
 56 75  1 75 63 60 73 60 57 80  1 57 60 56 76 75 

In [31]:
batch_size = 2**7
print(batch_size)

128


In [32]:
buffer_size = 10000
dataset = dataset.shuffle(buffer_size).batch(batch_size,drop_remainder=True)

In [33]:
dataset

<BatchDataset shapes: ((128, 180), (128, 180)), types: (tf.int64, tf.int64)>

## Creating the Model

In [34]:
vocab_size = len(vocab)
print(vocab_size)

84


In [35]:
embd_dim = 64

In [36]:
rnn_neurons = 1026

In [37]:
from tensorflow.keras.losses import sparse_categorical_crossentropy

https://keras.io/api/losses/probabilistic_losses/#sparse_categorical_crossentropy-function
SparseCategoricalCrossentropy class

tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=False, reduction="auto", name="sparse_categorical_crossentropy"
)
Computes the crossentropy loss between the labels and predictions.

Use this crossentropy loss function when there are two or more label classes. We expect labels to be provided as integers. If you want to provide labels using one-hot representation, please use CategoricalCrossentropy loss. There should be # classes floating point values per feature for y_pred and a single floating point value per feature for y_true.

from_logits: Whether y_pred is expected to be a logits tensor. By default, we assume that y_pred encodes a probability distribution.

In [38]:
#help(sparse_categorical_crossentropy)

In [39]:
def sparse_cat_loss(y_true,y_pred):
    return sparse_categorical_crossentropy(y_true, y_pred, from_logits=True)

In [40]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GRU, Dense

Embedding Layer input size error:
https://stackoverflow.com/questions/54557468/in-tf-keras-layers-embedding-why-it-is-important-to-know-the-size-of-dictionary

In [41]:
def create_model(vocab_size,embd_dim,rnn_neurons,batch_size):
    
    model = Sequential()
    model.add(Embedding(vocab_size,  embd_dim, batch_input_shape = (batch_size,None)))
    model.add(GRU(rnn_neurons, stateful=True,return_sequences=True,
                  recurrent_initializer='glorot_uniform'))
    # stateful=True -> use model-reset-states after eatch prediction
    model.add(Dense(vocab_size))
    model.compile(optimizer='adam', loss=sparse_cat_loss)
    
    return model

In [42]:
model = create_model(vocab_size,embd_dim,rnn_neurons,batch_size)

In [43]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (128, None, 64)           5376      
_________________________________________________________________
gru (GRU)                    (128, None, 1026)         3361176   
_________________________________________________________________
dense (Dense)                (128, None, 84)           86268     
Total params: 3,452,820
Trainable params: 3,452,820
Non-trainable params: 0
_________________________________________________________________


## Training the Model

In [44]:
[batch_size,None]

[128, None]

Before running the model, we will check it over one batch example to see how it works and if there are any issues to treat.

In [45]:
for input_example_batch, target_example_batch in dataset.take(1):

  example_batch_predictions = model(input_example_batch)

In [46]:
example_batch_predictions.shape

TensorShape([128, 180, 84])

In [47]:
example_batch_predictions[0]

<tf.Tensor: shape=(180, 84), dtype=float32, numpy=
array([[-0.00313731,  0.00130075,  0.00269769, ..., -0.00196088,
        -0.00358368,  0.00117394],
       [ 0.00146231,  0.00368364,  0.00237321, ..., -0.00033735,
         0.00222059, -0.0004087 ],
       [-0.00168441,  0.00321426,  0.00164211, ..., -0.00426133,
        -0.00303025, -0.00395624],
       ...,
       [-0.007145  ,  0.00733935,  0.00203389, ...,  0.00828057,
        -0.00762024,  0.00704344],
       [ 0.00082475,  0.00584078,  0.00168824, ...,  0.00324385,
        -0.0008019 ,  0.00289006],
       [ 0.00197069,  0.00593258, -0.00142932, ...,  0.00097622,
        -0.00181871,  0.00134782]], dtype=float32)>

In [48]:
sample_indices = tf.random.categorical(example_batch_predictions[0],num_samples=1)

In [49]:
sample_indices.shape

TensorShape([180, 1])

In [50]:
sample_indices = tf.squeeze(sample_indices, axis=-1).numpy()

In [51]:
#ind_to_char[sample_indices]

In [52]:
epochs = 20

In [53]:
model.fit(dataset, epochs=epochs)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f1d956abe50>

## Generating Text

In [54]:
from tensorflow.keras.models import load_model

In [55]:
model.save('shakespeare_gen.h5') 

In [56]:
model = create_model(vocab_size,embd_dim,rnn_neurons,batch_size=1)

model.load_weights('shakespeare_gen.h5')

model.build(tf.TensorShape([1,None]))

In [57]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 64)             5376      
_________________________________________________________________
gru_1 (GRU)                  (1, None, 1026)           3361176   
_________________________________________________________________
dense_1 (Dense)              (1, None, 84)             86268     
Total params: 3,452,820
Trainable params: 3,452,820
Non-trainable params: 0
_________________________________________________________________


In [58]:
def generate_text(model, start_seed,gen_size=500,temp=1.0):

  num_generate = gen_size

  input_eval = [char_to_ind[s] for s in start_seed]

  input_eval = tf.expand_dims(input_eval, 0)

  text_generated = []

  tempature = temp

# https://stackoverflow.com/questions/42763928/how-to-use-model-reset-states-in-keras
  model.reset_states()

  for i in range(num_generate):

    predictions = model(input_eval)
    predictions = tf.squeeze(predictions,0)
    predictions = predictions/tempature

    predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

    input_eval = tf.expand_dims([predicted_id],0)

    text_generated.append(ind_to_char[predicted_id])

  return (start_seed+"".join(text_generated))


In [59]:
print(generate_text(model,'ALEX',gen_size=1000))

ALEXDDY. Well, my good lord- Indeed, I will not take a common time;
    Therefore, two ships.
  YORK. Whe posts the golden death of winters follow thoughs,
    As fell to another weary a many of his face
      After his lance for 't. Now I am commanded
  Come on to seek the cap; of leading loves they die.
  The substake their affairs, and from our fight  
    And slay thine own mouth. Yea, not a foolish homicius,
    Offering none of mickleman 'nhappele,
    And catch it by. Be it confounded then?
    Urle sleeping! yes, my lord, within your dowrul,
    And his own birds.                             [Advancing]  I know your master's brand,
    Nor fair mistress be employ'd withal
    As water-loss about the guard of grace;
    So went illy Lucius, I found. Ay, sir, is 'ald will keep other off
    Satisfyray it by callett Prospero.
  ROSS. Be it to me. Mine ear I keep your bosom  
    As hoon did gather, by your free and officer or him must needs be blunt.
  Lear. Romeo, present; pardon