# LSTM Based Language Model
A language model looks at the context to generate next set of words. This context is also called as a sliding window which moves across the input sentence from left to right(right to left for language which are written from right to left). 


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/PacktPublishing/Hands-On-Generative-AI-with-Python-and-TensorFlow-2/blob/master/Chapter_9/language_model_lstm.ipynb)

## Import Required Libraries

In [None]:
import os
import math
import numpy as np
import tensorflow as tf

In [None]:
print("Tensorflow version={}".format(tf.__version__))

Tensorflow version=2.3.0


## Load Dataset

In [None]:
# https://www.gutenberg.org/ebooks/2600
datafile_path = r'warpeace_2600-0.txt'

In [None]:
# Load the text file
text = open(datafile_path, 'rb').read().decode(encoding='utf-8')
print ('Book contains a total of {} characters'.format(len(text)))

Book contains a total of 3293673 characters


In [None]:
idx = 8091
print(text[idx:idx+500])


BOOK ONE: 1805





CHAPTER I

“Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don’t tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by that
Antichrist—I really believe he is Antichrist—I will have nothing
more to do with you and you are no longer my friend, no longer my
‘faithful slave,’ as you call yourself! But how do you do? I see I
have frightened you—sit down and tell me al


In [None]:
# We remove first 8k characters to remove 
# details related to project gutenberg
text = text [8091:]

In [None]:
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

108 unique characters


## Prepare Dataset
+ Dictionary of character to index mapping
+ Inverse mapping of index to character mapping

In [None]:
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

In [None]:
print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

{
  '\n':   0,
  '\r':   1,
  ' ' :   2,
  '!' :   3,
  '$' :   4,
  '%' :   5,
  '(' :   6,
  ')' :   7,
  '*' :   8,
  ',' :   9,
  '-' :  10,
  '.' :  11,
  '/' :  12,
  '0' :  13,
  '1' :  14,
  '2' :  15,
  '3' :  16,
  '4' :  17,
  '5' :  18,
  '6' :  19,
  ...
}


### Sample Output

In [None]:
print ('{} ---- char-2-int ----  {}'.format(repr(text[40:60]), text_as_int[40:60]))

'\n“Well, Prince, so G' ---- char-2-int ----  [  0 106  50  58  65  65   9   2  43  71  62  67  56  58   9   2  72  68
   2  34]


### Prepare Batch of Training Samples
+ Sequence length limit to 100
+ Use ``tf.data`` API to prepare batches

In [None]:
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(10):
    print(idx2char[i.numpy()])




B
O
O
K
 
O
N
E


In [None]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(10):
    print(repr(''.join(idx2char[item.numpy()])))
    print("-"*110)

'\r\nBOOK ONE: 1805\r\n\r\n\r\n\r\n\r\n\r\nCHAPTER I\r\n\r\n“Well, Prince, so Genoa and Lucca are now just family estate'
--------------------------------------------------------------------------------------------------------------
's of the\r\nBuonapartes. But I warn you, if you don’t tell me that this means war,\r\nif you still try to'
--------------------------------------------------------------------------------------------------------------
' defend the infamies and horrors perpetrated by that\r\nAntichrist—I really believe he is Antichrist—I '
--------------------------------------------------------------------------------------------------------------
'will have nothing\r\nmore to do with you and you are no longer my friend, no longer my\r\n‘faithful slave'
--------------------------------------------------------------------------------------------------------------
',’ as you call yourself! But how do you do? I see I\r\nhave frightened you—sit down and tell me all the'
------

### Prepare Input->Target samples

In [None]:
def split_input_target(chunk):
    """
    Utility which takes a chunk of input text and target as one position shifted form of input chunk.
    Parameters:
        chunk: input list of words
    Returns:
        Tuple-> input_text(i.e. chunk minus last word),target_text(input chunk minus the first word)
    """
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

In [None]:
for input_example, target_example in  dataset.take(1):
    print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
    print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  '\r\nBOOK ONE: 1805\r\n\r\n\r\n\r\n\r\n\r\nCHAPTER I\r\n\r\n“Well, Prince, so Genoa and Lucca are now just family estat'
Target data: '\nBOOK ONE: 1805\r\n\r\n\r\n\r\n\r\n\r\nCHAPTER I\r\n\r\n“Well, Prince, so Genoa and Lucca are now just family estate'


In [None]:
# Batch size
BATCH_SIZE = 128
# Buffer size to shuffle the dataset
BUFFER_SIZE = 10000

In [None]:
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print("Dataset Shape={}".format(dataset))

Dataset Shape=<BatchDataset shapes: ((128, 100), (128, 100)), types: (tf.int64, tf.int64)>


## Prepare Language Model

In [None]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    """
    Utility to create a model object.
    Parameters:
        vocab_size: number of unique characters
        embedding_dim: size of embedding vector. This typically in powers of 2, i.e. 64, 128, 256 and so on
        rnn_units: number of LSTM units to be used
        batch_size: batch size for training the model
    Returns:
        tf.keras model object
    """
    model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
    return model

### Define the Model Parameters

In [None]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [None]:
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (128, None, 256)          27648     
_________________________________________________________________
lstm (LSTM)                  (128, None, 1024)         5246976   
_________________________________________________________________
dense (Dense)                (128, None, 108)          110700    
Total params: 5,385,324
Trainable params: 5,385,324
Non-trainable params: 0
_________________________________________________________________


In [None]:
def loss(labels, logits):
    return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

In [None]:
model.compile(optimizer='adam', loss=loss)

### Setup Callbacks

In [None]:
# Directory where the checkpoints will be saved
checkpoint_dir = r'data/training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [None]:
EPOCHS = 25
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


## Generate Fake Text

### Load Latest Checkpoint

In [None]:
# fetch the latest checkpoint from the model directory
tf.train.latest_checkpoint(checkpoint_dir)

'data/training_checkpoints/ckpt_25'

In [None]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

In [None]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 256)            27648     
_________________________________________________________________
lstm_1 (LSTM)                (1, None, 1024)           5246976   
_________________________________________________________________
dense_1 (Dense)              (1, None, 108)            110700    
Total params: 5,385,324
Trainable params: 5,385,324
Non-trainable params: 0
_________________________________________________________________


### Utility Function to Generate Text

In [None]:
def generate_text(model, mode='greedy', context_string='Hello', num_generate=1000, 
                  temperature=1.0):
    """
    Utility to generate text given a trained model and context
    Parameters:
        model: tf.keras object trained on a sufficiently sized corpus
        mode: decoding mode. Default is greedy. Other mode is
              sampling (set temperature)
        context_string: input string which acts as context for the model
        num_generate: number of characters to be generated
        temperature: parameter to control randomness of outputs
    Returns:
        string : context_string+text_generated
    """

    # vectorizing: convert context string into string indices
    input_eval = [char2idx[s] for s in context_string]
    input_eval = tf.expand_dims(input_eval, 0)

    # String for generated characters
    text_generated = []
    beam_input_predictions = []
    model.reset_states()
    # Loop till required number of characters are generated
    for i in range(num_generate):
        predictions = model(input_eval)
        predictions = tf.squeeze(predictions, 0)
        if mode == 'greedy':
          predicted_id = np.argmax(predictions[0])
          
        elif mode == 'sampling':
          # temperature helps control the character returned by the model.
          predictions = predictions / temperature
          # Sampling over a categorical distribution
          predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # predicted character acts as input for next step
        input_eval = tf.expand_dims([predicted_id], 0)
        text_generated.append(idx2char[predicted_id])
    return (context_string + ''.join(text_generated))

### Greedy Decoding

In [None]:
print(generate_text(model, mode= 'greedy', context_string=u"It was in July, 1805",num_generate=50))

It was in July, 1805-




CHAPTER XII

The former conditions of


### Sampled Decoding

In [None]:
print(generate_text(model, mode= 'sampling', context_string=u"It was in July, 1805",num_generate=100,temperature=0.3))

It was in July, 1805,

“Yes, I say, sir, and so it is the same thing!” said the countess, with a smile of the same tim


In [None]:
print(generate_text(model, mode= 'sampling', context_string=u"It was in July, 1805",num_generate=100,temperature=0.6))

It was in July, 1805, and the country former
adjutants were no longer than as it was done herself with his stories and 


In [None]:
print(generate_text(model, mode= 'sampling', context_string=u"It was in July, 1805",num_generate=100,temperature=0.9))

It was in July, 1805, I spoke to them,
and Bonaparte was foreshed the effect one
another; intelligent, or by asking to


In [None]:
def dummy(ctr,max,p_list,s):
  print(ctr,p_list)
  print("*********")
  if ctr == max:
    return -1
  rt = []
  for i in range(s):
    rt.append([p_list[i],dummy(ctr+1,max,p_list[1:],s)])
  return rt

In [None]:
ctr = 0
max = 3
p_list = [1,2,3,4,5,6,7,8,9,10]
x = dummy(ctr,max,p_list,2)