# Language Modeling with Recurrent Neural Networks using Keras
checked 28.02.24 GPaaß

This notebook uses code from [here](http://www.cs.virginia.edu/~vicente/vislang/notebooks/language_generation_lab.html). It uses [Keras](https://keras.io/), a Python deep learning framework that lets you quickly put together neural network models with a minimal amount of code. It can be run on top of  [Tensor Flow](https://www.tensorflow.org/) without you needing to know either of these underlying frameworks. It provides implementations of several of the layer architectures, objective functions, and optimization algorithms you need for building a model.

Prediction Task: **Language Modelling**
* predict next words in a text given a history of previous words.
* Dataset: A set of 400000 captions for images.

This model can be used to compute the probability of a sequence, as well as generate new sequences.


In [None]:
import os, sys;
from matplotlib import pyplot
%matplotlib inline

import glob
import numpy as np
import os, sys
import math
import json
import tensorflow as tf


`print_mat`: pretty-print a matrix or dataframe

In [None]:
#@title
def print_mat(x, title="", prtDim=True, max_rows=10, max_columns=10, precision=3, doRound=True,index=None, rowNames=None, colNames=None ):
    """ use pandas display to print a dataframe
        title: to be printed
        max_rows: number or None
        max_columns: number or None
        precision: number
        doRound: True  perform rounding (avoid E notation)
        index: None  row names
        columns: None column names
    """
    import pandas as pd
    import tensorflow as tf
    import numpy as np
    with pd.option_context('display.max_rows', max_rows, 'display.max_columns', max_columns, 'display.precision',precision):
        # pd.options.display.max_columns = None
        if tf.is_tensor(x):
            x = x.numpy()
        if doRound:
            x = np.round(x,decimals=precision)
        if title!="":
            if prtDim:
                print(title,x.shape)
            else:
                print(title,x.shape)
        display(pd.DataFrame(x,index=rowNames, columns=colNames))     # use smaller font


## Dataset of Image Captions

We will first read the sentences from the ms-coco dataset. This file was downloaded from http://mscoco.org/dataset/#download. This file contains ~5 descriptions for 80,000 images for a total of ~400k descriptions.

## Reading and Preprocessing
Each word is translated to a numerical index.
When we apply the model to generation later, it will output words as indices, so we'll need to map each numerical index back to its corresponding string representation. We'll reverse the lexicon dictionary so that a word can be looked up by its index.


In [None]:
modelType="big"
if modelType=="small":
    use_perc = 30
elif modelType=="big":
    use_perc = 100      # only read this percentage of the data
else:
    raise TypeError("only small or big")

In [None]:
!wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip

In [None]:
!unzip annotations_trainval2014.zip

In [None]:
vocabularySize = 1000  # vocabulary size.
assert(0 < use_perc and use_perc <= 100)
mscoco = json.load(open('annotations/captions_train2014.json'))
#captionStrings = ['[START] ' + entry['caption'].encode('ascii') for entry in mscoco['annotations']]

captionStrings = []
for entry in mscoco['annotations']:
    if 'caption' in entry and len(str(entry['caption']))>0:
        captionStrings.append(str(entry['caption']))
print('Number of sentences', len(captionStrings))
lng = math.floor(len(captionStrings)*use_perc*0.01)

print('use_perc',use_perc)
captionStrings = captionStrings[:lng]
print('Kept number of sentences', len(captionStrings))
print('First 2 sentences in the list:\n', captionStrings[0:2])

## Definining a word vocabulary
Next, we define a vocabulary and assign each unique word in this dataset with a word id. We use the 1000 most common words in these captions. Then we can transform each sentence into an array of word ids. These preprocessing functionalities are already implemented in keras Tokenizer class:


In [None]:
# Split sentences into words, and define a vocabulary with the most common words.
tokenizer = tf.keras.preprocessing.text.Tokenizer(
    num_words = vocabularySize,
    filters = '!"#$%&()*+,-./:;<=>?@\\^_`{|}~\t\n')
tokenizer.fit_on_texts(captionStrings)

# Convert the sentences into sequences of word ids using our vocabulary.
captionSequences = tokenizer.texts_to_sequences(captionStrings)

# Keep dictionaries that map ids -> words, and words -> ids.
word2id = tokenizer.word_index
id2word = {idx: word for (word, idx) in word2id.items()}
maxSeqLen = max(
    [len(seq)
     for seq in captionSequences])  # Find the sentence with most words.


print('Max Sequence Length', maxSeqLen)

In [None]:
# Print some output to verify the above.
for i in range(2):
  print('Original string:\t', captionStrings[i])
  print('Sequence of Word Ids:\t', captionSequences[i])
  print('Word Ids back to Words:\t',
      " ".join([id2word[idx] for idx in captionSequences[i]]))

## Padding to Maximum Length
Another piece of pre-processing that we might need is padding the sequences with zeroes so that all sequences have the same length and we can put them in a single matrix. This is implemented in Keras using the pad_sequences function.

In [None]:
# By default it pads with zeroes at the beginning (why would that be preferrable?), but we are overriding
# that default behavior by using padding = 'post'.
pad_sequences = tf.keras.preprocessing.sequence.pad_sequences
padded_seqs = pad_sequences(
    captionSequences,
    maxlen=(maxSeqLen + 1),
    padding='post',
    truncating='post')

id2word[0] = 'END'  # id2word[0] is empty before
word2id['END'] = 0

# Let's print some output.
print(padded_seqs.shape)  # This is num_sentences x maxSeqLen.
# Let's try converting back the first sequence into words again.
print(" ".join([id2word[idx] for idx in padded_seqs[0]]))

The outputs to be predicted are shifted for one position

In [None]:
inputData = padded_seqs[:, :-1]  # words 1, 2, 3, ... , (n-1)
outputData = padded_seqs[:, 1:]  # words 2, 3, 4, ... , (n)
print_mat(inputData,"inputData",doRound=False,max_rows=5,max_columns=None)
print_mat(outputData,"outputData",doRound=False,max_rows=5,max_columns=None)

Create Training and test data

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
    inputData, outputData, test_size=0.20, random_state=42)
print("x_train.shape",x_train.shape,"y_train.shape",y_train.shape)
print("x_test.shape",x_test.shape,"y_test.shape",y_test.shape)

## Building our model using a Recurrent Neural Network

Next we will create a recurrent neural network using Keras.
- It takes an input set of words of size `(batch_size, maxSeqLen)`,
- The output of this network will be a vector of size `(batch_size, maxSeqLen, vocabularySize)`. <br>
Notice that the output is of a different size than the input, it contains a pseudo-probability distribution (the output of a softmax layer) for every time step in the sequence. Meaning, it outputs the probability for each word in the vocabulary to be the next word at each time step.
<img src="img/RNN.png",style="max-width:70%">

In [None]:
from tensorflow import keras
from tensorflow.keras import layers
if modelType=="small":  # for model with 30% data
    emb_size = 128
    hid_size = 256
    dropout = 0.3
    batch_size=256
else:                  # for model with 100% data
    emb_size = 256    # original: 300
    hid_size = 512    # original: 512
    dropout = 0.3
    emb_size = 128    # original: 300
    hid_size = 256    # original: 512
    dropout = 0.3
    batch_size=256
rnnType = 'lstm'

In [None]:
def create_model(rnnType, maxSeqLen, vocabularySize, emb_size,hid_size,dropout):
    if rnnType == 'rnn': RNN = layers.SimpleRNN
    if rnnType == 'gru': RNN = layers.GRU
    if rnnType == 'lstm': RNN = layers.LSTM

    print('Building training model...')
    # Remember that in libraries like Keras/Tensorflow, you only need to implement the forward pass.
    # Here we show how to do that for our model.

    # Define the shape of the inputs batchSize x (maxSeqLen + 1).
    words = keras.Input(batch_shape=(None, maxSeqLen),
                       name="input")

    # Build a matrix of size vocabularySize x 300 where each row corresponds to a "word embedding" vector.
    # This layer will convert replace each word-id with a word-vector of size 300.
    embeddings = layers.Embedding(
        vocabularySize, emb_size, name="embeddings")(words)

    # Pass the word-vectors to the LSTM layer.
    # We are setting the hidden-state size to 512.
    # The output will be batchSize x maxSeqLen x hiddenStateSize
    hiddenStates = RNN(
        hid_size,
        return_sequences=True,  # return hidden vector for each position, not only the final
        input_shape=(maxSeqLen, emb_size),
        dropout=dropout,        # use dropout for regularization
        name="rnn")(embeddings)

    # Apply a linear (Dense) layer of size 512 x 256 to the outputs of the LSTM at each time step.
    denseOutput = layers.TimeDistributed(
        layers.Dense(vocabularySize), name="linear")(hiddenStates)
    # generate probabilities for words
    predictions = layers.TimeDistributed(
        layers.Activation("softmax"), name="softmax")(denseOutput)

    # Build the computational graph by specifying the input, and output of the network.
    model = keras.Model(inputs=words, outputs=predictions)

    print(model.summary())

    return model

model = create_model(rnnType, maxSeqLen, vocabularySize, emb_size,hid_size,dropout)

model.compile(
    loss='sparse_categorical_crossentropy',
    metrics = ['accuracy'],
    optimizer=keras.optimizers.RMSprop(lr=0.001))

Sample 10 inputs from the training data and verify everything works.

In [None]:
sample_inputs = padded_seqs[0:10, :-1]    # exclude last element
sample_outputs = model.predict(sample_inputs)
print('input size', sample_inputs.shape)
print('predictes output size', sample_outputs.shape)

In [None]:
sample_outputs


## Training the Model

Keras already implements a generic trainModel functionality through the model.fit function, but it also contains model.train_on_batch which we might need to save memory (e.g. if we want to avoid loading all the dataset in memory at once). For more informations about Keras model functionalities you can see here: https://keras.io/models/model/

If you installed Tensorflow with GPU support, this will automatically run on the GPU!


In [None]:
modelSavePathSmall = os.getcwd()+'/small_'+rnnType + "_best.keras"
modelSavePathBig = os.getcwd()+'/big_'+rnnType + "_best.keras"
if modelType=="small":  # for model with 30% data
    epochs = 5
    modelSavePath = modelSavePathSmall
else:
    epochs = 15
    modelSavePath = modelSavePathBig

# create output directory
print("Will save best model to",modelSavePath)

To save the best model only

In [None]:
ModelCheckpoint = tf.keras.callbacks.ModelCheckpoint
checkpointer = ModelCheckpoint(
    filepath=modelSavePath,
    save_best_only=True,
    monitor='loss')
# configure early stopping
estop = keras.callbacks.EarlyStopping(monitor='val_loss',
                                      patience=3)  # Stop after this number of epochs with no improvement.

V100: ~ 22 sec/epoch if 100% of data is used (emb_size = 300 ,hid_size = 512 )


In [None]:
epochs=1  #21 sec/epoch

In [None]:
history = model.fit(
    x_train,
    y_train,
    validation_data = (x_test,y_test),
    batch_size=batch_size,
    epochs=epochs,
    callbacks=[checkpointer,estop])

!ls

#model.save_weights(modelSavePath)
#print("Saved parameters to",modelSavePath)

In [None]:

model.save(modelSavePath, save_format ='keras')
print("Saved parameters to",modelSavePath)

100 % sample
```
Epoch 36/50
331290/331290 [==============================] - 23s 70us/sample - loss: 0.5747 - accuracy: 0.8751 - val_loss: 0.5744 - val_accuracy: 0.8758
```

**Monitoring** <br>
You can check CPU activity with `htop` in a terminal. <br>
You can check GPU activity with `watch nvidia-smi` in a terminal.

### Explicit  Training Loop
We could also go batch by batch ourselves, however the above
function worked well so let's not go this way.
```
trainSize = inputData.shape[0]
batchSize = 100
nBatches =  trainSize / batchSize
for b in range(0, nBatches):
    # Build the batch inputs, and batch labels.
    batchInputs = np.zeros((batchSize, inputData.shape[1]))
    batchLabels = np.zeros((batchSize, inputData.shape[1], vocabularySize))
    for bi in range(0, batchSize):
        rand_int = random.randint(0, trainSize - 1)
        batchInputs[bi, :] = inputData[rand_int, :]
        for s in range(0, inputData.shape[1]):
            batchLabels[bi, s, outputData[rand_int, s]] = 1

     model.train_on_batch(batchInputs, batchLabels)
```

In [None]:
# plot history
pyplot.plot(history.history['loss'], label='train loss')
pyplot.plot(history.history['val_loss'], label='validation loss')
pyplot.legend()
pyplot.show()


## Building the Inference Model.

Now let's build a model here with the exact same details as the ones we used for training,
* however this one **only takes a single word**, and outputs the next word.
* The other modification is that this network will keep the state of the recurrent network unless we override it.


In [None]:
!ls # list files in output directory

In [None]:
readBigModel = True
if readBigModel:
    inference_model = keras.models.load_model(modelSavePathBig)
else:
    inference_model = keras.models.load_model(modelSavePathSmall)
inference_model

In [None]:
shortSeqLen=1  # only 1 word as start

Given the token 'a' predict the next most likely word.

In [None]:
#startWord = np.zeros((1, 1))
startWord = np.zeros((1, maxSeqLen))
startWord[0, 0] = word2id['the']
nextWordProbabilities = inference_model.predict(startWord)

# print the most probable words that goes next.
top_inds = (-nextWordProbabilities).argsort()[0, 0, :10]
top_probs = np.sort(-nextWordProbabilities)[0, 0, :10]

# Print the next probable word given the previous word.
for iw in top_inds:
    print("{:10.3f}".format(nextWordProbabilities[0,0,iw]), id2word[iw])

## Sampling a Complete New Sentence

Now that we have our inference_model working we can start producing new sentences by random sampling from the output of next word probabilities one step at a time. We rely on the np.random.multinomial function from numpy. To see what it does please check the documentation and make sure you understand what it does http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.multinomial.html

In [None]:
inference_model.reset_states()  # This makes sure the initial hidden state is cleared every time.
word1 = 'the'
startWord = np.zeros((1, maxSeqLen))
startWord[0, 0] = word2id[word1]
print(word1,"         = given first word")
for i in range(0, maxSeqLen):
    nextWordProbs = inference_model.predict(startWord)[0,0,:]
    nextWordProbs.shape
    nextWordProbs = np.asarray(nextWordProbs).astype('float64')
    nextWordProbs = nextWordProbs / nextWordProbs.sum()
    nextWordId = np.random.multinomial(1, nextWordProbs.squeeze(), 1).argmax()
    print("{:10.3f}".format(nextWordProbs[nextWordId]), id2word[nextWordId],) # The comma at the end avoids printing a return line character.
    startWord.fill(0)
    startWord[0, 0] = nextWordId

In [None]:
# access the parameters of the model
for ww in model.weights:
    print(ww.shape)

In [None]:
model.weights[0]

Notice how the model learns to always predict 'END' once it has already predicted the first 'END' and does not produce any other word after that. We can stop the for loop once we already found 'END', this has the effect of producing sentences of arbitrary size, meaning our model has learned when to finish a sentence. The sentence might not be perfect at this point in training but probably it has already learned to produce basic sentences, however it still produces incoherent stuff from time to time. If you keep training the model for longer it should get better and better.

## tf.data for fast Transfer of Data
GPUs and TPUs can radically reduce the time required to execute a single training step. Achieving peak performance requires an efficient input pipeline that delivers data for the next step before the current step has finished. The tf.data API helps to build flexible and efficient input pipelines.

The tf.data API introduces a tf.data.Dataset abstraction that represents a sequence of elements, in which each element consists of one or more components. For example, in an image pipeline, an element might be a single training example, with a pair of tensor components representing the image and its label.

We have the following functionalities:
* [`tf.data.Dataset.from_tensors()`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensors) to construct a `Dataset` from data in memory.<br/>
`tf.data.TFRecordDataset()` to construct a `Dataset` from a file in the recommended `TFRecord` format.
* [`Dataset.cache()`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle)  caches elements either in the specified file or in memory.
* [`Dataset.shuffle()`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle) randomly shuffles the elements of this dataset..
* [`Dataset.map()`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map) apply per-element transformations.
* [`Dataset.batch()`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch) to create batches.
* [`Dataset.repeat(count=?)`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#repeat) to repeats this dataset so each original value is seen count times (default: indefinite repetition).

See the [documentation](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) for  a complete list of transformations.

Let's now use [tf.data](https://www.tensorflow.org/api_docs/python/tf/data) to shuffle, batch, and cache the dataset.


In [None]:
def pdata(nam, dat,n=2):
    """ to print data from a tf.Dataset"""
    print(nam)
    itm =0
    for elem in dat:
        print(elem)
        itm +=1
        if itm>=n:
            break

In [None]:
train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train))
pdata("from_slices",train_data,n=1)

In [None]:
BUFFER_SIZE = 10000
train_data = train_data.cache()
train_data = train_data.shuffle(BUFFER_SIZE)
train_data = train_data.batch(batch_size)
pdata("\nbatch with BATCH_SIZE="+str(batch_size),train_data,n=1)

In [None]:
# Repeats this dataset so each original value is seen count times.
train_data = train_data.repeat()

test_data = tf.data.Dataset.from_tensor_slices((x_test, y_test))
test_data = test_data.batch(batch_size).repeat()

In [None]:
model1 = create_model(rnnType, maxSeqLen, vocabularySize, emb_size,hid_size,dropout)

model1.compile(
    loss='sparse_categorical_crossentropy',
    metrics = ['accuracy'],
    optimizer=keras.optimizers.RMSprop(lr=0.001))

20% data: 23 sec/epoch
~ 116 sec/epoch if 100% of data is used (emb_size = 300 ,hid_size = 512 )

In [None]:
steps_per_epoch = int(len(x_train)/batch_size)
history = model1.fit(
    train_data,
    steps_per_epoch = steps_per_epoch,
    validation_data = test_data,
    validation_steps = 50,
    epochs=epochs,
    callbacks=[checkpointer])
