# Language modeling with recurrent neural networks using Keras


## Overview

We will to see how to build a recurrent neural network (RNN) language model that learns the relation between words in text, using the Keras library for machine learning. We will then see how this model can be used to compute the probability of a sequence, as well as generate new sequences.

### Language Modeling

A language model is a model of the probability of word sequences. These models are useful for a variety of tasks, such as ones that require selecting the most likely output from a set of candidates provided by a speech recognition or machine translation system, for example. Here, we will to see how a language model can be used to generate sequences, in particular the endings of stories. Language generation is a difficult research problem which is generally addressed by more complex models than the one shown here.

Traditionally, the most well-known approach to language modeling relies on n-grams. The limitation of n-gram language models is that they only explicitly model the probability of a sequence of *n* words. In contrast, RNNs can model longer sequences and thus typically are better at predicting which words will appear in a sequence. See the [chapter in Jurafsky & Martin's *Speech and Language Processing*](https://web.stanford.edu/~jurafsky/slp3/4.pdf) to learn more about traditional approaches to language modeling. 

### Recurrent Neural Networks (RNNs)

RNNs are a general framework for modeling sequence data and are particularly useful for natural language processing tasks. At a high level, RNN encode sequences via a set of parameters (weights) that are optimized to predict some output variable. The focus of this notebook is on the code needed to assemble a model in Keras, as well as some data processing tools that facilitate building the model. 

If you understand how to structure the input and output of the model, and know the fundamental concepts in machine learning, then a high-level understanding of how an RNN works is sufficient for using Keras. You'll see that most of the code here is actually just data manipulation, and I'll visualize each step in this process. The code used to assemble the RNN itself is more minimal. It is of course useful to know the technical details of the RNN, so you can theorize on the results and innovate the model to make it better. For a better understanding of RNNs and neural networks in general, see the resources at the bottom of the notebook.

Here an RNN will be used as a language model, which can predict which word is likely to occur next in a text given the words before it.

### Installation of spaCy

There are two tools widely used in the NLP world. These are two libraries called respectively spacy and nltk. The choice of spacy is based on its ease of implementation.
> If you want to know the differences between spacy and nltk, consult this article. 
> https://medium.com/@akankshamalhotra24/introduction-to-libraries-of-nlp-in-python-nltk-vs-spacy-42d7b2f128f2


***Installation via pip***

````
pip3 install -U spacy
````

***Installation via conda***
````
conda install -c conda-forge spacy
````

**Then you must install the language module !**

````
python3 -m spacy download en_core_web_sm
````

## Dataset

Our research is on story generation, so we've selected a dataset of stories as the text to be modeled by the RNN. They come from the [ROCStories](http://cs.rochester.edu/nlp/rocstories/)  dataset, which consists of thousands of five-sentence stories about everyday life events. Here the model will be trained to predict each word in the story based on the preceding words. Then we will use the trained model to generate the final sentence in a set of stories not observed during training. The full dataset is available at the above link and just requires filling out a form to get access. Here, we will use a sample of 100 stories.


In [None]:
# I had to import this module because Keras was generating errors for me, 
# it is possible that it works for you without this module.
import os
os.environ["MKL_THREADING_LAYER"] = "GNU"
from __future__ import print_function #Python 2/3 compatibility for print statements
import pandas as pd
pd.set_option('display.max_colwidth', 300) #widen pandas rows display

Well, we'll load the data that's in csv with **pandas**. You should know how to do it!  

**Exercise :** Load the dataset with the path ``dataset/example_train_stories.csv`` and save it in a variable ``train_stories``. Load only the first 100 stories to relieve your computer.

In [None]:
'''Load the training dataset'''
### ENTER YOUR CODE HERE ( 1 line)

### END
train_stories[:10]

## Preparing the data

The model we'll create is a word-based language model, which means each input unit is a single word (alternatively, some language models learn subword units like characters).

###  Tokenization

The first pre-processing step is to tokenize each of the stories into (lowercased) individual words, since the RNN will encode the stories word by word. For this we will use [spaCy](https://spacy.io/), which is a fast and extremely user-friendly library that performs various language processing tasks. Once you load a spaCy model for a particular language, you can provide any text as input to the model (e.g. encoder(text)) and access its linguistic features.

**Exercise:**: Load the English ('en') language with spacy. Check the documentation if necessary. Save it in the encoder variable.

In [None]:
'''Split texts into lists of words (tokens)'''

import spacy

###ENTER YOUR CODE HERE 1 line
### Load the English ('en') language with spacy

### END

We will create a text_to_to_token function because we will reuse this piece of code for the tests.

In [None]:
def text_to_tokens(text_seqs):
    token_seqs = [[word.lower_ for word in encoder(text_seq)] for text_seq in text_seqs]
    return token_seqs

As you can read, for each story we will break down the sentences and create a vector containing each word.

In [None]:
example = pd.DataFrame(['Hello, my name is Ludo. I learn deep learning . And i like it !', 
                     'This is the second story. There is no beginning, no end.'])
example = text_to_tokens(example[0])

**Exercise :** "Tokenize" the stories in variable ``train_stories['Story']``. Save the tokens by creating a new column in train_storiescalled ``Tokenized_Story``.

In [None]:
### ENTER YOUR CODE HERE ( 1 line)

###   

Let's make sure you have that by posting the first 10 stories. 

In [None]:
train_stories[['Story','Tokenized_Story']][:10]

You should have something like this: 

![dataframe](../img/capt01.png)

###  Lexicon

Then we need to assemble a lexicon (aka vocabulary) of words that the model needs to know. Each tokenized word in the stories is added to the lexicon, and then each word is mapped to a numerical index that can be read by the model. Since large datasets may contain a huge number of unique words, it's common to filter all words occurring less than a certain number of times, and replace them with some generic &lt;UNK&gt; token. The min_freq parameter in the function below defines this threshold. In the example code, the min_freq parameter is set to 1, so the lexicon will contain all unique words in the training set. When assigning the indices, the number 1 will represent unknown words. The number 0 will represent "empty" word slots, which is explained below. Therefore "real" words will have indices of 2 or higher.

**Exercise :** Create a token_count(sequence_text) function that will count the number of times the word appears in the text sequence. This function should return a dictionary that should look like this: `` {'hello': 1,','': 2,''my': 1,''name'': 1}``. So the word in index, and the number of times the number in value appears.

In [None]:
### ENTER YOUR CODE HERE (between 5 and 10 lines)
def token_count(token_seqs):
  


    pass
### END

Okay, let's make sure your operation is working fine. 

In [None]:
token_count(example)

You should have this :
````
{'hello': 1,
 ',': 2,
 'my': 1,
 'name': 1,
 'is': 3,
 'ludo': 1,
 '.': 4,
 'i': 2,
 'learn': 1,
 'deep': 1,
 'learning': 1,
 'and': 1,
 'like': 1,
 'it': 1,
 '!': 1,
 'this': 1,
 'the': 1,
 'second': 1,
 'story': 1,
 'there': 1,
 'no': 2,
 'beginning': 1,
 'end': 1}
 ````

So we're going to create a function that will make up our lexicon.

We will therefore create a function that will make our lexicon with two parameters. 
1. ``token_seqs`` which will be a list containing the stories.
2. ``min_freq`` which will determine how often it should appear in the story. We will set this default setting to 1, but we may need to change it later.  

It is interesting to know the number of times the word appears in the text. But contrary to what one might think, just because they often appear does not mean they are important. It is often the opposite. Words such as "and", "I", "the", etc ... will often be repeated but are not necessarily very important and give little information

Then the function will return a dictionary with a value that starts with two because element 0 is reserved for blank spaces and element 1 for words previously unknown.... 

**Exercise :** 
1. Create a token_counts variable. Determine the number of times a word appears in ``token_seqs``. Use the function previously created
2. Create a lexicon variable. Then, assign each word to a numerical index. Filter words that occur less than min_freq times.

In [None]:
'''Count tokens (words) in texts and add them to the lexicon'''

def make_lexicon(token_seqs, min_freq=1):
    # First, count the frequency with which each word appears in the text. 
    # Use the token_count function created previously
    ### ENTER YOUR CODE HERE (1 line)

    ### END
    
    # Then, assign each word to a numerical index. Filter words that occur less than min_freq times.
    ### ENTER YOUR CODE HERE (between 1 and 5 lines)

    ###END
        
    # Indices start at 2. 0 is reserved for padding, and 1 for unknown words.
    lexicon = {token:idx + 2 for idx,token in enumerate(lexicon)}
    lexicon['<UNK>'] = 1 # Unknown words are those that occur fewer than min_freq times
    lexicon_size = len(lexicon)

    print("LEXICON SAMPLE ({} total items).".format(len(lexicon)))
    
    return lexicon


Okay, let's check if you just. You should get this result by executing the code below.
````
{'hello': 2, ',': 3, 'my': 4, 'name': 5, 'is': 6, 'ludo': 7, '.': 8, 'i': 9, 'learn': 10, 'deep': 11, 'learning': 12, 'and': 13, 'like': 14, 'it': 15, '!': 16, 'this': 17, 'the': 18, 'second': 19, 'story': 20, 'there': 21, 'no': 22, 'beginning': 23, 'end': 24, '<UNK>': 1}
````

In [None]:
make_lexicon(example)

If it's okay then you can save your in the lexicon variable.   

**Exercise :** Create a lexicon variable that will contain the lexicon of ``train_stories['Tokenized_Story'] ``

In [None]:
### ENTER YOUR CODE HERE (1 line)

### END

We have the opportunity to save our lexicon for later use. And you already know the library that allows you to do that. It's about pickle!

In [None]:
import pickle

with open('example_model/lexicon.pkl', 'wb') as f: # Save the lexicon by pickling it
    pickle.dump(lexicon, f)

When we apply the model to generation later, it will output words as indices, so we'll need to map each numerical index back to its corresponding string representation. We'll reverse the lexicon dictionary so that a word can be looked up by its index.To do this, we will create a get_lexicon_lookup() function with a parameter that will be the lexicon.

**Exercise :**  Create a function that will return a dictionary where the string representation of a lexicon item can be retrieved from its numerical index. You must specify in your dictionary that the value 0 is equal to  ``""``.

Example : ``{2: 'hello', 3: ',', 4: 'my', 5: 'name', 0 :""}``

In [None]:
'''Make a dictionary where the string representation of a lexicon item can be retrieved from its numerical index'''

def get_lexicon_lookup(lexicon):
    ### ENTER YOUR CODE HERE (+- 3 lines)
 


    ###END 



Let's check if your function is correct. You should have this after you execute this code. 
````
{2: 'hello', 3: ',', 4: 'my', 5: 'name', 6: 'is', 7: 'ludo', 8: '.', 9: 'i', 10: 'learn', 11: 'deep', 12: 'learning', 13: 'and', 14: 'like', 15: 'it', 16: '!', 17: 'this', 18: 'the', 19: 'second', 20: 'story', 21: 'there'}
````

In [None]:
get_lexicon_lookup(make_lexicon(example))

**Exercise :** save the result on ``lexicon`` in the variable ``lexicon_lookup``

In [None]:
### ENTER YOUR CODE HERE (1 line)

### END

###  From strings to numbers

Once the lexicon is built, we can use it to transform each story from a list of string tokens into a list of numerical indexes. This will allow us to do mathematical calculations on a string of characters!

In [None]:
'''Convert each text from a list of tokens to a list of numbers (indices)'''

def tokens_to_idxs(token_seqs, lexicon):
    idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] for token in token_seq]  
                                                                     for token_seq in token_seqs]
    return idx_seqs

train_stories['Story_Idxs'] = tokens_to_idxs(token_seqs=train_stories['Tokenized_Story'],
                                             lexicon=lexicon)
                                   
train_stories[['Tokenized_Story', 'Story_Idxs']][:10]

###  Creating a matrix

Finally, we need to put all the training stories into a single matrix, where each row is a story and each column is a word index in that story. This enables the model to process the stories in batches as opposed to one at a time, which significantly speeds up training. However, each story has a different number of words. So we create a padded matrix equal to the length on the longest story in the training set. For all stories with fewer words, we prepend the row with zeros, each representing an empty word position. This is why the number 0 was not assigned as a word index in the lexicon. Then we can actually tell Keras to ignore these zeros during training.

**Exercise :** Create a max_seq_len variable that will contain the value of the word number  in the largest string. Get length of longest sequence

In [None]:
### ENTER YOUR CODE HERE 
# Get length of longest sequence 

###END 

Its value should be 73 if you use the same dataset !

Let's create the sequences, pad_sequence of keras will allow us to do that. 


In [None]:
from keras.preprocessing.sequence import pad_sequences

train_padded_idxs = pad_sequences(train_stories['Story_Idxs'], 
                                  maxlen=max_seq_len + 1) #Add one to max length for offsetting sequence by 1

print(train_padded_idxs) #same example story as above
print("SHAPE:", train_padded_idxs.shape)

All right, but we're going to create a pad_idx_seqs function that we can use later.

**Exercise:** Create a pad_idx_seqs function that will return a sequence. This function will have two parameters, the sequence index and the other one which will be the length of the longest string.

In [None]:
'''create a padded matrix of stories'''
def pad_idx_seqs(idx_seqs, max_seq_len):
    # Keras provides a convenient padding function; 
    ### ENTER YOUR CODE HERE  ( +- 2 lines)
 
    pass
    ### END

### Defining the input and output

In an RNN language model, the data is set up so that each word in the text is mapped to the word that follows it. In a given story, for each input word x[idx], the output label y[idx] is just x[idx+1]. In other words, the output sequences (y) matrix will be offset by one. The example below displays this alignment with the string tokens for the first story in the dataset.

In [None]:
pandas.DataFrame(list(zip(["-"] + train_stories['Tokenized_Story'].loc[0],
                          train_stories['Tokenized_Story'].loc[0])),
                 columns=['Input Word', 'Output Word'])


To keep the padded matrices the same length, the input word matrix will also both be offset by one in the opposite direction. So the length of both the input and output matrices will be both reduced by one.

In [None]:
print(pandas.DataFrame(list(zip(train_padded_idxs[0,:-1], train_padded_idxs[0, 1:])),
                columns=['Input Words', 'Output Words']))

##  Building the model

To assemble the model, we'll use Keras' [Functional API](https://keras.io/getting-started/functional-api-guide/), which is one of two ways to use Keras to assemble models (the alternative is the [Sequential API](https://keras.io/getting-started/sequential-model-guide/), which is a bit simpler but has more constraints). A model consists of a series of layers. As shown in the code below, we initialize instances for each layer. Each layer can be called with another layer as input, e.g. Embedding()(input_layer). A model instance is initialized with the Model() object, which defines the initial input and final output layers for that model. Before the model can be trained, the compile() function must be called with the loss function and optimization algorithm specified (see below).

### Model 

Until now, we have used the Squential method to create our sequential model.

````
from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(2, input_dim=2))
model.add(Dense(1), activation ="softmax")
````

But there is another way to create a sequential model, with the Model() method
````
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense

input_layer = Input(shape=(2,))  
hidden_layer = Dense(2)(input_layer)  
output_layer = Dense(1, activation="softmax")(hidden_layer )

model = Model(inputs=input_layer, outputs=output_layer)

````
We will now use the second method because it is more flexible.


### Layers

We'll build an RNN with five layers:

**1. Input**: The input layer takes in the matrix of word indices.

**2. Embedding**: An [embedding input layer](https://keras.io/layers/embeddings/) that converts integer word indices into distributed vector representations (embeddings). The mask_zero=True parameter indicates that values of 0 in the matrix (the padding) will be ignored by the model. Check [this link for more informations](https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12)

**3. GRU**: A [recurrent (GRU) hidden layer](https://keras.io/layers/recurrent/), the central component of the model. As it observes each word in the story, it integrates the word embedding representation with what it's observed so far to compute a representation (hidden state) of the story at that timepoint. There are a few architectures for this layer - I use the GRU variation, Keras also provides LSTM or just the simple vanilla recurrent layer (see the materials at the bottom for an explanation of the difference). By setting return_sequences=True for this layer, it will output the hidden states for every timepoint in the model, i.e. for every word in the story.

**4. GRU**: A second recurrent layer that takes the first as input and operates the same way, since adding more layers generally improves the model.

**5. (Time Distributed) Dense**: A [dense output layer](https://keras.io/layers/core/#dense) that outputs a probability for each word in the lexicon, where each probability indicates the chance of that word being the next word in the sequence. The 'softmax' activation is what transforms the values of this layer into scores from 0 to 1 that can be treated as probabilities. The Dense layer produces the probability scores for one particular timepoint (word). By wrapping this in a TimeDistributed() layer, the model outputs a probability distribution for every timepoint in the sequence.

The term "layer" is just an abstraction, when really all these layers are just matrices. The "weights" that connect the layers are also matrices. The process of training a neural network is a series of matrix multiplications. The weight matrices are the values that are adjusted during training in order for the model to learn to predict the next word.

###  Parameters

Our function for creating the model takes the following parameters:

**seq_input_len:** the length of the input and output matrices. This is equal to the length of the longest story in the training data. 

**n_input_nodes**: the number of unique words in the lexicon, plus one to account for the padding represented by 0 values. This indicates the number of rows in the embedding layer, where each row corresponds to a word. It is also the dimensionality of the probability vectors given as the model output.

**n_embedding_nodes**: the number of dimensions (units) in the embedding layer, which can be freely defined. Here, it is set to 300.

**n_hidden_nodes**: the number of dimensions in the hidden layers. Like the embedding layer, this can be freely chosen. Here, it is set to 500.

**stateful**: By default, the GRU hidden layer will reset its state (i.e. its values will be 0s) each time a new set of sequences is read into the model.  However, when stateful=True is given, this parameter indicates that the GRU hidden layer should "remember" its state until it is explicitly told to forget it. In other words, the values in this layer will be carried over between separate calls to the training function. This is useful when processing long sequences, so that the model can iterate through chunks of the sequences rather than loading the entire matrix at the same time, which is memory-intensive. I'll show below how this setting is also useful when the model is used for word prediction after training. During training, the model will observe all words in a story at once, so stateful will be set to False. At prediction time, it will be set to True.

**batch_size**: It is not always necessary to specify the batch size when setting up a Keras model. The fit() function will apply batch processing by default and the batch size can be given as a parameter. However, when a model is stateful, the batch size does need to be specified in the Input() layers. Here, for training, batch_size=None, so Keras will use its default batch size (which is 32). During prediction, the batch size will be set to 1.

### Procedure

The output of the model is a sequence of vectors, each with the same number of dimensions as the number of unique words (n_input_nodes). Each vector contains the predicted probability of each possible word appearing in that position in the sequence. Like all neural networks, RNNs learn by updating the parameters (weights) to optimize an objective (loss) function applied to the output. For this model, the objective is to minimize the cross-entropy (named as "sparse_categorical_crossentropy" in the code) between the predicted word probabilities and the probabilities observed from the words that appear in the training data, resulting in probabilities that more accurately predict when a particular word will appear. This is the general procedure used for all multi-label classification tasks. Updates to the weights of the model are performed using an optimization algorithm, such as Adam used here. The details of this process are extensive; see the resources at the bottom of the notebook if you want a deeper understanding. One huge benefit of Keras is that it implements many of these details for you. Not only does it already have implementations of the types of layer architectures, it also has many of the [loss functions](https://keras.io/losses/) and [optimization methods](https://keras.io/optimizers/) you need for training various models.

### Exercise 

Create the 4th layer which must also be a GRU. Create a variable gru_layer2 that will contain exactly the same parameters as for layer 3. Here the layer will have layer 3 as input.

In [None]:
'''Create the model'''
import tensorflow as tf

Sequential = tf.keras.models.Sequential
Dense = tf.keras.layers.Dense
Input = tf.keras.layers.Input
from tensorflow.python.keras.layers import TimeDistributed
from tensorflow.python.keras import backend
from tensorflow.python.keras.layers.embeddings import Embedding
from tensorflow.python.keras.layers.recurrent import GRU

def create_model(seq_input_len, n_input_nodes, n_embedding_nodes, 
                 n_hidden_nodes, stateful=False, batch_size=None):
    
    # Layer 1 : Create input_layer variable. Don't forget to give a name to your layer !
    input_layer = Input(batch_shape=(batch_size, seq_input_len), name='input_layer')

    # Layer 2
    embedding_layer = Embedding(input_dim=n_input_nodes, 
                                output_dim=n_embedding_nodes, 
                                mask_zero=True, name='embedding_layer')(input_layer) #mask_zero=True will ignore padding
    # Output shape = (batch_size, seq_input_len, n_embedding_nodes)

    #Layer 3
    gru_layer1 = GRU(n_hidden_nodes,
                     return_sequences=True, #return hidden state for each word, not just last one
                     stateful=stateful, name='hidden_layer1')(embedding_layer)
    # Output shape = (batch_size, seq_input_len, n_hidden_nodes)

    #Layer 4
    ### ENTER OUR CODE HERE (3 lines )
 


    ### END
    # Output shape = (batch_size, seq_input_len, n_hidden_nodes)

    #Layer 5
    output_layer = TimeDistributed(Dense(n_input_nodes, activation="softmax"), 
                                   name='output_layer')(gru_layer2)
    # Output shape = (batch_size, seq_input_len, n_input_nodes)
    
    model = Model(inputs=input_layer, outputs=output_layer)

    #Specify loss function and optimization algorithm, compile model
    model.compile(loss="sparse_categorical_crossentropy",
                  optimizer='adam')
    
    
    return model

In [None]:
model = create_model(seq_input_len=train_padded_idxs.shape[-1] - 1, #substract 1 from matrix length because of offset 
                     n_input_nodes = len(lexicon) + 1, # Add 1 to account for 0 padding
                     n_embedding_nodes = 300,
                     n_hidden_nodes = 500)

###  Training

Now we're ready to train the model. We'll call the fit() function to train the model for 10 iterations through the dataset (epochs), with a batch size of 20 stories. Keras reports the cross-entropy loss after each epoch - if the model is learning correctly, it should progressively decrease.

In [None]:
'''Train the model'''
# output matrix (y) has extra 3rd dimension added because sparse cross-entropy function requires one label per row
model.fit(x=train_padded_idxs[:,:-1], y=train_padded_idxs[:, 1:, None], epochs=5, batch_size=20)
model.save_weights('example_model/model_weights.h5') #Save model

### Evaluation

It is useful to know how well the cross entropy metric applied to the training set generalizes to stories not observed during training. The typical intrinsic evaluation metric for language modeling is perplexity, which is derived from cross entropy. Perplexity is a measure of how much the model "expects" or "anticipates" the words in a given set of texts, where lower perplexity values mean the model is better at guessing the words that appear. The closer perplexity is to 0, the better the model is at predicting the given sequences. This number does not indicate the performance of a model when applied to a specific task, but it is still useful for showing relative differences between models in terms of how well they fit the data. We'll evaluate perplexity on an example test set of 100 sentences that were not observed during training.

In [None]:
'''Load test set and apply same processing used for training stories'''

test_stories = pandas.read_csv('dataset/example_test_stories.csv', encoding='utf-8')

test_stories['Tokenized_Story'] = [[word.lower_ for word in encoder(text_seq)] for text_seq in test_stories['Story']]
test_stories['Story_Idxs'] = tokens_to_idxs(token_seqs=test_stories['Tokenized_Story'],
                                            lexicon=lexicon)
test_padded_idxs = pad_sequences(test_stories['Story_Idxs'], 
                                 maxlen=max_seq_len + 1)

Keras has an evaluate() function that will return the cross-entropy loss of the model on a given set of instances, the same thing that was reported during training. We can provide the test instances as input to this function to get the cross-entropy. Perplexity is equal to the exponentiated cross-entropy:

In [None]:
import numpy 

perplexity = numpy.exp(model.evaluate(x=test_padded_idxs[:,:-1], 
                                      y=test_padded_idxs[:, 1:, None]))

print("PERPLEXITY ON TEST SET: {:.3f}".format(perplexity))

## Prediction Tasks

Now that the model is trained, we can apply it to prediction tasks using the test set. I'll show two tasks: computing a probability score for a story, and generating a new ending for a story. To demonstrate both of these, I'll load a saved model previously trained on the ~100,000 stories in the training set. As opposed to training where we processed multiple stories at the same time, it will be more straightforward to demonstrate prediction on a single story at a time, especially since prediction is fast relative to training. In Keras, you can duplicate a model by loading the parameters from a saved model into a new model. Here, this new model will have a batch size of 1. It will also process a story one word at a time (seq_input_len=1), using the stateful=True parameter to remember the story that has occurred up to that word. The other parameters of this prediction model are exactly the same as the trained model, which is why the weights can be readily transferred.

In [None]:
'''Create a new test model, setting batch_size = 1, seq_input_len = 1, and stateful = True'''

# Load lexicon from the saved model 
with open('pretrained_model/lexicon.pkl', 'rb') as f:
    lexicon = pickle.load(f)
lexicon_lookup = get_lexicon_lookup(lexicon)

predictor_model = create_model(seq_input_len=1,
                               n_input_nodes=len(lexicon) + 1,
                               n_embedding_nodes = 300,
                               n_hidden_nodes = 500,
                               stateful=True, 
                               batch_size = 1)

predictor_model.load_weights('pretrained_model/model_weights.h5') #Load weights from saved model

In [None]:
'''Re-encode the test stories with lexicon we just loaded'''

test_stories['Story_Idxs'] = tokens_to_idxs(token_seqs=test_stories['Tokenized_Story'],
                                            lexicon=lexicon)

### Computing story probabilities

Since the model outputs a probability distribution for each word in the story, indicating the probability of each possible next word in the story, we can use these values to get a single probability score for the story. To do this, we iterate through each word in a story, call the predict() function to get the full list of probabilites for the next word, and then extract the probability predicted for the actual next word in the story. We can average these probabilities across all words in the story to get a single value. The stateful=True parameter is what enables the model to remember the previous words in the story when predicting the probability of the next word. Because of this, the reset_states() function must be called at the end of reading the story in order to clear its memory for the next story.

We do this below to compare the probability of each story to one with an ending randomly selected from another story in the test set. Of course, a good language model should overall score the randomly selected endings as less probable than the correct endings. 

In [None]:
'''Compute the probability of a stories according to the language model'''

import numpy

def get_probability(idx_seq):
    idx_seq = [0] + idx_seq #Prepend 0 so first call to predict() computes prob of first word from zero padding
    probs = []
    for word, next_word in zip(idx_seq[:-1], idx_seq[1:]):
       # Word is an integer, but the model expects an input array
       # with the shape (batch_size, seq_input_len), so prepend two dimensions
        p_next_word = predictor_model.predict(numpy.array(word)[None,None])[0,0] #Output shape= (lexicon_size + 1,)
        #Select predicted prob of the next word, which appears in the corresponding idx position of the probability vector
        p_next_word = p_next_word[next_word]
        probs.append(p_next_word)
    predictor_model.reset_states()
    return numpy.mean(probs) #return average probability of words in sequence

for _, test_story in test_stories[:10].iterrows():

    # Split out initial four sentences in story and ending sentence
    len_initial_story = len([word for sent in list(encoder(test_story['Story']).sents)[:-1] for word in sent])
    token_initial_story = test_story['Tokenized_Story'][:len_initial_story]
    idx_initial_story = test_story['Story_Idxs'][:len_initial_story]
    token_ending = test_story['Tokenized_Story'][len_initial_story:]
    
    # Randomly select another story and get its ending
    rand_story = test_stories.loc[numpy.random.choice(len(test_stories))]
    len_rand_ending = len(list(encoder(rand_story['Story']).sents)[-1])
    token_rand_ending = rand_story['Tokenized_Story'][-len_rand_ending:]
    idx_rand_ending = rand_story['Story_Idxs'][-len_rand_ending:]

    print("INITIAL STORY:", " ".join(token_initial_story))
    prob_given_ending = get_probability(test_story['Story_Idxs'])
    print("GIVEN ENDING: {} (P = {:.3f})".format(" ".join(token_ending), prob_given_ending))

    #print("PROBABILITY:", get_probability(test_story['Story_Idxs']))
    prob_rand_ending = get_probability(idx_initial_story + idx_rand_ending)
    print("RANDOM ENDING: {} (P = {:.3f})".format(" ".join(token_rand_ending), prob_rand_ending), "\n")


### Generating sentences

The language model can also be used to generate new text. Here, I'll give the same predictor model the first four sentences of a story in the test set and have it generate the fifth sentence. To do this, we "load" the first four sentences into the model. This can be done using predict() function. Because the model is stateful, predict() saves the representation of the story internally even though we don't need the output of this function when just reading the story. Once the final word in the fourth sentence has been read, then we can start using the resulting probability distribution to predict the first word in the fifth sentence. We can call numpy.random.choice() to randomly sample a word according to its probability. Now we again call predict() with this new word as input, which returns a probability distribution for the second word. Again, we sample from this distribution, append the newly sampled word to the previously generated word, and call predict() with this new word as input. We continue doing this until a word that ends with an end-of-sentence puncutation mark (".", "!", "?") has been selected. Just as before, reset_states() is called after the whole sentence has been generated. Then we can decode the generated ending into a string using the lexicon lookup dictionary. You can see that the generated endings are generally not as coherent and well-formed as the human-authored endings, but they do capture some components of the story and they are often entertaining.

In [None]:
'''Use the model to generate new endings for stories'''

def generate_ending(idx_seq):
    
    end_of_sent_tokens = [".", "!", "?"]
    generated_ending = []
    
    # First just read initial story, no output needed
    idx_seq = [0] + idx_seq #Prepend 0 so model observes 0 padding
    for word in idx_seq:
        p_next_word = predictor_model.predict(numpy.array(word)[None,None])[0,0]
        
    # Now start predicting new words
    while not generated_ending or lexicon_lookup[next_word] not in end_of_sent_tokens:
        #Randomly sample a word from the current probability distribution
        next_word = numpy.random.choice(a=p_next_word.shape[-1], p=p_next_word)
        # Append sampled word to generated ending
        generated_ending.append(next_word)
        # Get probabilities for next word by inputing sampled word
        p_next_word = predictor_model.predict(numpy.array(next_word)[None,None])[0,0]
    
    predictor_model.reset_states() #reset hidden state after generating ending
    
    return generated_ending

for _, test_story in test_stories[:20].iterrows():
    # Use spacy to segment the story into sentences, so we can seperate the ending sentence
    # Find out where in the story the ending starts (number of words from end of story)
    ending_story_idx = len(list(encoder(test_story['Story']).sents)[-1])
    print("INITIAL STORY:", " ".join(test_story['Tokenized_Story'][:-ending_story_idx]))
    print("GIVEN ENDING:", " ".join(test_story['Tokenized_Story'][-ending_story_idx:]))
    
    generated_ending = generate_ending(test_story['Story_Idxs'][:-ending_story_idx])
    generated_ending = " ".join([lexicon_lookup[word] if word in lexicon_lookup else ""
                                 for word in generated_ending]) #decode from numbers back into words
    print("GENERATED ENDING:", generated_ending, "\n")

### Visualizing data inside the model

To help visualize the data representation inside the model, we can look at the output of each layer individually. Keras' Functional API lets you derive a new model with the layers from an existing model, so you can define the output to be a layer below the output layer in the original model. Calling predict() on this new model will produce the output of that layer for a given input. Of course, glancing at the numbers by themselves doesn't provide any interpretation of what the model has learned (although there are opportunities to [interpret these values](https://www.civisanalytics.com/blog/interpreting-visualizing-neural-networks-text-processing/)), but seeing them verifies the model is just a series of transformations from one matrix to another. The model stores its layers as the list model.layers, and you can retrieve specific layer by its position index in the model. Below is an example of the word embedding output for the first word in the first story of the test set. You can do this same thing to view any layer.

In [None]:
'''Show the output of the word embedding layer for the first word of the first story'''

embedding_layer = Model(inputs=predictor_model.layers[0].input,
                        outputs=predictor_model.layers[1].output)
embedding_output = embedding_layer.predict(numpy.array(test_stories['Story_Idxs'][0][0])[None,None])
print("EMBEDDING OUTPUT SHAPE:", embedding_output.shape)
print(embedding_output[0]) # Print embedding vectors for first word of first story

It is also easy to look at the weight matrices that connect the layers. The get_weights() function will show the incoming weights for a particular layer.

In [None]:
'''Show weights that connect the hidden layer to the output layer'''

hidden_to_output_weights = predictor_model.layers[-1].get_weights()[0]
print("HIDDEN-TO_OUTPUT WEIGHTS SHAPE:", hidden_to_output_weights.shape)
print(hidden_to_output_weights)

## Conclusion

There are a good number of tutorials on RNN language models, particularly applied to text genertion. This notebook shows how to leverage Keras with batch training when the length of the sequences is variable. There are many ways this language model can be made to be more sophisticated. Here's a few interesting papers from the NLP community that innovate this basic model for different generation tasks:

*Recipe generation:* [Globally Coherent Text Generation with Neural Checklist Models](https://homes.cs.washington.edu/~yejin/Papers/emnlp16_neuralchecklist.pdf). Chloé Kiddon, Luke Zettlemoyer, and Yejin Choi. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.

*Emotional text generation:* [Affect-LM: A Neural Language Model for Customizable Affective Text Generation](https://arxiv.org/pdf/1704.06851.pdf). Sayan Ghosh, Mathieu Chollet, Eugene Laksana, Louis-Philippe Morency, Stefan Scherer. Annual Meeting of the Association for Computational Linguistics (ACL), 2017.

*Poetry generation:* [Generating Topical Poetry](https://www.isi.edu/natural-language/mt/emnlp16-poetry.pdf). Marjan Ghazvininejad, Xing Shi, Yejin Choi, and Kevin Knight. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.

*Dialogue generation:* [A Neural Network Approach to Context-Sensitive Generation of Conversational Responses](http://www-etud.iro.umontreal.ca/~sordonia/pdf/naacl15.pdf). Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie1, Jianfeng Gao, Bill Dolan. North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2015.

## More resources

Yoav Goldberg's book [Neural Network Methods for Natural Language Processing](http://www.morganclaypool.com/doi/abs/10.2200/S00762ED1V01Y201703HLT037) is a thorough introduction to neural networks for NLP tasks in general

If you'd like to learn more about what Keras is doing under the hood, the [Theano tutorials](http://deeplearning.net/tutorial/) are useful. There are two specifically on RNNs for NLP: [semantic parsing](http://deeplearning.net/tutorial/rnnslu.html#rnnslu) and [sentiment analysis](http://deeplearning.net/tutorial/lstm.html#lstm)

TensorFlow also has an RNN language model [tutorial](https://www.tensorflow.org/versions/r0.12/tutorials/recurrent/index.html) using the Penn Treebank dataset

Andrej Karpathy's blog post [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) is very helpful for understanding the underlying details of the same language model I've demonstrated here. It also provides raw Python code with an implementation of the backpropagation algorithm.

Chris Olah provides a good [explanation](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) of how LSTM RNNs work (this explanation also applies to the GRU model used here)

Denny Britz's [tutorial](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) documents well both the technical details of RNNs and their implementation in Python.
