## Sequence generation with LSTM

Recurrent Neural Networks face difficulties in carrying information properly and long order dependencies between layers
in large networks.
Long Short Term Memory networks, generally referred to as “LSTMs” are an extension of RNNs which are capable of learning
long-term dependencies and are widely used in deep learning to avoid vanishing gradient problem in RNNs.
LSTMs combat vanishing gradients through a gating mechanism and are able to remove or add information to the cell state,
carefully regulated by these gates which control the information to pass through.LSTMs have 3 kinds of gates, input, output and forget gate. The forget gate defines how much of the previous state ht-1 you want to allow to pass through. The input state defines how much of the newly computed state for the current input xt you want to let through, and the output gate defines how much of the internal state you want to expose to the next layer. 


In this recipe, we will implement LSTM for sequence prediction(many to one in this example).We will build a LSTM neural network which predicts occurrence of a word based on the previous sequence of words.This is also known as text generation.

### Getting ready...

In this example we will use "Jack and Jill" nursery rhyme as the source text to build a language model. It will take 2 words as input to predict next word. We start by importing the required libraries and reading our file.

In [29]:
library(keras)
library(readr)
library(stringr)
data <- read_file("data/rhyme.txt") %>% str_to_lower()

In NLP we refer our data as corpus. A corpus is a large collection of texts.Now let us look at the corpus.

In [30]:
data

### How to do it...

**Step 1:**

We have imported corpus into the R environment. To build a language model, we need to convert it into a  sequence of integers. Let us define our tokenizer and fit it. We will use it later to convert text to an integer sequence.

In [31]:
tokenizer = text_tokenizer(num_words = 35,char_level = F)
tokenizer %>% fit_text_tokenizer(data)

Let us look at number of unique words we have in our corpus.

In [32]:
cat("Number of unique words", length(tokenizer$word_index))

Number of unique words 37

Let's look at the vocabulary.

In [33]:
head(tokenizer$word_index)

Let us convert our corpus into a integer sequence using the tokenizer defined above.

In [34]:
text_seqs <- texts_to_sequences(tokenizer, data)
str(text_seqs)

List of 1
 $ : int [1:48] 2 1 4 5 6 9 10 3 11 12 ...


We see that texts_to_sequences() returns a list. Let us convert it into a vector and print out its length.

In [35]:
text_seqs <- text_seqs[[1]]
length(text_seqs)

In [36]:
max(text_seqs)

Let us convert text sequence into an input(feature), and output(labels) sequences, where input will be a sequence of two consecutive words and output will be the next word that appears in the sequence.

In [9]:
input_sequence_length <- 2
feature <- matrix(ncol = input_sequence_length)
label <- matrix(ncol = 1)

for(i in seq(input_sequence_length, length(text_seqs))){
    if(i >= length(text_seqs)){
        break()
    }
    start_idx <- (i - input_sequence_length) +1
    end_idx <- i +1
    new_seq <-  text_seqs[start_idx:end_idx]
    feature <- rbind(feature,new_seq[1:input_sequence_length])
    label <- rbind(label,new_seq[input_sequence_length+1])
}
feature <- feature[-1,]
label <- label[-1,]

In [10]:
paste("Feature")
head(feature)
paste("label")
head(label)

0,1
2,1
1,4
4,5
5,6
6,9
9,10


Let's on-hot-encode our label and show the dimensions of our features and labels.

In [11]:
label <- to_categorical(label,num_classes = tokenizer$num_words )

In [12]:
cat("Shape of features",dim(feature),"\n")
cat("Shape of label",length(label))

Shape of features 46 2 
Shape of label 1610

**Step 2:** 

Let us create our neural network for text generation and print its summary.

In [13]:
model <- keras_model_sequential()

In [14]:
model <- keras_model_sequential()
model %>%
    layer_embedding(input_dim = tokenizer$num_words,output_dim = 10,input_length = input_sequence_length) %>%
    layer_lstm(units = 50) %>%
    layer_dense(tokenizer$num_words) %>%
    layer_activation("softmax")

summary(model)

________________________________________________________________________________
Layer (type)                        Output Shape                    Param #     
embedding (Embedding)               (None, 2, 10)                   350         
________________________________________________________________________________
lstm (LSTM)                         (None, 50)                      12200       
________________________________________________________________________________
dense (Dense)                       (None, 35)                      1785        
________________________________________________________________________________
activation (Activation)             (None, 35)                      0           
Total params: 14,335
Trainable params: 14,335
Non-trainable params: 0
________________________________________________________________________________


Next, we compile the model and train it.

In [15]:
# compile
model %>% compile(
    loss = "categorical_crossentropy", 
    optimizer = optimizer_rmsprop(lr = 0.001),
    metrics = c('accuracy')
)

# train
model %>% fit(
  feature, label,
#   batch_size = 128,
  epochs = 500
)

**Step 3:**

In the following code block, we implement a function to generate a sequence from a language model.

In [16]:
generate_sequence <-function(model, tokenizer, input_length, seed_text, predict_next_n_words){
    input_text <- seed_text
    for(i in seq(predict_next_n_words)){
        encoded <- texts_to_sequences(tokenizer,input_text)[[1]]
        encoded <- pad_sequences(sequences = list(encoded),maxlen = input_length,padding = 'pre')
        yhat <- predict_classes(model,encoded, verbose=0)
        next_word <- tokenizer$index_word[[as.character(yhat)]]
        input_text <- paste(input_text, next_word)
    }
    return(input_text)
}


We use our previously written custom function, generate_sequence(), to generate text.

In [17]:
seed_1 = "Jack and"
cat("Text generated from seed 1: " ,generate_sequence(model,tokenizer,input_sequence_length,seed_1,11),"\n ")
seed_2 = "Jack fell"
cat("Text generated from seed 2: ",generate_sequence(model,tokenizer,input_sequence_length,seed_2,11))

Text generated from seed 1:  Jack and jill went up the hill to fetch a pail of water 
 Text generated from seed 2:  Jack fell down and broke his crown and jill went up the hill

## How it works

Step 1:
To build any language model, we clean the input text and break it into tokens. Tokens are individual words and breaking text into its different words is called tokenization. The keras tokenizer by default splits the corpus into a list of tokens( " " is used for splitting sentences into words), it removes all punctuation, converts the words into lowercase, and builds an internal vocabulary based on the input text.

Vocabulary generated by tokenizer is an indexed list where words are indexed by overall frequency in the dataset. We can see that in the nursery rhyme, "and" is the most frequent word and "up" ist the 5th most frequent word and it has 37 unique words.Next we convert our corpus into an integes sequence.Please note that "num_words" argument of text_tokenizer() defines the maximum number of words to keep, based on word frequency. It means that only top "n" frequent words are kept in the encoded sequence.Next, we prepare feature and lables from our corpus.

Step 2:
In this step we define our LSTM neuarl network. We first initialilize a sequential model an then add an embedding layer to it. Embedding layer transforms the input feature space into a latent feature with "n" dimension, in our exapmle it transforms it into 128 latent features. Next we add LSTM layer with 50 units. Word prediction is a classifiaction problen where we predict next one word from vocabulary,therefore we add a dense layer with units equal to the number of words in vocbulary with softmax activation function.

Step 3:

We define a function to generate text from a given initial set of two words. Our model predicts next word from the original 2 words, in our example, initial seed is "Jack and" and the predicted word is "jill" thus creating a three-word sequence. In the next iteration, we take the last two words of the sentence, i.e. "and jill" and predict next word, "went". The function continues to generate text until we have generated words equal to the value of the argument "predict_next_n_words".


## There is more...

While working on NLP applications,we contruct meaningfull features from the text data. There are many techniques using which we can construct these features like count vectorzation,binary vectorization ,tf-idf(Term frequency-inverse document frequency), word embeddings etc. Following code block demonstrates how we can built a tf-df faeture matrix for various NLP applications using keras libraray in R.

In [24]:
texts_to_matrix(tokenizer, data, mode = c("tfidf"))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
0,1.131961,0.8509141,0.8509141,0.6865121,0.6865121,0.6865121,0.6865121,0.6865121,0.4054651,⋯,0.4054651,0.4054651,0.4054651,0.4054651,0.4054651,0.4054651,0.4054651,0.4054651,0.4054651,0.4054651


Other modes available are - "binary","count","freq".