# <font color='#6629b2'>Language modeling with recurrent neural networks using Keras</font>
### https://github.com/roemmele/keras-rnn-notebooks
by Melissa Roemmele, 10/30/17, roemmele @ usc.edu

## <font color='#6629b2'>Overview</font>

I am going to show how to build a recurrent neural network (RNN) language model that learns the relation between words in text, using the Keras library for machine learning. I will then show how this model can be used to compute the probability of a sequence, as well as generate new sequences.

### <font color='#6629b2'>Language Modeling</font>

A language model is a model of the probability of word sequences. These models are useful for a variety of tasks, such as ones that require selecting the most likely output from a set of candidates provided by a speech recognition or machine translation system, for example. Here, I'll show how a language model can be used to generate sequences, in particular the endings of stories. Language generation is a difficult research problem which is generally addressed by more complex models than the one shown here.

Traditionally, the most well-known approach to language modeling relies on n-grams. The limitation of n-gram language models is that they only explicitly model the probability of a sequence of *n* words. In contrast, RNNs can model longer sequences and thus typically are better at predicting which words will appear in a sequence. See the [chapter in Jurafsky & Martin's *Speech and Language Processing*](https://web.stanford.edu/~jurafsky/slp3/4.pdf) to learn more about traditional approaches to language modeling. 

### <font color='#6629b2'>Recurrent Neural Networks (RNNs)</font>

RNNs are a general framework for modeling sequence data and are particularly useful for natural language processing tasks. At a high level, RNN encode sequences via a set of parameters (weights) that are optimized to predict some output variable. The focus of this notebook is on the code needed to assemble a model in Keras, as well as some data processing tools that facilitate building the model. 

If you understand how to structure the input and output of the model, and know the fundamental concepts in machine learning, then a high-level understanding of how an RNN works is sufficient for using Keras. You'll see that most of the code here is actually just data manipulation, and I'll visualize each step in this process. The code used to assemble the RNN itself is more minimal. It is of course useful to know the technical details of the RNN, so you can theorize on the results and innovate the model to make it better. For a better understanding of RNNs and neural networks in general, see the resources at the bottom of the notebook.

Here an RNN will be used as a language model, which can predict which word is likely to occur next in a text given the words before it.

### <font color='#6629b2'>Keras</font>

[Keras](https://keras.io/) is a Python deep learning framework that lets you quickly put together neural network models with a minimal amount of code. It can be run on top of [Theano](http://deeplearning.net/software/theano/) or [Tensor Flow](https://www.tensorflow.org/) without you needing to know either of these underlying frameworks. It provides implementations of several of the layer architectures, objective functions, and optimization algorithms you need for building a model.

## <font color='#6629b2'>Dataset</font>

My research is on story generation, so I've selected a dataset of stories as the text to be modeled by the RNN. They come from the [ROCStories](http://cs.rochester.edu/nlp/rocstories/) dataset, which consists of thousands of five-sentence stories about everyday life events. Here the model will be trained to predict each word in the story based on the preceding words. Then we'll use the trained model to generate the final sentence in a set of stories not observed during training. The full dataset is available at the above link and just requires filling out a form to get access. Here, I'll use a sample of 100 stories.

In [1]:
from __future__ import print_function #Python 2/3 compatibility for print statements
import pandas
pandas.set_option('display.max_colwidth', 300) #widen pandas rows display

I'll load the datasets using the [pandas library](https://pandas.pydata.org/), which is extremely useful for any task involving data storage and manipulation. This library puts a dataset into a readable table format, and makes it easy to retrieve specific columns and rows.

In [2]:
'''Load the training dataset'''

train_stories = pandas.read_csv('dataset/example_train_stories.csv', encoding='utf-8')[:100]

train_stories[:10]

Unnamed: 0,Story
0,Dan's parents were overweight. Dan was overweight as well. The doctors told his parents it was unhealthy. His parents understood and decided to make a change. They got themselves and Dan on a diet.
1,Carrie had just learned how to ride a bike. She didn't have a bike of her own. Carrie would sneak rides on her sister's bike. She got nervous on a hill and crashed into a wall. The bike frame bent and Carrie got a deep gash on her leg.
2,"Morgan enjoyed long walks on the beach. She and her boyfriend decided to go for a long walk. After walking for over a mile, something happened. Morgan decided to propose to her boyfriend. Her boyfriend was upset he didn't propose to her first."
3,"Jane was working at a diner. Suddenly, a customer barged up to the counter. He began yelling about how long his food was taking. Jane didn't know how to react. Luckily, her coworker intervened and calmed the man down."
4,"I was talking to my crush today. She continued to complain about guys flirting with her. I decided to agree with what she says and listened to her patiently. After I got home, I got a text from her. She asked if we can hang out tomorrow."
5,"Frank had been drinking beer. He got a call from his girlfriend, asking where he was. Frank suddenly realized he had a date that night. Since Frank was already a bit drunk, he could not drive. Frank spent the rest of the night drinking more beers."
6,"Dave was in the Bahamas on vacation. He decided to go snorkeling on his second day. While snorkeling, he saw a cave up ahead. He went into the cave, and he was terrified when he found a shark! Dave swam away as fast as he could, but the shark caught and ate Dave."
7,"Sunny enjoyed going to the beach. As she stepped out of her car, she realized she forgot something. It was quite sunny and she forgot her sunglasses. Sunny got back into her car and heading towards the mall. Sunny found some sunglasses and headed back to the beach."
8,"Sally was happy when her widowed mom found a new man. She discovered her siblings didn't feel the same. Sally flew to visit her mom and her mom's new husband. Although her mom was obviously in love, he was nothing like her dad. Sally went home and wondered about her parents' marriage."
9,Dan hit his golf ball and watched it go. The ball bounced on the grass and into the sand trap. Dan pretended that his ball actually landed on the green. His friends were not paying attention so they believed him. Dan snuck a ball on the green and made his putt from 10 feet.


## <font color='#6629b2'>Preparing the data</font>

The model we'll create is a word-based language model, which means each input unit is a single word (alternatively, some language models learn subword units like characters).

###  <font color='#6629b2'>Tokenization</font>

The first pre-processing step is to tokenize each of the stories into (lowercased) individual words, since the RNN will encode the stories word by word. For this I'll use [spaCy](https://spacy.io/), which is a fast and extremely user-friendly library that performs various language processing tasks. Once you load a spaCy model for a particular language, you can provide any text as input to the model (e.g. encoder(text)) and access its linguistic features.

In [3]:
'''Split texts into lists of words (tokens)'''

import spacy

encoder = spacy.load('en')

def text_to_tokens(text_seqs):
    token_seqs = [[word.lower_ for word in encoder(text_seq)] for text_seq in text_seqs]
    return token_seqs

train_stories['Tokenized_Story'] = text_to_tokens(train_stories['Story'])
    
train_stories[['Story','Tokenized_Story']][:10]

Unnamed: 0,Story,Tokenized_Story
0,Dan's parents were overweight. Dan was overweight as well. The doctors told his parents it was unhealthy. His parents understood and decided to make a change. They got themselves and Dan on a diet.,"[dan, 's, parents, were, overweight, ., dan, was, overweight, as, well, ., the, doctors, told, his, parents, it, was, unhealthy, ., his, parents, understood, and, decided, to, make, a, change, ., they, got, themselves, and, dan, on, a, diet, .]"
1,Carrie had just learned how to ride a bike. She didn't have a bike of her own. Carrie would sneak rides on her sister's bike. She got nervous on a hill and crashed into a wall. The bike frame bent and Carrie got a deep gash on her leg.,"[carrie, had, just, learned, how, to, ride, a, bike, ., she, did, n't, have, a, bike, of, her, own, ., carrie, would, sneak, rides, on, her, sister, 's, bike, ., she, got, nervous, on, a, hill, and, crashed, into, a, wall, ., the, bike, frame, bent, and, carrie, got, a, deep, gash, on, her, leg, .]"
2,"Morgan enjoyed long walks on the beach. She and her boyfriend decided to go for a long walk. After walking for over a mile, something happened. Morgan decided to propose to her boyfriend. Her boyfriend was upset he didn't propose to her first.","[morgan, enjoyed, long, walks, on, the, beach, ., she, and, her, boyfriend, decided, to, go, for, a, long, walk, ., after, walking, for, over, a, mile, ,, something, happened, ., morgan, decided, to, propose, to, her, boyfriend, ., her, boyfriend, was, upset, he, did, n't, propose, to, her, firs..."
3,"Jane was working at a diner. Suddenly, a customer barged up to the counter. He began yelling about how long his food was taking. Jane didn't know how to react. Luckily, her coworker intervened and calmed the man down.","[jane, was, working, at, a, diner, ., suddenly, ,, a, customer, barged, up, to, the, counter, ., he, began, yelling, about, how, long, his, food, was, taking, ., jane, did, n't, know, how, to, react, ., luckily, ,, her, coworker, intervened, and, calmed, the, man, down, .]"
4,"I was talking to my crush today. She continued to complain about guys flirting with her. I decided to agree with what she says and listened to her patiently. After I got home, I got a text from her. She asked if we can hang out tomorrow.","[i, was, talking, to, my, crush, today, ., she, continued, to, complain, about, guys, flirting, with, her, ., i, decided, to, agree, with, what, she, says, and, listened, to, her, patiently, ., after, i, got, home, ,, i, got, a, text, from, her, ., she, asked, if, we, can, hang, out, tomorrow, .]"
5,"Frank had been drinking beer. He got a call from his girlfriend, asking where he was. Frank suddenly realized he had a date that night. Since Frank was already a bit drunk, he could not drive. Frank spent the rest of the night drinking more beers.","[frank, had, been, drinking, beer, ., he, got, a, call, from, his, girlfriend, ,, asking, where, he, was, ., frank, suddenly, realized, he, had, a, date, that, night, ., since, frank, was, already, a, bit, drunk, ,, he, could, not, drive, ., frank, spent, the, rest, of, the, night, drinking, mor..."
6,"Dave was in the Bahamas on vacation. He decided to go snorkeling on his second day. While snorkeling, he saw a cave up ahead. He went into the cave, and he was terrified when he found a shark! Dave swam away as fast as he could, but the shark caught and ate Dave.","[dave, was, in, the, bahamas, on, vacation, ., he, decided, to, go, snorkeling, on, his, second, day, ., while, snorkeling, ,, he, saw, a, cave, up, ahead, ., he, went, into, the, cave, ,, and, he, was, terrified, when, he, found, a, shark, !, dave, swam, away, as, fast, as, he, could, ,, but, t..."
7,"Sunny enjoyed going to the beach. As she stepped out of her car, she realized she forgot something. It was quite sunny and she forgot her sunglasses. Sunny got back into her car and heading towards the mall. Sunny found some sunglasses and headed back to the beach.","[sunny, enjoyed, going, to, the, beach, ., as, she, stepped, out, of, her, car, ,, she, realized, she, forgot, something, ., it, was, quite, sunny, and, she, forgot, her, sunglasses, ., sunny, got, back, into, her, car, and, heading, towards, the, mall, ., sunny, found, some, sunglasses, and, he..."
8,"Sally was happy when her widowed mom found a new man. She discovered her siblings didn't feel the same. Sally flew to visit her mom and her mom's new husband. Although her mom was obviously in love, he was nothing like her dad. Sally went home and wondered about her parents' marriage.","[sally, was, happy, when, her, widowed, mom, found, a, new, man, ., she, discovered, her, siblings, did, n't, feel, the, same, ., sally, flew, to, visit, her, mom, and, her, mom, 's, new, husband, ., although, her, mom, was, obviously, in, love, ,, he, was, nothing, like, her, dad, ., sally, wen..."
9,Dan hit his golf ball and watched it go. The ball bounced on the grass and into the sand trap. Dan pretended that his ball actually landed on the green. His friends were not paying attention so they believed him. Dan snuck a ball on the green and made his putt from 10 feet.,"[dan, hit, his, golf, ball, and, watched, it, go, ., the, ball, bounced, on, the, grass, and, into, the, sand, trap, ., dan, pretended, that, his, ball, actually, landed, on, the, green, ., his, friends, were, not, paying, attention, so, they, believed, him, ., dan, snuck, a, ball, on, the, gree..."


###  <font color='#6629b2'>Lexicon</font>

Then we need to assemble a lexicon (aka vocabulary) of words that the model needs to know. Each tokenized word in the stories is added to the lexicon, and then each word is mapped to a numerical index that can be read by the model. Since large datasets may contain a huge number of unique words, it's common to filter all words occurring less than a certain number of times, and replace them with some generic &lt;UNK&gt; token. The min_freq parameter in the function below defines this threshold. In the example code, the min_freq parameter is set to 1, so the lexicon will contain all unique words in the training set. When assigning the indices, the number 1 will represent unknown words. The number 0 will represent "empty" word slots, which is explained below. Therefore "real" words will have indices of 2 or higher.

In [4]:
'''Count tokens (words) in texts and add them to the lexicon'''

import pickle

def make_lexicon(token_seqs, min_freq=1):
    # First, count how often each word appears in the text.
    token_counts = {}
    for seq in token_seqs:
        for token in seq:
            if token in token_counts:
                token_counts[token] += 1
            else:
                token_counts[token] = 1

    # Then, assign each word to a numerical index. Filter words that occur less than min_freq times.
    lexicon = [token for token, count in token_counts.items() if count >= min_freq]
    # Indices start at 2. 0 is reserved for padding, and 1 for unknown words.
    lexicon = {token:idx + 2 for idx,token in enumerate(lexicon)}
    lexicon[u'<UNK>'] = 1 # Unknown words are those that occur fewer than min_freq times
    lexicon_size = len(lexicon)

    print("LEXICON SAMPLE ({} total items):".format(len(lexicon)))
    print(dict(list(lexicon.items())[:20]))
    
    return lexicon

lexicon = make_lexicon(token_seqs=train_stories['Tokenized_Story'], min_freq=1)

with open('example_model/lexicon.pkl', 'wb') as f: # Save the lexicon by pickling it
    pickle.dump(lexicon, f)

LEXICON SAMPLE (1274 total items):
{'dan': 2, "'s": 3, 'parents': 4, 'were': 5, 'overweight': 6, '.': 7, 'was': 8, 'as': 9, 'well': 10, 'the': 11, 'doctors': 12, 'told': 13, 'his': 14, 'it': 15, 'unhealthy': 16, 'understood': 17, 'and': 18, 'decided': 19, 'to': 20, 'make': 21}


When we apply the model to generation later, it will output words as indices, so we'll need to map each numerical index back to its corresponding string representation. We'll reverse the lexicon dictionary so that a word can be looked up by its index.

In [5]:
'''Make a dictionary where the string representation of a lexicon item can be retrieved from its numerical index'''

def get_lexicon_lookup(lexicon):
    lexicon_lookup = {idx: lexicon_item for lexicon_item, idx in lexicon.items()}
    lexicon_lookup[0] = "" #map 0 padding to empty string
    print("LEXICON LOOKUP SAMPLE:")
    print(dict(list(lexicon_lookup.items())[:20]))
    return lexicon_lookup

lexicon_lookup = get_lexicon_lookup(lexicon)

LEXICON LOOKUP SAMPLE:
{2: 'dan', 3: "'s", 4: 'parents', 5: 'were', 6: 'overweight', 7: '.', 8: 'was', 9: 'as', 10: 'well', 11: 'the', 12: 'doctors', 13: 'told', 14: 'his', 15: 'it', 16: 'unhealthy', 17: 'understood', 18: 'and', 19: 'decided', 20: 'to', 21: 'make'}


###  <font color='#6629b2'>From strings to numbers</font>

Once the lexicon is built, we can use it to transform each story from a list of string tokens into a list of numerical indices.

In [6]:
'''Convert each text from a list of tokens to a list of numbers (indices)'''

def tokens_to_idxs(token_seqs, lexicon):
    idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] for token in token_seq]  
                                                                     for token_seq in token_seqs]
    return idx_seqs

train_stories['Story_Idxs'] = tokens_to_idxs(token_seqs=train_stories['Tokenized_Story'],
                                             lexicon=lexicon)
                                   
train_stories[['Tokenized_Story', 'Story_Idxs']][:10]

Unnamed: 0,Tokenized_Story,Story_Idxs
0,"[dan, 's, parents, were, overweight, ., dan, was, overweight, as, well, ., the, doctors, told, his, parents, it, was, unhealthy, ., his, parents, understood, and, decided, to, make, a, change, ., they, got, themselves, and, dan, on, a, diet, .]","[2, 3, 4, 5, 6, 7, 2, 8, 6, 9, 10, 7, 11, 12, 13, 14, 4, 15, 8, 16, 7, 14, 4, 17, 18, 19, 20, 21, 22, 23, 7, 24, 25, 26, 18, 2, 27, 22, 28, 7]"
1,"[carrie, had, just, learned, how, to, ride, a, bike, ., she, did, n't, have, a, bike, of, her, own, ., carrie, would, sneak, rides, on, her, sister, 's, bike, ., she, got, nervous, on, a, hill, and, crashed, into, a, wall, ., the, bike, frame, bent, and, carrie, got, a, deep, gash, on, her, leg, .]","[29, 30, 31, 32, 33, 20, 34, 22, 35, 7, 36, 37, 38, 39, 22, 35, 40, 41, 42, 7, 29, 43, 44, 45, 27, 41, 46, 3, 35, 7, 36, 25, 47, 27, 22, 48, 18, 49, 50, 22, 51, 7, 11, 35, 52, 53, 18, 29, 25, 22, 54, 55, 27, 41, 56, 7]"
2,"[morgan, enjoyed, long, walks, on, the, beach, ., she, and, her, boyfriend, decided, to, go, for, a, long, walk, ., after, walking, for, over, a, mile, ,, something, happened, ., morgan, decided, to, propose, to, her, boyfriend, ., her, boyfriend, was, upset, he, did, n't, propose, to, her, firs...","[57, 58, 59, 60, 27, 11, 61, 7, 36, 18, 41, 62, 19, 20, 63, 64, 22, 59, 65, 7, 66, 67, 64, 68, 22, 69, 70, 71, 72, 7, 57, 19, 20, 73, 20, 41, 62, 7, 41, 62, 8, 74, 75, 37, 38, 73, 20, 41, 76, 7]"
3,"[jane, was, working, at, a, diner, ., suddenly, ,, a, customer, barged, up, to, the, counter, ., he, began, yelling, about, how, long, his, food, was, taking, ., jane, did, n't, know, how, to, react, ., luckily, ,, her, coworker, intervened, and, calmed, the, man, down, .]","[77, 8, 78, 79, 22, 80, 7, 81, 70, 22, 82, 83, 84, 20, 11, 85, 7, 75, 86, 87, 88, 33, 59, 14, 89, 8, 90, 7, 77, 37, 38, 91, 33, 20, 92, 7, 93, 70, 41, 94, 95, 18, 96, 11, 97, 98, 7]"
4,"[i, was, talking, to, my, crush, today, ., she, continued, to, complain, about, guys, flirting, with, her, ., i, decided, to, agree, with, what, she, says, and, listened, to, her, patiently, ., after, i, got, home, ,, i, got, a, text, from, her, ., she, asked, if, we, can, hang, out, tomorrow, .]","[99, 8, 100, 20, 101, 102, 103, 7, 36, 104, 20, 105, 88, 106, 107, 108, 41, 7, 99, 19, 20, 109, 108, 110, 36, 111, 18, 112, 20, 41, 113, 7, 66, 99, 25, 114, 70, 99, 25, 22, 115, 116, 41, 7, 36, 117, 118, 119, 120, 121, 122, 123, 7]"
5,"[frank, had, been, drinking, beer, ., he, got, a, call, from, his, girlfriend, ,, asking, where, he, was, ., frank, suddenly, realized, he, had, a, date, that, night, ., since, frank, was, already, a, bit, drunk, ,, he, could, not, drive, ., frank, spent, the, rest, of, the, night, drinking, mor...","[124, 30, 125, 126, 127, 7, 75, 25, 22, 128, 116, 14, 129, 70, 130, 131, 75, 8, 7, 124, 81, 132, 75, 30, 22, 133, 134, 135, 7, 136, 124, 8, 137, 22, 138, 139, 70, 75, 140, 141, 142, 7, 124, 143, 11, 144, 40, 11, 135, 126, 145, 146, 7]"
6,"[dave, was, in, the, bahamas, on, vacation, ., he, decided, to, go, snorkeling, on, his, second, day, ., while, snorkeling, ,, he, saw, a, cave, up, ahead, ., he, went, into, the, cave, ,, and, he, was, terrified, when, he, found, a, shark, !, dave, swam, away, as, fast, as, he, could, ,, but, t...","[147, 8, 148, 11, 149, 27, 150, 7, 75, 19, 20, 63, 151, 27, 14, 152, 153, 7, 154, 151, 70, 75, 155, 22, 156, 84, 157, 7, 75, 158, 50, 11, 156, 70, 18, 75, 8, 159, 160, 75, 161, 22, 162, 163, 147, 164, 165, 9, 166, 9, 75, 140, 70, 167, 11, 162, 168, 18, 169, 147, 7]"
7,"[sunny, enjoyed, going, to, the, beach, ., as, she, stepped, out, of, her, car, ,, she, realized, she, forgot, something, ., it, was, quite, sunny, and, she, forgot, her, sunglasses, ., sunny, got, back, into, her, car, and, heading, towards, the, mall, ., sunny, found, some, sunglasses, and, he...","[170, 58, 171, 20, 11, 61, 7, 9, 36, 172, 122, 40, 41, 173, 70, 36, 132, 36, 174, 71, 7, 15, 8, 175, 170, 18, 36, 174, 41, 176, 7, 170, 25, 177, 50, 41, 173, 18, 178, 179, 11, 180, 7, 170, 161, 181, 176, 18, 182, 177, 20, 11, 61, 7]"
8,"[sally, was, happy, when, her, widowed, mom, found, a, new, man, ., she, discovered, her, siblings, did, n't, feel, the, same, ., sally, flew, to, visit, her, mom, and, her, mom, 's, new, husband, ., although, her, mom, was, obviously, in, love, ,, he, was, nothing, like, her, dad, ., sally, wen...","[183, 8, 184, 160, 41, 185, 186, 161, 22, 187, 97, 7, 36, 188, 41, 189, 37, 38, 190, 11, 191, 7, 183, 192, 20, 193, 41, 186, 18, 41, 186, 3, 187, 194, 7, 195, 41, 186, 8, 196, 148, 197, 70, 75, 8, 198, 199, 41, 200, 7, 183, 158, 114, 18, 201, 88, 41, 4, 202, 203, 7]"
9,"[dan, hit, his, golf, ball, and, watched, it, go, ., the, ball, bounced, on, the, grass, and, into, the, sand, trap, ., dan, pretended, that, his, ball, actually, landed, on, the, green, ., his, friends, were, not, paying, attention, so, they, believed, him, ., dan, snuck, a, ball, on, the, gree...","[2, 204, 14, 205, 206, 18, 207, 15, 63, 7, 11, 206, 208, 27, 11, 209, 18, 50, 11, 210, 211, 7, 2, 212, 134, 14, 206, 213, 214, 27, 11, 215, 7, 14, 216, 5, 141, 217, 218, 219, 24, 220, 221, 7, 2, 222, 22, 206, 27, 11, 215, 18, 223, 14, 224, 116, 225, 226, 7]"


###  <font color='#6629b2'>Creating a matrix</font>

Finally, we need to put all the training stories into a single matrix, where each row is a story and each column is a word index in that story. This enables the model to process the stories in batches as opposed to one at a time, which significantly speeds up training. However, each story has a different number of words. So we create a padded matrix equal to the length on the longest story in the training set. For all stories with fewer words, we prepend the row with zeros, each representing an empty word position. This is why the number 0 was not assigned as a word index in the lexicon. Then we can actually tell Keras to ignore these zeros during training.

In [7]:
'''create a padded matrix of stories'''

from keras.preprocessing.sequence import pad_sequences

def pad_idx_seqs(idx_seqs, max_seq_len):
    # Keras provides a convenient padding function; 
    padded_idxs = pad_sequences(sequences=idx_seqs, maxlen=max_seq_len)
    return padded_idxs

max_seq_len = max([len(idx_seq) for idx_seq in train_stories['Story_Idxs']]) # Get length of longest sequence

train_padded_idxs = pad_sequences(train_stories['Story_Idxs'], 
                                  maxlen=max_seq_len + 1) #Add one to max length for offsetting sequence by 1
print(train_padded_idxs) #same example story as above

print("SHAPE:", train_padded_idxs.shape)

Using TensorFlow backend.


[[   0    0    0 ...   22   28    7]
 [   0    0    0 ...   41   56    7]
 [   0    0    0 ...   41   76    7]
 ...
 [   0    0    0 ...  108   41    7]
 [   0    0    0 ...   11 1268    7]
 [   0    0    0 ...  177  122    7]]
SHAPE: (100, 74)


### <font color='#6629b2'>Defining the input and output</font>

In an RNN language model, the data is set up so that each word in the text is mapped to the word that follows it. In a given story, for each input word x[idx], the output label y[idx] is just x[idx+1]. In other words, the output sequences (y) matrix will be offset by one. The example below displays this alignment with the string tokens for the first story in the dataset.

In [8]:
pandas.DataFrame(list(zip(["-"] + train_stories['Tokenized_Story'].loc[0],
                          train_stories['Tokenized_Story'].loc[0])),
                 columns=['Input Word', 'Output Word'])


Unnamed: 0,Input Word,Output Word
0,-,dan
1,dan,'s
2,'s,parents
3,parents,were
4,were,overweight
5,overweight,.
6,.,dan
7,dan,was
8,was,overweight
9,overweight,as


To keep the padded matrices the same length, the input word matrix will also both be offset by one in the opposite direction. So the length of both the input and output matrices will be both reduced by one.

In [9]:
print(pandas.DataFrame(list(zip(train_padded_idxs[0,:-1], train_padded_idxs[0, 1:])),
                columns=['Input Words', 'Output Words']))

    Input Words  Output Words
0             0             0
1             0             0
2             0             0
3             0             0
4             0             0
5             0             0
6             0             0
7             0             0
8             0             0
9             0             0
10            0             0
11            0             0
12            0             0
13            0             0
14            0             0
15            0             0
16            0             0
17            0             0
18            0             0
19            0             0
20            0             0
21            0             0
22            0             0
23            0             0
24            0             0
25            0             0
26            0             0
27            0             0
28            0             0
29            0             0
..          ...           ...
43            9            10
44        

##  <font color='#6629b2'>Building the model</font>

To assemble the model, we'll use Keras' [Functional API](https://keras.io/getting-started/functional-api-guide/), which is one of two ways to use Keras to assemble models (the alternative is the [Sequential API](https://keras.io/getting-started/sequential-model-guide/), which is a bit simpler but has more constraints). A model consists of a series of layers. As shown in the code below, we initialize instances for each layer. Each layer can be called with another layer as input, e.g. Embedding()(input_layer). A model instance is initialized with the Model() object, which defines the initial input and final output layers for that model. Before the model can be trained, the compile() function must be called with the loss function and optimization algorithm specified (see below).

###  <font color='#6629b2'>Layers</font>

We'll build an RNN with five layers:

**1. Input**: The input layer takes in the matrix of word indices.

**2. Embedding**: An [embedding input layer](https://keras.io/layers/embeddings/) that converts integer word indices into distributed vector representations (embeddings). The mask_zero=True parameter indicates that values of 0 in the matrix (the padding) will be ignored by the model.

**3. GRU**: A [recurrent (GRU) hidden layer](https://keras.io/layers/recurrent/), the central component of the model. As it observes each word in the story, it integrates the word embedding representation with what it's observed so far to compute a representation (hidden state) of the story at that timepoint. There are a few architectures for this layer - I use the GRU variation, Keras also provides LSTM or just the simple vanilla recurrent layer (see the materials at the bottom for an explanation of the difference). By setting return_sequences=True for this layer, it will output the hidden states for every timepoint in the model, i.e. for every word in the story.

**4. GRU**: A second recurrent layer that takes the first as input and operates the same way, since adding more layers generally improves the model.

**5. (Time Distributed) Dense**: A [dense output layer](https://keras.io/layers/core/#dense) that outputs a probability for each word in the lexicon, where each probability indicates the chance of that word being the next word in the sequence. The 'softmax' activation is what transforms the values of this layer into scores from 0 to 1 that can be treated as probabilities. The Dense layer produces the probability scores for one particular timepoint (word). By wrapping this in a TimeDistributed() layer, the model outputs a probability distribution for every timepoint in the sequence.

The term "layer" is just an abstraction, when really all these layers are just matrices. The "weights" that connect the layers are also matrices. The process of training a neural network is a series of matrix multiplications. The weight matrices are the values that are adjusted during training in order for the model to learn to predict the next word.

###  <font color='#6629b2'>Parameters</font>

Our function for creating the model takes the following parameters:

**seq_input_len:** the length of the input and output matrices. This is equal to the length of the longest story in the training data. 

**n_input_nodes**: the number of unique words in the lexicon, plus one to account for the padding represented by 0 values. This indicates the number of rows in the embedding layer, where each row corresponds to a word. It is also the dimensionality of the probability vectors given as the model output.

**n_embedding_nodes**: the number of dimensions (units) in the embedding layer, which can be freely defined. Here, it is set to 300.

**n_hidden_nodes**: the number of dimensions in the hidden layers. Like the embedding layer, this can be freely chosen. Here, it is set to 500.

**stateful**: By default, the GRU hidden layer will reset its state (i.e. its values will be 0s) each time a new set of sequences is read into the model.  However, when stateful=True is given, this parameter indicates that the GRU hidden layer should "remember" its state until it is explicitly told to forget it. In other words, the values in this layer will be carried over between separate calls to the training function. This is useful when processing long sequences, so that the model can iterate through chunks of the sequences rather than loading the entire matrix at the same time, which is memory-intensive. I'll show below how this setting is also useful when the model is used for word prediction after training. During training, the model will observe all words in a story at once, so stateful will be set to False. At prediction time, it will be set to True.

**batch_size**: It is not always necessary to specify the batch size when setting up a Keras model. The fit() function will apply batch processing by default and the batch size can be given as a parameter. However, when a model is stateful, the batch size does need to be specified in the Input() layers. Here, for training, batch_size=None, so Keras will use its default batch size (which is 32). During prediction, the batch size will be set to 1.

### <font color='#6629b2'>Procedure</font>

The output of the model is a sequence of vectors, each with the same number of dimensions as the number of unique words (n_input_nodes). Each vector contains the predicted probability of each possible word appearing in that position in the sequence. Like all neural networks, RNNs learn by updating the parameters (weights) to optimize an objective (loss) function applied to the output. For this model, the objective is to minimize the cross-entropy (named as "sparse_categorical_crossentropy" in the code) between the predicted word probabilities and the probabilities observed from the words that appear in the training data, resulting in probabilities that more accurately predict when a particular word will appear. This is the general procedure used for all multi-label classification tasks. Updates to the weights of the model are performed using an optimization algorithm, such as Adam used here. The details of this process are extensive; see the resources at the bottom of the notebook if you want a deeper understanding. One huge benefit of Keras is that it implements many of these details for you. Not only does it already have implementations of the types of layer architectures, it also has many of the [loss functions](https://keras.io/losses/) and [optimization methods](https://keras.io/optimizers/) you need for training various models.

In [10]:
'''Create the model'''

from keras.models import Model
from keras.layers import Input, Dense, TimeDistributed
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import GRU

def create_model(seq_input_len, n_input_nodes, n_embedding_nodes, 
                 n_hidden_nodes, stateful=False, batch_size=None):
    
    # Layer 1
    input_layer = Input(batch_shape=(batch_size, seq_input_len), name='input_layer')

    # Layer 2
    embedding_layer = Embedding(input_dim=n_input_nodes, 
                                output_dim=n_embedding_nodes, 
                                mask_zero=True, name='embedding_layer')(input_layer) #mask_zero=True will ignore padding
    # Output shape = (batch_size, seq_input_len, n_embedding_nodes)

    #Layer 3
    gru_layer1 = GRU(n_hidden_nodes,
                     return_sequences=True, #return hidden state for each word, not just last one
                     stateful=stateful, name='hidden_layer1')(embedding_layer)
    # Output shape = (batch_size, seq_input_len, n_hidden_nodes)

    #Layer 4
    gru_layer2 = GRU(n_hidden_nodes,
                     return_sequences=True,
                     stateful=stateful, name='hidden_layer2')(gru_layer1)
    # Output shape = (batch_size, seq_input_len, n_hidden_nodes)

    #Layer 5
    output_layer = TimeDistributed(Dense(n_input_nodes, activation="softmax"), 
                                   name='output_layer')(gru_layer2)
    # Output shape = (batch_size, seq_input_len, n_input_nodes)
    
    model = Model(inputs=input_layer, outputs=output_layer)

    #Specify loss function and optimization algorithm, compile model
    model.compile(loss="sparse_categorical_crossentropy",
                  optimizer='adam')
    
    return model

In [11]:
model = create_model(seq_input_len=train_padded_idxs.shape[-1] - 1, #substract 1 from matrix length because of offset 
                     n_input_nodes = len(lexicon) + 1, # Add 1 to account for 0 padding
                     n_embedding_nodes = 300,
                     n_hidden_nodes = 500)

###  <font color='#6629b2'>Training</font>

Now we're ready to train the model. We'll call the fit() function to train the model for 10 iterations through the dataset (epochs), with a batch size of 20 stories. Keras reports the cross-entropy loss after each epoch - if the model is learning correctly, it should progressively decrease.

In [12]:
'''Train the model'''

# output matrix (y) has extra 3rd dimension added because sparse cross-entropy function requires one label per row
model.fit(x=train_padded_idxs[:,:-1], y=train_padded_idxs[:, 1:, None], epochs=5, batch_size=20)
model.save_weights('example_model/model_weights.h5') #Save model

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### <font color='#6629b2'>Evaluation</font>

It is useful to know how well the cross entropy metric applied to the training set generalizes to stories not observed during training. The typical intrinsic evaluation metric for language modeling is perplexity, which is derived from cross entropy. Perplexity is a measure of how much the model "expects" or "anticipates" the words in a given set of texts, where lower perplexity values mean the model is better at guessing the words that appear. The closer perplexity is to 0, the better the model is at predicting the given sequences. This number does not indicate the performance of a model when applied to a specific task, but it is still useful for showing relative differences between models in terms of how well they fit the data. We'll evaluate perplexity on an example test set of 100 sentences that were not observed during training.

In [13]:
'''Load test set and apply same processing used for training stories'''

test_stories = pandas.read_csv('dataset/example_test_stories.csv', encoding='utf-8')
test_stories['Tokenized_Story'] = text_to_tokens(test_stories['Story'])
test_stories['Story_Idxs'] = tokens_to_idxs(token_seqs=test_stories['Tokenized_Story'],
                                            lexicon=lexicon)
test_padded_idxs = pad_sequences(test_stories['Story_Idxs'], 
                                 maxlen=max_seq_len + 1)

Keras has an evaluate() function that will return the cross-entropy loss of the model on a given set of instances, the same thing that was reported during training. We can provide the test instances as input to this function to get the cross-entropy. Perplexity is equal to the exponentiated cross-entropy:

In [14]:
import numpy 

perplexity = numpy.exp(model.evaluate(x=test_padded_idxs[:,:-1], 
                                      y=test_padded_idxs[:, 1:, None]))
print("PERPLEXITY ON TEST SET: {:.3f}".format(perplexity))

PERPLEXITY ON TEST SET: 620.021


## <font color='#6629b2'>Prediction Tasks</font>

Now that the model is trained, we can apply it to prediction tasks using the test set. I'll show two tasks: computing a probability score for a story, and generating a new ending for a story. To demonstrate both of these, I'll load a saved model previously trained on the ~100,000 stories in the training set. As opposed to training where we processed multiple stories at the same time, it will be more straightforward to demonstrate prediction on a single story at a time, especially since prediction is fast relative to training. In Keras, you can duplicate a model by loading the parameters from a saved model into a new model. Here, this new model will have a batch size of 1. It will also process a story one word at a time (seq_input_len=1), using the stateful=True parameter to remember the story that has occurred up to that word. The other parameters of this prediction model are exactly the same as the trained model, which is why the weights can be readily transferred.

In [15]:
'''Create a new test model, setting batch_size = 1, seq_input_len = 1, and stateful = True'''

# Load lexicon from the saved model 
with open('pretrained_model/lexicon.pkl', 'rb') as f:
    lexicon = pickle.load(f)
lexicon_lookup = get_lexicon_lookup(lexicon)

predictor_model = create_model(seq_input_len=1,
                               n_input_nodes=len(lexicon) + 1,
                               n_embedding_nodes = 300,
                               n_hidden_nodes = 500,
                               stateful=True, 
                               batch_size = 1)

predictor_model.load_weights('pretrained_model/model_weights.h5') #Load weights from saved model

LEXICON LOOKUP SAMPLE:
{2: 'fawn', 13: 'lachelle', 5: 'woods', 6: 'spiders', 7: 'hanging', 8: 'francesca', 9: 'disobeying', 10: 'canes', 16670: 'aileen', 12: 'scold', 3: 'sonja', 14: 'screaming', 104: 'wooded', 16: 'grueling', 17: 'wooden', 18: 'wednesday', 19: 'crotch', 20: 'stereotypical', 20919: 'coordinate', 21: 'pizzle'}


In [16]:
'''Re-encode the test stories with lexicon we just loaded'''

test_stories['Story_Idxs'] = tokens_to_idxs(token_seqs=test_stories['Tokenized_Story'],
                                            lexicon=lexicon)

### <font color='#6629b2'>Computing story probabilities</font>

Since the model outputs a probability distribution for each word in the story, indicating the probability of each possible next word in the story, we can use these values to get a single probability score for the story. To do this, we iterate through each word in a story, call the predict() function to get the full list of probabilites for the next word, and then extract the probability predicted for the actual next word in the story. We can average these probabilities across all words in the story to get a single value. The stateful=True parameter is what enables the model to remember the previous words in the story when predicting the probability of the next word. Because of this, the reset_states() function must be called at the end of reading the story in order to clear its memory for the next story.

We do this below to compare the probability of each story to one with an ending randomly selected from another story in the test set. Of course, a good language model should overall score the randomly selected endings as less probable than the correct endings. 

In [17]:
'''Compute the probability of a stories according to the language model'''

import numpy

def get_probability(idx_seq):
    idx_seq = [0] + idx_seq #Prepend 0 so first call to predict() computes prob of first word from zero padding
    probs = []
    for word, next_word in zip(idx_seq[:-1], idx_seq[1:]):
       # Word is an integer, but the model expects an input array
       # with the shape (batch_size, seq_input_len), so prepend two dimensions
        p_next_word = predictor_model.predict(numpy.array(word)[None,None])[0,0] #Output shape= (lexicon_size + 1,)
        #Select predicted prob of the next word, which appears in the corresponding idx position of the probability vector
        p_next_word = p_next_word[next_word]
        probs.append(p_next_word)
    predictor_model.reset_states()
    return numpy.mean(probs) #return average probability of words in sequence

for _, test_story in test_stories[:10].iterrows():

    # Split out initial four sentences in story and ending sentence
    len_initial_story = len([word for sent in list(encoder(test_story['Story']).sents)[:-1] for word in sent])
    token_initial_story = test_story['Tokenized_Story'][:len_initial_story]
    idx_initial_story = test_story['Story_Idxs'][:len_initial_story]
    token_ending = test_story['Tokenized_Story'][len_initial_story:]
    
    # Randomly select another story and get its ending
    rand_story = test_stories.loc[numpy.random.choice(len(test_stories))]
    len_rand_ending = len(list(encoder(rand_story['Story']).sents)[-1])
    token_rand_ending = rand_story['Tokenized_Story'][-len_rand_ending:]
    idx_rand_ending = rand_story['Story_Idxs'][-len_rand_ending:]

    print("INITIAL STORY:", " ".join(token_initial_story))
    prob_given_ending = get_probability(test_story['Story_Idxs'])
    print("GIVEN ENDING: {} (P = {:.3f})".format(" ".join(token_ending), prob_given_ending))

    #print("PROBABILITY:", get_probability(test_story['Story_Idxs']))
    prob_rand_ending = get_probability(idx_initial_story + idx_rand_ending)
    print("RANDOM ENDING: {} (P = {:.3f})".format(" ".join(token_rand_ending), prob_rand_ending), "\n")


INITIAL STORY: lars went out skateboarding , today . he skateboarded to the skate park . his friends taught him how to do a new trick . it is a difficult trick , but he 's going to keep practicing .
GIVEN ENDING: tomorrow , he 'll teach his friends something new , too . (P = 0.226)
RANDOM ENDING: he is glad he drank some coffee . (P = 0.227) 

INITIAL STORY: abby is an avid scuba diver . abby 's dream has always been to scuba dive at the great barrier reef . one day , abby got a letter from her father in the mail . as abby opened the letter , she began crying with joy .
GIVEN ENDING: abby 's dad was sending her to the great barrier reef to dive in may. (P = 0.255)
RANDOM ENDING: anne dodged the beverage just in time , sparing her dress . (P = 0.236) 

INITIAL STORY: maggie was 100 years old . she knew her time was coming to an end soon . she gathered all of her family around her bed side . she told them her final goodbyes .
GIVEN ENDING: maggie passed away minutes later . (P = 0.281)
R

### <font color='#6629b2'>Generating sentences</font>

The language model can also be used to generate new text. Here, I'll give the same predictor model the first four sentences of a story in the test set and have it generate the fifth sentence. To do this, we "load" the first four sentences into the model. This can be done using predict() function. Because the model is stateful, predict() saves the representation of the story internally even though we don't need the output of this function when just reading the story. Once the final word in the fourth sentence has been read, then we can start using the resulting probability distribution to predict the first word in the fifth sentence. We can call numpy.random.choice() to randomly sample a word according to its probability. Now we again call predict() with this new word as input, which returns a probability distribution for the second word. Again, we sample from this distribution, append the newly sampled word to the previously generated word, and call predict() with this new word as input. We continue doing this until a word that ends with an end-of-sentence puncutation mark (".", "!", "?") has been selected. Just as before, reset_states() is called after the whole sentence has been generated. Then we can decode the generated ending into a string using the lexicon lookup dictionary. You can see that the generated endings are generally not as coherent and well-formed as the human-authored endings, but they do capture some components of the story and they are often entertaining.

In [22]:
'''Use the model to generate new endings for stories'''

def generate_ending(idx_seq):
    
    end_of_sent_tokens = [".", "!", "?"]
    generated_ending = []
    
    # First just read initial story, no output needed
    idx_seq = [0] + idx_seq #Prepend 0 so first call to predict() observes 0 padding
    for word in idx_seq:
        p_next_word = predictor_model.predict(numpy.array(word)[None,None])[0,0]
        
    # Now start predicting new words
    while not generated_ending or lexicon_lookup[next_word] not in end_of_sent_tokens:
        #Randomly sample a word from the current probability distribution
        next_word = numpy.random.choice(a=p_next_word.shape[-1], p=p_next_word)
        # Append sampled word to generated ending
        generated_ending.append(next_word)
        # Get probabilities for next word by inputing sampled word
        p_next_word = predictor_model.predict(numpy.array(next_word)[None,None])[0,0]
    
    predictor_model.reset_states() #reset hidden state after generating ending
    
    return generated_ending

for _, test_story in test_stories[:20].iterrows():
    # Use spacy to segment the story into sentences, so we can seperate the ending sentence
    # Find out where in the story the ending starts (number of words from end of story)
    ending_story_idx = len(list(encoder(test_story['Story']).sents)[-1])
    print("INITIAL STORY:", " ".join(test_story['Tokenized_Story'][:-ending_story_idx]))
    print("GIVEN ENDING:", " ".join(test_story['Tokenized_Story'][-ending_story_idx:]))
    
    generated_ending = generate_ending(test_story['Story_Idxs'][:-ending_story_idx])
    generated_ending = " ".join([lexicon_lookup[word] if word in lexicon_lookup else ""
                                 for word in generated_ending]) #decode from numbers back into words
    print("GENERATED ENDING:", generated_ending, "\n")
    

INITIAL STORY: lars went out skateboarding , today . he skateboarded to the skate park . his friends taught him how to do a new trick . it is a difficult trick , but he 's going to keep practicing .
GIVEN ENDING: tomorrow , he 'll teach his friends something new , too .
GENERATED ENDING: it still ca n't control me during the ice skating zone ! 

INITIAL STORY: abby is an avid scuba diver . abby 's dream has always been to scuba dive at the great barrier reef . one day , abby got a letter from her father in the mail . as abby opened the letter , she began crying with joy .
GIVEN ENDING: abby 's dad was sending her to the great barrier reef to dive in may.
GENERATED ENDING: abby will never call her brother again to tell him not to study . 

INITIAL STORY: maggie was 100 years old . she knew her time was coming to an end soon . she gathered all of her family around her bed side . she told them her final goodbyes .
GIVEN ENDING: maggie passed away minutes later .
GENERATED ENDING: their pa

### <font color='#6629b2'>Visualizing data inside the model</font>

To help visualize the data representation inside the model, we can look at the output of each layer individually. Keras' Functional API lets you derive a new model with the layers from an existing model, so you can define the output to be a layer below the output layer in the original model. Calling predict() on this new model will produce the output of that layer for a given input. Of course, glancing at the numbers by themselves doesn't provide any interpretation of what the model has learned (although there are opportunities to [interpret these values](https://www.civisanalytics.com/blog/interpreting-visualizing-neural-networks-text-processing/)), but seeing them verifies the model is just a series of transformations from one matrix to another. The model stores its layers as the list model.layers, and you can retrieve specific layer by its position index in the model. Below is an example of the word embedding output for the first word in the first story of the test set. You can do this same thing to view any layer.

In [19]:
'''Show the output of the word embedding layer for the first word of the first story'''

embedding_layer = Model(inputs=predictor_model.layers[0].input,
                        outputs=predictor_model.layers[1].output)
embedding_output = embedding_layer.predict(numpy.array(test_stories['Story_Idxs'][0][0])[None,None])
print("EMBEDDING OUTPUT SHAPE:", embedding_output.shape)
print(embedding_output[0]) # Print embedding vectors for first word of first story

EMBEDDING OUTPUT SHAPE: (1, 1, 300)
[[-1.38987884e-01  8.48659724e-02  4.91646171e-01 -6.54342631e-03
  -2.85563409e-01 -2.74401139e-02 -9.69286170e-03 -3.78113415e-04
   5.88026680e-02 -3.69859517e-01  5.96126541e-02  2.81094592e-02
  -2.67947197e-01  2.70303756e-01 -2.39816919e-01  7.30297789e-02
  -1.58825010e-01 -3.84417474e-02 -5.57651417e-03  6.54271543e-02
  -9.21142325e-02  8.57164860e-02 -2.29064282e-02 -6.99254647e-02
   1.72443300e-01 -2.84303248e-01  6.02143891e-02 -1.95725426e-01
  -1.40484378e-01  1.31462470e-01  1.60761863e-01 -2.53426611e-01
  -2.65083779e-02  2.64941044e-02  2.27801532e-01 -2.11654648e-01
   1.65292099e-01  4.04135184e-03 -2.57555127e-01 -1.71172783e-01
   5.93047976e-01  2.59072542e-01 -8.60209540e-02 -8.15737396e-02
  -1.51765138e-01 -3.61861557e-01  2.53538322e-02 -9.90964565e-03
   1.05063029e-01 -2.80712783e-01  4.86194223e-01  1.08294666e-01
   2.10506707e-01 -4.82211597e-02  1.42647669e-01 -8.88424441e-02
  -3.22649330e-01 -1.00344390e-01  2.720

It is also easy to look at the weight matrices that connect the layers. The get_weights() function will show the incoming weights for a particular layer.

In [20]:
'''Show weights that connect the hidden layer to the output layer'''

hidden_to_output_weights = predictor_model.layers[-1].get_weights()[0]
print("HIDDEN-TO_OUTPUT WEIGHTS SHAPE:", hidden_to_output_weights.shape)
print(hidden_to_output_weights)

HIDDEN-TO_OUTPUT WEIGHTS SHAPE: (500, 25043)
[[ 0.03925077  0.12099706 -0.23900633 ...  0.02966825  0.04271667
   0.03323554]
 [-0.09675466 -0.02291218  0.3609857  ... -0.10591131 -0.09102738
  -0.10392082]
 [ 0.10144623  0.01986339  0.06466851 ...  0.11147965  0.09949677
   0.10737196]
 ...
 [-0.13619427 -0.01907774 -0.53343266 ... -0.11855997 -0.15162987
  -0.13245827]
 [ 0.03687518 -0.10616766  0.01242885 ...  0.02534318  0.02902994
   0.02383525]
 [ 0.04188101  0.11175645  0.16715644 ...  0.04160435  0.0315874
   0.05033163]]


## <font color='#6629b2'>Conclusion</font>

There are a good number of tutorials on RNN language models, particularly applied to text genertion. This notebook shows how to leverage Keras with batch training when the length of the sequences is variable. There are many ways this language model can be made to be more sophisticated. Here's a few interesting papers from the NLP community that innovate this basic model for different generation tasks:

*Recipe generation:* [Globally Coherent Text Generation with Neural Checklist Models](https://homes.cs.washington.edu/~yejin/Papers/emnlp16_neuralchecklist.pdf). Chloé Kiddon, Luke Zettlemoyer, and Yejin Choi. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.

*Emotional text generation:* [Affect-LM: A Neural Language Model for Customizable Affective Text Generation](https://arxiv.org/pdf/1704.06851.pdf). Sayan Ghosh, Mathieu Chollet, Eugene Laksana, Louis-Philippe Morency, Stefan Scherer. Annual Meeting of the Association for Computational Linguistics (ACL), 2017.

*Poetry generation:* [Generating Topical Poetry](https://www.isi.edu/natural-language/mt/emnlp16-poetry.pdf). Marjan Ghazvininejad, Xing Shi, Yejin Choi, and Kevin Knight. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.

*Dialogue generation:* [A Neural Network Approach to Context-Sensitive Generation of Conversational Responses](http://www-etud.iro.umontreal.ca/~sordonia/pdf/naacl15.pdf). Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie1, Jianfeng Gao, Bill Dolan. North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2015.

## <font color='#6629b2'>More resources</font>

Yoav Goldberg's book [Neural Network Methods for Natural Language Processing](http://www.morganclaypool.com/doi/abs/10.2200/S00762ED1V01Y201703HLT037) is a thorough introduction to neural networks for NLP tasks in general

If you'd like to learn more about what Keras is doing under the hood, the [Theano tutorials](http://deeplearning.net/tutorial/) are useful. There are two specifically on RNNs for NLP: [semantic parsing](http://deeplearning.net/tutorial/rnnslu.html#rnnslu) and [sentiment analysis](http://deeplearning.net/tutorial/lstm.html#lstm)

TensorFlow also has an RNN language model [tutorial](https://www.tensorflow.org/versions/r0.12/tutorials/recurrent/index.html) using the Penn Treebank dataset

Andrej Karpathy's blog post [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) is very helpful for understanding the underlying details of the same language model I've demonstrated here. It also provides raw Python code with an implementation of the backpropagation algorithm.

Chris Olah provides a good [explanation](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) of how LSTM RNNs work (this explanation also applies to the GRU model used here)

Denny Britz's [tutorial](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) documents well both the technical details of RNNs and their implementation in Python.
