# Part-of-speech tagging with recurrent neural networks using Keras

## Overview

We will to see how to use the Keras library to build a recurrent neural network (RNN) model that labels part-of-speech (POS) tags for words in sentences. 

### Part-of-speech (POS) tagging

A part-of-speech tag is the syntactic category associated with a particular word in a sentence, such as a noun, verb, preposition, determiner, adjective or adverb. Part-of-speech tagging is a fundamental task in natural language processing; see the [chapter in Juraksky & Martin's *Speech and Language Processing*](https://web.stanford.edu/~jurafsky/slp3/10.pdf) for more background. POS tagging is a common pre-processing step in many NLP pipelines. For example, words with certain POS tags are more important than other words for capturing the content of a text (e.g. nouns and verbs carry more semantic meaning than grammatical words like prepositions and determiners), so models often take this into account when predicting the topic, sentiment, or some other categorical dimensions of a text. Start-of-the art models are quite successful, reaching near-perfect accuracy in the tags assigned to words. This notebook will show how to put together a simple tagger that uses a Recurrent Neural Network, though it does not perform as well as more advanced models.

### Recurrent Neural Networks (RNNs)

RNNs are a general framework for modeling sequence data and are particularly useful for natural language processing tasks. At a high level, RNN encode sequences via a set of parameters (weights) that are optimized to predict some output variable. This notebook demonstrates the code needed to assemble an RNN model using the Keras library, as well as some data processing tools that facilitate building the model. 

If you understand how to structure the input and output of the model, and know the fundamental concepts in machine learning, then a high-level understanding of how an RNN works is sufficient for using Keras. You'll see that most of the code here is actually just data manipulation, and I'll visualize each step in this process. The code used to assemble the RNN itself is more minimal. It is of course useful to know the technical details of the RNN, so you can theorize on the results and innovate the model to make it better. For a better understanding of RNNs and neural networks in general, see the resources at the bottom of the notebook.

Here an RNN will be used to encode a sentence and assign a POS tag to each word. The model shown here is applicable to any dataset with a one-to-one mapping between the inputs and outputs. This involves any task where for each sequential unit (here, a word), there is some output unit (here, a POS tag) that should be assigned to that input unit.


## Dataset

The [Brown Corpus](http://www.hit.uib.no/icame/brown/bcm.html) (download through NLTK [here](http://www.nltk.org/nltk_data/)) is a popular NLP resource that consists of 500 texts from a variety of sources, including news reports, academic essays, and fiction. Every word in the texts has been annotated with a POS tag. There are different POS annotation schemes provided in the corpus, which differ in the number of tags assigned. Here I use coarse-grained tags, of which there are eleven unique tags (for example, some schemes might split up the coarse-grained "VERB" tag into fine-grained tags based on the specific tense of the verb). I set up the dataset so that each entry is a single sentence. The code below loads a sample of 100 sentences from the corpus, so see the above [link](http://www.helsinki.fi/varieng/CoRD/corpora/BROWN/) to get the full dataset.

In [1]:
from __future__ import print_function #Python 2/3 compatibility for print statements

pd.set_option('display.max_colwidth', 170) #widen pandas rows display

NameError: name 'pd' is not defined

**Exercise :** Load the dataset with the path dataset/example_train_brown_corpus.csv and save it in a variable train_sents. Load only the first 100 stories to relieve your computer.

In [6]:
import pandas as pd
'''Load the dataset'''
### ENTER YOUR CODE HERE (1 line)
train_sents = pd.read_csv('dataset/example_train_brown_corpus.csv')


### END 
train_sents.head()

Unnamed: 0,Tokenized_Sentence,Tagged_Sentence
0,The\tFulton\tCounty\tGrand\tJury\tsaid\tFriday...,DET\tNOUN\tNOUN\tADJ\tNOUN\tVERB\tNOUN\tDET\tN...
1,The\tjury\tfurther\tsaid\tin\tterm-end\tpresen...,DET\tNOUN\tADV\tVERB\tADP\tNOUN\tNOUN\tADP\tDE...
2,The\tSeptember-October\tterm\tjury\thad\tbeen\...,DET\tNOUN\tNOUN\tNOUN\tVERB\tVERB\tVERB\tADP\t...
3,``\tOnly\ta\trelative\thandful\tof\tsuch\trepo...,.\tADV\tDET\tADJ\tNOUN\tADP\tADJ\tNOUN\tVERB\t...
4,The\tjury\tsaid\tit\tdid\tfind\tthat\tmany\tof...,DET\tNOUN\tVERB\tPRON\tVERB\tVERB\tADP\tADJ\tA...


### Preprocessing 
We split each space and put the words in lower case for Tokenized_Sentence. 
For Tagged_Sentence, we simply apply a spilt.

In [7]:
#Get the word tokens and tags into a readable list format
train_sents['Tokenized_Sentence'] = train_sents['Tokenized_Sentence'].apply(lambda sent: sent.lower().split("\t"))
train_sents['Tagged_Sentence'] = train_sents['Tagged_Sentence'].apply(lambda sent: sent.split("\t"))

train_sents[:10]

Unnamed: 0,Tokenized_Sentence,Tagged_Sentence
0,"[the, fulton, county, grand, jury, said, frida...","[DET, NOUN, NOUN, ADJ, NOUN, VERB, NOUN, DET, ..."
1,"[the, jury, further, said, in, term-end, prese...","[DET, NOUN, ADV, VERB, ADP, NOUN, NOUN, ADP, D..."
2,"[the, september-october, term, jury, had, been...","[DET, NOUN, NOUN, NOUN, VERB, VERB, VERB, ADP,..."
3,"[``, only, a, relative, handful, of, such, rep...","[., ADV, DET, ADJ, NOUN, ADP, ADJ, NOUN, VERB,..."
4,"[the, jury, said, it, did, find, that, many, o...","[DET, NOUN, VERB, PRON, VERB, VERB, ADP, ADJ, ..."
5,"[it, recommended, that, fulton, legislators, a...","[PRON, VERB, ADP, NOUN, NOUN, VERB, ., PRT, VE..."
6,"[the, grand, jury, commented, on, a, number, o...","[DET, ADJ, NOUN, VERB, ADP, DET, NOUN, ADP, AD..."
7,"[merger, proposed]","[NOUN, VERB]"
8,"[however, ,, the, jury, said, it, believes, ``...","[ADV, ., DET, NOUN, VERB, PRON, VERB, ., DET, ..."
9,"[the, city, purchasing, department, ,, the, ju...","[DET, NOUN, VERB, NOUN, ., DET, NOUN, VERB, .,..."


## Preparing the data

The sentences have already been tokenized into words, so both the words in each sentence and the corresponding tags are represented as lists.

###  Lexicons

We need to assemble lexicons for both the words and tags. The term "lexicon" usually refers specifically to the words in a model, but here I use it generally to mean a mapping between strings and numerical indices, which applies to the POS tags as well (I'll distinguish between the "words lexicon" and "tags lexicon"). Each word/tag is assigned a numerical index that can be read by the model. For the words lexicon, since large datasets may contain a huge number of unique words, it's common to filter all words occurring less than a certain number of times and replace them with some generic &lt;UNK&gt; token. The min_freq parameter in the function below defines this threshold. For the tags, we'll include all of them in the model since these are the output classes we are trying to predict. There are only 11 tags in this dataset. An &lt;UNK&gt; tag is included, even though it doesn't actually appear in the dataset; this isn't a problem, because the model will learn not to predict it.

**Exercise :** Create a make_lexicon function. This function return a lexicon for the words in the sentences as well as the tags. Refer you on previous chapter if you don't know how to create a lexicon, it's exactly the same method.

In [8]:
'''Create a lexicon for the words in the sentences as well as the tags'''

def make_lexicon(token_seqs, min_freq=1):
    ### ENTER YOUR CODE HERE (+- 10lines)
     # First, count how often each word appears in the text.
    token_counts = {}
    for seq in token_seqs:
        for token in seq:
            if token in token_counts:
                token_counts[token] += 1
            else:
                token_counts[token] = 1
    # Then, assign each word to a numerical index. Filter words that occur less than min_freq times.
    lexicon = [token for token, count in token_counts.items() if count >= min_freq]
    # Indices start at 2. 0 is reserved for padding, and 1 for unknown words.
    lexicon = {token:idx + 2 for idx,token in enumerate(lexicon)}
    lexicon[u'<UNK>'] = 1 # Unknown words are those that occur fewer than min_freq times
    lexicon_size = len(lexicon)

    print("LEXICON SAMPLE ({} total items):".format(len(lexicon)))
    print(dict(list(lexicon.items())[:20]))
    
    return lexicon

    
    ###END 


In [9]:
print(make_lexicon(token_seqs=train_sents['Tokenized_Sentence'], min_freq=1))

LEXICON SAMPLE (812 total items):
{'the': 2, 'fulton': 3, 'county': 4, 'grand': 5, 'jury': 6, 'said': 7, 'friday': 8, 'an': 9, 'investigation': 10, 'of': 11, "atlanta's": 12, 'recent': 13, 'primary': 14, 'election': 15, 'produced': 16, '``': 17, 'no': 18, 'evidence': 19, "''": 20, 'that': 21}


In [10]:
print(make_lexicon(token_seqs=train_sents['Tagged_Sentence'], min_freq=1))

LEXICON SAMPLE (12 total items):
{'DET': 2, 'NOUN': 3, 'ADJ': 4, 'VERB': 5, 'ADP': 6, '.': 7, 'ADV': 8, 'CONJ': 9, 'PRT': 10, 'PRON': 11, 'NUM': 12, '<UNK>': 1}
{'DET': 2, 'NOUN': 3, 'ADJ': 4, 'VERB': 5, 'ADP': 6, '.': 7, 'ADV': 8, 'CONJ': 9, 'PRT': 10, 'PRON': 11, 'NUM': 12, '<UNK>': 1}


Let's make a lexicon for words and another for tags. You should have something like this :  

````
WORDS: {'the': 2, 'fulton': 3, 'county': 4, 'grand': 5, 'jury': 6, 'said': 7, 'friday': 8, 'an': 9, 'investigation': 10, 'of': 11, "atlanta's": 12, 'recent': 13,`... } ````

and for the tags : 
````
TAGS: {'DET': 2, 'NOUN': 3, 'ADJ': 4, 'VERB': 5, 'ADP': 6, '.': 7, 'ADV': 8, 'CONJ': 9, 'PRT': 10, 'PRON': 11, 'NUM': 12, '<UNK>': 1}````


In [11]:
words_lexicon = make_lexicon(train_sents['Tokenized_Sentence'])
print("WORDS:", words_lexicon)

tags_lexicon = make_lexicon(train_sents['Tagged_Sentence'])
print("TAGS:", tags_lexicon)

LEXICON SAMPLE (812 total items):
{'the': 2, 'fulton': 3, 'county': 4, 'grand': 5, 'jury': 6, 'said': 7, 'friday': 8, 'an': 9, 'investigation': 10, 'of': 11, "atlanta's": 12, 'recent': 13, 'primary': 14, 'election': 15, 'produced': 16, '``': 17, 'no': 18, 'evidence': 19, "''": 20, 'that': 21}
LEXICON SAMPLE (12 total items):
{'DET': 2, 'NOUN': 3, 'ADJ': 4, 'VERB': 5, 'ADP': 6, '.': 7, 'ADV': 8, 'CONJ': 9, 'PRT': 10, 'PRON': 11, 'NUM': 12, '<UNK>': 1}
TAGS: {'DET': 2, 'NOUN': 3, 'ADJ': 4, 'VERB': 5, 'ADP': 6, '.': 7, 'ADV': 8, 'CONJ': 9, 'PRT': 10, 'PRON': 11, 'NUM': 12, '<UNK>': 1}


**Exercise :** Save your lexicon's ! If you don't remember how to do it, look in the previous lesson. 

In [12]:
### ENTER YOUR CODE HERE ( +- 5 lines )
import pickle

with open('example_model/lexicon.pkl', 'wb') as f: # Save the lexicon by pickling it
    pickle.dump(words_lexicon, f)
    
with open('example_model/lexicon.pkl', 'wb') as f: # Save the lexicon by pickling it
    pickle.dump(tags_lexicon, f)

### END 

Because the model will output tags as indices, we'll obviously need to map each tag number back to its corresponding string representation in order to later interpret the output. We'll reverse the tags lexicon to create a lookup table to get each tag from its index.

**Exercise :**  Create a get_lexicon_lookup function that returns a dictionary where the string representation of a lexical element can be retrieved from its numerical index. If you don't remember how to do it, look in the previous lesson. 

In [13]:
'''Make a dictionary where the string representation of a lexicon item can be retrieved from its numerical index'''

def get_lexicon_lookup(lexicon):
    ### ENTER YOUR CODE HERE 
    lexicon_lookup = {idx: lexicon_item for lexicon_item, idx in lexicon.items()}
    lexicon_lookup[0] = "" #map 0 padding to empty string
    print("LEXICON LOOKUP SAMPLE:")
    print(dict(list(lexicon_lookup.items())[:20]))
    return lexicon_lookup

get_lexicon_lookup(words_lexicon)



    
    ### END 


LEXICON LOOKUP SAMPLE:
{2: 'the', 3: 'fulton', 4: 'county', 5: 'grand', 6: 'jury', 7: 'said', 8: 'friday', 9: 'an', 10: 'investigation', 11: 'of', 12: "atlanta's", 13: 'recent', 14: 'primary', 15: 'election', 16: 'produced', 17: '``', 18: 'no', 19: 'evidence', 20: "''", 21: 'that'}


{2: 'the',
 3: 'fulton',
 4: 'county',
 5: 'grand',
 6: 'jury',
 7: 'said',
 8: 'friday',
 9: 'an',
 10: 'investigation',
 11: 'of',
 12: "atlanta's",
 13: 'recent',
 14: 'primary',
 15: 'election',
 16: 'produced',
 17: '``',
 18: 'no',
 19: 'evidence',
 20: "''",
 21: 'that',
 22: 'any',
 23: 'irregularities',
 24: 'took',
 25: 'place',
 26: '.',
 27: 'further',
 28: 'in',
 29: 'term-end',
 30: 'presentments',
 31: 'city',
 32: 'executive',
 33: 'committee',
 34: ',',
 35: 'which',
 36: 'had',
 37: 'over-all',
 38: 'charge',
 39: 'deserves',
 40: 'praise',
 41: 'and',
 42: 'thanks',
 43: 'atlanta',
 44: 'for',
 45: 'manner',
 46: 'was',
 47: 'conducted',
 48: 'september-october',
 49: 'term',
 50: 'been',
 51: 'charged',
 52: 'by',
 53: 'superior',
 54: 'court',
 55: 'judge',
 56: 'durwood',
 57: 'pye',
 58: 'to',
 59: 'investigate',
 60: 'reports',
 61: 'possible',
 62: 'hard-fought',
 63: 'won',
 64: 'mayor-nominate',
 65: 'ivan',
 66: 'allen',
 67: 'jr.',
 68: 'only',
 69: 'a',
 7

Let's check if everything's okay. You should get this 
````
{2: 'DET', 3: 'NOUN', 4: 'ADJ', 5: 'VERB', 6: 'ADP', 7: '.', 8: 'ADV', 9: 'CONJ', 10: 'PRT', 11: 'PRON', 12: 'NUM', 1: '<UNK>'}
````

In [14]:
tags_lexicon_lookup = get_lexicon_lookup(tags_lexicon)
print(tags_lexicon_lookup)

LEXICON LOOKUP SAMPLE:
{2: 'DET', 3: 'NOUN', 4: 'ADJ', 5: 'VERB', 6: 'ADP', 7: '.', 8: 'ADV', 9: 'CONJ', 10: 'PRT', 11: 'PRON', 12: 'NUM', 1: '<UNK>', 0: ''}
{2: 'DET', 3: 'NOUN', 4: 'ADJ', 5: 'VERB', 6: 'ADP', 7: '.', 8: 'ADV', 9: 'CONJ', 10: 'PRT', 11: 'PRON', 12: 'NUM', 1: '<UNK>', 0: ''}


###  From strings to numbers

We use the lexicons to transform the word and tag sequences into lists of numerical indices.  
**Exercise :** Builds a token_to_idxs() function that transforms word sequences into numbers. If you don't remember how to do it, look at the previous lesson

In [15]:
 ### ENTER YOUR CODE HERE 

def tokens_to_idxs(token_seqs, lexicon):
    idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] for token in token_seq]  
                                                                     for token_seq in token_seqs]
    return idx_seqs


### END

Let's test this !

In [16]:
train_sents['Sentence_Idxs'] = tokens_to_idxs(train_sents['Tokenized_Sentence'], words_lexicon)
train_sents['Tag_Idxs'] = tokens_to_idxs(train_sents['Tagged_Sentence'], tags_lexicon)
train_sents[['Tokenized_Sentence', 'Sentence_Idxs', 'Tagged_Sentence', 'Tag_Idxs']][:10]

Unnamed: 0,Tokenized_Sentence,Sentence_Idxs,Tagged_Sentence,Tag_Idxs
0,"[the, fulton, county, grand, jury, said, frida...","[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...","[DET, NOUN, NOUN, ADJ, NOUN, VERB, NOUN, DET, ...","[2, 3, 3, 4, 3, 5, 3, 2, 3, 6, 3, 4, 3, 3, 5, ..."
1,"[the, jury, further, said, in, term-end, prese...","[2, 6, 27, 7, 28, 29, 30, 21, 2, 31, 32, 33, 3...","[DET, NOUN, ADV, VERB, ADP, NOUN, NOUN, ADP, D...","[2, 3, 8, 5, 6, 3, 3, 6, 2, 3, 4, 3, 7, 2, 5, ..."
2,"[the, september-october, term, jury, had, been...","[2, 48, 49, 6, 36, 50, 51, 52, 3, 53, 54, 55, ...","[DET, NOUN, NOUN, NOUN, VERB, VERB, VERB, ADP,...","[2, 3, 3, 3, 5, 5, 5, 6, 3, 4, 3, 3, 3, 3, 10,..."
3,"[``, only, a, relative, handful, of, such, rep...","[17, 68, 69, 70, 71, 11, 72, 60, 46, 73, 20, 3...","[., ADV, DET, ADJ, NOUN, ADP, ADJ, NOUN, VERB,...","[7, 8, 2, 4, 3, 6, 4, 3, 5, 5, 7, 7, 2, 3, 5, ..."
4,"[the, jury, said, it, did, find, that, many, o...","[2, 6, 7, 81, 82, 83, 21, 84, 11, 85, 86, 41, ...","[DET, NOUN, VERB, PRON, VERB, VERB, ADP, ADJ, ...","[2, 3, 5, 11, 5, 5, 6, 4, 6, 3, 3, 9, 3, 3, 7,..."
5,"[it, recommended, that, fulton, legislators, a...","[81, 94, 21, 3, 95, 96, 17, 58, 97, 98, 87, 99...","[PRON, VERB, ADP, NOUN, NOUN, VERB, ., PRT, VE...","[11, 5, 6, 3, 3, 5, 7, 10, 5, 2, 3, 5, 9, 5, 6..."
6,"[the, grand, jury, commented, on, a, number, o...","[2, 5, 6, 105, 106, 69, 77, 11, 107, 108, 34, ...","[DET, ADJ, NOUN, VERB, ADP, DET, NOUN, ADP, AD...","[2, 4, 3, 5, 6, 2, 3, 6, 4, 3, 7, 6, 11, 2, 3,..."
7,"[merger, proposed]","[122, 123]","[NOUN, VERB]","[3, 5]"
8,"[however, ,, the, jury, said, it, believes, ``...","[124, 34, 2, 6, 7, 81, 125, 17, 98, 126, 127, ...","[ADV, ., DET, NOUN, VERB, PRON, VERB, ., DET, ...","[8, 7, 2, 3, 5, 11, 5, 7, 2, 12, 3, 5, 5, 5, 1..."
9,"[the, city, purchasing, department, ,, the, ju...","[2, 31, 110, 137, 34, 2, 6, 7, 34, 17, 138, 13...","[DET, NOUN, VERB, NOUN, ., DET, NOUN, VERB, .,...","[2, 3, 5, 3, 7, 2, 3, 5, 7, 7, 5, 5, 6, 5, 4, ..."


If all goes well, you should get this:  
![capture](../img/capt02.png)

###  Numerical lists to matrices

Finally, we need to put the input sequences into matrices for training. There will be separate matrices for the word and tag sequences, where each row is a sentence and each column is a word (or tag) index in that sentence. This matrix format is necessary for the model to process the sentences in batches as opposed to one at a time, which significantly speeds up training. However, each sentence has a different number of words, so we create a padded matrix equal to the length on the longest sentence in the training set. For all sentences with fewer words, we prepend the row with zeros representing an empty word (and tag) position. This is why the number 0 was not assigned as an index in the lexicons. We can specify to Keras to ignore these zeros during training.

**Exercise :** Create a pad_idx_seqs function that will return a sequence. This function will have two parameters, the sequence index and the other one which will be the length of the longest string. 

In [17]:
from keras.preprocessing.sequence import pad_sequences
###Enter Your code here (+- 3 lines)
'''create a padded matrix of stories'''
def pad_idx_seqs(idx_seqs, max_seq_len):
    # Keras provides a convenient padding function; 
    padded_idxs = pad_sequences(sequences=idx_seqs, maxlen=max_seq_len)
    return padded_idxs

    ### END

###End

**Exercise :** Create a max_seq_len variable that will contain the value of the word number in the largest string. Get length of longest sequence. (59)

In [18]:
###Enter Your code here (1 line)
max_seq_len = max([len(idx_seq) for idx_seq in train_sents['Sentence_Idxs']]) # Get length of longest sequence

### End 
print(max_seq_len)

59


Let's create the sequences !

In [19]:
train_padded_words = pad_idx_seqs(train_sents['Sentence_Idxs'], 
                                  max_seq_len + 1) #Add one to max length for offsetting sequence by 1
train_padded_tags = pad_idx_seqs(train_sents['Tag_Idxs'],
                                 max_seq_len + 1)  #Add one to max length for offsetting sequence by 1

print("WORDS:\n", train_padded_words)
print("SHAPE:", train_padded_words.shape, "\n")

print("TAGS:\n", train_padded_tags)
print("SHAPE:", train_padded_tags.shape, "\n")

WORDS:
 [[  0   0   0 ...  24  25  26]
 [  0   0   0 ...  46  47  26]
 [  0   0   0 ...  66  67  26]
 ...
 [  0   0   0 ... 758  20  26]
 [  0   0   0 ... 802  34 447]
 [  0   0   0 ... 447 812  26]]
SHAPE: (100, 60) 

TAGS:
 [[0 0 0 ... 5 3 7]
 [0 0 0 ... 5 5 7]
 [0 0 0 ... 3 3 7]
 ...
 [0 0 0 ... 3 7 7]
 [0 0 0 ... 3 7 3]
 [0 0 0 ... 3 3 7]]
SHAPE: (100, 60) 



### Defining the input and output

In this approach, for each word in a sentence, we predict the tag for that word based on two types of input: 

1. all the words in the sentence up to that point, including that current word, and 
2. all the previous tags in the sentence. 

So for a given position in the sentence *idx*, the input is train_padded_words[idx] and train_padded_tags[idx-1], and the output is train_padded_tags[idx]. The example below shows this alignment for the first sentence in the dataset.


In [20]:
import numpy
pd.DataFrame(list(zip(train_sents['Tokenized_Sentence'].loc[0],
                          ["-"] + train_sents['Tagged_Sentence'].loc[0],
                          train_sents['Tagged_Sentence'].loc[0])),
                 columns=['Input Word', 'Input Tag', 'Output Tag'])


Unnamed: 0,Input Word,Input Tag,Output Tag
0,the,-,DET
1,fulton,DET,NOUN
2,county,NOUN,NOUN
3,grand,NOUN,ADJ
4,jury,ADJ,NOUN
5,said,NOUN,VERB
6,friday,VERB,NOUN
7,an,NOUN,DET
8,investigation,DET,NOUN
9,of,NOUN,ADP


This same alignment is shown below for a sentence in the padded matrices. Because of the offsetting in the alignment, the length of the padded matrices will be reduced by one. 

In [21]:
print(pd.DataFrame(list(zip(train_padded_words[0,1:], train_padded_tags[0,:-1], train_padded_tags[0, 1:])),
                columns=['Input Words', 'Input Tags', 'Output Tags']))

    Input Words  Input Tags  Output Tags
0             0           0            0
1             0           0            0
2             0           0            0
3             0           0            0
4             0           0            0
5             0           0            0
6             0           0            0
7             0           0            0
8             0           0            0
9             0           0            0
10            0           0            0
11            0           0            0
12            0           0            0
13            0           0            0
14            0           0            0
15            0           0            0
16            0           0            0
17            0           0            0
18            0           0            0
19            0           0            0
20            0           0            0
21            0           0            0
22            0           0            0
23            0 

##  Building the model

### Functional API

To set up the model, we'll use Keras [Functional API](https://keras.io/getting-started/functional-api-guide/), which is one of two ways to assemble models in Keras (the alternative is the [Sequential API](https://keras.io/getting-started/sequential-model-guide/), which is a bit simpler but has more constraints). For the POS tagger model, new tags will be predicted from the combination of two input sequences, the words in the sentence and the corresponding tags in the sentence. The Functional API is specifically useful when a model has multiple inputs and/or outputs. A model consists of a series of layers. As shown in the code below, we initialize instances for each layer. Each layer can be called with another layer as input, e.g. Embedding()(input_layer). A model instance is initialized with the Model() object, which defines the initial input and final output layers for that model. Before the model can be trained, the compile() function must be called with the loss function and optimization algorithm specified (see below).

### Layers

We'll build an RNN with the following layers, numbered according to the level on which they are stacked:

**1. Input (words)**: This input layer takes in a sequence of word indices.

**1. Input (tags)**: This is the other input layer alongside the first, and it takes in a sequence of tag indices. It is on the same level as the word input layer, so both input sequences are read in parallel by the model.

**2. Embedding (words)**: There are two embedding layers, one for the words and a different one for the tags. Both of them function the same way: they convert the indices into distributed vector representations (embeddings). The mask_zero=True parameter indicates that values of 0 in the matrix (the padding) will be ignored by the model.

**2. Embedding (tags)**: Same as the word embedding layer, but for the tags.

**3. Concatenate**: This layer merges each embedded word sequence and corresponding embedded tag sequence into a single sequence. This means that for a given word and the tag for that word, their vectors will be concatenated into a single vector.

**4. GRU**: The recurrent (GRU) hidden layer reads the merged embedded sequence and computes a representation (hidden state) of the sequence. The result is a new vector for each word/tag in the sequence. There are a few architectures for this layer - I use the GRU variation, Keras also provides LSTM or just the simple vanilla recurrent layer. By specifying return_sequences=True in the below function, this layer will output the entire sequence of vectors (hidden states) for the sequence, rather than just the most recent hidden state that is returned by default.

**5. (Time Distributed) Dense**: An output layer that produces a probability distribution for each possible tag for each word in the sequence. The 'softmax' activation is what transforms the values of this layer into scores from 0 to 1 that can be treated as probabilities. The Dense layer produces the probability scores for one particular timepoint (word). By wrapping this in a TimeDistributed() layer, the model outputs a probability distribution for every timepoint in the sequence. 

The term "layer" is just an abstraction, when really all these layers are just matrices. Each layer is connected to the layer above it via a set of weights (also matrices), which are the parameters that are adjusted during training in order for the model to learn to predict tags. The process of training a neural network is a series of matrix multiplications. 

### Parameters

Our function for creating the model takes the following parameters:

**seq_input_length**: the length of the padded matrices for the word and tag sentence inputs, which will be the same since there is a one-to-one mapping between tags. This is equal to the length of the longest sentence in the training data. 

**n_word_input_nodes**: the number of unique words in the lexicon, plus one to account for matrix padding represented by 0 values. This indicates the number of rows in the word embedding layer, where each row corresponds to a word.

**n_tag_input_nodes**: the number of unique tags in the dataset, plus one to account for padding. This indicates the number of rows in the tag embedding layer, where each row corresponds to a tag.

**n_word_embedding_nodes**: the number of dimensions in the word embedding layer, which can be freely defined. Here, it is set to 300.

**n_tag_embedding_nodes**: the number of dimensions in the tag embedding layer, which can be freely defined. Here, it is set to 100.

**n_hidden_nodes**: the number of dimensions in the hidden layer. Like the embedding layers, this can be freely chosen. Here, it is set to 500.

**stateful**: By default, the GRU hidden layer will reset its state (i.e. its values will be 0s) each time a new set of sequences is read into the model.  However, when stateful=True is given, this parameter indicates that the GRU hidden layer should "remember" its state until it is explicitly told to forget it. In other words, the values in this layer will be carried over between separate calls to the training function. This is useful when processing long sequences, so that the model can iterate through chunks of the sequences rather than loading the entire matrix at the same time, which is memory-intensive. I'll show below how this setting is also useful when tagging new sequences. Here, because the training sequences only consist of one sentence, stateful will be set to False during training. At prediction time, it will be set to True.

**batch_size**: It is not always necessary to specify the batch size when setting up a Keras model. The fit() function will apply batch processing by default and the batch size can be given as a parameter. However, when a model is stateful, the batch size does need to be specified in the Input() layers. Here, for training, batch_size=None, so Keras will use its default batch size (which is 32). During prediction, the batch size will be set to 1.

### Procedure

The output of the model is a sequence of vectors, each with the same number of dimensions as the number of unique tags (n_tag_input_nodes). Each vector contains the predicted probability of each possible tag for the corresponding word in that position in the sequence. Like all neural networks, RNNs learn by updating the parameters (weights) to optimize an objective (loss) function applied to the output. For this model, the objective is to minimize the cross entropy (named as the "sparse_categorical_crossentropy" in the code) between the predicted tag probabilities and the probabilities observed from the tags in training data, resulting in probabilities that more accurately predict when a particular tag will appear. This is the general procedure used for all multi-label classification tasks. Updates to the weights of the model are performed using an optimization algorithm, such as Adam used here. The details of this process are extensive; see the resources at the bottom of the notebook if you want a deeper understanding. One huge benefit of Keras is that it implements many of these details for you. Not only does it already have implementations of the types of layer architectures, it also has many of the [loss functions](https://keras.io/losses/) and [optimization methods](https://keras.io/optimizers/) you need for training various models. The specific loss function and optimization method you use is specified when compiling the model with the model.compile() function.


**Exercise :** Create a model that respects this architecture. Apart from concatenate, you know these layers!
1. Input (words)
1. Input (tags)
2. Embedding (words)
2. Embedding (tags)
3. Concatenat ([words, tags])
4. GRU: The recurrent (GRU) hidden layer 
5. (Time Distributed) Dense: An output layer.

In [22]:
'''Create the model'''

from keras.models import Model
from keras.layers import Input, Concatenate, TimeDistributed, Dense
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import GRU

def create_model(seq_input_len, n_word_input_nodes, n_tag_input_nodes, n_word_embedding_nodes,
                 n_tag_embedding_nodes, n_hidden_nodes, stateful=False, batch_size=None):
    
    #Layers 1
    word_input = Input(batch_shape=(batch_size, seq_input_len), name='word_input_layer')
    tag_input = Input(batch_shape=(batch_size, seq_input_len), name='tag_input_layer')

    #Layers 2
    word_embeddings = Embedding(input_dim=n_word_input_nodes,
                                output_dim=n_word_embedding_nodes, 
                                mask_zero=True, name='word_embedding_layer')(word_input) #mask_zero will ignore 0 padding
    #Output shape = (batch_size, seq_input_len, n_word_embedding_nodes)
    tag_embeddings = Embedding(input_dim=n_tag_input_nodes,
                               output_dim=n_tag_embedding_nodes,
                               mask_zero=True, name='tag_embedding_layer')(tag_input) 
    #Output shape = (batch_size, seq_input_len, n_tag_embedding_nodes)
    
    #Layer 3
    merged_embeddings = Concatenate(axis=-1, name='concat_embedding_layer')([word_embeddings, tag_embeddings])
    #Output shape =  (batch_size, seq_input_len, n_word_embedding_nodes + n_tag_embedding_nodes)
    
    #Layer 4
    hidden_layer = GRU(units=n_hidden_nodes, return_sequences=True, 
                       stateful=stateful, name='hidden_layer')(merged_embeddings)
    #Output shape = (batch_size, seq_input_len, n_hidden_nodes)
    
    #Layer 5
    output_layer = TimeDistributed(Dense(units=n_tag_input_nodes, 
                                         activation='softmax'), name='output_layer')(hidden_layer)
    # Output shape = (batch_size, seq_input_len, n_tag_input_nodes)
    
    #Specify which layers are input and output, compile model with loss and optimization functions
    model = Model(inputs=[word_input, tag_input], outputs=output_layer)
    model.compile(loss="sparse_categorical_crossentropy",
                  optimizer='adam')
    
    return model

In [23]:
model = create_model(seq_input_len=train_padded_words.shape[-1] - 1, #substract 1 from matrix length because of offset
                     n_word_input_nodes=len(words_lexicon) + 1, #Add one for 0 padding
                     n_tag_input_nodes=len(tags_lexicon) + 1, #Add one for 0 padding
                     n_word_embedding_nodes=300,
                     n_tag_embedding_nodes=100,
                     n_hidden_nodes=500)

W0904 10:31:20.808312 139916237453120 deprecation_wrapper.py:119] From /home/user/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0904 10:31:20.827109 139916237453120 deprecation_wrapper.py:119] From /home/user/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0904 10:31:20.873342 139916237453120 deprecation_wrapper.py:119] From /home/user/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0904 10:31:21.250749 139916237453120 deprecation.py:323] From /home/user/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:2974: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will 

### Training

Now we're ready to train the model. We'll call the fit() function to train the model for 10 iterations through the dataset (epochs), using a batch size of 20 sentences. Keras reports to cross-entropy loss after each epoch, which should continue to decrease if the model is learning correctly.

In [24]:
'''Train the model'''

# output matrix (y) has extra 3rd dimension added because sparse cross-entropy function requires one label per row
model.fit(x=[train_padded_words[:,1:], train_padded_tags[:,:-1]], 
          y=train_padded_tags[:, 1:, None], 
          batch_size=20, epochs=5)
model.save_weights('example_model/model_weights.h5') #Save model



W0904 10:31:25.062544 139916237453120 deprecation_wrapper.py:119] From /home/user/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.



Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Tagging new sentences

Now that the model is trained, it can be used to predict tags in new sentences in the test set. As opposed to training where we processed multiple sentences at the same time, it will be more straightforward to demonstrate tagging on a single sentence at a time. In Keras, you can duplicate a model by loading the parameters from a saved model into a new model. Here, this new model will have a batch size of 1. It will also process a sentence one word/tag at a time (seq_input_len=1) and predict the next tag, using the stateful=True parameter to remember its previous predictions within that sentence. The other parameters of this prediction model are exactly the same as the trained model, which is why the weights can be readily transferred. To demonstrate prediction performance, I'll load the weights from a saved model previously trained on the full training set of 51606 sentences (as opposed to 100 sentences in the example dataset used above). I'll apply the model to an example test set of 100 sentences that were not observed during training.


In [25]:
'''Create predictor model with weights from saved model, with batch_size = 1, seq_input_len = 1 and stateful=True'''

# Load word and tag lexicons from the saved model 
with open('pretrained_model/words_lexicon.pkl', 'rb') as f:
    words_lexicon = pickle.load(f)
    
    
with open('pretrained_model/tags_lexicon.pkl', 'rb') as f:
    tags_lexicon = pickle.load(f)
tags_lexicon_lookup = get_lexicon_lookup(tags_lexicon)

predictor_model = create_model(seq_input_len=1,
                               n_word_input_nodes=len(words_lexicon) + 1,
                               n_tag_input_nodes=len(tags_lexicon) + 1,
                               n_word_embedding_nodes=300,
                               n_tag_embedding_nodes=100,
                               n_hidden_nodes=500,
                               stateful=True,
                               batch_size=1)

#Transfer the weights from the trained model
predictor_model.load_weights('pretrained_model/model_weights.h5')

LEXICON LOOKUP SAMPLE:
{2: 'ADV', 3: 'NOUN', 11: 'NUM', 4: 'ADP', 8: 'PRT', 6: 'DET', 7: '.', 5: 'PRON', 9: 'VERB', 10: 'X', 12: 'CONJ', 13: 'ADJ', 1: '<UNK>', 0: ''}


**Exercise :** Load the test set and apply same processing steps performed above for training set. Load dataset in ``dataset/example_test_brown_corpus.csv``

In [27]:
'''Load the test set and apply same processing steps performed above for training set'''
test_sents = pd.read_csv('dataset/example_test_brown_corpus.csv')
test_sents['Tokenized_Sentence'] = test_sents['Tokenized_Sentence'].apply(lambda sent: sent.lower().split("\t"))
test_sents['Tagged_Sentence'] = test_sents['Tagged_Sentence'].apply(lambda sent: sent.split("\t"))
test_sents['Sentence_Idxs'] = tokens_to_idxs(test_sents['Tokenized_Sentence'], words_lexicon)
test_sents['Tag_Idxs'] = tokens_to_idxs(test_sents['Tagged_Sentence'], tags_lexicon)


We'll iterate through the sentences in the test set and tag each of them. For each sentence, we start with an empty list for the predicted tags. For the first word in the sentence, there is no previous tag, so the model reads that word and the empty tag 0 (the padding value). The predict() function returns a probability distribution over the tags, and we pick the tag with the highest probability as the one to assign that word. This tag is appended to our list of predicted tags, and we continue to the next word in the sentence. Because the model is stateful, we can simply provide the current word and most recent tag as input to the predict() function, since its hidden layer has memorized the sequence of words/tags observed so far. After the entire sentence has been tagged, we call reset_states() to clear the values for this sentence so we can process a new sentence. The tag indices are mapped back to their string forms, which I show in the sample below, alongside the correct (gold) tags for comparison.


In [28]:
'''Predict tags for sentences in test set'''

import numpy

pred_tags = []
for _, sent in test_sents.iterrows():
    tok_sent = sent['Tokenized_Sentence']
    sent_idxs = sent['Sentence_Idxs']
    sent_gold_tags = sent['Tagged_Sentence']
    sent_pred_tags = []
    prev_tag = 0  #initialize predicted tag sequence with padding
    for cur_word in sent_idxs:
        # cur_word and prev_tag are just integers, but the model expects an input array
        # with the shape (batch_size, seq_input_len), so prepend two dimensions to these values
        p_next_tag = predictor_model.predict(x=[numpy.array(cur_word)[None, None],
                                                numpy.array(prev_tag)[None, None]])[0]
        prev_tag = numpy.argmax(p_next_tag, axis=-1)[0]
        sent_pred_tags.append(prev_tag)
    predictor_model.reset_states()

    #Map tags back to string labels
    sent_pred_tags = [tags_lexicon_lookup[tag] for tag in sent_pred_tags]
    pred_tags.append(sent_pred_tags) #filter padding 

test_sents['Predicted_Tagged_Sentence'] = pred_tags

#print sample
for _, sent in test_sents[:10].iterrows():
    print("SENTENCE:\t{}".format("\t".join(sent['Tokenized_Sentence'])), "\n\n")
    print("PREDICTED:\t{}".format("\t".join(sent['Predicted_Tagged_Sentence'])), "\n\n")
    print("GOLD:\t\t{}".format("\t".join(sent['Tagged_Sentence'])), "\n\n")

    

SENTENCE:	he	was	about	50	years	old	. 


PREDICTED:	NOUN	VERB	ADV	NUM	NOUN	ADJ	. 


GOLD:		PRON	VERB	ADV	NUM	NOUN	ADJ	. 


SENTENCE:	``	another	young	man	,	my	dear	?	? 


PREDICTED:	NOUN	DET	ADJ	NOUN	.	DET	NOUN	.	. 


GOLD:		.	DET	ADJ	NOUN	.	DET	NOUN	.	. 


SENTENCE:	really	,	you	are	most	indiscreet	to	drive	him	here	yourself	''	,	he	said	,	frowning	with	displeasure	. 


PREDICTED:	NOUN	.	PRON	VERB	ADV	ADJ	ADP	VERB	PRON	ADV	PRON	.	.	PRON	VERB	.	VERB	ADP	NOUN	. 


GOLD:		ADV	.	PRON	VERB	ADV	ADJ	PRT	VERB	PRON	ADV	PRON	.	.	PRON	VERB	.	VERB	ADP	NOUN	. 


SENTENCE:	delphine	presented	her	cheek	for	a	kiss	,	and	the	physician	pecked	it	like	a	timid	rooster	. 


PREDICTED:	NOUN	VERB	PRON	NOUN	ADP	DET	NOUN	.	CONJ	DET	NOUN	VERB	PRON	ADP	DET	ADJ	NOUN	. 


GOLD:		NOUN	VERB	DET	NOUN	ADP	DET	NOUN	.	CONJ	DET	NOUN	VERB	PRON	ADP	DET	ADJ	NOUN	. 


SENTENCE:	``	dandy	is	to	be	our	house	guest	,	louis	. 


PREDICTED:	NOUN	NOUN	VERB	PRT	VERB	DET	NOUN	NOUN	.	NOUN	. 


GOLD:		.	NOUN	VERB	PRT	VERB	DET	NOUN	NOU

### Evaluation

We can evaluate our model with some of the standard metrics for classification: *precision*, *recall*, and *F1 score*. In the context of this task, precision is the proportion of the predicted tags for a particular class that were correct predictions (i.e. of all the words that were assigned a NOUN tag by the tagger, what percentage of these were actually nouns according to the test set?). Recall is the proportion of correct tags for a particular class that the tagger also predicted correctly (i.e. of all the words in the test set that should have been assigned a NOUN tag, what percentage of these were actually tagged as a NOUN?). F1 score is a weighted average of precision and recall. The scikit-learn package has several of these [evaluation metrics](http://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) available.

In [29]:
'''Evalute the model by precision, recall, and F1'''

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

if __name__ == '__main__':
    all_gold_tags = [tag for sent_tags in test_sents['Tagged_Sentence'] for tag in sent_tags]
    all_pred_tags = [tag for sent_tags in test_sents['Predicted_Tagged_Sentence'] for tag in sent_tags]
    accuracy = accuracy_score(y_true=all_gold_tags, y_pred=all_pred_tags)
    precision = precision_score(y_true=all_gold_tags, y_pred=all_pred_tags, average='weighted')
    recall = recall_score(y_true=all_gold_tags, y_pred=all_pred_tags, average='weighted')
    f1 = f1_score(y_true=all_gold_tags, y_pred=all_pred_tags, average='weighted')

    print("ACCURACY: {:.3f}".format(accuracy))
    print("PRECISION: {:.3f}".format(precision))
    print("RECALL: {:.3f}".format(recall))
    print("F1: {:.3f}".format(f1))


ACCURACY: 0.898
PRECISION: 0.908
RECALL: 0.898
F1: 0.899


### Visualizing data inside the model

To help visualize the data representation inside the model, we can look at the output of each layer individually. Keras' Functional API lets you derive a new model with the layers from an existing model, so you can define the output to be a layer below the output layer in the original model. Calling predict() on this new model will produce the output of that layer for a given input. Of course, glancing at the numbers by themselves doesn't provide any interpretation of what the model has learned (although there are opportunities to [interpret these values](https://www.civisanalytics.com/blog/interpreting-visualizing-neural-networks-text-processing/)), but seeing them verifies the model is just a series of transformations from one matrix to another. The get_layer() function lets you retrieve any layer by the name that was assigned to it when creating the model. Below is an example of the output for the tag embedding layer for the first word in the first sentence of the test set. You can do this same thing to view any layer.

In [30]:
'''Show the output of the tag embedding layer for the first word in the first sentence'''

tag_embedding_layer = Model(inputs=[predictor_model.get_layer('word_input_layer').input,
                                    predictor_model.get_layer('tag_input_layer').input], 
                            outputs=predictor_model.get_layer('tag_embedding_layer').output)
#Show tag embedding used to predict first tag in sequence (word input is first word, tag input is 0)
tag_embedding_output = tag_embedding_layer.predict([numpy.array(test_sents['Sentence_Idxs'][0][0])[None,None], 
                                                    numpy.array(0)[None,None]])
print("TAG EMBEDDINGS OUTPUT SHAPE:", tag_embedding_output.shape)
print(tag_embedding_output[0])

TAG EMBEDDINGS OUTPUT SHAPE: (1, 1, 100)
[[ 0.04030772 -0.01521903 -0.01122327  0.04559968 -0.04060402 -0.02125445
  -0.04569159  0.04556891  0.03382897  0.04436642 -0.02249571  0.01446874
   0.03785415  0.02063607  0.0112173  -0.04073405 -0.00684955  0.03533303
   0.0119527   0.01346546  0.00389823  0.04908235 -0.04945602 -0.04106168
   0.01780717  0.03226806 -0.00636449  0.01396188  0.01970855  0.00895597
  -0.01151429  0.01169025  0.03164865  0.04846541 -0.02814269  0.04745128
   0.00717139 -0.00496722  0.04698488  0.00407878 -0.00893237 -0.01000279
   0.04122505 -0.02401526  0.02011186 -0.0495501  -0.01492489  0.04072745
   0.01137014 -0.0051544   0.00337183  0.01313441 -0.03503003  0.04318009
   0.02802358 -0.03258054 -0.01480098 -0.01453636  0.01580858 -0.03383125
  -0.01462271 -0.03466325 -0.0376377   0.04488823  0.032319   -0.02157353
  -0.0352294   0.04680395 -0.02944789 -0.04404376  0.02607415 -0.04325482
  -0.00075638 -0.01290952  0.03074952 -0.00708887  0.04597093 -0.010952

It is also easy to look at the weight matrices that connect the layers. The get_weights() function will show the incoming weights for a particular layer.

In [31]:
'''Show weights that connect hidden layer to output layer'''

hidden_to_output_weights = predictor_model.get_layer('output_layer').get_weights()[0]
print("HIDDEN-TO_OUTPUT WEIGHTS SHAPE:", hidden_to_output_weights.shape)
print(hidden_to_output_weights)

HIDDEN-TO_OUTPUT WEIGHTS SHAPE: (500, 14)
[[ 0.10821158 -0.03787097  0.15339486 ...  0.13131066  0.16468468
  -0.03933048]
 [-0.02245991  0.00259483 -0.08687688 ...  0.35621825 -0.20203626
  -0.26063833]
 [ 0.12871414  0.00333645 -0.08383496 ...  0.10571454  0.08997553
  -0.25365505]
 ...
 [ 0.04896822  0.11347028 -0.07975759 ... -0.0096857   0.15499653
  -0.1007444 ]
 [-0.1702166  -0.21876162 -0.1921142  ...  0.00251363 -0.0828248
  -0.14519574]
 [ 0.04146985  0.10812975 -0.1861306  ...  0.00607138 -0.22922915
  -0.03184065]]


## Conclusion

Even though this model can accuractely predict many POS tags, state-of-the-art taggers use more sophisticated techniques. For example, where here we predicted a tag just based on the preceding words and tags, [bidirectional layers](https://keras.io/layers/wrappers/#bidirectional) also model the sequence that appears after the given word to additionally inform the prediction. POS tagging can be seen as a shallow version of syntactic parsing, which is a more difficult NLP problem. Where POS tagging outputs a flat sequence with a one-to-one mapping between words and tags, syntatic parsing produces a hierarchical structure where categories consist of multiple-word phrases and phrase categories are embedded inside other phrases. Check out the [chapter from Jurafsky & Martin's book](https://web.stanford.edu/~jurafsky/slp3/14.pdf) if you're interested in learning more about these deeper models of linguistic structure.

## More resources

Yoav Goldberg's book [Neural Network Methods for Natural Language Processing](http://www.morganclaypool.com/doi/abs/10.2200/S00762ED1V01Y201703HLT037) is a thorough introduction to neural networks for NLP tasks in general

If you'd like to learn more about what Keras is doing under the hood, the [Theano tutorials](http://deeplearning.net/tutorial/) are useful. There is one specifically on [semantic parsing](http://deeplearning.net/tutorial/rnnslu.html#rnnslu), which is related to the POS tagging task.

TensorFlow also has an RNN language model [tutorial](https://www.tensorflow.org/versions/r0.12/tutorials/recurrent/index.html) using the Penn Treebank dataset

Andrej Karpathy's blog post [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) is very helpful for understanding the underlying details of the same language model I've demonstrated here. It also provides raw Python code with an implementation of the backpropagation algorithm.

Chris Olah provides a good [explanation](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) of how LSTM RNNs work (this explanation also applies to the GRU model used here)

Denny Britz's [tutorial](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) documents well both the technical details of RNNs and their implementation in Python.
