# Recurrent Neural Networks for Language Modeling in `Python`

<br>

## Recurrent Neural Networks and `Keras`

<br>

### Introduction to the Course

We will look at four applications of machine learning for text data  

* Sentiment Analysis
* Multi-class classification
* Text generation
* Machine neural translation

**Recurrent Neural Networks** - reduce the number of parameters by avoiding one-hot encoding  
**Sequence to sequence models** - man to one (e.g. for classification); many to many (e.g. text generation)

<br>

### Introduction to Language Models

Language Models determine sentence probability

* Probability of 'I loved this movie'
* Unigram $P(\mbox{sentence}) = P(\mbox{I})P(\mbox{loved})P(\mbox{this})P(\mbox{movie})$
* Bigram $P(\mbox{sentence})=P(\mbox{I})P(\mbox{loved|I})P(\mbox{this|loved})P(\mbox{movie|this})$
* Trigram $P(\mbox{sentence})=P(\mbox{I})P(\mbox{loved|I})P(\mbox{this|I loved})P(\mbox{movie|loved this})$
* Skipgram $P(\mbox{sentence})=P(\mbox{context or I |I})P(\mbox{context of loved|loved})P(\mbox{context of this|this})P(\mbox{context of movie|movie})$
* Neural Network models
    - $P(\mbox{sentence})$ is given by a softmax function on the outer layer of the network

<br>

In [None]:
#Build vocabulary dictionaries

#Get unique words
unique_words = list( set( text.split(' ') ) )

#Create a dictionary: word is key, index is the value
word_to_index = { k:v for (v,k) in enumerate( unique_words ) }

#Create a dictionary: index is key, word is value
index_to_word = { k:v for (k,v) in enumerate( unique_words ) }

In [None]:
#Preprocessing input
x = []
y = []

#Loop over the text: length 'sentence_size' per time with step equal to 'step'
for i in range( 0, len( text ) - sentence_size, step ):
    X.append( text[ i:i + sentence_size])
    y,append( text[ i + sentence_size ])

In [None]:
#Transforming new texts
new_text_split = []
#Loop and get the indexes from the dictionary
for sentence in new_text:
    sent_split = []
    for wd in sentence.split(' '):
        ix = wd_to_index[ wd ]
        sent_split.append( ix )
    new_text_split.append( sent_split )

In [1]:
#extracted the vocabulary unique_words of the raw texts and created dictionaries to 
#go from words to numerical indexes and vice versa.

sheldon_quotes = ["You're afraid of insects and women, Ladybugs must render you catatonic.",
 'Scissors cuts paper, paper covers rock, rock crushes lizard, lizard poisons Spock, Spock smashes scissors, scissors decapitates lizard, lizard eats paper, paper disproves Spock, Spock vaporizes rock, and as it always has, rock crushes scissors.',
 'For example, I cry because others are stupid, and that makes me sad.',
 "I'm not insane, my mother had me tested.",
 'Two days later, Penny moved in and so much blood rushed to your genitals, your brain became a ghost town.',
 "Amy's birthday present will be my genitals.",
 '(3 knocks) Penny! (3 knocks) Penny! (3 knocks) Penny!',
 'Thankfully all the things my girlfriend used to do can be taken care of with my right hand.',
 'I would have been here sooner but the bus kept stopping for other people to get on it.',
 'Oh gravity, thou art a heartless bitch.',
 'I am aware of the way humans usually reproduce which is messy, unsanitary and based on living next to you for three years, involves loud and unnecessary appeals to a deity.',
 'Well, today we tried masturbating for money.',
 'I think that you have as much of a chance of having a sexual relationship with Penny as the Hubble telescope does of discovering at the center of every black hole is a little man with a flashlight searching for a circuit breaker.',
 "Well, well, well, if it isn't Wil Wheaton! The Green Goblin to my Spider-Man, the Pope Paul V to my Galileo, the Internet Explorer to my Firefox.",
 "What computer do you have? And please don't say a white one.",
 "She calls me moon-pie because I'm nummy-nummy and she could just eat me up.",
 'Ah, memory impairment; the free prize at the bottom of every vodka bottle.']

# Transform the list of sentences into a list of words
all_words = ' '.join(sheldon_quotes).split(' ')

# Get number of unique words
unique_words = list(set(all_words))

# Dictionary of indexes as keys and words as values
index_to_word = {i:wd for i, wd in enumerate(sorted(unique_words))}

print(index_to_word)

# Dictionary of words as keys and indexes as values
word_to_index = {wd:i for i, wd in enumerate(sorted(unique_words))}

print(word_to_index)

{0: '(3', 1: 'Ah,', 2: "Amy's", 3: 'And', 4: 'Explorer', 5: 'Firefox.', 6: 'For', 7: 'Galileo,', 8: 'Goblin', 9: 'Green', 10: 'Hubble', 11: 'I', 12: "I'm", 13: 'Internet', 14: 'Ladybugs', 15: 'Oh', 16: 'Paul', 17: 'Penny', 18: 'Penny!', 19: 'Pope', 20: 'Scissors', 21: 'She', 22: 'Spider-Man,', 23: 'Spock', 24: 'Spock,', 25: 'Thankfully', 26: 'The', 27: 'Two', 28: 'V', 29: 'Well,', 30: 'What', 31: 'Wheaton!', 32: 'Wil', 33: "You're", 34: 'a', 35: 'afraid', 36: 'all', 37: 'always', 38: 'am', 39: 'and', 40: 'appeals', 41: 'are', 42: 'art', 43: 'as', 44: 'at', 45: 'aware', 46: 'based', 47: 'be', 48: 'became', 49: 'because', 50: 'been', 51: 'birthday', 52: 'bitch.', 53: 'black', 54: 'blood', 55: 'bottle.', 56: 'bottom', 57: 'brain', 58: 'breaker.', 59: 'bus', 60: 'but', 61: 'calls', 62: 'can', 63: 'care', 64: 'catatonic.', 65: 'center', 66: 'chance', 67: 'circuit', 68: 'computer', 69: 'could', 70: 'covers', 71: 'crushes', 72: 'cry', 73: 'cuts', 74: 'days', 75: 'decapitates', 76: 'deity.', 7

In [2]:
# Create lists to keep the sentences and the next character
sentences = []   # ~ Training data
next_chars = []  # ~ Training labels

# Define hyperparameters
step = 2          # ~ Step to take when reading the texts in characters
chars_window = 10 # ~ Number of characters to use to predict the next one  

# Loop over the text: length `chars_window` per time with step equal to `step`
for i in range(0, len(sheldon_quotes) - chars_window, step):
    sentences.append(sheldon_quotes[i:i + chars_window])
    next_chars.append(sheldon_quotes[i + chars_window])


In [3]:
new_text = ['A man either lives life as it happens to him meets it head-on and licks it or he turns his back on it and starts to wither away',
 'To the brave crew and passengers of the Kobayshi Maru sucks to be you',
 'Beware of more powerful weapons They often inflict as much damage to your soul as they do to you enemies',
 'They are merely scars not mortal wounds and you must use them to propel you forward',
 'You cannot explain away a wantonly immoral act because you think that it is connected to some higher purpose']

# Loop through the sentences and get indexes
new_text_split = []
for sentence in new_text:
    sent_split = []
    for wd in sentence.split(' '):
        index = word_to_index.get(wd, 0)
        sent_split.append(index)
    new_text_split.append(sent_split)

# Print the first sentence's indexes
print(new_text_split[0])

# Print the sentence converted using the dictionary
print(' '.join([index_to_word[index] for index in new_text_split[0]]))

[0, 125, 0, 0, 0, 43, 113, 0, 181, 0, 0, 113, 0, 39, 0, 113, 0, 0, 0, 0, 0, 141, 113, 39, 0, 181, 0, 0]
(3 man (3 (3 (3 as it (3 to (3 (3 it (3 and (3 it (3 (3 (3 (3 (3 on it and (3 to (3 (3


<br>

### Introduction to RNN inside `Keras`

* `keras.models`
    - `keras.models.Sequntial` - each layer is input to the following
    - `keras.models.Model` - allows for more flexible model architecture
* `keras.layers`
    - `LSTM`
    - `GRU`
    - `Dense`
    - `Dropout`
    - `Embedding`
    - `Bidirectional`
* `keras.preprocessing`
    - `keras.preprocessing.sequence.pad_sequences( text, maxlen=3 )` - make fixed length vectors
* `keras.datasets` - IMDB, Reuters & more

<br>

In [4]:
from keras.models import Sequential
from keras.layers import LSTM, Dense

<br>

Building the Model

    #Instantiate the model class
    model = Sequential()

    #Add the layers
    model.add( Dense( 64, activation='relu', input_dim=100 ) )
    model.add( Dense( 1, activation='sigmoid' ) )

    #Compile the model
    model.compile( optimizer='adam', loss='mean_squared_error', metrics=['accuracy'] )
    
<br>

Training the Model

    model.fit( X_train, y_train, epochs = 10, batch_size = 32 )
    
where:  

1. **epochs** - determine how many weight updates will be done on the model
2. **batch_size** - size of the data on each step


Evaluate the Model

    model.evaluate( X_test, y_test )
    model.predict( new_data )
    
<br>

In [8]:
# Sequential Model

# Instantiate the class
model = Sequential()

# One LSTM layer (defining the input shape because it is the 
# initial layer)
model.add(LSTM(128, input_shape=(None, 10), name="LSTM"))

# Add a dense layer with one unit
model.add(Dense(1, activation="sigmoid", name="output"))

# The summary shows the layers and the number of parameters 
# that will be trained
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
LSTM (LSTM)                  (None, 128)               71168     
_________________________________________________________________
output (Dense)               (None, 1)                 129       
Total params: 71,297
Trainable params: 71,297
Non-trainable params: 0
_________________________________________________________________


In [None]:
# Model model

# Define the input layer
main_input = Input(shape=(None, 10), name="input")

# One LSTM layer (input shape is already defined)
lstm_layer = LSTM(128, name="LSTM")(main_input)

# Add a dense layer with one unit
main_output = Dense(1, activation="sigmoid", name="output")(lstm_layer)

# Instantiate the class at the end
model = Model(inputs=main_input, outputs=main_output, name="modelclass_model")

# Same amount of parameters to train as before (71,297)
model.summary()

In [10]:
import numpy as np

texts = np.array(['So if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits. If it’s unobserved it will, however, if it’s observed after it’s left the plane but before it hits its target, it will not have gone through both slits.',
       'Hello, female children. Allow me to inspire you with a story about a great female scientist. Polish-born, French-educated Madame Curie. Co-discoverer of radioactivity, she was a hero of science, until her hair fell out, her vomit and stool became filled with blood, and she was poisoned to death by her own discovery. With a little hard work, I see no reason why that can’t happen to any of you. Are we done? Can we go?'],
      dtype='<U419')

texts

array(['So if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits. If it’s unobserved it will, however, if it’s observed after it’s left the plane but before it hits its target, it will not have gone through both slits.',
       'Hello, female children. Allow me to inspire you with a story about a great female scientist. Polish-born, French-educated Madame Curie. Co-discoverer of radioactivity, she was a hero of science, until her hair fell out, her vomit and stool became filled with blood, and she was poisoned to death by her own discovery. With a little hard work, I see no reason why that can’t happen to any of you. Are we done? Can we go?'],
      dtype='<U419')

In [11]:
# Preprocess text

# Import relevant classes/functions
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# Build the dictionary of indexes
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)

# Change texts into sequence of indexes
texts_numeric = tokenizer.texts_to_sequences(texts)
print("Number of words in the sample texts: ({0}, {1})".format(len(texts_numeric[0]), len(texts_numeric[1])))

# Pad the sequences
texts_pad = pad_sequences(texts_numeric, 60)
print("Now the texts have fixed length: 60. Let's see the first one: \n{0}".format(texts_pad[0]))

Number of words in the sample texts: (54, 78)
Now the texts have fixed length: 60. Let's see the first one: 
[ 0  0  0  0  0  0 24  4  1 25 13 26  5  1 14  3 27  6 28  2  7 29 30 13
 15  2  8 16 17  5 18  6  4  9 31  2  8 32  4  9 15 33  9 34 35 14 36 37
  2 38 39 40  2  8 16 41 42  5 18  6]


<br>

## RNN Architecture

### Vanishing and Exploding Gradients

**exploding gradient problem** - derivatives of a function can increase exponentially. this can be addressed with simple techniques such as gradient clipping.  

**vanishing gradient problem** - when gradients vanish or go to zero.

### GRU & LSTM Cells

**GRU cells** - add a memory cell gate to determine if the value should update or maintain the previous weight. this prevents the RNN cell weight from going to zero (vanishing)  

**LSTM (Long Short-Term Memory) cells** - adds 3 gates to the RNN cell. forget gate decides if the previous state should be forgotten. Update gate determines if the candidate state should be considered. Output gate determines whether the new hidden state should be considered  

No more vanishing gradients  

* the `simpleRNN` cell can have gradient problems
    - the weight matrix power t multiples the other terms
* `GRU` and `LSTM` cells don't have vanishing gradient problems
    - because of their gates
    - don't have the wieght matrix terms multiplying the rest
    - exploding gradient problems are easier to solve
    
Usage in `keras`:

    # Import the layers
    from keras.layers import GRU, LSTM
    
    # Add the layers to a model
    model.add( GRU( unit=128, return_sequences=True, name='GRU layer' ) )
    model.add( LSTM( units=64, return_sequences=FALSE, name='LSTM layer' ) )
    

<br>

### The Embedding Layer

Embeddings reduce the dimensionality of the data in comparison with one-hot encoding. Embeddings are a dense representations of the words. However, training can take a lot longer as there are a lot of parameters.  

in `keras`:

    from keras.layers import Embedding
    model = Sequential()
    
    #use the embedding as the first layer
    model.add( Embedding( input_dim=100000, #size of vocab
                          output_dim=300, #dims of embedding space
                          trainable=True, #update weights during training
                          embeddings_initializer=None, #use pretrained weights
                          input_length=120 ) ) #length of sequences to be modelled
                          
transfer learning for language modesl (GloVE, word2vec, BERT)  
in `keras`:  

    from keras.initializers import Constant
    model.add( Embedding( input_dim = vocabulary size,
                          output_dim = embedding_dim,
                          embeddings_initializer=Constant(pre_trained_vectors))
                          
Using GloVE pre-trained vectors:  
(https://nlp.stanford.edu/projects/glove/)  

    #Get the CloVE vectors
    def get_glove_vectors( filename="glove.6B.300d.txt" )
        # Get all word vectors from pre-trained model
        glove_vector_dict = {}
        with  open( filename ) as f:
            for line in f:
                values = line.split()
                word = values[0]
                coefs = values[1:]
                glove_vector_dict[word] = np.asarray( coefs, dtype='float32' )
        return glove_vector_dict
        
        
Using the GloVE on a specific task:  

    #Filter GloVE vectors to be specific for the task vocab
    def filter_glove( vocabulary_dict, glove_dict, wordvec_dim=300 ):
        # Create a matrix to store the vectors
        embedding_matrix = np.zeros( ( len( vocabulary_dict ) + 1, wordvec_dim ) )
        for word,i in vocabulary_dict.items():
            embedding_vector = glove_dict.get( word )
            if embedding_vector is not None:
                # words not found in the glove_dict will be all zeros
                embedding_matrix[i] = embedding_vector
        return embedding_mtrix
        
<br>        

### Sentiment Classification Revisited

Improving a model's performance:  

* Add the embedding layer
* Increase the number of layers
* Tune the paramters
* Increase vocabulary size
* Accept longer sentences with more memory cells

Avoiding overfitting:  

* Test different batch sizes
* Add `Dropout` layers
* Add `dropout` and `recurrent_dropout` parameters to RNN layers  

Adding a Convolution Layer:  

* Convolution layer will do the feature selection on the embedding layer
* Achieves tate-of-the-art results in may NLP problems

An example model:  

    model = Sequential()
    model.add( Embedding( vocabulary_size, word_dim, trainable=True,
                          embeddings_initializer=Constant(glove_matrix),
                          input_length=max_text_len, name='Embedding' ) )
    model.add( Dense( wordvec_dim, activation='relu', name='Dense11 ) )
    model.add( Dropout( rate=0.25 ) ) #add noise by removing inputs
    model.add( LSTM( 64, return_sequences=True, dropout=0.15, name='LSTM' ) )
    model.add( GRU( 64, return_sequences=False, dropout=0.15, name='GRU' ) )
    model.add( Dense( 64, name='Dense2' ) )
    model.add( Dropout( rate=0.25 ) )
    model.add( Dense( 32, name='Dense3' ) )
    model.add( Dense( 1, activation='sigmoid', name='Output' ) )

<br>

## Multi-class Classification

### Data pre-processing

differences from binary to multi-class classification problems:  

* shape of the output variable $y$
    - one hot encoding will make each class equidistant
    - solfmax function will return the probability of each class
* Number of units on the output layer
* Activation function on the output layer
    - solfmax used instead of a sigmoid (logistic) function
* Loss function
    - use the `categorical_crossentropy` function

<br>

In [14]:
#Preparing text categories for keras (numerical representation)

import pandas as pd 
y = [ 'sports', 'economy', 'data_science', 'sports', 'finance' ]
y_series = pd.Series( y, dtype='category' )
print( y_series.cat.codes )
print( y_series.cat.categories )

0    3
1    1
2    0
3    3
4    2
dtype: int8
Index(['data_science', 'economy', 'finance', 'sports'], dtype='object')


In [16]:
#Preparing the data as 1-hot encoded

from keras.utils.np_utils import to_categorical 
y_prep = to_categorical( y_series.cat.codes )
print( y_prep )

[[0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]]


<br>

### Transfer Learning for Language Models

**Transfer learning** - start with better than random initial weights obtained from a previously trainied model that was trained on very big datasets  

Some available architectures:  

* Word2Vec (Google)
    - Continuous Bag of Words
    - Skip-gram
* FastText (Facebook)
    - uses words and n-grams of chars
* ELMo
    - uses words, embeddings per context
    - uses Deep bidirectional language models
    
Word2Vec and FastText are available on the package `gensim` and ELMo on `tensorflow_hub`

example using Word2Vec

    from gensim.models import word2vec
    # Train the model
    w2v_model = word2vec.Word2Vec( tokenized_corpus, size = embedding_dim,
                                   window=neighbot_words_num, iter = 100 )
    # Get the top 3 similar words to 'captain'
    w2v_model.wv.most_similar( ['captain'], topn=3 )
    
example using FastText

    from gensim.models import fasttext
    # instantiate the model
    ft_model = fasttext.FastText( size = embedding_dim, window=neighbor_words_num )
    # build vocabulary
    ft_model.build_vocab( sentences=tokenized_corpus,
                          total_examples=len(tokenized_corpus),
                          epochs=100 )

<br>

### Multi-class Classification Models

Building a multi-class classification model:  

    # Build the model
    model = Sequential()
    model.add( Embedding( 10000, 128 ) )
    model.add( LSTM( 128, dropout = 0.2 ) )
    # output layer has 'num_classes' units and uses 'softmax'
    model.add( Dense( num_classes, activation='softmax' ) )
    # compile the model
    model.commpile( loss = 'categorical_crossentropy', optimizer='adam', metrics=['accuracy'] )
    ...
    
<br>

In [17]:
from sklearn.datasets import fetch_20newsgroups
news_train = fetch_20newsgroups( subset = 'train' )
news_test = fetch_20newsgroups( subset = 'test' )

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [19]:
print( news_train.DESCR )

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features       

In [22]:
# import modules
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical

# create and fit the tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts( news_train.data )

# transform the text to vector representations
X_train = tokenizer.texts_to_sequences( news_train.data )
X_train = pad_sequences( X_train, maxlen=400 )
Y_train = to_categorical( news_train.target )

In [None]:
# train the model
model.fit( X_train, Y_train, batch_size=64, epochs=100 )

# evaluate on test data
model.evaluate( X_test, Y_test)

<br>

### Assessing the Model's Performance

Accuracy is not very informative. A Confusion Matrix provides richer information  

**Precision** - is the model accurately predicting each class? $\frac{\mbox{Correct}_{\mbox{class}}}{\mbox{Predicted}_{\mbox{class}}}$

**Recall** - are the classes being correctly classified? $\frac{\mbox{Correct}_{\mbox{class}}}{\mbox{N}_{\mbox{class}}}$

**F1-score** - a weighted harmonic average between precision and recall.  
$$\mbox{F1 score} = 2\cdot \frac{\mbox{Precision}_{\mbox{class}} \cdot \mbox{Recall}_{\mbox{class}}}{\mbox{Precision}_{\mbox{class}} + \mbox{Recall}_{\mbox{class}}}$$


using `sklearn`:

    from sklearn.metrics import confusion_matrix
    # build the confusion matrix
    confusion_matrix( y_true, y_pred )
    
other performance metrics:

    from sklearn.metrics import precision_score
    from sklearn.metrics import recall_score
    from sklearn.metrics import f1_score
    from sklearn.metrics import accuracy_score
    from sklearn.metrics import classification_report
    
    # add average=None to precision, recall and f1 score functions as follows:
    print( precision_score( y_true, y_pred, average=None )
    
<br>    