# Recurrent Neural Networks for Language Modeling in `Python`

<br>

## Recurrent Neural Networks and `Keras`

<br>

### Introduction to the Course

We will look at four applications of machine learning for text data  

* Sentiment Analysis
* Multi-class classification
* Text generation
* Machine neural translation

**Recurrent Neural Networks** - reduce the number of parameters by avoiding one-hot encoding  
**Sequence to sequence models** - man to one (e.g. for classification); many to many (e.g. text generation)

<br>

### Introduction to Language Models

Language Models determine sentence probability

* Probability of 'I loved this movie'
* Unigram $P(\mbox{sentence}) = P(\mbox{I})P(\mbox{loved})P(\mbox{this})P(\mbox{movie})$
* Bigram $P(\mbox{sentence})=P(\mbox{I})P(\mbox{loved|I})P(\mbox{this|loved})P(\mbox{movie|this})$
* Trigram $P(\mbox{sentence})=P(\mbox{I})P(\mbox{loved|I})P(\mbox{this|I loved})P(\mbox{movie|loved this})$
* Skipgram $P(\mbox{sentence})=P(\mbox{context or I |I})P(\mbox{context of loved|loved})P(\mbox{context of this|this})P(\mbox{context of movie|movie})$
* Neural Network models
    - $P(\mbox{sentence})$ is given by a softmax function on the outer layer of the network

<br>

In [None]:
#Build vocabulary dictionaries

#Get unique words
unique_words = list( set( text.split(' ') ) )

#Create a dictionary: word is key, index is the value
word_to_index = { k:v for (v,k) in enumerate( unique_words ) }

#Create a dictionary: index is key, word is value
index_to_word = { k:v for (k,v) in enumerate( unique_words ) }

In [None]:
#Preprocessing input
x = []
y = []

#Loop over the text: length 'sentence_size' per time with step equal to 'step'
for i in range( 0, len( text ) - sentence_size, step ):
    X.append( text[ i:i + sentence_size])
    y,append( text[ i + sentence_size ])

In [None]:
#Transforming new texts
new_text_split = []
#Loop and get the indexes from the dictionary
for sentence in new_text:
    sent_split = []
    for wd in sentence.split(' '):
        ix = wd_to_index[ wd ]
        sent_split.append( ix )
    new_text_split.append( sent_split )

In [1]:
#extracted the vocabulary unique_words of the raw texts and created dictionaries to 
#go from words to numerical indexes and vice versa.

sheldon_quotes = ["You're afraid of insects and women, Ladybugs must render you catatonic.",
 'Scissors cuts paper, paper covers rock, rock crushes lizard, lizard poisons Spock, Spock smashes scissors, scissors decapitates lizard, lizard eats paper, paper disproves Spock, Spock vaporizes rock, and as it always has, rock crushes scissors.',
 'For example, I cry because others are stupid, and that makes me sad.',
 "I'm not insane, my mother had me tested.",
 'Two days later, Penny moved in and so much blood rushed to your genitals, your brain became a ghost town.',
 "Amy's birthday present will be my genitals.",
 '(3 knocks) Penny! (3 knocks) Penny! (3 knocks) Penny!',
 'Thankfully all the things my girlfriend used to do can be taken care of with my right hand.',
 'I would have been here sooner but the bus kept stopping for other people to get on it.',
 'Oh gravity, thou art a heartless bitch.',
 'I am aware of the way humans usually reproduce which is messy, unsanitary and based on living next to you for three years, involves loud and unnecessary appeals to a deity.',
 'Well, today we tried masturbating for money.',
 'I think that you have as much of a chance of having a sexual relationship with Penny as the Hubble telescope does of discovering at the center of every black hole is a little man with a flashlight searching for a circuit breaker.',
 "Well, well, well, if it isn't Wil Wheaton! The Green Goblin to my Spider-Man, the Pope Paul V to my Galileo, the Internet Explorer to my Firefox.",
 "What computer do you have? And please don't say a white one.",
 "She calls me moon-pie because I'm nummy-nummy and she could just eat me up.",
 'Ah, memory impairment; the free prize at the bottom of every vodka bottle.']

# Transform the list of sentences into a list of words
all_words = ' '.join(sheldon_quotes).split(' ')

# Get number of unique words
unique_words = list(set(all_words))

# Dictionary of indexes as keys and words as values
index_to_word = {i:wd for i, wd in enumerate(sorted(unique_words))}

print(index_to_word)

# Dictionary of words as keys and indexes as values
word_to_index = {wd:i for i, wd in enumerate(sorted(unique_words))}

print(word_to_index)

{0: '(3', 1: 'Ah,', 2: "Amy's", 3: 'And', 4: 'Explorer', 5: 'Firefox.', 6: 'For', 7: 'Galileo,', 8: 'Goblin', 9: 'Green', 10: 'Hubble', 11: 'I', 12: "I'm", 13: 'Internet', 14: 'Ladybugs', 15: 'Oh', 16: 'Paul', 17: 'Penny', 18: 'Penny!', 19: 'Pope', 20: 'Scissors', 21: 'She', 22: 'Spider-Man,', 23: 'Spock', 24: 'Spock,', 25: 'Thankfully', 26: 'The', 27: 'Two', 28: 'V', 29: 'Well,', 30: 'What', 31: 'Wheaton!', 32: 'Wil', 33: "You're", 34: 'a', 35: 'afraid', 36: 'all', 37: 'always', 38: 'am', 39: 'and', 40: 'appeals', 41: 'are', 42: 'art', 43: 'as', 44: 'at', 45: 'aware', 46: 'based', 47: 'be', 48: 'became', 49: 'because', 50: 'been', 51: 'birthday', 52: 'bitch.', 53: 'black', 54: 'blood', 55: 'bottle.', 56: 'bottom', 57: 'brain', 58: 'breaker.', 59: 'bus', 60: 'but', 61: 'calls', 62: 'can', 63: 'care', 64: 'catatonic.', 65: 'center', 66: 'chance', 67: 'circuit', 68: 'computer', 69: 'could', 70: 'covers', 71: 'crushes', 72: 'cry', 73: 'cuts', 74: 'days', 75: 'decapitates', 76: 'deity.', 7

In [2]:
# Create lists to keep the sentences and the next character
sentences = []   # ~ Training data
next_chars = []  # ~ Training labels

# Define hyperparameters
step = 2          # ~ Step to take when reading the texts in characters
chars_window = 10 # ~ Number of characters to use to predict the next one  

# Loop over the text: length `chars_window` per time with step equal to `step`
for i in range(0, len(sheldon_quotes) - chars_window, step):
    sentences.append(sheldon_quotes[i:i + chars_window])
    next_chars.append(sheldon_quotes[i + chars_window])


In [3]:
new_text = ['A man either lives life as it happens to him meets it head-on and licks it or he turns his back on it and starts to wither away',
 'To the brave crew and passengers of the Kobayshi Maru sucks to be you',
 'Beware of more powerful weapons They often inflict as much damage to your soul as they do to you enemies',
 'They are merely scars not mortal wounds and you must use them to propel you forward',
 'You cannot explain away a wantonly immoral act because you think that it is connected to some higher purpose']

# Loop through the sentences and get indexes
new_text_split = []
for sentence in new_text:
    sent_split = []
    for wd in sentence.split(' '):
        index = word_to_index.get(wd, 0)
        sent_split.append(index)
    new_text_split.append(sent_split)

# Print the first sentence's indexes
print(new_text_split[0])

# Print the sentence converted using the dictionary
print(' '.join([index_to_word[index] for index in new_text_split[0]]))

[0, 125, 0, 0, 0, 43, 113, 0, 181, 0, 0, 113, 0, 39, 0, 113, 0, 0, 0, 0, 0, 141, 113, 39, 0, 181, 0, 0]
(3 man (3 (3 (3 as it (3 to (3 (3 it (3 and (3 it (3 (3 (3 (3 (3 on it and (3 to (3 (3


<br>

### Introduction to RNN inside `Keras`

* `keras.models`
    - `keras.models.Sequntial` - each layer is input to the following
    - `keras.models.Model` - allows for more flexible model architecture
* `keras.layers`
    - `LSTM`
    - `GRU`
    - `Dense`
    - `Dropout`
    - `Embedding`
    - `Bidirectional`
* `keras.preprocessing`
    - `keras.preprocessing.sequence.pad_sequences( text, maxlen=3 )` - make fixed length vectors
* `keras.datasets` - IMDB, Reuters & more

<br>

In [4]:
from keras.models import Sequential
from keras.layers import LSTM, Dense

<br>

Building the Model

    #Instantiate the model class
    model = Sequential()

    #Add the layers
    model.add( Dense( 64, activation='relu', input_dim=100 ) )
    model.add( Dense( 1, activation='sigmoid' ) )

    #Compile the model
    model.compile( optimizer='adam', loss='mean_squared_error', metrics=['accuracy'] )
    
<br>

Training the Model

    model.fit( X_train, y_train, epochs = 10, batch_size = 32 )
    
where:  

1. **epochs** - determine how many weight updates will be done on the model
2. **batch_size** - size of the data on each step


Evaluate the Model

    model.evaluate( X_test, y_test )
    model.predict( new_data )
    
<br>

In [8]:
# Sequential Model

# Instantiate the class
model = Sequential()

# One LSTM layer (defining the input shape because it is the 
# initial layer)
model.add(LSTM(128, input_shape=(None, 10), name="LSTM"))

# Add a dense layer with one unit
model.add(Dense(1, activation="sigmoid", name="output"))

# The summary shows the layers and the number of parameters 
# that will be trained
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
LSTM (LSTM)                  (None, 128)               71168     
_________________________________________________________________
output (Dense)               (None, 1)                 129       
Total params: 71,297
Trainable params: 71,297
Non-trainable params: 0
_________________________________________________________________


In [None]:
# Model model

# Define the input layer
main_input = Input(shape=(None, 10), name="input")

# One LSTM layer (input shape is already defined)
lstm_layer = LSTM(128, name="LSTM")(main_input)

# Add a dense layer with one unit
main_output = Dense(1, activation="sigmoid", name="output")(lstm_layer)

# Instantiate the class at the end
model = Model(inputs=main_input, outputs=main_output, name="modelclass_model")

# Same amount of parameters to train as before (71,297)
model.summary()

In [10]:
import numpy as np

texts = np.array(['So if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits. If it’s unobserved it will, however, if it’s observed after it’s left the plane but before it hits its target, it will not have gone through both slits.',
       'Hello, female children. Allow me to inspire you with a story about a great female scientist. Polish-born, French-educated Madame Curie. Co-discoverer of radioactivity, she was a hero of science, until her hair fell out, her vomit and stool became filled with blood, and she was poisoned to death by her own discovery. With a little hard work, I see no reason why that can’t happen to any of you. Are we done? Can we go?'],
      dtype='<U419')

texts

array(['So if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits. If it’s unobserved it will, however, if it’s observed after it’s left the plane but before it hits its target, it will not have gone through both slits.',
       'Hello, female children. Allow me to inspire you with a story about a great female scientist. Polish-born, French-educated Madame Curie. Co-discoverer of radioactivity, she was a hero of science, until her hair fell out, her vomit and stool became filled with blood, and she was poisoned to death by her own discovery. With a little hard work, I see no reason why that can’t happen to any of you. Are we done? Can we go?'],
      dtype='<U419')

In [11]:
# Preprocess text

# Import relevant classes/functions
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# Build the dictionary of indexes
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)

# Change texts into sequence of indexes
texts_numeric = tokenizer.texts_to_sequences(texts)
print("Number of words in the sample texts: ({0}, {1})".format(len(texts_numeric[0]), len(texts_numeric[1])))

# Pad the sequences
texts_pad = pad_sequences(texts_numeric, 60)
print("Now the texts have fixed length: 60. Let's see the first one: \n{0}".format(texts_pad[0]))

Number of words in the sample texts: (54, 78)
Now the texts have fixed length: 60. Let's see the first one: 
[ 0  0  0  0  0  0 24  4  1 25 13 26  5  1 14  3 27  6 28  2  7 29 30 13
 15  2  8 16 17  5 18  6  4  9 31  2  8 32  4  9 15 33  9 34 35 14 36 37
  2 38 39 40  2  8 16 41 42  5 18  6]
