In [1]:
# For the Data
import numpy as np
import pandas as pd
import nltk
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# For the Model
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout, Embedding

# For saving
from path import path

/Users/tonimo/Documents/Learning-Developing-Software/Python/Machine Learning/Supervised Learning/Recurrent Neural Network


In [2]:
data = open('data/wonderful_wizard_of_oz.txt', 'r').read() # (any file of text that can be converted to a string)

In [3]:
print(data[70:200])

k Baum

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with al


### Preprocessing


In [4]:

"""Before we begin to create and learn the model we have to clean the text. Similar to sentiment analysis, we will use we will have to rid 
non-words only. Although, we will need to keep stop words in the text as they are informative to what the next word will be. In order 
to process this text we will remove all symbols that are not letters, numbers, or spaces. Then we will have to split our text into tokens which
are essentially smaller chunks of information rather than the entire sequence of text. Once we have our text in tokens, we will have to rid the 
tokens of non-words, punctuation, and anything outside of what is important to prediction the next word based on the words that proceed it.
We can use some basic built in python functions, classes, etc. But we can also use some classes and external libraries such as nltk and a 
Tokenizar from keras.utils"""

def preprocess(text):
    nltk.download("words") # (MAC OS ONLY)
    words = set(nltk.corpus.words.words()) # words set
    tokenized = text.split() # creating tokens
    tokenized = [word.lower() for word in tokenized if word in words] # filtering tokens of words we don't want (not in words corpus)
    return tokenized    

print(preprocess(data)[:10]) # tokens

['of', 'the', 'of', 'by', 'frank', 'is', 'for', 'the', 'use', 'of']


[nltk_data] Downloading package words to /Users/tonimo/nltk_data...
[nltk_data]   Package words is already up-to-date!


### Creating Sequences



In [5]:

"""In order for our model to be able to predict the next word, we need to create sequences of words. We can design a desired length of the 
sequence which indicates how many words will be understood/processed before the next word is predicted. We will reserve tokens for the inputs 
and have one extra token (pad)* for the output (the next word) for a given sequence. Our interpretation will look something like this giving 
the following example: "I went to play basketball today" sequence length = 2. 1st iteration: ["I went to"] 2nd iteration: ["went to play"] 3rd 
iterations: ["to play basketball"]... and so. from 0 to our sequence length (exclusive) are our inputs and the last element in the array is the target or expected output. We will use these to embed information into vectors that will help us understand the positional relationships between these tokens (words)."""


def sequenize(tokens, sequence_length, size=None):
    length = sequence_length + 1 # include the output (+1)
    sequences = []
    size = size if size else len(tokens) # how many sequences you want

    for i in range(length, size): 
        # getting the sequence along with the output for every possible sequence in our tokens
        sequence = " ".join(tokens[i-length:i]).strip()
        # valid sequences only
        if sequence and len(sequence.split()) == sequence_length + 1:
            sequences.append(sequence) 
    
    return sequences

print(sequenize(preprocess(data), 3)[:10]) # sequences of our text with a sequence length of 3 + 1 (for output)

[nltk_data] Downloading package words to /Users/tonimo/nltk_data...
[nltk_data]   Package words is already up-to-date!


['of the of by', 'the of by frank', 'of by frank is', 'by frank is for', 'frank is for the', 'is for the use', 'for the use of', 'the use of anyone', 'use of anyone anywhere', 'of anyone anywhere in']


### Tokenize sequences



In [6]:

"""With the data in sequences we can see the inputs and outputs. The inputs are the first k (k = sequence length) words of the sequence and 
the output word comes after the kth word in the sequence. In order for our model to learn anything, we must use a class that turns these 
sequences into numerical vectors. The Keras API allows us to use the Tokenizer class which turns an array of sequences into a numerical vector, 
either turning the sequence into integers where the index represents the word at that position of the sequence and the element represents the key 
value of a key (token) in the dictionary of the tokenizer, or a vector of numbers where each number is binary for word count of tf-idf. What 
we're going to do is fit the Tokenizer to our sequences, to build a dictionary of words where the beginning keys of the dictionary will be 
tokens of hire frequency. Once we do that, we will then turn our text sequences into numerical sequences where the index represents where 
that word occurs in the sequence and the element is the key value of that tokens key in the tokenizer dictionary                                     

Example: text = "The earth is an awesome place live" -> (tokenize) -> 
dictionary: {'e': 1, 'a': 2, 't': 3, 'h': 4, 'i': 5, 's': 6, 'l': 7, 'r': 8, 'n': 9, 'w': 10, 'o': 11, 'm': 12, 'p': 13, 'c': 14, 'v': 15}     
test text = "The earth is an great place live" -> (text to sequence) -> 
sequence: [[3], [4], [1], [], [1], [2], [8], [3], [4], [], [5], [6], [], [2], [9], [], [], [8], [1], [2], [3], [], [13], [7], [2], [14], [1], [], [7], [5], [15], [1]] 
small note, the tokens are characters and not words in our case we have words so the keys will be words instead of characters"""

def vectorize(sequences):
    tokenizer = Tokenizer(oov_token="<OOV>") # (oov_token is out of vocabulary token)
    tokenizer.fit_on_texts(sequences) # fitting to create dictionary
    vectors = np.array(tokenizer.texts_to_sequences(sequences)) # creating our numerical vectors for learning


    return vectors, tokenizer # give back vectors array and the tokenizer for future predictions


In [7]:
sequence_length = 10

text = sequenize(preprocess(data), sequence_length=sequence_length, size=50000) # 50,000 sequences of 10 words + 1 for output
print(f"sequences: {text[:5]}")

vectors, tokenizer = vectorize(text)
count = len(tokenizer.index_word.keys()) # vocab of the tokenizer
vocab_size = count + 1 # vocab size (for the embedding layer)
print(f"Vectors example: {vectors[0]}\nTotal unique words: {count}\nVocabulary size: {vocab_size}")

[nltk_data] Downloading package words to /Users/tonimo/nltk_data...
[nltk_data]   Package words is already up-to-date!


sequences: ['of the of by frank is for the use of anyone', 'the of by frank is for the use of anyone anywhere', 'of by frank is for the use of anyone anywhere in', 'by frank is for the use of anyone anywhere in the', 'frank is for the use of anyone anywhere in the and']
Vectors example: [  5   2   5  48 753  21  18   2 183   5 280]
Total unique words: 2142
Vocabulary size: 2143


### Embedding

In [8]:

"""Embedding is an important part of natural language processing and in this case creating a prediction on the next words. What an embedding layer
does is give us the ability to learn our vectors, by assigning a "look-up" index which contains the weighted values assigned to that given word
in it's position at the sequence. This is much more effeicient and better than a typical approach of assigning each word a vector and deploying
it initially, because as the vocabulary size grows, the amount of computation & space becomes larger and thus causes training to be ineffecient.
So assigning a vector or matrix to an embedded word is much better for processing. In a sense, it is a way of choosing features and reducing 
dimensionality. 

The Embedding layer from the Keras API has 3 important parameters. The input dimension, the output dimension, and the input length. 
The input dimension is the size of the vocabulary. Basically, these are our values being converted into one-hot encoding for however many 
words we have (hence it's the vocab size). The output dimensions is how big we want our vector of each word to be. We can start small and 
say something like 5, or we could large and bump it up to 500. This is basically a tunable parameter that helps with learning. Lastly, the 
input length is the maximum length of our sequences. If the sequences are not the same, it's wise to add pad (fill the sequence with zeros 
until it's the same length of the biggest sequence in the sequences). In our case, we're fixed with a length of ten so padding will not need
to be added. From here, over epochs our model can learn the context of words, so that similar words have similar embeddings thus giving us 
predicted words alike after applying softmax (Theoretically)"""

embedded_model = Sequential()
embedded_model.add(Embedding(input_dim=10, output_dim=5, input_length=4))
embedded_model.compile(optimizer="adam", loss="mse", metrics=None)
test_input = np.array([1, 2, 5, 7])
test_input_vectors = embedded_model.predict(test_input.reshape(1, test_input.size))
weights_for_vocab = embedded_model.weights
print(f"Vectors for each element in the input respectively:\n{test_input_vectors}")
print(f"Weights for each vocab word in vocabulary (should be 10 (input dimension)):\n{weights_for_vocab}")


Vectors for each element in the input respectively:
[[[ 0.03296374  0.03500713  0.03752087 -0.0340186   0.01835303]
  [-0.04588386  0.01072135  0.00405924 -0.04999904  0.03682213]
  [ 0.01428384  0.0389634  -0.00085402  0.00861896 -0.02453052]
  [ 0.04725624  0.02515594  0.04757052 -0.01327684 -0.02088813]]]
Weights for each vocab word in vocabulary (should be 10 (input dimension)):
[<tf.Variable 'embedding/embeddings:0' shape=(10, 5) dtype=float32, numpy=
array([[ 0.0298381 , -0.04962946,  0.03873776, -0.01433212,  0.04859641],
       [ 0.03296374,  0.03500713,  0.03752087, -0.0340186 ,  0.01835303],
       [-0.04588386,  0.01072135,  0.00405924, -0.04999904,  0.03682213],
       [ 0.04516921, -0.01392338, -0.04011289, -0.01987324,  0.04825015],
       [-0.00624734,  0.00655264,  0.02335049,  0.04804944,  0.04087234],
       [ 0.01428384,  0.0389634 , -0.00085402,  0.00861896, -0.02453052],
       [-0.00988655, -0.00684834,  0.02439785,  0.00412842,  0.02199061],
       [ 0.04725624, 

2022-05-16 21:05:06.988483: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Model

### LSTM (Long Short Term Memory)


In [9]:

"""The LSTM is a complex layer in a RNN which essentially helps us with learning during training by increasing effeciency and basically sifting 
the inputs of importance through the recurrent layers. An LSTM is a tatic that uses gates to control the flow of states as they recur through 
reuccrent layers. This is important to resolve issues with vanishing gradient from short-term memory of the the base recurrent neural network. 

To start off, the base LSTM has are 3 gates (1 with two operations (so 4 gates in a way)) that control the the flow of the information passed 
through. These gates basically help the cell learn what to keep and what to forget. The first gate is the forget gate which takes the previous 
cell state along with sigmoid applied to the concatenation of the previous hidden state and input vector. The sigmoid function squishes the 
values between 0 and 1 which makes important values closer to 1 and less important values closer to 0 which allows the network to retain 
important values and ignore (forget) non-important ones. The previous cell state is then combined with this output through multiplication. 
Next the input gate is used to update the cell state by applying sigmoid (to forget values or retain values) to the concatenation of the 
previous hidden state and the input vector and doing the same operation but instead of sigmoid, tanh is used to regulate the flow of the network. 
Those results are multiplied then added to the result of the forget gate to get the new cell state. Lastly, the output gate indicates what the 
next hidden state will be. It applies the sigmoid (to forget or retain values) to the concatenation of the input and previous hidden state then 
multiplies that with the result of the new cell state applied to the tanh function (to regulate the flow) to create the new hidden state to be 
passed to the next layer. In short, this process is complex, but it basically indicates to the network what values to keep or forget as it learns
and back propagates. The LSTM outputs a new cell state and a previous hidden state to the next layer."""

inputs = tf.random.normal([4, 2, 2])
lstm1 = LSTM(3, return_sequences=True)
lstm2 = LSTM(3)
output1 = lstm1(inputs) # lstm1 receiving inputs and passing them as sequential inputs (return_sequences=True)
output2 = lstm2(output1) # lstm2 taking outputs from lstm1 as inputs to output data
print(f"inputs -> lstm1 ->:\n {output1}\n")
print(f"output(lstm1) -> input -> lstm2 ->:\n {output2}")

inputs -> lstm1 ->:
 [[[ 0.00497654 -0.11485896 -0.10108453]
  [-0.06515366 -0.13458483 -0.23143965]]

 [[-0.05492815  0.06400284 -0.2400621 ]
  [-0.05787418  0.06483869 -0.01606833]]

 [[-0.02756388  0.10773108  0.04353565]
  [-0.09107605  0.12948447 -0.12584193]]

 [[ 0.0103193   0.06424125  0.06334271]
  [-0.05525967  0.0982845  -0.11114873]]]

output(lstm1) -> input -> lstm2 ->:
 [[ 0.04109912  0.02249972  0.05420007]
 [ 0.04302067  0.04966669  0.01984686]
 [ 0.03541845  0.0534464   0.00093455]
 [ 0.02065295  0.03397502 -0.00445839]]


### Model Breakdown

In [10]:

"""First layer is an embedding layer that will learn the context of words based on their prediction which is a driving factor for predicting 
the next word. Using 32 as the vector size the represent a word (friendly dimension value)

We then have two LSTM layers with 512 nodes for both. These layers implement a solution to vanishing gradient issues with RNN and is a huge 
factor for effectively and effeciently learning our features. Notice above that one sets the return sequences to True has it's False by default.
This basically is what allows stacking of these layers because we want the next LSTM layer to recieve the output of the previous LSTM layer as 
time distributed input rather than one vector output as if there was nothing sequential. 

We then have a Dense layer (512 neurons) to apply the relu activation to the last time step of the previous LSTM layer (just a Hidden Layer 
before output). Stacked on top, we have a dropout layer to prevent overfitting.

Lastly, we have another Dense layer as our output layer. The number of nodes needs to be the vocab size as we will later use argmax on the output vector to find which word the model predicted based on our vocab from the tokenizer. Of course for multi-class we must use softmax Activation"""

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=32, input_length=sequence_length))
model.add(LSTM(512, return_sequences=True))
model.add(LSTM(512))
model.add(Dense(512, activation="relu"))
model.add(Dropout(0.3))
model.add(Dense(vocab_size, activation="softmax"))

model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])



### Training

In [11]:

print(vectors.shape)

X, y = vectors[:, :-1], vectors[:, -1] # features are every word val up to last word, label is last word in a given vector

print(f"Example input & shape: {X[0]}, {X[0].shape}") # vector of 3 values which are the indices to vocab
print(f"Example output & shape: {y[0]}, {y[0].shape}") # vector of 1 value which is index to vocab
print(f"Goal: {X[0]} -> {y[0]}")

(30960, 11)
Example input & shape: [  5   2   5  48 753  21  18   2 183   5], (10,)
Example output & shape: 280, ()
Goal: [  5   2   5  48 753  21  18   2 183   5] -> 280


In [12]:
model.fit(X, y, batch_size=128, epochs=25) # training (epochs vary for dataset and model build)
model.save(f"{path}/models/lstm.h5") # saving model

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


### Predicting

In [19]:
# just gets inputs for predicting next words
def input_sequences(sequence_length):
    print("Press enter to quit")
    sequences = []
    while True:
        user_input = input(f"Enter a sequence of words to predict (any size): ").strip()
        if user_input:
            sequences.append(user_input)
        else:
            return sequences

In [20]:
# predicts the next word off a given sequence
def predict(tokenizer, inputs, model=None):
    vectors = tokenizer.texts_to_sequences(inputs) # tokenize the inputs
    vectors = np.asarray(pad_sequences(vectors, maxlen=sequence_length, padding="post")) # pad the inputs
    pred = model.predict(vectors) # predict the inputs
    mappings = dict(map(reversed, tokenizer.word_index.items())) # reverse the tokenizer to get the word from the index
    italic, end = "\033[3m", "\033[0m"
    return [f"{s}{italic} {mappings[np.argmax(p)]}{end}" for s,p in zip(inputs, pred)] # generate what the model thinks the next word is

In [29]:
model = tf.keras.models.load_model(f"{path}/models/lstm.h5")

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 10, 32)            68576     
                                                                 
 lstm_2 (LSTM)               (None, 10, 512)           1116160   
                                                                 
 lstm_3 (LSTM)               (None, 512)               2099200   
                                                                 
 dense (Dense)               (None, 512)               262656    
                                                                 
 dropout (Dropout)           (None, 512)               0         
                                                                 
 dense_1 (Dense)             (None, 2143)              1099359   
                                                                 
Total params: 4,645,951
Trainable params: 4,645,951
No

In [27]:
inputs = input_sequences(sequence_length) # get filtered sequences
predictions = predict(tokenizer, inputs, model=model) # get predictions
for pred in predictions:
    print(pred) # italicized words are the prediction

Press enter to quit
When Aunt Em came there to live she was a[3m pretty[0m
When Dorothy, who was an orphan, first came to her[3m great[0m
with long silky hair and small black eyes that twinkled[3m running[0m
He worked hard from morning till night and did not[3m know[0m
The cyclone had set the house down very gently for[3m the[0m
Dorothy carried the shoes into the house and placed them[3m they[0m
She took a little basket and filled it with bread[3m from[0m
some still open contradictions in my thoughts and my values, that[3m you[0m
This automobile executive has a better idea than digging tunnels[3m all[0m
We were very close and remain close. We just hadn’t[3m he[0m
has no wires and total freedom. Explore new worlds with[3m the[0m
wow that tv looks pretty nice[3m like[0m
I went to play basketball and the[3m and[0m
Are these the children of  Michael Jordan?[3m once[0m
