# Long Short-Term Memory Next Word Prediction Model

We seek to create a model that, given a string of text, can reliably predict the following *n* words. The model will be a Recurrent Neural Net w/ LSTM architecture.

## Preprocessing

In [1]:
import sys
sys.path.append('../')
from util.process import Process

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\amira\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [13]:
FILE_PATH = "sherlock_holmes_text.txt"

sentences = Process.file_to_sentences(FILE_PATH)

In [14]:
sentences = sentences[4:]
sentences[:10]

['I have seldom heard him mention her under any other name.',
 'In his eyes she eclipses and predominates the whole of her sex.',
 'It was not that he felt any emotion akin to love for Irene Adler.',
 'All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind.',
 'He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position.',
 'He never spoke of the softer passions, save with a gibe and a sneer.',
 'They were admirable things for the observer—excellent for drawing the veil from mens motives and actions.',
 'But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results.',
 'Grit in a sensitive instrument, or a crack in one of his own high-power lenses, would not be more disturbing than a strong emotion in a na

Now that we have an array of all the sentences, we want to go split these sentences into each of their words. We can use word_tokenize() from NLTK.

In [7]:
from nltk.tokenize import word_tokenize

In [15]:
sentences = [word_tokenize(sentence) for sentence in sentences]
sentences[0]

['I',
 'have',
 'seldom',
 'heard',
 'him',
 'mention',
 'her',
 'under',
 'any',
 'other',
 'name',
 '.']

Now, we need to create our input sequences. We should convert the words to numbers, and then use n-grams to form our input sequences.

In [18]:
all_words = [word for sentence in sentences for word in sentence]
all_words[:10]

['I',
 'have',
 'seldom',
 'heard',
 'him',
 'mention',
 'her',
 'under',
 'any',
 'other']

In [19]:
vocabulary = set(all_words)
word_to_index = {word: idx for idx, word in enumerate(vocabulary, 1)}
index_to_word = {idx: word for word, idx in word_to_index.items()}
# The size of the vocabulary will be one larger because 
# we reserve integer 0 for the padding token
vocab_size = len(vocabulary) + 1

In [31]:
input_sequences = []
for sentence in sentences:
    token_list = [word_to_index[word] for word in sentence]
    for i in range(2, len(token_list) + 1):
        ngram = token_list[:i]
        input_sequences.append(ngram)

Now, we need to pad our input sequences so that they are all the same length. After padding, we are ready to create our predictor and label vectors for the model.

In [25]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

In [32]:
max_sequence_len = max([len(seq) for seq in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')

X, y = input_sequences[:,:-1],input_sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)

In [37]:
print(X[453], '\n->\n', y[450])

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0 8733 3550 5178 2061] 
->
 [0. 0. 0. ... 0. 0. 0.]


Looks like we're done now. X represents all the predictors, and y represents the labels.

## Building the Model