## Language modelling with LSTMs

Now, we will be using LSTMs for language modelling by learning sequences of sentences (windows) that lead up to a certain word.

Note:
first time running, you will need to install ``tensorflow`` and ``keras``

In [1]:
import nltk
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import Dropout
from keras.utils import to_categorical
from keras.models import Model
from keras.layers import BatchNormalization
from keras.preprocessing import sequence
from keras.layers import Input
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split

In [3]:
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\abodunde.ojo_kuda\AppData\Roaming\nltk_data..
[nltk_data]     .
[nltk_data]   Unzipping corpora\gutenberg.zip.


True

We use Emma by Jane Austen from the Gutenberg corpus again:

In [4]:
emma_sents = nltk.corpus.gutenberg.sents('austen-emma.txt')

In [5]:
# print the first 5 sentences
print(emma_sents[:5])

[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'], ['VOLUME', 'I'], ['CHAPTER', 'I'], ['Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence', ';', 'and', 'had', 'lived', 'nearly', 'twenty', '-', 'one', 'years', 'in', 'the', 'world', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', '.'], ['She', 'was', 'the', 'youngest', 'of', 'the', 'two', 'daughters', 'of', 'a', 'most', 'affectionate', ',', 'indulgent', 'father', ';', 'and', 'had', ',', 'in', 'consequence', 'of', 'her', 'sister', "'", 's', 'marriage', ',', 'been', 'mistress', 'of', 'his', 'house', 'from', 'a', 'very', 'early', 'period', '.']]


Next, we store all the words as integers (like a dictionary, similar to the one-hot vector):

In [6]:
# stores the words -> integers
word_dict = {}
# stores the integers -> words
reverse_dict = {}

# keeps track of the words
no_words = 1

int_sents = []
longest = 0
for i, sentence in enumerate(emma_sents[:1000]):
    ints = []
    
    for word in sentence:
        if word not in word_dict.keys():
            word_dict[word] = no_words
            reverse_dict[no_words] = word
            no_words += 1
        # add word to the integer list
        ints.append(word_dict[word])
    
    # Print to illustrate the conversion
    if i<5:
        print(sentence)
        print(ints)
    # Keep track of the longest sentence
    if len(sentence) > longest:
        longest = len(sentence)
        
    # store the integer sentence
    int_sents.append(ints)

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
[1, 2, 3, 4, 5, 6, 7]
['VOLUME', 'I']
[8, 9]
['CHAPTER', 'I']
[10, 9]
['Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence', ';', 'and', 'had', 'lived', 'nearly', 'twenty', '-', 'one', 'years', 'in', 'the', 'world', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', '.']
[2, 11, 12, 13, 12, 14, 12, 15, 16, 12, 17, 18, 19, 20, 15, 21, 22, 12, 23, 24, 25, 26, 27, 28, 29, 30, 27, 31, 32, 15, 33, 34, 35, 36, 37, 38, 39, 40, 28, 41, 17, 42, 43, 24, 44, 45, 46, 47, 48]
['She', 'was', 'the', 'youngest', 'of', 'the', 'two', 'daughters', 'of', 'a', 'most', 'affectionate', ',', 'indulgent', 'father', ';', 'and', 'had', ',', 'in', 'consequence', 'of', 'her', 'sister', "'", 's', 'marriage', ',', 'been', 'mistress', 'of', 'his', 'house', 'from', 'a', 'very'

In [7]:
print('#Words: ',len(word_dict))
print('#Sentences: ',len(int_sents))
print('Length: ', longest)

#Words:  2728
#Sentences:  1000
Length:  172


Again, we have to pad our sequences that are not long enough, adding 0's at the begining of each ``int_sents`` (although not as necessary in this case):

In [8]:
int_sents = sequence.pad_sequences(int_sents, maxlen=longest)

In [9]:
# an example
int_sents[2]

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
       10,  9])

Consider each input vector contains 30 words, and the corresponding output is the 31st word.

In [10]:
tt = [0,1,2,3,4,5,6]
print('input is', tt[1:1+3])
print('output is',tt[1+3])

input is [1, 2, 3]
output is 4


Let's create our data:

In [11]:
X = []
y = []
window = 30

# Go through the sentences and store the window (sequence of words) as X
# y contains the word at the end of the window
for sent in int_sents:
    for i in range(0, len(sent) - window, 1):
        window_x = sent[i:i + window]
        window_y = sent[i + window]
        X.append(window_x)
        y.append(window_y)  

In [12]:
# For example
sent = int_sents[10]
i= len(sent) - window -1 #since range() creates until len(sent) - window -1
print('input is', sent[i:i + window])
print('output is',sent[i + window])

input is [  0   0   0   0   0   0   0 145 166  12 167  12  50 168 169 170 171  12
 172 129 173 174   3 121 175 176  88 177  17  47]
output is 48


Now we are going to store our data in the correct shape (#sentences x length window x #features (1 - the words)):

In [13]:
X = np.reshape(X, (len(X), window, 1))

# For the y-value, we use one-hot encoding
y = to_categorical(y)    

print(np.shape(X))
print(np.shape(y))

(142000, 30, 1)
(142000, 2729)


In [14]:
# For example, although many vectors are 0's since we only consider a dictionary of size 30
ind = 12345
print(X[ind])
print(y[ind])

[[  0]
 [  0]
 [  0]
 [  0]
 [681]
 [578]
 [157]
 [332]
 [ 27]
 [136]
 [682]
 [ 27]
 [680]
 [490]
 [334]
 [102]
 [254]
 [683]
 [ 12]
 [  9]
 [ 77]
 [ 63]
 [ 40]
 [209]
 [684]
 [ 17]
 [501]
 [685]
 [ 32]
 [103]]
[0. 0. 0. ... 0. 0. 0.]


Creating training and test sets:

In [15]:
X = X[:2000]
y = y[:2000]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [16]:
np.shape(y_test)

(660, 2729)

Build the model:

Note:

- check ``keras.LSTM`` configurations here https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM

- ``Dense`` implements the operation: ``output = activation(dot(input, kernel) + bias)``

In [17]:
model = Sequential()
no_dim = 128 # # of blocks, related to the training time

model = Sequential()
# Input is: a window of 1 feature: an integer representing the word
model.add(LSTM(no_dim, input_shape=(window, 1))) 
model.add(Dropout(0.2))
# The output layer predicts the word, one-hot encoded (i.e. the vector is as long as the number of words)
model.add(Dense(y.shape[1], activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy','mae'])
model.fit(X_train, y_train, validation_split=0.2, batch_size=longest, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x229679d84d0>

In [18]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 128)               66560     
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense (Dense)               (None, 2729)              352041    
                                                                 
Total params: 418601 (1.60 MB)
Trainable params: 418601 (1.60 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Evaluation:

In [19]:
evaluation = model.evaluate(X_test, y_test,return_dict = True)
print(evaluation)

{'loss': 1.4337700605392456, 'accuracy': 0.8181818127632141, 'mae': 0.00015202061331365258}


Predicting the words:

In [20]:
no_words = 0
for x_i, y_i in zip(X_test, y_test):
    
    # We need to reshape the test again
    x = np.reshape(x_i, (1, len(x_i), 1))
    
    sentence = ''
    for word in x[0]:
        if np.sum(word) > 0:
            sentence += reverse_dict[word[0]] + " "
    if sentence != '':
        print('\nTo predict: ', sentence)
    
    prediction = model.predict(x)
    
    # The LSTM returns a probability for every word, we take the highest probability (argmax)
    i_x = np.argmax(prediction)
    i_y = np.argmax(y_i)
    if i_x > 0:
        print('Prediction: ', reverse_dict[i_x])
        print('Actual word: ', reverse_dict[i_y])
    no_words += 1
    if no_words > 100:
        break


To predict:  had been living together as friend and friend very mutually attached , and Emma doing just what she liked ; highly esteeming Miss Taylor ' s judgment , but directed 
Prediction:  of
Actual word:  chiefly

To predict:  She was the 

To predict:  It was on the wedding - day of this beloved friend that Emma first sat 
Prediction:  of
Actual word:  in

To predict:  Sixteen years had 

To predict:  Her mother had died too long ago for her to have more than an indistinct remembrance 
Prediction:  of
Actual word:  of

To predict:  of governess , the mildness of her temper had hardly allowed her to impose any restraint ; and the shadow of authority being now long passed away , they had 
Prediction:  of
Actual word:  been

To predict:  Sorrow came -- a gentle sorrow -- but not at all in the shape of 
Prediction:  of
Actual word:  any

To predict:  She was the youngest of the two 
Prediction:  ,
Actual word:  daughters

To predict:  It was on the wedding - day of this beloved frien