Importing the libraries 

In [1]:
import tensorflow as tf 
import string 
import requests 

get the data

In [2]:
file = open("corpus.txt", "r", encoding = "utf8")


In [3]:
lines = []
for i in file:
    lines.append(i)

data = ""
for i in lines:
    data = ' '. join(lines)
    

In [4]:
data[:1500]

"There were little things that she simply could not stand. The sound of someone tapping their nails on the table. A person chewing with their mouth open. Another human imposing themselves into her space. She couldn't stand any of these things but none of them compared to the number one thing she couldn't stand which topped all of them combined.\n It went through such rapid contortions that the little bear was forced to change his hold on it so many times he became confused in the darkness and could not for the life of him tell whether he held the sheep right side up or upside down. But that point was decided for him a moment later by the animal itself who with a sudden twist jabbed its horns so hard into his lowest ribs that he gave a grunt of anger and disgust.\n The cab arrived late. The inside was in as bad of shape as the outside which was concerning and it didn't appear that it had been cleaned in months. The green tree air-freshener hanging from the rearview mirror was either exh

Split the data set into lines 

In [5]:
data = data.split('\n') 
data[0] 

"There were little things that she simply could not stand. The sound of someone tapping their nails on the table. A person chewing with their mouth open. Another human imposing themselves into her space. She couldn't stand any of these things but none of them compared to the number one thing she couldn't stand which topped all of them combined."

In [6]:
data = data[253:] 
data[0] 

' According to the caption on the bronze marker placed by the Multnomah Chapter of the Daughters of the American Revolution on May 12 1939 “College Hall (is) the oldest building in continuous use for Educational purposes west of the Rocky Mountains. Here were educated men and women who have won recognition throughout the world in all the learned professions.”'

In [7]:
len(data)

748

Right now we have a list of the lines in the data. Now we are going to join all the lines and create a long string consisting of the data in continuous format.

In [8]:
data = " ".join(data) 
data[:1000] 

' According to the caption on the bronze marker placed by the Multnomah Chapter of the Daughters of the American Revolution on May 12 1939 “College Hall (is) the oldest building in continuous use for Educational purposes west of the Rocky Mountains. Here were educated men and women who have won recognition throughout the world in all the learned professions.”  She looked at her little girl who was about to become a teen. She tried to think back to when the girl had been younger but failed to pinpoint the exact moment when she had become a little too big to pick up and carry. It hit her all at once. She was no longer a little girl and she stood there speechless with fear sadness and pride all running through her at the same time.  Turning away from the ledge he started slowly down the mountain deciding that he would that very night satisfy his curiosity about the man-house. In the meantime he would go down into the canyon and get a cool drink after which he would visit some berry patche

we can see that after passing data to clean_text we get the data in the required format without punctuations and special characters. 

In [9]:
def clean_text(doc): 
 tokens = doc.split() 
 table = str.maketrans('', '', string.punctuation) 
 tokens = [w.translate(table) for w in tokens] 
 tokens = [word for word in tokens if word.isalpha()] 
 tokens = [word.lower() for word in tokens] 
 return tokens 
tokens = clean_text(data) 
print(tokens[:50]) 

['according', 'to', 'the', 'caption', 'on', 'the', 'bronze', 'marker', 'placed', 'by', 'the', 'multnomah', 'chapter', 'of', 'the', 'daughters', 'of', 'the', 'american', 'revolution', 'on', 'may', 'hall', 'is', 'the', 'oldest', 'building', 'in', 'continuous', 'use', 'for', 'educational', 'purposes', 'west', 'of', 'the', 'rocky', 'mountains', 'here', 'were', 'educated', 'men', 'and', 'women', 'who', 'have', 'won', 'recognition', 'throughout', 'the']


In [10]:
len(tokens)

45335

we are going to use a set of previous words to predict the next word in the sentence. To be precise we are going to use a set of 50 words to predict the 51st word. Hence we are going to divide our data in chunks of 51 words and at the last we will separate the last word from every line. We are going to limit our dataset to 200000 words. 

In [11]:
length = 50 + 1 
lines = [] 
for i in range(length, len(tokens)): 
 seq = tokens[i-length:i] 
 line = ' '.join(seq) 
 lines.append(line) 
 if i > 200000: 
   break 
print(len(lines)) 

45284


# Build LSTM Model and Prepare X and y

import all the necessary libraries used to pre-process the data and create the layers of the neural network. 

In [12]:
import numpy as np 
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.utils import to_categorical 
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense, LSTM, Embedding 
from tensorflow.keras.preprocessing.sequence import pad_sequences 

We are going to create a unique numerical token for each unique word in the dataset.fit_on_texts() updates internal vocabulary based on a list of texts. texts_to_sequences() transforms each text in texts to a sequence of integers. 

In [13]:
tokenizer = Tokenizer() 
tokenizer.fit_on_texts(lines) 
sequences = tokenizer.texts_to_sequences(lines) 

sequences containes a list of integer values created by tokenizer. Each line in sequences has 51 words. Now we will split each line such that the first 50 words are in X and the last word is in y. 


In [14]:
sequences = np.array(sequences) 
X, y = sequences[:, :-1], sequences[:,-1] 
X[0] 

array([1757,    2,    1, 1756,   30,    1, 1755, 1754, 1753,   67,    1,
       1752, 1751,    9,    1, 1750,    9,    1, 1749, 1748,   30,  214,
       1747,   37,    1, 1746,  480,   12, 1745,  384,   20, 1744,  445,
       1743,    9,    1,  866, 1742,  210,   33, 1741, 1740,    4, 1739,
         94,   40, 1738, 1737,  588,    1])

vocab_size contains all the uniques words in the dataset. tokenizer.word_index gives the mapping of each unique word to its numerical equivalent. Hence len() of tokenizer.word_index gives the vocab_size 

In [15]:
vocab_size = len(tokenizer.word_index) + 1

to_categorical() converts a class vector (integers) to binary class matrix. num_classes is the total number of classes which is vocab_size. 

In [16]:
y = to_categorical(y, num_classes=vocab_size) 

In [17]:
seq_length = X.shape[1] 
seq_length 


50

# LSTM Model 

A Sequential model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor. 

In [18]:
model = Sequential() 
model.add(Embedding(vocab_size, 50, input_length=seq_length)) 
model.add(LSTM(100, return_sequences=True)) 
model.add(LSTM(100)) 
model.add(Dense(100, activation='relu')) 
model.add(Dense(vocab_size, activation='softmax')) 

In [19]:
model.summary() 

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 50, 50)            91500     
_________________________________________________________________
lstm (LSTM)                  (None, 50, 100)           60400     
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense (Dense)                (None, 100)               10100     
_________________________________________________________________
dense_1 (Dense)              (None, 1830)              184830    
Total params: 427,230
Trainable params: 427,230
Non-trainable params: 0
_________________________________________________________________


In [20]:
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

After compiling the model we will now train the model using model.fit() on the training dataset. We will use 100 epochs to train the model. An epoch is an iteration over the entire x and y data provided. batch_size is the number of samples per gradient update i.e. the weights will be updates after 256 training examples. 

In [21]:
model.fit(X, y, batch_size = 256, epochs = 100) 

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7fda27d62710>

We are now going to generate words using the model. For this we need a set of 50 words to predict the 51st word. So we are taking a random line.

In [22]:
seed_text=lines[12343] 
seed_text 

'her husband lamenting at fate which had directed her footsteps to the path which they had taken she was just having a good cry all to herself the mosquitoes made merry over her biting her firm round arms and nipping at her bare insteps it was easy to spot her all'

generate_text_seq() generates n_words number of words after the given seed_text. We are going to pre-process the seed_text before predicting. We are going to encode the seed_text using the same encoding used for encoding the training data. Then we are going to convert the seed_textto 50 words by using pad_sequences(). Now we will predict using model.predict_classes(). After that we will search the word in tokenizer using the index in y_predict. Finally we will append the predicted word to seed_text and text and repeat the process. 

In [23]:
def generate_text_seq(model, tokenizer, text_seq_length, seed_text, n_words): 
 text = [] 
 for _ in range(n_words):
   encoded = tokenizer.texts_to_sequences([seed_text])[0] 
   encoded = pad_sequences([encoded], maxlen = text_seq_length, truncating='pre') 
   y_predict = model.predict_classes(encoded) 
   predicted_word = '' 
   for word, index in tokenizer.word_index.items(): 
     if index == y_predict: 
       predicted_word = word 
       break 
   seed_text = seed_text + ' ' + predicted_word 
   text.append(predicted_word) 
 return ' '.join(text) 

We can see that the next 100 words are predicted by the model for the seed_text.

In [24]:
generate_text_seq(model, tokenizer, seq_length, seed_text, 100) 



'you needed to do was look at her socks they were never a matching pair one would be green while the other would be blue one would reach her knee while the other barely touched her ankle every other part of her was perfect but never the socks they were her micro act of rebellion she sat across from her trying to imagine it was the first time it wasnt had it been a expert or much away with him the lone lamp post of the onestreet town flickered not quite dead but definitely on its way out suitcase by'

We have got a accuracy of 96%. To increase the accuracy we can increase the number of epochs or we can consider the entire data for training. For this model we have only considered 1/4th of the data for training. 