# Primitive Text Prediction using TensorFlow and Keras

**Dataset Used:** https://www.kaggle.com/c/spooky-author-identification

## Execution

In [6]:
# Standard Data Science Libraries
import pickle
import math
import pandas as pd
import numpy as np
from numpy import array

# Neural Net Preprocessing
from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Neural Net Layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Embedding

# Neural Net Training
from tensorflow.keras.models import load_model
from tensorflow.keras.callbacks import ModelCheckpoint
from keras.callbacks import EarlyStopping

from pickle import load

Using TensorFlow backend.


We are going to train a model to emulate the speaking style of the text it is trained on. 

Here we are going to use the dataset containing the works of a collection of authors. We plan to extract the content corresponding to a particular author to train the model, so that it can predict the text to some extent.

In this case, we are going to train the model with the works of the author Edgar Allen Poe

In [7]:
# Import the data
train_df = pd.read_csv('C:\\Users\\HP\\Desktop\\Datasets\\Sentence Prediction\\train.csv')
# Selecting Edgar Allen Poe as author style to emulate
author = train_df[train_df['author'] == 'EAP']["text"]
print('Number of training sentences: ',author.shape[0])
#print('Number of training sentences: ',train_df.shape[0])

Number of training sentences:  7900


Next we are going to perform tokenization of the training set. This function will perform operations such as removing punctuations, converting the entire text to lowercase and most importantly splitting up the words and assigning a unique integer to each word and replacing all instances of the word with the corresponding integer.

In [8]:
max_words = 50000 # Max size of the dictionary
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(author.values)
sequences = tokenizer.texts_to_sequences(author.values)
print(sequences[:5])

[[19, 2397, 80, 1001, 29, 31, 177, 2, 4073, 1, 1960, 2, 11, 3024, 15, 7, 110, 157, 41, 2146, 3, 481, 4, 1, 149, 2147, 7, 393, 74, 114, 101, 439, 2, 1, 162, 32, 913, 6453, 136, 1, 380], [6, 21, 142, 150, 10, 5, 551, 2148, 319, 28, 16, 15, 20, 8999, 128, 1, 3025, 2398, 30, 171, 2, 1797, 697, 20, 180, 2148, 6454, 12, 33, 188, 2, 1, 869, 243, 522, 1264], [1, 6455, 203, 14, 19, 149, 180, 6456, 6, 1, 1357, 2, 1358, 9000, 3, 83, 2149, 10, 355, 140, 794], [1, 4074, 491, 6, 9001, 28, 11, 158], [7, 287, 9, 36, 48, 22, 73, 4, 644, 9002, 114, 101, 346, 4, 271, 2, 9003, 3, 81, 2, 1, 3026, 2, 6457, 3, 282, 53, 34, 6458, 19, 339, 22, 43, 97, 608, 7, 450, 4, 36, 133, 1191, 88, 12, 133, 71, 914, 1, 759, 3027, 2, 9, 1445, 1359, 18, 760, 12, 4973, 6, 1, 421, 9004, 9005, 7, 214, 9, 36, 48, 22, 3449, 3028, 98, 124, 1192, 4, 1, 92, 9006, 6, 3450, 3, 7, 761, 870, 9, 36, 55, 111, 32]]


Next we will flatten the sequence list of lists for the convenience of applying the sliding window technique for word prediction.

In [9]:
# Flatten the list of lists resulting from the tokenization. This will reduce the list
# to one dimension, allowing us to apply the sliding window technique to predict the next word
text = [item for sublist in sequences for item in sublist]
vocab_size = len(tokenizer.word_index)

In [10]:
print('Vocabulary size in this corpus: ', vocab_size)

Vocabulary size in this corpus:  15713


We perform sliding window technique to determine the 20th word in the sequence from the preceding sequence of 19 words. It is a fairly rudimentary strategy and hence has scope for improvement.

In [11]:
sentence_len = 20
pred_len = 1
train_len = sentence_len - pred_len
seq = []
for i in range(len(text)-sentence_len):
    seq.append(text[i:i+sentence_len])
# Reverse dictionary to decode tokenized sequences back to words
reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))

Here trainX corresponds to the input sequence and trainY corresponds to the predicted sequence.

In [12]:
# Each row in seq is a 20 word long window. We append he first 19 words as the input to predict the 20th word
trainX = []
trainy = []
for i in seq:
    trainX.append(i[:train_len])
    trainy.append(i[-1])

*Next comes the creation of the model. It will contain the following layers:*

*Embedding layer, Two Stacked LSTM Layers, a Dense Layer with ReLU activation and also another Dense Layer with Softmax activation*

In [31]:
Model = Sequential([
    Embedding(vocab_size+1, 50, input_length=train_len),
    LSTM(100),
    LSTM(100),
    Dense(100, activation='relu'),
    Dense(vocab_size, activation='softmax')
])

ValueError: Input 0 of layer lstm_3 is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: [None, 100]

Upon deeply searching about the usage of LSTM and how it works, we found out that we haven't used them as necessary. In fact we intended to make use of stacked LSTM since they add additional depth compared to adding additional cells in a single LSTM layer.

So for such an implementation, we have to provide the "return sequences" flag set to TRUE for the first layer. This enables it to pass the sequence information to the second LSTM layer instead of just its end states.

In [32]:
# define model
model = Sequential([
    Embedding(vocab_size+1, 50, input_length=train_len),
    LSTM(100, return_sequences=True),
    LSTM(100),
    Dense(100, activation='relu'),
    Dense(vocab_size, activation='softmax')
])

In [33]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 19, 50)            785700    
_________________________________________________________________
lstm_4 (LSTM)                (None, 19, 100)           60400     
_________________________________________________________________
lstm_5 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_4 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_5 (Dense)              (None, 15713)             1587013   
Total params: 2,523,613
Trainable params: 2,523,613
Non-trainable params: 0
_________________________________________________________________


The next step is one of the most crucial and the most time-taking process that is the training of the model that we have defined. 

We config the model with losses and metrics using the compile() method of Model class. We train the model using the fit() method. We provide trainX and trainY along with the batch size and epochs.

batch_size refers to the number of samples thaat we feed into the network, while an epoch refer to one complete presenation of the dataset that has to be learned.

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(np.asarray(trainX), pd.get_dummies(np.asarray(trainy)), batch_size=128, epochs=90)

We save the tokenizer for future use by storing it into a pickle file, while we also store the model that we created as well. This allows us to simply use them as instances in the future without having to execute the python code that was required to design them. We can simply load and use them as and when required. This is especially useful in the case of the model which requires hours to complete its training.

In [37]:
pickle.dump(tokenizer, open('tokenizer.pkl', 'wb'))
model.save('model_weights.hdf5')

With training complete, we now have a model that can generate text. However, we need to give it a starting point. To do this, we write a function that takes a string input, tokenizes it, then pads it with zeroes so it fits into our 19 long prediction window.

In [2]:
def gen(model,seq,max_len = 20):
    # Tokenizing input string
    tokenized_sent = tokenizer.texts_to_sequences([seq])
    max_len = max_len+len(tokenized_sent[0])
    while len(tokenized_sent[0]) < max_len:
        padded_sentence = pad_sequences(tokenized_sent[-19:],maxlen=19)
        op = model.predict(np.asarray(padded_sentence).reshape(1,-1))
        tokenized_sent[0].append(op.argmax()+1)
        
    return " ".join(map(lambda x : reverse_word_map[x],tokenized_sent[0]))

In [3]:
def test_model(test_string,sequence_length= 50):
    print('Input String: ', test_string)
    print("Model :")
    print(gen(model,test_string,sequence_length))
    pass

## Result

**Outputs for the model that was trained with epochs = 100. This resulted in an accuracy of around 0.43(43%).**

In [28]:
test_model(author.iloc[58],50)

Input String:  By these means for they were ignorant men I found little difficulty in gaining them over to my purpose.
Model :
by these means for they were ignorant men i found little difficulty in gaining them over to my purpose and i am aware that i was not surprised at once as well as 'glory of the greek and i intrench our multitude of opinions for a livelihood was almost indispensable that the habitual horrendum informe ingens cui lumen et sepultus resurrexit certum est quia impossibile est occupied the old


In [29]:
test_model('My name is',10)

Input String:  My name is
Model :
my name is a compound of stern unutterable width which have been made


In [30]:
test_model('There is a',20)

Input String:  There is a
Model :
there is a fool of business not a little concave it crowned the personal figure of the unhappy extraordinary house overthrown clambered forth


In [34]:
test_model('What are you', 10)

Input String:  What are you
Model :
what are you heroded conversing conversing political intangible intangible swarming swarming tremendous somnolency


**Outputs for the model that was trained with epochs = 90. This resulted in an accuracy of around 0.35 (35%)**

In [40]:
test_model('My name is', 10)

Input String:  My name is
Model :
my name is not assassinated that by the wreckers and in the meantime


In [41]:
test_model('There is a', 20)

Input String:  There is a
Model :
there is a very capital degree of constructing balloons from the earth the latter of the automaton by the mesmeric phenomena de l'omelette


In [22]:
model = load_model('model_weights.hdf5')

In [23]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 19, 50)            785700    
_________________________________________________________________
lstm_4 (LSTM)                (None, 19, 100)           60400     
_________________________________________________________________
lstm_5 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_4 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_5 (Dense)              (None, 15713)             1587013   
Total params: 2,523,613
Trainable params: 2,523,613
Non-trainable params: 0
_________________________________________________________________


In [24]:
test_model('There is a', 10)

Input String:  There is a
Model :
there is a very capital degree of constructing balloons from the earth the


In [26]:
model = load_model('model_weights.hdf5')

In [27]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 19, 50)            785700    
_________________________________________________________________
lstm_4 (LSTM)                (None, 19, 100)           60400     
_________________________________________________________________
lstm_5 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_4 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_5 (Dense)              (None, 15713)             1587013   
Total params: 2,523,613
Trainable params: 2,523,613
Non-trainable params: 0
_________________________________________________________________


In [28]:
test_model('There is a', 20)

Input String:  There is a
Model :
there is a very capital degree of constructing balloons from the earth the latter of the automaton by the mesmeric phenomena de l'omelette
