<a href="https://colab.research.google.com/github/ArvindRajen/text_prediction_LSTM/blob/main/language_model_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding


### TASK

Language Model - $P(X_{t+1}|X_{1:t})$. Given a sequence of text, we try to model the next word. If we do this process recursively, we will be able to generate language.

In [2]:
data = """ Jack and Jill went up the hill\n
         To fetch a pail of water\n
         Jack fell down and broke his crown\n
         And Jill came tumbling after\n """

In [3]:
data = """To be, or not to be, that is the question\n
Whether it is nobler in the mind to suffer\n
The slings and arrows of outrageous fortune\n
Or to take arms against a sea of troubles\n
And by opposing end them. To die—to sleep \n
No more; and by a sleep to say we end \n
The heart-ache and the thousand natural shocks \n
That flesh is heir to: 'tis a consummation \n
Devoutly to be wish'd. To die, to sleep \n
To sleep, perchance to dream—ay, there's the rub \n
For in that sleep of death what dreams may come \n 
When we have shuffled off this mortal coil\n
Must give us pause—there's the respect\n
That makes calamity of so long life\n"""


#### SEQUENCE GENERATOR

In [4]:
# generate a sequence from a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
    # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # pre-pad sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
    return in_text

#### TOKENIZE

In [5]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# create line-based sequences
sequences = list()
for line in data.split('\n'):
    encoded = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(encoded)):
        sequence = encoded[:i+1]
        sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))

Vocabulary Size: 76
Total Sequences: 102


#### PAD SENTENCES TO SAME LENGTH

In [6]:
# pad input sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)

Max Sequence Length: 10


#### SPLIT INPUTS

In [7]:
# split into input and output elements
sequences = array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)# define model


#### MODEL

In [8]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 9, 10)             760       
_________________________________________________________________
lstm (LSTM)                  (None, 50)                12200     
_________________________________________________________________
dense (Dense)                (None, 76)                3876      
Total params: 16,836
Trainable params: 16,836
Non-trainable params: 0
_________________________________________________________________
None


#### TRAIN

In [9]:
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(X, y, epochs=500, verbose=2)

Epoch 1/500
4/4 - 0s - loss: 4.3308 - accuracy: 0.0000e+00
Epoch 2/500
4/4 - 0s - loss: 4.3263 - accuracy: 0.0980
Epoch 3/500
4/4 - 0s - loss: 4.3222 - accuracy: 0.0980
Epoch 4/500
4/4 - 0s - loss: 4.3180 - accuracy: 0.0980
Epoch 5/500
4/4 - 0s - loss: 4.3129 - accuracy: 0.0980
Epoch 6/500
4/4 - 0s - loss: 4.3067 - accuracy: 0.0980
Epoch 7/500
4/4 - 0s - loss: 4.2988 - accuracy: 0.0980
Epoch 8/500
4/4 - 0s - loss: 4.2875 - accuracy: 0.0980
Epoch 9/500
4/4 - 0s - loss: 4.2714 - accuracy: 0.0980
Epoch 10/500
4/4 - 0s - loss: 4.2462 - accuracy: 0.0980
Epoch 11/500
4/4 - 0s - loss: 4.2062 - accuracy: 0.0980
Epoch 12/500
4/4 - 0s - loss: 4.1298 - accuracy: 0.0980
Epoch 13/500
4/4 - 0s - loss: 4.0584 - accuracy: 0.0980
Epoch 14/500
4/4 - 0s - loss: 4.0518 - accuracy: 0.0980
Epoch 15/500
4/4 - 0s - loss: 4.0263 - accuracy: 0.0980
Epoch 16/500
4/4 - 0s - loss: 4.0132 - accuracy: 0.0980
Epoch 17/500
4/4 - 0s - loss: 4.0138 - accuracy: 0.0980
Epoch 18/500
4/4 - 0s - loss: 4.0067 - accuracy: 0.09

<tensorflow.python.keras.callbacks.History at 0x7fadc52c9b70>

#### EVALUATE

In [11]:
# evaluate model
print(generate_seq(model, tokenizer, max_length-1, 'The slings and', 4))
print(generate_seq(model, tokenizer, max_length-1, 'Or to', 4))
print(generate_seq(model, tokenizer, max_length-1, 'Must give', 4))

The slings and arrows of outrageous fortune
Or to be not to die
Must give us pause—there's the respect
