<a href="https://colab.research.google.com/github/Amulya-Anurag/Text-generation-from-Moby-Dick-Novel-using-a-LSTM-model/blob/master/Text_generation_from_Moby_Dick_Novel_using_a_LSTM_model_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Import relevent libraries
import numpy as np
import spacy
import tensorflow as tf


In [2]:
# Define a functions to read file
def read_txt(path):
  with open(path) as file:
    text= file.read()
  return text

Here, Only 4 chapters of book has been used due to limited RAM availability

In [3]:
# Directory of Book
path='/content/sample_data/moby_dick_four_chapters.txt'
txt=read_txt(path)
# Visualize the text
print(txt[:500])

Call me Ishmael.  Some years ago--never mind how long
precisely--having little or no money in my purse, and nothing
particular to interest me on shore, I thought I would sail about a
little and see the watery part of the world.  It is a way I have of
driving off the spleen and regulating the circulation.  Whenever I
find myself growing grim about the mouth; whenever it is a damp,
drizzly November in my soul; whenever I find myself involuntarily
pausing before coffin warehouses, and bringing up t


In [4]:
# Import english library from SPACY and disable ner, parser, and tagger for more speed
nlp=spacy.load('en',disable=['ner','parser','tagger'])

# define the max length to avoid any issue in future
nlp.max_length=2000000

Create a function to remove Puncuation

In [5]:
def remove_punc(txt):
  return [token.text for token in nlp(txt) if token.text not in '\n\n \n\n\n!"-#$%&()--.*+,-/:;<=>?@[\\]^_`{|}~\t\n '] # These punctuation has been taken from Keras module and by observing the text file

In [6]:
# Remove Punctuation from text
tokens =remove_punc(txt)
len(tokens)

11338

Lets create a training sequence of 30 words

In [7]:
# define training sequence
train_len = 30 +1
xtrain_sentence=[ ]

for i in range(train_len,len(tokens)):
   xtrain_sentence.append(tokens[i-train_len : i])


Tokenize the training Sequences

In [8]:
# import tokenizer from tensorflow and convert text into sequences
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer= Tokenizer()
tokenizer.fit_on_texts(xtrain_sentence)
train_seq= tokenizer.texts_to_sequences(xtrain_sentence)

In [9]:
# Take a peek at vocab
list(tokenizer.word_index.items())[:10]

[('the', 1),
 ('a', 2),
 ('and', 3),
 ('of', 4),
 ('i', 5),
 ('to', 6),
 ('in', 7),
 ('it', 8),
 ('that', 9),
 ('he', 10)]

In [10]:
# vocab size will also include zeroth position so 1 is added here
vocab_size=len(tokenizer.word_counts)+1

# convert training seq into numpy array for smooth functioning
train_seq=np.array(train_seq)

In [11]:
# Split into input and targets
xtrain=train_seq[:,:-1]
ytrain=train_seq[:,-1]

# Create LSTM model 

In [12]:
# Import relevent libraries
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM ,Embedding

In [13]:
# Create class of ytrain 
ytrain=to_categorical(ytrain,num_classes=vocab_size)

# Check shape of xtrain 
xtrain.shape

(11307, 30)

LSTM model

In [14]:
# Model will include one Embedding layer, 2 LSTM layers, 2 Dense Layers
model= Sequential([ 
                
                  Embedding( input_dim= vocab_size, output_dim= 100, input_length=xtrain.shape[1] ),
                   
                  LSTM(units=100,return_sequences= True),
                  LSTM(units=100),
                   
                  Dense(units=100,activation='relu'),
                  Dense(units=vocab_size, activation='softmax')                   
])

# use adam optimizer and loss function would be categorical crossentropy
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

#Summary of Model
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 30, 100)           271800    
_________________________________________________________________
lstm (LSTM)                  (None, 30, 100)           80400     
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense (Dense)                (None, 100)               10100     
_________________________________________________________________
dense_1 (Dense)              (None, 2718)              274518    
Total params: 717,218
Trainable params: 717,218
Non-trainable params: 0
_________________________________________________________________


Train the model

In [15]:
model.fit(xtrain,ytrain,batch_size=128,epochs=300,verbose=2)

Epoch 1/300
89/89 - 1s - loss: 6.8774 - accuracy: 0.0502
Epoch 2/300
89/89 - 1s - loss: 6.3807 - accuracy: 0.0529
Epoch 3/300
89/89 - 1s - loss: 6.2727 - accuracy: 0.0529
Epoch 4/300
89/89 - 1s - loss: 6.1444 - accuracy: 0.0529
Epoch 5/300
89/89 - 1s - loss: 6.0326 - accuracy: 0.0532
Epoch 6/300
89/89 - 1s - loss: 5.9224 - accuracy: 0.0626
Epoch 7/300
89/89 - 1s - loss: 5.8340 - accuracy: 0.0656
Epoch 8/300
89/89 - 1s - loss: 5.7558 - accuracy: 0.0685
Epoch 9/300
89/89 - 1s - loss: 5.6815 - accuracy: 0.0705
Epoch 10/300
89/89 - 1s - loss: 5.6191 - accuracy: 0.0740
Epoch 11/300
89/89 - 1s - loss: 5.5578 - accuracy: 0.0768
Epoch 12/300
89/89 - 1s - loss: 5.5011 - accuracy: 0.0769
Epoch 13/300
89/89 - 1s - loss: 5.4374 - accuracy: 0.0805
Epoch 14/300
89/89 - 1s - loss: 5.3733 - accuracy: 0.0821
Epoch 15/300
89/89 - 1s - loss: 5.3062 - accuracy: 0.0847
Epoch 16/300
89/89 - 1s - loss: 5.2383 - accuracy: 0.0907
Epoch 17/300
89/89 - 1s - loss: 5.1680 - accuracy: 0.0938
Epoch 18/300
89/89 - 1s

<tensorflow.python.keras.callbacks.History at 0x7f727016dac8>

Create a function to generate text

In [16]:
# Seed Text is the input text which will be provided for generation
# num_words tells how much word to be generated


def generate_text(model,tokenizer, seq_length, seed_text,num_words):
  text=[]

  for _ in range (num_words):

    #Convert the text into sequences
    encoded_text = tokenizer.texts_to_sequences([seed_text])[0]

    # Add a padding upto seq length
    padded_text= pad_sequences([encoded_text], maxlen=seq_length, truncating='pre')

    # Model will predict the id of the word 
    pred_index=np.argmax(model.predict(padded_text), axis=-1)

    # predictited word can be generated from tokenizer.word_index
    pred_word =' '
    for word , index in tokenizer.word_index.items():
      if index == pred_index:
        pred_word = word
        break
    # Update the seed_text for next prediction
    seed_text= seed_text + '  ' + pred_word

    text.append(pred_word)

  #Return  generated sentence
  return ' '.join(text)


In [17]:
# Using randint to generate seed randomly
from numpy.random import randint

seq_len=xtrain.shape[1]
seed= randint(0,len(tokens)-seq_len)

# Create a random seed_text
seed_text=' ' .join(xtrain_sentence[seed])

print(' Seed Text are :' ,'\n', seed_text,'\n')
print('Generated text are:',end='\n')

generate_text(model,tokenizer,seq_len,seed_text,10)

 Seed Text are : 
 not being much accustomed to boots his pair of damp wrinkled cowhide ones probably not made to order either rather pinched and tormented him at the first go off of a 

Generated text are:


'bitter cold morning seeing now that there were no curtains'