<a href="https://colab.research.google.com/github/Aditya-shahh/Supervised-Text-Generation/blob/master/Text_Generation_using_wikipedia_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Word-based Text generation using BiLSTMs and GloVe embedding**
---

This colab is part of my applied research work on Text Generation using LSTMs

I have used wikipedia sentences dataset which is a collection of 7.8 million sentences (one per line) from August 2018 English wikipedia. These are only sentences found in the opening text of content pages. Further, filtering is applied to remove junk sentences.


To run build a model and run it on Colab, I have reduced the dataset to fewer sentences

The original dataset can be downloaded from [here](https://www.kaggle.com/mikeortman/wikipedia-sentences
)

####Warning: The text data is large enough to juice out the complete 12 GB Ram provided by Colab. 
So it is adviced to reduce the dataset (atmost 100k sentences)


In [None]:
# reduce data file [run this code only once]
data = open('/content/wikisent2.txt').read()
corpus = data.lower().split("\n")
f = open("/content/drive/My Drive/reducedwikisent2.txt", "a+")
j = 0
for line in corpus:
  if j%3500==0:
    f.write(line + '\n')
  j+=1

In [None]:
!unzip '/content/drive/My Drive/wikisent2.txt.zip'	


Archive:  /content/drive/My Drive/wikisent2.txt.zip
  inflating: wikisent2.txt           


In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers
import tensorflow.keras.utils as ku 
import numpy as np 

In [None]:
tokenizer = Tokenizer()

In [None]:

data = open('/content/drive/My Drive/reducedwikisent2.txt').read()
corpus = data.lower().split("\n")
tokenizer.fit_on_texts(corpus)
word_index = tokenizer.word_index
total_words = len(word_index) + 1



del data


In [None]:
# create input sequences using list of tokens
input_sequences = []
j = 0
for line in corpus:

    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
      n_gram_sequence = token_list[:i+1]
      input_sequences.append(n_gram_sequence)

	
print(len(input_sequences))


del corpus


36100


In [None]:

max_sequence_len = 30

print(max_sequence_len)

30


In [None]:
# pad sequences 

input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# create predictors and label
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]

label = ku.to_categorical(label, num_classes=total_words)

Using the 100 dimensional GloVe embedding layer

In [None]:
# Note this is the 100 dimension version of GloVe from Stanford
!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/glove.6B.100d.txt \
    -O /tmp/glove.6B.100d.txt
embeddings_index = {};
with open('/tmp/glove.6B.100d.txt') as f:
    for line in f:
        values = line.split();
        word = values[0];
        coefs = np.asarray(values[1:], dtype='float32');
        embeddings_index[word] = coefs;

embeddings_matrix = np.zeros((total_words, 100));
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word);
    if embedding_vector is not None:
        embeddings_matrix[i] = embedding_vector;


print(len(embeddings_matrix))
print(embeddings_matrix[1])

--2020-06-28 20:21:46--  https://storage.googleapis.com/laurencemoroney-blog.appspot.com/glove.6B.100d.txt
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.79.128, 108.177.119.128, 108.177.126.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.79.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 347116733 (331M) [text/plain]
Saving to: ‘/tmp/glove.6B.100d.txt’


2020-06-28 20:21:50 (82.1 MB/s) - ‘/tmp/glove.6B.100d.txt’ saved [347116733/347116733]

9500
[-0.038194   -0.24487001  0.72812003 -0.39961001  0.083172    0.043953
 -0.39140999  0.3344     -0.57545     0.087459    0.28786999 -0.06731
  0.30906001 -0.26383999 -0.13231    -0.20757     0.33395001 -0.33848
 -0.31742999 -0.48335999  0.1464     -0.37303999  0.34577     0.052041
  0.44946    -0.46970999  0.02628    -0.54154998 -0.15518001 -0.14106999
 -0.039722    0.28277001  0.14393     0.23464    -0.31020999  0.086173
  0.20397     0.52623999  0.17163999 -0.

In [None]:
# Define the model architecture
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1, weights=[embeddings_matrix], trainable=False))
model.add(Bidirectional(LSTM(128, return_sequences = True)))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(128)))
model.add(Dropout(0.2))
model.add(Dense(total_words/2, activation='relu', kernel_regularizer=regularizers.l2(0.01)))
model.add(Dense(total_words, activation='softmax'))

#Initial Learning rate
optimizer = Adam(learning_rate=0.001)

model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
print(model.summary())


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 29, 100)           950000    
_________________________________________________________________
bidirectional (Bidirectional (None, 29, 256)           234496    
_________________________________________________________________
dropout (Dropout)            (None, 29, 256)           0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 256)               394240    
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense (Dense)                (None, 4750)              1220750   
_________________________________________________________________
dense_1 (Dense)              (None, 9500)              4

In [None]:
import tensorflow as tf
checkpoint_filepath = '/content/drive/My Drive/checkpoint'
import numpy as np


# Callbacks for stopping the model early if loss does not reduce for 4 successive epochs (patience = 4)


early_stop_callback = tf.keras.callbacks.EarlyStopping(
    monitor='loss', min_delta=0, patience=4, verbose=1, mode='auto',
    baseline=None, restore_best_weights=False
)

# Callback for saving the complete model (.pb file) after every 5th epoch

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=False,
    monitor='accuracy',
    verbose = 1,
    mode='auto',
    period = 5,
    save_best_only=True)





In [None]:
# Start with training the model for 500 epochs and learning rate = 0.001
history = model.fit(predictors, label, epochs=500, verbose=1, callbacks=[early_stop_callback, checkpoint_callback])

Epoch 1/500
Epoch 2/500
Epoch 3/500
Epoch 4/500
Epoch 5/500
Epoch 00005: accuracy improved from -inf to 0.14058, saving model to /content/drive/My Drive/checkpoint
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: /content/drive/My Drive/checkpoint/assets
Epoch 6/500
Epoch 7/500
Epoch 8/500
Epoch 9/500
Epoch 10/500
Epoch 00010: accuracy improved from 0.14058 to 0.17319, saving model to /content/drive/My Drive/checkpoint
INFO:tensorflow:Assets written to: /content/drive/My Drive/checkpoint/assets
Epoch 11/500
Epoch 12/500
Epoch 13/500
Epoch 14/500
Epoch 15/500
Epoch 00015: accuracy improved from 0.17319 to 0.19566, saving model to /content/drive/My Drive/checkpoint
INFO:tensorflow:Assets written to: /content/drive/My Drive/checkpoint/assets
Epoch 16/500
Epoch 17/500
Epoch 18/500
Epoch 19/500
Epoch 20/500
Epoch 00020: accuracy improved from 0.19566 to 0.21957, saving model to /content/drive/My Drive/checkpoint
INFO:tensorf

If the model stops training early, then we can reload the model from the saved checkpoint and reduce the learning rate for a better convergence

In [None]:
# Load the model from previous saved checkpoint
model = tf.keras.models.load_model(checkpoint_filepath)

In [None]:
# Reduce the learning rate to 0.0005
optimizer = Adam(learning_rate=0.0005)

model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

In [None]:
# Retrain the model from last checkpoint saved
history = model.fit(predictors, label, epochs=50, verbose=1, callbacks=[early_stop_callback, checkpoint_callback])

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 00005: accuracy improved from 0.62470 to 0.69872, saving model to /content/drive/My Drive/checkpoint
INFO:tensorflow:Assets written to: /content/drive/My Drive/checkpoint/assets
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 00010: accuracy improved from 0.69872 to 0.72186, saving model to /content/drive/My Drive/checkpoint
INFO:tensorflow:Assets written to: /content/drive/My Drive/checkpoint/assets
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 00015: accuracy improved from 0.72186 to 0.73640, saving model to /content/drive/My Drive/checkpoint
INFO:tensorflow:Assets written to: /content/drive/My Drive/checkpoint/assets
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 00020: accuracy improved from 0.73640 to 0.74930, saving model to /content/drive/My Drive/checkpoint
INFO:tensorflow:Assets written to: /content/drive/My Drive/checkpoint/assets
Epoch 21/50
Epoch 22/50
Epoch 23/

In [None]:
model = tf.keras.models.load_model(checkpoint_filepath)

#Reduce the learning rate further to 0.0001 and retrain the model
optimizer = Adam(learning_rate=0.0001)

model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

In [None]:
history = model.fit(predictors, label, epochs=30, verbose=1, callbacks=[early_stop_callback, checkpoint_callback])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 00005: accuracy improved from 0.79574 to 0.85486, saving model to /content/drive/My Drive/checkpoint
INFO:tensorflow:Assets written to: /content/drive/My Drive/checkpoint/assets
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 00010: accuracy improved from 0.85486 to 0.86613, saving model to /content/drive/My Drive/checkpoint
INFO:tensorflow:Assets written to: /content/drive/My Drive/checkpoint/assets
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 00015: accuracy improved from 0.86613 to 0.87375, saving model to /content/drive/My Drive/checkpoint
INFO:tensorflow:Assets written to: /content/drive/My Drive/checkpoint/assets
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 00020: accuracy improved from 0.87375 to 0.87870, saving model to /content/drive/My Drive/checkpoint
INFO:tensorflow:Assets written to: /content/drive/My Drive/checkpoint/assets
Epoch 21/30
Epoch 22/30
Epoch 23/

In [None]:
#Reload the model again 
model = tf.keras.models.load_model(checkpoint_filepath)

# Reduce the learning rate further to 0.00005 
optimizer = Adam(learning_rate=0.00005)

model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

In [None]:
history = model.fit(predictors, label, epochs=20, verbose=1, callbacks=[early_stop_callback, checkpoint_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 00004: accuracy improved from 0.88495 to 0.89061, saving model to /content/drive/My Drive/checkpoint
INFO:tensorflow:Assets written to: /content/drive/My Drive/checkpoint/assets
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 00009: accuracy improved from 0.89061 to 0.89157, saving model to /content/drive/My Drive/checkpoint
INFO:tensorflow:Assets written to: /content/drive/My Drive/checkpoint/assets
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 00014: accuracy improved from 0.89157 to 0.89460, saving model to /content/drive/My Drive/checkpoint
INFO:tensorflow:Assets written to: /content/drive/My Drive/checkpoint/assets
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 00019: accuracy improved from 0.89460 to 0.89682, saving model to /content/drive/My Drive/checkpoint
INFO:tensorflow:Assets written to: /content/drive/My Drive/checkpoint/assets
Epoch 20/20


The model achieved an accuracy of 89% which is pretty decent for text generation.
There’s a huge scope for improvement in generated text. 
For eg: Instead of using greedy approach (selecting the word with best probability), a method called beam search can be used (selecting top k words). 
Other methods could be using ELMo embedding, a better model architecture, using transformers etc.


In [None]:
#prediction
import tensorflow as tf
checkpoint_filepath = '/content/drive/My Drive/checkpoint'
import numpy as np
model = tf.keras.models.load_model(checkpoint_filepath)




In [None]:
seed_text = 'he loved to'
next_words = 15
  
for _ in range(next_words):
	token_list = tokenizer.texts_to_sequences([seed_text])[0]
	token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
	predicted = model.predict_classes(token_list, verbose=0)
	output_word = ""
	for word, index in tokenizer.word_index.items():
		if index == predicted:
			output_word = word
			break
	seed_text += " " + output_word
print(seed_text)

he loved to recorder known team villages however systems would divisional studies an philadelphia and the dan military
