<a href="https://colab.research.google.com/github/Archandra12/Deeplearning-Project/blob/codes/Embedding_Concept.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Practice tasks to understand word embeddings ad their use in RNN
- Model can learn embeddings during Model training, or
- Model can use a pretrained embeddings to train the model

We are taking movie ratings data which has 50000 reviews. They are 50% positive and 50% negative reviews. We are breaking data into two parts for training and Testing. Each set has equal proportion of positive or negative reivews.

####Loading libraries

In [11]:
#Pandas libraries
import pandas as pds
import numpy as np

#keras libraries
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding

In [12]:
#method to load file
def load_data(file):
  sheet1 = pds.read_excel(file, sheet_name = 0)  
  sheet2 = pds.read_excel(file, sheet_name = 1)  
  data = pds.concat([sheet1, sheet2], axis=0)
  return data
#Loading Training Data
newData = load_data('/content/drive/MyDrive/LSTM-Dataset/aclImdb/train/mergetest.xlsx')
newData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25000 entries, 0 to 12499
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   25000 non-null  object
 1   Review  25000 non-null  object
dtypes: object(2)
memory usage: 585.9+ KB


####Separating reviews and labels

In [13]:
docs = newData.iloc[:, newData.columns!='Label']
labels = newData.iloc[:, newData.columns!='Label']

In [14]:
docs.head(5)

Unnamed: 0,Review
0,Story of a man who has unnatural feelings for ...
1,Airport '77 starts as a brand new luxury 747 p...
2,This film lacked something I couldn't put my f...
3,"Sorry everyone,,, I know this is supposed to b..."
4,When I was little my parents took me along to ...


####Cleaning and tokenizing the review

In [26]:
from string import punctuation
from collections import Counter
import string
def tokenize_reviews(docs):
  review_text = docs.iloc[:, docs.columns=='Review'].replace('--', ' ')
  review_text = review_text['Review'].str.lower()
  review_text = [c for c in review_text if c not in punctuation]
  all_text2 = ' '.join(review_text)
  #print(all_text2)
  # create a list of words
  words = all_text2.split()
  table = str.maketrans('', '', string.punctuation)
  words = [w.translate(table) for w in words]
  words = [word for word in words if word.isalpha()]
  # Count all the words using Counter Method
  count_words = Counter(words)
  total_words = len(words)
  #we are taking top 10000 words based on their frequency of occurances
  sorted_words = count_words.most_common(10000)
  vocab_to_int = {w:i+1 for i, (w,c) in enumerate(sorted_words)}
  len(vocab_to_int)
  encoded_docs = []
  for review in docs['Review']:
    r = []
    for w in review.split():
      if w in vocab_to_int:
        r.append(vocab_to_int[w])
      else:
        r.append(0)      
    encoded_docs.append(r)
  return encoded_docs, vocab_to_int
encoded_docs, vocab_to_int = tokenize_reviews(docs)
print (len(encoded_docs[0]))
print (encoded_docs[1])
print (encoded_docs[2])

112
[0, 0, 494, 14, 3, 3414, 154, 8383, 0, 1641, 6, 4765, 57, 16, 4355, 5719, 0, 135, 0, 5, 993, 4858, 0, 0, 0, 0, 36, 6, 1495, 97, 0, 3, 730, 4, 0, 5, 24, 3477, 7, 0, 4, 8, 106, 2998, 5, 1, 1038, 14, 3, 0, 78, 20, 2042, 6, 0, 573, 0, 0, 0, 0, 39, 0, 0, 8383, 0, 290, 126, 14, 4217, 18, 0, 1, 1641, 6, 0, 32, 1, 0, 0, 0, 0, 0, 24, 104, 0, 0, 0, 0, 0, 0, 0, 0, 36, 3666, 1, 5977, 0, 1020, 45, 16, 2661, 0, 34, 1272, 5, 1997, 1, 4355, 0, 0, 1496, 20, 3, 0, 1641, 3459, 20, 33, 4255, 1079, 18, 130, 244, 24, 4629, 0, 209, 1823, 33, 3199, 0, 7, 1, 0, 0, 1878, 1143, 4, 1, 1641, 5541, 8, 6480, 77, 1, 2023, 112, 8, 7837, 5, 1, 1319, 204, 4630, 7, 1, 743, 4, 1, 0, 0, 0, 986, 7, 345, 0, 1091, 0, 7, 0, 251, 0, 125, 0, 1927, 126, 258, 1, 686, 0, 15, 1, 0, 14, 34, 0, 325, 16, 61, 820, 604, 0, 0, 0, 611, 479, 1, 1037, 265, 0, 0, 0, 10, 323, 759, 5, 1, 0, 1704, 765, 0, 0, 13, 514, 32, 0, 0, 0, 130, 271, 172, 38, 0, 8092, 0, 0, 129, 0, 0, 6, 98, 422, 4, 1554, 347, 8, 6, 438, 254, 21, 2580, 15, 1, 204, 0, 0

###Encoding Labels

In [27]:
encoded_labels = [1 if label =='positive' else 0 for label in newData['Label']]
encoded_labels = np.array(encoded_labels)
encoded_labels

array([0, 0, 0, ..., 0, 0, 0])

###Preparing Input for Model
Padding the encoded reviews to ensure all reviews are of same size in terms of number of tokens. We are keeping total number of tokens in each review as 500

In [28]:
max_length = 500
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

[[   0    4    3 ...    0    0    0]
 [   0   16  153 ...    0    0    0]
 [   0   19 3528 ...    0    0    0]
 ...
 [   0  236 2597 ...    0    0    0]
 [   0    0    0 ...    0    0    0]
 [   0  710  470 ...    0    0    0]]


In [29]:
padded_docs.shape

(25000, 500)

###Define Model - Embedding layer learns along the model

In [30]:
# define the model - model is mainly trying to learn embedding nothing much
vocab_size=10001 #needs to be 1 more than number of words in the vocab
model = Sequential()
model.add(Embedding(vocab_size, 32, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           320032    
_________________________________________________________________
flatten_1 (Flatten)          (None, 16000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 16001     
Total params: 336,033
Trainable params: 336,033
Non-trainable params: 0
_________________________________________________________________
None


###Training the Model

In [31]:
# fit the model
model.fit(padded_docs, encoded_labels, epochs=50, verbose=0, validation_split=0.2)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, encoded_labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 100.000000


###Preparing a Model using pretrained Embeddings
 **Took glove.6B.100d.txt for this from GloVe Embeddings

In [38]:
from numpy import array
from numpy import asarray
from numpy import zeros
embeddings_index = dict()
f = open('/content/drive/MyDrive/LSTM-Dataset/aclImdb/train/glove.6B.100d.txt')
for line in f:
	values = line.split()
	word = values[0]
	coefs = asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))
# create a weight matrix for words in training docs
embedding_matrix = zeros((10001, 100))
for word, i in vocab_to_int.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector


Loaded 400000 word vectors.


In [33]:
embedding_matrix.shape

(10001, 100)

###Define Model with pre-trained Embeddings

In [34]:
# define model
vocab_size=10001
model = Sequential()
#vocab size = number of words in entire training dataset, 100 is number of weights for each word, input length = number of tokens in padded_reviews
e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=500, trainable=False)
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 100)          1000100   
_________________________________________________________________
flatten_2 (Flatten)          (None, 50000)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 50001     
Total params: 1,050,101
Trainable params: 50,001
Non-trainable params: 1,000,100
_________________________________________________________________
None


###Training the model with pre-trained embeddings

In [35]:
# fit the model
model.fit(padded_docs, encoded_labels, epochs=50, verbose=0, validation_split=0.2)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, encoded_labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 100.000000


###Checking Model performance on Test data

In [36]:
#Loading Training Data
testData = load_data('/content/drive/MyDrive/LSTM-Dataset/aclImdb/train/MovieRating_Test.xlsx')
testData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24999 entries, 0 to 12499
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   24999 non-null  object
 1   Review  24999 non-null  object
dtypes: object(2)
memory usage: 585.9+ KB


In [40]:
encoded_test_docs, vocab_to_int = tokenize_reviews(testData)
print (len(encoded_test_docs))
print (encoded_test_docs[1])
print (encoded_test_docs[2])

24999
[0, 6, 33, 468, 4, 133, 1, 2224, 4, 224, 94, 23, 1, 0, 0, 3, 0, 0, 62, 163, 259, 145, 0, 0, 571, 455, 4, 1, 90, 0, 2058, 4, 0, 3, 0, 0, 0, 239, 0, 104, 214, 130, 11, 34, 23, 2062, 4, 0, 3, 113, 0, 0, 1094, 15, 10, 0, 141, 61, 0, 0, 0, 0, 39, 103, 0, 0, 0, 16, 0, 39, 0, 3107, 1, 0, 0, 0, 39, 0, 16, 0, 0, 3, 61, 1, 144, 0, 0, 2076, 7307, 405, 660, 155, 10, 19, 0, 30, 1, 0, 3, 0, 128, 1507, 48, 1, 1931, 0, 0, 13, 399, 7, 10, 0, 0, 133, 1, 1931, 122, 28, 208, 279, 1, 2396, 162, 0, 0, 0, 0, 164, 19, 0, 105, 15, 0, 0, 44, 88, 392, 1, 2396, 162, 1881, 0, 3, 30, 217, 7, 0, 24, 108, 0, 64, 96, 9, 668, 0, 0, 0, 10, 6, 0, 224, 0, 0, 23, 3788, 123, 94, 5, 0, 3, 41, 21, 62, 175, 5, 61, 10, 0, 103, 0, 0, 64, 6, 2621, 2, 0, 857, 18, 44, 123, 113, 3, 2, 123, 0, 0, 63, 143, 11, 96, 10, 30, 31, 259, 145, 13, 2, 523, 608, 20, 1, 369, 0, 1, 617, 13, 207, 0, 64, 271, 585, 5, 257, 56, 16, 1, 461, 19, 379, 0, 18, 22, 0, 0]
[0, 4, 31, 0, 697, 138, 5470, 0, 36, 0, 485, 41, 34, 67, 2, 982, 9029, 459, 60, 

In [41]:
encoded_test_labels = [1 if label =='positive' else 0 for label in testData['Label']]
encoded_test_labels = np.array(encoded_test_labels)
encoded_test_labels

array([0, 0, 0, ..., 1, 1, 1])

In [42]:
max_length = 500
padded_test_docs = pad_sequences(encoded_test_docs, maxlen=max_length, padding='post')
print(padded_test_docs)

[[  0 173   0 ...   0   0   0]
 [  0   6  33 ...   0   0   0]
 [  0   4  31 ...   0   0   0]
 ...
 [  0  47   0 ...  36 305   0]
 [  0   0  15 ...   0   0   0]
 [  0 105  10 ...   0   0   0]]


###Predicting on test data

In [43]:
#ynew = model.predict_classes(padded_test_docs)
ynew = np.argmax(model.predict(padded_test_docs), axis=-1)
# show the inputs and predicted outputs
#for i in range(len(padded_test_docs)):
#	print("X=%s, Labels=%s, Predicted=%s" % (padded_test_docs[i], encoded_test_labels[i], ynew[i]))

In [44]:
testData['Predicted_Label'] = ['negative' if i==0 else 'positive' for i in ynew]
testData[testData['Label'] != testData['Predicted_Label']].shape, testData.shape

((12500, 3), (24999, 3))

##How to save and reuse models in Keras

Model weights are saved to HDF5 format. This is a grid format that is ideal for storing multi-dimensional arrays of numbers

The model structure can be described and saved using two different formats: JSON and YAML.

In order to do that we need to install __h5py__

In [45]:
!sudo pip install h5py



Saving Model in JASON and weights in HDF5 format (this helps in storing weights in multidimensions

In [46]:
from keras.models import model_from_json
# serialize model to JSON
model_json = model.to_json()
with open("Embeddings_Conceptmodel.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("Embeddings_Conceptmodel.h5")
print("Saved model to disk")


Saved model to disk


If we want to create new model using saved JASON and weights We can do that as below

In [47]:
# load json and create model
json_file = open('/content/Embeddings_Conceptmodel.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("/content/Embeddings_Conceptmodel.h5")
print("Loaded model from disk")
 
# evaluate loaded model on test data
loaded_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
loaded_model.fit(padded_docs, encoded_labels, epochs=50, verbose=0, validation_split=0.2)
score = loaded_model.evaluate(padded_docs, encoded_labels, verbose=0)
print("%s: %.2f%%" % (loaded_model.metrics_names[1], score[1]*100))

Loaded model from disk
accuracy: 100.00%


###Saving the Model in YAML

In [48]:
!pip install PyYAML



In [49]:
from keras.models import model_from_yaml
# serialize model to YAML
model_yaml = model.to_yaml()
with open("Embeddings_Conceptmodel.yaml", "w") as yaml_file:
    yaml_file.write(model_yaml)
# serialize weights to HDF5
model.save_weights("Embeddings_Conceptmodel.h5")
print("Saved model to disk")


Saved model to disk


###To load the model from yaml saved on local path

In [50]:
# load yaml and create model
yaml_file = open('/content/Embeddings_Conceptmodel.yaml', 'r')
loaded_model_yaml = yaml_file.read()
yaml_file.close()
loaded_model = model_from_yaml(loaded_model_yaml)
# load weights into new model
loaded_model.load_weights("/content/Embeddings_Conceptmodel.h5")
print("Loaded model from disk")
 
# evaluate loaded model on test data
loaded_model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
loaded_model.fit(padded_docs, encoded_labels, epochs=50, verbose=0, validation_split=0.2)
score = loaded_model.evaluate(padded_docs, encoded_labels, verbose=0)
print("%s: %.2f%%" % (loaded_model.metrics_names[1], score[1]*100))

Loaded model from disk
accuracy: 100.00%


###Save Model Weights and Architecture Together

Keras also supports a simpler interface to save both the model weights and model architecture together into a single H5 file.

Saving the model in this way includes everything we need to know about the model, including:

 - Model weights.
 - Model architecture.
 - Model compilation details (loss and metrics).
 - Model optimizer state.
 
This means that we can load and use the model directly, without having to re-compile it as we did above.

In [51]:
model.save("Embedding_Conceptmodel_Save.h5")

Need to perform following steps when we need to load model and use it

In [52]:
from keras.models import load_model 
# load model
loaded_model = load_model('Embedding_Conceptmodel_Save.h5')
# summarize model.
loaded_model.summary()
loaded_model.fit(padded_docs, encoded_labels, epochs=50, verbose=0, validation_split=0.2)
score = loaded_model.evaluate(padded_docs, encoded_labels, verbose=0)
print("%s: %.2f%%" % (loaded_model.metrics_names[1], score[1]*100))

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 100)          1000100   
_________________________________________________________________
flatten_2 (Flatten)          (None, 50000)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 50001     
Total params: 1,050,101
Trainable params: 50,001
Non-trainable params: 1,000,100
_________________________________________________________________
accuracy: 100.00%


##Text Analysis - Prediction on Republic

A language model can predict the probability of the next word in the sequence, based on the words already observed in the sequence.

Neural network models are a preferred method for developing statistical language models because they can use a distributed representation where different words with similar meanings have similar representation and because they can use a large context of recently observed words when making predictions.


We will perform following:
- Data Preparation
- Train Language Model
- Use Language Model

Here is a direct link to the clean version of the data file:

###Data Preparation
We will start by preparing the data for modeling.

We could see following from a quick look on the data:

 - Book/Chapter headings (e.g. “BOOK I.”).
 - British English spelling (e.g. “honoured”)
 - Lots of punctuation (e.g. “–“, “;–“, “?–“, and more)
 - Strange names (e.g. “Polemarchus”).
 - Some long monologues that go on for hundreds of lines.
 - Some quoted dialog (e.g. ‘…’)
 - These observations, and more, suggest at ways that we may wish to prepare the text data.


###Language Model Design
We will develop a model of the text that we can then use to generate new sequences of text.

The language model will be statistical and will predict the probability of each word given an input sequence of text. The predicted word will be fed in as input to in turn generate the next word.

__A key design decision is how long the input sequences should be. They need to be long enough to allow the model to learn the context for the words to predict. This input length will also define the length of seed text used to generate new sequences when we use the model.__

There is no correct answer. With enough time and resources, we could explore the ability of the model to learn with differently sized input sequences. We will pick a length of 50 words for the length of the input sequences, somewhat arbitrarily.

To keep the example brief, we will let all of the text flow together and train the model to predict the next word across sentences, paragraphs, and even books or chapters in the text.

Now that we have a model design, we can look at transforming the raw text into sequences of 50 input words to 1 output word, ready to fit a model.

Load Text
The first step is to load the text into memory.

We can develop a small function to load the entire text file into memory and return it. The function is called load_doc() and is listed below. Given a filename, it returns a sequence of loaded text.

In [53]:
#/content/drive/MyDrive/LSTM-Dataset/republic_clean.txt
def load_doc(filename):
  # open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

In [54]:
# load document
in_filename = '/content/drive/MyDrive/LSTM-Dataset/republic_clean.txt'
doc = load_doc(in_filename)
print(doc[:200])

BOOK I.

I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what


###Clean Text
We need to transform the raw text into a sequence of tokens or words that we can use as a source to train the model.

Based on reviewing the raw text (above), below are some specific operations we will perform to clean the text. 

 - Replace ‘–‘ with a white space so we can split words better.
 - Split words based on white space.
 - Remove all punctuation from words to reduce the vocabulary size (e.g. ‘What?’ becomes ‘What’).
 - Remove all words that are not alphabetic to remove standalone punctuation tokens.
 - Normalize all words to lowercase to reduce the vocabulary size.
 - Vocabulary size is a big deal with language modeling. A smaller vocabulary results in a smaller model that trains faster.

We can implement each of these cleaning operations in this order in a function. Below is the function clean_doc() that takes a loaded document as an argument and returns an array of clean tokens.

In [55]:
import string

# turn a doc into clean tokens
def clean_doc(doc):
	# replace '--' with a space ' '
	doc = doc.replace('--', ' ')
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', string.punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# make lower case
	tokens = [word.lower() for word in tokens]
	return tokens
# clean document
tokens = clean_doc(doc)
print(tokens[:200])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

['book', 'i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess', 'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they', 'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new', 'thing', 'i', 'was', 'delighted', 'with', 'the', 'procession', 'of', 'the', 'inhabitants', 'but', 'that', 'of', 'the', 'thracians', 'was', 'equally', 'if', 'not', 'more', 'beautiful', 'when', 'we', 'had', 'finished', 'our', 'prayers', 'and', 'viewed', 'the', 'spectacle', 'we', 'turned', 'in', 'the', 'direction', 'of', 'the', 'city', 'and', 'at', 'that', 'instant', 'polemarchus', 'the', 'son', 'of', 'cephalus', 'chanced', 'to', 'catch', 'sight', 'of', 'us', 'from', 'a', 'distance', 'as', 'we', 'were', 'starting', 'on', 'our', 'way', 'home', 'and', 'told', 'his', 'servant', 'to', 'run', 'and', 'bid',


Next, we can look at shaping the tokens into sequences and saving them to file.

Save Clean Text
We can organize the long list of tokens into sequences of 50 input words and 1 output word.

That is, sequences of 51 words.

We can do this by iterating over the list of tokens from token 51 onwards and taking the prior 50 tokens as a sequence, then repeating this process to the end of the list of tokens.

We will transform the tokens into space-separated strings for later storage in a file.


In [56]:
# organize into sequences of tokens
length = 50 + 1
sequences = list()
for i in range(length, len(tokens)):
	# select sequence of tokens
	seq = tokens[i-length:i]
	# convert into a line
	line = ' '.join(seq)
	# store
	sequences.append(line)
print('Total Sequences: %d' % len(sequences))

Total Sequences: 118634


####Saving the Data after cleaning

In [57]:
# save tokens to file, one dialog per line
def save_doc(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()
# save sequences to file
out_filename = '/content/drive/MyDrive/LSTM-Dataset/republic_sequences.txt'
save_doc(sequences, out_filename)

###Train Language Model
We can now train a statistical language model from the prepared data.

The model we will train is a neural language model. It has a few unique characteristics:

 - It uses a distributed representation for words so that different words with similar meanings will have a similar representation.
 - It learns the representation at the same time as learning the model.
 - It learns to predict the probability for the next word using the context of the last 100 words.
 
Specifically, we will use an Embedding Layer to learn the representation of words, and a Long Short-Term Memory (LSTM) recurrent neural network to learn to predict words based on their context.

Let’s start by loading our training data.

Load Sequences
We can load our training data using the load_doc() function we developed in the previous section.

Once loaded, we can split the data into separate training sequences by splitting based on new lines.

In [58]:
# load
in_filename = '/content/drive/MyDrive/LSTM-Dataset/republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')
lines[:3]

['book i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was',
 'i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted',
 'i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted with']

####Encode Sequences
The word embedding layer expects input sequences to be comprised of integers.

We can map each word in our vocabulary to a unique integer and encode our input sequences. Later, when we make predictions, we can convert the prediction to numbers and look up their associated words in the same mapping.

To do this encoding, we will use the __Tokenizer__ class in the Keras API.

First, the Tokenizer must be trained on the entire training dataset, which means it finds all of the unique words in the data and assigns each a unique integer.

We can then use the fit Tokenizer to encode all of the training sequences, converting each sequence from a list of words to a list of integers.

In [59]:
from keras.preprocessing.text import Tokenizer
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

In [60]:
#to get the mapping of words to integer
tokenizer.word_index
#to know the length of vocabulary as we would need in embedding layer
vocab_size = len(tokenizer.word_index) + 1
vocab_size

7412

####Sequence Inputs and Output
Now that we have encoded the input sequences, we need to separate them into input (X) and output (y) elements.

We can do this with array slicing.

After separating, we need to one hot encode the output word. This means converting it from an integer to a vector of 0 values, one for each word in the vocabulary, with a 1 to indicate the specific word at the index of the words integer value.

This is so that the model learns to predict the probability distribution for the next word and the ground truth from which to learn from is 0 for all words except the actual word that comes next.

Keras provides the to_categorical() that can be used to one hot encode the output words for each input-output sequence pair.

Finally, we need to specify to the Embedding layer how long input sequences are. We know that there are 50 words because we designed the model, but a good generic way to specify that is to use the second dimension (number of columns) of the input data’s shape. That way, if you change the length of sequences when preparing data, you do not need to change this data loading code; it is generic.

In [61]:
# separate into input and output
from numpy import array
from numpy import asarray
from numpy import zeros

from keras.utils.np_utils import to_categorical
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]
print(X.shape, y.shape)

(118634, 50) (118634, 7412)


###Fit Model
We can now define and fit our language model on the training data.

The learned embedding needs to know the size of the vocabulary and the length of input sequences as previously discussed. It also has a parameter to specify how many dimensions will be used to represent each word. That is, the size of the embedding vector space.

Common values are 50, 100, and 300. We will use 50 here, but consider testing smaller or larger values.

We will use a two LSTM hidden layers with 100 memory cells each. More memory cells and a deeper network may achieve better results.

A dense fully connected layer with 100 neurons connects to the LSTM hidden layers to interpret the features extracted from the sequence. The output layer predicts the next word as a single vector the size of the vocabulary with a probability for each word in the vocabulary. A softmax activation function is used to ensure the outputs have the characteristics of normalized probabilities.

In [62]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
# define model
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 50, 50)            370600    
_________________________________________________________________
lstm (LSTM)                  (None, 50, 100)           60400     
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_3 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_4 (Dense)              (None, 7412)              748612    
Total params: 1,270,112
Trainable params: 1,270,112
Non-trainable params: 0
_________________________________________________________________
None


Training the model

In [63]:
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(X, y, batch_size=128, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fcf922ddc10>

###Save Model
At the end of the run, the trained model is saved to file.

Here, we use the Keras model API to save the model to the file ‘model.h5‘ in the current working directory.

Later, when we load the model to make predictions, we will also need the mapping of words to integers. This is in the Tokenizer object, and we can save that too using Pickle.

In [64]:
# save the model to file
from pickle import dump
model.save('/content/drive/MyDrive/LSTM-Dataset/languagemodel.h5')
# save the tokenizer
dump(tokenizer, open('/content/drive/MyDrive/LSTM-Dataset/tokenizer.pkl', 'wb'))

###Use Language Model
Now that we have a trained language model, we can use it.

In this case, we can use it to generate new sequences of text that have the same statistical properties as the source text.

This is not practical, at least not for this example, but it gives a concrete example of what the language model has learned.

We will start by loading the training sequences again.

In [65]:
# load cleaned text sequences
in_filename = '/content/drive/MyDrive/LSTM-Dataset/republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

We need the text so that we can choose a source sequence as input to the model for generating a new sequence of text.

The model will require 50 words as input.

Later, we will need to specify the expected length of input. We can determine this from the input sequences by calculating the length of one line of the loaded data and subtracting 1 for the expected output word that is also on the same line.

In [66]:
seq_length = len(lines[0].split()) - 1
seq_length

50

###Load Model
We can now load the model from file.

Keras provides the load_model() function for loading the model, ready for use.

In [67]:
from random import randint
from pickle import load
from keras.models import load_model
loaded_model = load_model('/content/drive/MyDrive/LSTM-Dataset/languagemodel.h5')

 Load the tokenizer from file using the Pickle API

In [68]:
# load the tokenizer
tokenizer = load(open('/content/drive/MyDrive/LSTM-Dataset/tokenizer.pkl', 'rb'))

###Generate Text
The first step in generating text is preparing a seed input.

We will select a random line of text from the input text for this purpose. Once selected, we will print it so that we have some idea of what was used.

In [69]:
print(lines[0])
print(lines[118633])

book i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was
one another and to the gods both while remaining here and when like conquerors in the games who go round to gather gifts we receive our reward and it s hall be well with us both in this life and in the pilgrimage of a thousand years which we have been


In [70]:
# select a seed text
seed_text = lines[randint(0,len(lines))]
print(seed_text + '\n')
print(len(seed_text), len(lines))
print(randint(0,len(lines)))

many are their friends and for all these reasons they will be unwilling to waste their lands and rase their houses their enmity to them will only last until the many innocent sufferers have compelled the guilty few to give satisfaction i agree he said that our citizens should thus deal

286 118634
7975


Next, we can generate new words, one at a time.

First, the seed text must be encoded to integers using the same tokenizer that we used when training the model.

In [71]:
encoded = tokenizer.texts_to_sequences([seed_text])[0]

In [72]:
from keras.preprocessing.sequence import pad_sequences
encoded = pad_sequences([encoded], maxlen=seq_length)


The model can predict the next word directly by calling model.predict_classes() that will return the index of the word with the highest probability.

In [73]:
# predict probabilities for each word
import numpy as np 
#yhat = loaded_model.predict_classes(encoded, verbose=0)
yhat = np.argmax(model.predict(encoded), axis=-1)

We can then look up the index in the Tokenizers mapping to get the associated word.

In [74]:
out_word = ''
for word, index in tokenizer.word_index.items():
	if index == yhat:
		out_word = word
		break
out_word

'to'

Putting it all together as below

In [75]:
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
  result = list()
  in_text = seed_text
  #print(in_text)
  for _ in range(n_words):
    encoded = tokenizer.texts_to_sequences([in_text])[0]
    encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
    
    yhat = np.argmax(model.predict(encoded), axis=-1)    
    out_word = ''
    for word, index in tokenizer.word_index.items():
      if index == yhat:
        out_word = word
        #print(yhat, out_word)
        break
    in_text += ' ' + out_word
    #print(in_text)
    result.append(out_word)
  return ' '.join(result)  
# generate new text
print('Original text:')
print(seed_text)
generated = generate_seq(loaded_model, tokenizer, seq_length, seed_text, 50)
print('Generated text:')
print(generated)

Original text:
many are their friends and for all these reasons they will be unwilling to waste their lands and rase their houses their enmity to them will only last until the many innocent sufferers have compelled the guilty few to give satisfaction i agree he said that our citizens should thus deal
Generated text:
to the same and the same and the same and the same and the same and the same and the same and the same and the same and the same and the same and the same and the same and the same and the same and the same and the
