#APPROACHES:

1. Load the libraries and data.
2. Clean the data.
3. Tokenize the data.
4. Convert to Sequence.
5. Input Sequence and Output Sequence.
6. Create the Sequential model.
7. LSTM Layers.
8. Compile the model.
9. Fit the model.
10. Evaluate the model.

### Importing:

In [2]:
from random import randint
from pickle import load, dump
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from numpy import array
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding, Dropout
from tensorflow.keras.callbacks import ModelCheckpoint
import string

# It help to get the data from the url
import urllib.request
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
response = urllib.request.urlopen('https://raw.githubusercontent.com/insaid2018/DeepLearning/master/Data/republic_clean.txt?_sm_au_=iVV10f0f2kPt2J07')
doc = response.read().decode('utf8')

In [4]:
print(doc[:100])

﻿BOOK I.

I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might 


### CLEANING :

In [5]:
def clean_doc(doc):
  #replace '--' with space ' '
  doc = doc.replace('--', ' ')
  #split into tokens by white space
  tokens = doc.split()
  #remove punctuation from each token
  table = str.maketrans('', '', string.punctuation)
  tokens = [w.translate(table) for w in tokens]
  #remove remaining tokens that are not alphabetic
  tokens = [word for word in tokens if word.isalpha()]
  #make lower case
  tokens = [word.lower() for word in tokens]
  return tokens

tokens = clean_doc(doc)

In [6]:
tokens[:100]

['book',
 'i',
 'i',
 'went',
 'down',
 'yesterday',
 'to',
 'the',
 'piraeus',
 'with',
 'glaucon',
 'the',
 'son',
 'of',
 'ariston',
 'that',
 'i',
 'might',
 'offer',
 'up',
 'my',
 'prayers',
 'to',
 'the',
 'goddess',
 'bendis',
 'the',
 'thracian',
 'artemis',
 'and',
 'also',
 'because',
 'i',
 'wanted',
 'to',
 'see',
 'in',
 'what',
 'manner',
 'they',
 'would',
 'celebrate',
 'the',
 'festival',
 'which',
 'was',
 'a',
 'new',
 'thing',
 'i',
 'was',
 'delighted',
 'with',
 'the',
 'procession',
 'of',
 'the',
 'inhabitants',
 'but',
 'that',
 'of',
 'the',
 'thracians',
 'was',
 'equally',
 'if',
 'not',
 'more',
 'beautiful',
 'when',
 'we',
 'had',
 'finished',
 'our',
 'prayers',
 'and',
 'viewed',
 'the',
 'spectacle',
 'we',
 'turned',
 'in',
 'the',
 'direction',
 'of',
 'the',
 'city',
 'and',
 'at',
 'that',
 'instant',
 'polemarchus',
 'the',
 'son',
 'of',
 'cephalus',
 'chanced',
 'to',
 'catch',
 'sight']

In [7]:
print('The total tokens: ', len(tokens))
print('The unique tokens: ', len(set(tokens)))
#Corpus : Total token
#Vocabulary : Unique Token.

The total tokens:  118684
The unique tokens:  7409


### CREATE SEQUENCES:

In [8]:
#In 1st sequence 50 words are stored
#In 2nd sequence 1st token in 1st sequnece dropped and new word is added at the end of the sequence.

length = 50
sequences = list()
for i in range(length, len(tokens)):
  #select sequence of tokens
  seq = tokens[i-length:i]
  #convert into a line
  line = ' '.join(seq)
  #store in sequences list
  sequences.append(line)
print('Total Sequences: %d' % len(sequences))

Total Sequences: 118634


In [9]:
sequences[:10]

['book i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i',
 'i i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was',
 'i went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would celebrate the festival which was a new thing i was delighted',
 'went down yesterday to the piraeus with glaucon the son of ariston that i might offer up my prayers to the goddess bendis the thracian artemis and also because i wanted to see in what manner they would cele

In [10]:
# The answer from the above code is stored in seperate file as soft copy in your hardware
#Though you can see the sequence output some times you may loss the data,by this you can easily backup the data.

#save tokens to file, one dailog per line
def save_doc(lines, filename):
  data = '\n'.join(lines)
  file = open(filename, 'w')
  file.write(data)
  file.close()

#save sequences to file
out_filename = 'republic_sequences.txt'
save_doc(sequences, out_filename)

In [11]:
def load_doc(filename):
    #open the file as read only
    file = open(filename, 'r')
    #read all text
    text = file.read()
    #close the file
    file.close()
    return text

#load
in_filename = 'republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

This may help you to get the data that you stored in above in your hardware.

### Tokenize and convert the data into sequences.

In [12]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

This assign the integer value to the tokens.

### Convert the sequences to an array:

In [13]:
sequences = array(sequences)

In [14]:
sequences

array([[1046,   11,   11, ...,  549,  151,   11],
       [  11,   11, 1045, ...,  151,   11,   57],
       [  11, 1045,  329, ...,   11,   57, 1147],
       ...,
       [ 467,    4,   33, ...,  414,   13,   21],
       [   4,   33,   79, ...,   13,   21,   23],
       [  33,   79,    2, ...,   21,   23,   85]])

In [15]:
#To see first 50 token that is one sequence.

sequences[0]

array([1046,   11,   11, 1045,  329, 7409,    4,    1, 2873,   35,  213,
          1,  261,    3, 2251,    9,   11,  179,  817,  123,   92, 2872,
          4,    1, 2249, 7408,    1, 7407, 7406,    2,   75,  120,   11,
       1266,    4,  110,    6,   30,  168,   16,   49, 7405,    1, 1609,
         13,   57,    8,  549,  151,   11])

          X                       Y

Jack and jill went up the _____ |  hill

*In the above line we going to predict hill, so hill is in the Y. For next line the hill is in the input section..*

and jill went up the hill _____ | to

jill went up the hill to  _____ | fetch

went up the hill to fetch _____ |   a

In [16]:
 sequences[0]

array([1046,   11,   11, 1045,  329, 7409,    4,    1, 2873,   35,  213,
          1,  261,    3, 2251,    9,   11,  179,  817,  123,   92, 2872,
          4,    1, 2249, 7408,    1, 7407, 7406,    2,   75,  120,   11,
       1266,    4,  110,    6,   30,  168,   16,   49, 7405,    1, 1609,
         13,   57,    8,  549,  151,   11])

### X & Y

In [17]:
x,y = sequences[:,:-1], sequences[:,-1]

In [18]:
x[0]

array([1046,   11,   11, 1045,  329, 7409,    4,    1, 2873,   35,  213,
          1,  261,    3, 2251,    9,   11,  179,  817,  123,   92, 2872,
          4,    1, 2249, 7408,    1, 7407, 7406,    2,   75,  120,   11,
       1266,    4,  110,    6,   30,  168,   16,   49, 7405,    1, 1609,
         13,   57,    8,  549,  151])

In [19]:
y[0]

11

### Crate an LSTM model:

In [20]:
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)
y = to_categorical(y, num_classes=vocab_size)

7410


In [21]:
x.shape[1]

49

            X                          Y
                                      51             7409 - voc
                                       |              |
[1 2 3 4 ............50]-------------[0 0 0 0 0 1 0 0 0 0 0 0  0 ]

[2 3 4 5 ............51]-------------[0 0 0 0 0 0 1 0 0 0 0 0  0 ]

*As we giving the continous number like 1 2 .. 49 50 to find next that is 51 the model think this is an regression problem. But our problem is classification for that we encoded.*

In [22]:
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=x.shape[1])) # Embedding Layer
model.add(LSTM(100, return_sequences=True)) # LSTM Layer 1
model.add(LSTM(100)) #LSTM Layer2
model.add(Dense(100, activation='relu')) #Fully Connected Layer 1(Classiication Layer)
model.add(Dense(vocab_size, activation='softmax')) #Output Layer



##Compile the model:

In [23]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
hist = model.fit(x,y, batch_size = 128, epochs = 100)

Epoch 1/100
[1m927/927[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 16ms/step - accuracy: 0.0627 - loss: 6.4585
Epoch 2/100
[1m927/927[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 16ms/step - accuracy: 0.1057 - loss: 5.6817
Epoch 3/100
[1m927/927[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 16ms/step - accuracy: 0.1297 - loss: 5.4468
Epoch 4/100
[1m927/927[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 16ms/step - accuracy: 0.1481 - loss: 5.2679
Epoch 5/100
[1m927/927[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 16ms/step - accuracy: 0.1550 - loss: 5.1466
Epoch 6/100
[1m927/927[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 16ms/step - accuracy: 0.1656 - loss: 5.0250
Epoch 7/100
[1m927/927[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 16ms/step - accuracy: 0.1733 - loss: 4.9231
Epoch 8/100
[1m927/927[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 16ms/step - accuracy: 0.1772 - loss: 4.8291
Epoch 9/100
[1m

### Take a Copy of the Model:

In [24]:
#Save the model
model.save('model.h5')



In [25]:
#take a copy of the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

### INFERENCE PIPELINE:


1. Random Text
2. Tokenizer
3. Padding for fixed length
4. Predict
5. Use predicted values in tokenizer to generate the text

In [26]:
len(lines[0].split())

50

In [27]:
seq_length = len(lines[0].split())-1

In [28]:
# generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    result = list()
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word
        yhat = model.predict(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == np.argmax(yhat):
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)


# load the model
model = load_model('model.h5')

# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))

# select a seed text
seed_text = lines[randint(0,len(lines))]
print("seed_text:" + '\n')
print(seed_text + '\n')

# generate new text
generated = generate_seq(model, tokenizer, seq_length, seed_text, 50)
print("generated_text:" + '\n')
print(generated)



seed_text:

from a distance and it occurred to some one else that they might be found in another place which was larger and in which the letters were larger if they were the same and he could read the larger letters first and then proceed to the lesser this would have

generated_text:

been a circumcision of such natures are to be assigned to them but the same is true of the absence of thing which he is not to be allowed to do heaven and are not the same control itself to be sure he said and i should like to know


In [29]:
print(len(generated))

231
