**IMPORTING REQUIRED LIBRARIES**

In [3]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam
import pickle
import os
import numpy as np

**UPLOADING TEXT FILE**

In [4]:
from google.colab import files
uploaded = files.upload()

Saving Romeo and Juliet.txt to Romeo and Juliet.txt


**STORING THE FILE CONTENTS INTO A LIST**
**AND PREPROCESSING**



In [5]:
file = open("Romeo and Juliet.txt","r",encoding="utf8")

lines=[]
for i in file:
  lines.append(i)

#converting list to string
data = ''
for i in lines:
  data = ' '.join(lines)

#replacing unnecessary characters with space
data = data.replace('\n','').replace("\r",'').replace('\ufeff','').replace('"','').replace('*','').replace(',','')

#removing unnecessary spaces
data = data.split()
data = ' '.join(data)
print(data[:500])
data[:500]
print('Length: ',len(data))

The Project Gutenberg eBook of Romeo and Juliet This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it give it away or re-use it under the terms of the Project Gutenberg License included with this ebook or online at www.gutenberg.org. If you are not located in the United States you will have to check the laws of the country where you are located before using this eBook. Title: Rome
Length:  157137


**CREATING TOKENIZER OBJECT AND FITTING TO "data" VARIABLE.**

**1.   The tokenizer object will learn the vocabulary of the data, i.e., the set of all unique words in the data.**

**2.   The tokenizer object is saved to a file called "token.pkl" using the Pickle library. This will allow you to load the tokenizer object later and use it to make predictions on new data.**

**3.     Converting each word in the data variable into a integer(mapping of words to integers).**



In [6]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])

#saving the tokenizer for predict function
pickle.dump(tokenizer,open('token.pkl','wb'))

sequence_data = tokenizer.texts_to_sequences([data])[0]
sequence_data[:15]
print("Length: ",len(sequence_data))

Length:  29251


***Calculating the size of the vocabulary that the tokenizer object has learned. ***

In [7]:
vocab_size = len(tokenizer.word_index)+1
print(vocab_size)

4307


**The code snippet creates a list of sequences of words from the sequence_data variable. Each sequence of words is 4 words long. The first three words acts input for predicting the fourth word.**

In [8]:
sequences = []
for i in range(3,len(sequence_data)):
  words = sequence_data[i-3:i+1]
  sequences.append(words)

print("Length of sequences: ",len(sequences))
sequences = np.array(sequences)
sequences[:10]

Length of sequences:  29248


array([[  1,  54, 129, 302],
       [ 54, 129, 302,   6],
       [129, 302,   6,  12],
       [302,   6,  12,   2],
       [  6,  12,   2,  22],
       [ 12,   2,  22,  16],
       [  2,  22,  16, 302],
       [ 22,  16, 302,   8],
       [ 16, 302,   8,  18],
       [302,   8,  18,   1]])

**The x NumPy array contains the first three words in each sequence in the sequences NumPy array. The y NumPy array contains the fourth word in each sequence in the sequences NumPy array.**

In [26]:
x=[]
y=[]
for i in sequences:
  x.append(i[0:3])
  y.append(i[3])

#x data acts as input which is used for prediction.
x = np.array(x)
#y is response data which is predicted based on x
y = np.array(y)

In [27]:
print("Data: \n",x[:10])
print("Response: \n",y[:10])

Data: 
 [[  1  54 129]
 [ 54 129 302]
 [129 302   6]
 [302   6  12]
 [  6  12   2]
 [ 12   2  22]
 [  2  22  16]
 [ 22  16 302]
 [ 16 302   8]
 [302   8  18]]
Response: 
 [302   6  12   2  22  16 302   8  18   1]


**The to_categorical() function from the Keras library converts a vector of integers into a binary class matrix.**


*   **The model requires requires the input data to be in a one-hot encoded format, where each row in the matrix represents a single example and each column in the matrix represents a different class.**



In [11]:
y = to_categorical(y, num_classes=vocab_size)
y[:5]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)



1.   **Creating Object for Sequential Neural Network.**
2.   **Adding Embedded layer to the Model, by specifying the vocab_size, input length which is 3, as I mentioned above snippets.**
3.     **Adding LSTM-Long Short Term Memory Layer which returns sequence of outputs instead of returning only final output.**
4.       **Adding a dense layer to the model. The dense layer has vocab_size units and uses the softmax activation function. The softmax activation function ensures that the output of the model is a probability distribution over all possible next words.**


In [12]:
model = Sequential()
model.add(Embedding(vocab_size,10,input_length=3))
model.add(LSTM(1000,return_sequences=True))
model.add(LSTM(1000))
model.add(Dense(1000,activation="relu"))
model.add(Dense(vocab_size,activation="softmax"))

In [13]:
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 3, 10)             43070     
                                                                 
 lstm (LSTM)                 (None, 3, 1000)           4044000   
                                                                 
 lstm_1 (LSTM)               (None, 1000)              8004000   
                                                                 
 dense (Dense)               (None, 1000)              1001000   
                                                                 
 dense_1 (Dense)             (None, 4307)              4311307   
                                                                 
Total params: 17403377 (66.39 MB)
Trainable params: 17403377 (66.39 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


1.    **Creating a new ModelCheckpoint callback object. The filepath argument specifies the path to the file where the best model will be saved.**
2.    **The save_best_only argument specifies whether to only save the best model.**
3.    **Training the model on the x and y data for 30 epochs with a batch size of 64. The callbacks argument specifies the ModelCheckpoint callback.**
4.    **The ModelCheckpoint callback will save the best model to the next_words.h5 file during training.**

**The best model is the model with the lowest loss on the validation set.**

**RUNNING 30-EPOCHS...**

In [15]:
from tensorflow.keras.callbacks import ModelCheckpoint

checkpoint = ModelCheckpoint("next_words.h5",monitor="loss",verbose=1,save_best_only=True)
model.compile(loss="categorical_crossentropy",optimizer=Adam(learning_rate=0.001))
model.fit(x,y,epochs=30,batch_size=64,callbacks=[checkpoint])

Epoch 1/30
Epoch 1: loss improved from inf to 6.69002, saving model to next_words.h5


  saving_api.save_model(


Epoch 2/30
Epoch 2: loss improved from 6.69002 to 6.38133, saving model to next_words.h5
Epoch 3/30
Epoch 3: loss improved from 6.38133 to 6.11457, saving model to next_words.h5
Epoch 4/30
Epoch 4: loss improved from 6.11457 to 5.85011, saving model to next_words.h5
Epoch 5/30
Epoch 5: loss improved from 5.85011 to 5.60982, saving model to next_words.h5
Epoch 6/30
Epoch 6: loss improved from 5.60982 to 5.38533, saving model to next_words.h5
Epoch 7/30
Epoch 7: loss improved from 5.38533 to 5.17472, saving model to next_words.h5
Epoch 8/30
Epoch 8: loss improved from 5.17472 to 4.96413, saving model to next_words.h5
Epoch 9/30
Epoch 9: loss improved from 4.96413 to 4.75095, saving model to next_words.h5
Epoch 10/30
Epoch 10: loss improved from 4.75095 to 4.52414, saving model to next_words.h5
Epoch 11/30
Epoch 11: loss improved from 4.52414 to 4.29342, saving model to next_words.h5
Epoch 12/30
Epoch 12: loss improved from 4.29342 to 4.04732, saving model to next_words.h5
Epoch 13/30
Epo

<keras.src.callbacks.History at 0x78d34a882680>

1.    **The load_model() function loads the model from the next_words.h5 file. The pickle.load() function loads the tokenizer from the token.pkl file.**
2.    **The tokenizer.texts_to_sequences() function converts a list of text sequences to a list of integer sequences. Each integer sequence represents a word in the text sequence. The np.array() function converts the list of integer sequences to a NumPy array.**
3.    **Predicting the next word in the sequence using the model. The model.predict() function predicts the probability distribution over all possible next words. **
4.    **The np.argmax() function returns the index of the word with the highest probability.**
5.    **Iterating over the tokenizer's word index and find the word that corresponds to the predicted integer.**

In [20]:
from tensorflow.keras.models import load_model


#load the model and tokenizer
model = load_model('next_words.h5')
tokenizer = pickle.load(open('token.pkl','rb'))

def Predict_Next_Words(model,tokenizer,text):
  sequence = tokenizer.texts_to_sequences([text])
  sequence = np.array(sequence)
  preds = np.argmax(model.predict(sequence))
  predict_word = ""

  for key,value in tokenizer.word_index.items():
    if value == preds:
      predicted_word = key
      break

  print(predicted_word)
  return predicted_word

1.    **The text.split(" ") line splits the user input into a list of words. The text[-3:] line takes the last three words in the list of words. This is because the trained model was trained on sequences of three words.**

2.    **The Predict_Next_Words() function is then called to predict the next word in the sequence. The Predict_Next_Words() function takes three arguments: the trained model, the tokenizer, and the sequence of three words. The function returns the predicted next word.**

3.    **The predicted next word is then printed to the console.**

In [30]:
while (True):
  text = input("Enter your line: ")

  if text=="0":
    print("Execution Terminated...")
    break
  else:
    try:
      text = text.split(" ")
      text = text[-3:]
      print(text)
      Predict_Next_Words(model,tokenizer,text)
    except Exception as e:
      print("Error Occured: ",e)
      continue

Enter your line: The Project Gutenberg 
['Project', 'Gutenberg', '']
ebook
Enter your line: Scene IV. A Street. Scene V. A Hall in Capulet’s
['Hall', 'in', 'Capulet’s']
house
Enter your line: You may copy it, give it away or re-use it under the terms of the Project 
['the', 'Project', '']
gutenberg
Enter your line: 0
Execution Terminated...
