# **Prediction Of The Next Word**

Predicting the next word in a sentence is a common Natural Language Processing (NLP) task that can be approached as a machine learning project. Here's a general outline of how you can approach building a next word prediction model using machine learning

**Description:-** A distinctive aspect of working on data science projects is that you get the freedom to create predictive type models. You must have noticed this while using Google Docs, WhatsApp, or evven the Google search bar, all of these use the technique of predicting the next word by suggesting a new word after each new word you type.

### **Importing the Libraries**

In [1]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.optimizers import Adam
import pickle
import numpy as np
import os

### **Open and Pre-Process the data**

In [3]:
file = open("next word prediction.txt", "r", encoding = "utf8")

# store file in list
lines = []
for i in file:
    lines.append(i)

# Convert list to string
data = ""
for i in lines:
  data = ' '. join(lines) 

#replace unnecessary stuff with space
data = data.replace('\n', '').replace('\r', '').replace('\ufeff', '').replace('“','').replace('”','')  #new line, carriage return, unicode character --> replace by space

#remove unnecessary spaces 
data = data.split()
data = ' '.join(data)
data[:500]

'The Project Gutenberg eBook of Pride and prejudice, by Jane Austen This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using th'

In [4]:
len(data)

733851

## **Apply tokenization and some other changes**

In [5]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])

# saving the tokenizer for predict function
pickle.dump(tokenizer, open('token.pkl', 'wb'))

sequence_data = tokenizer.texts_to_sequences([data])[0]
sequence_data[:15]

[1, 182, 164, 1001, 3, 299, 4, 946, 30, 72, 710, 41, 1001, 23, 21]

In [6]:
len(sequence_data)

131237

In [7]:
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)

7250


In [8]:
sequences = []

for i in range(3, len(sequence_data)):
    words = sequence_data[i-3:i+1]
    sequences.append(words)
    
print("The Length of sequences are: ", len(sequences))
sequences = np.array(sequences)
sequences[:10]

The Length of sequences are:  131234


array([[   1,  182,  164, 1001],
       [ 182,  164, 1001,    3],
       [ 164, 1001,    3,  299],
       [1001,    3,  299,    4],
       [   3,  299,    4,  946],
       [ 299,    4,  946,   30],
       [   4,  946,   30,   72],
       [ 946,   30,   72,  710],
       [  30,   72,  710,   41],
       [  72,  710,   41, 1001]])

In [9]:
X = []
y = []

for i in sequences:
    X.append(i[0:3])
    y.append(i[3])
    
X = np.array(X)
y = np.array(y)

In [10]:
print("Data: ", X[:10])
print("Response: ", y[:10])

Data:  [[   1  182  164]
 [ 182  164 1001]
 [ 164 1001    3]
 [1001    3  299]
 [   3  299    4]
 [ 299    4  946]
 [   4  946   30]
 [ 946   30   72]
 [  30   72  710]
 [  72  710   41]]
Response:  [1001    3  299    4  946   30   72  710   41 1001]


In [11]:
y = to_categorical(y, num_classes=vocab_size)
y[:5]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

## **Creating the model**

In [12]:
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=3))
model.add(LSTM(1000, return_sequences=True))
model.add(LSTM(1000))
model.add(Dense(1000, activation="relu"))
model.add(Dense(vocab_size, activation="softmax"))

In [13]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 3, 10)             72500     
                                                                 
 lstm (LSTM)                 (None, 3, 1000)           4044000   
                                                                 
 lstm_1 (LSTM)               (None, 1000)              8004000   
                                                                 
 dense (Dense)               (None, 1000)              1001000   
                                                                 
 dense_1 (Dense)             (None, 7250)              7257250   
                                                                 
Total params: 20,378,750
Trainable params: 20,378,750
Non-trainable params: 0
_________________________________________________________________


## **Train the model**

In [15]:
from tensorflow.keras.callbacks import ModelCheckpoint

model.fit(X, y, epochs=70, batch_size=64, callbacks=[checkpoint])

Epoch 1/70
Epoch 1: loss improved from inf to 6.22915, saving model to next_words.h5
Epoch 2/70
Epoch 2: loss improved from 6.22915 to 5.61051, saving model to next_words.h5
Epoch 3/70
Epoch 3: loss improved from 5.61051 to 5.29121, saving model to next_words.h5
Epoch 4/70
Epoch 4: loss improved from 5.29121 to 5.05734, saving model to next_words.h5
Epoch 5/70
Epoch 5: loss improved from 5.05734 to 4.84500, saving model to next_words.h5
Epoch 6/70
Epoch 6: loss improved from 4.84500 to 4.63641, saving model to next_words.h5
Epoch 7/70
Epoch 7: loss improved from 4.63641 to 4.42813, saving model to next_words.h5
Epoch 8/70
Epoch 8: loss improved from 4.42813 to 4.22247, saving model to next_words.h5
Epoch 9/70
Epoch 9: loss improved from 4.22247 to 4.01277, saving model to next_words.h5
Epoch 10/70
Epoch 10: loss improved from 4.01277 to 3.80062, saving model to next_words.h5
Epoch 11/70
Epoch 11: loss improved from 3.80062 to 3.58331, saving model to next_words.h5
Epoch 12/70
Epoch 12:

<keras.callbacks.History at 0x7fb69ceb1750>

## **Let's Predict**

In [16]:
from tensorflow.keras.models import load_model
import numpy as np
import pickle

# Load the model and tokenizer
model = load_model('next_words.h5')
tokenizer = pickle.load(open('token.pkl', 'rb'))

def Predict_Next_Words(model, tokenizer, text):

  sequence = tokenizer.texts_to_sequences([text])
  sequence = np.array(sequence)
  preds = np.argmax(model.predict(sequence))
  predicted_word = ""
  
  for key, value in tokenizer.word_index.items():
      if value == preds:
          predicted_word = key
          break
  
  print(predicted_word)
  return predicted_word

In [18]:
while(True):
  text = input("Enter your line: ")
  
  if text == "0":
      print("Execution completed.....")
      break
  
  else:
      try:
          text = text.split(" ")
          text = text[-3:]
          print(text)
        
          Predict_Next_Words(model, tokenizer, text)
          
      except Exception as e:
        print("Error occurred: ",e)
        continue

Enter your line: The Project Gutenberg
['The', 'Project', 'Gutenberg']
literary
Enter your line: The project gutenberg eBook of
['gutenberg', 'eBook', 'of']
pride
Enter your line: how can you abuse your own
['abuse', 'your', 'own']
daughter
Enter your line: He wa quite
['He', 'wa', 'quite']
all
Enter your line: he was quite
['he', 'was', 'quite']
young
Enter your line: He could not help seeing that you were about five times as
['five', 'times', 'as']
pretty
Enter your line: and her sister
['and', 'her', 'sister']
just
Enter your line: however, it may all come to
['all', 'come', 'to']
nothing
Enter your line: 0
Execution completed.....
