# **Introduction**
Natural Language Processing enables computers to understand the human language.
This project implements a Next Word Prediction system using a deep learning model based on LSTM networks.


#### **Dataset Description**
The dataset consists of a short English story titled **"The Farmer and His Sons"**.
It contains approximately 200 words and is used to train a word-level language model.


#### **Connect Google Drive**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#### **Import Modules**

In [None]:
import tensorflow
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Dropout
from tensorflow.keras.optimizers import Adam
import pickle
import numpy as np
import os

In [None]:
file = open('/content/drive/MyDrive/Colab Notebooks/NLP Projects/NLP Dataset/document.txt','r',encoding="utf8")

In [None]:
lines = []
for line in file :
  lines.append(line)
  print(line)

THE FARMER AND HIS SONS

A farmer had five sons. They were strong and hardworking. But they always quarrelled with

one another. Sometimes, they even fought. The farmer wanted his sons to stop quarrelling

and fighting. He wanted them to live in peace. Plain words of advice or scolding did not have

much effect on these young people.

The farmer always thought what to do to keep his sons united. One day he found an answer

to the problem. So he called all his sons together. He showed them a bundle of sticks and

said, “I want any of you to break these sticks without separating them from the bundle.”

Each of the five sons tried one by one. They used their full strength and skill. But none of

them could break the sticks. Then the old man separated the sticks and gave each of them

just a single stick to break. They broke the sticks easily.

The farmer said, “A single stick by itself is weak. It is strong as long as it is tied up in a

bundle. Likewise, you will be strong if you are uni

#### **Text Preprocessing**

In [None]:
data = ""
data = ' '.join(lines)

In [None]:
data

'THE FARMER AND HIS SONS\n A farmer had five sons. They were strong and hardworking. But they always quarrelled with\n one another. Sometimes, they even fought. The farmer wanted his sons to stop quarrelling\n and fighting. He wanted them to live in peace. Plain words of advice or scolding did not have\n much effect on these young people.\n The farmer always thought what to do to keep his sons united. One day he found an answer\n to the problem. So he called all his sons together. He showed them a bundle of sticks and\n said, “I want any of you to break these sticks without separating them from the bundle.”\n Each of the five sons tried one by one. They used their full strength and skill. But none of\n them could break the sticks. Then the old man separated the sticks and gave each of them\n just a single stick to break. They broke the sticks easily.\n The farmer said, “A single stick by itself is weak. It is strong as long as it is tied up in a\n bundle. Likewise, you will be strong i

In [None]:
data = data.split()
data = ' '.join(data)

In [None]:
data

'THE FARMER AND HIS SONS A farmer had five sons. They were strong and hardworking. But they always quarrelled with one another. Sometimes, they even fought. The farmer wanted his sons to stop quarrelling and fighting. He wanted them to live in peace. Plain words of advice or scolding did not have much effect on these young people. The farmer always thought what to do to keep his sons united. One day he found an answer to the problem. So he called all his sons together. He showed them a bundle of sticks and said, “I want any of you to break these sticks without separating them from the bundle.” Each of the five sons tried one by one. They used their full strength and skill. But none of them could break the sticks. Then the old man separated the sticks and gave each of them just a single stick to break. They broke the sticks easily. The farmer said, “A single stick by itself is weak. It is strong as long as it is tied up in a bundle. Likewise, you will be strong if you are united. You wi

#### **Tokenization**

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])

In [None]:
pickle.dump(tokenizer, open('/content/drive/MyDrive/Colab Notebooks/NLP Projects/Models/token.pkl', 'wb'))

In [None]:
tokenized_data = tokenizer.texts_to_sequences([data])[0]
tokenized_vocab = tokenizer.word_index

In [None]:
print(tokenized_data)

[1, 6, 3, 11, 4, 12, 6, 41, 20, 4, 7, 42, 15, 3, 43, 21, 7, 22, 44, 45, 13, 46, 47, 7, 48, 49, 1, 6, 23, 11, 4, 2, 50, 51, 3, 52, 14, 23, 8, 2, 53, 24, 54, 55, 56, 5, 57, 58, 59, 60, 61, 62, 63, 64, 65, 25, 66, 67, 1, 6, 22, 68, 69, 2, 70, 2, 71, 11, 4, 16, 13, 72, 14, 73, 74, 75, 2, 1, 76, 77, 14, 78, 79, 11, 4, 80, 14, 81, 8, 12, 17, 5, 9, 3, 26, 82, 83, 84, 5, 10, 2, 18, 25, 9, 85, 86, 8, 87, 1, 17, 27, 28, 5, 1, 20, 4, 88, 13, 29, 13, 7, 89, 90, 91, 92, 3, 93, 21, 94, 5, 8, 95, 18, 1, 9, 96, 1, 97, 98, 99, 1, 9, 3, 100, 28, 5, 8, 101, 12, 30, 31, 2, 18, 7, 102, 1, 9, 103, 1, 6, 26, 104, 30, 31, 29, 105, 19, 32, 33, 19, 15, 34, 106, 34, 33, 19, 107, 108, 24, 12, 17, 109, 10, 35, 36, 15, 37, 10, 38, 16, 10, 35, 36, 32, 37, 10, 38, 39, 27, 110, 16, 40, 111, 39, 40, 112]


In [None]:
print(tokenized_vocab)

{'the': 1, 'to': 2, 'and': 3, 'sons': 4, 'of': 5, 'farmer': 6, 'they': 7, 'them': 8, 'sticks': 9, 'you': 10, 'his': 11, 'a': 12, 'one': 13, 'he': 14, 'strong': 15, 'united': 16, 'bundle': 17, 'break': 18, 'is': 19, 'five': 20, 'but': 21, 'always': 22, 'wanted': 23, 'in': 24, 'these': 25, 'said': 26, '”': 27, 'each': 28, 'by': 29, 'single': 30, 'stick': 31, 'weak': 32, 'it': 33, 'as': 34, 'will': 35, 'be': 36, 'if': 37, 'are': 38, 'divided': 39, 'we': 40, 'had': 41, 'were': 42, 'hardworking': 43, 'quarrelled': 44, 'with': 45, 'another': 46, 'sometimes': 47, 'even': 48, 'fought': 49, 'stop': 50, 'quarrelling': 51, 'fighting': 52, 'live': 53, 'peace': 54, 'plain': 55, 'words': 56, 'advice': 57, 'or': 58, 'scolding': 59, 'did': 60, 'not': 61, 'have': 62, 'much': 63, 'effect': 64, 'on': 65, 'young': 66, 'people': 67, 'thought': 68, 'what': 69, 'do': 70, 'keep': 71, 'day': 72, 'found': 73, 'an': 74, 'answer': 75, 'problem': 76, 'so': 77, 'called': 78, 'all': 79, 'together': 80, 'showed': 81,

In [None]:
print(f"len of tokenized_data : {len(tokenized_data)}")
print(f"len of tokenized_vocab : {len(tokenized_vocab)}")

len of tokenized_data : 206
len of tokenized_vocab : 112


In [None]:
vocab_size = len(tokenized_vocab)+1

#### **Sequence Generation**

In [None]:
document_sequence_list = []
for i in range(3,len(tokenized_data)) :
  sequence_set = tokenized_data[i-3:i+1]
  document_sequence_list.append(sequence_set)

In [None]:
document_sequence_list

[[1, 6, 3, 11],
 [6, 3, 11, 4],
 [3, 11, 4, 12],
 [11, 4, 12, 6],
 [4, 12, 6, 41],
 [12, 6, 41, 20],
 [6, 41, 20, 4],
 [41, 20, 4, 7],
 [20, 4, 7, 42],
 [4, 7, 42, 15],
 [7, 42, 15, 3],
 [42, 15, 3, 43],
 [15, 3, 43, 21],
 [3, 43, 21, 7],
 [43, 21, 7, 22],
 [21, 7, 22, 44],
 [7, 22, 44, 45],
 [22, 44, 45, 13],
 [44, 45, 13, 46],
 [45, 13, 46, 47],
 [13, 46, 47, 7],
 [46, 47, 7, 48],
 [47, 7, 48, 49],
 [7, 48, 49, 1],
 [48, 49, 1, 6],
 [49, 1, 6, 23],
 [1, 6, 23, 11],
 [6, 23, 11, 4],
 [23, 11, 4, 2],
 [11, 4, 2, 50],
 [4, 2, 50, 51],
 [2, 50, 51, 3],
 [50, 51, 3, 52],
 [51, 3, 52, 14],
 [3, 52, 14, 23],
 [52, 14, 23, 8],
 [14, 23, 8, 2],
 [23, 8, 2, 53],
 [8, 2, 53, 24],
 [2, 53, 24, 54],
 [53, 24, 54, 55],
 [24, 54, 55, 56],
 [54, 55, 56, 5],
 [55, 56, 5, 57],
 [56, 5, 57, 58],
 [5, 57, 58, 59],
 [57, 58, 59, 60],
 [58, 59, 60, 61],
 [59, 60, 61, 62],
 [60, 61, 62, 63],
 [61, 62, 63, 64],
 [62, 63, 64, 65],
 [63, 64, 65, 25],
 [64, 65, 25, 66],
 [65, 25, 66, 67],
 [25, 66, 67, 1],
 [6

In [None]:
X = []
y = []

for i in document_sequence_list :
  X.append(i[:3])
  y.append(i[3])

In [None]:
print(X)

[[1, 6, 3], [6, 3, 11], [3, 11, 4], [11, 4, 12], [4, 12, 6], [12, 6, 41], [6, 41, 20], [41, 20, 4], [20, 4, 7], [4, 7, 42], [7, 42, 15], [42, 15, 3], [15, 3, 43], [3, 43, 21], [43, 21, 7], [21, 7, 22], [7, 22, 44], [22, 44, 45], [44, 45, 13], [45, 13, 46], [13, 46, 47], [46, 47, 7], [47, 7, 48], [7, 48, 49], [48, 49, 1], [49, 1, 6], [1, 6, 23], [6, 23, 11], [23, 11, 4], [11, 4, 2], [4, 2, 50], [2, 50, 51], [50, 51, 3], [51, 3, 52], [3, 52, 14], [52, 14, 23], [14, 23, 8], [23, 8, 2], [8, 2, 53], [2, 53, 24], [53, 24, 54], [24, 54, 55], [54, 55, 56], [55, 56, 5], [56, 5, 57], [5, 57, 58], [57, 58, 59], [58, 59, 60], [59, 60, 61], [60, 61, 62], [61, 62, 63], [62, 63, 64], [63, 64, 65], [64, 65, 25], [65, 25, 66], [25, 66, 67], [66, 67, 1], [67, 1, 6], [1, 6, 22], [6, 22, 68], [22, 68, 69], [68, 69, 2], [69, 2, 70], [2, 70, 2], [70, 2, 71], [2, 71, 11], [71, 11, 4], [11, 4, 16], [4, 16, 13], [16, 13, 72], [13, 72, 14], [72, 14, 73], [14, 73, 74], [73, 74, 75], [74, 75, 2], [75, 2, 1], [2, 

In [None]:
print(y)

[11, 4, 12, 6, 41, 20, 4, 7, 42, 15, 3, 43, 21, 7, 22, 44, 45, 13, 46, 47, 7, 48, 49, 1, 6, 23, 11, 4, 2, 50, 51, 3, 52, 14, 23, 8, 2, 53, 24, 54, 55, 56, 5, 57, 58, 59, 60, 61, 62, 63, 64, 65, 25, 66, 67, 1, 6, 22, 68, 69, 2, 70, 2, 71, 11, 4, 16, 13, 72, 14, 73, 74, 75, 2, 1, 76, 77, 14, 78, 79, 11, 4, 80, 14, 81, 8, 12, 17, 5, 9, 3, 26, 82, 83, 84, 5, 10, 2, 18, 25, 9, 85, 86, 8, 87, 1, 17, 27, 28, 5, 1, 20, 4, 88, 13, 29, 13, 7, 89, 90, 91, 92, 3, 93, 21, 94, 5, 8, 95, 18, 1, 9, 96, 1, 97, 98, 99, 1, 9, 3, 100, 28, 5, 8, 101, 12, 30, 31, 2, 18, 7, 102, 1, 9, 103, 1, 6, 26, 104, 30, 31, 29, 105, 19, 32, 33, 19, 15, 34, 106, 34, 33, 19, 107, 108, 24, 12, 17, 109, 10, 35, 36, 15, 37, 10, 38, 16, 10, 35, 36, 32, 37, 10, 38, 39, 27, 110, 16, 40, 111, 39, 40, 112]


#### **Encoding**

In [None]:
y = to_categorical(y, num_classes=vocab_size)

In [None]:
y

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [None]:
X = np.array(X)
y = np.array(y)

#### **Train–Test Split**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

#### **Model Architecture**

In [None]:
model = Sequential()
model.add(Embedding(vocab_size, 10, input_shape=(3,)))
model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(256))
model.add(Dropout(0.3))
model.add(Dense(256, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))


  super().__init__(**kwargs)


In [None]:
model.summary()

In [None]:
checkpoint = ModelCheckpoint(
    "/content/drive/MyDrive/Colab Notebooks/NLP Projects/Models/next_words.keras",
    monitor="loss",
    verbose=1,
    save_best_only=True
)


#### **Model Compilation**

In [None]:
model.compile(loss="categorical_crossentropy", optimizer = Adam(learning_rate = 0.001))

#### **Model Training**

In [None]:
model.fit(
    X_train,
    y_train,
    epochs=200,
    batch_size=32,
    callbacks=[checkpoint],
    validation_data=(X_test, y_test)
)


Epoch 1/200
[1m5/6[0m [32m━━━━━━━━━━━━━━━━[0m[37m━━━━[0m [1m0s[0m 35ms/step - loss: 4.7271
Epoch 1: loss improved from inf to 4.72692, saving model to /content/drive/MyDrive/Colab Notebooks/NLP Projects/Models/next_words.keras
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 319ms/step - loss: 4.7271 - val_loss: 4.7260
Epoch 2/200
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 98ms/step - loss: 4.7185 
Epoch 2: loss improved from 4.72692 to 4.71828, saving model to /content/drive/MyDrive/Colab Notebooks/NLP Projects/Models/next_words.keras
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 267ms/step - loss: 4.7185 - val_loss: 4.7244
Epoch 3/200
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 67ms/step - loss: 4.7084
Epoch 3: loss improved from 4.71828 to 4.70556, saving model to /content/drive/MyDrive/Colab Notebooks/NLP Projects/Models/next_words.keras
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 109ms/step

<keras.src.callbacks.history.History at 0x797a41e955b0>

#### **Model Evaluation**

In [None]:
test_acc = calculate_accuracy(model, X_test, y_test)
print("Test Accuracy:", test_acc)

Test Accuracy: 0.024390243902439025


In [None]:
acc = calculate_accuracy(model, X, y)
print("Next-word accuracy:", acc)

Next-word accuracy: 0.7487684729064039


In [None]:
from tensorflow.keras.models import load_model

model = load_model("/content/drive/MyDrive/Colab Notebooks/NLP Projects/Models/next_words.keras")

In [None]:
import pickle

tokenizer_stored = pickle.load(
    open('/content/drive/MyDrive/Colab Notebooks/NLP Projects/Models/token.pkl', 'rb')
)


In [None]:
import numpy as np

def predict_next_word(seed_text):
    # tokenize
    token_list = tokenizer_stored.texts_to_sequences([seed_text])[0]

    # take last 3 words
    token_list = token_list[-3:]

    # reshape for model
    token_list = np.array(token_list).reshape(1, 3)

    # predict
    predicted_probs = model.predict(token_list, verbose=0)
    predicted_index = np.argmax(predicted_probs)

    # convert index to word
    for word, index in tokenizer.word_index.items():
        if index == predicted_index:
            return word

    return None


In [None]:
file.seek(0)

0

In [None]:
for line in file :
  print(line)

THE FARMER AND HIS SONS

A farmer had five sons. They were strong and hardworking. But they always quarrelled with

one another. Sometimes, they even fought. The farmer wanted his sons to stop quarrelling

and fighting. He wanted them to live in peace. Plain words of advice or scolding did not have

much effect on these young people.

The farmer always thought what to do to keep his sons united. One day he found an answer

to the problem. So he called all his sons together. He showed them a bundle of sticks and

said, “I want any of you to break these sticks without separating them from the bundle.”

Each of the five sons tried one by one. They used their full strength and skill. But none of

them could break the sticks. Then the old man separated the sticks and gave each of them

just a single stick to break. They broke the sticks easily.

The farmer said, “A single stick by itself is weak. It is strong as long as it is tied up in a

bundle. Likewise, you will be strong if you are uni

In [None]:
print(predict_next_word("the farmer had"))


his


In [None]:
print(predict_next_word("They were strong"))


and


In [None]:
print(predict_next_word(" They broke the sticks"))

easily


In [None]:
print(predict_next_word("I want any of"))

you


In [None]:
print(predict_next_word("He wanted them to"))

live


In [None]:
print(predict_next_word("Plain words of"))

advice


In [None]:
print(predict_next_word("The farmer wanted his"))

sons


In [None]:
print(predict_next_word("Sometimes, they even"))

fought


In [None]:
print(predict_next_word("But they always"))

could


In [None]:
print(predict_next_word("They used their full"))

strength


In [None]:
def calculate_accuracy(model, X, y):
    correct = 0
    total = len(X)

    for i in range(total):
        x_input = X[i].reshape(1, 3)
        y_true = np.argmax(y[i])

        y_pred = model.predict(x_input, verbose=0)
        y_pred_index = np.argmax(y_pred)

        if y_pred_index == y_true:
            correct += 1

    accuracy = correct / total
    return accuracy


Next-word accuracy: 0.7487684729064039
