<a href="https://colab.research.google.com/github/Love1117/Machine_learning-Projects/blob/main/Machine_Learning%20Project/04_NLP%20Projects/GRU%20Sequence/Next%20Word%20Prediction/Next_Word_Prediction_GRU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Project Summary: Next-Word Prediction Using GRU-Based Language Model**

##**Overview**

This project develops a Recurrent Neural Network (RNN) using Gated Recurrent Units (GRU) to perform next-word prediction on the classic novel "War and Peace" sourced from Project Gutenberg. After thorough preprocessing and tokenization (without applying n-grams or padding), the model was trained to learn long-range linguistic patterns.
The GRU architecture delivered 79% prediction accuracy, outperforming a previous LSTM implementation in both speed and predictive consistency.


---

##**Aim of the Project**

To build a lightweight and effective language model capable of understanding literary text structure.

To compare GRU performance against LSTM for next-word prediction tasks.

To explore how GRU networks handle long text dependencies without additional sequence engineering like n-gram creation or padding.



##**IMPORTING NECCESARY LIBERTIES FOR MY PROJECT**

In [5]:
from google.colab import drive
drive.mount("/content/drive")
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense,GRU, Embedding

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##**LOADING FILE FROM GOOGLE DRIVE**

In [6]:
file_path = "/content/drive/My Drive/Text Data/War and Peace.txt"

with open(file_path, "r", encoding="utf-8") as f:
  text = f.read()

In [20]:
text[:500]

' The Project Gutenberg eBook of War and Peace\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it  give it away or re use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States \nyou will have to check the laws of the country where you are located\nbefore using this eBook.\n\nTitle'

##**CLEAN AND TOKENIZE TEXT**

In [21]:
import re
text = re.sub(r'[^a-zA-Z0-9.\s]', ' ', text)
clean_text = re.sub(r"\s+", " ", text)
clean_text = clean_text.strip()
clean_text[:500]

'The Project Gutenberg eBook of War and Peace This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it give it away or re use it under the terms of the Project Gutenberg License included with this ebook or online at www.gutenberg.org. If you are not located in the United States you will have to check the laws of the country where you are located before using this eBook. Title War and '

In [9]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([clean_text])
vocal_size = len(tokenizer.word_index)+1
vocal_size

14428

##**GIVING NUMERIC REPRESENTATION FOR MY TEXT**

In [10]:
sequence = tokenizer.texts_to_sequences([clean_text])[0]
input_sequence =[]
for i in range(3,len(sequence)):
  ngram = sequence[i-3:i+1]
  input_sequence.append(ngram)
print(f"Length of sequence: {len(sequence)}")
input_sequence[:20]

Length of sequence: 372367


[[1, 2492, 4950, 3957],
 [2492, 4950, 3957, 4],
 [4950, 3957, 4, 241],
 [3957, 4, 241, 2],
 [4, 241, 2, 510],
 [241, 2, 510, 38],
 [2, 510, 38, 3957],
 [510, 38, 3957, 27],
 [38, 3957, 27, 25],
 [3957, 27, 25, 1],
 [27, 25, 1, 796],
 [25, 1, 796, 4],
 [1, 796, 4, 305],
 [796, 4, 305, 2353],
 [4, 305, 2353, 7],
 [305, 2353, 7, 1],
 [2353, 7, 1, 2493],
 [7, 1, 2493, 4951],
 [1, 2493, 4951, 2],
 [2493, 4951, 2, 272]]

##**CREATE TRAINING DATA:**

In [11]:
input_sequence = np.array(input_sequence)

x = input_sequence[:,:-1]
y = input_sequence[:,-1]

x = np.array(x)
y = np.array(y)

print(x)
print(y)

[[   1 2492 4950]
 [2492 4950 3957]
 [4950 3957    4]
 ...
 [ 113  250 4243]
 [ 250 4243  209]
 [4243  209   23]]
[3957    4  241 ...  209   23   41]


##**BUILDING MY GRU MODEL**

In [12]:
model = Sequential([Embedding(vocal_size, 500, input_length = 3),
                    GRU(500),
                    Dense(vocal_size, activation="softmax")])
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

model.fit(x, y, epochs=40,batch_size=1000)

Epoch 1/40




[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 49ms/step - accuracy: 0.0651 - loss: 7.1640
Epoch 2/40
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 49ms/step - accuracy: 0.1469 - loss: 5.5582
Epoch 3/40
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 49ms/step - accuracy: 0.1784 - loss: 5.0462
Epoch 4/40
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 50ms/step - accuracy: 0.2012 - loss: 4.6504
Epoch 5/40
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 51ms/step - accuracy: 0.2236 - loss: 4.2748
Epoch 6/40
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 50ms/step - accuracy: 0.2548 - loss: 3.9126
Epoch 7/40
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 50ms/step - accuracy: 0.2923 - loss: 3.5797
Epoch 8/40
[1m373/373[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 51ms/step - accuracy: 0.3330 - loss: 3.2786
Epoch 9/40
[1m373/373[0m [32m━━━

<keras.src.callbacks.history.History at 0x7ad5150393d0>

##**MODEL PERFORMANCE| SCORE 88%**

In [13]:
model.evaluate(x, y)

[1m11637/11637[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m59s[0m 5ms/step - accuracy: 0.8023 - loss: 0.6978


[0.7021440863609314, 0.8010038733482361]

In [14]:
model.save("/content/drive/My Drive/Text Data/GRU_model.keras")

In [15]:
import joblib
joblib.dump(tokenizer, "/content/drive/My Drive/Text Data/GRU_tokenizer.plk")


['/content/drive/My Drive/Text Data/GRU_tokenizer.plk']

##**LOADING MY SAVE MODEL AND TOKENIZER**

In [16]:
from keras.models import load_model
my_model = load_model("/content/drive/My Drive/Text Data/GRU_model.keras")

##**PREDICTION — Next 4 words from 3 input words**


In [17]:
def prediction(input_word, len_of_words, tokenizer, model): # 'model' parameter is unused, 'my_model' is used from global scope.
  input_text = input_word
  num_words = len_of_words

  generated_text = input_word # Initialize the text that will be generated

  for _ in range(num_words):
    seq = tokenizer.texts_to_sequences([generated_text])[0] # Use generated_text for context

    if len(seq) < 3:
      print("Input words not in vocabulary or sequence is too short. Stopping prediction.")
      break

    seq_to_predict = seq[-3:]

    seq_to_predict = np.array(seq_to_predict).reshape(1, 3)


    pred_index = np.argmax(model.predict(seq_to_predict, verbose=0), axis=1)[0]

    next_word = ""
    next_word_found = False

    for win, idx in tokenizer.word_index.items():
      if idx == pred_index:
        next_word = win
        next_word_found = True
        break

    if next_word_found:
      generated_text += " " + next_word
    else:
      print(f"Predicted index {pred_index} not found in tokenizer vocabulary. Stopping prediction.")
      break

  return generated_text

print(prediction(input_word= "you may copy", len_of_words=4, tokenizer=tokenizer, model=model))

you may copy it give it to


##**PREDICTING THE NEXT 50 WORDS**

In [18]:
print(prediction(input_word= "you may copy", len_of_words=50, tokenizer=tokenizer, model=model))

you may copy it give it to the committee i do not think so think what about asked prince andrew with a characteristic desire to foment his own grief decided that he must retreat as quickly as possible and flying away like this take care you ll fall out he heard the voice


##**Conclusion / Deployment Summary**

When deployed, this model can:

Generate coherent next-word predictions for text composition and auto-completion tasks.

Support creative writing aids, typing assistants, or text generation pipelines.

Enhance NLP applications that require fast, context-aware suggestions.


Its strong performance on a complex literary dataset demonstrates its reliability for real-world next-word prediction scenarios.