## Project Description: Next Word Prediction Using LSTM
#### Project Overview:

This project aims to develop a deep learning model for predicting the next word in a given sequence of words. The model is built using Long Short-Term Memory (LSTM) networks, which are well-suited for sequence prediction tasks. The project includes the following steps:

1- Data Collection: We use the text of Shakespeare's "Hamlet" as our dataset. This rich, complex text provides a good challenge for our model.

2- Data Preprocessing: The text data is tokenized, converted into sequences, and padded to ensure uniform input lengths. The sequences are then split into training and testing sets.

3- Model Building: An LSTM model is constructed with an embedding layer, two LSTM layers, and a dense output layer with a softmax activation function to predict the probability of the next word.

4- Model Training: The model is trained using the prepared sequences, with early stopping implemented to prevent overfitting. Early stopping monitors the validation loss and stops training when the loss stops improving.

5- Model Evaluation: The model is evaluated using a set of example sentences to test its ability to predict the next word accurately.

6- Deployment: A Streamlit web application is developed to allow users to input a sequence of words and get the predicted next word in real-time.

In [None]:
### Data Collection
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
import pandas as pd 

## Load the dataset
data = gutenberg.raw('austen-emma.txt')
## save to a file
with open('emma.txt', 'w') as file:
    file.write(data)

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\ninaw/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [None]:
### Data Preprocessing

import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

## Load the dataset

with open('emma.txt', 'r') as file:
    text = file.read().lower()

## Tokenize the text - Creating indexex for words
Tokenizer = Tokenizer()
Tokenizer.fit_on_texts([text])
total_words = len(Tokenizer.word_index)+1
total_words

7233

In [None]:
Tokenizer.word_index

{'to': 1,
 'the': 2,
 'and': 3,
 'of': 4,
 'i': 5,
 'a': 6,
 'it': 7,
 'her': 8,
 'was': 9,
 'she': 10,
 'in': 11,
 'not': 12,
 'you': 13,
 'be': 14,
 'he': 15,
 'that': 16,
 'had': 17,
 'but': 18,
 'as': 19,
 'for': 20,
 'have': 21,
 'is': 22,
 'with': 23,
 'very': 24,
 'mr': 25,
 'his': 26,
 'at': 27,
 'so': 28,
 'all': 29,
 'could': 30,
 'would': 31,
 'emma': 32,
 'him': 33,
 'been': 34,
 'no': 35,
 'my': 36,
 'mrs': 37,
 'on': 38,
 'any': 39,
 'do': 40,
 'miss': 41,
 'were': 42,
 'me': 43,
 'will': 44,
 'by': 45,
 'must': 46,
 'which': 47,
 'there': 48,
 'from': 49,
 'they': 50,
 'what': 51,
 'this': 52,
 'or': 53,
 'such': 54,
 'much': 55,
 'if': 56,
 'said': 57,
 'more': 58,
 'an': 59,
 'are': 60,
 'one': 61,
 'them': 62,
 'every': 63,
 'than': 64,
 'harriet': 65,
 'am': 66,
 'well': 67,
 'thing': 68,
 'weston': 69,
 'think': 70,
 'how': 71,
 'should': 72,
 'your': 73,
 'when': 74,
 'little': 75,
 'being': 76,
 'never': 77,
 'good': 78,
 'we': 79,
 'knightley': 80,
 'did': 81,
 '

In [None]:
## creting the input sequence
input_sequences = []
for line in text.split('\n'):
    token_list = Tokenizer.texts_to_sequences([line])[0]
    for i in range(1,len(token_list)):
        n_gram_sequence = token_list[:i+1] # 1:1 + 1 = 2,  2: 2 + 1 = 3
        input_sequences.append(n_gram_sequence)

In [24]:
input_sequences

[[32, 45],
 [32, 45, 92],
 [32, 45, 92, 4410],
 [32, 45, 92, 4410, 4411],
 [2794, 5],
 [346, 5],
 [32, 96],
 [32, 96, 493],
 [32, 96, 493, 633],
 [32, 96, 493, 633, 3],
 [32, 96, 493, 633, 3, 1024],
 [32, 96, 493, 633, 3, 1024, 23],
 [32, 96, 493, 633, 3, 1024, 23, 6],
 [32, 96, 493, 633, 3, 1024, 23, 6, 532],
 [32, 96, 493, 633, 3, 1024, 23, 6, 532, 163],
 [3, 171],
 [3, 171, 697],
 [3, 171, 697, 156],
 [3, 171, 697, 156, 1],
 [3, 171, 697, 156, 1, 2795],
 [3, 171, 697, 156, 1, 2795, 97],
 [3, 171, 697, 156, 1, 2795, 97, 4],
 [3, 171, 697, 156, 1, 2795, 97, 4, 2],
 [3, 171, 697, 156, 1, 2795, 97, 4, 2, 238],
 [3, 171, 697, 156, 1, 2795, 97, 4, 2, 238, 1853],
 [4, 1551],
 [4, 1551, 3],
 [4, 1551, 3, 17],
 [4, 1551, 3, 17, 675],
 [4, 1551, 3, 17, 675, 1025],
 [4, 1551, 3, 17, 675, 1025, 588],
 [4, 1551, 3, 17, 675, 1025, 588, 61],
 [4, 1551, 3, 17, 675, 1025, 588, 61, 364],
 [4, 1551, 3, 17, 675, 1025, 588, 61, 364, 11],
 [4, 1551, 3, 17, 675, 1025, 588, 61, 364, 11, 2],
 [4, 1551, 3, 1

In [None]:
## Pad Sequences
max_sequence_len = max([len(x) for x in input_sequences])
max_sequence_len

17

In [38]:
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
input_sequences

array([[   0,    0,    0, ...,    0,   32,   45],
       [   0,    0,    0, ...,   32,   45,   92],
       [   0,    0,    0, ...,   45,   92, 4410],
       ...,
       [   0,    0,    0, ...,  534,  260,    4],
       [   0,    0,    0, ...,  260,    4,    2],
       [   0,    0,    0, ...,    4,    2, 2784]])

In [39]:
### create prediction and label

import tensorflow as tf
x,y = input_sequences[:, :-1], input_sequences[:,-1]

In [40]:
x

array([[  0,   0,   0, ...,   0,   0,  32],
       [  0,   0,   0, ...,   0,  32,  45],
       [  0,   0,   0, ...,  32,  45,  92],
       ...,
       [  0,   0,   0, ...,   2, 534, 260],
       [  0,   0,   0, ..., 534, 260,   4],
       [  0,   0,   0, ..., 260,   4,   2]])

In [41]:
y

array([  45,   92, 4410, ...,    4,    2, 2784])

In [None]:
## convert to categorical (should be 1, remaining all be  0)

y = tf.keras.utils.to_categorical(y,num_classes=total_words)
y

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [43]:
## split the data into training and tastion sets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [44]:
#Train out LSTM RNN