## Next Word prediction

- Predicting the most likely word or phrase to appear next in a sentence or text is known as "**next word prediction.**" 
- It's similar to an application's built-in function that predicts the next word you'll write or say.
- Applications such as messaging apps, search engines, virtual assistants, and smartphone autocorrect functions use the Next Word Prediction Models. 
- Thus, this post is for you if you wish to learn how to create a Next Word Prediction Model. 
- I'll walk you through creating a Next Word Prediction Model with Python and Deep Learning in this tutorial.

### What is the Next Word Prediction Model & How to Build it?

- Next word prediction, a task in Machine Learning language modeling, seeks to anticipate the word or sequence of words most likely to follow a provided input context. 
- This process relies on statistical patterns and linguistic structures to make precise predictions according to the given context.
- Next Word Prediction models find utility in diverse sectors. 
- For instance, in mobile messaging, they offer word suggestions to hasten typing. 
- Likewise, search engines propose search terms as users type. 
- This technology expedites communication and search processes by anticipating user inputs accurately.

**To construct a Next Word Prediction model**

- begin by gathering a varied dataset of textual materials,
- cleanse and tokenize the data for preprocessing,
- organize the data into input-output pairs,
- create word embeddings as part of feature engineering,
- choose a suitable model such as LSTM or GPT,
- train the model on the dataset while fine-tuning hyperparameters,
- enhance the model by exploring diverse methods and structures.

- **This cyclical approach enables businesses to create precise and effective Next Word Prediction models adaptable to different contexts.** 
- **Initiating the construction of a Next Word Prediction model involves gathering textual data, essentially forming the vocabulary for the model.**
- **For instance, the input from a smartphone keyboard serves as the vocabulary for its predictive text feature.** 
- **Similarly, I've identified an optimal dataset sourced from a Sherlock Holmes book text for this purpose.**

### Next Word Prediction Model using Python

- I hope you now know what a Next Word Prediction model is. 
- In this section, I’ll take you through how to build a Next Word Prediction model using Python and Deep Learning. 
- So, let’s start this task by importing the necessary Python libraries.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Read the text file
with open('C:/Users/asus/OneDrive/Desktop/ML_Datasets/project/More_Projects/sherlock-holm.es_stories_plain-text_advs.txt', 
          'r', encoding='utf-8') as file:
    text = file.read()

**Now let’s tokenize the text to create a sequence of words:**

In [2]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1

- In the provided code, the text undergoes tokenization, separating it into individual tokens or words.
- A 'Tokenizer' object is instantiated to manage this process. The 'fit_on_texts' method of the tokenizer is invoked with the 'text' as its argument. 
- This method examines the text, creating a lexicon of unique words and assigning each a numeric index. 
- Subsequently, the variable 'total_words' is set to the length of the word index plus one, representing the total count of unique words in the text.

#### Now let’s create input-output pairs by splitting the text into sequences of tokens and forming n-grams from the sequences:

In [3]:
input_sequences = []
for line in text.split('\n'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

In the provided code snippet, the text is segmented into lines using the newline character ('\n') as a separator.
- Each line is then converted into a sequence of numeric tokens using the 'texts_to_sequences' method of the tokenizer, which utilizes the previously established vocabulary. 
- These token sequences are processed iteratively using a for loop. During each iteration, a subset of tokens, forming an n-gram sequence, is extracted from the beginning of the token list up to the current index 'i'. This n-gram sequence serves as the input context, with the final token representing the target or predicted word. 
- These input-output sequences are aggregated into the 'input_sequences' list, resulting in multiple training instances for the next word prediction model.

#### Now let’s pad the input sequences to have equal length:

In [4]:
max_sequence_len = max([len(seq) for seq in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

In the provided code snippet, the input sequences undergo padding to standardize their lengths. 
- The variable 'max_sequence_len' is determined by finding the longest sequence among all input sequences. 
- Using the 'pad_sequences' function, the input sequences are either padded or truncated to match this maximum length. 
- This function operates on the 'input_sequences' list, specifying the maximum length as 'max_sequence_len' and instructing padding to be added at the beginning of each sequence ('padding=pre'). Subsequently, the input sequences are converted into a numpy array for streamlined processing.

#### Now let’s split the sequences into input and output:

In [5]:
X = input_sequences[:, :-1]
y = input_sequences[:, -1]

- In the provided code, the input sequences are partitioned into two arrays labeled as 'X' and 'y', serving as the input and output for training the subsequent word prediction model. 
- The 'X' array comprises all rows from the 'input_sequences' array, excluding the last column, signifying the input context. Conversely, the 'y' array consists of the values solely from the last column of the 'input_sequences' array, denoting the target or anticipated word.

#### Now let’s convert the output to one-hot encode vectors:

In [6]:
y = np.array(tf.keras.utils.to_categorical(y, num_classes=total_words))

In the above code, we are converting the output array into a suitable format for training a model, where each target word is represented as a binary vector.

Now let’s build a neural network architecture to train the model:

In [7]:
import warnings
warnings.simplefilter(action='ignore', category=(FutureWarning, DeprecationWarning))

model = Sequential()
model.add(Embedding(total_words, 100, input_shape=(max_sequence_len-1,)))
model.add(LSTM(150))
model.add(Dense(total_words, activation='softmax'))
print(model.summary())

  super().__init__(**kwargs)


None


- The provided code outlines the architecture of the next word prediction model. 
- It employs a 'Sequential' model, representing a linear stack of layers. 
- The first layer, 'Embedding', transforms input sequences into fixed-size dense vectors. 
- It requires parameters including 'total_words' for the vocabulary size, '100' for embedding dimensionality, and 'input_length' for sequence length. 
- Following this, an 'LSTM' layer is added with 150 units to capture sequential patterns. 
- Finally, a 'Dense' layer with 'total_words' units and a 'softmax' activation function generates output predictions by assigning probabilities to each word.

Now let’s compile and train the model:

In [8]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=50, verbose=1)

Epoch 1/50
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m87s[0m 28ms/step - accuracy: 0.0600 - loss: 6.5601
Epoch 2/50
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m90s[0m 30ms/step - accuracy: 0.1168 - loss: 5.5713
Epoch 3/50
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m78s[0m 26ms/step - accuracy: 0.1446 - loss: 5.1301
Epoch 4/50
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m74s[0m 25ms/step - accuracy: 0.1630 - loss: 4.7880
Epoch 5/50
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m75s[0m 25ms/step - accuracy: 0.1850 - loss: 4.4373
Epoch 6/50
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m75s[0m 25ms/step - accuracy: 0.2022 - loss: 4.1590
Epoch 7/50
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m90s[0m 30ms/step - accuracy: 0.2326 - loss: 3.8742
Epoch 8/50
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m88s[0m 29ms/step - accuracy: 0.2630 - loss: 3.6026
Epoch 9/

<keras.src.callbacks.history.History at 0x1da7fde0650>

- The provided code compiles and trains the model.
- The 'compile' method configures it for training, setting 'loss' to 'categorical_crossentropy' for multi-class classification and 'optimizer' to 'adam' for adaptive learning rate. 
- 'Metrics' is set to 'accuracy' for monitoring. 
- The 'fit' method trains the model on input sequences 'X' and output 'y', with 'epochs' specifying iteration times. 'Verbose' is set to '1' for display during training. 
- Upon completion, the model can be used for generating next word predictions.

In [10]:
seed_text = input()
next_words = 3

for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    predicted = np.argmax(model.predict(token_list), axis=-1)
    output_word = ""
    for word, index in tokenizer.word_index.items():
        if index == predicted:
            output_word = word
            break
    seed_text += " " + output_word

print(seed_text)

hey there
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
hey there is one thing


The provided code generates predictions for the next words based on a given seed text. 
- 'seed_text' stores the initial text, while 'next_words' determines how many predictions to generate. 
- Inside the loop, 'seed_text' is converted into token sequences using the tokenizer, then padded to match the maximum sequence length. 
- The model predicts the next word using the 'predict' method, selecting the word with the highest probability score via 'np.argmax'. 
- This process repeats for the specified 'next_words'. 
- Finally, 'seed_text' is printed, displaying the initial text followed by the generated predictions. 
- This demonstrates the construction of a Next Word Prediction model using Deep Learning and Python.

### Summary

**Next word prediction, a task in Machine Learning, seeks to forecast the most likely word or word sequence following a provided input context. It leverages statistical patterns and linguistic structures to make precise predictions based on the context. I trust you found this article informative on constructing a Next Word Prediction model with Python.**