In [None]:
text = """In the realm of technology and artificial intelligence, the word predictor project stands as a testament to innovation. It's all about predicting words. Simple, right? Yet, beneath this simplicity lies a world of complexity and potential. Imagine typing on your smartphone, and as you start to type, the device anticipates what you'll say next. It's not just convenient; it's efficient and almost magical. This project dives deep into the world of Natural Language Processing (NLP) to make this magic happen.

NLP, a field at the intersection of computer science and linguistics, is what enables your device to understand and generate human language. It's the technology that powers chatbots, virtual assistants, and even the autocorrect feature on your phone. It's a field that constantly evolves, fueled by large datasets, powerful algorithms, and the quest to make human-computer interactions more seamless.

To build an effective word predictor, you need data—lots of it. The more, the better. So, this project starts by collecting vast amounts of text from diverse sources. It could be news articles, books, social media posts, or just about anything with words. This corpus of text serves as the project's playground, the place where the word predictor learns the ropes.

But it's not enough to throw text at the predictor and hope for the best. You need to clean it, like panning for gold in a River. Punctuation, special characters, and irrelevant information must be removed, leaving behind only the raw material of language. Once the text is cleaned, it's time to roll up the sleeves and start the training.

But wait, there's more to it than just guessing. A good word predictor doesn't just look at the last word you typed; it considers the entire sentence. It's like having a conversation where each word flows naturally from the one before it. So, the model doesn't work in isolation. It uses a technique called 'recurrent' neural networks or fancy transformers that help it keep track of the context.

After hours, days, or even weeks of training (depending on the size of your dataset and the power of your computer), the word predictor is ready to show its skills. You type, and it predicts. You pause, and it waits. It's almost like a digital partner in your writing endeavors, suggesting the next word as you compose an email, write a novel, or chat with friends.

But there's a twist here. This word predictor isn't just about predicting any word; it's about predicting the right word—the word that fits your writing style, the word that aligns with your thoughts, the word that makes your communication clearer and more efficient. It's about personalization. As you use it more, it gets to know you better. It learns your quirks, your preferences, and your idiosyncrasies.

This project isn't just about convenience; it's about empowerment. It's about enabling people to communicate more effectively, whether they're crafting a research paper, sending a heartfelt message, or writing code. A word predictor isn't just a tool; it's a companion in the journey of words.

Imagine a future where technology seamlessly blends into our lives, not as a separate entity but as an extension of ourselves. A future where typing becomes a fluid act of thought, where the boundary between human and machine blurs, and where communication is as effortless as thinking. This project is a step in that direction, a step towards a world where words flow freely and where technology is not just smart but also intuitive.

As the project evolves, it holds the promise of touching various aspects of our lives. From aiding individuals with disabilities in communication to assisting writers in generating ideas, from making customer service chatbots more responsive to helping students with their essays, the applications are boundless. It's not just about predicting words; it's about transforming how we interact with technology and with each other through the medium of text.

In the grand tapestry of technology, this word predictor project represents a thread—one that weaves together the threads of NLP, machine learning, and human ingenuity. It's a testament to what's possible when we combine data, algorithms, and creativity. It's about making technology work for us, not the other way around.

So, as you type away on your devices, remember that there's more than meets the eye. Behind those letters and spaces, behind each word you type, there's a world of algorithms and neural networks working tirelessly to make your words count, to make your thoughts flow, and to make your communication a breeze"""

In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer


In [None]:
tokenizer = Tokenizer()

In [None]:
tokenizer.fit_on_texts([text])

In [None]:
vocab_size = len(tokenizer.word_index) + 1

In [None]:
len(tokenizer.word_index)

343

In [None]:
input_sequences = []
for sentence in text.split('\n'):
  tokenized_sentence=tokenizer.texts_to_sequences([sentence])[0]
for i in range(1,len(tokenized_sentence)):
  input_sequences.append(tokenized_sentence[:i+1])

In [None]:


input_sequences

[[44, 10],
 [44, 10, 12],
 [44, 10, 12, 33],
 [44, 10, 12, 33, 332],
 [44, 10, 12, 33, 332, 32],
 [44, 10, 12, 33, 332, 32, 8],
 [44, 10, 12, 33, 332, 32, 8, 333],
 [44, 10, 12, 33, 332, 32, 8, 333, 334],
 [44, 10, 12, 33, 332, 32, 8, 333, 334, 17],
 [44, 10, 12, 33, 332, 32, 8, 333, 334, 17, 36],
 [44, 10, 12, 33, 332, 32, 8, 333, 334, 17, 36, 18],
 [44, 10, 12, 33, 332, 32, 8, 333, 334, 17, 36, 18, 72],
 [44, 10, 12, 33, 332, 32, 8, 333, 334, 17, 36, 18, 72, 335],
 [44, 10, 12, 33, 332, 32, 8, 333, 334, 17, 36, 18, 72, 335, 1],
 [44, 10, 12, 33, 332, 32, 8, 333, 334, 17, 36, 18, 72, 335, 1, 336],
 [44, 10, 12, 33, 332, 32, 8, 333, 334, 17, 36, 18, 72, 335, 1, 336, 47],
 [44, 10, 12, 33, 332, 32, 8, 333, 334, 17, 36, 18, 72, 335, 1, 336, 47, 337],
 [44,
  10,
  12,
  33,
  332,
  32,
  8,
  333,
  334,
  17,
  36,
  18,
  72,
  335,
  1,
  336,
  47,
  337,
  338],
 [44,
  10,
  12,
  33,
  332,
  32,
  8,
  333,
  334,
  17,
  36,
  18,
  72,
  335,
  1,
  336,
  47,
  337,
  338,
  

In [None]:

# finding the longest sentence to arrnge the input data equally
max_len =max([len(x) for x in input_sequences])

In [None]:
max_len

53

In [None]:

from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_input_sequences = pad_sequences(input_sequences, maxlen = max_len, padding='pre')

In [None]:
padded_input_sequences

array([[  0,   0,   0, ...,   0,  44,  10],
       [  0,   0,   0, ...,  44,  10,  12],
       [  0,   0,   0, ...,  10,  12,  33],
       ...,
       [  0,   0,  44, ...,  26,   8,  37],
       [  0,  44,  10, ...,   8,  37,   2],
       [ 44,  10,  12, ...,  37,   2, 343]], dtype=int32)

In [None]:
X = padded_input_sequences[:,:-1]
Y = padded_input_sequences[:,-1]

In [None]:
from tensorflow.keras.utils import to_categorical
Y=to_categorical(Y,num_classes=344)

In [None]:
Y.shape

(52, 344)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

In [None]:
'''model=Sequential()
model.add(Embedding(283,100,input_length=56))
model.add(LSTM(150))
model.add(Dense(283,activation='softmax'))'''
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=100, input_length=max_len - 1))
model.add(tf.keras.layers.LSTM(150))
model.add(tf.keras.layers.Dense(vocab_size, activation='softmax'))

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 52, 100)           34400     
                                                                 
 lstm_1 (LSTM)               (None, 150)               150600    
                                                                 
 dense_1 (Dense)             (None, 344)               51944     
                                                                 
Total params: 236944 (925.56 KB)
Trainable params: 236944 (925.56 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:
model.fit(X,Y,epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.src.callbacks.History at 0x7a72d7fd7df0>

In [None]:
import time
import numpy as np
text = 'text'
for i in range(6):
  token_text = tokenizer.texts_to_sequences([text])[0]
  padded_token_text = pad_sequences([token_text], maxlen=52, padding='pre')
  pos = np.argmax(model.predict(padded_token_text))
  for word,index in tokenizer.word_index.items():
    if index == pos:
      text=text + " " + word
      print(text)


text away
text away on
text away on your
text away on your devices
text away on your devices remember
text away on your devices remember that
