<a href="https://colab.research.google.com/github/SHEHAN-120/next-word-prediction-lstm/blob/main/nlp_text_generator_keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
text="""Data plays a vital role in our everyday life.
Directly or indirectly, for daily life decisions, we depend on some data, be it choosing a novel to read from a list of books,buying a thing after considering the budget, and so on.
Have you ever imagined searching for something on Google or Yahoo generates a lot of data?
This data is essential to analyze user experiences.
Getting recommendations on various e-commerce websites after buying a product and tracking parcels during delivery are part of Data Analytics which involves analyzing the raw data to make informed decisions.
But this raw data does not help make decisions if it has some redundancy, inconsistency, or inaccuracy.
Therefore, this data needs to be cleaned before considering for analysis."""

In [2]:
import tensorflow as tf
import numpy as np

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding,LSTM,Dense


In [3]:
tokenizer=Tokenizer()

In [4]:
tokenizer.fit_on_texts([text])

In [5]:
len(tokenizer.word_index)

87

In [6]:
for sentence in text.split('\n'):
  print(sentence)

Data plays a vital role in our everyday life. 
Directly or indirectly, for daily life decisions, we depend on some data, be it choosing a novel to read from a list of books,buying a thing after considering the budget, and so on. 
Have you ever imagined searching for something on Google or Yahoo generates a lot of data? 
This data is essential to analyze user experiences. 
Getting recommendations on various e-commerce websites after buying a product and tracking parcels during delivery are part of Data Analytics which involves analyzing the raw data to make informed decisions. 
But this raw data does not help make decisions if it has some redundancy, inconsistency, or inaccuracy. 
Therefore, this data needs to be cleaned before considering for analysis.


In [7]:
for sentence in text.split('\n'):
  print(tokenizer.texts_to_sequences([sentence])[0])

[1, 21, 2, 22, 23, 24, 25, 26, 10]
[27, 5, 28, 6, 29, 10, 7, 30, 31, 3, 11, 1, 12, 13, 32, 2, 33, 4, 34, 35, 2, 36, 8, 37, 14, 2, 38, 15, 16, 17, 39, 18, 40, 3]
[41, 42, 43, 44, 45, 6, 46, 3, 47, 5, 48, 49, 2, 50, 8, 1]
[9, 1, 51, 52, 4, 53, 54, 55]
[56, 57, 3, 58, 59, 60, 61, 15, 14, 2, 62, 18, 63, 64, 65, 66, 67, 68, 8, 1, 69, 70, 71, 72, 17, 19, 1, 4, 20, 73, 7]
[74, 9, 19, 1, 75, 76, 77, 20, 7, 78, 13, 79, 11, 80, 81, 5, 82]
[83, 9, 1, 84, 4, 12, 85, 86, 16, 6, 87]


In [8]:
input_sequence=[]
for sentence in text.split('\n'):
  tokenized_sentence=tokenizer.texts_to_sequences([sentence])[0]

  for i in range(1,len(tokenized_sentence)):
    input_sequence.append(tokenized_sentence[:i+1])

In [9]:
input_sequence

[[1, 21],
 [1, 21, 2],
 [1, 21, 2, 22],
 [1, 21, 2, 22, 23],
 [1, 21, 2, 22, 23, 24],
 [1, 21, 2, 22, 23, 24, 25],
 [1, 21, 2, 22, 23, 24, 25, 26],
 [1, 21, 2, 22, 23, 24, 25, 26, 10],
 [27, 5],
 [27, 5, 28],
 [27, 5, 28, 6],
 [27, 5, 28, 6, 29],
 [27, 5, 28, 6, 29, 10],
 [27, 5, 28, 6, 29, 10, 7],
 [27, 5, 28, 6, 29, 10, 7, 30],
 [27, 5, 28, 6, 29, 10, 7, 30, 31],
 [27, 5, 28, 6, 29, 10, 7, 30, 31, 3],
 [27, 5, 28, 6, 29, 10, 7, 30, 31, 3, 11],
 [27, 5, 28, 6, 29, 10, 7, 30, 31, 3, 11, 1],
 [27, 5, 28, 6, 29, 10, 7, 30, 31, 3, 11, 1, 12],
 [27, 5, 28, 6, 29, 10, 7, 30, 31, 3, 11, 1, 12, 13],
 [27, 5, 28, 6, 29, 10, 7, 30, 31, 3, 11, 1, 12, 13, 32],
 [27, 5, 28, 6, 29, 10, 7, 30, 31, 3, 11, 1, 12, 13, 32, 2],
 [27, 5, 28, 6, 29, 10, 7, 30, 31, 3, 11, 1, 12, 13, 32, 2, 33],
 [27, 5, 28, 6, 29, 10, 7, 30, 31, 3, 11, 1, 12, 13, 32, 2, 33, 4],
 [27, 5, 28, 6, 29, 10, 7, 30, 31, 3, 11, 1, 12, 13, 32, 2, 33, 4, 34],
 [27, 5, 28, 6, 29, 10, 7, 30, 31, 3, 11, 1, 12, 13, 32, 2, 33, 4, 34, 35],


In [10]:
max_len=max([len(x) for x in input_sequence])

In [11]:
max_len

34

In [12]:
padded_input_sequences=pad_sequences(input_sequence,maxlen=max_len,padding='pre')

In [13]:
padded_input_sequences

array([[ 0,  0,  0, ...,  0,  1, 21],
       [ 0,  0,  0, ...,  1, 21,  2],
       [ 0,  0,  0, ..., 21,  2, 22],
       ...,
       [ 0,  0,  0, ..., 85, 86, 16],
       [ 0,  0,  0, ..., 86, 16,  6],
       [ 0,  0,  0, ..., 16,  6, 87]], dtype=int32)

In [14]:
X=padded_input_sequences[:,:-1]
y=padded_input_sequences[:,-1]

In [15]:
X

array([[ 0,  0,  0, ...,  0,  0,  1],
       [ 0,  0,  0, ...,  0,  1, 21],
       [ 0,  0,  0, ...,  1, 21,  2],
       ...,
       [ 0,  0,  0, ..., 12, 85, 86],
       [ 0,  0,  0, ..., 85, 86, 16],
       [ 0,  0,  0, ..., 86, 16,  6]], dtype=int32)

In [16]:
y

array([21,  2, 22, 23, 24, 25, 26, 10,  5, 28,  6, 29, 10,  7, 30, 31,  3,
       11,  1, 12, 13, 32,  2, 33,  4, 34, 35,  2, 36,  8, 37, 14,  2, 38,
       15, 16, 17, 39, 18, 40,  3, 42, 43, 44, 45,  6, 46,  3, 47,  5, 48,
       49,  2, 50,  8,  1,  1, 51, 52,  4, 53, 54, 55, 57,  3, 58, 59, 60,
       61, 15, 14,  2, 62, 18, 63, 64, 65, 66, 67, 68,  8,  1, 69, 70, 71,
       72, 17, 19,  1,  4, 20, 73,  7,  9, 19,  1, 75, 76, 77, 20,  7, 78,
       13, 79, 11, 80, 81,  5, 82,  9,  1, 84,  4, 12, 85, 86, 16,  6, 87],
      dtype=int32)

In [17]:
tokenizer.word_index

{'data': 1,
 'a': 2,
 'on': 3,
 'to': 4,
 'or': 5,
 'for': 6,
 'decisions': 7,
 'of': 8,
 'this': 9,
 'life': 10,
 'some': 11,
 'be': 12,
 'it': 13,
 'buying': 14,
 'after': 15,
 'considering': 16,
 'the': 17,
 'and': 18,
 'raw': 19,
 'make': 20,
 'plays': 21,
 'vital': 22,
 'role': 23,
 'in': 24,
 'our': 25,
 'everyday': 26,
 'directly': 27,
 'indirectly': 28,
 'daily': 29,
 'we': 30,
 'depend': 31,
 'choosing': 32,
 'novel': 33,
 'read': 34,
 'from': 35,
 'list': 36,
 'books': 37,
 'thing': 38,
 'budget': 39,
 'so': 40,
 'have': 41,
 'you': 42,
 'ever': 43,
 'imagined': 44,
 'searching': 45,
 'something': 46,
 'google': 47,
 'yahoo': 48,
 'generates': 49,
 'lot': 50,
 'is': 51,
 'essential': 52,
 'analyze': 53,
 'user': 54,
 'experiences': 55,
 'getting': 56,
 'recommendations': 57,
 'various': 58,
 'e': 59,
 'commerce': 60,
 'websites': 61,
 'product': 62,
 'tracking': 63,
 'parcels': 64,
 'during': 65,
 'delivery': 66,
 'are': 67,
 'part': 68,
 'analytics': 69,
 'which': 70,
 'invo

In [18]:
y=to_categorical(y,num_classes=88)

In [19]:
y

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [20]:
y.shape

(119, 88)

In [21]:
X.shape

(119, 33)

## Model Building

In [22]:
model = Sequential()
model.add(Embedding(88,100,input_length=33))
model.add(LSTM(150))
model.add(Dense(88,activation='softmax'))



In [23]:
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

In [24]:
model.summary()

In [25]:
X.shape

(119, 33)

In [26]:
y.shape

(119, 88)

In [27]:
model.fit(X,y,epochs=100)

Epoch 1/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 63ms/step - accuracy: 0.0259 - loss: 4.4761
Epoch 2/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 56ms/step - accuracy: 0.1055 - loss: 4.4578
Epoch 3/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 57ms/step - accuracy: 0.0644 - loss: 4.4244
Epoch 4/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 57ms/step - accuracy: 0.0496 - loss: 4.3450
Epoch 5/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step - accuracy: 0.0592 - loss: 4.3008
Epoch 6/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 57ms/step - accuracy: 0.0717 - loss: 4.2312
Epoch 7/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 63ms/step - accuracy: 0.0644 - loss: 4.1676
Epoch 8/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 60ms/step - accuracy: 0.0696 - loss: 4.1356
Epoch 9/100
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[3

<keras.src.callbacks.history.History at 0x785713521850>

## Testing the model

In [28]:
text_1="Data"

token_text=tokenizer.texts_to_sequences([text_1])[0]
padded_text=pad_sequences([token_text],maxlen=33,padding='pre')
model.predict(padded_text)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 217ms/step


array([[1.03822858e-05, 8.70685373e-03, 2.14119423e-02, 1.29598321e-03,
        4.77626594e-03, 4.34044562e-02, 2.48853699e-03, 4.06207429e-04,
        1.10075453e-04, 4.34377268e-02, 7.81444483e-04, 2.90784646e-05,
        3.27301421e-03, 1.89945604e-05, 4.03067417e-04, 2.08943762e-04,
        2.44686526e-04, 3.65571672e-04, 3.76881871e-05, 1.55683113e-02,
        5.72925142e-04, 5.39237857e-01, 7.59400008e-03, 1.28992950e-03,
        6.80851925e-04, 9.59064637e-05, 1.08057022e-04, 1.69702980e-05,
        2.01133620e-02, 7.46663602e-04, 4.99130751e-04, 7.16366107e-04,
        1.10351146e-04, 7.98654437e-05, 8.02086215e-05, 5.03056472e-05,
        3.33509925e-05, 1.03940212e-04, 5.05630669e-05, 8.21109497e-05,
        2.74923168e-05, 8.61311310e-06, 4.99895327e-02, 2.02346966e-02,
        4.88597574e-03, 2.27926043e-03, 3.76853859e-04, 9.07368012e-05,
        1.74072647e-05, 5.33805978e-05, 1.83913493e-04, 1.26479015e-01,
        7.01007945e-03, 2.49033299e-04, 1.31269626e-04, 3.499199

In [29]:
pos=np.argmax(model.predict(padded_text))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/step


In [30]:
tokenizer.word_index

{'data': 1,
 'a': 2,
 'on': 3,
 'to': 4,
 'or': 5,
 'for': 6,
 'decisions': 7,
 'of': 8,
 'this': 9,
 'life': 10,
 'some': 11,
 'be': 12,
 'it': 13,
 'buying': 14,
 'after': 15,
 'considering': 16,
 'the': 17,
 'and': 18,
 'raw': 19,
 'make': 20,
 'plays': 21,
 'vital': 22,
 'role': 23,
 'in': 24,
 'our': 25,
 'everyday': 26,
 'directly': 27,
 'indirectly': 28,
 'daily': 29,
 'we': 30,
 'depend': 31,
 'choosing': 32,
 'novel': 33,
 'read': 34,
 'from': 35,
 'list': 36,
 'books': 37,
 'thing': 38,
 'budget': 39,
 'so': 40,
 'have': 41,
 'you': 42,
 'ever': 43,
 'imagined': 44,
 'searching': 45,
 'something': 46,
 'google': 47,
 'yahoo': 48,
 'generates': 49,
 'lot': 50,
 'is': 51,
 'essential': 52,
 'analyze': 53,
 'user': 54,
 'experiences': 55,
 'getting': 56,
 'recommendations': 57,
 'various': 58,
 'e': 59,
 'commerce': 60,
 'websites': 61,
 'product': 62,
 'tracking': 63,
 'parcels': 64,
 'during': 65,
 'delivery': 66,
 'are': 67,
 'part': 68,
 'analytics': 69,
 'which': 70,
 'invo

In [31]:
for word,index in tokenizer.word_index.items():
  if index==pos:
    print(word)

plays


In [32]:
text_2="Data is a vital"

token_text_2=tokenizer.texts_to_sequences([text_2])[0]
padded_text_2=pad_sequences([token_text_2],maxlen=33,padding='pre')
model.predict(padded_text_2)

pos_2=np.argmax(model.predict(padded_text_2))

for word,index in tokenizer.word_index.items():
  if index==pos_2:
    print(word)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 54ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step
role
