# Problem statement 

- Next word prediction / Sentence auto correct

"Developing an efficient next word prediction algorithm for natural language processing applications, aiming to enhance user experience in text-based interfaces by accurately suggesting the most probable word based on context, while considering factors like grammar, semantics, and user preferences."

# Import Libraries

In [1]:
import csv
import nltk
import string
import pandas as pd
import numpy as np
import tensorflow as tf

from keras import backend as K
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding,SimpleRNN, LSTM, Dense, Bidirectional,Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

In [2]:
tokenizer = Tokenizer()

# Data Loading

In [3]:
df = pd.read_csv('ArticlesMarch2018.csv')

In [4]:
df

Unnamed: 0,articleID,byline,documentType,headline,keywords,multimedia,newDesk,printPage,pubDate,sectionName,snippet,source,typeOfMaterial,webURL,articleWordCount
0,5a974697410cf7000162e8a4,By BINYAMIN APPELBAUM,article,"Virtual Coins, Real Resources","['Bitcoin (Currency)', 'Electric Light and Pow...",1,Business,1,2018-03-01 00:17:22,Economy,America has a productivity problem. One explan...,The New York Times,News,https://www.nytimes.com/2018/02/28/business/ec...,1207
1,5a974be7410cf7000162e8af,By HELENE COOPER and ERIC SCHMITT,article,U.S. Advances Military Plans for North Korea,"['United States Defense and Military Forces', ...",1,Washington,11,2018-03-01 00:40:01,Asia Pacific,The American military is looking at everything...,The New York Times,News,https://www.nytimes.com/2018/02/28/world/asia/...,1215
2,5a9752a2410cf7000162e8ba,By THE EDITORIAL BOARD,article,Mr. Trump and the ‘Very Bad Judge’,"['Trump, Donald J', 'Curiel, Gonzalo P', 'Unit...",1,Editorial,26,2018-03-01 01:08:46,Unknown,Can you guess which man is the model public se...,The New York Times,Editorial,https://www.nytimes.com/2018/02/28/opinion/tru...,1043
3,5a975310410cf7000162e8bd,By JAVIER C. HERNÁNDEZ,article,"To Erase Dissent, China Bans Pooh Bear and ‘N’","['China', 'Xi Jinping', 'Term Limits (Politica...",1,Foreign,1,2018-03-01 01:10:35,Asia Pacific,Censors swung into action after Mr. Xi’s bid t...,The New York Times,News,https://www.nytimes.com/2018/02/28/world/asia/...,1315
4,5a975406410cf7000162e8c3,"By JESSE DRUCKER, KATE KELLY and BEN PROTESS",article,Loans Flowed to Kushner Cos. After Visits to t...,"['Kushner, Jared', 'Kushner Cos', 'United Stat...",1,Business,1,2018-03-01 01:14:41,Unknown,"Apollo, the private equity firm, and Citigroup...",The New York Times,News,https://www.nytimes.com/2018/02/28/business/ja...,1566
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1380,5ac1647647de81a90121adaa,By ROBERT LEONARD,article,Will Trump Start a Farm Crisis?,"['Agriculture and Farming', 'International Tra...",1,OpEd,23,2018-04-01 23:00:04,Unknown,Much of rural America will abandon the preside...,The New York Times,Op-Ed,https://www.nytimes.com/2018/04/01/opinion/tru...,849
1381,5ac1654f47de81a90121adb2,By DENENE MILLNER,article,A New Black American Dream,"['Blacks', 'Parenting', 'Careers and Professio...",1,OpEd,23,2018-04-01 23:03:39,Unknown,We all want our children to have a better life...,The New York Times,Op-Ed,https://www.nytimes.com/2018/04/01/opinion/ame...,865
1382,5ac1700447de81a90121ade2,By KATIE VAN SYCKLE,article,When a Subject Refuses to Pose,"['Photography', 'New York Times', 'Mattis, Jam...",1,Insider,2,2018-04-01 23:49:21,Unknown,"Mark Peterson, the photographer who worked on ...",The New York Times,News,https://www.nytimes.com/2018/04/01/insider/jim...,611
1383,5ac1720c47de81a90121adf0,By THE EDITORIAL BOARD,article,America Needs Better Privacy Rules,"['Privacy', 'Social Media', 'United States Pol...",1,Editorial,22,2018-04-01 23:58:02,Unknown,If we learn anything from the recent controver...,The New York Times,Editorial,https://www.nytimes.com/2018/04/01/opinion/fac...,862


In [1]:
#shape of data

In [5]:
df.shape

(1385, 15)

In [6]:
snippet = '\n'.join(df['snippet'])

In [7]:
corpus = snippet.lower().split('\n')
print(len(corpus))

1385


In [8]:
corpus[:2]

['america has a productivity problem. one explanation may be the growing use of real resources to make virtual products.',
 'the american military is looking at everything from troop rotations to surveillance to casualty evacuations should it be ordered to take action against north korea.']

In [9]:
tokenizer.fit_on_texts(corpus) # Tokenizer here
tokenizer.word_index  # Giving index here
total_unique_words = len(tokenizer.word_index) + 1  # tokenizer.word index is starts from 1 so we are adding +1 here
print("Total unique words count numbers:",total_unique_words)
print('-----------------------------------------------------------------------------------------------------------------------')
print(tokenizer.word_index)

Total unique words count numbers: 6863
-----------------------------------------------------------------------------------------------------------------------


In [10]:
seqs = tokenizer.texts_to_sequences([corpus[0]]) # converting words to index
print(corpus[0])
print(seqs)

america has a productivity problem. one explanation may be the growing use of real resources to make virtual products.
[[193, 14, 2, 2796, 699, 39, 1177, 63, 23, 1, 397, 194, 3, 158, 2797, 4, 105, 1646, 700]]


In [11]:
print(tokenizer.word_index['america'],tokenizer.word_index['has'],tokenizer.word_index['a'],
      tokenizer.word_index['productivity'],tokenizer.word_index['problem'])

193 14 2 2796 699


In [12]:
input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)): 
        n_gram_seqs = token_list[:i+1]
        input_sequences.append(n_gram_seqs)
print(len(input_sequences))
print('-----------------------------------------------------------------------------------------------------------------------')
print(input_sequences)

26937
-----------------------------------------------------------------------------------------------------------------------
[[193, 14], [193, 14, 2], [193, 14, 2, 2796], [193, 14, 2, 2796, 699], [193, 14, 2, 2796, 699, 39], [193, 14, 2, 2796, 699, 39, 1177], [193, 14, 2, 2796, 699, 39, 1177, 63], [193, 14, 2, 2796, 699, 39, 1177, 63, 23], [193, 14, 2, 2796, 699, 39, 1177, 63, 23, 1], [193, 14, 2, 2796, 699, 39, 1177, 63, 23, 1, 397], [193, 14, 2, 2796, 699, 39, 1177, 63, 23, 1, 397, 194], [193, 14, 2, 2796, 699, 39, 1177, 63, 23, 1, 397, 194, 3], [193, 14, 2, 2796, 699, 39, 1177, 63, 23, 1, 397, 194, 3, 158], [193, 14, 2, 2796, 699, 39, 1177, 63, 23, 1, 397, 194, 3, 158, 2797], [193, 14, 2, 2796, 699, 39, 1177, 63, 23, 1, 397, 194, 3, 158, 2797, 4], [193, 14, 2, 2796, 699, 39, 1177, 63, 23, 1, 397, 194, 3, 158, 2797, 4, 105], [193, 14, 2, 2796, 699, 39, 1177, 63, 23, 1, 397, 194, 3, 158, 2797, 4, 105, 1646], [193, 14, 2, 2796, 699, 39, 1177, 63, 23, 1, 397, 194, 3, 158, 2797, 4, 105,

In [13]:
max_seq_length = max([len(x) for x in input_sequences])  # sentence which having highest number (sentence means doc words )
input_seqs = np.array(pad_sequences(input_sequences, maxlen=max_seq_length, padding='pre')) #padding gives sentence in equal len
print(max_seq_length)
print(input_seqs)

41
[[   0    0    0 ...    0  193   14]
 [   0    0    0 ...  193   14    2]
 [   0    0    0 ...   14    2 2796]
 ...
 [   0    0    0 ...  443  193  563]
 [   0    0    0 ...  193  563   89]
 [   0    0    0 ...  563   89 6862]]


In [14]:
x_values, labels = input_seqs[:, :-1], input_seqs[:, -1]
y_values = tf.keras.utils.to_categorical(labels, num_classes=total_unique_words)
print(x_values[:3])
print(labels[:3])

[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0 193]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0 193  14]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0 193  14   2]]
[  14    2 2796]


Example:
    
- america has a productivity problem.

x_values => america | labels => has 

x_values => america  has | labels => a 

x_values => america  has a | labels => productivity 

x_values => america  has a productivity | labels => problem

In [15]:
x_values[0],np.argmax(y_values[0])

(array([  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
        193]),
 14)

##### Glove 

In [16]:
path = 'glove.txt'
embeddings_index = {}
with open(path,encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word   = values[0]
        coeffs = np.array(values[1:], dtype='float32')
        embeddings_index[word] = coeffs

In [17]:
len(embeddings_index)

400000

In [18]:
dict(list(embeddings_index.items())[0:5])

{'the': array([ 4.1800e-01,  2.4968e-01, -4.1242e-01,  1.2170e-01,  3.4527e-01,
        -4.4457e-02, -4.9688e-01, -1.7862e-01, -6.6023e-04, -6.5660e-01,
         2.7843e-01, -1.4767e-01, -5.5677e-01,  1.4658e-01, -9.5095e-03,
         1.1658e-02,  1.0204e-01, -1.2792e-01, -8.4430e-01, -1.2181e-01,
        -1.6801e-02, -3.3279e-01, -1.5520e-01, -2.3131e-01, -1.9181e-01,
        -1.8823e+00, -7.6746e-01,  9.9051e-02, -4.2125e-01, -1.9526e-01,
         4.0071e+00, -1.8594e-01, -5.2287e-01, -3.1681e-01,  5.9213e-04,
         7.4449e-03,  1.7778e-01, -1.5897e-01,  1.2041e-02, -5.4223e-02,
        -2.9871e-01, -1.5749e-01, -3.4758e-01, -4.5637e-02, -4.4251e-01,
         1.8785e-01,  2.7849e-03, -1.8411e-01, -1.1514e-01, -7.8581e-01],
       dtype=float32),
 ',': array([ 0.013441,  0.23682 , -0.16899 ,  0.40951 ,  0.63812 ,  0.47709 ,
        -0.42852 , -0.55641 , -0.364   , -0.23938 ,  0.13001 , -0.063734,
        -0.39575 , -0.48162 ,  0.23291 ,  0.090201, -0.13324 ,  0.078639,
        -0.4

In [19]:
embeddings_matrix = np.zeros((total_unique_words, 50))  # we are extracting words from glove.txt
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embeddings_matrix[i] = embedding_vector;

In [20]:
len(embeddings_matrix) 

6863

In [21]:
embeddings_matrix

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.41800001,  0.24968   , -0.41242   , ..., -0.18411   ,
        -0.11514   , -0.78580999],
       [ 0.21705   ,  0.46515   , -0.46757001, ..., -0.043782  ,
         0.41012999,  0.1796    ],
       ...,
       [ 1.05420005,  0.75160998, -0.42938   , ..., -0.44940001,
        -0.23126   , -1.1846    ],
       [ 0.42986   , -1.42449999, -0.08004   , ...,  1.16069996,
        -0.53895998,  0.18253   ],
       [-0.11311   , -0.10973   , -0.26403999, ...,  1.34500003,
         0.56571001,  0.92388999]])

In [23]:
#glove_vect_df=pd.DataFrame (embeddings_matrix. T, columns=['']+list (tokenizer.word_index.keys()))
#glove_vect_df

# RNN - RECURRENT NEURAL NETWORK

In [23]:
K.clear_session()

model = Sequential()

# Embedding layer
# model.add(Embedding(input_dim=total_unique_words, output_dim=50, \
#                    weights=[embeddings_matrix],input_length = 41 - 1, trainable=False))

# Embedding layer without 'input_length' argument
model.add(Embedding(input_dim=total_unique_words, output_dim=50, trainable=False))

# Bidirectional LSTM layer with return_sequences=True
model.add(Bidirectional(SimpleRNN(256, return_sequences=True)))

# Dropout layer
model.add(Dropout(0.2))

# Bidirectional LSTM layer without return_sequences (last LSTM layer)
model.add(Bidirectional(SimpleRNN(256)))

# Dropout layer
model.add(Dropout(0.2))

# Dense layer with 128 units and 'relu' activation
model.add(Dense(128, activation='relu'))

# Dense layer with total_unique_words units and 'softmax' activation
model.add(Dense(total_unique_words, activation='softmax'))

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

In [45]:
history = model.fit(x_values, y_values, epochs=10, validation_split=0.2, verbose=1, batch_size=256)

Epoch 1/10
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 49ms/step - accuracy: 0.2700 - loss: 3.4047 - val_accuracy: 0.0392 - val_loss: 14.8241
Epoch 2/10
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 43ms/step - accuracy: 0.2735 - loss: 3.4053 - val_accuracy: 0.0414 - val_loss: 14.8871
Epoch 3/10
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 42ms/step - accuracy: 0.2626 - loss: 3.4238 - val_accuracy: 0.0412 - val_loss: 14.8561
Epoch 4/10
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 43ms/step - accuracy: 0.2690 - loss: 3.3569 - val_accuracy: 0.0401 - val_loss: 14.8676
Epoch 5/10
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 42ms/step - accuracy: 0.2707 - loss: 3.3921 - val_accuracy: 0.0416 - val_loss: 14.9038
Epoch 6/10
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 42ms/step - accuracy: 0.2731 - loss: 3.3674 - val_accuracy: 0.0408 - val_loss: 14.8861
Epoch 7/10
[1m85/85[0m [3

##### Assuming 'model' is your trained model
model.save(r"C:\Users\kumar\Data science\8.NLP\4.NLP Class Workouts\.ipynb_checkpoints.h5")

from keras.models import load_model

##### Load the saved model
loaded_model = load_model("path_to_save_model.h5")

# Model Summary

In [40]:
model.summary()

In [41]:
def prediction(seed_text, next_words):
    for _ in range(next_words):
        token_list  = tokenizer.texts_to_sequences([seed_text])[0]
        token_list  = pad_sequences([token_list], maxlen=max_seq_length-1, padding='pre')
        predicted   = np.argmax(model.predict(token_list, verbose=0), axis=-1)
        output_word = tokenizer.sequences_to_texts([[predicted[0]]])
        print(output_word)
        seed_text += ' '+output_word[0]
    print(seed_text)

In [42]:
input_text = "america has a productivity problem."
remaining_part = "one explanation may be the growing use of real resources to make virtual products."
# next_words = len(remaining_part.split())
next_words = 14
prediction(input_text, next_words)

['into']
['sophistication']
['in']
['when']
['the']
['1960s']
['it']
['fought']
['to']
['make']
['a']
['land']
['slides']
['can']
america has a productivity problem. into sophistication in when the 1960s it fought to make a land slides can


## LSTM (Long Short-Term Memory) - Model

In [24]:
K.clear_session()

model = Sequential()

# Embedding layer
# model.add(Embedding(input_dim=total_unique_words, output_dim=50, \
#                     weights=[embeddings_matrix],input_length = 41 - 1, trainable=False))

# Embedding layer without 'input_length' argument
model.add(Embedding(input_dim=total_unique_words, output_dim=50, trainable=False))

# Bidirectional LSTM layer with return_sequences=True
model.add(Bidirectional(LSTM(8, return_sequences=True)))

# Dropout layer
model.add(Dropout(0.2))

# Bidirectional LSTM layer without return_sequences (last LSTM layer)
model.add(Bidirectional(LSTM(8)))

# Dropout layer
model.add(Dropout(0.2))

# Dense layer with 128 units and 'relu' activation
model.add(Dense(128, activation='relu'))

# Dense layer with total_unique_words units and 'softmax' activation
model.add(Dense(total_unique_words, activation='softmax'))

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])




In [26]:
history = model.fit(x_values, y_values, epochs=20, validation_split=0.2, verbose=1, batch_size=256)

Epoch 1/20
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 75ms/step - accuracy: 0.0744 - loss: 5.9526 - val_accuracy: 0.0488 - val_loss: 9.7516
Epoch 2/20
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 59ms/step - accuracy: 0.0719 - loss: 5.9170 - val_accuracy: 0.0512 - val_loss: 9.8954
Epoch 3/20
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 57ms/step - accuracy: 0.0753 - loss: 5.8365 - val_accuracy: 0.0471 - val_loss: 10.0542
Epoch 4/20
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 55ms/step - accuracy: 0.0730 - loss: 5.8048 - val_accuracy: 0.0479 - val_loss: 10.1464
Epoch 5/20
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 61ms/step - accuracy: 0.0750 - loss: 5.7790 - val_accuracy: 0.0453 - val_loss: 10.3539
Epoch 6/20
[1m85/85[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 54ms/step - accuracy: 0.0761 - loss: 5.7328 - val_accuracy: 0.0475 - val_loss: 10.4732
Epoch 7/20
[1m85/85[0m [32m

# Model Summary

In [27]:
model.summary()

In [28]:
def prediction(seed_text, next_words):
    for _ in range(next_words):
        token_list  = tokenizer.texts_to_sequences([seed_text])[0]
        token_list  = pad_sequences([token_list], maxlen=max_seq_length-1, padding='pre')
        predicted   = np.argmax(model.predict(token_list, verbose=0), axis=-1)
        output_word = tokenizer.sequences_to_texts([[predicted[0]]])
        print(output_word)
        seed_text += ' '+output_word[0]
    print(seed_text)

##### Testing

In [29]:
input_text = "america has a productivity problem."
remaining_part = "one explanation may be the growing use of real resources to make virtual products."
# next_words = len(remaining_part.split())
next_words = 14
prediction(input_text, next_words)

['hawking']
['has']
['make']
['have']
['be']
['make']
['a']
['own']
['tariff']
['violence']
['has']
['end']
['the']
['united']
america has a productivity problem. hawking has make have be make a own tariff violence has end the united


# Conclusion

The development and implementation of an effective next word prediction system can significantly enhance user engagement, streamline text-based interactions, and improve overall efficiency in various applications by providing accurate and contextually relevant word suggestions.