<a href="https://colab.research.google.com/github/Daksh024/HindiNext/blob/Colab/next_word_prediction_bi_lstm_tutorial_easy_way.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Table Content
------------------

- [Introduction](#intro)
- [Import libraries and packages](#ilp)
- [Dataset Information](#di)
- [Separate 'Title' field and preprocess it](#preprocess)
    - [Removing unwanted charaters and words](#remv)
    - [Tokenization and word_index (vocabulary) ](#token)
    - [Convert titles into sequences and Make n_gram model](#ngram)
    - [Make all titles with same length and padding them](#pad)
- [Preprare features (X) and labels (Y)](#xy)
- [Architechture of Bidirectional LSTM neural network](#blstm)
- [Train Bi-LSTM neural network](#train)
- [Plotting accuracy and loss graph](#acc)
- [Predict new title (Testing)](#new)

----------------




<a name="intro"></a>

# Introduction

**Next Word Prediction (also called Language Modeling) is the task of predicting what word comes next. It is one of the fundamental tasks of NLP.**

Image reference: https://medium.com/@antonio.lopardo/the-basics-of-language-modeling-1c8832f21079

![gg.png](attachment:426089b0-5844-4928-a797-40e0015c1a93.png)

#### Application Language Modelling

**1) Mobile keyboard text recommandation**

![fff.jpg](attachment:0cd813a1-ea03-40b9-86d7-0585d994a36e.jpg)

**2) Whenever we search for something on any search engine, we get many suggestions and,  as we type new words in it, we get better recommendations according to our searching context. So, how will it happen??? **

![Screenshot (21).png](attachment:72ee772e-4ef9-4e79-a364-5dcf8f558e4a.png)


It is poosible through natural language processing (NLP) technique. Here, we will use NLP and try to make a prediction model using Bidirectional LSTM (Long short-term memory) model that will predict next words of sentence.


<a name="ilp"></a>
# Import necessary libraries and packages

In [2]:
import pandas as pd
import os
import numpy as np

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

<a name="di"></a>
# Dataset information

**Import Medium-articles-dataset:**

This dataset contains information about randomly chosen medium articles published in 2019 from these 7 publications:

- Towards Data Science
- UX Collective
- The Startup
- The Writing Cooperative
- Data Driven Investor
- Better Humans
- Better Marketing


Here, we have a **10 different fields and 6508 records** but we will only use **title field** for predicting next word.

In [3]:
file = open("/content/drive/MyDrive/Colab Notebooks/Project/hi_big78.txt",'r')
sentences = file.readlines()


In [4]:
lines = []
for sentence in sentences:
  lines.append(sentence.split("।"))

text_corpus = []

for line in lines:
  text_corpus.append("".join(line))

for text in text_corpus:
  text.replace("\n","।")

text_corpus


['झारखंड के धनबाद में छठ पर्व में अर्ध्य के दौरान तालाब में डूबने से एक व्यक्ति मौत हो गई बताया जा रहा कि छठ पर्व के दौरान सुदामडीह थाना क्षेत्र के परघाबाद तालाब पर अर्ध्य देने के दौरान ये हादसा हुआ हादसे में मृतक की शिनाख्त लक्ष्मी नारायण दास के रूप में हुई जो बीसीसीएल के सुदामडीह इंक्लाइंड में कार् \n',
 ' \n',
 'मौसम विभाग का अनुमान है कि 9 से 12 जुलाई के बीच बिहार में बहुत ज्यादा बारिश हो सकती है वहीं उत्तराखंड और हिमाचल प्रदेश में भी 10 से 12 जुलाई को भारी बरसात होगी इसके लिए पहले से आगाह कर दिया गया है 9 जुलाई तक के आंकड़ों के मुताबिक, पूरे भारत की बात करें तो वर्षा की कमी में 2 प्रतिशत की गिरावट आ चुकी है 9 जुलाई तक उत्तर भारत में अनुमान से 31 प्रतिशत ज्यादा बारिश हुई लेकिन मध्य भारत, दक्षिण भारत, पूर्वी भारत और नॉर्थ ईस्ट में 7 से 9 प्रतिशत की कमी है \n',
 ' \n',
 'दही या चावल का सेव \n',
 ' \n',
 'सबसे खास बात है कि इस बार गरीब कल्याण अन्न योजना के तहत उन लोगों को भी मुफ्त अनाज मिलेगा, जिनके पास राशन कार्ड नहीं है योजना का लाभ लेने के लिए उन्हें आधार कार्ड के जरिए सिर्फ रजिस्ट

In [5]:
tokenizer = Tokenizer(oov_token='<oov>') # For those words which are not found in word_index
tokenizer.fit_on_texts(text_corpus)
total_words = len(tokenizer.word_index) + 1

print("Total number of words: ", total_words)
print("Word: ID")
print("------------")
print("<oov>: ", tokenizer.word_index['<oov>'])
print("जीडीपी: ", tokenizer.word_index['जीडीपी'])
print("जान: ", tokenizer.word_index['जान'])
print("विकसित: ", tokenizer.word_index['विकसित'])

Total number of words:  88362
Word: ID
------------
<oov>:  1
जीडीपी:  4428
जान:  601
विकसित:  1686


<a name="ngram"></a>
#### Titles text into sequences and make n_gram model

suppose, we have sentence like **"I am Yash"** and this will convert into a sequence with their respective tokens **{'I': 1,'am': 2,'Yash': 3}**. Thus, output will be  **[ '1' ,'2' ,'3' ]**

Likewise, our all titles will be converted into sequences.

Then,
we will make a n_gram model for good prediction.

Below image explain about everything.

![Capture.PNG](attachment:48ad80b3-90bf-4cf6-99f8-7dcfd467d1f8.PNG)


In [6]:
input_sequences = []
for line in text_corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    # print(token_list)

    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        if len(n_gram_sequence) < 50:
          input_sequences.append(n_gram_sequence)

In [7]:
print(input_sequences[:20])
print("Total input sequences: ", len(input_sequences))

[[875, 2], [875, 2, 3266], [875, 2, 3266, 3], [875, 2, 3266, 3, 2393], [875, 2, 3266, 3, 2393, 1618], [875, 2, 3266, 3, 2393, 1618, 3], [875, 2, 3266, 3, 2393, 1618, 3, 22087], [875, 2, 3266, 3, 2393, 1618, 3, 22087, 2], [875, 2, 3266, 3, 2393, 1618, 3, 22087, 2, 69], [875, 2, 3266, 3, 2393, 1618, 3, 22087, 2, 69, 2344], [875, 2, 3266, 3, 2393, 1618, 3, 22087, 2, 69, 2344, 3], [875, 2, 3266, 3, 2393, 1618, 3, 22087, 2, 69, 2344, 3, 5399], [875, 2, 3266, 3, 2393, 1618, 3, 22087, 2, 69, 2344, 3, 5399, 7], [875, 2, 3266, 3, 2393, 1618, 3, 22087, 2, 69, 2344, 3, 5399, 7, 18], [875, 2, 3266, 3, 2393, 1618, 3, 22087, 2, 69, 2344, 3, 5399, 7, 18, 365], [875, 2, 3266, 3, 2393, 1618, 3, 22087, 2, 69, 2344, 3, 5399, 7, 18, 365, 184], [875, 2, 3266, 3, 2393, 1618, 3, 22087, 2, 69, 2344, 3, 5399, 7, 18, 365, 184, 23], [875, 2, 3266, 3, 2393, 1618, 3, 22087, 2, 69, 2344, 3, 5399, 7, 18, 365, 184, 23, 33], [875, 2, 3266, 3, 2393, 1618, 3, 22087, 2, 69, 2344, 3, 5399, 7, 18, 365, 184, 23, 33, 54], [8

<a name="pad"></a>
#### Make all titles with same length by using padding

The length of every title has to be the same. To make it, we need to find a title that has a maximum length, and based on that length, we have to pad rest of titles.

In [8]:
# pad sequences
max_sequence_len = max([len(x) for x in input_sequences])
print(max_sequence_len)
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

49


In [26]:
input_sequences[23]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,   875,     2,  3266,
           3,  2393,  1618,     3, 22087,     2,    69,  2344,     3,
        5399,     7,    18,   365,   184,    23,    33,    54,    38,
          34,    12,  2393,  1618], dtype=int32)

<a name="xy"></a>
# Prepare features and labels

Here, we consider **last element of all sequences as a label**.Then,
We need to perform **onehot encoding on labels corresponding to total_words.**

In [12]:
# create features and label
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

In [13]:
print(xs[5])
print(labels[5])
print(ys[5][14])

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
  875    2 3266    3 2393 1618]
3
0.0


<a name="blstm"></a>
# Architechture of Bidirectional LSTM Neural Network

Long Short-Term Memory (LSTM) networks is an advance recurrent neural network which is apable to store order states by using its cell state feature.

Image reference: https://www.researchgate.net/figure/The-structure-of-the-Long-Short-Term-Memory-LSTM-neural-network-Reproduced-from-Yan_fig8_334268507
![lstm.png](attachment:c34341f6-d243-478a-b4bd-bf242759cd50.png)

**Bidirectional LSTM**
Image reference: https://paperswithcode.com/method/bilstm
![bi.png](attachment:d26c6b0c-cbdf-45a5-b88b-2b352d7b7d63.png)

<a name="train"></a>
# Bi- LSTM Neural Network Model training

In [9]:
gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
  try:
    tf.config.set_logical_device_configuration(
        gpus[0],
        [tf.config.LogicalDeviceConfiguration(memory_limit=5120)])
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)


1 Physical GPUs, 1 Logical GPUs


In [None]:
# model = Sequential()
# model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
# model.add(Bidirectional(LSTM(150)))
# model.add(Dense(total_words, activation='softmax'))
# adam = Adam(lr=0.01)
# model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])

if gpus:
  # Replicate your computation on multiple GPUs
  c = []
  for gpu in gpus:
    model = Sequential()
    model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
    model.add(Bidirectional(LSTM(150)))
    model.add(Dense(total_words, activation='softmax'))
    adam = Adam(lr=0.01)
    model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
    history = model.fit(xs, ys, epochs=50, verbose=1)

  with tf.device('/CPU:0'):
    #print model.summary()
    print(model)



In [None]:
model.save("your_model.h5")

<a name="acc"></a>
# Plotting model accuracy and loss

In [None]:
import matplotlib.pyplot as plt


def plot_graphs(history, string):
    plt.plot(history.history[string])
    plt.xlabel("Epochs")
    plt.ylabel(string)
    plt.show()

In [None]:
plot_graphs(history, 'accuracy')

In [None]:
plot_graphs(history, 'loss')

<a name="new"></a>
# Predicting next word of title

In [None]:
seed_text = "मेरो प्यारी माँ"
next_words = 10

for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    predicted_probs = model.predict(token_list, verbose=0)
    predicted_index = np.argmax(predicted_probs)

    output_word = ""
    for word, index in tokenizer.word_index.items():
        if index == predicted_index:
            output_word = word
            break
    seed_text += " " + output_word
print(seed_text)