# Literature Survey

Next-word prediction is a key task in natural language processing that involves predicting the next word in a sentence based on the preceding words. This project aims to develop a next-word prediction model using state-of-the-art deep learning techniques. The model will be built using Recurrent Neural Networks (RNNs) and their advanced variant, Long Short-Term Memory (LSTM) networks, known for their ability to capture long-term dependencies in text.By training these models on large text datasets, we aim to achieve high accuracy in predicting the next word, making the model useful for applications such as text completion, chatbots, and virtual assistants. The project will involve data preprocessing, model implementation, training, evaluation, and fine-tuning to optimize performance.

# Dataset Description

The dataset used in this project is sourced from Kaggle and contains news headlines from the month of March 2018.
The dataset provides a rich source of textual data suitable for training a next-word prediction model.
Each record in the dataset represents a single news headline.

In [None]:
!pip install keras tensorflow

import pandas as pd
import os
import numpy as np

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

### Install opendatasets and download the dataset

In [None]:
!pip install opendatasets -q
import opendatasets as od

### Download the dataset

In [None]:
od.download("https://www.kaggle.com/datasets/manishguptads/news-headlines/code")

### Load the dataset

In [None]:
news_data = pd.read_csv("/content/news-headlines/ArticlesMarch2018.csv")
news_data.head()

In [None]:
print("Number of records: ", news_data.shape[0])
print("Number of fields: ", news_data.shape[1])

# Exploratory Data Analysis

Exploratory Data Analysis (EDA) involves analyzing the dataset to summarize its main characteristics, often using visual methods.
For this dataset, we can explore the distribution of headline lengths, the most common words, and other relevant statistics.


### Display the first few records and some basic statistics

In [None]:
print(news_data['headline'].head())
print(news_data['headline'].describe())

# Data Preprocessing

Data preprocessing is a crucial step in preparing the data for model training.
It includes tasks such as removing unwanted characters, tokenizing the text, and creating sequences of words for model input.

### Data cleaning

In [None]:
news_data['headline'] = news_data['headline'].apply(lambda x: x.replace(u'\xa0',u' '))
news_data['headline'] = news_data['headline'].apply(lambda x: x.replace('\u200a',' '))

###Tokenization


In [None]:
tokenizer = Tokenizer(oov_token='<oov>') # For those words which are not found in word_index
tokenizer.fit_on_texts(news_data['headline'])

tokenizer_json = tokenizer.to_json()
with open('tokenizer.json', 'w') as file:
    file.write(tokenizer_json)

total_words = len(tokenizer.word_index) + 1

print("Total number of words: ", total_words)
print("Word: ID")
print("------------")
print("<oov>: ", tokenizer.word_index['<oov>'])
print("Strong: ", tokenizer.word_index['strong'])
print("And: ", tokenizer.word_index['and'])

### Creating sequences of words

In [None]:
input_sequences = []
for line in news_data['headline']:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

print("Total input sequences: ", len(input_sequences))

### Pad sequences

In [None]:
max_sequence_len = max([len(x) for x in input_sequences])
max_sequence_len
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
input_sequences[1]

In [None]:
max_length = max([len(input_sequence) for input_sequence in input_sequences])
max_length

### Create features and label

In [None]:
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

In [None]:
print(xs[5])
print(labels[5])
print(ys[5][14])

# Model Implementation

In this section, we implement the model for next-word prediction.
We use a Bidirectional LSTM model with an Embedding layer and a Dense layer with a softmax activation function.

### Define the model

In [None]:
model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150)))
model.add(Dense(total_words, activation='softmax'))

### Compile the model

In [None]:
adam = Adam(learning_rate=0.01)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
print(model.summary())

### Train the model



In [None]:
history = model.fit(xs, ys, epochs=50, verbose=1)

# Model Evaluation and Discussion

Model evaluation involves assessing the performance of the model on the training data.
We use accuracy and loss metrics to evaluate the model during training.
We also visualize the training progress using plots.

In [None]:
import matplotlib.pyplot as plt

In [None]:
def plot_graphs(history, string):
    plt.plot(history.history[string])
    plt.xlabel("Epochs")
    plt.ylabel(string)
    plt.show()

In [None]:
plot_graphs(history, 'accuracy')
plot_graphs(history, 'loss')

### Saving the model

In [None]:
model.save("next_word_prediction_model.h5")

### Text Generation Function

This function takes a seed text and generates the next words based on the trained model.

In [None]:
def generate_text(model, tokenizer, seed_text, max_sequence_len, next_words):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len - 1, padding='pre')
        predicted = model.predict(token_list, verbose=0)
        predicted = np.argmax(predicted, axis=1)
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text

### Example usage

In [None]:
seed_text = "I am"
next_words = 5
generated_text = generate_text(model, tokenizer, seed_text, max_sequence_len, next_words)
print(generated_text)

# Conclusion

In this project, we developed a next-word prediction model using a Bidirectional LSTM.
The model was trained on a dataset of news headlines, and we evaluated its performance using accuracy and loss metrics.
The results indicate that the model is capable of predicting the next word in a given sequence with reasonable accuracy.

# References

1. Gupta, M. (2021). News Headlines. Retrieved from https://www.kaggle.com/datasets/manishguptads/news-headlines/code