<h1 align="center"><font size="5">Predict Next Word</font></h1>

<h2>Project Overview</h2>

__This project is a course project from Data Science: Statistics and Machine Learning Specialization. This course is taught by Johns Hopkins University__ 

This project is part of the capstone project of this course. Specifically, this project is for Week 4 (Task 5). A LSTM model is built to predict the next word. 

This project refers https://www.geeksforgeeks.org/next-word-prediction-with-deep-learning-in-nlp/

Implementing these codes in R is more complicated, since it uses TensorFlow and Keras. 


<h2>Table of Contents</h2>
<ol>
    <li><a href="#1">Read in Data</a></li>
    <li><a href="#2">Preprocessing the dataset</a></li>
    <li><a href="#3">Build the Model</a></li>
    <li><a href="#4">Train the Model</a></li>
    <li><a href="#5">Predict the next word</a></li>
</ol>
<p></p>
</div>
<br>

## 1. Read in Data

In [1]:
import tensorflow as tf 
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Embedding, LSTM, Dense 
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing.sequence import pad_sequences 
import numpy as np 
import regex as re 

In [2]:
def file_to_sentence_list(file_path): 
    with open(file_path, 'r',encoding='UTF-8') as file: 
        text = file.read().lower() 
  
    # Splitting the text into sentences using 
    # delimiters like '.', '?', and '!' 
    sentences = [sentence.strip() for sentence in re.split( 
        r'(?<=[.!?])\s+', text) if sentence.strip()] 
  
    return sentences 

In [3]:
%%time
file_path = "D:/Core/Google/Data Science--Statistics and ML/5. Capstone/final/en_US/en_US.blogs.txt"
blog_data = file_to_sentence_list(file_path)
file_path = "D:/Core/Google/Data Science--Statistics and ML/5. Capstone/final/en_US/en_US.news.txt"
news_data = file_to_sentence_list(file_path) 
file_path = "D:/Core/Google/Data Science--Statistics and ML/5. Capstone/final/en_US/en_US.twitter.txt"
twitter_data = file_to_sentence_list(file_path) 

CPU times: total: 22.2 s
Wall time: 23.1 s


In [4]:
print("Total lenght of blog data is ", len(blog_data))
print("Total lenght of blog data is ", len(news_data))
print("Total lenght of blog data is ", len(twitter_data))

Total lenght of blog data is  2092948
Total lenght of blog data is  1842069
Total lenght of blog data is  2926549


## 2. Preprocss the data

In [5]:
# combine all three datasets
text_data = blog_data + news_data + twitter_data
print("The total length of data is ", len(text_data))

The total length of data is  6861566


In [6]:
import random

def sample_list_by_percentage(list, rate):
  """Samples a list by percentage.

  Args:
    list: The list to sample.
    percentage: The percentage of the list to sample.

  Returns:
    A list of the sampled elements.
  """

  # Calculate the number of elements to sample.
  num_elements_to_sample = int(len(list) * rate)

  # Sample the list.
  sampled_list = random.sample(list, num_elements_to_sample)

  # Return the sampled list.
  return sampled_list

The entire data set is too big. It takes a long time to train the model. Here uses part of the data set. 

In [7]:
# sample the data
sample_size = 0.001
text_data = sample_list_by_percentage(text_data, sample_size)
print("The total length of data after sampling is ", len(text_data))

The total length of data after sampling is  6861


In [8]:
%%time
# Tokenize the text data 
tokenizer = Tokenizer() 
tokenizer.fit_on_texts(text_data) 
total_words = len(tokenizer.word_index) + 1

CPU times: total: 266 ms
Wall time: 506 ms


In [9]:
# Create input sequences 
input_sequences = [] 
for line in text_data: 
    token_list = tokenizer.texts_to_sequences([line])[0] 
    for i in range(1, len(token_list)): 
        n_gram_sequence = token_list[:i+1] 
        input_sequences.append(n_gram_sequence) 

In [10]:
%%time
# Pad sequences and split into predictors and label 
max_sequence_len = max([len(seq) for seq in input_sequences]) 
input_sequences = np.array(pad_sequences( 
    input_sequences, maxlen=max_sequence_len, padding='pre')) 
X, y = input_sequences[:, :-1], input_sequences[:, -1] 
  
# Convert target data to one-hot encoding 
y = tf.keras.utils.to_categorical(y, num_classes=total_words) 

CPU times: total: 547 ms
Wall time: 776 ms


## 3. Build the Model

In [11]:
%%time
# Define the model 
model = Sequential() 
model.add(Embedding(total_words, 10, 
                    input_length=max_sequence_len-1)) 
model.add(LSTM(128)) 
model.add(Dense(total_words, activation='softmax')) 
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', metrics=['accuracy']) 

CPU times: total: 547 ms
Wall time: 732 ms


## 4. Train the Model

In [12]:
%%time
# Train the model 
model.fit(X, y, epochs=10, verbose=1) 

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: total: 1h 27min 50s
Wall time: 58min 45s


<keras.src.callbacks.History at 0x1f92c6d0d60>

## 5. Predict Next Word

In [13]:
# Generate next word predictions 
seed_text = "When you breathe, I want to be the air for you. I'll be there for you, I'd live and I'd"
next_words = 5
  
for _ in range(next_words): 
    token_list = tokenizer.texts_to_sequences([seed_text])[0] 
    token_list = pad_sequences( 
        [token_list], maxlen=max_sequence_len-1, padding='pre') 
    predicted_probs = model.predict(token_list) 
    predicted_word = tokenizer.index_word[np.argmax(predicted_probs)] 
    seed_text += " " + predicted_word 
  
print("Next predicted words:", seed_text) 

Next predicted words: When you breathe, I want to be the air for you. I'll be there for you, I'd live and I'd have a good bit but


In [15]:
# Generate next word predictions 
seed_text = "He likes to eat"
next_words = 5
  
for _ in range(next_words): 
    token_list = tokenizer.texts_to_sequences([seed_text])[0] 
    token_list = pad_sequences( 
        [token_list], maxlen=max_sequence_len-1, padding='pre') 
    predicted_probs = model.predict(token_list) 
    predicted_word = tokenizer.index_word[np.argmax(predicted_probs)] 
    seed_text += " " + predicted_word 
  
print("Next predicted words:", seed_text) 

Next predicted words: He likes to eat in the meantime of the


In [16]:
# Generate next word predictions 
seed_text = "The prime minister"
next_words = 5
  
for _ in range(next_words): 
    token_list = tokenizer.texts_to_sequences([seed_text])[0] 
    token_list = pad_sequences( 
        [token_list], maxlen=max_sequence_len-1, padding='pre') 
    predicted_probs = model.predict(token_list) 
    predicted_word = tokenizer.index_word[np.argmax(predicted_probs)] 
    seed_text += " " + predicted_word 
  
print("Next predicted words:", seed_text) 

Next predicted words: The prime minister twin has crawled the legislation
