## Machine Learning - Exercise 3 (WS 2022)
**Group (31):** Petkova Violeta (01636660), Upadhyaya Bishal (12119246), Gabor Toaso (12127079)

#### Selected topic: 3.2.3 Next-word prediction (Language Modelling) using Deep Learning

**Description:**
We implemented a "next word prediction model", which consider predicting the next possible word (e.g.: the last word of a particular sentence)
We used a methods of natural language processing, language modeling, and deep learning.


**Data source:**
----


**High level process:**
- download the data from repository XXX,
- pre-processing the data from the dataset,
  - remove all unnecessary data,
  - delete the starting and end of the dataset (?),
  - save the pre-processed data as txt file (access the file using the encoding as utf-8),
  - replace all (i) unnecessary extra new lines, (ii) the carriage return and (iii) the Unicode character,
  - make sure we have only unique words (consider each word only once and remove additional repetitions) to avoid confusion,
- start to analyse data downloaded from xxx repository,

- tokenize the data (splitting bigger text corpus into smaller segments),
  - Keras Tokenizer is used to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf.
  - convert the texts to sequences (interpreting the text data into numbers),
  - create the training dataset ('X'),
  - define output for training data ('y') => 'y' contains all the next word predictions for each input 'X',
  - calculate "vocab_size" by using the length extracted from "tokenizer.word_index" and then add 1 to it ("0" is reserved for padding and we start our cont from "1"),
  - convert our predictions data 'y' to categorical data of the "vocab_size" => convert a class vector (integers) to the binary class matrix. This will be useful with our loss which will be categorical_crossentropy. 
  - improvements in pre-processing is still possible => to achieve a better loss and accuracy in lesser epochs,


- **Predicting a sequential model**
  - create an embedding layer and specify the input dimensions and output dimensions
  - specify the input length as 1 since the prediction will be made on exactly one word and we receive a reposne for that word,
  - add an LSTM layer (#1) to our model with 1000 units which returns the sequences as true - to pass it through another LSTM layer,
  - for the next LSTM layer (#2), we also pass it throught another 1000 units (the return sequense is false by default),
  - pass this through a hidden layer with 1000 node units using "dense layer" function with "relu" set as the activation,
  - pass
  - ...
  - ...

For the next LSTM layer, we will also pass it through another 1000 units but we don’t need to specify return sequence as it is false by default. We will pass this through a hidden layer with 1000 node units using the dense layer function with relu set as the activation. Finally, we pass it through an output layer with the specified vocab size and a softmax activation. The softmax activation ensures that we receive a bunch of probabilities for the outputs equal to the vocab size. The entire code for our model structure is as shown below. After we look at the model code, we will also look at the model summary and the model plot.


- build a deep learning model (using LSTM),
  - train model,

## Sources

- https://www.ris-ai.com/predict-next-word-with-python
- https://www.nltk.org/
- https://towardsdatascience.com/next-word-prediction-with-nlp-and-deep-learning-48b9fe0a17bf

## Importing packages

In [11]:
!pip3 install nltk
!pip3 install keras
import numpy as np
from nltk.tokenize import RegexpTokenizer
from keras.models import Sequential, load_model
from keras.layers import LSTM
from keras.layers.core import Dense, Activation
from keras.optimizers import RMSprop
import matplotlib.pyplot as plt
import pickle
import heapq

Looking in indexes: https://repo.eb.lan.at/artifactory/api/pypi/pypi-repo/simple
You should consider upgrading via the '/opt/app-root/venv/bin/python3.8 -m pip install --upgrade pip' command.[0m
Looking in indexes: https://repo.eb.lan.at/artifactory/api/pypi/pypi-repo/simple
You should consider upgrading via the '/opt/app-root/venv/bin/python3.8 -m pip install --upgrade pip' command.[0m


## Importing data (corpus length)

In [12]:
text = open('metamorphosis_clean.txt').read().lower()
print('corpus length:', len(text))

corpus length: 119164


## Pre-processing (tokenization)

In [13]:
tokenizer = RegexpTokenizer(r'\w+')
words = tokenizer.tokenize(text)

unique_words = np.unique(words)
unique_word_index = dict((c, i) for i, c in enumerate(unique_words))

# Next, for the feature engineering part, we need to have the unique sorted words list. 
# We also need a dictionary with each word form the unique_words list as key and its corresponding position as value.

## Feature engineering

In [14]:
WORD_LENGTH = 5 # Number of words considered in sequence
prev_words = []
next_words = []
for i in range(len(words) - WORD_LENGTH):
    prev_words.append(words[i:i + WORD_LENGTH])
    next_words.append(words[i + WORD_LENGTH])
print(prev_words[0])
print(next_words[0])

# Here, we create two numpy array X(for storing the features) and Y(for storing the corresponding label).
X = np.zeros((len(prev_words), WORD_LENGTH, len(unique_words)), dtype=bool)
Y = np.zeros((len(next_words), len(unique_words)), dtype=bool)

# We iterate X and Y if the word is present then the corresponding position is made 1.
for i, each_words in enumerate(prev_words):
    for j, each_word in enumerate(each_words):
        X[i, j, unique_word_index[each_word]] = 1
    Y[i, unique_word_index[next_words[i]]] = 1
    
print(X[0][0])

['one', 'morning', 'when', 'gregor', 'samsa']
woke
[False False False ... False False False]


## Training the data / Build a model

In [15]:
# We use a single-layer LSTM model with 128 neurons, a fully connected layer, and a softmax function for activation.
model = Sequential()
model.add(LSTM(128, input_shape=(WORD_LENGTH, len(unique_words))))
model.add(Dense(len(unique_words)))
model.add(Activation('softmax'))

# Train - The model will be trained with 20 epochs with an RMSprop optimizer.
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = model.fit(X, Y, validation_split=0.05, batch_size=128, epochs=20, shuffle=True).history

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [16]:
model.save('keras_next_word_model.h5')
pickle.dump(history, open("history.p", "wb"))

model = load_model('keras_next_word_model.h5')
history = pickle.load(open("history.p", "rb"))

## Predicting a model

In [22]:
# Now, we need to predict new words using this model. 
# To do that we input the sample as a feature vector. 
# We convert the input string to a single feature vector.

def prepare_input(text):
    x = np.zeros((1, WORD_LENGTH, len(unique_words)))
    for t, word in enumerate(text.split()):
        print(word)
        x[0, t, unique_word_index[word]] = 1
    return x
prepare_input("He slid back into his".lower())



he
slid
back
into
his


array([[[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]])

In [23]:
# To choose the best possible n words after the prediction from the model is done by sample function.

def sample(preds, top_n=3):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds)
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    return heapq.nlargest(top_n, range(len(preds)), preds.take)

# Finally, for prediction, we use the function predict_completions which use 
# the model to predict and return the list of n predicted words.

def predict_completions(text, n=3):
    if text == "":
        return("0")
    x = prepare_input(text)
    preds = model.predict(x, verbose=0)[0]
    next_indices = sample(preds, n)
    return [unique_words[idx] for idx in next_indices]

# Now let’s see how it predicts, we use tokenizer.tokenize fo removing the punctuations and 
# also we choose 5 first words because our predicts base on 5 previous words.

q =  "I'd get kicked out on the spot"
print("correct sentence: ",q)
seq = " ".join(tokenizer.tokenize(q.lower())[0:5])
print("Sequence: ",seq)
print("next possible words: ", predict_completions(seq, 5))

correct sentence:  I'd get kicked out on the spot
Sequence:  i d get kicked out
i
d
get
kicked
out
next possible words:  ['on', 'out', 'of', 'part', 'away']


## Conclusions