## The below code is to train an LSTM model to generate new text. The application has effectively 2 parts, encoder and decoder. I use a bidirectional LSTM and a unidirectional LSTM for encoder and 2 dense layers as decoder.

# Creating a tokenized corpus

The first step is too read a textfile for creating a corpus. A corpus is a large and structured set of texts used for natural language processing (NLP) tasks like text mining, sentiment analysis, machine translation,etc. I will use this corpus to train my model. The FitOnTexts function of tokenizer is used  to generate a word index dictionary that maps each unique word in the provided texts to a unique integer. This mapping will be used to convert string input into numerical format to be used as input to machine learning models. Machine Learning models only understands numbers

In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer


tokenizer = Tokenizer()
text = open("employee.txt", encoding="utf-8").read()
corpus = text.lower().splitlines()
tokenizer.fit_on_texts(corpus)
wordDict = len(tokenizer.word_index)+1


# Creating n-gram sequences

We need to convert the text input into numericals.TextToSequences is used for this purpose. It converts list of strings into a list of sequences/integers. Each word in the texts is replaced by its corresponding integer index as determined by the fit_on_texts method. Next we make input sequences. Each line needs to be converted to n-gram sequences. Last word in each n-gram sequence will be treated as label and all the words preceding it as inputs. For example, The sentence 'Shivam loves building large language models' should be converted to ["Shivam", "Shivam loves", "Shivam loves building","Shivam loves building large","Shivam loves building large language","Shivam loves building large language models"]. Though for training purposes, we will ignore n-gram sequences with length 1 as they can't have both input and label

In [2]:
ngramSequencesList = []
for line in corpus:
    listOfTokens = tokenizer.texts_to_sequences([line])[0]
    #print(token_list)
    for i in range(1, len(listOfTokens)):
        ngramSequence = listOfTokens[:i+1]
        ngramSequencesList.append(ngramSequence)
sequenceMaxLength = max([len(i) for i in ngramSequencesList])


# Padding

The n-fram sequences to be used as input must be of same size, so padding is done

In [3]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

ngramSequencesList = np.array(pad_sequences(ngramSequencesList, maxlen=sequenceMaxLength, padding='pre'))

# Seperating input from labels and one hot encoding

The input and labels need to be seperated for everu n-gram sequence. Labels need to be one hot encoded to remove any relationship between the labels

In [4]:
import tensorflow.keras.utils as keras_utils

input, label = ngramSequencesList[:,:-1],ngramSequencesList[:,-1]
label = keras_utils.to_categorical(label, num_classes=wordDict)

# Creating the model

Model contains 2 parts, encoder and decoder

## Encoder

Encoder consists of an Embedding layer which converts each individual token(represented in numerical form) into dense embedded vectors, each dimensions size mentioned in second parameter(in this case 100). This ensures that the semantics and contexts of the words is captured and the size of the token does not grow quadratically with the size of the corpus.

 A bidirectional LSTM layers is added. LSTM input shape must be compatable with embedding outsput shape. Since LSTM is not the first layer, keras will take care of this implicitly and we don't need to mention input shape explicitly. return_sequence=True is added to return the full sequence of outputs for another LSTM stack

 A dropout layer is added to avoid overfitting

 a unidirectional LSTM is added as the last layer. The output is feeded into the decoder

# Decoder

Decoder has 2 dense layer. First dense layer is used to recieve input from encoder and relu function is applied. Regulizer is used to avoid overfitting

Last sende layer is used as the output layer for the predicted word. Softmax is used in this layer to chose the word with the highest probability

Optimizer used is ADAM

In [5]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras import regularizers


model = Sequential()
#------------------Encoder------------------------------------
model.add(Embedding(wordDict, 100, input_length=sequenceMaxLength-1))
model.add(Bidirectional(LSTM(150, return_sequences=True)))
model.add(Dropout(0.2))
model.add(LSTM(150))
#-------------------Decoder----------------------------------------------
model.add(Dense(wordDict, activation='relu', kernel_regularizer=regularizers.l2(0.0001)))
model.add(Dense(wordDict, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


Training

Here the model is simply trained

In [6]:
history = model.fit(input, label, epochs=50, verbose=1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


# Prediction

A prompt is used for the model to use as input for predicting next words. I predict upto 100 words here(Prompt size + 100 words)

In [8]:
prompt = "Tell me Something about employee training"
prectionWodCOunt = 100

for _ in range(prectionWodCOunt):
    listOfTokens = tokenizer.texts_to_sequences([prompt])[0]
    listOfTokens = pad_sequences([listOfTokens], maxlen=sequenceMaxLength-1, padding='pre')
    predictedProbabilities = model.predict(listOfTokens, verbose=0)
    predictedClass = np.argmax(predictedProbabilities)
    outWord = ""
    for word, index in tokenizer.word_index.items():
        if index == predictedClass:
            outWord = word
            break
    prompt += " " + outWord

print(prompt)

Tell me Something about employee training skills results from employees patient in working for one organization for a considerable amount of time it enables employees to appreciate the process of organizational growth they become patient to grow with it employee training enables workers to appreciate that the more one remains in an organization the less likely that he or she will turn over that are occasionally trained on various topics and issues gain the required skills for promotion and reward such employees will always be waiting for the next training programs in the employees who worked as office assistants during the manual era however those who
