<a href="https://colab.research.google.com/github/Gr3gP/NLP-Projects/blob/main/Word_Level_Neural_Language_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Building a Word Level Neural Language Model



In this paper I will be building a Neural Language Model from three of Shakespeares most famous tragedies.

##EDA and Data Cleaning

In [1]:
!pip install nltk
!pip install spacy
!python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [2]:
import numpy as np
from numpy import array
import pandas as pd
import nltk
from nltk.corpus import gutenberg
import spacy
import re
import string
from random import randint
from pickle import dump, load
import tensorflow
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import LSTM, Dense, Embedding, BatchNormalization, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
import warnings
warnings.filterwarnings('ignore')

nltk.download('gutenberg')
!python -m spacy download en

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [3]:
#we will use the Shakspearean tragedies. Download and concat the three into a single txt object

hamlet = gutenberg.raw('shakespeare-hamlet.txt')
macbeth = gutenberg.raw('shakespeare-macbeth.txt')
caesar = gutenberg.raw('shakespeare-caesar.txt')

tragedies = caesar + macbeth + hamlet

In [4]:
# turn a doc into clean tokens
def clean_doc(doc):
	# replace '--' with a space ' '
	doc = doc.replace('--', ' ')
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', string.punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# make lower case
	tokens = [word.lower() for word in tokens]
	return tokens

In [5]:
#clean our document and inspect the first few hundred tokens
tokens = clean_doc(tragedies)
print(tokens[:200])
print('total tokens: %d'% len(tokens))
print('unique tokens: %d' % len(set(tokens)))

['the', 'tragedie', 'of', 'julius', 'caesar', 'by', 'william', 'shakespeare', 'actus', 'primus', 'scoena', 'prima', 'enter', 'flauius', 'murellus', 'and', 'certaine', 'commoners', 'ouer', 'the', 'stage', 'flauius', 'hence', 'home', 'you', 'idle', 'creatures', 'get', 'you', 'home', 'is', 'this', 'a', 'holiday', 'what', 'know', 'you', 'not', 'being', 'mechanicall', 'you', 'ought', 'not', 'walke', 'vpon', 'a', 'labouring', 'day', 'without', 'the', 'signe', 'of', 'your', 'profession', 'speake', 'what', 'trade', 'art', 'thou', 'car', 'why', 'sir', 'a', 'carpenter', 'mur', 'where', 'is', 'thy', 'leather', 'apron', 'and', 'thy', 'rule', 'what', 'dost', 'thou', 'with', 'thy', 'best', 'apparrell', 'on', 'you', 'sir', 'what', 'trade', 'are', 'you', 'cobl', 'truely', 'sir', 'in', 'respect', 'of', 'a', 'fine', 'workman', 'i', 'am', 'but', 'as', 'you', 'would', 'say', 'a', 'cobler', 'mur', 'but', 'what', 'trade', 'art', 'thou', 'answer', 'me', 'directly', 'cob', 'a', 'trade', 'sir', 'that', 'i', 'h

In [6]:
#organize into sequences of tokens
length = 50 + 1
sequences = list()
for i in range(length, len(tokens)):
    #select sequence of tokens
    seq = tokens[i - length:i]
    #convert to a line
    line = ' '.join(seq)
    #store it
    sequences.append(line)
print('Total sequences: %d' % len(sequences))

Total sequences: 67546


In [7]:
#organize our tokens into a file
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

In [8]:
#save our sequences to a file
out_filename = 'tragedies_sequences.txt'
save_doc(sequences, out_filename)

In [9]:
#we will now load our doc into memory
def load_doc(filename):
    #open file as read only
    file = open(filename, 'r')
    #read all text
    text = file.read()
    #close file
    file.close()
    return text

#load
in_filename = 'tragedies_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

In [10]:
#now we can integer encode our sequences to use in our model
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

#set vocabulary size
vocab_size = len(tokenizer.word_index) + 1

In [11]:
#seperate our model inputs and outputs
sequences = np.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

##Build Model I

Now that we have clean and encoded our data, we are ready to begin building our model. We will use two hidden LSTM layers with 100 memory cells each to start. A fully connected Dense layers with 100 neurons will connect to the hidden LSTM layers to interpret the extracted features for the sequence. We will use a softmax activation function to ensure outputs are characteristic of normalized probabilities. We will use categorical cross entropy loss since this is technically a multi-class classification problem. An Adam implementation of mini-batch gradient descent is used as well and accuracy will be our metric. The model will run for 100 epochs and a smaller batch size of 128 to start.

In [12]:
#Define model
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 50, 50)            391300    
_________________________________________________________________
lstm (LSTM)                  (None, 50, 100)           60400     
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense (Dense)                (None, 100)               10100     
_________________________________________________________________
dense_1 (Dense)              (None, 7826)              790426    
Total params: 1,332,626
Trainable params: 1,332,626
Non-trainable params: 0
_________________________________________________________________
None


In [13]:
#compile our model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
#fit model
model.fit(X, y, batch_size=128, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7f0103661190>

In [14]:
#save model to file
model.save('model.g1')
#save the tokenizer 
dump(tokenizer, open('tokenizer.pkl', 'wb'))



INFO:tensorflow:Assets written to: model.g1/assets


INFO:tensorflow:Assets written to: model.g1/assets


##Load the Data and Model I

In [15]:
#load doc into memory
def load_doc(filename):
    #open file
    file = open(filename, 'r')
    #read text
    text = file.read()
    #close file
    file.close()
    return text

#load cleaned text sequences
in_filename = 'tragedies_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

#change sequence length
seq_length = len(lines[0].split()) - 1

#load model
model = load_model('model.g1')

#load tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))

##Generate Text with Model I

In [16]:
#select a random seed text
seed_text = lines[randint(0,len(lines))]
print(seed_text)

queene would speak with you and presently ham do you see that clowd thats almost in shape like a camell polon byth masse and its like a camell indeed ham me thinkes it is like a weazell polon it is backd like a weazell ham or like a whale polon verie


In [17]:
#generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    result = list()
    in_text = seed_text
    #generate a fixed number of words
    for _ in range(n_words):
        #encode text as integers
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        #truncate the sequence
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        #predict probabilites for each word
        yhat = model.predict_classes(encoded, verbose=0)
        #map predicted word index to words
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        #append to our input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

In [18]:
#lets generate some text
generated = generate_seq(model, tokenizer, seq_length,seed_text, 50)
print(generated)

friends ophe i am glad to see you by your health and shalt not presume but to me indifferent oath to signifie it i am i shall be bethinke me i had not quoted him i will not thinke and i shall finde them crownd in the pit of tyber


We can see the model ran for 100 epochs and gave us fairly unitelligible text. We can try adding in dropout or batch normalization to improve its performance as well as running it for more epochs.

##Build Model II

In [19]:
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dropout(0.3))
model.add(BatchNormalization())
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 50)            391300    
_________________________________________________________________
lstm_2 (LSTM)                (None, 50, 100)           60400     
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dropout (Dropout)            (None, 100)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 100)               400       
_________________________________________________________________
dense_2 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_3 (Dense)              (None, 7826)             

In [20]:
#compile new model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
#fit model
model.fit(X, y, batch_size=128, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7f004e6d4a90>

In [21]:
#Save model
model.save('model.g2')
#save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))



INFO:tensorflow:Assets written to: model.g2/assets


INFO:tensorflow:Assets written to: model.g2/assets


##Load Data and Model II

In [22]:
#load dco into memory
def load_doc(filename):
    #open file 
    file = open(filename, 'r')
    #read text
    text = file.read()
    #close file
    file.close()
    return text

#Load cleaned text
in_filename = 'tragedies_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

#change sequence length
seq_length = len(lines[0].split()) -1

#load model
model = load_model('model.g2')

#load tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))

##Generate Text with Model II

In [23]:
#Start with a random seed text
seed_text = lines[randint(0, len(lines))]
print(seed_text)

tis very hot ham no beleeue mee tis very cold the winde is northerly osr it is indifferent cold my lord indeed ham mee thinkes it is very soultry and hot for my complexion osr exceedingly my lord it is very soultry as twere i cannot tell how but my lord


In [24]:
#call function to generate 10 words 
text_generate10 = generate_seq(model, tokenizer, seq_length, seed_text, 10)
print(text_generate10)

ham eene as i slewe my sad poysond bru and


We can see that our model wasn't half bad at producing semi-intelligible text this time around. It is also worth noting that as our word count increases, our sentence coherence decreases with the model just spewing words. 