### Rodrigo González Linares

# Language Modeling: text generation

`LM.py` includes two classes, `ToyLM_ngram` and `ToyLM_LSTM`. The former implements an n-gram model, while the latter implements an Long Short-Term Memory (LSTM) network-based model; both for the purpose of text generation.  
First of all it is necessary to import the `LM.py`.

In [1]:
import LM as lm 

Metal device set to: Apple M2


2023-05-05 00:48:59.537145: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-05-05 00:48:59.537237: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


## n-gram models 
### Queen of England 3-gram model
Next we need to create an object for the `ToyLM_ngram` class. As this first model will be trained on Queen Elizabeth II's speeches under a 3-gram schema, we will appropriately name it `lizzie_3gram`

In [2]:
lizzie_3gram = lm.ToyLM_ngram()

Before creating the model, we need to process the raw text from the speeches to get the train and test sets. This can be done using the `GetSentences` method.

In [3]:
trainSentences = lizzie_3gram.GetSentences('./materials/HerMajestySpeechesDataset/train.txt')
testSentences = lizzie_3gram.GetSentences('./materials/HerMajestySpeechesDataset/test.txt')

Once that is done we create the model, indicating the value of `n`, using `CreateModel`. We can also save the model with `Save` indicating the path where it should be saved. 

In [4]:
lizzie_3gram.CreateModel(trainSentences,3)
lizzie_3gram.Save('models/lizzie_3gram.pickle')

When this is done we can calculate the average perplexity for the test set. If a model produces a perplexity `x`, this roughly equates to the model being as confused as trying to choose from `x` random tokens. Then, the lower the perplexity the better. As a rule of thumb, if the vocabulary size is much larger than perplexity, then the model is performing well.  
There is a big caveat regarding perplexity and how the n-gram model was set up. If a given m-gram (where m=n-1) is not contained within the corpus, then the model will generate the next token piking the last token of any n-gram in the corpus, based on its total number of counts (i.e. the higher the count the most probable the last token of a particular n-gram will be picked as the next token to be generated). This is a palliative mechanism to avoid the model halting. If this happen and the ground truth label is not present in the corpus, with regards to perplexity these are deemed as "invalid tests". On the other hand, if the label is contained within a corpus these will be regarded as "valid tests". The perplexity is only calculated for valid tests, as the model would be "infinitely perplex" by tokens not present in the vocabulary.  

In [5]:
lizzie_3gram.Test(testSentences)

The average perplexity is:  8.099249
* Note: The percentage of valid tests is:  25.41532573714882 %


At this point we are ready to predict. We can do this using the `Predict` method, specifying the initial context and the maximum number of tokens to be generated. The generation will stop when either the maximum number of tokens have been generated (30 for all the examples in the notebook), or when an end-of-sentence symbol is reached. The dot (`.`) has been defined as the end-of-sentence symbol.  
During the text processing all words were uncapitalized; so all capitalized words will surely not be contained in any n-gram. It is therefore recommended that every word in the provided context is uncapitalized.  
We will start by predicting using only the most probable n-gram by setting `numberOfConsideredWords` to 1.

In [6]:
contexts = ['',
           'hello, i am the',
           'i love the commonwealth',
           'where are you philip',
           'hi there folks',
           'a gin for me',
           'ireland is',
           'the scottich',
           'i look at our country',
            'the houses of parliament']

for context in contexts:
    print(lizzie_3gram.Predict(context,maxLength=30,numberOfConsideredWords=1))
    print(' ')

  0 2 2 , 0 0 2 2 .
 
hello , i am the very first visit fifty - two years ago , and the commonwealth , and the commonwealth , and the commonwealth , and the commonwealth , and the commonwealth
 
i love the commonwealth , and the commonwealth , and the commonwealth , and the commonwealth , and the commonwealth , and the commonwealth , and the commonwealth , and the commonwealth
 
where are you philip 0 0 2 2 , and the commonwealth , and the commonwealth , and the commonwealth , and the commonwealth , and the commonwealth , and the commonwealth
 
hi there folks 0 0 2 2 .
 
a gin for me , this is a pleasure to be back in australia and all those who have been affected by yesterday ' s christmas broadcast 2 0 0 2 2
 
 ireland is 0 0 2 2 , when i was last in australia and britain stood side by side in two of the commonwealth , and the commonwealth , and
 
 the scottich 0 0 2 2 , when i addressed the theme of the commonwealth , and the commonwealth , and the commonwealth , and the commonwealt

Next we use all the available n-grams for prediction based on its probability distribution (which is the default).  

In [7]:
for context in contexts:
    print(lizzie_3gram.Predict(context,maxLength=30))
    print(' ')

  gentlemen thank of on world history .
 
hello , i am the very welcome progress in our national character .
 
i love the commonwealth - canada , 1 1 march 2 0 0 2 2 is not a year of the diversity that has necessarily kept people apart has , quite simply
 
where are you philip more , a vivid passage in her diary the first time , and cambridge have all been affected continue to encourage our young people - safe , environmentally
 
hi there folks games coronation golden it from states in from its citizens from every background and experience .
 
a gin for me this year , and saint paul reminded parents to be done , soon be reunited with their families .
 
 ireland is , of which you fulfil .
 
 the scottich ' members our that of your homeland as the regiment during those years ago , he lost friends in the northern irish countryside .
 
i look at our country .
 
the houses of parliament .
 


Finally, we will set a cutoff and only consider the 50 top n-grams, and then make a choice, again, given their probability distribution.

In [8]:
for context in contexts:
    print(lizzie_3gram.Predict(context,maxLength=30,numberOfConsideredWords=50))
    print(' ')

  i will open my home .
 
hello , i am the very first broadcast i made my first christmas broadcast 1 9 4 0 0 2 2 , and now the commonwealth , as we continue to offer my
 
i love the commonwealth treasures and respects this wealth of natural resources with greater care , professionalism and sensitivity often in places that are now among the members of my accession with
 
where are you philip and me a right and proper way of describing this parliament to give care to those who died in the structure of society .
 
hi there folks you who work for them .
 
a gin for me .
 
 ireland is to you all .
 
 the scottich pleased , the queen ' s bombings in london from persecution .
 
i look at our country .
 
the houses of parliament democracy ' 2 is not only follow these developments more closely than ever .
 


If we want to use the same model again, we can simply create a new `ToyLM_ngram` object and load it. We can start generating text right away.

In [9]:
new_lizzi_3gram = lm.ToyLM_ngram()

new_lizzi_3gram.Load('models/lizzie_3gram.pickle')

new_lizzi_3gram.Predict('the queen is back',maxLength=30)

'the queen is back africa samaritan it have 5 when and our country is immune from these dangers and we see your principles being put into context the invaluable public and voluntary'

### Queen of England 4-gram model
We will now try the same with a 4-gram model this time.

In [10]:
lizzie_4gram = lm.ToyLM_ngram()

trainSentences = lizzie_4gram.GetSentences('./materials/HerMajestySpeechesDataset/train.txt')
testSentences = lizzie_4gram.GetSentences('./materials/HerMajestySpeechesDataset/test.txt')

lizzie_4gram.CreateModel(trainSentences,4)
lizzie_4gram.Save('models/lizzie_4gram.pickle')

In [11]:
lizzie_4gram.Test(testSentences)

The average perplexity is:  17.745726
* Note: The percentage of valid tests is:  12.400072674418606 %


In [12]:
contexts = ['',
           'hello, i am the',
           'i love the commonwealth',
           'where are you philip',
           'hi there folks',
           'a gin for me',
           'ireland is',
           'the scottich',
           'i look at our country',
            'the houses of parliament']

for context in contexts:
    print(lizzie_4gram.Predict(context,maxLength=30,numberOfConsideredWords=1))
    print(' ')

  i to i i i i i i i i i i i i i i i i i i i i i i i i i i i i
 
hello , i am the very i i i i i i i i i i i i i i i i i i i i i i i i i i
 
i love the commonwealth i i i i i i i i i i i i i i i i i i i i i i i i i i i
 
where are you philip i i i i i i i i i i i i i i i i i i i i i i i i i i i
 
 hi there folks i i i i i i i i i i i i i i i i i i i i i i i i i i i
 
a gin for me i i i i i i i i i i i i i i i i i i i i i i i i i i i
 
 ireland is i i i i i i i i i i i i i i i i i i i i i i i i i i i i
 
 the scottich i i i i i i i i i i i i i i i i i i i i i i i i i i i i
 
i look at our country i i i i i i i i i i i i i i i i i i i i i i i i i i i
 
the houses of parliament , 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 


In [13]:
for context in contexts:
    print(lizzie_4gram.Predict(context,maxLength=30))
    print(' ')

  which bind all world us , is the date of the wedding anniversary of my accession is giving so many people , adults and children , and children to appreciate
 
hello , i am the very , timeless there welcomed at , has are do 2 - who society have have have in and ignored ’ at festival broadcasting than .
 
i love the commonwealth ' s leaders , as evident in australia last week ; and to share in the ideals of this unique gathering of nations , to celebrate an
 
where are you philip years global reply and .
 
 hi there folks them great mark of , westminster laid some region young experience in i everywhere memories proud meeting building even a remarkable effort to the pounds tragedy ,
 
a gin for me to see so many cadets from the commonwealth and to all those who lost their lives , and the commonwealth and around the world , australians are
 
 ireland is of grateful the rise determine to family , of friendship , of language and education , of the continuity of our national spirit ; and i

In [14]:
for context in contexts:
    print(lizzie_4gram.Predict(context,maxLength=30,numberOfConsideredWords=50))
    print(' ')

  s , i kingdom i , have , , , , for s to , , 0 9 0 been been 4 force s i you wales wales , you
 
hello , i am the very 4 0 gentlemen 2 s queen have gentlemen to as kingdom 9 to 0 0 wales are government gentlemen your been gentlemen the 9 i of
 
i love the commonwealth s 9 wales to 9 4 your 9 2 s queen first for the been wales and 2 9 queen the 2 9 9 march 2 0
 
where are you philip parliament i gentlemen of 2 the are meeting , 2 force have christmas i as 4 , to to 2 century have 0 gentlemen wales .
 
 hi there folks to meeting 2 are , , 0 s , of force 4 , thank i 9 wales s as you for have me of thank , 2
 
a gin for me wales to , 2 the gentlemen 9 and to the courage of those who have been affected by events in afghanistan and saddened by the casualties suffered
 
 ireland is of to , thank 9 s 0 thank of been of 2 for queen kingdom your 0 s 0 2 you to gentlemen i century scottish s s
 
 the scottich 5 century 9 , 2 me , for 2 the s been your 2 i and and , , 9 wales 9 4 i 2 to gentlemen

### James Joyce 3-gram model

We will now hop over the other side of the Irish sea, and create a model that speaks like James Joyce by training on Dubliners (available at: https://www.gutenberg.org/ebooks/2814).  
After downloading the plain text version and saving as `dubliners.txt`, and extracting the sentences, we should create our own train, and test sets; assigning 80% of the sentences to the train set, and 20% to the test set.

In [15]:
joyce_3gram = lm.ToyLM_ngram()

sentences = joyce_3gram.GetSentences('./materials/dubliners.txt')
sentences = [elem for elem in sentences if elem.strip() != ''] # Eliminate empty elements

import random

random.shuffle(sentences)

trainSentences = sentences[:int(len(sentences)*0.8)]
testSentences = sentences[int(len(sentences)*0.8):]

Once that is done, it is business as usual.

In [16]:
joyce_3gram.CreateModel(trainSentences,3)
joyce_3gram.Save('models/joyce_3gram.pickle')

joyce_3gram.Test(testSentences)

The average perplexity is:  9.016742
* Note: The percentage of valid tests is:  24.222155418608146 %


In [17]:
contexts = ['',
           'hello, i am the',
           'i love the dublin',
           'where are you father flynn',
           'hi there folks',
           'a ginger beer for me',
           'ireland is',
           'the english',
           'the laws of the country',
            'they walked along']

for context in contexts:
    print(joyce_3gram.Predict(context,maxLength=30,numberOfConsideredWords=1))
    print(' ')

  mr ’ mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr
 
hello , i am the mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr
 
i love the dublin musical world .” mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr
 
where are you father flynn .” mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr
 
hi there folks mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr
 
a ginger beer for me ,” said mr cunningham .
 
 ireland is mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr
 
 the english ’ mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr
 
the laws of the country .
 
they walked along the shaft of grey light : mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr mr
 


In [18]:
for context in contexts:
    print(joyce_3gram.Predict(context,maxLength=30))
    print(' ')

  whom paid the driver of the project gutenberg - tm , to make an .
 
hello , i am the the both the you to break out again a few minutes the women out of the car .
 
i love the dublin musical world .” accepted ivors of men and went towards her nephew ’ s eve .
 
where are you father flynn .” and ever the woe she a later the the front of her .” purposed waiting to be thought that in her eyes he would never see again
 
hi there folks joy old the himself world and was about to knit his brows and , when she spoke to me , gabriel himself had taken up beyond the river
 
a ginger beer for me .
 
 ireland is said music henchy !” three to in the united states and three ladies , with a soft wet substance with her ) do anxiously studied project .
 
 the english ’ , has many to she .
 
the laws of the country where you are , crofton ?” cried mr kernan sensibly , what an end to herself , her aunts what or show the difference between us ,” said
 
they walked along the route , and a minute or so in t

In [19]:
for context in contexts:
    print(joyce_3gram.Predict(context,maxLength=30,numberOfConsideredWords=50))
    print(' ')

  aunt ,” the , mr conroy ?” asked little chandler , “ each in his own , insisting on the following sentence , but covertly proud of their friends ,
 
hello , i am the a s ’ the connor “ t arcy mr mr the ll s ’ mr had had the night air .
 
i love the dublin musical world .” s been drinking since friday .” a know - s mr browne got into a society , mr conroy ?” asked mr o ’ clock
 
where are you father flynn .” tm mr a , tm kate ’ s a stroke cunningham e ’ was a decent sort so long as anyone could remember the way you chaps
 
hi there folks a ’ m right ,” said mr o ’ clock from mr ryan .
 
a ginger beer for me ,” he said coldly .
 
 ireland is ’ tm said a i connor ’ cunningham tm was s was ’ ou frightened , love ?... there mr connor ’ t you remind him ?” said
 
 the english ’ cunningham arcy had s was man , who was regarding us with offers to donate royalties under this agreement , the big hat who had come and
 
the laws of the country .
 
they walked along nassau street and walked up to

## LSTM models

### Queen of England 4-size window LSTM model

To work with an LSTM model, we first need to create a `ToyLM_LSTM` object, and extract the train, validation and test sets as we have done before.

In [24]:
lizzie_4LSTM = lm.ToyLM_LSTM()

trainSentences = lizzie_4LSTM.GetSentences('./materials/HerMajestySpeechesDataset/train.txt')
valSentences = lizzie_4LSTM.GetSentences('./materials/HerMajestySpeechesDataset/dev.txt')
testSentences = lizzie_4LSTM.GetSentences('./materials/HerMajestySpeechesDataset/test.txt')

Next we need to process the sentences using the `ProcessData` method. By default this method creates windows of size 4. We also need to set `trainSet` to `True` when processing the corresponding set, as the method will construct the tokenizer only with the test set. When set to `True`, the tokenizer will not only be created but also saved, so we need to specify a path for it.

In [25]:
xTrain, yTrain = lizzie_4LSTM.ProcessData(trainSentences,trainSet=True,tokenizerPath='models/lizzie_4LSTM.pickle')
xVal, yVal = lizzie_4LSTM.ProcessData(valSentences)
xTest, yTest = lizzie_4LSTM.ProcessData(testSentences)

We can now create the model. By default the embedding size is calculated using an heuristic, 64 LSTM units are used, and the dropout and recurrent dropout rates are set to 0.2 to avoid over fitting. Moreover the a model summary is not printed, and the vocabulary size is set to the total number of tokens in the corpus. Any of these parameters can be modified though.  

The `Train` method should receive the train features and labels, but can receive additional parameters like a validation tuple, number of epochs, optimizer, metrics, batch size, and metrics.  
Because an early-stopping callback has been set for the training loop so that only the best model (validation loss-wise) is saved, we need to specify a path to save the model if a validation tuple was provided and `saveBest` was not set to `False`. Related to this, the `patience` parameter is set to 5 by default, but it can be modified. The training process is halted when the validation loss has not improved in the number of cycles specified by this parameter. 

In [26]:
lizzie_4LSTM.CreateModel(printSummary=True)
lizzie_4LSTM.Train(xTrain, yTrain, valTuple=(xVal, yVal), epochs=50, modelName='models/lizzie_4LSTM.h5')

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 4)]               0         
                                                                 
 embedding (Embedding)       (None, 4, 116)            612828    
                                                                 
 lstm (LSTM)                 (None, 64)                46336     
                                                                 
 dense (Dense)               (None, 5283)              343395    
                                                                 
Total params: 1,002,559
Trainable params: 1,002,559
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/50


2023-05-04 20:37:12.976474: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2023-05-04 20:37:14.114653: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




2023-05-04 20:39:20.566546: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50


Next we calculate the perplexity by calling `Test`. It is important to mention that these LSTM models do not suffer from the aforementioned problems regarding the calculation of perplexity as n-gram models do.

In [27]:
lizzie_4LSTM.Test(xTest,yTest)

The average perplexity is:  359.27008056640625


We can finally generate texts, first grabbing the most probable word (i.e. setting the number of considered words to 1). Heads-up, she really seems to like the commonwealth.

In [28]:
contexts = ['',
           'hello, i am the',
           'i love the commonwealth',
           'where are you philip',
           'hi there folks',
           'a gin for me',
           'ireland is',
           'the scottich',
           'i look at our country',
            'the houses of parliament']

for context in contexts:
    print(lizzie_4LSTM.Predict(context,maxLength=30,numberOfConsideredWords=1))
    print(' ')

2023-05-04 20:57:38.687546: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-05-04 20:57:38.912399: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


i am pleased to be able to be able to be able to be able to be able to be able to be able to be able to be able
 
hello, i am the pleased to be able to be able to be able to be able to be able to be able to be able to be able to be able to be
 
i love the commonwealth commonwealth , and i am pleased to be able to be able to be able to be able to be able to be able to be able to be able
 
where are you philip have been able to be in the commonwealth , and i am pleased to be able to be able to be able to be able to be able to be
 
hi there folks is a great pleasure for the commonwealth , and i am pleased to be able to be able to be able to be able to be able to be able
 
a gin for me the commonwealth , and i am pleased to be able to be able to be able to be able to be able to be able to be able to be
 
ireland is premier , i am pleased to be able to be able to be able to be able to be able to be able to be able to be able to
 
the scottich queen ' s commonwealth day message , the commonw

Next grabbing any word based on a probability distribution (which is the default).

In [29]:
for context in contexts:
    print(lizzie_4LSTM.Predict(context,maxLength=30))
    print(' ')

edinburgh was a strong industry of vice democracy , elsewhere , dynamic commitment and work down that we are turn forward on the frauenkirche overseas in education , the economic
 
hello, i am the to know that your newspapers is a new canadian mounted illness .
 
i love the commonwealth w .
 
where are you philip share with their famous singer .
 
hi there folks .
 
a gin for me life was my grandchildren because of your lord - and staying .
 
ireland is drives this year will county anxiously - neither regiment we saw , it in canada , general 5 .
 
the scottich course of your two twentieth time also then that are serving up and renewal .
 
i look at our country two and welsh opposition of your national assembly or in the generosity and admirers of overseas has not done many of the air world represented ; what today has been
 
the houses of parliament my creative nations are intensely wanted in my families , in london and bewilderment , i would particularly be for struck all how your ele

And finally, again based in a probability distribution, but taking into account only the 50 most probable tokens. 

In [30]:
for context in contexts:
    print(lizzie_4LSTM.Predict(context,maxLength=30,numberOfConsideredWords=50))
    print(' ')

this is so many hundred years for australia , i know it is no difficult for we celebrate in that times - of new ways such the future .
 
hello, i am the speaking to that there are no opportunity to be together in the wider world .
 
i love the commonwealth year i have travelled back over the world , is as a toast to me their own energy , and care and your family come here , as a deep
 
where are you philip are often about doing all their best - bearers to a rich city whose prairies it , as they are very best to play and those who have died before
 
hi there folks are great history , and is to the warmth and you are able to offer around people in both - a time to achieve a pleasure for our good ,
 
a gin for me the nation at the new city , and those from its spirit .
 
ireland is premier and i remember the commonwealth , and the past is now only to help the commonwealth , and i have continued so many year will be a time of
 
the scottich queen and , my country and the united kingdom hav

As with the n-gram models, we can load the model. This time however, we need to also specify the path for the tokenizer. 

In [31]:
new_lizzie_4LSTM = lm.ToyLM_LSTM()

new_lizzie_4LSTM.Load('models/lizzie_4LSTM.h5','models/lizzie_4LSTM.pickle')

new_lizzie_4LSTM.Predict('the queen is back',maxLength=30,numberOfConsideredWords=50)



2023-05-04 20:57:59.798127: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-05-04 20:58:00.007270: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


'the queen is back to the commonwealth , and all have already only too much to find british ways our reputation to meet us on and us in a very special assembly for these'

### Queen of England 6-size window LSTM model

We will repeat the same procedure but this time using a windows of size 6.

In [32]:
lizzie_6LSTM = lizzie = lm.ToyLM_LSTM()

trainSentences = lizzie_6LSTM.GetSentences('./materials/HerMajestySpeechesDataset/train.txt')
valSentences = lizzie_6LSTM.GetSentences('./materials/HerMajestySpeechesDataset/dev.txt')
testSentences = lizzie_6LSTM.GetSentences('./materials/HerMajestySpeechesDataset/test.txt')

xTrain, yTrain = lizzie_6LSTM.ProcessData(trainSentences,trainSet=True,tokenizerPath='models/lizzie_6LSTM.pickle')
xVal, yVal = lizzie_6LSTM.ProcessData(valSentences)
xTest, yTest = lizzie_6LSTM.ProcessData(testSentences)

lizzie_6LSTM.CreateModel(printSummary=True)
lizzie_6LSTM.Train(xTrain, yTrain, valTuple=(xVal, yVal), epochs=50, modelName='models/lizzie_6LSTM.h5')

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 4)]               0         
                                                                 
 embedding_1 (Embedding)     (None, 4, 116)            612828    
                                                                 
 lstm_1 (LSTM)               (None, 64)                46336     
                                                                 
 dense_1 (Dense)             (None, 5283)              343395    
                                                                 
Total params: 1,002,559
Trainable params: 1,002,559
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/50


2023-05-04 20:58:02.294337: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




2023-05-04 21:00:19.460720: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50


In [33]:
lizzie_6LSTM.Test(xTest,yTest)

The average perplexity is:  349.2349853515625


In [34]:
contexts = ['',
           'hello, i am the',
           'i love the commonwealth',
           'where are you philip',
           'hi there folks',
           'a gin for me',
           'ireland is',
           'the scottich',
           'i look at our country',
            'the houses of parliament']

for context in contexts:
    print(lizzie_6LSTM.Predict(context,maxLength=30,numberOfConsideredWords=1))
    print(' ')

2023-05-04 21:16:39.200176: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-05-04 21:16:39.410896: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


i am pleased to be a place to be able to be able to be able to be able to be able to be able to be able to be
 
hello, i am the pleased to be a place to be able to be able to be able to be able to be able to be able to be able to be able to
 
i love the commonwealth commonwealth , and the queen ' s commonwealth day message , 2 0 0 2 2 , i have been struck by the queen ' s commonwealth day message ,
 
where are you philip have been able to be able to be able to be able to be able to be able to be able to be able to be able to be able
 
hi there folks is a pleasure to be able to be able to be able to be able to be able to be able to be able to be able to be able
 
a gin for me the world .
 
ireland is , the queen ' s commonwealth day message , 2 0 0 2 2 , i have been struck by the queen ' s commonwealth day message , 2 0
 
the scottich queen ' s commonwealth day message , 2 0 0 2 2 , i have been struck by the queen ' s commonwealth day message , 2 0 0 2
 
i look at our country own lives 

In [35]:
for context in contexts:
    print(lizzie_6LSTM.Predict(context,maxLength=30))
    print(' ')

deep atlantic stage .
 
hello, i am the delighted to be back made before on gone in london , and down our life on the royal jubilee horribilis over the birth fields .
 
i love the commonwealth old play given us they all here today .
 
where are you philip are very been here in ’, during the year like developing its whilst energy to western support .
 
hi there folks is no doubt that all those and theory .
 
a gin for me the world , service to do just with our two ways with urgent glorious levels at the highly twentieth century , i would like for these and british powers to
 
ireland is , the we draw across your communities , and welcoming meetings of the peace will emerge for the rock of carnivals throughout the years .
 
the scottich scottish parliament , 3 3 .
 
i look at our country first countries , whose guests has blessed on become friendship and bear .
 
the houses of parliament my two and recent freedom between 1 9 4 4 4 , was loved , 2 0 0 , runs , for your peace .
 


In [36]:
for context in contexts:
    print(lizzie_6LSTM.Predict(context,maxLength=30,numberOfConsideredWords=50))
    print(' ')

last year we have been encouraged by the first century .
 
hello, i am the delighted to the more fundamental century .
 
i love the commonwealth more fortune that the scottish parliament – the opening of this great advances of london , has been doing only at the new welcome state years , and my present
 
where are you philip , mr .
 
hi there folks and the first who will find to make british investment to see , to their accession , but all too much for us , for me and a shared future
 
a gin for me many people in this day .
 
ireland is will be a reminder of us all .
 
the scottich people of this country ' s message as a strong in great care and prosperity to ireland to the netherlands , our last fifty years ago .
 
i look at our country young and old , in a new building to the future , you continue to celebrate the world , and i shall be seen , i would like to mark
 
the houses of parliament the commonwealth , but not me for so much to the benefit of the challenges .
 


### James Joyce 6-size window LSTM model

And finally, it is now time to train an LSTM models with Dubliners.

In [44]:
joyce_6LSTM = lizzie = lm.ToyLM_LSTM()

sentences = joyce_6LSTM.GetSentences('./materials/dubliners.txt')
sentences = [elem for elem in sentences if elem.strip() != ''] # Eliminate empty elements

import random
random.shuffle(sentences)

trainSentences = sentences[:int(len(sentences)*0.7)]
valSentences = sentences[int(len(sentences)*0.7):int(len(sentences)*0.9)]
testSentences = sentences[:int(len(sentences)*0.9)]

xTrain, yTrain = joyce_6LSTM.ProcessData(trainSentences,trainSet=True,tokenizerPath='models/joyce_6LSTM.pickle')
xVal, yVal = joyce_6LSTM.ProcessData(valSentences)
xTest, yTest = lizzie_6LSTM.ProcessData(testSentences)

joyce_6LSTM.CreateModel(printSummary=True)
joyce_6LSTM.Train(xTrain, yTrain, valTuple=(xVal, yVal), epochs=50, modelName='models/joyce_6LSTM.h5')

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 4)]               0         
                                                                 
 embedding_3 (Embedding)     (None, 4, 127)            806450    
                                                                 
 lstm_3 (LSTM)               (None, 64)                49152     
                                                                 
 dense_3 (Dense)             (None, 6350)              412750    
                                                                 
Total params: 1,268,352
Trainable params: 1,268,352
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/50


2023-05-05 00:27:34.275815: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.




2023-05-05 00:29:42.692154: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50


In [45]:
joyce_6LSTM.Test(xTest,yTest)

The average perplexity is:  43428.453125


In [46]:
contexts = ['',
           'hello, i am the',
           'i love the dublin',
           'where are you father flynn',
           'hi there folks',
           'a ginger beer for me',
           'ireland is',
           'the english',
           'the laws of the country',
            'they walked along']

for context in contexts:
    print(joyce_6LSTM.Predict(context,maxLength=30,numberOfConsideredWords=1))
    print(' ')

2023-05-05 00:45:29.203105: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-05-05 00:45:29.449740: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


“ i ’ m not going to be a little man .
 
hello, i am the not much at him .
 
i love the dublin registered and , as if he had been a bit of the room .
 
where are you father flynn ?” said mr o ’ connor .
 
hi there folks was a great sum with him .
 
a ginger beer for me the room .
 
ireland is of the room .
 
the english man had grown out of the room .
 
the laws of the country room .
 
they walked along into the room .
 


In [47]:
for context in contexts:
    print(joyce_6LSTM.Predict(context,maxLength=30))
    print(' ')

jerry pride in which pushed back and told me for a young body was joy into clouds .
 
hello, i am the not is signs to him .
 
i love the dublin chief concert .
 
where are you father flynn heard .
 
hi there folks he received a rebuke hat and stood two bodies dog or far or set on dublin but joe is standing at the fire at him preserve and read .
 
a ginger beer for me him in all together .
 
ireland is german .
 
the english lady waited in the window .
 
the laws of the country house reward it won ’ t indeed ,” she wrote never always like his own right in the lion , because he say that , as he might kathleen once
 
they walked along again on the victim - sliding .
 


In [48]:
for context in contexts:
    print(joyce_6LSTM.Predict(context,maxLength=30,numberOfConsideredWords=50))
    print(' ')

“ yes !” said the man was to say that the way .
 
hello, i am the ?” _ mr o ’ connor again .
 
i love the dublin prelude .” but the car , shining with the river .
 
where are you father flynn mayor , take all the night of mrs mooney , “ it ’ s daughter .” when the invalid of aughrim_ : _ “ i ’ m my copy ,
 
hi there folks were and called from up and corley ’ s head .
 
a ginger beer for me a moment ’ s works .
 
ireland is they was in the counter and seemed to take the window in the man , which she gave their pen or in a moment from her out of a dairyman
 
the english hall - chair while the room .
 
the laws of the country street .
 
they walked along as after his companions , including getting out in the other door .
 


## Conclusions 

Although of course the quality of the n-gram and LSTM-based toy models in not comparable to a state the art large language model trained on a huge corpus, and using supervised and reinforcement learning like chatGPT, they at least do quite a good job at capturing the writing style of the set they where trained on.  
They also managed to produce some fully coherent (albeit short) phrases like `the english lady waited in the window .` (James Joyce 6-size window LSTM model, with 50 words considered).  
A few longer phrases are also relatively coherent, for example: `a gin for me to see so many cadets from the commonwealth and to all those who lost their lives , and the commonwealth and around the world , australians are`, taken from the queen of England 4-gram model, and considering all words.  
The models also managed to capture or learn (in the case of LSTM models) things like digits generally appearing one after another (as in `i love the commonwealth - canada , 1 1 march 2 0 0 2 2 is not a year of the diversity that has necessarily kept people apart has , quite simply`, for the queen of England 3-gram model considering all words in the vocabulary), or to add question marks when prompted with a context structured like a question; for example, when prompted with `where are you father flynn` the James Joyce 6-size window LSTM model with all word considered, produced: `where are you father flynn ?” said mr o ’ connor .`   
All models produced poor results very lacking in creativity when the taking only the most probable token for generation. Results got much better when picking among the 50 most probable, or from all tokens within the vocabulary.  
All in all the models produced some interesting phrases like `ireland is german .`, `the scottich pleased , the queen ' s bombings in london from persecution .` or `they walked along nassau street and walked up to dublin and holland ; and , carrying them to the man ’ s , and thanks ever so much the better for`. 

Regarding the differences between the n-gram and LSTM models, the n-gram models are very quick to create as, unlike LSTMs, they do not require training. On the other hand, neural network-based models have the potential to capture much more sophisticated language patterns.   
Perplexity-wise, when this metric could actually be computed, n-gram models scored significantly better than LSTMs ones. As mentioned before however, the n-gram models are completely perplexed a good amount of the times. The first n-gram model constructed in the notebook, for example, got a perplexity of around 8 for 25% of the tests, while being utterly perplexed for around 75% of the tests. In contrast, the LSTM model scored an average perplexity of around 350 across the board; significantly higher, but more consistent.  
The James Joyce 6-size window LSTM model scored by far the highest perplexity of close to 43,500. This could be explained by its significantly bigger vocabulary size when compared to that of the queen's models. Despite this, the model managed to create results of comparable quality to other models, highlighting the importance of not taking perplexity as the sole metric for judging the performance of a language model. It should be regarded contextually, taking into account the inherit complexity of the corpus and model.  

Qualitatively, all models produced results of similar caliber, and it would be hard to tell with which model trained on the same corpus a particular sentence was created. 