John Nipp

Project 4 from CPSC585, 7/18/22
https://docs.google.com/document/d/1McdbIfDMYbumtCDITisMAtppJGlJostgjufOnBA5lhw/edit

The first thing to do, is to familiarize ourselves with the character-generator model of rnn's, so that we have a better sense of how to prepare our text-corpus.

The emphasis for this project is going to be the fact that it is character generated. I am going to make a text-generator that outputs Lao Tzu quotes based on the Tao Te Ching text. I have a resource online that has over 20 english translation for each section of the Tao Te Ching (81 total). I'm  in the process of considering whether to use multiple translations, and of how many. 

I've found that there are models that will do text generation for slogans, using 600 possible slogans. With that being said, it might be safe to try making a model that uses only one translation of the Tao Te Ching. This will save me the research of checking to see how to deal with using multiple translations, and will save me the concern of how word flow might be altered between translations.

I am using Stephen Mitchell's translation of the Tao Te Ching. Line-spacing is unaltered, and the number of each section is still labeled above each new section. There is one line difference between one section and another.

I'm going Francois Chollet's code, and modify it for the sake of this experiment! It is built to work for models that have less data to train on.

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

import numpy as np
import random
import io

from google.colab import files
!cp /content/drive/MyDrive/Colab\ Notebooks/TaoTeChingStephenMitchell.txt /content/

In [None]:
import os
fullPath = os.path.abspath("./" + "TaoTeChingStephenMitchell.txt")
data_for_processing = keras.utils.get_file('TaoTeChingStephenMitchell.txt', 
                                           'file://'+fullPath)

Downloading data from file:///content/TaoTeChingStephenMitchell.txt


That was the setup, now here is some preparation. I added comments to explain the code.

In [None]:
# Here, we are reading the entire file into string file, all in lower case.
with io.open(data_for_processing, encoding="utf-8") as f:
    text = f.read().lower()

# Here we are replacing every \n character with a space.
text = text.replace("\n", " ")  # We remove newlines chars for nicer display
# To know how many characters we have
print("Corpus length:", len(text))

# To check the number of chars we are going to have. Later we use this
# in creating the final layer of our model.
chars = sorted(list(set(text)))
print("Total chars:", len(chars))

# Here, you are making a dictionary of accessing chars based on their appearance
# in the set above (chars). This will probably save time and space since you
# are not going to have to use ascii values, and worry about large gaps 
# between one character and another, and have to concern about how that shows
# up in next character generation.
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []

# Here, we create an x and y relationship. We are going to splice 40 length
# pieces of the text, and put them into the sentences list. Then, we are going
# to take the next char after the end of that sentence, and put that in 
# next_chars.
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i : i + maxlen])
    next_chars.append(text[i + maxlen])
print("Number of sequences:", len(sentences))

# Now, we are going to to make a 3D matrix, with a 2D matrix. The 3D matrix
# is going to be the X. Each row is going to be a sentence sequence. Each column
# is going to a set of one-hot encodings of each char value in sequence.
# The y value is going to be a set of one-hot encodings of the next character
# in each sentence.
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

Corpus length: 36045
Total chars: 46
Number of sequences: 12002


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations


There are 36,045 characters. This may pose an issue later, since Chollet recommended to have at least ~100k characters, and ~1MB to have better performance. For our first try, we are going to try using our current data, and see if we can get get good text generation by tweaking the model we already have.

Now we build a single-layer LSTM model. LSTM stands or Long-Term Short Model. It is a form of an RNN (Recurrent Neural Network).

In [None]:
model = keras.Sequential(
    [
        keras.Input(shape=(maxlen, len(chars))),
        layers.LSTM(128),
        layers.Dense(len(chars), activation="softmax"),
    ]
)
optimizer = keras.optimizers.RMSprop(learning_rate=0.01)
model.compile(loss="categorical_crossentropy", optimizer=optimizer)

We also need our sampling function for training.

In [None]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype("float64")
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

Now we train the model.

In [None]:
epochs = 40
batch_size = 128

for epoch in range(epochs):
    model.fit(x, y, batch_size=batch_size, epochs=1)
    print()
    print("Generating text after epoch: %d" % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print("...Diversity:", diversity)

        generated = ""
        sentence = text[start_index : start_index + maxlen]
        print('...Generating with seed: "' + sentence + '"')

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.0
            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]
            sentence = sentence[1:] + next_char
            generated += next_char

        print("...Generated: ", generated)
        print()


Generating text after epoch: 0
...Diversity: 0.2
...Generating with seed: "the lower the results. try to make peopl"
...Generated:  e the the the the the the withe the thang the the the the the the the the the the the the the the the the the the the the the the the the the the the the be the toule the the the the the the the the the be the be the the the the the the thit the the the lowhe the the the the the thang the the the the the the be the the the the the the the the the be the the the the the the the coule the the the th

...Diversity: 0.5
...Generating with seed: "the lower the results. try to make peopl"
...Generated:  e thorlt our uow ol belis all whath is other thos fou ko tous is and thing wing comlece thes ither gole the thet thace the ther thall ollethe the thing the gowithes thee it  hitpe theng  be cou the eease the withe wilg be be it the pothe illit  ither the the cou the the be that the thomles com whe thet it the thes ole itgout ill mor the tou th le lither the wito

I'm rather dissapointed in my results. I'm going to try combining multiple translations into a single document.  But first, I'm going to try training the same model, with batches of 64 instead.

In [None]:
model2 = keras.Sequential(
    [
        keras.Input(shape=(maxlen, len(chars))),
        layers.LSTM(128),
        layers.Dense(len(chars), activation="softmax"),
    ]
)
optimizer = keras.optimizers.RMSprop(learning_rate=0.01)
model2.compile(loss="categorical_crossentropy", optimizer=optimizer)

epochs = 40
batch_size = 64

for epoch in range(epochs):
    model2.fit(x, y, batch_size=batch_size, epochs=1)
    print()
    print("Generating text after epoch: %d" % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print("...Diversity:", diversity)

        generated = ""
        sentence = text[start_index : start_index + maxlen]
        print('...Generating with seed: "' + sentence + '"')

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.0
            preds = model2.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]
            sentence = sentence[1:] + next_char
            generated += next_char

        print("...Generated: ", generated)
        print()


Generating text after epoch: 0
...Diversity: 0.2
...Generating with seed: " artless.  the master allows things to h"
...Generated:  e cowowe ta ta can the coure the coure ta cat can wall beale ta count the can can the cour the wall be cao the can the cat ca ta cacen the co cale to to can the cas ale can the ca tale wat ca to can the ca can be can the cas and cas the co the ca count the caone ta tao can the cale ta can ale cour the count se can be ta can the cowo the cas the can the coure the can the can the can the cale ca tan

...Diversity: 0.5
...Generating with seed: " artless.  the master allows things to h"
...Generated:  a dong staut came was whe it we ale and fourncang bee ale caoun whal your wat wat you tolt be court ce to ald whe  awe wall we cerer the cowith be soreo wall be your te can uf to male peopes the ca caler tit can the cas the co the care, peasd be tha to cat it lan be cale can can leing be tan an the woll tou womlit tae tat ther and the cowatt wath and be can sare

Let's try with 96 nodes.

In [None]:
model3 = keras.Sequential(
    [
        keras.Input(shape=(maxlen, len(chars))),
        layers.LSTM(96),
        layers.Dense(len(chars), activation="softmax"),
    ]
)
optimizer = keras.optimizers.RMSprop(learning_rate=0.01)
model3.compile(loss="categorical_crossentropy", optimizer=optimizer)

epochs = 40
batch_size = 64

for epoch in range(epochs):
    model3.fit(x, y, batch_size=batch_size, epochs=1)
    print()
    print("Generating text after epoch: %d" % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print("...Diversity:", diversity)

        generated = ""
        sentence = text[start_index : start_index + maxlen]
        print('...Generating with seed: "' + sentence + '"')

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.0
            preds = model3.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]
            sentence = sentence[1:] + next_char
            generated += next_char

        print("...Generated: ", generated)
        print()


Generating text after epoch: 0
...Diversity: 0.2
...Generating with seed: " she doesn't cling to her own comfort; t"
...Generated:  he the the the the the the sous man the the the toous the the the the the the and the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the mand the the the the the the the the the the the the the the the the the the sous and the the the the the the the the the the the the the the the the the the the the 

...Diversity: 0.5
...Generating with seed: " she doesn't cling to her own comfort; t"
...Generated:  he the the sheong.  an and so the cand this the thou the tho the tho the mond the the the best to the mint be beont an  an bos be the and whas bous ar aris and so thant thes is is the mard whe  is bere stire the  oo the goong the ther in and fove the the beepr. soe the th it and the bront is in thint this male sore tho ther con in the the eope an in this the the

Neither using less nodes, nor using a smaller batch size made much difference. I'm going to resort to claiming the issue is the lack of data.

I want to try the original model with more data. So I'm going to compile 4 more translations with this one. These translations were selected on account of their language style similarities to the Stephen Mitchell translation.

However... to save the time of having to organize all of the data again, I'm going to try using the original model on St. Augustine's Confessions, instead. It is an edition Translated by E. B. Pusey (Edward Bouverie). I eliminated the preface, and took some spaces out between each book. The rest is intact.

In [None]:
!cp /content/drive/MyDrive/Colab\ Notebooks/AugustinesConfessions.txt /content/
fullPath = os.path.abspath("./" + "AugustinesConfessions.txt")
data_for_processing = keras.utils.get_file('AugustinesConfessions.txt', 
                                           'file://'+fullPath)

Downloading data from file:///content/AugustinesConfessions.txt


In [None]:
# Here, we are reading the entire file into string file, all in lower case.
with io.open(data_for_processing, encoding="utf-8") as f:
    text = f.read().lower()

# Here we are replacing every \n character with a space.
text = text.replace("\n", " ")  # We remove newlines chars for nicer display
# To know how many characters we have
print("Corpus length:", len(text))

# To check the number of chars we are going to have. Later we use this
# in creating the final layer of our model.
chars = sorted(list(set(text)))
print("Total chars:", len(chars))

# Here, you are making a dictionary of accessing chars based on their appearance
# in the set above (chars). This will probably save time and space since you
# are not going to have to use ascii values, and worry about large gaps 
# between one character and another, and have to concern about how that shows
# up in next character generation.
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []

# Here, we create an x and y relationship. We are going to splice 40 length
# pieces of the text, and put them into the sentences list. Then, we are going
# to take the next char after the end of that sentence, and put that in 
# next_chars.
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i : i + maxlen])
    next_chars.append(text[i + maxlen])
print("Number of sequences:", len(sentences))

# Now, we are going to to make a 3D matrix, with a 2D matrix. The 3D matrix
# is going to be the X. Each row is going to be a sentence sequence. Each column
# is going to a set of one-hot encodings of each char value in sequence.
# The y value is going to be a set of one-hot encodings of the next character
# in each sentence.
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

model4 = keras.Sequential(
    [
        keras.Input(shape=(maxlen, len(chars))),
        layers.LSTM(128),
        layers.Dense(len(chars), activation="softmax"),
    ]
)
optimizer = keras.optimizers.RMSprop(learning_rate=0.01)
model4.compile(loss="categorical_crossentropy", optimizer=optimizer)

epochs = 40
batch_size = 128

for epoch in range(epochs):
    model4.fit(x, y, batch_size=batch_size, epochs=1)
    print()
    print("Generating text after epoch: %d" % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print("...Diversity:", diversity)

        generated = ""
        sentence = text[start_index : start_index + maxlen]
        print('...Generating with seed: "' + sentence + '"')

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.0
            preds = model4.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]
            sentence = sentence[1:] + next_char
            generated += next_char

        print("...Generated: ", generated)
        print()

Corpus length: 602700
Total chars: 39
Number of sequences: 200887


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations



Generating text after epoch: 0
...Diversity: 0.2
...Generating with seed: ", that they may increase and multiply, a"
...Generated:  nd there was and somer to be all them of the lighted to be i was be in them to be to be to the life, and come to be in the besing to them to them to be them said all therefore to be measure to them all them all there would not in the beried of the beried to be to all there with the beried of the besing to them my god, and there was they were to be somered to be of the berion of the care to be of t

...Diversity: 0.5
...Generating with seed: ", that they may increase and multiply, a"
...Generated:  nd for them to took there thou discervers with these for and besieved in him is not it with me might to be mans and to that seek to be in the light of the contess of the same all of the life, they would not how they to long them and nother when i so one have to measure when they called to be hold of my god, i such we then beferved all toor there are, there would

  after removing the cwd from sys.path.


...Generated:   i because it is the senses of the words, and thy word, which by the soul be against the soul, and a creation of the same in satisfocious both, and there was no earth. and i was subject in him all the creation of my more things with the human sportions of the life, or with him then see that they are in thy goodness, that i had gont out of thee, whereof she was a one will of the word who have not s

...Diversity: 1.0
...Generating with seed: "er wonted place, and i to rome.  and lo,"
...Generated:   not out of my mind? behold i bands neghise to be hons? and by pallishion, and bears in memoryfulness, had as in mercious more, if both without youth was neither to way, if will he lose. which by them, what eas, and lord, and said to light, that i cannot inited thy mind. ford grievous lay darkness othery alone them restless-power of the error, of my will upon my god, in a drink and did or so that 

...Diversity: 1.2
...Generating with seed: "er wonted place, and i to rome.  and

I want to save the model, and the best generated text is (after sample "n unto such as i remembered myself to ha"):

"ve not after the same to be a light, and the same to be a strond to thee, and there was in the man of the man, and what i had not been a man of the sense and the same space of the same spirit, and that is it with the more soul to be a soul to me to be made; and the same words was the same spirit, where i am wise, because they were then the same man of the firmament of the desire of the more soul t"

The model is reflecting that all of the things that it has said account for who it is, and how it's wise to know better, and how it will make a better decision in the eyes of the lord. Such a devout model.

In [None]:
model4.save('model4Confessions')



INFO:tensorflow:Assets written to: model4Confessions/assets


INFO:tensorflow:Assets written to: model4Confessions/assets


Questions:
1. Why do we replace the \n's with spaces instead? - One possible reason is to shorten the possibilities of next characters. Another one is so that our text generation isn't splayed with tons of newline generations. Plus, in our case, our work will be grammatically correct.
2. How does using validation data play into using this model?
3. Should I try combining multiple translations in order to have more data?
4. How do I know when my text-generation model is over-fitting the data?