*p.s.: I apologize for the weird PDF. Due to a problem in my VSCode, I first need to export the ipynb to HTML, and then convert to PDF.*

# Objective

Writing is a hobby I've had for more than ten years now. I've written a lot during this time, from large novels to tiny tales, covering tutorials, essays, and formal reports.

In the first assignment, I analyzed whether we could use Naive Bayes to distinguish pieces I wrote for different purposes. Now, I take a step further to explore: can I train a simple model that mock my way of writing?

# Dataset Description

The dataset used is comprised of writing samples from my Tumblr and Medium blogs, which were be downloaded as HMTL files in both websites, and notes I had on iCloud. In total, 52 HTML files were gathered, each one representing a a single piece of writing. The files were manually labeled into six categories:

1. Social criticism/opinion piece
2. Poem
3. Tale
4. Tutorial
5. Notes
6. Novel

Last, writing samples have different lengths, are from different times, and are all in Portuguese, which is my native language.

# Importing Dataset to Python

In the previous assignment, I built the structure to parse the HTMLs and convert the text into a Pandas dataframe. I had exported this data as a CSV, which I am now re-uploading below.

In [1]:
# Loads all libraries that will be used in this assignment
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
import sys
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')


Mounted at /content/drive


In [2]:
df = pd.read_csv('/content/drive/MyDrive/Minerva/Academic/CS156/assignment 2/10years.csv')
print(df)

     Category                                            Content
0        poem                  Se o fim entrasse em minha vista.
1        poem        Entre risos perguntando “do que se trata?”.
2        poem                           Puto, velho, vigarista!.
3        poem                  Por que não logo que tu me mata?.
4        poem            Queima-me a pele e escalda-me a cabeça.
...       ...                                                ...
4643     poem                       Porém, fácil mesmo é morrer.
4644     poem        Assim como uma semente plantada no inverno.
4645     poem             Assim como um anjo nascido no Inferno.
4646     poem       Assim como o amor que não se consegue viver.
4647     poem  Talvez morrerei sem ter a chance, da verdade, ...

[4648 rows x 2 columns]


# Data Processing

## Data preparation

There are a few pre-processing steps that we took in the previous assignment and will take here again. Namely:

1. Remove "\n" marks, which are present in some of the poems
2. Convert everything to lowercase
3. Remove punctuation marks
4. Remove signatures (" — Felipe, March 1985")

While steps 1 and 4 are meant to enhance the qiality of the data, steps 2 and 3 try to make it more simple, reducing the space of options the model has to learn.

In [3]:
import string

# Create an empty list to store the updated data
updated_data = []

# Iterate through each row in the DataFrame
for idx, row in df.iterrows():
    content = row['Content']

    # Split the content by '\n'
    lines = content.split('\n')

    # Add each line as a separate entry
    for line in lines:
        if line.strip():  # Check if line is not empty (to avoid adding empty entries)
            updated_data.append({'Category': row['Category'], 'Content': line.strip()})

# Create a new DataFrame with the updated data
updated_df = pd.DataFrame(updated_data)

# Convert all entries to lowercase
updated_df['Content'] = updated_df['Content'].str.lower()

# Remove punctuation signs from the entries
def remove_punctuation(text):
    punctuation_to_remove = string.punctuation.replace('-', '').replace(' ', '')  # Keeps hyphens because in Portuguese they matter
    return text.translate(str.maketrans('', '', punctuation_to_remove))

updated_df['Content'] = updated_df['Content'].apply(remove_punctuation)

# Remove entries containing signatures or
updated_df = updated_df[~updated_df['Content'].str.contains('bandeira')]
updated_df = updated_df[~updated_df['Content'].str.contains('poema')]
updated_df = updated_df[~updated_df['Content'].str.contains('tumblr')]

# Print the updated DataFrame
print(updated_df)

     Category                                            Content
0        poem                   se o fim entrasse em minha vista
1        poem          entre risos perguntando “do que se trata”
2        poem                               puto velho vigarista
3        poem                    por que não logo que tu me mata
4        poem             queima-me a pele e escalda-me a cabeça
...       ...                                                ...
4933     poem                         porém fácil mesmo é morrer
4934     poem         assim como uma semente plantada no inverno
4935     poem              assim como um anjo nascido no inferno
4936     poem        assim como o amor que não se consegue viver
4937     poem  talvez morrerei sem ter a chance da verdade co...

[4856 rows x 2 columns]


## Exploratory analysis

We can start by checking the size of the dataset and whether it is imbalanced. Below, we can see that there are 2.5x more novel entries than poems and tales, which in turns are 3x more present than entries from notes and tutorials. With such an imbalanced dataset, we outght to be mindful of implications this might have in whatever model we build.

In [4]:
label_counts = updated_df['Category'].value_counts()
print(label_counts)

novel       2281
poem         953
tale         915
notes        335
tutorial     308
opinion       64
Name: Category, dtype: int64


We can also look into how big the dataset is from a word-based perspective, given that entries can vary in size. The results are below, but to illustrate it clearly: if we were to condense the entire dataset in a single document, it would fill 83 pages in Arial 11 font. Not bad for a hobby!

In [None]:
def count_words(dataframe, column_name):
    total_words = 0
    unique_words = set()
    for text in dataframe[column_name]:
        words = text.split()
        total_words += len(words)
        unique_words.update(set(words))
    return total_words, len(unique_words)

# Usage:
total_words, unique = count_words(updated_df, 'Content')

print(f"Total number of words in the dataset: {total_words}")
print(f"Number of unique words: {unique}")


Total number of words in the dataset: 57997
Number of unique words: 8860


Last, by printing a few random samples of data, we can better observe how each entry looks:

In [None]:
# Randomly select observations
random_observations = updated_df.sample(n=10)

# Print the content of the selected observations
for idx, row in random_observations.iterrows():
    print(f"Content: {row['Content']}")

Content: poucas sensações se igualavam àquilo
Content: ao entrar sentiu o cheiro da carcaça de carne que apodrecia ali dentro
Content: parasita maníaco e solitário
Content: seus olhos ficam molhados a garganta dói mas você não chora
Content: e não existe nem uma remota chance de você sair andando daqui se me der de muito longe como resposta- venho diretamente do inferno seth- e me diga por que é que você está fugindo da sua esposa
Content: já conseguisse passar algum dia da tua vida sem ter que apelar pra essa noia tua de esperança
Content: eu posso sentir
Content: faça-me um carinho e me dê um beijo depois
Content: via-se dentro do próprio caixão
Content: as origens confidenciadas a fat jack não muito tempo antes agora eram usadas contra si próprio


# Task Explanation and Data Split

## Task explanation

After years of writing, I want to have the experience of being a reader of myself. Of course, reading something I actually wrote would not be enough because I know what's coming next, which makes it impossible for me to feel surprise or enjoy the novelty of a text. This is why I want to train a model that can *write something new* like me.

Technically speaking, this means training a model to understand the subtle patterns in my writing style to an extent that it is capable of generalizing them.

## Model selection

To accomplish this task, I chose to built a Long Short-Term Memory (LSTM) network. LSTMs are a type of RNN (Recurrent Neural Network) designed to capture long-term dependencies in sequential data.

The way LSTMs work can be illustrated with the analogy of reading a book and trying to understand the plot: as we read the pages, we continuously update our understanding based on the current sentence and what we've read previously. LSTM does a similar process, but using numerical data instead of words. As in any neural network, each layer takes in some input, applies a set of weights, and produces an output. However, in an RNN (and consequently in an LSTM), there's a hidden state that's passed along from one step to the next. This hidden state acts like a memory, allowing the network to consider past information while processing current input. The difference between RNNs and LSTMs is that the latter is better at handling memory, suffering less from vanishing gradients when the input becomes large.

Each time a new input is given to the model, the following process happens:
<br><br>

#### **1) Deciding how much of the long-term memory to forget**
The first part of LSTM (named Forget Gate) determines how much of the long term memory should be remembered for the current calculation. To do so, it uses the short-term memory ($h_{t-1}$) and the current input ($x_t$), returning a percentage $f_t$ that will be factored in the long-term memory later.

$$
f_t = \sigma(W_{f} \cdot [h_{t-1}, x_t] + b_f)
$$

in which any $W$ is a weight matrix and any $b$ is a bias vector.

#### **2) Deciding what to add to the long-term memory**
Next, the LSTM combines the short-term memory with the input to create a potential long-term memory, $C^\sim_t$.

$$
C^\sim_t = \text{tanh}(W_C \cdot [h_{t-1}, x_t] + b_C)
$$


Then, it determines what percentage of this potential memory should be actually incorporated into the long-term memory. This entire process happens on what is called the Input Gate.

$$
i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)
$$

Following these steps, we update the long-term memory, $C_t$, based on the previous memory (and the amount of it we dediced to forget), and the candidate new memory (along with the amount we decided to remember):

$$
C_t = f_t \star C_{t-1} + i_t \star C^\sim_t
$$

#### **3) Deciding what to output**

Last, we output a value by first combining the short-term memory and the input, which gives us a candidate output $o_t$:

$$
o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)
$$

And then we factor in our long-term memory, thus obtaining the final output. Given that this output will be the short-term memory for the next input, we call it $h_t$:

$$
h_t = o_t \star \text{tanh}(C_t)
$$


<br>
<i>(Note: In a LSTM, everything we have just described is a single neuron)

The weights and biases are randomly initialized and updated through backpropagation.

## Data preparation

To create a model that can write like me, I am assuming that the piece category (poem, tale...) doesn't matter, which means all of the data can be grouped together. We thus start by converting all of the text into a single string.

In [5]:
raw_text = updated_df['Content'].str.cat(sep=' ')

Next, we map the characters of the vocabulary to integers. Given that LSTMs are made to work with numerical data, each character in the text needs to be represented as a numerical value. This mapping allows us to process characters through the model, and later to reverse the process and convert numerical outputs into text again.

In [None]:
# Create mapping of unique chars to integers
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))

n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters:", n_chars)
print("Total Vocab:", n_vocab)
print(f"Characters that compose the vocabulary: {chars}")

Total Characters: 322745
Total Vocab: 58
Characters that compose the vocabulary: [' ', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\xa0', 'à', 'á', 'â', 'ã', 'ç', 'é', 'ê', 'í', 'ó', 'ô', 'õ', 'ú', '\u200a', '–', '—', '’', '“', '”', '…']


Now, we split the data into inpt-output pairs. We want the model to predict one character at a time based on the previous 100 characters. Therefore, our input will be a sequence of 100 characters starting in $i$ and finishing in $i+99$, and the output, a sequence of 100 characters starting in $i+1$ and finishing in $i+100$.

In [None]:
# Prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []

for i in range(0, n_chars - seq_length, 1):
 seq_in = raw_text[i:i + seq_length]
 seq_out = raw_text[i + seq_length]
 dataX.append([char_to_int[char] for char in seq_in])
 dataY.append(char_to_int[seq_out])

n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

Total Patterns:  322645


Last, we reshape the input to the format expected by Keras, normalize it, and convert the output to 58-dimensional vectors (the size of the vocabulary). This means that, after processing the data, the LSTM will output a vector with probabilities for the next letter.  

In [None]:
# reshape X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))

# normalize
X = X / float(n_vocab)

# one hot encode the output variable
y = to_categorical(dataY)

# Model Initialization and Training

Below we initialize our LSTM model. We have two layers with 256 neurons each, two dropout layers in between to prevent overfitting, and a softmax at the end. Additionally, the fact we are using stacked LSTMs should also increase our capacity to represent more complex inputs.

In [None]:
# Creates LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

We then train the model for 70 epochs and a batch size of 60 (which means 60 training samples will be passed through the network before we update weights with backpropagation). With the resources from Colab Free, the training took 1h56min, achieving a minimum loss of 1.4405.

In [None]:
filepath = "/content/drive/MyDrive/Minerva/Academic/CS156/assignment 2/model/weights-improvement-{epoch:02d}-{loss:.4f}-bigger.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

# fit the model
model.fit(X, y, epochs=70, batch_size=60, callbacks=callbacks_list)

Epoch 1/70
Epoch 1: loss improved from inf to 1.95305, saving model to /content/drive/MyDrive/Minerva/Academic/CS156/assignment 2/weights-improvement-01-1.9530-bigger.hdf5
Epoch 2/70
   4/5059 [..............................] - ETA: 1:36 - loss: 1.8700

  saving_api.save_model(


Epoch 2: loss improved from 1.95305 to 1.88847, saving model to /content/drive/MyDrive/Minerva/Academic/CS156/assignment 2/weights-improvement-02-1.8885-bigger.hdf5
Epoch 3/70
Epoch 3: loss improved from 1.88847 to 1.83753, saving model to /content/drive/MyDrive/Minerva/Academic/CS156/assignment 2/weights-improvement-03-1.8375-bigger.hdf5
Epoch 4/70
Epoch 4: loss improved from 1.83753 to 1.79718, saving model to /content/drive/MyDrive/Minerva/Academic/CS156/assignment 2/weights-improvement-04-1.7972-bigger.hdf5
Epoch 5/70
Epoch 5: loss improved from 1.79718 to 1.75898, saving model to /content/drive/MyDrive/Minerva/Academic/CS156/assignment 2/weights-improvement-05-1.7590-bigger.hdf5
Epoch 6/70
Epoch 6: loss improved from 1.75898 to 1.72987, saving model to /content/drive/MyDrive/Minerva/Academic/CS156/assignment 2/weights-improvement-06-1.7299-bigger.hdf5
Epoch 7/70
Epoch 7: loss improved from 1.72987 to 1.70338, saving model to /content/drive/MyDrive/Minerva/Academic/CS156/assignment

<keras.src.callbacks.History at 0x7ca5293a58d0>

# Model Predictions

With the model trained and its best version saved, we can now upload the weights and ask the model to generate new text based on an initial sample from the dataset. Below we have the functions for such.

In [None]:
# load the network weights
filename = "/content/drive/MyDrive/Minerva/Academic/CS156/assignment 2/model/weights-improvement-70-1.4405-bigger.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
# Similar to how we initialy converted letters to numbers, we now do the opposite proces
int_to_char = dict((i, c) for i, c in enumerate(chars))

In [None]:
def auto_generate():
  # pick a random seed
  start = np.random.randint(0, len(dataX)-1)
  pattern = dataX[start]
  print("Seed:")
  print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
  print("\nModel generation:")

  print_results(pattern)

def custom_generate(text):
  lst = []
  lst.append([char_to_int[char] for char in text])
  pattern = lst[0]
  print("Seed:")
  print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
  print("\nModel generation:")

  print_results(pattern)


def print_results(pattern):
  # generate characters
  for i in range(400):
    x = np.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    prediction = model.predict(x, verbose=0)
    index = np.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    sys.stdout.write(result)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]

  print("\nDone.")

## Quantitative performance metrics

Unlike classification or regression tasks, there apparently are few metrics to evaluate the performance of generative models. Some of the most common ones are:

1. **BLEU Score**: commonly used in translation tasks, it computes the similarity between the generated text and a set of reference (human-generated) texts.<br>
2. **Perplexity**: measures how well a model predicts a sample of text.
3. **ROGUE**: commonly used in text summarization, it evaluates the quality of summaries or generated text by measuring the overlap in n-grams (sequences of words) between the generated and the reference texts.

None of these metrics seem to be applicable in our case, as our goal is to mimic the writing style of the original dataset while producing novel work (which is very difficult to evaluate).

## Qualitative performance metrics
Given the lack of a quantitative metric, we will evaluate the model qualitatively, using the best possible evaluation method: the author's opinion of the model's output. The sampled seed and output from the next code cell are translated below. <br><br>



In [None]:
auto_generate()

Seed:
" brilhar teu raciocínio rápido esperto sempre fizera eu me esforçar para poder de alguma forma fazer  "

Model generation:
a carne a lhe perfurar si próprio olhava buscando a lembrança como a cabeça do corpo de seu contexto de seu contexto que estava a conteguir a cada posta de seu posto a primeira vez a mente se por mais que a contenção de seu posto a primeira vez a mente se por mais que a contenção de seu posto a primeira vez a mente se por mais que a contenção de seu posto a primeira vez a mente se por mais que a c
Done.


> Sampled seed text:

> * "[..] shine your smart quick thinking has always made me strive to be able to somehow do"

> Generated output:

> - "your flesh pierced himself looked at seeking the memory as the head of the body of his context of his context that he was comtaining at each slice of his post the first time the mind if for even more than the containment of his post the first time mind no matter how much the contention of its post the first time the mind no matter what the contention of its post the first time the mind no matter what the c"

**Evaluation:** while the seed text comes from a love poem I once wrote, the output seems to be a mix of words from a horror tale and generative hallucinations. There is also a lot of repetitiveness from the middle to the end of the output.

Next, we try customizing the input:

In [None]:
custom_generate("ontem foi um lindo dia o mar brilhava em olinda enquanto o sol iluminava as colinas e ")

Seed:
" ontem foi um lindo dia o mar brilhava em olinda enquanto o sol iluminava as colinas e  "

Model generation:
o context que estavam por causa de seu posto não se poderia estar e o context que estava a conteguir a cada posta de seu posto a primeira vez a mente se por mais que a contenção de seu posto a primeira vez a mente se por mais que a contenção de seu posto a primeira vez a mente se por mais que a contenção de seu posto a primeira vez a mente se por mais que a contenção de seu posto a primeira vez a 
Done.


> Custom seed:
> - yesterday was a beautiful day, the sea shone in olinda while the sun illuminated the hills and

> Output:
> - the context they were in because of their position could not be and the context that they were containing at each position of their position the first time the mind no matter how much the contention of their position the first time the mind no matter what the containment of its post the first time the mind no matter how much more than the containment of its post the first time

Evaluation: while the input is about a beautiful day, the output seems to mix words from different pieces that do not make sense together. There is also a repetitive pattern in this output that resembles the repetition from the previous one, suggesting a bias from the model that will be in most outputs (I won't include extra outputs here, but in fact, it is).

# Discussion of Results

Overall, the LSTM is generating words correctly and hardly ever misspells any. This is, first of all, surprising, given that our model is simply predicting one character at a time. What we see is that it samples letters in a way that makes sense — for example, it doesn't sample a list of 20 consecutive letter, nor does it sample things like "yzgsfat" — and these samples turn out to have meaning to us, being words we actually understand. Additionally, it sometimes even presents words that could make sense together, such as "because of their position" or "the mind becomes more than".

However, the sentences the LSTM builds are not really logical, and we are left with an output that resembles real language, but is not. One might argue that the performance could have been better if we had trained for longer, but most likely, it seems that there isn't much room for improvement with this LSTM. Some of the possibilities of exploration that can yield better outputs are training a model with more neurons, or maybe a model that predicts words, rather than characters.

# Executive Summary

In this assignment, I explored the extent to which an LSTM can mock my way of writing. I started by importing the data from my last 10 years of writing, cleaning it, and preparing it to be processed by the LSTM. Then, I explained why LSTMs were a good choice, describing their advantage over RNNs and providing a step-by-step explanation of how they work. Next, I initialized my own LSTM and trained it on my dataset, keeping track of the loss and saving the weights of the best performing model. FInally, I used the trained model to make inferences on both samples from the dataset and customized inputs.

Evaluating generative models like this seems to be a topic of debate on current research and, unlike with classification models, in which there are well-established metrics for performance evaluation, this does not seem to be the case here. Consequently, my analysis of its performance was based on a personal assessment of how closely the model's output resembled by own writing. While it was fascinating to see that the model could output words correctly, it was not capable of generating sentences that made sense.

# References

Tutorial for LSTM for text generation:
- https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/

How LSTMs work:
- https://colah.github.io/posts/2015-08-Understanding-LSTMs/
- https://www.youtube.com/watch?v=YCzL96nL7j0

How to evaluate generative models:
- https://saturncloud.io/glossary/evaluating-generative-models/
- https://arxiv.org/abs/2206.10935
- ChatGPT

# Extra Section

Given that the original model wasn't surprising, I decided to try a few alternatives and see how they perform.

## 1) More neurons?

What happens if we have the same network, but with more neurons? Theoretically, the greater number of neurons should allow the model to capture more information and patterns.

### Model definition

The model below trained for 20 epochs before Colab free shut itself down, achieving a minimum loss of 1.2908.

In [None]:
# Creates LSTM model
larger_model = Sequential()
larger_model.add(LSTM(768, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
larger_model.add(Dropout(0.25))
larger_model.add(LSTM(768))
larger_model.add(Dropout(0.25))
larger_model.add(Dense(y.shape[1], activation='softmax'))
larger_model.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
filepath = "/content/drive/MyDrive/Minerva/Academic/CS156/assignment 2/larger_model/weights-improvement-{epoch:02d}-{loss:.4f}-bigger.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

# fit the model
larger_model.fit(X, y, epochs=70, batch_size=60, callbacks=callbacks_list)

Epoch 1/70
Epoch 1: loss improved from inf to 2.89929, saving model to /content/drive/MyDrive/Minerva/Academic/CS156/assignment 2/larger_model/weights-improvement-01-2.8993-bigger.hdf5


  saving_api.save_model(


Epoch 2/70
Epoch 2: loss improved from 2.89929 to 2.87000, saving model to /content/drive/MyDrive/Minerva/Academic/CS156/assignment 2/larger_model/weights-improvement-02-2.8700-bigger.hdf5
Epoch 3/70
Epoch 3: loss improved from 2.87000 to 2.83611, saving model to /content/drive/MyDrive/Minerva/Academic/CS156/assignment 2/larger_model/weights-improvement-03-2.8361-bigger.hdf5
Epoch 4/70
Epoch 4: loss improved from 2.83611 to 2.71915, saving model to /content/drive/MyDrive/Minerva/Academic/CS156/assignment 2/larger_model/weights-improvement-04-2.7192-bigger.hdf5
Epoch 5/70
Epoch 5: loss improved from 2.71915 to 2.61171, saving model to /content/drive/MyDrive/Minerva/Academic/CS156/assignment 2/larger_model/weights-improvement-05-2.6117-bigger.hdf5
Epoch 6/70
Epoch 6: loss improved from 2.61171 to 2.47918, saving model to /content/drive/MyDrive/Minerva/Academic/CS156/assignment 2/larger_model/weights-improvement-06-2.4792-bigger.hdf5
Epoch 7/70
Epoch 7: loss improved from 2.47918 to 2.265

### Predictions

In [None]:
int_to_char = dict((i, c) for i, c in enumerate(chars))

filename_larger = "/content/drive/MyDrive/Minerva/Academic/CS156/assignment 2/larger_model/weights-improvement-20-1.2908-bigger.hdf5"
larger_model.load_weights(filename_larger)
larger_model.compile(loss='categorical_crossentropy', optimizer='adam')


def auto_generate():
  # pick a random seed
  start = np.random.randint(0, len(dataX)-1)
  pattern = dataX[start]
  print("Seed:")
  print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
  print("\nModel generation:")

  print_results(pattern)

def custom_generate(text):
  lst = []
  lst.append([char_to_int[char] for char in text])
  pattern = lst[0]
  print("Seed:")
  print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
  print("\nModel generation:")

  print_results(pattern)


def print_results(pattern):
  # generate characters
  for i in range(300):
    x = np.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    prediction = larger_model.predict(x, verbose=0)
    index = np.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    sys.stdout.write(result)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]

  print("\nDone.")

In [None]:
auto_generate()

Seed:
" pode encontrar o código desse meu aplicativo aqui enfim terminamos espero muito que o artigo tenha s "

Model generation:
ido de madeira ertava entrar no casal para o contrabandista não era mais puro srabalho de seu app e a menos de se alguma manhira pue estava em seu rosto de tm celes  fmi a única coisa que estava en cada novo trabalho na parte de corrado para o casal para o contrabandista não era mais puro srabalho d
Done.


> Sampled seed text:

> * "you can find the code for my application here finally we're done I really hope this article has s"

> Generated output:

> - "gone from wood to eter the couple for the smuggler was no longer pure swork of his app and unless some maner pue was in his face of tm celes fmi the only thing that was inn each new work in the part of corrado for the couple For the smuggler it was no longer pure work of art."

In [None]:
custom_generate("ontem foi um lindo dia o mar brilhava em olinda enquanto o sol iluminava as colinas e ")

Seed:
" ontem foi um lindo dia o mar brilhava em olinda enquanto o sol iluminava as colinas e  "

Model generation:
a porta de seu app e a menos de se alguma manhira pue estava em seu rosto de tm celes  fme souris por completo perceber que o cara eez o contrabandista não era mais puro srabalho de seu app e a menos de se alguma manhira pue estava em seu rosto de tm celes  fme souris por completo perceber que o car
Done.


> Sampled seed text:

> * "yesterday was a beautiful day the sea shone in olinda while the sun illuminated the hills and"

> Generated output:

> - "the door or your app and unless of some maner pue was in your face of wit celes fme souris completely realizing the guy iid the smuggler wasn't more pure swork of your app and unless that some maner pue was in your face wit celes fme  souris completely realize the guy"

### Analysis of results

This model performed worst than the previous one. It is capable of outputting characters in sequences that resemble words (placing spaces correctly, alternating between vowels and consonants...), but it often makes grammatical mistakes. Furthermore, the words it generates correctly do not make sense — unlike the previous model, neighboring words have little connection with each other.

Last, during training, the loss improved very little from epoch to epoch, suggesting that longer training times would not contribute much for improving the quality of the output.

## 2) Words instead of characters?

Another possibility is adjusting the model to predict words, instead of characters. Although this increases considerably the number of different inputs the model can take (and patterns it needs to learn), it might make it easier for the model to connect words together in a coherent way.

To make it work, I had to change the pipeline that processed the text, which resulted in the one below.

### Data processing and model definition

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np

# Concatenate text data
raw_text = updated_df['Content'].str.cat(sep=' ')

# Tokenize the text into words
tokenizer = Tokenizer()
tokenizer.fit_on_texts([raw_text])
sequences = tokenizer.texts_to_sequences([raw_text])[0]
total_words = len(tokenizer.word_index) + 1  # Adding 1 for Out of Vocabulary (OOV) token

# Prepare sequences of 30 words as input and one word as output
seq_length = 30
dataX = []
dataY = []

for i in range(seq_length, len(sequences)):
    seq_in = sequences[i - seq_length:i]
    seq_out = sequences[i]
    dataX.append(seq_in)
    dataY.append(seq_out)

# Convert the sequences into numpy arrays
X = np.array(dataX)
y = to_categorical(dataY, num_classes=total_words)

print("Total Sequences: ", len(dataX))

# Now, X contains sequences of 30 words, and y is the one-hot encoded output.
# These can be used for training the LSTM model.

# reshape X to be [samples, time steps, features]
n_patterns = len(dataX)
X = np.reshape(dataX, (n_patterns, seq_length, 1))

# normalize
X = X / float(total_words)

# one hot encode the output variable
y = to_categorical(dataY)

Total Sequences:  58233


I then defined a model that was identical to the first one and trained it for 3 hours, covering 300 epochs with a batch size of 15.

In [None]:
from keras.models import Sequential
from keras.layers import LSTM, Dropout, Dense
from keras.callbacks import ModelCheckpoint
import numpy as np

# Adjusting the model for word-level prediction
words_model = Sequential()
words_model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
words_model.add(Dropout(0.2))
words_model.add(LSTM(256))  # No return_sequences needed in the last LSTM layer
words_model.add(Dropout(0.2))
words_model.add(Dense(total_words, activation='softmax'))  # Changed y.shape[1] to total_words
words_model.compile(loss='categorical_crossentropy', optimizer='adam')

filepath = "/content/drive/MyDrive/156 materials/weights-improvement-{epoch:02d}-{loss:.4f}-bigger.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [None]:
# Fit the model
words_model.fit(X, y, epochs=300, batch_size=15, callbacks=callbacks_list)

Epoch 1/300
Epoch 1: loss improved from inf to 7.22084, saving model to /content/drive/MyDrive/156 materials/weights-improvement-01-7.2208-bigger.hdf5
Epoch 2/300
Epoch 2: loss improved from 7.22084 to 7.04317, saving model to /content/drive/MyDrive/156 materials/weights-improvement-02-7.0432-bigger.hdf5
Epoch 3/300
Epoch 3: loss improved from 7.04317 to 7.01351, saving model to /content/drive/MyDrive/156 materials/weights-improvement-03-7.0135-bigger.hdf5
Epoch 4/300
Epoch 4: loss improved from 7.01351 to 6.99408, saving model to /content/drive/MyDrive/156 materials/weights-improvement-04-6.9941-bigger.hdf5
Epoch 5/300
Epoch 5: loss improved from 6.99408 to 6.96856, saving model to /content/drive/MyDrive/156 materials/weights-improvement-05-6.9686-bigger.hdf5
Epoch 6/300
Epoch 6: loss improved from 6.96856 to 6.93041, saving model to /content/drive/MyDrive/156 materials/weights-improvement-06-6.9304-bigger.hdf5
Epoch 7/300
Epoch 7: loss improved from 6.93041 to 6.88298, saving model t

<keras.src.callbacks.History at 0x7c96ea9f2ec0>

### Model predictions

The final set of weights from epoch 300 was highly overfit, so instead I used for prediction a set of weights from epoch 90, which had captured some patterns from my writing but wasn't copying the training data yet.

In [None]:
# Loads weights
filename_words = "/content/drive/MyDrive/Minerva/Academic/CS156/assignment 2/words_model/weights-improvement-91-1.7459-bigger.hdf5"
words_model.load_weights(filename_words)
words_model.compile(loss='categorical_crossentropy', optimizer='adam')

In [None]:
import numpy as np
from keras.models import Sequential
from keras.layers import LSTM, Dropout, Dense
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

word_index = tokenizer.word_index
int_to_word = {index: word for word, index in word_index.items()}

def generate_text_words(words_model, sequences, tokenizer, seq_length, total_words, int_to_word, num_words=50):
    # pick a random seed
    start = np.random.randint(0, len(sequences)-1)
    pattern = sequences[start-seq_length:start]
    print("Seed:")
    print(" ".join([int_to_word[value] for value in pattern]))
    print("\nModel generation:")

    # generate words
    for i in range(num_words):
        x = np.reshape(pattern, (1, len(pattern), 1))
        x = x / float(total_words)
        prediction = words_model.predict(x, verbose=0)
        index = np.argmax(prediction)
        result = int_to_word[index]
        sys.stdout.write(result + " ")
        pattern = np.append(pattern, index)
        pattern = pattern[1:len(pattern)]

    print("\nDone.")


# Calling the function to generate text using the word-based model
generate_text_words(words_model, sequences, tokenizer, seq_length, total_words, int_to_word)


Seed:
células com conteúdo para isso usaremos outro protocolo de collection view mas antes precisamos de uma breve explicação imagine que tenhamos 10000 itens para exibir na cv se continuássemos implementando

Model generation:
os código normalmente iríamos criar uma célula de cada mais de 10000 itens mesmo ter nem células de dor e olhos que jeito o noite de deixar e deixar estava uma não três não o não estridente não ruas do bar de o vai que rosto que o tempo futuro 
Done.


> Sampled seed text:

> * "cells with content for this we will use another collection view protocol but first we need a brief explanation imagine we have 10000 items to display in the cv if we continued implementing"

> Generated output:

> - "the codes normally we would create a cell of each more than 10000 items even though there are no pain cells and eyes that way the night of leaving and leaving was one no three no the no strident no streets of the bar of the go what face that the future time"

### Analysis of results

There are no grammatical mistakes here — which is expected, since the model is trained on words, and not characters. However, the output is a mix of overfit text and hallucinations. The beginning of the output is the exact continuation of the input, which comes from an iOS tutorial I once wrote. However, at some point, it switches to a nearly-random set of words. This random set resembles some of the novels I wrote, but they don't make sense.

It is worth mentioning again that this result comes from the weights the model had around epoch 90. Weights from earlier epochs resulted in text with no meaning, and from later epochs, in copies of the training data due to overfit.

The model trained on words seems to be unable to find the balance between learning my writing style, learning to generate text that makes sense, and not overfitting the training data.

# Final conclusion

it seems that LSTMs can only scratch the surface of text generation. These models output content that individually makes sense (such as characters that make up actual words), but they struggle to arrange these successful units in a meaningful way, apparently being unable to create useful sentences without overfitting the training data.

In order to bridge this gap, it seems like we need a model that can understand the relevance of each word relative to each other. This likely means a model that contains attention mechanisms, which I will explore in the next assignment.