# Setup

In [44]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/quotes-by-philosophers/Plato.txt
/kaggle/input/quotes-by-philosophers/Immanuel-Kant.txt
/kaggle/input/quotes-by-philosophers/Arthur-Schopenhauer-Quotes.txt
/kaggle/input/quotes-by-philosophers/Jean-Paul-Sartre.txt
/kaggle/input/quotes-by-philosophers/Spinoza.txt
/kaggle/input/quotes-by-philosophers/Sigmund-Freud.txt
/kaggle/input/quotes-by-philosophers/Aristotle.txt
/kaggle/input/quotes-by-philosophers/Friedrich-Nietzsche.txt
/kaggle/input/quotes-by-philosophers/Hegel.txt


Import the necessary libraries.

In [45]:
from keras.preprocessing import sequence
import keras
import tensorflow as tf
import os
import numpy as np

# Dataset

The Kaggle dataset contains quotes from famous philosophers, as follows:
Arthur Schopenhauer 400+ quotes
Friedrich Nietzsche 200+ quotes
Immanuel Kant 300+ quotes
Aristotle 350+ quotes
Plato 70+ quotes
Sigmund Freud 400+ quotes
Hegel 120+ quotes
Jean Paul Sartre 320+ quotes
Spinoza 120+ quotes
(Bozkurt, 2022).

I chose to focus on Arthur Schopenhauer's quotes, since there was plenty of data to work with.

First, it is necessary to import the dataset. The 'unicode_escape' encoding is crucial for working with this data, since there are Unicode characters that may have been escaped in the file. It can then be decoded back into a string.

In [46]:
path = "../input/quotes-by-philosophers/Arthur-Schopenhauer-Quotes.txt"
text = open(path, 'rb').read().decode(encoding='unicode_escape')

Now, a sample of the text can be printed to ensure the decoding process went smoothly and get a better idea of how the text looks in general before preprocessing.

In [47]:
print(text[:240])

It is difficult to find happiness within oneself, but it is impossible to find it anywhere else.
All truth passes through three stages. First, it is ridiculed. Second, it is violently opposed. Third, it is accepted as being self-evident.



Now it is necessary to build the vocabulary of the text. This vocabulary will be the set of all unique characters in the text in ascending order.

In [48]:
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

75 unique characters


Converting characters to numbers is one of the most important steps for building a RNN to work on a text dataset. This is because neural networks can only work with numerical data. The next step is therefore to convert characters to numbers:

In [49]:
words = sorted(set(text))

char2idx = {u:i for i, u in enumerate(words)}
idx2char = np.array(words)

def text2num(text):
  return np.array([char2idx[c] for c in text])
  
num_text = text2num(text)

Later on, it will also be necessary to convert back to text, so the following function can take care of that:

In [50]:
def num2text(nums):
  try:
    nums = nums.numpy()
  except:
    pass
  return ''.join(idx2char[nums])

# Prepare for Training

The data can now be prepared for the specific task of text generation from sequences of characters. The goal is to utilize the data for a sequence-to-sequence model in order to predict the next character in a sequence.

In [51]:
seq_length = 100  
examples_per_epoch = len(text)//(seq_length+1)
char_dataset = tf.data.Dataset.from_tensor_slices(num_text)
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

In [52]:
def split_input_target(chunk):
    input_text = chunk[:-1]  
    target_text = chunk[1:] 
    return input_text, target_text
dataset = sequences.map(split_input_target)

Now the training batches can be allocated accordingly and the training parameters specified. I set the EMBEDDING_DIM to 100 to start with, since it can be relative small for models such as this one. I initially specified a batch size, but kept encountering errors, so I chose to omit it.

In [53]:
BATCH_SIZE = 64
VOCAB_SIZE = len(words) 
EMBEDDING_DIM = 100
RNN_UNITS = 1024

BUFFER_SIZE = 10000

data = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

# Building the Model

Built the model using two LSTM layers*, one embedding layer, a dropout layer for regularization, and a Dense output layer to predict the next character in the sequence. After encountering many errors, I had to resort to just one LSTM layer, after all. I also encountered some errors related to the input shape of the embedding layer. I was able to avoid the error by explicitly statingt the input, but I believe that may have altered my results down the line. I may be misunderstaing how things are supposed to work there.

In [54]:
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(None,)),
    tf.keras.layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM),
    tf.keras.layers.LSTM(RNN_UNITS,
                         return_sequences=True),
    tf.keras.layers.Dropout(0.2),
    # tf.keras.layers.LSTM(RNN_UNITS),
    tf.keras.layers.Dense(VOCAB_SIZE)
])

Then, compile the model. Specify the adam optimizer without a learning rate and a sparse categorical crossentropy loss function in order to output a proabibility distribution for the possible characters.

In [55]:
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer='adam', loss=loss)

# Train the Model

First, configure a custom callback with checkpoints to avoid having to retrain from scratch.

In [56]:
checkpoint_dir = './training_checkpoints'

checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}.weights.h5")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

Then, training can finally commence over 100 epochs. After experiencing some weird generated text, I decided to up this number to 150:

In [57]:
history = model.fit(data, epochs=150, callbacks=[checkpoint_callback])

Epoch 1/150
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 85ms/step - loss: 4.1256
Epoch 2/150
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 84ms/step - loss: 3.1395
Epoch 3/150
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 84ms/step - loss: 3.0239
Epoch 4/150
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 85ms/step - loss: 2.9384
Epoch 5/150
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 83ms/step - loss: 2.8431
Epoch 6/150
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 81ms/step - loss: 2.7216
Epoch 7/150
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 92ms/step - loss: 2.6106
Epoch 8/150
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 78ms/step - loss: 2.5447
Epoch 9/150
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 87ms/step - loss: 2.4860
Epoch 10/150
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 87ms/step - lo

# Generate Text

The function provided by (Bozkurt, 2022b) can then generate text. Unfortunately, my text came out pretty jumbled. My temperature was initially set to 0.5, so I adjusted it to 0.4 and that made a small improvement. In the end, my text still came out jumbled.

In [60]:
def generate_text(model, start_string):
  
  num_generate = 100

  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  text_generated = []

  temperature = 0.4

  for i in range(num_generate):
      predictions = model(input_eval)
    
      predictions = tf.squeeze(predictions, 0)

      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [61]:
inp = "Life is "
print(generate_text(model, inp))

Life is t ton sin an wo tindis by the by the angore o buth ondinde offes ng te t uno tore be an my o nde as 


# References

Bozkurt, M. (2022a, October 15). Quotes by philosophers. Kaggle. https://www.kaggle.com/datasets/mertbozkurt5/quotes-by-philosophers/data?select=Arthur-Schopenhauer-Quotes.txt 

Bozkurt, M. (2022b, October 15). Simple text generation with an RNN. Kaggle. https://www.kaggle.com/code/mertbozkurt5/simple-text-generation-with-an-rnn 

Géron, A. (2017). Hands-on machine learning with scikit-learn and tensorflow: Concepts, tools, and techniques to build Intelligent Systems. O’Reilly Media. 