# Generating Text with Neural Networks


In the following notebook, a generative AI model is being built trained and put to use. Its purpose is to generate text based on a shakespeare dataset so that new text in the style of Shakespeare is being generated.
It might also be possible to apply this network to other datasets, using other authors as a basis.
This project could for example find use in an interactive digital archive project, helping people get a new few on famous authors. If further developed, it could become an interactive installation that lets people pretend to be talking to for example Shakespeare. 

The following notebook will be a step by step guide, starting with getting and preparing the data, before building and training the network and lastly trying it out on an example.

# Getting the Data

To start with generating out own text with a neural network, we need a dataset. For this example, a dataset of a Shakespeare text has been chosen.
In the following code snippet, the dataset is being imported, so we can work with it. 
Furthermore, the library tensorflow is being imported, since it is needed to get the dataset file and it is going to be necessary for the following machine learning process
The print statement in the second code cell helps exploring the data since it prints out the first few lines of the imported data, helping users to understand what the data looks like.

### 1. Getting the data for the network 

In [1]:
import tensorflow as tf 

shakespeare_url = "https://homl.info/shakespeare"  # shortcut URL
filepath = tf.keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

2023-11-29 12:04:46.152153: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### 2. Visualizing data to help explore it 

In [2]:
print(shakespeare_text[:80]) # not relevant to machine learning but relevant to exploring the data

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.


# Preparing the Data

In the following, the data is being prepared. For a neural network to work, the data often needs to have a specific form and adhere to certain rules. This is why it is important to prepare the data before feeding it into the network, to avoid errors in the machine learning process. In this case, for example, the text is first being split by character and transformed into lowercase letters, before being incoded. Encoding is necessary, because neural networks can only work with numbers, not letters, which is why the information has to be transformed into numbers. 

### 1. Encoding data

To encode the data, the characters in the text first need to be split and turned into lower case and then they are being vectorized becfore being incoded into numbers and made into a tensor, so tensorflow can work with it 

In [3]:
text_vec_layer = tf.keras.layers.TextVectorization(split="character",
                                                   standardize="lower")
text_vec_layer.adapt([shakespeare_text])
encoded = text_vec_layer([shakespeare_text])[0]

In [4]:
print(text_vec_layer([shakespeare_text]))

tf.Tensor([[21  7 10 ... 22 28 12]], shape=(1, 1115394), dtype=int64)


### 2. Tokenizing the encoded data

In this step the encoded number is being tokenized 

In [5]:
encoded -= 2  # drop tokens 0 (pad) and 1 (unknown), which we will not use
n_tokens = text_vec_layer.vocabulary_size() - 2  # number of distinct chars = 39
dataset_size = len(encoded)  # total number of chars = 1,115,394

In [6]:
print(n_tokens, dataset_size)

39 1115394


### 3. Preparation of data for network

In [7]:
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))
    if shuffle:
        ds = ds.shuffle(100_000, seed=seed)
    ds = ds.batch(batch_size)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

### 4. Definition of training, validation and test set 

In [8]:
length = 100
tf.random.set_seed(42)
train_set = to_dataset(encoded[:1_000_000], length=length, shuffle=True,
                       seed=42)
valid_set = to_dataset(encoded[1_000_000:1_060_000], length=length)
test_set = to_dataset(encoded[1_060_000:], length=length)

# Building and Training the Model

In the following, the actual model is being built and trained. To do so, several layers are being added, with each having its own use.
Furthermore, the model is being compiled and the loss and accuracy are being printed when training the model, so it can be seen wether the model is performing well or not.
In the last line, the number of epochs is being set to 10. An epoch is defined as one pass over the whole dataset. In this case the model would go over the whole dataset 10 times. (Source: https://keras.io/getting_started/faq/#what-do-sample-batch-and-epoch-mean)

### 1. Building the model and training it 

In [9]:
tf.random.set_seed(42)  # extra code – ensures reproducibility on CPU
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
]) #building the layers of the network
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"]) 
model_ckpt = tf.keras.callbacks.ModelCheckpoint(
    "my_shakespeare_model", monitor="val_accuracy", save_best_only=True)
history = model.fit(train_set, validation_data=valid_set, epochs=10,
                    callbacks=[model_ckpt]) #implementing checkpoints to explore loss and accuracy and thereby performance of model

Epoch 1/10


2023-11-29 12:05:13.341855: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] ShuffleDatasetV3:19: Filling up shuffle buffer (this may take a while): 75997 of 100000
2023-11-29 12:05:15.977842: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:452] Shuffle buffer filled.


  31247/Unknown - 1876s 60ms/step - loss: 1.3934 - accuracy: 0.5731INFO:tensorflow:Assets written to: my_shakespeare_model/assets


INFO:tensorflow:Assets written to: my_shakespeare_model/assets


Epoch 2/10


2023-11-29 12:37:02.491466: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] ShuffleDatasetV3:19: Filling up shuffle buffer (this may take a while): 93625 of 100000


    2/31247 [..............................] - ETA: 38:16 - loss: 1.5932 - accuracy: 0.5150   

2023-11-29 12:37:03.192408: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:452] Shuffle buffer filled.




INFO:tensorflow:Assets written to: my_shakespeare_model/assets


Epoch 3/10


2023-11-29 13:08:44.274484: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] ShuffleDatasetV3:19: Filling up shuffle buffer (this may take a while): 70258 of 100000


    3/31247 [..............................] - ETA: 34:38 - loss: 1.6381 - accuracy: 0.5042   

2023-11-29 13:08:47.838295: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:452] Shuffle buffer filled.




INFO:tensorflow:Assets written to: my_shakespeare_model/assets


Epoch 4/10


2023-11-29 13:41:32.515837: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] ShuffleDatasetV3:19: Filling up shuffle buffer (this may take a while): 76712 of 100000


    2/31247 [..............................] - ETA: 38:05 - loss: 1.6262 - accuracy: 0.5025    

2023-11-29 13:41:37.235630: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:452] Shuffle buffer filled.




INFO:tensorflow:Assets written to: my_shakespeare_model/assets


Epoch 5/10


2023-11-29 14:12:54.579760: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] ShuffleDatasetV3:19: Filling up shuffle buffer (this may take a while): 95445 of 100000


    2/31247 [..............................] - ETA: 38:21 - loss: 1.6216 - accuracy: 0.5109   

2023-11-29 14:12:55.114123: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:452] Shuffle buffer filled.




### 2. Conclusion about training

Since it is a very large dataset, the training takes quite a long times. Therefore it is helpful to use a more powerful device (if possible) and to make sure that the device is not running out of power, since that would mean starting the process all over again. Furthermore it is important to take the time it takes to train a model into account, when planning a project that includes machine learning. Additionally, the notebook tends to throw resizing buffer errors when trying to train it again leading to an even longer time effort. 
![image of error message](screenshots/Error.png)

Some research into the error lead to the conclusion that when the error message occurs in the terminal, the training is actually still running, but the jupyter notebook is not showing any output anymore. Therefore, the error is probably more with jupyter notebook instead of the actual code. Several people in a github thread have experienced similar issues, which is also were the conclusion about this problem are stemming from https://github.com/tensorflow/tensorflow/issues/60309
Running the training on a laptop for example took 5.5 hours. But about 10 additional hours were needed because of failed training attempts. 
While the time needed for each epoch can differ, it is pretty similar in this training example, always varying around 1850 and 2000 seconds. In the training excerpt, the epochs and time needed for them are circled in pink, acting as an example of the time needed for training the model. Shown are the times of epoch 1, 2 and 3. 
![time needed for epochs](screenshots/Training_time.png)

In general it can be observed, that the accuracy is getting higher from epoch to epoch, but it is starting at around 0.57, which for me was very surprising, since I had expected for it to start much higher. The accuracy can be seen in the follwing excerpt of the training output:

![accuracy of batch 1](screenshots/accuracy.png)

###  3. Defining the shakespeare model

In [22]:
shakespeare_model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2),  # no <PAD> or <UNK> tokens
    model
])

### 4. Peer discussion about time factor

The experience witht the time needed to execute all the code and train the model, lead to a peer discussion about the duration of the trianing and its practicality.
In the arts and humanities, projects might have limited resources - in computing power, money or time, or all of the above. Therefore it is important to think about how models might be made more time efficient or wether it would be useful for institutions to invest in the needed resources, be it computing power or time. 
Whenenver using AI in the arts and humanities, it is important to compare the use of resources and the performance of the model. Decreasing the needed time for example, might impact the accuracy of the model negatively. This leads to the conclusion that for each project, a threshhold must be set of a required balance between resources and performance. Project do not have infinite resources, but at the same time, a certain accuracy of the model is needed to produce usable results.
It is recommended to think aobut this balance before starting an AI project in the field.

In this specific case for example, the accuracy was 0.57, which was lower than expected. It might be increased by investing more time by for example increasing the amount of epochs. But this would also result in a higher time effort, meaning that it would cost more resources.

# Generating Text

In the following part, the built and trained model is put to use and made to output text. First, some variables need to be defined and then the model is tested, using varying inputs for the temperature.

### 1. Defining variables for testing

In [23]:
y_proba = shakespeare_model.predict(["To be or not to b"])[0, -1]
y_pred = tf.argmax(y_proba)  # choose the most probable character ID
text_vec_layer.get_vocabulary()[y_pred + 2]



'e'

In [24]:
# help with visualization 
print(y_proba, y_pred)

[2.1129138e-06 8.5561556e-01 2.2224115e-11 2.1224868e-02 3.9113373e-02
 3.6693189e-02 4.4674958e-10 4.6583000e-06 9.9596856e-03 2.9044875e-07
 4.9154664e-07 1.4862280e-02 5.6615965e-09 1.4430948e-02 3.1604199e-07
 8.0804434e-03 8.7489385e-09 1.6814655e-08 2.3241416e-11 1.0369784e-08
 2.8587053e-07 4.2391011e-06 4.2892214e-08 4.4083540e-10 1.4786215e-12
 2.5570140e-11 1.4088186e-08 3.6063730e-06 1.9268016e-09 4.4550044e-10
 1.8006867e-08 3.2127401e-09 3.5330561e-06 8.2488816e-12 1.3907974e-11
 6.3178342e-13 2.0576040e-14 1.7938722e-15 2.4389968e-20] tf.Tensor(1, shape=(), dtype=int64)


In [25]:
log_probas = tf.math.log([[0.5, 0.4, 0.1]])  # probas = 50%, 40%, and 10%
tf.random.set_seed(42)
tf.random.categorical(log_probas, num_samples=8)  # draw 8 samples

<tf.Tensor: shape=(1, 8), dtype=int64, numpy=array([[0, 1, 0, 2, 1, 0, 0, 1]])>

In [26]:
def next_char(text, temperature=1):
    y_proba = shakespeare_model.predict([text])[0, -1:]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
    return text_vec_layer.get_vocabulary()[char_id + 2]

In [27]:
def extend_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

### 2. Testing the model

In [28]:
tf.random.set_seed(42)  # extra code – ensures reproducibility on CPU

In [29]:
# example 1
print(extend_text("To be or not to be", temperature=0.01))

To be or not to be so longer than you shall be so of the strange fro


In [30]:
# example 2
print(extend_text("To be or not to be", temperature=1))

To be or not to betweeper than up?
what's my honour frame the fire, 


In [37]:
# example 3
print(extend_text("To be or not to be", temperature=100))

To be or not to be.s?,uf:mya,blqttobcaii.r;d:ngat!r!!ronv:z:z.hyd$s:


# Reflection:


### 1. Evaluating the result

The results are interesting, as they vary a lot depending on the value of the temperature. This leads to the conclusion, that "temperature" is an important attribute for the generation of new text. While the of 0.01 produces sensible results, a temperature of 100 only results in a string of random keys, as shown in the output of example 3.
![output of example 3](screenshots/Random_keys.png)

To further investigate this, some more examples with numbers between the above examples can be found below:

In [32]:
# example 4
print(extend_text("To be or not to be", temperature=0.5))

To be or not to be so
of our love in the king of her business.

luci


In [33]:
# example 6
print(extend_text("To be or not to be", temperature=5))

To be or not to beju
cocio?
axs? nujtentiogati swhm, gicfet
wa meal,


In [34]:
# example 7
print(extend_text("To be or not to be", temperature=0.0000000001))

To be or not to be so longer than you shall be so of the strange fro


From these examples, it can be concluded that the lower the temperature, the better is the result. But at some point, the result probably does not change anymore, which can be seen in the last example (example 7). Its temperature is set to 0.0000000001 and it produces the exact same result as example 1, of which the temperature is set to 0.01. Both produce the sentence shown in the screenshot:

![output example 1 and 7](screenshots/example_output.png)

This means that somewhere is a threshold, after which the result does not change anymore. There is also probably a threshold where the sentence changes from mainly making sense to mainly being non sensical, but this can be very subjective, since some people might find creative ways to make sense of some of the sentences while others do not. Still at some point the output can only be called random and the model does not produce any valuable information anymore. This is all dependent on the temperature. 

### 2. Using own data

I would be interested in using data from other authors, for example Jane Austen, to train the network. It would then be interesting to take the exact same prompt and feed it to the models to then be able to compare the models and resulting texts which each other and (hopefully) have a good example of what different authors would have written about specific prompts.
When using other data, the respective dataset would need to be prepared for the network. When using text, the data would need to be vectorized, encoded and tokenized for the machine to be able to work with it.

### 3. Ethical concerns

There are two main concerns when it comes to this code or similar projects. On the one hand, there is the issue of copyright. In the case of Shakespeare, his works are by now under a creative commons licence, meaning that everyone can use them, but when implementing own data or using texts by other authors, copyright might become an ethical and also legal concern. Additionally, when it comes to dead people that can not be asked for their consent anymore, it is always an ethical question to what extend their style - like for example Shakespeare's writing style - or their voice and looks can or can not be used nowadays. Therefore it very much depends on the kind of project and wether there for example are in built limiting factors. For example when thinking of Shakespeare and generating texts in his style, a limit could be built in that prevents the machine from putting out bad language when being asked for it. People and their works need to be treated with respect, which is a big ethical concern that always needs to be evaluated when implementing a project such a this one. 
The other big ethical concern about generative AI and AI in general is the environmental impact. Training and using models is using up a lot of electrical power and thereby a lot of resources. While AI in general can also help with solving environmental issues, its impact on the environment always needs to be taken into consideration when building, training and using a model. Recommended further reading is an article by Payal Dhar "The carbon impact of artificial intelligence" which can be found under this link: https://www.nature.com/articles/s42256-020-0219-9#citeas (Source: Dhar, P. The carbon impact of artificial intelligence. Nat Mach Intell 2, 423–425 (2020). https://doi.org/10.1038/s42256-020-0219-9)

# Final conclusion:

In conclusion, the model is successful in generating new texts, but there are ways to improve the performance, on the one hand, the time it takes for the model to train is not ideal for some projects and on the other hand, the accuracy of the model could also be better. Therefore it might make sense to play around with some parts of the training model, for examples the layers of the network or the weights but also the batch size or the amount of epochs, to achieve ever better results. Still the code works and is a good and easy way to for example get into machine learning in the arts and humanities.