GUID: 2944376H

Github respository: https://github.com/Ge0rgie12/AI-Arts-Humanities

# Generating Text with Neural Networks


In the following notebook, a generative AI model is being built trained and put to use. Its purpose is to generate text, based on a shakespeare dataset so that new text in the style of Shakespeare is being generated.
It might also be possible to apply this network to other datasets, using other authors as a basis. The notebook will lead through the steps in the machine learning process and in the end the trained model will be tested by giving it the beginning of a Shakespeare quote and have the model continue the text.

This project could for example find use in an interactive digital archive project, helping people get a new view on famous authors. If further developed, it could become an interactive installation that lets people pretend to be talking to for example Shakespeare. 

The following notebook will be a step by step guide, starting with getting and preparing the data, before building and training the network and lastly trying it out on an example.

![picture of William Shakespeare](Shakespeare.png)

Source: the picture is free to use without attribution by the pixaby content licence 


Image  of William Shakespeare - will the model be able to imiate his writing style?
Let's find out!

# Getting the Data

To start with generating out own text with a neural network, we need a dataset. For this example, a dataset of Shakespeare texts has been chosen. The format of the data is a txt file, meaning it is in pure text format.
In the following code snippet, the dataset is being imported, so we can work with it. 
Furthermore, the library tensorflow is being imported, since it is needed to get the dataset file and it is going to be necessary for the following machine learning process. Tensorflow is a very common library used for machine learning and building neural networks. 
The print statement in the second code cell helps exploring the data since it prints out the first few lines of the imported data, helping users to understand what the data looks like.

### 1. Getting the data for the network 

In [None]:
import tensorflow as tf 

shakespeare_url = "https://homl.info/shakespeare"  # shortcut URL
filepath = tf.keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

### 2. Visualizing data to help explore it 

In [None]:
print(shakespeare_text[:80]) # not relevant to machine learning but relevant to exploring the data

Visualizing data does not only help with exploring and better understanding it, but also with checking wether it was imported correctly or wether everything is correct. Soemtimes the first errors can be found through checking this. Furthermore, if working with tabular data for example, it is useful to see wether there are any missing values.
In this case the data is in text form. As shown in the example above, which are the first 80 characters in the dataset, as indicated in the interval [:80], and they are simply the first lines in a Shakespeare play. 

# Preparing the Data

In the following, the data is being prepared. For a neural network to work, the data often needs to have a specific form and adhere to certain rules. This is why it is important to prepare the data before feeding it into the network, to avoid errors in the machine learning process. In this case, for example, the text is first being split by character and transformed into lowercase letters, before being encoded. Encoding is necessary, because neural networks can only work with numbers, not letters, which is why the information has to be transformed into numbers. 

For the validation of results, the print statements are useful, als they show the form of the transformed data and therefore show wether the results are as expected. 

### 1. Encoding data

To encode the data, the characters in the text first need to be split and turned into lower case and then they are being vectorized becfore being incoded into numbers and made into a tensor, so tensorflow can work with it.
For example the "split" by "character" is a set way of splitting the input data. It means that the input text is being split at each unicode character. The explanation for this can be found in the tensorflow documentation https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization. 

The tf.keras.layers.TextVectorization layer is a keras preprocessing layer that transform natural language input into numerical data. More information on how it works and how to use the layer can be found in the tensorflow documentation https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization.

In [None]:
text_vec_layer = tf.keras.layers.TextVectorization(split="character",
                                                   standardize="lower")
text_vec_layer.adapt([shakespeare_text])
encoded = text_vec_layer([shakespeare_text])[0]

In [None]:
print(text_vec_layer([shakespeare_text]))

### 2. Tokenizing the encoded data

In this step the encoded number is being tokenized 

In [None]:
encoded -= 2  # drop tokens 0 (pad) and 1 (unknown), which we will not use
n_tokens = text_vec_layer.vocabulary_size() - 2  # number of distinct chars = 39
dataset_size = len(encoded)  # total number of chars = 1,115,394

In [None]:
print(n_tokens, dataset_size)

### 3. Preparation of data for network

In this section, the dataset is being prepared for further use. Furthermore, the batch size is being set to 32. When having a look at stackoverflow, 32 seems to be considered a standard batch size, especially when starting out with training a model. https://stackoverflow.com/questions/35050753/how-big-should-batch-size-and-number-of-epochs-be-when-fitting-a-model

In [None]:
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):
    ds = tf.data.Dataset.from_tensor_slices(sequence)
    ds = ds.window(length + 1, shift=1, drop_remainder=True)
    ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))
    if shuffle:
        ds = ds.shuffle(100_000, seed=seed)
    ds = ds.batch(batch_size)
    return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

### 4. Definition of training, validation and test set 

In this part the dataset is being split into a training, validation and test set. This is done by using the indices of the encoded data. The dataset is simply split at two points to make the three sets. 
These three different sets are needed for different steps in the machine learning process. The training set is used to acutally train the network, after which it will be able to produce an output. The test set is then used to test the accuracy and the use of the network. After testing, the network might have to be adjusted, resulting in a new training round with the training set. This cycle can be repeated a couple of times. The validation set is then used for the final step. After repeating the trianing and testing a few times, the network might have learned the data in the test set, meaning it will be able to predict the right answer simply because it knows the information from experience and learning and not through training. Therefore, after the network is considered to be finished, with the accuracy and performance being good, the validation set is used to test the network again but on unknown data. The validation is usually only used once because its whole purpose is that it contains previously unknown data to be able to test wether the network is actually performing well or wether it simply has learned all the test data. 

In [None]:
length = 100
tf.random.set_seed(42)
train_set = to_dataset(encoded[:1_000_000], length=length, shuffle=True,
                       seed=42)
valid_set = to_dataset(encoded[1_000_000:1_060_000], length=length)
test_set = to_dataset(encoded[1_060_000:], length=length)

# Building and Training the Model

In the following, the actual model is being built and trained. To do so, several layers are being added, with each having its own use. While understanding each layer is not necessary to understand machine learning in the context of the arts and humanities, further information on them is provided in the tensorflow documentation.
Furthermore, the model is being compiled and the loss and accuracy are being printed when training the model, so it can be seen wether the model is performing well or not.
In the last line, the number of epochs is being set to 10. An epoch is defined as one pass over the whole dataset. In this case the model would go over the whole dataset 10 times. (Source: https://keras.io/getting_started/faq/#what-do-sample-batch-and-epoch-mean)

### 1. Building the model and training it 

In [None]:
tf.random.set_seed(42)  # extra code – ensures reproducibility on CPU
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
]) #building the layers of the network
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam",
              metrics=["accuracy"]) 
model_ckpt = tf.keras.callbacks.ModelCheckpoint(
    "my_shakespeare_model", monitor="val_accuracy", save_best_only=True)
history = model.fit(train_set, validation_data=valid_set, epochs=10,
                    callbacks=[model_ckpt]) #implementing checkpoints to explore loss and accuracy and thereby performance of model

### 2. Conclusion about training

Since it is a very large dataset, the training takes quite a long times. Therefore it is helpful to use a more powerful device (if possible) and to make sure that the device is not running out of power, since that would mean starting the process all over again. Furthermore it is important to take the time it takes to train a model into account, when planning a project that includes machine learning. Additionally, the notebook tends to throw resizing buffer errors when trying to train it again leading to an even longer time effort. 
![image of error message](screenshots/Error.png)

Some research into the error lead to the conclusion that when the error message occurs in the terminal, the training is actually still running, but the jupyter notebook is not showing any output anymore. Therefore, the error is probably more with jupyter notebook instead of the actual code. Several people in a github thread have experienced similar issues, which is also were the conclusion about this problem are stemming from https://github.com/tensorflow/tensorflow/issues/60309. In the terminal, it can still be observed when the next epoch is starting, the jupyter notebook interface simply is not showing the output anymore. After the training has finished, the changes have to be saved and then the notebook needs to be shutdown and reopened. After that, the trained model can be used in the further process and the following cells will be showing output again, which was not possible before restarting the notebook.
Running the training on a laptop for example took 5.5 hours. But about 20 additional hours were needed because of failed training attempts. In the last attempts, by coincidence all of the training data was actually shown in the output of the notebook. Through this it was possible to observe that the accuracy went up to 0.6. To be able to oberserve the development on the accuracy and loss was the reason to attempt the training several times, inspite of the time needed to do so. The reason for this is that the accuracy is very telling for how well a model will perform. 
While the time needed for each epoch can differ, it is pretty similar in this training example, always varying around 1850 and 2000 seconds. In the training excerpt, the epochs and time needed for them are circled in pink, acting as an example of the time needed for training the model. Shown are the times of epoch 1, 2 and 3. 
![time needed for epochs](screenshots/Training_time.png)

In general it can be observed, that the accuracy is getting higher from epoch to epoch, but it is starting at around 0.57, which for me was very surprising, since I had expected for it to start much higher, as in the machine learning by example from start to end it had already startd at around 0.94. This leads to the conclusion, that the model could maybe use some work to make it more accurate, but at the same time the accuracy is incresing during the training process, meaning the model is already a good start. How well it actually works will be showed later in the notebook, when testing it on some example cases. The accuracy can be seen in the follwing excerpt of the training output:

![accuracy of batch 1](screenshots/accuracy.png)

###  3. Defining the shakespeare model

In [None]:
shakespeare_model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2),  # no <PAD> or <UNK> tokens
    model
])

### 4. Peer discussion about time factor

The experience witht the time needed to execute all the code and train the model, lead to a peer discussion about the duration of the trianing and its practicality.
In the arts and humanities, projects might have limited resources - in computing power, money or time, or all of the above. Therefore it is important to think about how models might be made more time efficient or wether it would be useful for institutions to invest in the needed resources, be it computing power or time. 
Whenenver using AI in the arts and humanities, it is important to compare the use of resources and the performance of the model. Decreasing the needed time for example, might impact the accuracy of the model negatively. This leads to the conclusion that for each project, a threshhold must be set of a required balance between resources and performance. Project do not have infinite resources, but at the same time, a certain accuracy of the model is needed to produce usable results. Another option would be to reduce the size of the data fed to the model, since each epoch means going over each data point in the training set, so reducing the size of the training set means each epoch takes less time because there is less data to go over. But this would mean that the model would have less data to train on, which could also result in a lower accuracy and worse performance of the model. 
It is recommended to think aobut this balance before starting an AI project in the field.

In this specific case for example, the accuracy was 0.57, which was lower than expected. The highest accuracy stated was 0.6 after the tenth epoch. It might be increased by investing more time by for example increasing the amount of epochs. But this would also result in a higher time effort, meaning that it would cost more resources.

Another factor could be the batch size. In 2016, user Lucas Ramadan listed some general rules about batch size on a thread on stackoverflow. A larger batch size might therefore help to make the training process faster, but not in all cases. https://stackoverflow.com/questions/35050753/how-big-should-batch-size-and-number-of-epochs-be-when-fitting-a-model

# Generating Text

In the following part, the built and trained model is put to use and made to output text. First, some variables need to be defined and then the model is tested, using varying inputs for the temperature.

### 1. Defining variables for testing

In [None]:
y_proba = shakespeare_model.predict(["To be or not to b"])[0, -1]
y_pred = tf.argmax(y_proba)  # choose the most probable character ID
text_vec_layer.get_vocabulary()[y_pred + 2]

In [None]:
# help with visualization 
print(y_proba, y_pred)

As in the steps before, print statements, as the one before, can help with visualization and also verification of code that has been executed before

In the following code cell, some mathematical operations are being used to define probabilites and to draw 8 samples.

In [None]:
log_probas = tf.math.log([[0.5, 0.4, 0.1]])  # probas = 50%, 40%, and 10%
tf.random.set_seed(42)
tf.random.categorical(log_probas, num_samples=8)  # draw 8 samples

In the following part of this step the function next_char is being defined. It takes text and temperature as arguments and returns text_vec_layer.get_vocabulary()[char_id + 2].

In [None]:
def next_char(text, temperature=1):
    y_proba = shakespeare_model.predict([text])[0, -1:]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
    return text_vec_layer.get_vocabulary()[char_id + 2]

In the last step of this part, the function extend_text is being defined. It takes the arguments text, n_chars=50, temperature=1 and returns a text when being called. Text is a variable that is also being defined in this part of the code.  

In [None]:
def extend_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

### 2. Testing the model

In [None]:
tf.random.set_seed(42)  # extra code – ensures reproducibility on CPU

In [None]:
# example 1
print(extend_text("To be or not to be", temperature=0.01))

In [None]:
# example 2
print(extend_text("To be or not to be", temperature=1))

In [None]:
# example 3
print(extend_text("To be or not to be", temperature=100))

# Reflection:


### 1. Evaluating the result

The results are interesting, as they vary a lot depending on the value of the temperature. This leads to the conclusion, that "temperature" is an important argument for the generation of new text. While the of 0.01 produces sensible results, a temperature of 100 only results in a string of random keys, as shown in the output of example 3.
![output of example 3](screenshots/Random_keys.png)

To further investigate this, some more examples with numbers between the above examples can be found below:

In [None]:
# example 4
print(extend_text("To be or not to be", temperature=0.5))

In [None]:
# example 6
print(extend_text("To be or not to be", temperature=5))

In [None]:
# example 7
print(extend_text("To be or not to be", temperature=0.0000000001))

From these examples, it can be concluded that the lower the temperature, the better is the result. But at some point, the result probably does not change anymore, which can be seen in the last example (example 7). Its temperature is set to 0.0000000001 and it produces the exact same result as example 1, of which the temperature is set to 0.01. Both produce the sentence shown in the screenshot:

![output example 1 and 7](screenshots/example_output.png)

This means that somewhere is a threshold, after which the result does not change anymore. There is also probably a threshold where the sentence changes from mainly making sense to mainly being non sensical, but this can be very subjective, since some people might find creative ways to make sense of some of the sentences while others do not. Still at some point the output can only be called random and the model does not produce any valuable information anymore. This is all dependent on the temperature. 

### 2. Using own data

I would be interested in using data from other authors, for example Jane Austen, to train the network. It would then be interesting to take the exact same prompt and feed it to the models to then be able to compare the models and resulting texts which each other and (hopefully) have a good example of what different authors would have written about specific prompts. In the larger context of possible projects this network could me useful for, an interactive exhibition with several different authors could be a great experience. Therefore this project would be useful for cultural heritage institutions, especially literary ones, like the british library for example. 
When using other data, the respective dataset would need to be prepared for the network. When using text, the data would need to be vectorized, encoded and tokenized for the machine to be able to work with it. Since data is a very unique topic, this part will probably be adjusted the most. 

Furthermore, this project could be extended beyond the area of authors and literature. To do so, though, more parts of the training might have to be adjusted. 

### 3. Ethical concerns

There are two main concerns when it comes to this code or similar projects. On the one hand, there is the issue of copyright. In the case of Shakespeare, his works are by now under a creative commons licence, meaning that everyone can use them, but when implementing own data or using texts by other authors, copyright might become an ethical and also legal concern. Additionally, when it comes to dead people that can not be asked for their consent anymore, it is always an ethical question to what extend their style - like for example Shakespeare's writing style - or their voice and looks can or can not be used nowadays. Therefore it very much depends on the kind of project and wether there for example are in built limiting factors. For example when thinking of Shakespeare and generating texts in his style, a limit could be built in that prevents the machine from putting out bad language when being asked for it. People and their works need to be treated with respect, which is a big ethical concern that always needs to be evaluated when implementing a project such a this one. 
The other big ethical concern about generative AI and AI in general is the environmental impact. Training and using models is using up a lot of electrical power and thereby a lot of resources. While AI in general can also help with solving environmental issues, its impact on the environment always needs to be taken into consideration when building, training and using a model. Recommended further reading is an article by Payal Dhar "The carbon impact of artificial intelligence" which can be found under this link: https://www.nature.com/articles/s42256-020-0219-9#citeas (Source: Dhar, P. The carbon impact of artificial intelligence. Nat Mach Intell 2, 423–425 (2020). https://doi.org/10.1038/s42256-020-0219-9)

# Final conclusion:

In conclusion, the model is successful in generating new texts, but there are ways to improve the performance, on the one hand, the time it takes for the model to train is not ideal for some projects and on the other hand, the accuracy of the model could also be better. Therefore it might make sense to play around with some parts of the training model, for examples the layers of the network or the weights but also the batch size or the amount of epochs, to achieve ever better results. Still the code works and is a good and easy way to for example get into machine learning in the arts and humanities.

This concludes the Exercise with generating text with a neural network!

![image of blackboard with well done written on it](well_done.png)

Source: the picture is free to use without attribution by the pixaby content licence 

Recommended resources for further reading:

- http://stackoverflow.com
- The tensorflow documentation https://www.tensorflow.org
- And a personal favorite, the website https://www.geeksforgeeks.org
