# 3 Assignment RNN / LSTM DMML
Thomas Delissen - November 2024

This assignment is based on the Blog from Andrej Karpathy: https://karpathy.github.io/2015/05/21/rnn-effectiveness/ 

That Website might give you some useful hints. Note that you cannot copy the code from that Github, since the assignment you will have to do is to create an LSTM using Keras instead of implementing a manual RNN. 

The goal will be to create an LSTM using Keras that can generate long sequences of characters, hopefully resembling the input text. So this will be a NLP Sequence to Sequence Model that can generate single characters. 

This exercise uses the file tinyshakespeare_karpathy.txt . 

In [1]:
# The following libraries might come in handy
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, GRU, SimpleRNN, Dense
from tensorflow.keras.utils import to_categorical

# data I/O - code copied from Andrej Karphaty github
data = open('tinyshakespeare_karpathy.txt', 'r').read() # should be simple plain text file
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

print("Data size: " + str(data_size))
print("Vocab size: " + str(vocab_size))
print("Available Characters: " + str(chars))

Data size: 1115394
Vocab size: 65
Available Characters: ['r', 'i', 'S', 'd', '?', 'u', 'V', 'b', 'Y', 'T', 'D', 'Q', 'E', 'J', 'c', '$', 'H', ',', 'N', 'o', 'p', 'l', 'f', 'X', 'B', 'a', 'F', 'K', '&', '\n', 'n', ';', 'k', 'q', 'h', '-', 'A', 'W', 'L', 'P', 'U', ' ', ':', 't', 'j', 'v', 'C', '!', 'g', 'm', 'y', 'z', "'", 'e', 'R', 'I', 'O', 'G', '3', 'x', 's', 'w', '.', 'M', 'Z']


Dictionaries char_to_ix and ix_to_char can be used to convert the character to a number and vice versa. 

In [2]:
print(char_to_ix["E"])
print(ix_to_char[2])

12
S


So, we have read in the data in your notebook. Your assignment is the following: 

1. Prepare the data for training an LSTM

We will take an input sequence lenght of 40 characters. Proposed solution: Create an X and a y variable of type numpy array, where each row in X should be 40 characters, encoded as numbers, and each row in y should contain the next character, also encoded as a number. Once the numpy arrays have been created, use the to_categorical function from keras to one-hot-encode the arrays. We do not need to split off a test set, since we will not evaluate the LSTM, just use it to generate text. 

2. Design an LSTM for the training data you just created, compile it and train it. 

You can decide yourself on the architecture, can be simple or complex, as you like. Proposal: Your final layer will most likely be a multiclass classification of size vocab_size, so you should use the appropriate activation function for that. 

3. Create a function (def) that can iteratively call your RNN / LSTM

The function should expect as an input an initial (40) character string, so that you can generate the first next character. A second parameter of the function should be the number of tokens you like to generate. Inside the function there should be a for loop that iteratively calls your model. Append your generated tokens to the initial string and feed your string into your LSTM iteratively to generate the next token. Output the completely generated text. 

4. Demonstrate your function

Call your function with a string you defined yourself, and let it generate 200 characters. Hopefully, it resembles syntactically correct English. 

For Reference, here is a text that my model generated: 

____________________________
First Citizen:
Before we proceed any further with the cours.

PETRUCHIO:
What shall be the death the strength the strength.

KING RICHARD III:
The shall be the death the strength the strength.

KING RICHARD III:
The shall be the death the s

_____________________________

That is it! You are allowed to use parts from the Github I recommended, and are also allowed to use ChatGPT. But, very important: You must be able to explain your code, so you can only use stuff from ChatGPT that you understand. The model must be created using Keras and Tensorflow. It is allowed to deviate from the detailed instructions from above, as long as you can explain why you did that and as long as your model is able to generate text. In case you do not manage to complete the entire assignment (perhaps only part 1 and 2), please still submit your work, you will get points for it.

#### 1) Prepare the data for training an LSTM
We will take an input sequence lenght of 40 characters. Proposed solution: Create an X and a y variable of type numpy array, where each row in X should be 40 characters, encoded as numbers, and each row in y should contain the next character, also encoded as a number. Once the numpy arrays have been created, use the to_categorical function from keras to one-hot-encode the arrays. We do not need to split off a test set, since we will not evaluate the LSTM, just use it to generate text.

In [3]:
X, y = [], []

seq_length = 40
for char in range(seq_length, len(data) - 1):
    row = data[char - seq_length: char]
    predict_char = data[char]
    predict_char_encoded = char_to_ix[predict_char]
    row_encoded = [char_to_ix[c] for c in row]

    X.append(row_encoded)
    y.append(predict_char_encoded)

X = np.array(X)
y = np.array(y)
print(X.shape)
print(X)

y = to_categorical(y, num_classes=vocab_size)



(1115353, 40)
[[26  1  0 ... 22  5  0]
 [ 1  0 60 ...  5  0 43]
 [ 0 60 43 ...  0 43 34]
 ...
 [ 1 53 17 ... 25 32  1]
 [53 17 41 ... 32  1 30]
 [17 41  0 ...  1 30 48]]


#### 2) Design an LSTM for the training data you just created, compile it and train it.

You can decide yourself on the architecture, can be simple or complex, as you like. Proposal: Your final layer will most likely be a multiclass classification of size vocab_size, so you should use the appropriate activation function for that.

In [64]:
# Design an LSTM for the training data you just created, compile it and train it. You can decide yourself on the architecture, can be simple or complex, as you like. Proposal: Your final layer will most likely be a multiclass classification of size vocab_size, so you should use the appropriate activation function for that.

X = X.reshape((X.shape[0], X.shape[1], 1))

model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))

# we use categorical_crossentropy as loss function, since we are dealing with a multiclass classification problem and not a regression problem
model.compile(loss='categorical_crossentropy', optimizer='adam')


In [68]:
model.fit(X, y, epochs=20, batch_size=32, validation_split=0.2)

Epoch 1/20
[1m27884/27884[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m675s[0m 24ms/step - loss: 1.6686 - val_loss: 2.0622
Epoch 2/20
[1m27884/27884[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m669s[0m 24ms/step - loss: 1.6585 - val_loss: 2.0492
Epoch 3/20
[1m27884/27884[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m666s[0m 24ms/step - loss: 1.6464 - val_loss: 2.0674
Epoch 4/20
[1m27884/27884[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m661s[0m 24ms/step - loss: 1.6413 - val_loss: 2.0431
Epoch 5/20
[1m27884/27884[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m662s[0m 24ms/step - loss: 1.6353 - val_loss: 2.0451
Epoch 6/20
[1m27884/27884[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m660s[0m 24ms/step - loss: 1.6561 - val_loss: 2.0574
Epoch 7/20
[1m27884/27884[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m662s[0m 24ms/step - loss: 1.6317 - val_loss: 2.0645
Epoch 8/20
[1m27884/27884[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m672s[0m 24ms/step - loss: 1.6212 - v

<keras.src.callbacks.history.History at 0x1ada7951af0>

#### 3) Create a function (def) that can iteratively call your RNN / LSTM

The function should expect as an input an initial (40) character string, so that you can generate the first next character. A second parameter of the function should be the number of tokens you like to generate. Inside the function there should be a for loop that iteratively calls your model. Append your generated tokens to the initial string and feed your string into your LSTM iteratively to generate the next token. Output the completely generated text.

In [70]:
def generate_text(text: str, tokens: int):
    x = text
    output = ""
    
    for i in range(tokens):
        x = x[-40:]
        x_encoded = [char_to_ix[c] for c in x]
        x_encoded = np.array(x_encoded).reshape(1, len(x_encoded))
        #print(x_encoded.shape)
        prediction = model.predict(x_encoded)
        index = np.argmax(prediction)
        x += ix_to_char[index]
        output += ix_to_char[index]
        
    return text + output 

#### 4) Demonstrate your function

In [71]:
text_output = generate_text("All:\nCome, come.\n\nFirst Citizen:\nSoft! who comes here?", 200)
print(text_output)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0s/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/