# Exercises

1. What are the pros & cons of using a stateful RNN versus a stateless RNN?
2. Why do people use encoder-decoder RNNs rather than plain sequence-to-sequence RNNs for automatic translation?
3. How can you deal with variable-length input sequences? What about variable-length output sequences?
4. What is beam search & why would you use it? What tool can you use to implement it?
5. What is an attention mechanism? How does it help?
6. What is the most important layer in the transformer architecture? What is its purpose?
7. When would you need to use sampled softmax?
8. *Embedded Reber grammars* were used by Hochreiter & Schmidhuber in their paper about LSTMs. They are artificial grammars that produce strings such as "BPBTSXXVPSEPE". Check out Jenny Orr's introduction to this topic. Choose a particular embedded Reber grammar (such as the one represented on Jenny Orr's page), then train an RNN to identify whether a string respects that grammar or not. You will first need to write a function capable of generating a training batch containing about 50% strings that respect grammar, & 50% that don't.
9. Train an encoder-decoder model that can convert a date string from one format to another (e.g., from "April 22, 2019" to "2019-04-22").
10. Go through TensorFlow's neural machine translation with attention tutorial.
11. Use one of the recent language models (e.g., BERT) to generate more convincing Shakespearean text.

---

1. A stateless RNN will only capture short-term patterns, or at least patterns within the size of the windows the stateless RNN is trained on. Stateful RNNs can capture long-term patterns, but the preparation is more difficult. With stateful RNNs, you have to worry about the issue with consecutive batches -- they are not independent & equally distributed, which is not good for gradient descent.
2. If you started translating as you read a sentence one word at a time, your resulting translated sentence may have a ton of grammatical errors or it would just not make sense at all. This is how a sequence-to-sequence RNN would translate your sentences. An encoder-decoder RNN will read the whole sentence before translating it, which leads to more accurate translations.
3. For input sequences of different lengths, depending on how long the sequences are, you can bin the sequences based on their length & pad them so that all sequences in the bins are the same length. With all of this padding, you would make sure your model masks the padding tokens. Since, generall, the length of the output sequence is not known ahead of time, you would train the model to output an \<eos> token at the end of each sequence. But if you did know the length of the output in advance, you would need to configure the loss function to ignore tokens after the \<eos> token.
4. Beam search is used to improve the performance of an encoder-decoder model. It keeps a short like of the *k* best output sequences & at each decoder step, it tries to extend the sequence by one word; then it keeps the *k* most likely sequences. *k* is beam width, & is a tunable hyperparameter.
5. Attention mechanisms are used in encoder-decoder models to deal with longer input sequences. At each decoder time step, the current decoder's hidden state & the encoder's output is processed by the alignment model that outputs a score for each input time step. The score determines which part of the input is the most relevant (weighted) to the current decoder time step. This weighted sum of the encoder output (weighted by the alignment score) is fed to the decoder to produce the next decoder time step & the output for this time step. The benefit of doing all of this extra work is that it makes the encoder-decoder model able to process longer input sequences & it can potentially make the model easier to debug as well by pointing out which part of the input the model is paying attention to.
6. The most important part of the transformer architecture is the multi-head attention layer. It allows models to identify words that are most aligned with each other (ex: smart & smartest, or grief, grieve, & grieving), & then improve each words representation in the output.
7. You use sampled softmax when there are many classes (thousands). It approximates the cross-entropy loss based on the logit predicted from a sample of incorrect words, which speeds up training a ton, because it doesn't need to output a probability for each class if there are many classes, only a sample. But, after training, use a regular softmax function (not sampled softmax) to compute all the class probabilities (because you are trying to predict new words).

# 8.

In [None]:
import numpy as np

reber_grammar = [[("B", 1)],
                 [("T", 2), ("P", 3)],
                 [("S", 2), ("X", 4)],
                 [("T", 3), ("V", 5)],
                 [("X", 3), ("S", 6)],
                 [("P", 4), ("V", 6)],
                 [("E", None)]]

embedded_reber_grammar = [[("B", 1)],
                          [("T", 2), ("P", 3)], 
                          [(reber_grammar, 4)],
                          [(reber_grammar, 5)],
                          [("T", 6)],
                          [("P", 6)],
                          [("E", None)]]

def generate_string(grammar):
    node = 0
    output = []
    while node != None:
        index = np.random.randint(len(grammar[node]))
        production, node = grammar[node][index]
        if isinstance(production, list):
            production = generate_string(grammar = production)
        output.append(production)
    return "".join(output)

In [None]:
print(generate_string(reber_grammar))

In [None]:
def generate_bad_string(grammar, chars = "BEPSTVX"):
    good_string = generate_string(grammar)
    index = np.random.randint(len(good_string))
    good_char = good_string[index]
    bad_char = np.random.choice(sorted(set(chars) - set(good_char)))
    return good_string[:index] + bad_char + good_string[index + 1:]

In [None]:
print(generate_bad_string(embedded_reber_grammar))

We can't feed strings directly to an RNN, so we will encode them. We'll perform embedding. One-hot encoding works too!

In [None]:
def str_to_ids(s, chars = "BEPSTVX"):
    return [chars.index(c) for c in s]

In [None]:
str_to_ids("BPBTXSEPE")

We'll generate a dataset of half reber strings & half not reber strings.

In [None]:
import tensorflow as tf
from tensorflow import keras

def generate_dataset(size):
    good_strings = [str_to_ids(generate_string(embedded_reber_grammar)) 
                               for _ in range(size // 2)]
    bad_strings = [str_to_ids(generate_bad_string(embedded_reber_grammar))
                              for _ in range(size // 2)]
    all_strings = good_strings + bad_strings
    X = tf.ragged.constant(all_strings, ragged_rank = 1)
    y = np.array([[1.0] for _ in range(len(good_strings))] + 
                 [[0.0] for _ in range(len(bad_strings))])
    return X, y

In [None]:
X_train, y_train = generate_dataset(10000)
X_val, y_val = generate_dataset(2000)

In [None]:
embedding_size = 5

model = keras.models.Sequential([
    keras.layers.InputLayer(input_shape = [None], dtype = tf.int32, ragged = True),
    keras.layers.Embedding(input_dim = len("BEPSTVX"), output_dim = embedding_size),
    keras.layers.GRU(30),
    keras.layers.Dense(1, activation = "sigmoid")
])
optimizer = keras.optimizers.SGD(learning_rate = 0.02, momentum = 0.95, nesterov = True)
model.compile(loss = "binary_crossentropy", optimizer = optimizer, metrics = ["accuracy"])
history = model.fit(X_train, y_train, epochs = 20, validation_data = (X_val, y_val))

# 9.

In [1]:
from datetime import date
import numpy as np
import tensorflow as tf
from tensorflow import keras

months = ["January", "February", "March", "April", "May", "June",
          "July", "August", "September", "October", "November", "December"]

def random_dates(n_dates):
    min_date = date(1000, 1, 1).toordinal()
    max_date = date(9999, 12, 31).toordinal()

    ordinals = np.random.randint(max_date - min_date, size = n_dates) + min_date
    dates = [date.fromordinal(ordinal) for ordinal in ordinals]

    x = [months[dt.month - 1] + " " + dt.strftime("%d, %Y") for dt in dates]
    y = [dt.isoformat() for dt in dates]
    return x, y

2024-09-30 18:00:10.061534: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
input_chars = "".join(sorted(set("".join(months) + "0123456789, ")))
input_chars

' ,0123456789ADFJMNOSabceghilmnoprstuvy'

In [3]:
output_chars = "0123456789-"

In [4]:
def date_str_to_ids(date_str, chars = input_chars):
    return [chars.index(c) for c in date_str]

def prepare_date_strs(date_strs, chars = input_chars):
    x_ids = [date_str_to_ids(dt, chars) for dt in date_strs]
    x = tf.ragged.constant(x_ids, ragged_rank = 1)
    return (x + 1).to_tensor()

def create_dataset(n_dates):
    x, y = random_dates(n_dates)
    return prepare_date_strs(x, input_chars), prepare_date_strs(y, output_chars)

In [5]:
X_train, y_train = create_dataset(10000)
X_val, y_val = create_dataset(2000)
X_test, y_test = create_dataset(2000)

In [7]:
embedding_size = 32
max_output_length = y_train.shape[1]

encoder = keras.models.Sequential([keras.layers.Embedding(input_dim = len(input_chars) + 1,
                                                          output_dim = embedding_size,
                                                          input_shape = [None]),
                                   keras.layers.LSTM(128)
])

decoder = keras.models.Sequential([
    keras.layers.LSTM(128, return_sequences = True),
    keras.layers.Dense(len(output_chars) + 1, activation = "softmax")
])

model = keras.models.Sequential([encoder,
                                 keras.layers.RepeatVector(max_output_length),
                                 decoder])
model.compile(loss = "sparse_categorical_crossentropy", optimizer = "nadam",
              metrics = ["accuracy"])
model.fit(X_train, y_train, epochs = 15, validation_data = (X_val, y_val))

Epoch 1/15
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 22ms/step - accuracy: 0.3068 - loss: 1.9636 - val_accuracy: 0.6122 - val_loss: 1.1097
Epoch 2/15
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 23ms/step - accuracy: 0.6060 - loss: 1.1239 - val_accuracy: 0.6851 - val_loss: 0.8552
Epoch 3/15
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 20ms/step - accuracy: 0.6940 - loss: 0.8447 - val_accuracy: 0.7489 - val_loss: 0.6408
Epoch 4/15
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 19ms/step - accuracy: 0.7741 - loss: 0.5816 - val_accuracy: 0.8396 - val_loss: 0.4253
Epoch 5/15
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 22ms/step - accuracy: 0.8609 - loss: 0.3702 - val_accuracy: 0.9047 - val_loss: 0.2694
Epoch 6/15
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 20ms/step - accuracy: 0.9271 - loss: 0.2259 - val_accuracy: 0.9624 - val_loss: 0.1483
Epoch 7/15
[1m313/31

<keras.src.callbacks.history.History at 0x152429be0>

In [10]:
def ids_to_date_strs(ids, chars = output_chars):
    return ["".join([("?" + chars)[index] for index in sequence]) for sequence in ids]

In [16]:
X_new = prepare_date_strs(["September 17, 2009", "July 14, 1789"])

max_input_length = X_train.shape[1]

def prepare_date_strs_padded(date_strs):
    X = prepare_date_strs(date_strs)
    if X.shape[1] < max_input_length:
        X = tf.pad(X, [[0, 0], [0, max_input_length - X.shape[1]]])
    return X

def convert_date_strs(date_strs):
    X = prepare_date_strs_padded(date_strs)
    ids = np.argmax(model.predict(X), axis = -1)
    return ids_to_date_strs(ids)

In [17]:
convert_date_strs(["May 02, 2020", "July 14, 1789"])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step


['2020-05-02', '1789-07-14']

# 10.

["NMT with Attention"](https://www.tensorflow.org/text/tutorials/nmt_with_attention)

# 11.

In [23]:
import transformers
from transformers import TFOpenAIGPTLMHeadModel

model = TFOpenAIGPTLMHeadModel.from_pretrained("openai-gpt")

RuntimeError: Failed to import transformers.models.openai.modeling_tf_openai because of the following error (look up to see its traceback):
Your currently installed version of Keras is Keras 3, but this is not yet supported in Transformers. Please install the backwards-compatible tf-keras package with `pip install tf-keras`.

In [None]:
from transformers import OpenAIGPTTokenizer

tokenizer = OpenAIGPTTokenizer.from_pretrained("openai_gpt")

In [None]:
num_sequences = 5
length = 40

generated_sequences = model.generate(input_ids = encoded_prompt,
                                     do_sample = True,
                                     max_length = length + len(encoded_prompt[0]),
                                     temperature = 1.0,
                                     top_k = 0,
                                     top_p = 0.9,
                                     repetition_penalty = 1.0,
                                     num_return_sequences = num_sequences)
generate_sequences

In [None]:
for sequence in generated_sequences:
    text = tokenizer.decode(sequence, clean_up_tokenization_spaces = True)
    print(text)
    print("-" * 80)