# Character-level RNN for Text Generation

This sample code is taken from [Github](https://github.com/jctestud/char-rnn) and demonstrates how to train a stacked LSTM on a set of Tweets and then to generate text that resembles these input Tweets in style. We work with Trump Tweets as collected by the [Trump Twitter Archive](http://www.trumptwitterarchive.com/archive). This form of learning doesn't need manually annotated datasets.


![Text Generation Training](https://www.tensorflow.org/text/tutorials/images/text_generation_training.png)


## Training
The following part demonstrates the training of such an LSTM model. Pay special attention to

- how the data batches are generated
- what the training data looks like (input and target label)
- the model architecture
- why do we do this on a character level.

In [10]:
!wget -nc https://raw.githubusercontent.com/ADSLab-Salzburg/workshop-chatbot-trump-tweet-generator/main/trump_train.txt
!wget -nc https://raw.githubusercontent.com/ADSLab-Salzburg/workshop-chatbot-trump-tweet-generator/main/trump_val.txt
!wget -nc https://raw.githubusercontent.com/ADSLab-Salzburg/workshop-chatbot-trump-tweet-generator/main/tweet_preprocessor.py

File ‘trump_train.txt’ already there; not retrieving.

File ‘trump_val.txt’ already there; not retrieving.

File ‘tweet_preprocessor.py’ already there; not retrieving.



Some parameters to set and tune:

In [6]:
data_directory = "."

SEQUENCE_LEN = 60
BATCH_SIZE = 128
EPOCHS = 20
HIDDEN_LAYERS_DIM = 512
LAYER_COUNT = 4
DROPOUT = 0.2

MAX_DATA_PERCENT = 0.01

Let's create the vocabulary! Depending on the dataset and your set locale, it might be possible, that you also need to remove other chars.

In [7]:
import string
characters = list(string.printable)
characters.remove('\x0b')
characters.remove('\x0c')

VOCABULARY_SIZE = len(characters)

characters_to_idx = {c:i for i,c in enumerate(characters)}
print(f'vocabulary len = {VOCABULARY_SIZE}')
print(characters)

vocabulary len = 98
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', ' ', '\t', '\n', '\r']


Here are two helper functions. Just take them as they are. :-)

In [8]:
import numpy as np

def batch_generator(text, count):
    """Generate batches for training"""
    while True: # keras wants that for reasons
        for batch_idx in range(count):
            X = np.zeros((BATCH_SIZE, SEQUENCE_LEN, VOCABULARY_SIZE))
            y = np.zeros((BATCH_SIZE, VOCABULARY_SIZE))

            batch_offset = BATCH_SIZE * batch_idx

            for sample_idx in range(BATCH_SIZE):
                sample_start = batch_offset + sample_idx
                for s in range(SEQUENCE_LEN):
                    X[sample_idx, s, characters_to_idx[text[sample_start+s]]] = 1
                y[sample_idx, characters_to_idx[text[sample_start+s+1]]]=1

            yield X, y


def describe_batch(X, y, samples=3):
    """Describe in a human-readable format some samples from a batch"""
    for i in range(samples):
        sentence = ""
        for s in range(SEQUENCE_LEN):
            sentence += characters[X[i,s,:].argmax()]
        next_char = characters[y[i,:].argmax()]

        print("sample #%d: ...%s -> '%s'" % (
            i,
            sentence[-20:],
            next_char
        ))

Your turn!

Let's build a model. The model has `LAYER_COUNT` LSTM layers with a Dropout following after each. Each of the LSTMs has the same input shape and the same number of hidden layers. The model has a Dense layer for classification at the end, predicting the characters.
Compile it using a categorical cross-entropy and an adam optimizer.

In [9]:
from keras.models import Sequential
from keras.layers import LSTM, Dropout, Dense

def build_model():
    """Build a Keras sequential model for training the char-rnn"""
    model = Sequential()

    for layer_idx in range(LAYER_COUNT):
        model.add(
            LSTM(
                HIDDEN_LAYERS_DIM,
                return_sequences=True if (layer_idx!=(LAYER_COUNT-1)) else False,
                input_shape=(SEQUENCE_LEN, VOCABULARY_SIZE),
            )
        )
        model.add(Dropout(DROPOUT))

    model.add(Dense(VOCABULARY_SIZE, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer="adam")

    return model

TypeError: Descriptors cannot be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

Here, the text is loaded, (its size reduced,) and described initially. How does it look like?

In [15]:
import os

# loading the text
with open(os.path.join(data_directory, "trump_train.txt"), "r", encoding="utf8") as f:
    text_train = f.read()
with open(os.path.join(data_directory, "trump_val.txt"), "r", encoding="utf8") as f:
    text_val = f.read()

text_train = text_train[:int(MAX_DATA_PERCENT*len(text_train))]
text_val = text_val[:int(MAX_DATA_PERCENT*len(text_val))]

text_train_len = len(text_train)
text_val_len = len(text_val)
print("Total of %d characters" % (text_train_len + text_val_len))

Total of 119126 characters


Let's see, how the `batch_generator` prepares the data. Try out the generator and describe one batch using the `describe_batch(...)` method.

In [16]:
for ix, (X,y) in enumerate(batch_generator(text_train, count=3)):
    # describe some samples from the first batch
    describe_batch(X, y, samples=5)
    if ix > 1:
        break

sample #0: ...nd Mississippi for t -> 'h'
sample #1: ...d Mississippi for th -> 'e'
sample #2: ... Mississippi for the -> ' '
sample #3: ...Mississippi for the  -> 'R'
sample #4: ...ississippi for the R -> 'e'
sample #0: ...ttps://t.co/bvQLOmwl -> 'b'
sample #1: ...tps://t.co/bvQLOmwlb -> 'd'
sample #2: ...ps://t.co/bvQLOmwlbd -> ' '
sample #3: ...s://t.co/bvQLOmwlbd  -> ' '
sample #4: ...://t.co/bvQLOmwlbd   -> '"'
sample #0: ...s. I wish I had the  -> 'm'
sample #1: .... I wish I had the m -> 'o'
sample #2: ... I wish I had the mo -> 'n'
sample #3: ...I wish I had the mon -> 'e'
sample #4: ... wish I had the mone -> 'y'


Let's create a model and have a look at its summary.

In [17]:
training_model = build_model()
training_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, 60, 512)           1251328   
                                                                 
 dropout (Dropout)           (None, 60, 512)           0         
                                                                 
 lstm_1 (LSTM)               (None, 60, 512)           2099200   
                                                                 
 dropout_1 (Dropout)         (None, 60, 512)           0         
                                                                 
 lstm_2 (LSTM)               (None, 60, 512)           2099200   
                                                                 
 dropout_2 (Dropout)         (None, 60, 512)           0         
                                                                 
 lstm_3 (LSTM)               (None, 512)               2

How many batches do we have?

In [18]:
train_batch_count = (text_train_len - SEQUENCE_LEN) // BATCH_SIZE
val_batch_count = (text_val_len - SEQUENCE_LEN) // BATCH_SIZE
print("training batch count: %d" % train_batch_count)
print("validation batch count: %d" % val_batch_count)

training batch count: 464
validation batch count: 464


**Callbacks** today we are using standard callbacks. What are both doing?

In [19]:
from keras.callbacks import ModelCheckpoint, EarlyStopping

filepath = "./BS-%d_%d-%s_dp%.2f_%dS_epoch{epoch:02d}-loss{loss:.4f}-val-loss{val_loss:.4f}_weights" % (
    BATCH_SIZE,
    LAYER_COUNT,
    HIDDEN_LAYERS_DIM,
    DROPOUT,
    SEQUENCE_LEN
)
checkpoint = ModelCheckpoint(
    filepath,
    save_weights_only=True
)

early_stopping = EarlyStopping(monitor='val_loss', patience=5)

callbacks_list = [checkpoint, early_stopping]

Let's get the training started!!

In [20]:
history = training_model.fit(
    batch_generator(text_train, count=train_batch_count),
    steps_per_epoch=train_batch_count,
    max_queue_size=4,     # no more than one queued batch in RAM
    epochs=EPOCHS,
    callbacks=callbacks_list,
    validation_data=batch_generator(text_val, count=val_batch_count),
    validation_steps=val_batch_count,
    initial_epoch=0
)

Epoch 1/20
  1/464 [..............................] - ETA: 2:28:23 - loss: 4.5852

KeyboardInterrupt: 

### Testing

![Text Generation Sampling](https://www.tensorflow.org/text/tutorials/images/text_generation_sampling.png)

The following part demonstrates the testing of the model - in other words, how to generate text that sounds like Trump. To keep the prediction step simple, a batch size of 1 is used. In tensorflow v1, the batch size cannot be changed once the model is built. This means that we have to rebuild the model and restore the weights from the training checkpoint.

Here, we are building up a new model and loading the weights to the model.

Let's start generating text...


##### Model
In the following, create a model, similar as done in the training section.  This time, you do not need to add dropouts. Additionally, instead of the `input_shape`, you need to define the `batch_input_shape` in such a way, that only on word is passed along.

Good think, you don't need to compile the model. :-)

In [4]:
from keras.models import Sequential
from keras.layers import LSTM, Dense

test_model = Sequential()

for i in range(LAYER_COUNT):
    test_model.add(
            LSTM(
                HIDDEN_LAYERS_DIM,
                return_sequences=(i!=(LAYER_COUNT-1)),
                batch_input_shape=(1, 1, VOCABULARY_SIZE),  # batch size = 1
                stateful=True
            )
        )
test_model.add(Dense(VOCABULARY_SIZE, activation='softmax'))

2024-02-28 10:48:24.126468: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-02-28 10:48:24.126525: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


TypeError: Descriptors cannot be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

Load the weights with the best validation results during the training procedure.

In [5]:
### BEGIN SOLUTION
!wget -nc https://raw.githubusercontent.com/ADSLab-Salzburg/workshop-chatbot-trump-tweet-generator/main/stored_model.index
!wget -nc https://raw.githubusercontent.com/ADSLab-Salzburg/workshop-chatbot-trump-tweet-generator/main/stored_model.data-00000-of-00001

test_model.load_weights(
    "stored_model"
    #"weights_4-512-lstm_loss1.2550_val-loss1.2443"
)
### END SOLUTION

NameError: name 'test_model' is not defined

Imagine, a normal setup for a prediction. _Which result is taken from the output distribution?_

What can we do in order to have variability in our text generation?

Additionaly: the distribution can also be changed. Try to implement a temperature coefficient [1]:
$$q_i = \frac{e^{\frac{z_i}{T}}}{\sum_j e^{\frac{z_j}{T}}}$$ where $T$ can be set to any value in $[0,1]$.

For the random sampling, have a look at [numpy.random.multinomial](https://numpy.org/doc/stable/reference/random/generated/numpy.random.multinomial.html).

In [11]:
import numpy as np

def sample(preds, temperature=1.0):
    """Helper function to sample an index from a probability array"""
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds)  # ... z

    preds = preds / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)

    # n     ... number of experiments
    # pvals ... probability values
    # size  ... output shape
    probas = np.random.multinomial(n=1, pvals=preds/np.sum(preds), size=1)

    return np.argmax(probas)

You can consider these two functions as they are.

`predict_next_char` predicts a probability distribution for the following character, based on an input. This probability distribution is then used to sample characters based on this distribution.

The `generate_text`functions generates a text using a model, based on the `seed`.

In [12]:
import sys


def predict_next_char(model, current_char, diversity=1.0):
    """Predict the next character from the current one"""
    x = np.zeros((1, 1, VOCABULARY_SIZE))
    x[:,:,characters_to_idx[current_char]] = 1
    y = model.predict(x, batch_size=1,verbose=0,)
    next_char_idx = sample(y[0,:], temperature=diversity)
    next_char = characters[next_char_idx]
    return next_char

def generate_text(model, seed="I am", count=140):
    """Generate characters from a given seed"""
    model.reset_states()
    for s in seed[:-1]:
        next_char = predict_next_char(model, s)
    current_char = seed[-1]

    sys.stdout.write("["+seed+"]")

    # no more reset, preserve context
    for i in range(count - len(seed)):
        next_char = predict_next_char(model, current_char, diversity=0.7)
        current_char = next_char
        sys.stdout.write(next_char)
    print("...\n")

Now, let's write some tweets!

In [13]:
for i in range(5):
    generate_text(
        test_model,
        seed="Despite the constant negative press "
    )

NameError: name 'test_model' is not defined

### Try it with your seed sentence.

The question that unanswered till now: what did he want to tell us?

[![covefe](https://pbs.twimg.com/media/DBIXi67V0AAzS7x?format=jpg&name=900x900)](https://twitter.com/GavinNewsom/status/869783572390883328/photo/1)

# Homework
Retry this with your dataset. You can use, whatever you want! Linux-Kernel or other source code, novels, or even your former assignments from school to create a personalized assignment generator!

# Literature
[1] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015). [arXiv](https://arxiv.org/abs/1503.02531)