Ash Rai and Raniery Mendes <br>
CSC 674: Machine Learning <br>
WFU, Spring 2023


# Poject 3: Text Generation Using Transformers

This notebook is our implementation for Project 3. The implemnented text generator uses the miniature transformer model, based on the linguistic style of Donald Trump. Transcripts of 35 rally speeches by the former American president during the years of 2019 and 2020 has been used to train the model.

The model consists of a single Transformer block with causal masking in its attention layer. Different hyperparameters such as number of epochs, number of attention heads, word embedding dimensions are used.

### Creating the model

Importing all the packages

In [12]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
import numpy as np
import os
import re
import string
import random

We create a single Transformer block with causal masking in its attention layer.

In [13]:
def causal_attention_mask(batch_size, n_dest, n_src, dtype):
    """
    Mask the upper half of the dot product matrix in self attention.
    This prevents flow of information from future tokens to current token.
    1's in the lower triangle, counting from the lower right corner.
    """
    i = tf.range(n_dest)[:, None]
    j = tf.range(n_src)
    m = i >= j - n_src + n_dest
    mask = tf.cast(m, dtype)
    mask = tf.reshape(mask, [1, n_dest, n_src])
    mult = tf.concat(
        [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], 0
    )
    return tf.tile(mask, mult)


class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = layers.MultiHeadAttention(num_heads, embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size = input_shape[0]
        seq_len = input_shape[1]
        causal_mask = causal_attention_mask(batch_size, seq_len, seq_len, tf.bool)
        attention_output = self.att(inputs, inputs, attention_mask=causal_mask)
        attention_output = self.dropout1(attention_output)
        out1 = self.layernorm1(inputs + attention_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output)
        return self.layernorm2(out1 + ffn_output)

We implement two seperate embedding layers: one for tokens and one for token index (positions). The embeddings help us to convert the input tokens and output tokens to vectors of dimension embed_dim.


In [14]:
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super().__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

Creating the miniature GPT model (mini-Trump text generator). The different hyper parameters can be set here for the model created.

During our experiments we changed the vocabulary size, embedding and feed forward dimensions as well as number of attention head nodes. 

**configuration 1**
vocab_size = 50000
embed_dim=256
feed_forward_dim=256
num_head = 6

**configuration 1**
vocab_size = 75000
embed_dim=512
feed_forward_dim=512
num_head = 8

**configuration 1**
vocab_size = 100000
embed_dim=512
feed_forward_dim=512
num_head = 8

In [35]:
vocab_size = 100000  # Only consider the top 20k words
maxlen = 80  # Max sequence size
embed_dim = 512  # Embedding size for each token
num_heads = 8  # Number of attention heads
feed_forward_dim = 512  # Hidden layer size in feed forward network inside transformer


def create_model():
    inputs = layers.Input(shape=(maxlen,), dtype=tf.int32)
    embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
    x = embedding_layer(inputs)
    transformer_block = TransformerBlock(embed_dim, num_heads, feed_forward_dim)
    x = transformer_block(x)
    outputs = layers.Dense(vocab_size)(x)
    model = keras.Model(inputs=inputs, outputs=[outputs, x])
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    model.compile(
        "adam", loss=[loss_fn, None],
    )  # No loss and optimization based on word embeddings from transformer block
    return model

### Dataset Preprocessing

Original dataset was divided into 35 different files, with a file for each different city. We concatenated them into one file for ease of use.

Reference: https://www.kaggle.com/datasets/christianlillelund/donald-trumps-rallies?resource=download


After the concatenation of all speeches, we grouped the sentences in paragraphs of size equal to ten sentences. The reasoning behind this is that even though the average size of a paragraph is five sentences, in rally settings, speakers tend to shorter sentences to maintain supporters’ attention as well as to be persuasive. So, we assume ten sentences is the proper optimal length to capture the speaker’s “thought”.  For our implementation, a sentence is considered to be any sequence of words delimited by a period.

In [36]:
filename = '/Users/mendrc18/Documents/ash/combined_speeches.txt'
with open(filename) as file:
    lines = [line.rstrip() for line in file]

# Split into indivdual sentences
sentences = lines[0].split(".")

# Process the sentences for formatting
processed_sentences = []
for sentence in sentences:
  sentence = sentence.strip() + "."
  processed_sentences.append(sentence)

# Concatenate 10 sentences for each data point (i.e. cohesive paragraph)
running_sentence = ''
paragraphs = []
i = 0
for sentence in processed_sentences:
  running_sentence += ' ' + sentence
  i += 1
  if i == 10:
    paragraphs.append(running_sentence.lstrip())
    running_sentence = ''
    i = 0

In [37]:
len(paragraphs)

3378

The dataset, which contains 3378 paraghraphs for 10 sentences each, is then processed further to vectorize and prepare the input labels.

In [38]:
batch_size = 128

# Create a dataset
text_ds = tf.data.Dataset.from_tensor_slices(paragraphs)
text_ds = text_ds.shuffle(buffer_size=256)
text_ds = text_ds.batch(batch_size)


def custom_standardization(input_string):
    """ Remove html line-break tags and handle punctuation """
    lowercased = tf.strings.lower(input_string)
    stripped_html = tf.strings.regex_replace(lowercased, "<br />", " ")
    return tf.strings.regex_replace(stripped_html, f"([{string.punctuation}])", r" \1")


# Create a vectorization layer and adapt it to the text
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size - 1,
    output_mode="int",
    output_sequence_length=maxlen + 1,
)
vectorize_layer.adapt(text_ds)
vocab = vectorize_layer.get_vocabulary()  # To get words back from token indices


def prepare_lm_inputs_labels(text):
    """
    Shift word sequences by 1 position so that the target for position (i) is
    word at position (i+1). The model will use all words up till position (i)
    to predict the next word.
    """
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y


text_ds = text_ds.map(prepare_lm_inputs_labels)
text_ds = text_ds.prefetch(tf.data.AUTOTUNE)

### Implement a Keras callback for generating text

The generator is then implemented using the transformer we have set up so far. The hyperparameters for the text to be generated such as the starting prompt (i.e. the first words of the text generation), and the length of the generated text is also set.

In [41]:
class TextGenerator(keras.callbacks.Callback):
    """A callback to generate text from a trained model.
    1. Feed some starting prompt to the model
    2. Predict probabilities for the next token
    3. Sample the next token and add it to the next input

    Arguments:
        max_tokens: Integer, the number of tokens to be generated after prompt.
        start_tokens: List of integers, the token indices for the starting prompt.
        index_to_word: List of strings, obtained from the TextVectorization layer.
        top_k: Integer, sample from the `top_k` token predictions.
        print_every: Integer, print after this many epochs.
    """

    def __init__(
        self, max_tokens, start_tokens, index_to_word, top_k=10, print_every=1
    ):
        self.max_tokens = max_tokens
        self.start_tokens = start_tokens
        self.index_to_word = index_to_word
        self.print_every = print_every
        self.k = top_k

    def sample_from(self, logits):
        logits, indices = tf.math.top_k(logits, k=self.k, sorted=True)
        indices = np.asarray(indices).astype("int32")
        preds = keras.activations.softmax(tf.expand_dims(logits, 0))[0]
        preds = np.asarray(preds).astype("float32")
        return np.random.choice(indices, p=preds)

    def detokenize(self, number):
        return self.index_to_word[number]

    def on_epoch_end(self, epoch, logs=None):
        start_tokens = [_ for _ in self.start_tokens]
        if (epoch + 1) % self.print_every != 0:
            return
        num_tokens_generated = 0
        tokens_generated = []
        while num_tokens_generated <= self.max_tokens:
            pad_len = maxlen - len(start_tokens)
            sample_index = len(start_tokens) - 1
            if pad_len < 0:
                x = start_tokens[:maxlen]
                sample_index = maxlen - 1
            elif pad_len > 0:
                x = start_tokens + [0] * pad_len
            else:
                x = start_tokens
            x = np.array([x])
            y, _ = self.model.predict(x)
            sample_token = self.sample_from(y[0][sample_index])
            tokens_generated.append(sample_token)
            start_tokens.append(sample_token)
            num_tokens_generated = len(tokens_generated)
        txt = " ".join(
            [self.detokenize(_) for _ in self.start_tokens + tokens_generated]
        )
        print(f"generated text:\n{txt}\n")


# Tokenize starting prompt
word_to_index = {}
for index, word in enumerate(vocab):
    word_to_index[word] = index

start_prompt = "Democrats are bad people."
start_tokens = [word_to_index.get(_, 1) for _ in start_prompt.split()]
num_tokens_generated = 100
text_gen_callback = TextGenerator(num_tokens_generated, start_tokens, vocab)

### Text generation results

The model can now be used to generate the text. The number of epochs can be set here for that particular run of text generation.

**configuration 1** 

epochs = 25


**configuration 2** 

epochs = 100

**configuration 3** 

epochs = 200

In [20]:
model = create_model()

model.fit(text_ds, verbose=2, epochs=25, callbacks=[text_gen_callback])

Epoch 1/25
generated text:
[UNK] you for coming to my grand [UNK] and the . . , and a you and . it we and we i the you we , we you . the it the and we , , and , i i the . i a . , . . it we i you . i a . i we , the . it and you and i . and and , a and the a , we and . , , and and they i i they we i the . a i we i we . we , they . a a they the and the the i

27/27 - 211s - loss: 6.9688 - dense_5_loss: 6.9688 - 211s/epoch - 8s/step
Epoch 2/25
generated text:
[UNK] you for coming to my grand [UNK] . i and we the 's . . the . the you you a it a . the , i . a , , the and . , . it it . , . , , you , , . 's , , you . i . . , , . it 's , , . , we it . you and , , , i and . , you , a . , the it 's and a i i it and we . a and the a the and it . . a 's the . . and .

27/27 - 212s - loss: 5.9321 - dense_5_loss: 5.9321 - 212s/epoch - 8s/step
Epoch 3/25
generated text:
[UNK] you for coming to my grand [UNK] . . . to the , , a , . the it and , . 's the and to a it , to it 's the , i the it 's we and .

<keras.callbacks.History at 0x166bef3d0>

In [26]:
model_1 = create_model()
model_1.fit(text_ds, verbose=2, epochs=100, callbacks=[text_gen_callback])

Epoch 1/100
generated text:
[UNK] you for coming to my grand [UNK] . , , you it i , . 's 's , a and . and , i , a i , it . it a , . we , it you i , we you it and i 's i the and , the , i , a you the and we a , i , you and . it and , you . the . , a and the a and , , , . and , we 's you i . . it , i i a the . it 's . . the . and the and .

27/27 - 276s - loss: 7.1306 - dense_8_loss: 7.1306 - 276s/epoch - 10s/step
Epoch 2/100
generated text:
[UNK] you for coming to my grand [UNK] , it , we . a , i you . . you a . you you and . it . it , it , , . and , . i the the i the . , , 's , we , , and . we the . . . the a , . we , , . i we it the you it . the . , a and i . it , , i the i i a and the you . i . . , . i , and it it . , , . , , , a

27/27 - 288s - loss: 5.9350 - dense_8_loss: 5.9350 - 288s/epoch - 11s/step
Epoch 3/100
generated text:
[UNK] you for coming to my grand [UNK] we . . . . , . . it . . 's . and you 's we and you a and a . 's . it , , the . and . and . 's you i . , , , , , , t

<keras.callbacks.History at 0x17bd7e200>

In [None]:
model_1.summary()

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 80)]              0         
                                                                 
 token_and_position_embeddin  (None, 80, 512)          38440960  
 g_2 (TokenAndPositionEmbedd                                     
 ing)                                                            
                                                                 
 transformer_block_2 (Transf  (None, 80, 512)          8928768   
 ormerBlock)                                                     
                                                                 
 dense_8 (Dense)             (None, 80, 75000)         38475000  
                                                                 
Total params: 85,844,728
Trainable params: 85,844,728
Non-trainable params: 0
_______________________________________________

In [30]:
model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 80)]              0         
                                                                 
 token_and_position_embeddin  (None, 80, 512)          25640960  
 g_1 (TokenAndPositionEmbedd                                     
 ing)                                                            
                                                                 
 transformer_block_1 (Transf  (None, 80, 512)          8928768   
 ormerBlock)                                                     
                                                                 
 dense_5 (Dense)             (None, 80, 50000)         25650000  
                                                                 
Total params: 60,219,728
Trainable params: 60,219,728
Non-trainable params: 0
_______________________________________________

Reference

1.  Nandan, Apoorv. “Keras Documentation: Text Generation with a Miniature GPT.” Keras, https://keras.io/examples/generative/text_generation_with_miniature_gpt/. 
