<a href="https://colab.research.google.com/github/Chan3377/Deep-learning-for-GenAI-Text-Generation-with-Transformer-Model/blob/main/Deep_learning_for_GenAI_Text_Generation_with_Transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generative deep learning

## Text generation

### How do you generate sequence data?

- The universal way to generate sequence data in deep learning is to ***train a model*** (usually a Transformer or an RNN) ***to predict the next token or next few tokens in a sequence, using the previous tokens as input***.
    - For instance, given the input “the cat is on the,” the model is trained to **predict the target “mat,”** the next word
- As usual when **working with text data**, tokens are typically words or characters, and any **network that can model the probability of the next token given the previous ones** is called a ***language model***.

### The importance of the sampling strategy

- **Sampling strategy** is the way you ***choose the next token*** which is crucially important for generating text.

- ***When sampling from generative models***, it’s always good to ***explore different amounts of randomness in the generation process***. But **how can we explore the amounts of randomness?** Introducing ***softmax temperature*** parameter
    - **Softmax temperature ****is ******a parameter used to ***control the amount of stochasticity in the sampling process (Randomness)***, which ***characterizes the entropy of the probability distribution*** used for sampling: it ***characterizes how surprising or predictable the choice of the next word will be***.

**Reweighting a probability distribution to a different temperature**

In [None]:
import numpy as np
# original_distribution is a 1D NumPy array of probability values that must sum to 1. temperature is a factor
# quantifying the entropy of the output distribution
def reweight_distribution(original_distribution, temperature=0.5):
    distribution = np.log(original_distribution) / temperature
    distribution = np.exp(distribution)
    return distribution / np.sum(distribution)
# Returns a reweighted version of the original distribution.
# The sum of the distribution may no longer be 1, so you divide it by its sum to obtain the new distribution

### Implementing text generation with Keras

#### Preparing the data

**Downloading and uncompressing the IMDB movie reviews dataset**

In [None]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

--2024-01-14 05:10:56--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2024-01-14 05:10:58 (49.9 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



**Creating a dataset from text files (one file = one sample)**

In [None]:
import tensorflow as tf
from tensorflow import keras
dataset = keras.utils.text_dataset_from_directory(
    directory="aclImdb", label_mode=None, batch_size=256)
# Strip the <br /> HTML tag that occurs in many of the reviews.
# This did not matter much for text classification, but we wouldn’t want to generate <br /> tags in this project
dataset = dataset.map(lambda x: tf.strings.regex_replace(x, "<br />", " "))

Found 100006 files belonging to 1 classes.


**Preparing a `TextVectorization` layer**

- Use a TextVectorization layer to compute the vocabulary we’ll be working with.
- Will only use the first sequence_length words of each review:
- TextVectorization layer will cut off anything beyond that when vectorizing a text

In [None]:
from tensorflow.keras.layers import TextVectorization

sequence_length = 100
# only consider the top 15,000 most common words
# anything else will be treated as the out-of-vocabulary token, "[UNK]"
vocab_size = 15000
# output_mode - return integer word index sequences
# output_sequence - will work with inputs and targets of length 100
# (but since we’ll offset the targets by 1, the model will actually see sequences of length 99).
text_vectorization = TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)
text_vectorization.adapt(dataset)

**Setting up a language modeling dataset**

- Use the layer to create a language modeling dataset where input samples are vectorized texts, and corresponding targets are the same texts offset by one word.

In [None]:
def prepare_lm_dataset(text_batch):
    # Convert a batch of texts (strings) to a batch of integer sequences.
    vectorized_sequences = text_vectorization(text_batch)
    # Create inputs by cutting off the last word of the sequences.
    x = vectorized_sequences[:, :-1]
    # Create targets by offsetting the sequences by 1.
    y = vectorized_sequences[:, 1:]
    return x, y

lm_dataset = dataset.map(prepare_lm_dataset, num_parallel_calls=4)

#### A Transformer-based sequence-to-sequence model

- How transformer-based sequence-to- sequence model work?
    - A sequence-to-sequence model is trained by feeding sequences of ***N words (indexed from 0 to N)*** into our model, and we’ll ***predict the sequence offset by one (from 1 to N+1).***
    - We’ll use causal masking to make sure that, for any ***i***, the model will only be using words from ***0 to i*** in order to ***predict the word i + 1***.
    - This means that the model is trained simultaneously to solve N mostly overlapping but different problems: ***predicting the next words given a sequence of 1 <= i <= N prior words***.
    - ***At generation time***, even if you only ***prompt the model with a single word***, it will be able to ***give you a probability distribution for the next possible words.***

Reuse the building blocks: **PositionalEmbedding** and **TransformerDecoder**.

In [None]:
import tensorflow as tf
from tensorflow.keras import layers

class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=input_dim, output_dim=output_dim)
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim)
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super(PositionalEmbedding, self).get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config


class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
          num_heads=num_heads, key_dim=embed_dim)
        self.attention_2 = layers.MultiHeadAttention(
          num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True

    def get_config(self):
        config = super(TransformerDecoder, self).get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1),
             tf.constant([1, 1], dtype=tf.int32)], axis=0)
        return tf.tile(mask, mult)

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(
                mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)
        else:
            padding_mask = mask
        attention_output_1 = self.attention_1(
            query=inputs,
            value=inputs,
            key=inputs,
            attention_mask=causal_mask)
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)
        attention_output_2 = self.attention_2(
            query=attention_output_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        attention_output_2 = self.layernorm_2(
            attention_output_1 + attention_output_2)
        proj_output = self.dense_proj(attention_output_2)
        return self.layernorm_3(attention_output_2 + proj_output)

**A simple Transformer-based language model**

In [None]:
from tensorflow.keras import layers
embed_dim = 256
latent_dim = 2048
num_heads = 2

inputs = keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, x)
# Softmax over possible vocabulary words, computed for each output sequence timestep
outputs = layers.Dense(vocab_size, activation="softmax")(x)
model = keras.Model(inputs, outputs)
model.compile(loss="sparse_categorical_crossentropy", optimizer="rmsprop")

### A text-generation callback with variable-temperature sampling

**The text-generation callback**

Use a callback to generate text using a range of different temperatures after every epoch.

* This allows you to see how the generated text evolves as the model begins to converge,
* as well as the impact of temperature in the sampling strategy.

***Note***: Use [UNK] to catch an "out of vocabulary" index
- When decoding a sequence of integers back into words, you’ll replace 1 with something like “[UNK]” (which you’d call an “OOV token”).
- “Why use 1 and not 0?” you may ask. That’s because 0 is already taken. There are two special tokens that you will commonly use: ***the OOV token (index 1)***, and ***the mask token (index 0)***.
- While the OOV token means “here was a word we did not recognize,” the mask token tells us “ignore me, I’m not a word.”
- You’d use it in particular to pad sequence data: because data batches need to be contiguous, all sequences in a batch of sequence data must have the same length, so shorter sequences should be padded to the length of the longest sequence.

In [None]:
import numpy as np

# Dict that maps word indices back to strings, to be used for text decoding
tokens_index = dict(enumerate(text_vectorization.get_vocabulary()))

# Implements variable temperature sampling from a probability distribution
def sample_next(predictions, temperature=1.0):
    predictions = np.asarray(predictions).astype("float64")
    predictions = np.log(predictions) / temperature
    exp_preds = np.exp(predictions)
    predictions = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, predictions, 1)
    return np.argmax(probas)

class TextGenerator(keras.callbacks.Callback):
    # prompt - Prompt that we use to seed text generation
    # generate_length - How many words to generate
    # temperature - Range of temperatures to use for sampling
    def __init__(self,
                 prompt,
                 generate_length,
                 model_input_length,
                 temperatures=(1.,),
                 print_freq=1):
        self.prompt = prompt
        self.generate_length = generate_length
        self.model_input_length = model_input_length
        self.temperatures = temperatures
        self.print_freq = print_freq
        vectorized_prompt = text_vectorization([prompt])[0].numpy()
        self.prompt_length = np.nonzero(vectorized_prompt == 0)[0][0]

    def on_epoch_end(self, epoch, logs=None):
        if (epoch + 1) % self.print_freq != 0:
            return
        for temperature in self.temperatures:
            print("== Generating with temperature", temperature)
            # when generating text, we start from our prompt
            sentence = self.prompt
            for i in range(self.generate_length):
                # Feed the current sequence into model
                tokenized_sentence = text_vectorization([sentence])
                predictions = self.model(tokenized_sentence)
                # Retrieve the predictions for the last timestep, and use them to sample a new word
                next_token = sample_next(
                    predictions[0, self.prompt_length - 1 + i, :]
                )
                # Append the new word to the current sequence and repeat
                sampled_token = tokens_index[next_token]
                sentence += " " + sampled_token
            print(sentence)

prompt = "This movie"
text_gen_callback = TextGenerator(
    prompt,
    generate_length=50,
    model_input_length=sequence_length,
    # Use a diverse range of temperatures to sample text, to demonstrate the effect of temperature on text generation
    temperatures=(0.2, 0.5, 0.7, 1., 1.5))

**Fitting the language model**

In [None]:
model.fit(lm_dataset, epochs=200, callbacks=[text_gen_callback])

Epoch 1/200
This movie looks for anything to albeit sold and if youve got read it like most tango movie it the plot and a drew and you said her music needs to be the surprises and we they renamed which even known what can be only laugh and even get the directors were
== Generating with temperature 0.5
This movie does were much better plot thanks to say its a sweet here for it and that [UNK] movie seems true director [UNK] cary talky is a [UNK] to this is close film it can work more recommend this movie had great thing about drilling she was to do not a
== Generating with temperature 0.7
This movie is a really many important that i [UNK] comedies this you have never then there was put away from sex but mostly how bad spoke goes stay at the case of course exactly the movie to person walked the parents if they should say that the plot ever read the
== Generating with temperature 1.0
This movie bette fictional the movie is that is a big ourselves this is a million for like a few looking fo