# Transformer Tutorial Generated by GPT 4

Both the descriptive explanations and the code samples for this tutorial were generated entirely with chatGPT using the GPT 4 model. In some cases the initial code had minor errors, these errors were also fixed by GPT 4 by feeding the errors back into GPT 4 and GPT 4 would generate new code.  This debugging process was repeated at most 3 times.  The last example, with multi-headed attention and token and position embedding, was the most complicated and took GPT 4 3 iterations to get it right.

This is an basic tutorial which uses built in layers from Tensorflow for the self attention mechanism and token and position embedding

## IMDB Sentiment Analysis

The Keras IMDB dataset is a popular dataset for sentiment analysis tasks in natural language processing (NLP). It contains 50,000 movie reviews from the Internet Movie Database (IMDB) labeled as either positive (1) or negative (0) based on the sentiment expressed in the review. The dataset is divided into 25,000 reviews for training and 25,000 reviews for testing.

The reviews in the dataset have been preprocessed, and each review is encoded as a sequence of word indices (integers). The indices represent the overall frequency rank of the words in the entire dataset. For instance, the integer "3" encodes the 3rd most frequent word in the data. This encoding allows for faster processing and less memory usage compared to working with raw text data.

The Keras IMDB dataset is typically used for binary classification tasks, where the goal is to build a machine learning model that can predict whether a given movie review is positive or negative based on the text content. The dataset is accessible through the tensorflow.keras.datasets module in the TensorFlow library.

In [1]:
import numpy as np
import tensorflow as tf
from keras.layers import Input, Dense, MultiHeadAttention, LayerNormalization, Dropout, Embedding, GlobalAveragePooling1D, Add
from keras.models import Model
from keras.optimizers import Adam
from keras.losses import BinaryCrossentropy
from keras.datasets import imdb

2023-05-09 19:20:22.644788: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-05-09 19:20:22.694539: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-05-09 19:20:22.695518: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


# Single headed attention 
This code will create a single-headed attention transformer model and train it on the IMDB dataset. The model has an input layer, an embedding layer, a layer normalization, a multi-head attention layer with a single head, another layer normalization, and finally a dense layer with a softmax activation function. The model is compiled with the SparseCategoricalCrossentropy loss function and the Adam optimizer. It is then trained for 10 epochs and evaluated on the test set.


Since we are working on a classification task, we should not output a probability distribution over the entire vocabulary. Instead, we should output a single probability for each class.

In [2]:
# Load and preprocess the data
vocab_size = 10000
max_length = 200

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=max_length)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=max_length)

# Define the single-headed attention transformer model
def transformer_model(vocab_size, d_model, input_length):
    inputs = Input(shape=(input_length,))
    embeddings = Embedding(vocab_size, d_model)(inputs)

    normalized_embeddings = LayerNormalization(epsilon=1e-6)(embeddings)
    attention = MultiHeadAttention(num_heads=1, key_dim=d_model)(normalized_embeddings, normalized_embeddings)
    attention = Dropout(0.1)(attention)
    attention = LayerNormalization(epsilon=1e-6)(attention + normalized_embeddings)

    pooled = GlobalAveragePooling1D()(attention)
    outputs = Dense(1, activation='sigmoid')(pooled)
    model = Model(inputs=inputs, outputs=outputs)
    return model

# Create and compile the model
d_model = 128
model = transformer_model(vocab_size, d_model, max_length)
model.compile(loss=BinaryCrossentropy(from_logits=False), optimizer=Adam(), metrics=['accuracy'])

# Train the model
batch_size = 64
epochs = 6
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2)

# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test)
print(f'Test Loss: {loss}, Test Accuracy: {accuracy}')

2023-05-09 19:20:29.082765: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Test Loss: 1.0261664390563965, Test Accuracy: 0.8234800100326538


# Multi-Headed attention

This code creates a multi-headed attention transformer model and trains it on the IMDB dataset. The model has an input layer, an embedding layer, a layer normalization, a multi-head attention layer, another layer normalization, a global average pooling layer, and finally a dense layer with a sigmoid activation function. The model is compiled with the BinaryCrossentropy loss function and the Adam optimizer. It is then trained for 10 epochs and evaluated on the test set.

In [3]:
# Load and preprocess the data
vocab_size = 20000
max_length = 200

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=max_length)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=max_length)

# Define the multi-headed attention transformer model
def transformer_model(vocab_size, d_model, input_length, num_heads):
    inputs = Input(shape=(input_length,))
    embeddings = Embedding(vocab_size, d_model)(inputs)

    normalized_embeddings = LayerNormalization(epsilon=1e-6)(embeddings)
    attention = MultiHeadAttention(num_heads=num_heads, key_dim=d_model)(normalized_embeddings, normalized_embeddings)
    attention = Dropout(0.1)(attention)
    attention = LayerNormalization(epsilon=1e-6)(attention + normalized_embeddings)

    pooled = GlobalAveragePooling1D()(attention)
    outputs = Dense(1, activation='sigmoid')(pooled)
    model = Model(inputs=inputs, outputs=outputs)
    return model

# Create and compile the model
d_model = 128
num_heads = 8
model = transformer_model(vocab_size, d_model, max_length, num_heads)
model.compile(loss=BinaryCrossentropy(from_logits=False), optimizer=Adam(), metrics=['accuracy'])

# Train the model
batch_size = 64
epochs = 6
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2)

# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test)
print(f'Test Loss: {loss}, Test Accuracy: {accuracy}')

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Test Loss: 0.8519366383552551, Test Accuracy: 0.834879994392395


# Multi-Headed Attention with Token and Position Embedding

_Generate python code using the tensorflow module to build and train a transformer model with a multi headed attention mechanism and token and position embedding, using keras.layers.MultiHeadAttention_

This code creates a multi-headed attention transformer model with token and position embeddings and trains it on the IMDB dataset. The model has an input layer, an embedding layer, an added position encoding, a layer normalization, a multi-head attention layer, another layer normalization, a global average pooling layer, and finally a dense layer with a sigmoid activation function. The model is compiled with the BinaryCrossentropy loss function and the Adam optimizer. It is then trained for 10 epochs and evaluated on the test set.

In [4]:
# Load and preprocess the data
vocab_size = 20000
max_length = 200

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=max_length)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=max_length)

# Positional encoding
def positional_encoding(position, d_model):
    angle_rads = np.arange(0, d_model, 2) / np.array([10000 ** (i / d_model) for i in range(0, d_model, 2)])
    angle_rads = np.repeat(np.arange(position)[:, np.newaxis], d_model // 2, axis=1) * angle_rads

    pos_encoding = np.zeros((1, position, d_model))
    pos_encoding[:, :, 0::2] = np.sin(angle_rads)
    pos_encoding[:, :, 1::2] = np.cos(angle_rads)
    
    return tf.cast(pos_encoding, dtype=tf.float32)

# Define the multi-headed attention transformer model with token and position embeddings
def transformer_model(vocab_size, d_model, input_length, num_heads):
    inputs = Input(shape=(input_length,))
    token_embeddings = Embedding(vocab_size, d_model)(inputs)

    position_embeddings = positional_encoding(input_length, d_model)
    embeddings = Add()([token_embeddings, position_embeddings])

    normalized_embeddings = LayerNormalization(epsilon=1e-6)(embeddings)
    attention = MultiHeadAttention(num_heads=num_heads, key_dim=d_model)(normalized_embeddings, normalized_embeddings)
    attention = Dropout(0.1)(attention)
    attention = LayerNormalization(epsilon=1e-6)(attention + normalized_embeddings)

    pooled = GlobalAveragePooling1D()(attention)
    outputs = Dense(1, activation='sigmoid')(pooled)
    model = Model(inputs=inputs, outputs=outputs)
    return model

# Create and compile the model
d_model = 128
num_heads = 8
model = transformer_model(vocab_size, d_model, max_length, num_heads)
model.compile(loss=BinaryCrossentropy(from_logits=False), optimizer=Adam(), metrics=['accuracy'])

# Train the model
batch_size = 64
epochs = 6
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.2)

# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test)
print(f'Test Loss: {loss}, Test Accuracy: {accuracy}')

Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6
Test Loss: 0.9520815014839172, Test Accuracy: 0.8418800234794617
