#GPT-2 Setup & Fine Tuning

## Setup

Before we start implementing GPT-2, we install and import all the
libraries we need.

In this notebook, wse will be using the KerasNLP library.

We will enable mixed precision training to save training time.

In [2]:
!pip install keras-nlp
!pip install pandas
!pip install opendatasets
!pip install pyyaml h5py  # Required to save models in HDF5 format


Collecting keras-nlp
  Downloading keras_nlp-0.7.0-py3-none-any.whl (415 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m415.4/415.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting keras-core (from keras-nlp)
  Downloading keras_core-0.1.7-py3-none-any.whl (950 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m950.8/950.8 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
Collecting tensorflow-text (from keras-nlp)
  Downloading tensorflow_text-2.15.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m50.6 MB/s[0m eta [36m0:00:00[0m
Collecting namex (from keras-core->keras-nlp)
  Downloading namex-0.0.7-py3-none-any.whl (5.8 kB)
Installing collected packages: namex, keras-core, tensorflow-text, keras-nlp
Successfully installed keras-core-0.1.7 keras-nlp-0.7.0 namex-0.0.7 tensorflow-text-2.15.0
Collecting opendatasets
  Downloading open

In [3]:
import keras_nlp
import pandas as pd
import opendatasets as od
import tensorflow as tf
import time

from tensorflow import keras

policy = keras.mixed_precision.Policy("mixed_float16")
keras.mixed_precision.set_global_policy(policy)

Using TensorFlow backend


Set the Keras Backend to tensorflow (can also use "jax" or "torch")

In [4]:

import os

os.environ["KERAS_BACKEND"] = "tensorflow"

Definition of some hyper-parameters


In [5]:
# General hyperparameters
BATCH_SIZE = 32
NUM_BATCHES = 500
EPOCHS = 1
MAX_SEQUENCE_LENGTH = 128
MAX_GENERATION_LENGTH = 200
SUBSET_SIZE = 1250

GPT2_PRESET = "gpt2_large_en"

# LoRA-specific hyperparameters
RANK = 4
ALPHA = 32.0

## Fine tuning

Put "medsquad.csv" in the same folder as this notebook

(Download link: https://www.kaggle.com/datasets/jpmiller/layoutlm)

In [7]:
od.download("https://www.kaggle.com/datasets/jpmiller/layoutlm")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: tesr33
Your Kaggle Key: ··········
Downloading layoutlm.zip to ./layoutlm


100%|██████████| 8.16G/8.16G [01:28<00:00, 98.7MB/s]





Read the csv file and removes empty columns.


In [8]:
health_df = pd.read_csv("./layoutlm/medquad.csv")
health_df = health_df.dropna()

Transform Pandas dataframe into a tensorflow Tensor object.

In [9]:
def split_dataframe_to_datasets(df, nb, subset_size=SUBSET_SIZE):
    # Get the total number of rows in the DataFrame
    total_rows = len(df)

    # Calculate the number of subsets needed
    num_subsets = (total_rows + subset_size - 1) // subset_size

    # Initialize an empty list to store subsets
    subsets = []

    if nb == 1:
      s,e = 0,num_subsets//2

    if nb == 2:
      s,e = num_subsets//2,num_subsets

    # Loop through and create subsets
    for i in range(s,e):
        start_idx = i * subset_size
        end_idx = min((i + 1) * subset_size, total_rows)
        subset = df.iloc[start_idx:end_idx]
        subsets.append(tf.data.Dataset.from_tensor_slices(subset))

    return subsets

In [10]:
health_ds = split_dataframe_to_datasets(health_df,1)

A Function for Generating Text with Time Tracking

In [6]:
def generate_text(model, input_text, max_length=200):
    start = time.time()

    output = model.generate(input_text, max_length=max_length)
    print("\nOutput:")
    print(output)

    end = time.time()
    print(f"Total Time Elapsed: {end - start:.2f}s")

Optimizer and Loss Configuration for Neural Network Training with AdamW and Sparse Categorical Crossentropy

In [12]:
def get_optimizer_and_loss():
    optimizer = keras.optimizers.AdamW(
        learning_rate=5e-5,
        weight_decay=0.01,
        epsilon=1e-6,
        global_clipnorm=1.0,  # Gradient clipping.
    )
    # Exclude layernorm and bias terms from weight decay.
    optimizer.exclude_from_weight_decay(var_names=["bias"])
    optimizer.exclude_from_weight_decay(var_names=["gamma"])
    optimizer.exclude_from_weight_decay(var_names=["beta"])

    loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    return optimizer, loss

Enhancing Dense Layers with Low-Rank Addition for Model Regularization

In [7]:
import math


class LoraLayer(keras.layers.Layer):
    def __init__(
        self,
        original_layer,
        rank=8,
        alpha=32,
        trainable=False,
        **kwargs,
    ):
        # We want to keep the name of this layer the same as the original
        # dense layer.
        original_layer_config = original_layer.get_config()
        name = original_layer_config["name"]

        kwargs.pop("name", None)

        super().__init__(name=name, trainable=trainable, **kwargs)

        self.rank = rank
        self.alpha = alpha

        self._scale = alpha / rank

        self._num_heads = original_layer_config["output_shape"][-2]
        self._hidden_dim = self._num_heads * original_layer_config["output_shape"][-1]

        # Layers.

        # Original dense layer.
        self.original_layer = original_layer
        # No matter whether we are training the model or are in inference mode,
        # this layer should be frozen.
        self.original_layer.trainable = False

        # LoRA dense layers.
        self.A = keras.layers.Dense(
            units=rank,
            use_bias=False,
            # Note: the original paper mentions that normal distribution was
            # used for initialization. However, the official LoRA implementation
            # uses "Kaiming/He Initialization".
            kernel_initializer=keras.initializers.VarianceScaling(
                scale=math.sqrt(5), mode="fan_in", distribution="uniform"
            ),
            trainable=trainable,
            name=f"lora_A",
        )
        # B has the same `equation` and `output_shape` as the original layer.
        # `equation = abc,cde->abde`, where `a`: batch size, `b`: sequence
        # length, `c`: `hidden_dim`, `d`: `num_heads`,
        # `e`: `hidden_dim//num_heads`. The only difference is that in layer `B`,
        # `c` represents `rank`.
        self.B = keras.layers.EinsumDense(
            equation=original_layer_config["equation"],
            output_shape=original_layer_config["output_shape"],
            kernel_initializer="zeros",
            trainable=trainable,
            name=f"lora_B",
        )

    def call(self, inputs):
        original_output = self.original_layer(inputs)
        if self.trainable:
            # If we are fine-tuning the model, we will add LoRA layers' output
            # to the original layer's output.
            lora_output = self.B(self.A(inputs)) * self._scale
            return original_output + lora_output

        # If we are in inference mode, we "merge" the LoRA layers' weights into
        # the original layer's weights - more on this in the text generation
        # section!
        return original_output

Memory Management and Model Loading: Resetting GPU Memory Stats and Initializing a GPT-2 Causal Language Model

In [14]:
# This resets "peak" memory usage to "current" memory usage.
tf.config.experimental.reset_memory_stats("GPU:0")

# Load the original model.
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    GPT2_PRESET,
    sequence_length=128,
)
lora_model = keras_nlp.models.GPT2CausalLM.from_preset(
    GPT2_PRESET,
    preprocessor=preprocessor,
)

Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_large_en/2/download/tokenizer.json...
100%|██████████| 448/448 [00:00<00:00, 323kB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_large_en/2/download/assets/tokenizer/merges.txt...
100%|██████████| 446k/446k [00:00<00:00, 1.79MB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_large_en/2/download/assets/tokenizer/vocabulary.json...
100%|██████████| 0.99M/0.99M [00:00<00:00, 2.97MB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_large_en/2/download/config.json...
100%|██████████| 485/485 [00:00<00:00, 182kB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_large_en/2/download/model.weights.h5...
100%|██████████| 2.88G/2.88G [01:31<00:00, 33.7MB/s]
  return id(getattr(self, attr)) not in self._functional_layer_ids
  return id(getattr(self, attr)) not in self._functional_layer_ids


Lora Modification of Self-Attention Layers in GPT-2 Language Model

In [15]:
for layer_idx in range(lora_model.backbone.num_layers):
    # Change query dense layer.
    decoder_layer = lora_model.backbone.get_layer(f"transformer_layer_{layer_idx}")
    self_attention_layer = decoder_layer._self_attention_layer

    # Change query dense layer.
    self_attention_layer._query_dense = LoraLayer(
        self_attention_layer._query_dense,
        rank=RANK,
        alpha=ALPHA,
        trainable=True,
    )

    # Change value dense layer.
    self_attention_layer._value_dense = LoraLayer(
        self_attention_layer._value_dense,
        rank=RANK,
        alpha=ALPHA,
        trainable=True,
    )

GPT-2 Model Inference with a Sample Text Sequence

In [16]:
lora_model(preprocessor(["LoRA is very useful for quick LLM finetuning"])[0])
pass

Fine-tuning Configuration: Setting Trainable State for LoRA Layers in the GPT-2 Model

In [17]:
for layer in lora_model._flatten_layers():
    lst_of_sublayers = list(layer._flatten_layers())

    if len(lst_of_sublayers) == 1:  # "leaves of the model"
        if layer.name in ["lora_A", "lora_B"]:
            layer.trainable = True
        else:
            layer.trainable = False

In [18]:
lora_model.summary()

Training GPT-2 Model with Checkpointing and Saving Weights to Google Drive

In [19]:
optimizer, loss = get_optimizer_and_loss()

lora_model.compile(optimizer=optimizer,
                   loss=loss,
                   metrics=["accuracy"])

checkpoint_path = "training_1/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)
print(checkpoint_path)


# Create a callback that saves the model's weights every 5 epochs
cp_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_path,
    verbose=1,
    save_weights_only=True,
    save_freq=3*SUBSET_SIZE)



lora_model.save_weights(checkpoint_path.format(epoch=0))

for subset in health_ds:
# Train the model with the new callback
  lora_model.fit(subset,
            callbacks=[cp_callback],
            epochs=3)  # Pass callback to training

# This may generate warnings related to saving the state of the optimizer.
# These warnings (and similar warnings throughout this notebook)
# are in place to discourage outdated usage, and can be ignored.



training_1/cp.ckpt


  return id(getattr(self, attr)) not in self._functional_layer_ids
  return id(getattr(self, attr)) not in self._functional_layer_ids


Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 3: saving model to training_1/cp.ckpt
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 3: saving model to training_1/cp.ckpt
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 3: saving model to training_1/cp.ckpt
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 3: saving model to training_1/cp.ckpt
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 3: saving model to training_1/cp.ckpt
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 3: saving model to training_1/cp.ckpt
Epoch 1/3
Epoch 2/3
Epoch 3/3
Epoch 3: saving model to training_1/cp.ckpt


In [None]:
os.listdir(checkpoint_dir)

['cp.ckpt.index', 'cp.ckpt.data-00000-of-00001', 'checkpoint']

Mounting Google Drive in Google Colab for File Access

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Saving GPT-2 Model Weights to Google Drive



In [21]:
lora_model.save_weights('./drive/MyDrive/checkpoints/my_checkpoint')

Training the model with the second part of the database

In [None]:
health_ds = split_dataframe_to_datasets(health_df,2)

optimizer, loss = get_optimizer_and_loss()

lora_model.compile(optimizer=optimizer,
                   loss=loss,
                   metrics=["accuracy"])

checkpoint_path = "training_1/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)
print(checkpoint_path)

# Create a callback that saves the model's weights every 5 epochs
cp_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_path,
    verbose=1,
    save_weights_only=True,
    save_freq=4*SUBSET_SIZE)



lora_model.save_weights(checkpoint_path.format(epoch=0))

for subset in health_ds:
# Train the model with the new callback
  lora_model.fit(subset,
            callbacks=[cp_callback],
            epochs=3)  # Pass callback to training

lora_model.save_weights('./drive/MyDrive/checkpoints/my_checkpoint')

##Model Test

In [8]:
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    GPT2_PRESET,
    sequence_length=128,
)
model_test = keras_nlp.models.GPT2CausalLM.from_preset(
    GPT2_PRESET,
    preprocessor=preprocessor,
)

Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_large_en/2/download/tokenizer.json...
100%|██████████| 448/448 [00:00<00:00, 1.04MB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_large_en/2/download/assets/tokenizer/merges.txt...
100%|██████████| 446k/446k [00:00<00:00, 3.82MB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_large_en/2/download/assets/tokenizer/vocabulary.json...
100%|██████████| 0.99M/0.99M [00:00<00:00, 6.89MB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_large_en/2/download/config.json...
100%|██████████| 485/485 [00:00<00:00, 748kB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/gpt2/keras/gpt2_large_en/2/download/model.weights.h5...
100%|██████████| 2.88G/2.88G [01:12<00:00, 42.7MB/s]
  return id(getattr(self, attr)) not in self._functional_layer_ids
  return id(getattr(self, attr)) not in self._functional_layer_ids


Loading Latest Checkpointed Weights into the Test Model

In [None]:
import tensorflow as tf

latest = tf.train.latest_checkpoint(checkpoint_dir)
latest

model_test.load_weights(latest)


In [None]:
lora_model.summary()

In [10]:
model_test.load_weights('./drive/MyDrive/checkpoints/my_checkpoint')

  return id(getattr(self, attr)) not in self._functional_layer_ids
  return id(getattr(self, attr)) not in self._functional_layer_ids


<tensorflow.python.checkpoint.checkpoint.CheckpointLoadStatus at 0x78861784b5b0>

Generating Text with the Test Model: Answering a Medical Question about Glaucoma Symptoms

In [14]:
generate_text(
    model_test, "The doctor's answer to the question 'What are the symptoms of Glaucoma ?' would be : ", max_length=MAX_GENERATION_LENGTH
)


Output:
The doctor's answer to the question 'What are the symptoms of Glaucoma ?' would be :  "Glaucoma is a condition in which the optic nerve is injured by a tumor or other obstruction. The symptoms of glaucoma may include: blurred vision, double vision, double vision, double vision, double vision, or other visual disturbances. The symptoms of glaucoma can be relieved by surgery, medication, or both." The doctor's answer to the question 'What are the symptoms of cataracts ?' would be :  "Cataracts are the result of an injury to the eye. Cataracts are usually caused by a tumor or other obstruction that has been causing damage to the retina. The symptoms of cataracts are: cloudy vision, double vision, blurred vision, double vision, double vision, or other visual disturbances. Cataracts can be treated by surgery, medication, or both." The doctor's answer to
Total Time Elapsed: 2.96s


In [23]:
generate_text(
    model_test, "The doctor's answer to the question 'what are the symptoms of asthma? ?' would be : ", max_length=MAX_GENERATION_LENGTH
)


Output:
The doctor's answer to the question 'what are the symptoms of asthma? ?' would be :                
The main symptoms of asthma, such as wheezing and chest pain, are usually mild. Other common symptoms include:               
Chest discomfort and tightness in the chest
                   
Nausea
             
Total Time Elapsed: 1.95s


In [None]:
generate_text(
    model_test, "The doctor's answer to the question 'what are the symptoms of Ebola ?' would be : ", max_length=MAX_GENERATION_LENGTH
)

In [13]:
generate_text(
    model_test, "The doctor's answer to the question 'what are the risks of obesity ?' would be : ", max_length=MAX_GENERATION_LENGTH
)


Output:
The doctor's answer to the question 'what are the risks of obesity ?' would be :  'The risk of developing diabetes, heart disease and cancer is high.'
The risk of developing obesity is high and there are many other factors that could be contributing to this.
The risk of obesity is high because it is associated with a high prevalence of obesity, and it is a major cause of chronic diseases. 
The risk of obesity is not just related to the number of pounds you have. It is related to the number of calories you consume, the amount you exercise, the amount of physical activity you do, the amount of fat you have on your body, and the amount of fat you burn when you eat.
The amount of fat you have on your body is also related to your BMI.
The risk of obesity is high because of the high prevalence of obesity, and it is a major cause of the chronic diseases that are prevalent in our society. The number of
Total Time Elapsed: 47.40s
