<a href="https://colab.research.google.com/github/Itskedarjasud/Data-Dreamers-5201/blob/main/Gpt2_text_generation_with_kerasnlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GPT2 Text Generation with KerasNLP

**Author:** Chen Qian<br>
**Date created:** 2024/04/1<br>
**Last modified:** 2024/06/21<br>
**Description:** Use KerasNLP GPT2 model and `samplers` to do text generation.

In this tutorial, you will learn to use [KerasNLP](https://keras.io/keras_nlp/) to load a
pre-trained Large Language Model (LLM) - [GPT-2 model](https://openai.com/research/better-language-models)
(originally invented by OpenAI), finetune it to a specific text style, and
generate text based on users' input (also known as prompt). You will also learn
how GPT2 adapts quickly to non-English languages, such as Chinese.

##  Before we begin

Colab offers different kinds of runtimes. Make sure to go to **Runtime ->
Change runtime type** and choose the GPU Hardware Accelerator runtime
(which should have >12G host RAM and ~15G GPU RAM) since you will finetune the
GPT-2 model. Running this tutorial on CPU runtime will take hours.

## Install KerasNLP, Choose Backend and Import Dependencies

This examples uses [Keras 3](https://keras.io/keras_3/) to work in any of
`"tensorflow"`, `"jax"` or `"torch"`. Support for Keras 3 is baked into
KerasNLP, simply change the `"KERAS_BACKEND"` environment variable to select
the backend of your choice. We select the JAX backend below.

In [17]:
!pip install git+https://github.com/keras-team/keras-nlp.git -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [18]:
import os

os.environ["KERAS_BACKEND"] = "jax"  # or "tensorflow" or "torch"

import keras_nlp
import keras
import tensorflow as tf
import time

keras.mixed_precision.set_global_policy("mixed_float16")

## Introduction to Generative Large Language Models (LLMs)

Large language models (LLMs) are a type of machine learning models that are
trained on a large corpus of text data to generate outputs for various natural
language processing (NLP) tasks, such as text generation, question answering,
and machine translation.

Generative LLMs are typically based on deep learning neural networks, such as
the [Transformer architecture](https://arxiv.org/abs/1706.03762) invented by
Google researchers in 2017, and are trained on massive amounts of text data,
often involving billions of words. These models, such as Google [LaMDA](https://blog.google/technology/ai/lamda/)
and [PaLM](https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html),
are trained with a large dataset from various data sources which allows them to
generate output for many tasks. The core of Generative LLMs is predicting the
next word in a sentence, often referred as **Causal LM Pretraining**. In this
way LLMs can generate coherent text based on user prompts. For a more
pedagogical discussion on language models, you can refer to the
[Stanford CS324 LLM class](https://stanford-cs324.github.io/winter2022/lectures/introduction/).

## Introduction to KerasNLP

Large Language Models are complex to build and expensive to train from scratch.
Luckily there are pretrained LLMs available for use right away. [KerasNLP](https://keras.io/keras_nlp/)
provides a large number of pre-trained checkpoints that allow you to experiment
with SOTA models without needing to train them yourself.

KerasNLP is a natural language processing library that supports users through
their entire development cycle. KerasNLP offers both pretrained models and
modularized building blocks, so developers could easily reuse pretrained models
or stack their own LLM.

In a nutshell, for generative LLM, KerasNLP offers:

- Pretrained models with `generate()` method, e.g.,
    `keras_nlp.models.GPT2CausalLM` and `keras_nlp.models.OPTCausalLM`.
- Sampler class that implements generation algorithms such as Top-K, Beam and
    contrastive search. These samplers can be used to generate text with
    custom models.

## Load a pre-trained GPT-2 model and generate some text

KerasNLP provides a number of pre-trained models, such as [Google
Bert](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)
and [GPT-2](https://openai.com/research/better-language-models). You can see
the list of models available in the [KerasNLP repository](https://github.com/keras-team/keras-nlp/tree/master/keras_nlp/models).

It's very easy to load the GPT-2 model as you can see below:

In [19]:
# To speed up training and generation, we use preprocessor of length 128
# instead of full length 1024.
preprocessor = keras_nlp.models.GPT2CausalLMPreprocessor.from_preset(
    "gpt2_base_en",
    sequence_length=128,
)
gpt2_lm = keras_nlp.models.GPT2CausalLM.from_preset(
    "gpt2_base_en", preprocessor=preprocessor
)

Once the model is loaded, you can use it to generate some text right away. Run
the cells below to give it a try. It's as simple as calling a single function
*generate()*:

In [20]:
start = time.time()

output = gpt2_lm.generate("My trip to Yosemite was", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")


GPT-2 output:
My trip to Yosemite was one of my favorite things in my adult life. It was a day I spent in a beautiful valley, surrounded by beautiful mountains that made me want to explore.

I was in Yosemite at the time, but I was not alone in my passion for exploring the mountains. There are many places that offer a unique experience, and I love the feeling of being able to explore those.

It's a good feeling, but it can also feel like a little bit of a distraction. There is a reason why I love Yosemite.

The first time I visited Yosemite, I was in my 20s, which was a pretty big change for me. I wasn't a teenager, but I was still in high school, so I was a little bit nervous.

When I was in college, I was really into hiking and exploring the mountains. The only way I could really enjoy the view was to hike in the back of the car. I loved it
TOTAL TIME ELAPSED: 12.52s


Try another one:

In [21]:
start = time.time()

output = gpt2_lm.generate("That Italian restaurant is", max_length=200)
print("\nGPT-2 output:")
print(output)

end = time.time()
print(f"TOTAL TIME ELAPSED: {end - start:.2f}s")


GPT-2 output:
That Italian restaurant is called "The Italian Grill" is located on the corner of South and East Street in downtown San Francisco, where you'll find a lot of Italian food, a lot of wine, and some great food. It has a large selection of Italian dishes and some of them are quite good.

You can see a video of this Italian restaurant below.
TOTAL TIME ELAPSED: 2.64s


Notice how much faster the second call is. This is because the computational
graph is [XLA compiled](https://www.tensorflow.org/xla) in the 1st run and
re-used in the 2nd behind the scenes.

The quality of the generated text looks OK, but we can improve it via
fine-tuning.

## More on the GPT-2 model from KerasNLP

Next up, we will actually fine-tune the model to update its parameters, but
before we do, let's take a look at the full set of tools we have to for working
with for GPT2.

The code of GPT2 can be found
[here](https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/gpt2/).
Conceptually the `GPT2CausalLM` can be hierarchically broken down into several
modules in KerasNLP, all of which have a *from_preset()* function that loads a
pretrained model:

- `keras_nlp.models.GPT2Tokenizer`: The tokenizer used by GPT2 model, which is a
    [byte-pair encoder](https://huggingface.co/course/chapter6/5?fw=pt).
- `keras_nlp.models.GPT2CausalLMPreprocessor`: the preprocessor used by GPT2
    causal LM training. It does the tokenization along with other preprocessing
    works such as creating the label and appending the end token.
- `keras_nlp.models.GPT2Backbone`: the GPT2 model, which is a stack of
    `keras_nlp.layers.TransformerDecoder`. This is usually just referred as
    `GPT2`.
- `keras_nlp.models.GPT2CausalLM`: wraps `GPT2Backbone`, it multiplies the
    output of `GPT2Backbone` by embedding matrix to generate logits over
    vocab tokens.

## Finetune on imdb_reviews dataset

Now you have the knowledge of the GPT-2 model from KerasNLP, you can take one
step further to finetune the model so that it generates text in a specific
style, short or long, strict or casual. In this tutorial, we will use reddit
dataset for example.

Let's take a look inside sample data from the imdb_reviews TensorFlow Dataset. There
are two features:

- **__document__**: text of the post.
- **__title__**: the title.

In [22]:
import tensorflow_datasets as tfds

# Load the IMDB dataset
imdb_ds, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)
train_ds, test_ds = imdb_ds['train'], imdb_ds['test']


Preprocess the Dataset:
You need to preprocess the text data, including tokenization and padding, similarly to the previous example.

In [23]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Tokenizer parameters
vocab_size = 10000
oov_token = '<OOV>'
max_length = 256
trunc_type = 'post'
padding_type = 'post'

# Initialize the tokenizer
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_token)

# Extract texts and labels
train_texts = []
train_labels = []
test_texts = []
test_labels = []

for text, label in tfds.as_numpy(train_ds):
    train_texts.append(text.decode('utf-8'))
    train_labels.append(label)

for text, label in tfds.as_numpy(test_ds):
    test_texts.append(text.decode('utf-8'))
    test_labels.append(label)

# Fit the tokenizer on training texts
tokenizer.fit_on_texts(train_texts)

# Convert texts to sequences
train_sequences = tokenizer.texts_to_sequences(train_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)

# Pad the sequences
train_padded = pad_sequences(train_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
test_padded = pad_sequences(test_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)


Create TensorFlow Datasets from the Preprocessed Data

In [24]:
train_dataset = tf.data.Dataset.from_tensor_slices((train_padded, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_padded, test_labels))

# Shuffle and batch the dataset
batch_size = 32
train_dataset = train_dataset.shuffle(buffer_size=10000).batch(batch_size, drop_remainder=True)
test_dataset = test_dataset.batch(batch_size, drop_remainder=True)


Build and Train a Keras Model:
Here's an example of a simple sentiment analysis model using an embedding layer, LSTM, and a dense output layer.

In [25]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, GlobalAveragePooling1D

# Model parameters
embedding_dim = 64
lstm_units = 64

# Build the model
model = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=max_length),
    LSTM(lstm_units),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(train_dataset, validation_data=test_dataset, epochs=10)


Epoch 1/10




[1m781/781[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m48s[0m 60ms/step - accuracy: 0.5030 - loss: nan - val_accuracy: 0.5030 - val_loss: nan
Epoch 2/10
[1m781/781[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m55s[0m 70ms/step - accuracy: 0.5053 - loss: nan - val_accuracy: 0.5030 - val_loss: nan
Epoch 3/10
[1m781/781[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 55ms/step - accuracy: 0.5025 - loss: nan - val_accuracy: 0.5001 - val_loss: nan
Epoch 4/10
[1m781/781[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m55s[0m 70ms/step - accuracy: 0.5016 - loss: nan - val_accuracy: 0.5001 - val_loss: nan
Epoch 5/10
[1m781/781[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 46ms/step - accuracy: 0.5040 - loss: nan - val_accuracy: 0.5001 - val_loss: nan
Epoch 6/10
[1m781/781[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 55ms/step - accuracy: 0.4982 - loss: nan - val_accuracy: 0.5001 - val_loss: nan
Epoch 7/10
[1m781/781[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[

<keras.src.callbacks.history.History at 0x7e1383f29450>

After fine-tuning is finished, you can again generate text using the same generate() function. This time, the text will be closer to imdb_reviews writing style, and the generated length will be close to our preset length in the training set.

In [26]:
def generate_text(prompt, model, tokenizer, max_length=100):
    input_ids = tokenizer.encode(prompt, return_tensors='tf')
    output = model.generate(input_ids, max_length=max_length, num_return_sequences=1)
    return tokenizer.decode(output[0], skip_special_tokens=True)

prompt1 = "The movie was fantastic because"
generated_text1 = gpt2_lm.generate(prompt1, max_length=100)
print("Generated text 1:")
print(generated_text1)

# Generate another text with the same or a different prompt
prompt2 = "The storyline was gripping and"
generated_text2 = gpt2_lm.generate(prompt2, max_length=100)
print("Generated text 2:")
print(generated_text2)


Generated text 1:
The movie was fantastic because it was about two guys in their 20s, and they were trying to make a movie about them. They were trying to get a movie about the guys who were doing it for the first time in a long time and they didn't know what to expect. And I was thinking, this movie is so funny and so funny. And it was a really good movie. And it's funny to think that this is a movie about two guys who are really good at it
Generated text 2:
The storyline was gripping and the characters were all over the map. But the main character was not. It was not just the character who was being portrayed.

I think it is because of this story that I feel like it's been a lot of work. I don't think that the story has gotten better in the last year. The story is still going on. The characters have been getting better, but the story is still going on. I don't know how many of the characters


## Into the Sampling Method

In KerasNLP, we offer a few sampling methods, e.g., contrastive search,
Top-K and beam sampling. By default, our `GPT2CausalLM` uses Top-k search, but
you can choose your own sampling method.

Much like optimizer and activations, there are two ways to specify your custom
sampler:

- Use a string identifier, such as "greedy", you are using the default
configuration via this way.
- Pass a `keras_nlp.samplers.Sampler` instance, you can use custom configuration
via this way.

In [27]:
# Use a string identifier.
gpt2_lm.compile(sampler="top_k")
output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)

# Use a `Sampler` instance. `GreedySampler` tends to repeat itself,
greedy_sampler = keras_nlp.samplers.GreedySampler()
gpt2_lm.compile(sampler=greedy_sampler)

output = gpt2_lm.generate("I like basketball", max_length=200)
print("\nGPT-2 output:")
print(output)


GPT-2 output:
I like basketball, and I'm a fan of the game. But I don't know how to play it, because it doesn't work for me.

The only time I can play basketball in a big city in the U.S., I'm on the court for about three hours. I'm not playing in front of the media, but I do watch games. I'm not going out in front of the media. I'm playing for my team. It's a big deal.

I like the way the team plays. They are very smart. They have good players. They are good on both ends of the floor. They are smart. They know how to defend and how to get the shot.

They are very smart. I like how they play the game. I like the way they play the game.

They have good defense. They have good shooters, and they have good shooters.

It's a big difference from what you saw

GPT-2 output:
I like basketball. I like to play it. I like to watch it. I like to watch it. I like to watch it. I like to watch it. I like to watch it. I like to watch it. I like to watch it. I like to watch it. I like to watch it. I 

For more details on KerasNLP `Sampler` class, you can check the code
[here](https://github.com/keras-team/keras-nlp/tree/master/keras_nlp/samplers).

## Finetune on Chinese Poem Dataset

We can also finetune GPT2 on non-English datasets. For readers knowing Chinese,
this part illustrates how to fine-tune GPT2 on Chinese poem dataset to teach our
model to become a poet!

Because GPT2 uses byte-pair encoder, and the original pretraining dataset
contains some Chinese characters, we can use the original vocab to finetune on
Chinese dataset.

In [28]:
!# Load chinese poetry dataset.
!git clone https://github.com/chinese-poetry/chinese-poetry.git

fatal: destination path 'chinese-poetry' already exists and is not an empty directory.


Load text from the json file. We only use《全唐诗》for demo purposes.

In [29]:
import os
import json

poem_collection = []
for file in os.listdir("chinese-poetry/全唐诗"):
    if ".json" not in file or "poet" not in file:
        continue
    full_filename = "%s/%s" % ("chinese-poetry/全唐诗", file)
    with open(full_filename, "r") as f:
        content = json.load(f)
        poem_collection.extend(content)

paragraphs = ["".join(data["paragraphs"]) for data in poem_collection]

Let's take a look at sample data.

In [30]:
print(paragraphs[0])

空谷跫然聞足音，殷勤通守致家禽。陰晴不爽司晨□，□暮全無起舞心。夢短那堪眠警枕，脛長稍已怯□衾。可憐一念尤迂闊，朗誦西山夜氣箴。


Similar as Reddit example, we convert to TF dataset, and only use partial data
to train.

In [31]:
train_ds = (
    tf.data.Dataset.from_tensor_slices(paragraphs)
    .batch(16)
    .cache()
    .prefetch(tf.data.AUTOTUNE)
)

# Running through the whole dataset takes long, only take `500` and run 1
# epochs for demo purposes.
train_ds = train_ds.take(500)
num_epochs = 1

learning_rate = keras.optimizers.schedules.PolynomialDecay(
    5e-4,
    decay_steps=train_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)
loss = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(learning_rate),
    loss=loss,
    weighted_metrics=["accuracy"],
)

gpt2_lm.fit(train_ds, epochs=num_epochs)

[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m103s[0m 175ms/step - accuracy: 0.2498 - loss: 2.6126


<keras.src.callbacks.history.History at 0x7e1384b79d20>

Let's check the result!

In [32]:
output = gpt2_lm.generate("昨夜雨疏风骤", max_length=200)
print(output)

昨夜雨疏风骤，終翰知山落至。知若面細自絮，翻曲場游綠。非風落景石，詩塞紫標馳。清須香頻綠，


Not bad 😀