In this notebook, we use the Simple Books dataset to pretrain a transformer with the task of causal language modelling. We have understood the two pretraining objectives we have for this kind of task - the causal language modelling and the masked language modelling. The causal language modelling is the autoregressive objective in which the neural network learns to predict the next token given a set of previous tokens in the sequence. In masked language modelling, the objective is to predict the masked token given the context of surrounding tokens.

For a high level understanding of the text-generation task and the pretraining objective, [the introductory notebook](https://github.com/Akorex/Natural-Language-Processing/blob/main/Text%20Generation/introduction-to-text-generation.ipynb) is quite useful.

## Preparing the Environment

In [1]:
import os
import numpy as np
import tensorflow as tf
from tensorflow import keras
from datasets import load_dataset

# Train in mixed-precision float16
tf.keras.mixed_precision.set_global_policy("mixed_float16")

In [2]:
# set up multi-GPU/TPU use
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect()
    strategy = tf.distribute.TPUStrategy(tpu)
except ValueError:
    strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

Number of devices: 2


### Some Hyperparameters

In [3]:
# dataset
BATCH_SIZE = 64
BUFFER_SIZE = 256
SEQ_LEN = 128
MIN_TRAINING_SEQ_LEN = 450

## 1.0 The Dataset

The dataset we'll use is the SimpleBooks dataset. The SimpleBooks dataset consists of 1,573 Gutenberg books, and has one of the smallest vocabulary size to word-level tokens ratio. It has a vocabulary size of approx 98k, a third of WikiText-103's, with around the same number of tokens (approx 100M). This makes it easy to fit a small model.

We download the dataset using the Keras utility code and load them in HuggingFace Datasets object. This is the approach we'll use in this notebook.

In [4]:
keras.utils.get_file(
    origin="https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip",
    extract=True,
)
dir = os.path.expanduser("~/.keras/datasets/simplebooks/")

train_path = dir + "simplebooks-92-raw/train.txt"
val_path = dir + "simplebooks-92-raw/valid.txt"

Downloading data from https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip


In [5]:
raw_train_ds = load_dataset('text', data_files = train_path)

raw_val_ds = load_dataset('text', data_files = val_path)

Downloading and preparing dataset text/default to /root/.cache/huggingface/datasets/text/default-19fb8885f73698bd/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-19fb8885f73698bd/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset text/default to /root/.cache/huggingface/datasets/text/default-f7c4c38fd0f348e3/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-f7c4c38fd0f348e3/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [6]:
print(raw_train_ds)
print("\n")
print(raw_val_ds)

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 3876796
    })
})


DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 13384
    })
})


Let us write a function in python using keras tool to load the dataset in a similar way as implemented above.

In [7]:
def load_dataset_tf():
    """Utility to load the dataset into TensorFlow TF Data object.
    
    This is a more suitable approach when using the TensorFlow/Keras libraries
    """
    
    # download the file
    keras.utils.get_file(
    origin="https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip",
    extract=True)
    
    # set the path
    dir = os.path.expanduser("~/.keras/datasets/simplebooks/")
    train_path = dir + "simplebooks-92-raw/train.txt"
    val_path = dir + "simplebooks-92-raw/valid.txt"
    
    raw_train_ds = (
        tf.data.TextLineDataset(train_path)
        .filter(lambda x: tf.strings.length(x) > MIN_TRAINING_SEQ_LEN)
        .batch(BATCH_SIZE)
        .shuffle(buffer_size=BUFFER_SIZE)
    )
    
    raw_val_ds = (
        tf.data.TextLineDataset(dir + "simplebooks-92-raw/valid.txt")
        .filter(lambda x: tf.strings.length(x) > MIN_TRAINING_SEQ_LEN)
        .batch(BATCH_SIZE)
    )
    
    print(raw_train_ds.unbatch().batch(1).take(1).get_single_element())
    print(raw_val_ds.unbatch().batch(1).take(1).get_single_element())

## 2.0 Tokenization

In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('gpt2')

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [9]:
# let's see some things about the tokenizer
print(f"Vocab Size: {tokenizer.vocab_size}")
print(f"Model Input names: {tokenizer.model_input_names}")
print(f"Special tokens: {tokenizer.special_tokens_map}")
print(f"Model max seq length: {tokenizer.model_max_length}")

Vocab Size: 50257
Model Input names: ['input_ids', 'attention_mask']
Special tokens: {'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}
Model max seq length: 1024


In [10]:
text = "This is a sample text"

tokenizer(text, add_special_tokens = False)

{'input_ids': [1212, 318, 257, 6291, 2420], 'attention_mask': [1, 1, 1, 1, 1]}

In [11]:
def tokenize(batch):
    return tokenizer(batch['text'], truncation = True, max_length = SEQ_LEN)

In [12]:
train_ds = raw_train_ds.map(tokenize, batched = True, remove_columns=raw_train_ds["train"].column_names)
val_ds = raw_val_ds.map(tokenize, batched = True, remove_columns = raw_val_ds['train'].column_names)

  0%|          | 0/3877 [00:00<?, ?ba/s]

  0%|          | 0/14 [00:00<?, ?ba/s]

In [13]:
print(train_ds)
print("\n")
print(val_ds)

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 3876796
    })
})


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 13384
    })
})


In [14]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False, return_tensors="tf")

## 3.0 Training the Model From Scratch

For this task, we'll be using the causal language modelling objective. Let's load some items

In [15]:
from transformers import AutoConfig, TFGPT2LMHeadModel

config = AutoConfig.from_pretrained('gpt2', vocab_size=len(tokenizer), 
                                    bos_token_id=tokenizer.bos_token_id,eos_token_id=tokenizer.eos_token_id, n_ctx=SEQ_LEN)

In [16]:
config

GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 128,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.27.4",
  "use_cache": true,
  "vocab_size": 50257
}

In [17]:
with strategy.scope():
    model = TFGPT2LMHeadModel(config)
    model(model.dummy_inputs)  # Builds the model

model.summary()

Model: "tfgpt2lm_head_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transformer (TFGPT2MainLaye  multiple                 124439808 
 r)                                                              
                                                                 
Total params: 124,439,808
Trainable params: 124,439,808
Non-trainable params: 0
_________________________________________________________________


In [18]:
with strategy.scope():
    train_dataset = model.prepare_tf_dataset(
        train_ds["train"], collate_fn=data_collator,
        shuffle=True,batch_size=32,)
    
    eval_dataset = model.prepare_tf_dataset(
        val_ds["train"],collate_fn=data_collator,
        shuffle=False,batch_size=32,)

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [19]:
save_path = 'transformer/'
callbacks = tf.keras.callbacks.ModelCheckpoint(save_path, save_best_only = False, save_freq = 5000)

In [20]:
from transformers import create_optimizer

with strategy.scope():
    num_train_steps = len(train_dataset)
    optimizer, schedule = create_optimizer(
        init_lr=5e-5,
        num_warmup_steps=1_000,
        num_train_steps=num_train_steps,
        weight_decay_rate=0.01
    )
    
    model.compile(optimizer=optimizer)


No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [21]:
model.fit(train_dataset, validation_data=eval_dataset, callbacks=[callbacks], batch_size = BATCH_SIZE)

  7170/121149 [>.............................] - ETA: 13:01:39 - loss: 5.6499

KeyboardInterrupt: 

In [22]:
# save the config file

config.to_json_file('transformer/config.json')

### Using the trained model

In [23]:
from transformers import pipeline

pipe = pipeline("text-generation", model= model, tokenizer=tokenizer, device=0)

In [24]:
text = "Transformers are the most"
print(pipe(text, num_return_sequences=1)[0]["generated_text"])

  "You have modified the pretrained model configuration to control generation. This is a"
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most little answer. In this moment in the end had been heard of the first of the evening. There was a beautiful little young boy. "Oh, my child, so, we can't do your little child." "I


In [25]:
text = "I believe in the power of "
print(pipe(text, num_return_sequences=1)[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I believe in the power of  There was an end of the top of the ship, and, with the same great in the dark, which was a part of the wood, the three, they had gone in a tree without work, for I


## Resources

1. [HuggingFace Load Text](https://huggingface.co/docs/datasets/nlp_load)
2. [HuggingFace Load Datasets](https://huggingface.co/docs/datasets/dataset_script)
3. [KerasNLP Word Generation](https://keras.io/examples/nlp/text_generation_gpt/)
4. [Using Multiple GPUs](https://www.kaggle.com/code/gusthema/multigpu-with-tensorflow-and-keras)
5. [HuggingFace course](https://huggingface.co/course/chapter7/6?fw=tf)