In [2]:
from datasets import load_dataset

dataset = load_dataset("textminr/simplebooks", split="train[:1%]")  
texts = dataset["text"]  


README.md:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

simplebooks.py:   0%|          | 0.00/4.12k [00:00<?, ?B/s]

The repository for textminr/simplebooks contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/textminr/simplebooks.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


Downloading data:   0%|          | 0.00/282M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

## Text Preprocessing

In [3]:
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel


tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")

def tokenize_function(examples):
    return tokenizer(examples, padding="max_length", truncation=True, max_length=512)

tokenizer.pad_token = tokenizer.eos_token

inputs = tokenizer(texts, return_tensors="tf", padding=True, truncation=True, max_length=512)

labels = inputs["input_ids"]

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

## Model Training

In [7]:
import os
import tensorflow as tf
from tensorflow.keras.optimizers import Adam

model = TFGPT2LMHeadModel.from_pretrained("distilgpt2")

optimizer = Adam(learning_rate=5e-5)

# Function for the training step

@tf.function
def train_step(input_ids, labels):
    with tf.GradientTape() as tape:

        # Forward Pass
        
        outputs = model(input_ids, labels=labels)
        loss = outputs.loss

    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    return loss


batch_size = 2  
dataset = tf.data.Dataset.from_tensor_slices((inputs["input_ids"], labels)).batch(batch_size)

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [23]:
epochs = 5
for epoch in range(epochs):
    total_loss = 0
    print(f"Epoch {epoch+1}/{epochs}", flush=True)
    for batch in dataset:
        input_ids, labels = batch
        loss = train_step(input_ids, labels)

        total_loss += loss.numpy()  

model.save_pretrained("distilgpt2_finetuned_simplebooks")
tokenizer.save_pretrained("distilgpt2_finetuned_simplebooks")

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


('distilgpt2_finetuned_simplebooks/tokenizer_config.json',
 'distilgpt2_finetuned_simplebooks/special_tokens_map.json',
 'distilgpt2_finetuned_simplebooks/vocab.json',
 'distilgpt2_finetuned_simplebooks/merges.txt',
 'distilgpt2_finetuned_simplebooks/added_tokens.json')

## Model Inference

In [44]:
input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="tf")

generated_ids = model.generate(inputs['input_ids'], max_length=200, num_return_sequences=1)

generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)


print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time a king gave a holiday to all the people in one of his cities .'


In [51]:
input_text = "There is a"
inputs = tokenizer(input_text, return_tensors="tf")

generated_ids = model.generate(inputs['input_ids'], max_length=200, num_return_sequences=1)

generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)


print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


There is a little trick here . The trick was finding the Striped Cat . He took a big bag of toys , and put down the Cat . He took a big bag of toys , and put it on , tying it to the tree .'


In [58]:
input_text = "There were"
inputs = tokenizer(input_text, return_tensors="tf")

generated_ids = model.generate(inputs['input_ids'], max_length=200, num_return_sequences=1)

generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)


print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


There were some of the jolly workmen in the shop . They would fix the glass and glass jugs so that they would never do anything wrong .'


In [2]:
import transformers

print(transformers.__version__)

4.47.0
