# Text Generation Demo

This demo shows you how to fine-tune a pretrained language model for text generation by using _Causal language modeling_.
Fine-tuning is the process of training a pretrained AI model on a specific task or dataset to adapt the model to your needs.
In this way, you refine the performance of the model for your specific use case, without having to retrain the model from scratch.

Causal language modeling is a natural language processing technique that predicts the next token of a sequence of tokens, and it's typically used for text generation.
Large language models (LLMs) such as Llama2 and GPT-4 have shown splendid results in text generation.
However, the size of this models require vasts amounts of computational and memory resources for training, and even for fine-tuning the pretrained models.

Instead, this demo fine tunes the DistilGPT-2 model, which is a smaller model developed by Hugging Face.
Note that, altough the model is smaller, the fine-tuning step still might take hours if you do not have access to a GPU.
For that reason, this notebook deactivates the training phase by default, and instead provides the final fine tuned model.
If you wish to change this behaviour, change the following variable to `True`.


In [1]:
DO_TRAIN = False

First, install the dependencies that are required for this demo and are not installed in the PyTorch workbench.
This demo uses the `transformers` library ecosystem, which is a common choice when training language models.

In [2]:
%pip install transformers[torch]==4.34.0 datasets==2.14.5 evaluate==0.4.0

Collecting transformers[torch]==4.34.0
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m146.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting datasets==2.14.5
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m335.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate==0.4.0
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m275.2 MB/s[0m eta [36m0:00:00[0m
Collecting regex!=2019.12.17
  Downloading regex-2023.10.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (773 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m773.3/773.3 kB[0m [31m333.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1
  Downloading safetensors-0.4.0-cp39-cp39-manylinux_2_17_x86_6

Import the dependencies for the exercise

In [3]:
import math
from datasets import Dataset
from datasets import load_dataset
from transformers import (
    pipeline, AutoTokenizer, DataCollatorForLanguageModeling, 
    AutoModelForCausalLM, TrainingArguments, Trainer
)

## Data Loading

The demo fine-tunes the DistilGPT-2 model to better generate text related to Open Data Hub.
To this end, the demo provides a subset of the asciidoc source code of the Open Data Hub Documentation in the `odh-merged-docs.adoc` file.
The complete documentation is available at https://github.com/opendatahub-io/opendatahub-documentation.

You can use this data to _teach_ the model how to write more Open Data Hub content.

Load the data with the `datasets` library:

In [4]:
ds = load_dataset("text", data_files={"data": "odh-merged-docs.adoc"}, split="data")
ds

Dataset({
    features: ['text'],
    num_rows: 3793
})

## Create the Tokenizer

Create a tokenizer.
A tokenizer is a key component of language models.
It converts raw text into numerical ids (tokens) that can be processed by the neural network inside the model.

In this case, use the the tokenizer that is specific for the DistilGPT-2 model

In [5]:
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

You can test the tokenizer

In [6]:
tokenizer("Hello world!")

{'input_ids': [15496, 995, 0], 'attention_mask': [1, 1, 1]}

## Data Preparation

Preprocess the data by tokenizing the text and grouping the samples in batches.
You must also divide the data into training and testing splits.

In [7]:
def preprocess_function(samples):
    return tokenizer([f"{x}\n".join(x) for x in samples["text"]])

ds = ds.train_test_split(test_size=0.2)

tokenized_ds = ds.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=ds["train"].column_names,
)

Map (num_proc=4):   0%|          | 0/3034 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (7068 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1915 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1790 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (7814 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/759 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (5457 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1687 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3206 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (4385 > 1024). Running this sequence through the model will result in indexing errors


Inspect the dataset and verify that two subsets are included now.

In [8]:
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 3034
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 759
    })
})

Concatenate all the token sequences and chunk them into blocks.

This is important to ensure that every block of tokens that we use for training fits in memory.

In [9]:
block_size = 256

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported 
    # it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_dataset = tokenized_ds.map(group_texts, batched=True, num_proc=4)
lm_dataset

Map (num_proc=4):   0%|          | 0/3034 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/759 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 21154
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 6194
    })
})

Finally, define the data collation and the padding strategy

In [10]:
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

## Training (fine-tuning)

Load the pretrained base DistilGPT-2 model.

In [11]:
model = AutoModelForCausalLM.from_pretrained("distilgpt2")

Train the model with your data

In [14]:
if DO_TRAIN:
    training_args = TrainingArguments(
        output_dir="my_model",
        evaluation_strategy="epoch",
        learning_rate=2e-5,
        num_train_epochs=1,
        weight_decay=0.01
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=lm_dataset["train"],
        eval_dataset=lm_dataset["test"],
        data_collator=data_collator,
    )

    trainer.train()
    trainer.save_model()

Evaluate the model

In [None]:
if DO_TRAIN:
    eval_results = trainer.evaluate()
    print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

# Testing


## Download Model (ONLY if `DO_TRAIN` is False)
If `DO_TRAIN` is False, you need to download the model before testing it.

Download the [pytorch_model.bin](https://drive.google.com/file/d/142H5pfiw7JKN29xv9rav1ZTDF2a8MCoZ/view?usp=sharing) from Google Drive file into your computer.
Then, upload the file into the `my_model/` directory of this workspace.

Wait for the file to upload.
You can verify the upload progress in the tool bar at the bottom of the screen.
After you have uploaded the file, verify that the file is in the `my_model` directory:

In [17]:
%ls -l my_model/pytorch_model.bin

-rw-r--r--. 1 1003310000 1003310000 327674773 Oct  9 08:54 my_model/pytorch_model.bin


## Run the Tests
Generate text given the following prompt:

In [18]:
prompt = "Use Elyra to"

First, verify the text produced by the base DistilGPT-2 model:

In [19]:
base_generator = pipeline("text-generation", model="distilgpt2")
print(base_generator(prompt)[0]["generated_text"].strip())

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Use Elyra to the Moon

There's going to be a lot of time before we get to that point. I've been going on for a couple of seasons before, and here's the main point: the most important thing about that story


Now, test the text generated by the fine tuned model.
The output might sound closer to the OpenDataHub docs.

In [21]:
generator = pipeline("text-generation", model="./my_model", tokenizer=tokenizer)
print(generator(prompt)[0]["generated_text"].strip())

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Use Elyra to create visual end-to-end pipelines that easily run pipelines across your notebook server. Elyra is an extension for JupyterLab that provides you with a Pipeline Editor to create pipeline workflows that can be executed in {
