In this tutorial, we'll use a transformer model to generate text based on input text. This is commonly referred to as text generation, and models like GPT (Generative Pretrained Transformer) and T5 are great for this task. We’ll use the popular model t5-base, which is a general-purpose text-to-text transformer that can perform tasks like summarization, translation, and text generation.

# Import the necessary libraries:

In [8]:
from transformers import pipeline


Create a text generation pipeline:
# We’ll initialize a text-to-text generation pipeline using the t5-base model.

In [None]:
generator = pipeline("text2text-generation", model="t5-base")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]



# Provide input text:
You can provide some input text to generate a response. For this tutorial, we’ll use the T5 model in its basic form. You can experiment with tasks like answering questions or summarizing content.

In [None]:
input_text = "Translate the following English text to French: Hugging Face is creating amazing tools for the NLP community."


# Generate the output:
Now you can use the generator() function to generate text based on the input.

In [None]:
output = generator(input_text)
print("Generated text:", output[0]['generated_text'])




Generated text: Hugging Face crée des outils extraordinaires pour la communauté de la LNP.



# Fine-Tuning the T5 Model for Text-to-Text Generation

In this section, we'll demonstrate how to fine-tune the `t5-base` model for text-to-text generation tasks on a custom dataset. Fine-tuning allows the model to adapt to specific text generation tasks such as translation, summarization, or other natural language processing tasks.

We will use the Hugging Face `Trainer` API to handle the fine-tuning process.


In [2]:
!pip install transformers datasets

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests (from transformers)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.2 MB/s[0m e

In [3]:

# Step 1: Load a custom text-to-text dataset (e.g., a summarization dataset)
# For demonstration purposes, we’ll use the CNN/DailyMail dataset for summarization. You can replace this with your own dataset.

from datasets import load_dataset

dataset = load_dataset('cnn_dailymail', '3.0.0', split='train[:1%]')

# Step 2: Load the pretrained model and tokenizer
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained('t5-base')
model = T5ForConditionalGeneration.from_pretrained('t5-base')

# Step 3: Preprocess the dataset for fine-tuning
def preprocess_data(examples):
    inputs = ["summarize: " + doc for doc in examples["article"]]  # Prepend task for T5
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding='max_length')

    # Tokenize the summaries (labels)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["highlights"], max_length=128, truncation=True, padding='max_length')

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

train_data = dataset.map(preprocess_data, batched=True, remove_columns=["article", "highlights"])

# Step 4: Define training arguments and initialize Trainer
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
evaluation_strategy="no",  # Disable evaluation
    learning_rate=3e-5,
    per_device_train_batch_size=4,
    num_train_epochs=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data
)

# Step 5: Fine-tune the model
trainer.train()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Map:   0%|          | 0/2871 [00:00<?, ? examples/s]



Step,Training Loss
500,1.1058


TrainOutput(global_step=718, training_loss=1.0267081513046221, metrics={'train_runtime': 544.3503, 'train_samples_per_second': 5.274, 'train_steps_per_second': 1.319, 'total_flos': 1748318103797760.0, 'train_loss': 1.0267081513046221, 'epoch': 1.0})

# Now check model's performance after fine-tuning

In [7]:
model_path = "/content/results/checkpoint-718"
tokenizer = T5Tokenizer.from_pretrained('t5-base')
model = T5ForConditionalGeneration.from_pretrained(model_path)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [9]:
text2text_generator = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


# Exercise Questions:






*   Experiment with different tasks, such as summarization or translation to various languages. How does the model perform when given more complex tasks (e.g., summarizing large paragraphs)?
*   Modify the script to allow the user to continuously input new tasks and generate text without having to restart the program.
*   Try out different models from Hugging Face’s text2text-generation models (e.g., t5-large, t5-small) to see how they compare in terms of performance and speed.
Explore fine-tuning the max_length and min_length parameters in the generation function to control the length of the output text.





In [10]:

input_text = "Translate the following English text to French: Hugging Face is creating amazing tools for the NLP community."

output = text2text_generator(input_text)
print("Generated text:", output[0]['generated_text'])



Generated text: Hugging Face crée des outils extraordinaires pour la communauté de la LNP.


In [11]:
while True:
    input_text = input("Enter a task or text to generate: ")

    if input_text.lower() == "exit":
        break

    output = text2text_generator(input_text)
    print("Generated text:", output[0]['generated_text'])

Enter a task or text to generate: Translate the following English text to French: Hugging Face is creating amazing tools for the NLP community
Generated text: Hugging Face crée des outils extraordinaires pour la communauté de la LNL
Enter a task or text to generate: exit
