If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets as well as other dependencies. Uncomment the following cell and run it.

In [5]:
!pip install datasets evaluate transformers rouge-score nltk

Collecting datasets
  Using cached datasets-2.20.0-py3-none-any.whl (547 kB)
Collecting evaluate
  Using cached evaluate-0.4.2-py3-none-any.whl (84 kB)
Collecting transformers
  Using cached transformers-4.42.3-py3-none-any.whl (9.3 MB)
Processing /root/.cache/pip/wheels/24/55/6f/ebfc4cb176d1c9665da4e306e1705496206d08215c1acd9dde/rouge_score-0.1.2-py3-none-any.whl
Collecting nltk
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting pyarrow>=15.0.0
  Using cached pyarrow-16.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.0 MB)
Collecting fsspec[http]<=2024.5.0,>=2023.1.0
  Using cached fsspec-2024.5.0-py3-none-any.whl (316 kB)
Collecting huggingface-hub>=0.21.2
  Using cached huggingface_hub-0.23.4-py3-none-any.whl (402 kB)
Collecting pyarrow-hotfix
  Using cached pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting tqdm>=4.66.3
  Using cached tqdm-4.66.4-py3-none-any.whl (78 kB)
Collecting pandas
  Using cached pandas-2.0.3-cp38-cp38-manylinux_2_17_x86_64.

In [24]:
!pip install tensorboardX

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting tensorboardX
  Downloading tensorboardX-2.6.2.2-py2.py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 4.2 MB/s ta 0:00:011
Collecting protobuf>=3.20
  Downloading protobuf-5.27.2-cp38-abi3-manylinux2014_x86_64.whl (309 kB)
[K     |████████████████████████████████| 309 kB 40.4 MB/s eta 0:00:01
[?25hInstalling collected packages: protobuf, tensorboardX
Successfully installed protobuf-5.27.2 tensorboardX-2.6.2.2


# Fine-tuning a model on a summarization task

In [1]:
model_checkpoint = "t5-small"

Note: Model Selected From Here [Model Hub](https://huggingface.co/models) 

## Loading the dataset

In [2]:
from datasets import load_dataset
from evaluate import load

raw_datasets = load_dataset("gen_sft_dataset.py", trust_remote_code=True)
metric = load("rouge")

In [3]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['weibo', 'resp'],
        num_rows: 25140
    })
    validation: Dataset({
        features: ['weibo', 'resp'],
        num_rows: 8670
    })
})

In [4]:
raw_datasets["train"][0]

{'weibo': '#WTT冠军赛布达佩斯站#\xa0男单1/4决赛林高远3-0宇田幸矢11-3，11-4，11-7',
 'resp': '别把我帅死林高远一直这么坚定下去吧！！！！别有太大压力，战胜自己就够了！！！！我永远相信小林将军'}

Pick some random samples to take a look at the data

In [5]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [6]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,weibo,resp
0,#浙江男生获9所世界名校全奖直博offer#浙江男生朱科航本科就读中科大，毕业时，收到哈佛、斯坦福、耶鲁、加州理工等9所世界名校全奖直博offer。最终选择去哈佛研究量子传感和量子新奇材料。朱科航父亲告诉橙柿互动，他们从不强迫孩子去做不喜欢的事，朱科航到高中毕业都没上过任何辅导班。大学时，朱科航用两年学完大学四年主要课程，名列物理学院第一。朱科航聊自己的学习经验时说，“做喜欢做的事情，保持开心最重要。”OL都市快报,《关于我在人间凑数的二十年》
1,#梦华录刘亦菲角色关注度第一#,#刘亦菲梦华录# 好厉害啊！
2,#清华大学CUBA总冠军#队史第4冠！恭喜清华大学男篮，夺得第24届CUBA中国大学生男篮一级联赛全国总冠军！#CUBA总决赛##清华大学CUBA三连冠#,广工牛逼
3,#TFBOYS电影连播云合体#也许你曾被风浪拍得颓废失意，是时候重回赛道乘风破浪了！小人物也能触底反弹，平凡人也能成为黑马！《解忧杂货店》《地久天长》《送你一朵小红花》TFBOYS电影连播，三小只银幕同台云合体。《银河补习班》《误杀》《人潮汹涌》《我不是药神》《旋风女队》《夺冠》《乘风破浪》火热连映，激励奋斗人生勇往直前。#百部电影直播过大年#除夕至初六，每日打开1905电影网APP，24小时不间断直播+新春好礼+精彩电影大放映，陪你欢乐过大年！,《解忧杂货店》温暖动人！一起期待王俊凯6.2端午档新电影《断桥》
4,#过年别用李易峰照片糊弄家人#,我也想跟苏苏做朋友


The metric we use: `ROUGE-1`, `ROUGE-2`, `ROUGE-L`.metric is an instance of [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric):

In [7]:
metric

EvaluationModule(name: "rouge", module_type: "metric", features: [{'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id=None)}, {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}], usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLsum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/

## Preprocessing the data

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

You can directly call this tokenizer on one sentence or a pair of sentences:

In [10]:
tokenizer("Hello, this one sentence!")

{'input_ids': [8774, 6, 48, 80, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Depending on the model you selected, you will see different keys in the dictionary returned by the cell above. They don't matter much for what we're doing here (just know they are required by the model we will instantiate later), you can learn more about them in [this tutorial](https://huggingface.co/transformers/preprocessing.html) if you're interested.

Instead of one sentence, we can pass along a list of sentences:

In [11]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

To prepare the targets for our model, we need to tokenize them using the `text_target` parameter. This will make sure the tokenizer uses the special tokens corresponding to the targets:

In [12]:
print(tokenizer(text_target=["Hello, this one sentence!", "This is another sentence."]))

{'input_ids': [[8774, 6, 48, 80, 7142, 55, 1], [100, 19, 430, 7142, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}


If you are using one of the five T5 checkpoints we have to prefix the inputs with "summarize:" (the model can also translate and it needs the prefix to know which task it has to perform).

In [13]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b", "google/flan-t5-base"]:
    prefix = "summarize: "
else:
    prefix = ""

We can then write the function that will preprocess our samples. We just feed them to the `tokenizer` with the argument `truncation=True`. This will ensure that an input longer that what the model selected can handle will be truncated to the maximum length accepted by the model. The padding will be dealt with later on (in a data collator) so we pad examples to the longest length in the batch and not the whole dataset.

In [14]:
max_input_length = 128
max_target_length = 32

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["weibo"]]
    # print(inputs)
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    labels = tokenizer(text_target=examples["resp"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [15]:
preprocess_function(raw_datasets['train'][:2])

{'input_ids': [[21603, 10, 1713, 518, 9697, 2, 4663, 3, 2, 536, 13572, 2, 22773, 2, 2596, 3486, 6, 2596, 4278, 6, 2596, 6832, 1], [21603, 10, 1713, 518, 9697, 2, 4663, 3, 2, 536, 13572, 2, 22773, 2, 2596, 3486, 6, 2596, 4278, 6, 2596, 6832, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[3, 2, 12887, 2, 6, 2, 12887, 2, 1], [3, 2, 1]]}

To apply this function on all the pairs of sentences in our dataset, we just use the `map` method of our `dataset` object we created earlier. This will apply the function on all the elements of all the splits in `dataset`, so our training, validation and testing data will be preprocessed in one single command.

In [16]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

In [17]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['weibo', 'resp', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 25140
    })
    validation: Dataset({
        features: ['weibo', 'resp', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 8670
    })
})

## Fine-tuning the model

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the `AutoModelForSeq2SeqLM` class. Like with the tokenizer, the `from_pretrained` method will download and cache the model for us.

In [29]:
!pip install accelerate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting accelerate
  Using cached accelerate-0.31.0-py3-none-any.whl (309 kB)
Collecting torch>=1.10.0
  Using cached torch-2.3.1-cp38-cp38-manylinux1_x86_64.whl (779.1 MB)
Installing collected packages: torch, accelerate
Successfully installed accelerate-0.31.0 torch-2.3.1


In [18]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [19]:
import os
pid = os.getpid()
print("Process ID:", pid)

Process ID: 930


Note that  we don't get a warning like in our classification example. This means we used all the weights of the pretrained model and there is no randomly initialized head in this case.

To instantiate a `Seq2SeqTrainer`, we will need to define three more things. The most important is the [`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [67]:
batch_size = 16
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    output_dir = "SFT_Model_T5-Small",
    evaluation_strategy = "epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3, # save 3 times maximum (to save disk usage)
    num_train_epochs=10,
    predict_with_generate=True,
    fp16=True,
    logging_dir = "./results/t5-small-formal_10epochs",
    report_to = "tensorboard",
#    push_to_hub=True,
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the cell and customize the weight decay. Since the `Seq2SeqTrainer` will save the model regularly and our dataset is quite large, we tell it to make three saves maximum. Lastly, we use the `predict_with_generate` option (to properly generate summaries) and activate mixed precision training (to go a bit faster).

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/t5-finetuned-xsum"` or `"huggingface/t5-finetuned-xsum"`).

Then, we need a special kind of data collator, which will not only pad the inputs to the maximum length in the batch, but also the labels:

##### Note 2:

**Dynamically** pads both the **inputs and the labels** to the **maximum length in the batch**. This is particularly useful in sequence-to-sequence tasks where both the input sequences (source text) and the output sequences (target text) can have varying lengths.

`The DataCollatorForSeq2Seq` will use these to dynamically pad the inputs and labels for each batch during training. This ensures that all sequences *in a batch have the same length*, which is a requirement for training the model.

In [68]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

The last thing to define for our `Seq2SeqTrainer` is how to compute the metrics from the predictions. We need to define a function for this, which will just use the `metric` we loaded earlier, and we have to do a bit of pre-processing to **decode the predictions into texts**:

In [69]:
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    # Note that other metrics may not have a `use_aggregator` parameter
    # and thus will return a list, computing a metric for each sentence.
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True, use_aggregator=True)
    # Extract a few results
    result = {key: value * 100 for key, value in result.items()}

    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

Then we just need to pass all of this along with our datasets to the `Seq2SeqTrainer`:

In [70]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

We can now finetune our model by just calling the `train` method:

In [71]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [72]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,0.5325,1.552044,0.2807,0.0115,0.2776,0.283,9.3368
2,0.508,1.599372,0.2995,0.0077,0.2941,0.3018,7.1723
3,0.4685,1.639693,0.2576,0.0077,0.258,0.2614,8.0023
4,0.4636,1.655328,0.2147,0.0077,0.2141,0.2195,8.0826
5,0.4537,1.681561,0.2955,0.0077,0.2937,0.3012,7.8381
6,0.4341,1.705429,0.2636,0.0115,0.2625,0.2661,7.9446
7,0.5835,1.579337,0.2165,0.0,0.2161,0.2203,8.6159
8,0.6657,1.525661,0.2361,0.0,0.2357,0.2399,8.6817
9,0.6534,1.505022,0.2361,0.0,0.2357,0.2399,8.526
10,0.6544,1.501982,0.2361,0.0,0.2357,0.2399,8.5017




TrainOutput(global_step=15720, training_loss=0.5343201838680199, metrics={'train_runtime': 3029.6996, 'train_samples_per_second': 82.979, 'train_steps_per_second': 5.189, 'total_flos': 1.1199377582850048e+16, 'train_loss': 0.5343201838680199, 'epoch': 10.0})

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
trainer.push_to_hub()

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("sgugger/my-awesome-model")
```

In [44]:
trainer.train() # 1 epoch initial

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,1.1511,1.132078,0.0,0.0,0.0,0.0,10.8178




TrainOutput(global_step=1572, training_loss=1.1973761395947018, metrics={'train_runtime': 318.237, 'train_samples_per_second': 78.998, 'train_steps_per_second': 4.94, 'total_flos': 1130857938419712.0, 'train_loss': 1.1973761395947018, 'epoch': 1.0})

In [51]:
trainer.train() # 3 epochs (stopped because I changed to 128,32)

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,1.104,1.124649,0.0423,0.0,0.0423,0.0461,7.4302
2,1.0799,1.119966,0.0269,0.0,0.0269,0.0269,7.5054
3,1.0021,1.12236,0.0923,0.0,0.0923,0.0923,6.7352



KeyboardInterrupt



In [None]:
### A formal trial - Early-stop (10 epochs, lr = 1e-3, 128, 32)

In [58]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,0.9883,1.152742,0.1457,0.0,0.1476,0.1465,9.9373
2,0.9743,1.187302,0.022,0.0,0.0225,0.022,9.0208
3,0.9145,1.185638,0.1038,0.0,0.1038,0.1038,8.3033
4,0.9079,1.163994,0.1038,0.0,0.1038,0.1077,10.7869
5,0.8818,1.192633,0.0923,0.0,0.0923,0.0923,10.9585
6,0.8383,1.228651,0.1836,0.0,0.1807,0.1807,6.9333
7,0.8125,1.239564,0.1269,0.0,0.1269,0.1269,11.5303


Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x7fbf6ea60b80>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ipykernel/ipkernel.py", line 775, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(
KeyboardInterrupt: 

KeyboardInterrupt



In [65]:
### Formal Trial 2 (lr = 5e-4, bs = 16, 128, 32)

In [66]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,0.7494,1.307577,0.2691,0.0,0.2672,0.2691,7.4291
2,0.723,1.349614,0.1357,0.0,0.1384,0.1384,8.4484
3,0.6818,1.369715,0.1269,0.0,0.1269,0.1288,7.9539
4,0.6691,1.384475,0.2263,0.0,0.2236,0.2265,9.5781
5,0.6564,1.423687,0.1807,0.0,0.1807,0.1817,8.5998
6,0.6234,1.447989,0.2111,0.0,0.2072,0.2115,8.0768



KeyboardInterrupt

