If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets as well as other dependencies. Uncomment the following cell and run it.

In [1]:
!pip install datasets evaluate transformers rouge-score nltk

Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 482 kB/s eta 0:00:01
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
Collecting absl-py
  Downloading absl_py-2.1.0-py3-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 2.6 MB/s eta 0:00:01
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25ldone
[?25h  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24954 sha256=1afffaca5090efb0c4f9c1e89fc79a51f7ce4adb0ea4802f88625ff4dbffe898
  Stored in directory: /root/.cache/pip/wheels/24/55/6f/ebfc4cb176d1c9665da4e306e1705496206d08215c1acd9dde
Successfully built rouge-score
Installing collected packages: evaluate, absl-py, rouge-score
Successfully installed absl-py-2.1.0 evaluate-0.4.2 rouge-score-0.1.2


In [3]:
!pip install tensorboardX



In [4]:
model_checkpoint = "t5-small" # google/flan-t5-base

## Loading the dataset

In [5]:
from datasets import load_dataset
from evaluate import load

raw_datasets = load_dataset("gen_sft_dataset.py", trust_remote_code=True)
metric = load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [6]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['weibo', 'resp'],
        num_rows: 25140
    })
    validation: Dataset({
        features: ['weibo', 'resp'],
        num_rows: 8670
    })
})

In [7]:
raw_datasets["train"][0]

{'weibo': '#WTT冠军赛布达佩斯站#\xa0男单1/4决赛林高远3-0宇田幸矢11-3，11-4，11-7',
 'resp': '别把我帅死林高远一直这么坚定下去吧！！！！别有太大压力，战胜自己就够了！！！！我永远相信小林将军'}

In [8]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [9]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,weibo,resp
0,#西安一男子在司机持刀伤人现场趁乱行窃#警方：已抓获嫌疑人1月19日17时许，陕西西安东郊两名司机因剐蹭发生争执，一名身着黑色上衣的男子迅速侧身从旁边的轿车内偷取物品，黑衣男子的行为被网友发现后举报给了西安公安新城分局，23时许新城分局回应称，盗窃嫌疑人已被抓获，被盗财物已追回。#西安一出租车司机持刀伤人#L百姓关注,西安捅了热搜的窝嘛
1,#经典咏流传#大美中华首期阵容发布古人笔下的中华，是如此多娇。如今众多经典传唱人将和诗以歌，用音乐诉说“中国情”，展现祖国的秀丽壮美。CCTV-1今晚20:53首播，CCTV-3今晚22点档重播，请君同赏大美中华！,诗意中国源远流长 期待歌手张杰再次传唱经典
2,#VWmagazine##硬糖少女303郑乃馨我没有PlanB#「我没有想过自己一下子就能有多红、多火，我就是慢慢往前走。」郑乃馨毫不掩饰自己的野心，成为superstar是她一直以来最大的目标。用音乐传递态度，用作品影响他人，从而收获属于自己的闪耀人生。LVWmagazine,郑乃馨怎么这么美
3,#我国一人户家庭超1.25亿##中国单人家庭超1.25亿##中国新观察#一人住、一人食、一人游，当下，独居群体不断增多。中新财经记者查询2021年中国统计年鉴发现，2020年，全国共有家庭户49416万户，其中“一人户”家庭有125490007，超过1.25亿，占比超过25%。二人户超1.46亿，三人户超1亿。你是独居吗？O中国一人户数量超1.25亿！独居者为何越来越多？,所以呢？五保户降低年龄领取不？
4,#目击者回应女子被男子拖进厕所#7月19日，网传一段女子被一名男子拖进厕所隔间的视频。#警方回应女子被男子强行拖进卫生间#红星新闻联系到视频中被女子拉倒的男子马某，马某告诉记者，当天晚上他和朋友一起去郑州市中牟县maxclub酒吧给朋友过生日，在上厕所洗手时，突然从厕所隔间里冲出一名女孩，女孩拽住他大喊“救命”，他被拽翻在地。这时，从该隔间里还冲出一名男子，拽着女孩的头发，将其快速拉回隔间。　“当时都懵了，也不知道什么情况。”旁边的朋友提醒报警，随即他拿出手机报警。之后，从旁边的隔间里出来一名穿着黑色衣服的女孩，她用脚踹了几下门，马某与几个朋友一起把门推开，将那名女子拽了出来。　马某表示，他当时去卫生间时，那名女子和男子已经在隔间里面了。那名男子出来时边提裤子边对在场人员谩骂，“他一直骂骂咧咧。”马某告诉记者，事发后警察出现，对该名男子进行盘问。　郑州市中牟县东风路派出所工作人员回应称，当天确实接到这个报警，但由于他并非当日出警人员，对此事细节并不了解。（红星新闻)L微资讯,不严惩真的会越来越多的人干这种事


In [10]:
metric

EvaluationModule(name: "rouge", module_type: "metric", features: [{'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id=None)}, {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')}], usage: """
Calculates average rouge scores for a list of hypotheses and references
Args:
    predictions: list of predictions to score. Each prediction
        should be a string with tokens separated by spaces.
    references: list of reference for each prediction. Each
        reference should be a string with tokens separated by spaces.
    rouge_types: A list of rouge types to calculate.
        Valid names:
        `"rouge{n}"` (e.g. `"rouge1"`, `"rouge2"`) where: {n} is the n-gram based scoring,
        `"rougeL"`: Longest common subsequence based scoring.
        `"rougeLsum"`: rougeLsum splits text using `"
"`.
        See details in https://github.com/huggingface/

You can call its `compute` method with your predictions and labels, which need to be list of decoded strings:

In [11]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = ["hello there", "general kenobi"]
metric.compute(predictions=fake_preds, references=fake_labels)

{'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0, 'rougeLsum': 1.0}

## Preprocessing the data

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

In [13]:
tokenizer("Hello, this one sentence!")

{'input_ids': [8774, 6, 48, 80, 7142, 55, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

In [16]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b", "google/flan-t5-base"]:
    prefix = "summarize: "
else:
    prefix = ""

In [17]:
max_input_length = 128
max_target_length = 32

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["weibo"]]
    # print(inputs)
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    labels = tokenizer(text_target=examples["resp"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [18]:
preprocess_function(raw_datasets['train'][:2])

{'input_ids': [[21603, 10, 1713, 518, 9697, 2, 4663, 3, 2, 536, 13572, 2, 22773, 2, 2596, 3486, 6, 2596, 4278, 6, 2596, 6832, 1], [21603, 10, 1713, 518, 9697, 2, 4663, 3, 2, 536, 13572, 2, 22773, 2, 2596, 3486, 6, 2596, 4278, 6, 2596, 6832, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[3, 2, 12887, 2, 6, 2, 12887, 2, 1], [3, 2, 1]]}

In [19]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Map:   0%|          | 0/25140 [00:00<?, ? examples/s]

Map:   0%|          | 0/8670 [00:00<?, ? examples/s]

In [20]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['weibo', 'resp', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 25140
    })
    validation: Dataset({
        features: ['weibo', 'resp', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 8670
    })
})

## Fine-tuning the model

In [21]:
!pip install accelerate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [22]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [23]:
import os
pid = os.getpid()
print("Process ID:", pid)

Process ID: 573


In [24]:
batch_size = 32
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    output_dir = "SFT_Model_T5-Small",
    evaluation_strategy = "epoch",
    learning_rate=5e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3, # save 3 times maximum (to save disk usage)
    num_train_epochs=20,
    predict_with_generate=True,
    fp16=True,
    logging_dir = "./results/t5-small-July2",
    report_to = "tensorboard",
#    push_to_hub=True,
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


##### Note 2:

**Dynamically** pads both the **inputs and the labels** to the **maximum length in the batch**. This is particularly useful in sequence-to-sequence tasks where both the input sequences (source text) and the output sequences (target text) can have varying lengths.

`The DataCollatorForSeq2Seq` will use these to dynamically pad the inputs and labels for each batch during training. This ensures that all sequences *in a batch have the same length*, which is a requirement for training the model.

In [25]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [26]:
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    # Note that other metrics may not have a `use_aggregator` parameter
    # and thus will return a list, computing a metric for each sentence.
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True, use_aggregator=True)
    # Extract a few results
    result = {key: value * 100 for key, value in result.items()}

    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

In [27]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)



In [28]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [31]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,1.0647,1.140377,0.0,0.0,0.0,0.0,11.0411
2,1.0242,1.147904,0.1938,0.0,0.193,0.1922,7.2157
3,0.9887,1.157635,0.1153,0.0,0.1153,0.1153,7.534
4,0.9487,1.158618,0.1745,0.0,0.1738,0.173,9.5029
5,0.9373,1.177618,0.0807,0.0,0.0807,0.0807,13.3765
6,0.905,1.1852,0.2326,0.0,0.2307,0.2345,7.6044
7,0.8807,1.200435,0.0884,0.0,0.0846,0.0884,11.5725
8,0.8582,1.224542,0.1461,0.0,0.1461,0.1461,8.9391
9,0.8411,1.242187,0.2052,0.0,0.2032,0.2027,9.0392
10,0.8286,1.255606,0.1749,0.0,0.1711,0.1749,9.2826




TrainOutput(global_step=15720, training_loss=0.8407957130412715, metrics={'train_runtime': 2411.2207, 'train_samples_per_second': 208.525, 'train_steps_per_second': 6.52, 'total_flos': 1.6078012837134336e+16, 'train_loss': 0.8407957130412715, 'epoch': 20.0})

#### Below are just some previous trials:

In [44]:
trainer.train() # 1 epoch initial

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,1.1511,1.132078,0.0,0.0,0.0,0.0,10.8178




TrainOutput(global_step=1572, training_loss=1.1973761395947018, metrics={'train_runtime': 318.237, 'train_samples_per_second': 78.998, 'train_steps_per_second': 4.94, 'total_flos': 1130857938419712.0, 'train_loss': 1.1973761395947018, 'epoch': 1.0})

In [51]:
trainer.train() # 3 epochs (stopped because I changed to 128,32)

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,1.104,1.124649,0.0423,0.0,0.0423,0.0461,7.4302
2,1.0799,1.119966,0.0269,0.0,0.0269,0.0269,7.5054
3,1.0021,1.12236,0.0923,0.0,0.0923,0.0923,6.7352



KeyboardInterrupt



In [None]:
### A formal trial - Early-stop (10 epochs, lr = 1e-3, 128, 32)

In [58]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,0.9883,1.152742,0.1457,0.0,0.1476,0.1465,9.9373
2,0.9743,1.187302,0.022,0.0,0.0225,0.022,9.0208
3,0.9145,1.185638,0.1038,0.0,0.1038,0.1038,8.3033
4,0.9079,1.163994,0.1038,0.0,0.1038,0.1077,10.7869
5,0.8818,1.192633,0.0923,0.0,0.0923,0.0923,10.9585
6,0.8383,1.228651,0.1836,0.0,0.1807,0.1807,6.9333
7,0.8125,1.239564,0.1269,0.0,0.1269,0.1269,11.5303


Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x7fbf6ea60b80>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/ipykernel/ipkernel.py", line 775, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(
KeyboardInterrupt: 

KeyboardInterrupt



In [65]:
### Formal Trial 2 (lr = 5e-4, bs = 16, 128, 32)

In [66]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1,0.7494,1.307577,0.2691,0.0,0.2672,0.2691,7.4291
2,0.723,1.349614,0.1357,0.0,0.1384,0.1384,8.4484
3,0.6818,1.369715,0.1269,0.0,0.1269,0.1288,7.9539
4,0.6691,1.384475,0.2263,0.0,0.2236,0.2265,9.5781
5,0.6564,1.423687,0.1807,0.0,0.1807,0.1817,8.5998
6,0.6234,1.447989,0.2111,0.0,0.2072,0.2115,8.0768



KeyboardInterrupt

