# T5-small Fine-tuning

### **The model checkpoint can be downloaded from: [Google Drive](https://drive.google.com/file/d/1rPxy0jeDDT5JkYI1q5ntlOROJvt2Jpdo/view?usp=sharing)*

## Imports, Device Setting and Weight and Bias Display

In [1]:
! pip install transformers
!pip3 install wandb
import wandb
import os
import torch
import re
from torch import cuda, nn, optim
from transformers import BertTokenizer, T5ForConditionalGeneration, Text2TextGenerationPipeline
from transformers import TrainingArguments, Trainer, logging
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m96.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m98.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m30.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.1
Looking in indexes: https://pypi.org/simple, https://

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')
manual_seed = 585
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

Mounted at /content/gdrive
cuda


In [3]:
wandb.login()
wandb.init(project="Zootopia", entity="qmygrace")


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mqmygrace[0m. Use [1m`wandb login --relogin`[0m to force relogin


## Load the Pre-trained Model

In [4]:
# tokenizer = AutoTokenizer.from_pretrained("t5-small")
# model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
# No Chinese was used for pre-train

# tokenizer = AutoTokenizer.from_pretrained("mxmax/Chinese_Chat_T5_Base")
# model = AutoModelForSeq2SeqLM.from_pretrained("mxmax/Chinese_Chat_T5_Base")

# https://huggingface.co/uer/t5-small-chinese-cluecorpussmall
tokenizer = BertTokenizer.from_pretrained("uer/t5-small-chinese-cluecorpussmall")
model = T5ForConditionalGeneration.from_pretrained("uer/t5-small-chinese-cluecorpussmall")


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/321 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/458 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/220M [00:00<?, ?B/s]

## Preprocess data

In [5]:
# path = '../data/'    # change the path as needed
path = '/content/gdrive/My Drive/585data/'

def read_data(file):
    with open (path+file) as t:
        data = t.readlines()
    return data

train_set = read_data('train_data.txt')[:20000]
dev_set = read_data('dev_data.txt')[:3000]
# test_set = read_data('test_data.txt')[:3000]

# type(train_set)
print(train_set[:2], '\n', len(train_set), len(dev_set))

['{"groundTruth": ["发扬光大", "平易近人", "温文尔雅"], "candidates": [["意气风发", "街谈巷议", "人才辈出", "一脉相传", "后继有人", "发扬光大", "腥风血雨"], ["平易近人", "落落大方", "八仙过海", "彬彬有礼", "史无前例", "盛气凌人", "好自为之"], ["不拘小节", "风流潇洒", "无病呻吟", "言谈举止", "壮志凌云", "关门闭户", "温文尔雅"]], "content": "由实力派演员刘威饰演的清华第三任校长蒋南翔，是我国著名的青年运动家和教育家，他跟清华终身校长梅贻琦一样，都是由清华人自己培养出来的校长。历史上的蒋南翔是著名的“一二九”学生救亡运动的领导人之一，他在清华校长之位14年期间，不但很好的继承了清华建校之初的优秀传统与理念，而且更加的#idiom#，他把清华的教师队伍扩大了将近5倍，将清华本科人数破万，为新中国培养了大量的有用人才。在《天行健》中饰演蒋南翔的刘威是观众所熟悉的著名实力派演员，早在1987年刘威就在《关东大侠》中饰演豪爽仗义的关云天一角而获得了金鸡奖最佳男主角的提名，后来更是因在《唐明皇》中精湛的表演而一举夺得金鹰奖最佳男演员奖。此次《天行健》选定刘威来出演正是看中了他#idiom#的表演方式和对人物深入内心的刻画。至此，《天行健》中涉及的三位清华校长的人选都已经曝光，#idiom#的第一任校长赵文?、稳重坚毅的第二任校长孙逊、亲切务实的第三任校长刘威，再加上梁思成、林徽因、朱自清、闻一多等一批“大师”的加盟，相信作为清华百年校庆重点项目之一的《天行健》一定会带领观众重温那段不能抹去的历史。", "realCount": 3}\n', '{"groundTruth": ["肥头大耳"], "candidates": [["超凡入圣", "骨瘦如柴", "青面獠牙", "虎背熊腰", "成人之美", "肥头大耳", "神不守舍"]], "content": "#idiom#的掌柜只穿一件衬衫，坐在柜台里。几个堂倌穿着脏得发黑的白工作服，因为没有顾客，都散坐在桌子旁。这当儿看到这位不寻常的客人，都露出好奇的神色列宁曾批评他理论上的错误，同时认为他“所写的全部哲学，赶紧迎上前来伺候。聂赫留朵夫要了一瓶矿泉水，在离窗较远的地方挨着一张

In [6]:
# preprocess_idx = -1
# def replace(match):
#     global preprocess_idx
#     preprocess_idx += 1
#     return 'extra {}'.format(preprocess_idx)

# text = '由实力派演员刘威饰演的清华第三任校长蒋南翔，是我国著名的青年运动家和教育家，他跟清华终身校长梅贻琦一样，都是由清华人自己培养出来的校长。历史上的蒋南翔是著名的“一二九”学生救亡运动的领导人之一，他在清华校长之位14年期间，不但很好的继承了清华建校之初的优秀传统与理念，而且更加的#idiom#，他把清华的教师队伍扩大了将近5倍，将清华本科人数破万，为新中国培养了大量的有用人才。在《天行健》中饰演蒋南翔的刘威是观众所熟悉的著名实力派演员，早在1987年刘威就在《关东大侠》中饰演豪爽仗义的关云天一角而获得了金鸡奖最佳男主角的提名，后来更是因在《唐明皇》中精湛的表演而一举夺得金鹰奖最佳男演员奖。此次《天行健》选定刘威来出演正是看中了他#idiom#的表演方式和对人物深入内心的刻画。至此，《天行健》中涉及的三位清华校长的人选都已经曝光，#idiom#的第一任校长赵文?、稳重坚毅的第二任校长孙逊、亲切务实的第三任校长刘威，再加上梁思成、林徽因、朱自清、闻一多等一批“大师”的加盟，相信作为清华百年校庆重点项目之一的《天行健》一定会带领观众重温那段不能抹去的历史。'
# re.sub(r'#idiom#', replace, text)

In [7]:
def preprocess(data):
    text_input = []
    idiom_output = []
    for i in range(len(data)):
        data[i] = eval(data[i])
        input_text = data[i]['content']
        ground_truth = data[i]['groundTruth']
        candidates = data[i]['candidates']

        candidate_str = ''
        for candidate in candidates:
            candidate_str += '('+'|'.join(candidate)+')'
        
        preprocess_idx = -1
        def replace(match):
            nonlocal preprocess_idx
            preprocess_idx += 1
            return 'extra{}'.format(preprocess_idx)
        input_text = re.sub(r'#idiom#', replace, input_text)

        instruction = '请从下列括号中分别选择合适的成语填入空缺处：{}'.format(candidate_str)
        # input_text = input_text.replace('#idiom#', '_')
        output_text = ','.join(ground_truth)
        
        text_input.append(instruction+'\n'+input_text)
        idiom_output.append(output_text)
    
    print(text_input[0], idiom_output[0])    
    input_tok = tokenizer.batch_encode_plus(text_input,
                                            add_special_tokens=False, 
                                            return_token_type_ids=False)
    output_tok = tokenizer.batch_encode_plus(idiom_output, 
                                             add_special_tokens=False,
                                             return_token_type_ids=False)
    return input_tok, output_tok

In [8]:
train_input, train_output = preprocess(train_set)
dev_input, dev_output = preprocess(dev_set)
# test_input, test_output = preprocess(test_set)

请从下列括号中分别选择合适的成语填入空缺处：(意气风发|街谈巷议|人才辈出|一脉相传|后继有人|发扬光大|腥风血雨)(平易近人|落落大方|八仙过海|彬彬有礼|史无前例|盛气凌人|好自为之)(不拘小节|风流潇洒|无病呻吟|言谈举止|壮志凌云|关门闭户|温文尔雅)
由实力派演员刘威饰演的清华第三任校长蒋南翔，是我国著名的青年运动家和教育家，他跟清华终身校长梅贻琦一样，都是由清华人自己培养出来的校长。历史上的蒋南翔是著名的“一二九”学生救亡运动的领导人之一，他在清华校长之位14年期间，不但很好的继承了清华建校之初的优秀传统与理念，而且更加的extra0，他把清华的教师队伍扩大了将近5倍，将清华本科人数破万，为新中国培养了大量的有用人才。在《天行健》中饰演蒋南翔的刘威是观众所熟悉的著名实力派演员，早在1987年刘威就在《关东大侠》中饰演豪爽仗义的关云天一角而获得了金鸡奖最佳男主角的提名，后来更是因在《唐明皇》中精湛的表演而一举夺得金鹰奖最佳男演员奖。此次《天行健》选定刘威来出演正是看中了他extra1的表演方式和对人物深入内心的刻画。至此，《天行健》中涉及的三位清华校长的人选都已经曝光，extra2的第一任校长赵文?、稳重坚毅的第二任校长孙逊、亲切务实的第三任校长刘威，再加上梁思成、林徽因、朱自清、闻一多等一批“大师”的加盟，相信作为清华百年校庆重点项目之一的《天行健》一定会带领观众重温那段不能抹去的历史。 发扬光大,平易近人,温文尔雅
请从下列括号中分别选择合适的成语填入空缺处：(深恶痛绝|人人自危|恨入骨髓|不胜枚举|嗤之以鼻|走马看花|不屑一顾)
另据了解，北京一个对垃圾短信extra0的老人，利用该软件总共呼死了近2000个号码。20分钟呼上万号码记者昨天在百度里输入“呼死你软件”，出现了7000多个相关网页，随机登录几个网站，发现软件均需花钱购买，价格从200元至500元不等。 深恶痛绝


In [9]:
print(train_input.keys(), train_output.keys())

dict_keys(['input_ids', 'attention_mask']) dict_keys(['input_ids', 'attention_mask'])


In [10]:
print(train_input['input_ids'][0], '\n', train_input['attention_mask'][0])

[6435, 794, 678, 1154, 2886, 1384, 704, 1146, 1166, 6848, 2885, 1394, 6844, 4638, 2768, 6427, 1856, 1057, 4958, 5375, 1905, 8038, 113, 2692, 3698, 7599, 1355, 170, 6125, 6448, 2350, 6379, 170, 782, 2798, 6777, 1139, 170, 671, 5549, 4685, 837, 170, 1400, 5326, 3300, 782, 170, 1355, 2813, 1045, 1920, 170, 5581, 7599, 6117, 7433, 114, 113, 2398, 3211, 6818, 782, 170, 5862, 5862, 1920, 3175, 170, 1061, 803, 6814, 3862, 170, 2509, 2509, 3300, 4851, 170, 1380, 3187, 1184, 891, 170, 4670, 3698, 1119, 782, 170, 1962, 5632, 711, 722, 114, 113, 679, 2872, 2207, 5688, 170, 7599, 3837, 4045, 3818, 170, 3187, 4567, 1460, 1412, 170, 6241, 6448, 715, 3632, 170, 1896, 2562, 1119, 756, 170, 1068, 7305, 7308, 2787, 170, 3946, 3152, 2209, 7414, 114, 4507, 2141, 1213, 3836, 4028, 1447, 1155, 2014, 7652, 4028, 4638, 3926, 1290, 5018, 676, 818, 3413, 7270, 5882, 1298, 5425, 8024, 3221, 2769, 1744, 5865, 1399, 4638, 7471, 2399, 6817, 1220, 2157, 1469, 3136, 5509, 2157, 8024, 800, 6656, 3926, 1290, 5303, 6716

In [11]:
print(train_output['input_ids'][0], '\n', train_output['attention_mask'][0])

[1355, 2813, 1045, 1920, 117, 2398, 3211, 6818, 782, 117, 3946, 3152, 2209, 7414] 
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [12]:
class IdiomDataset(Dataset):
    def __init__(self, inputs, outputs):
        self.inputs = inputs
        self.outputs = outputs

    def __len__(self):
        return len(self.inputs["input_ids"])

    def __getitem__(self, idx):
        input_ids = self.inputs['input_ids'][idx]
        attention_mask = self.inputs['attention_mask'][idx]

        target_ids = self.outputs['input_ids'][idx]
        target_attention_mask = self.outputs['attention_mask'][idx]
        return {"input_ids": input_ids, "attention_mask":attention_mask, "output_ids":target_ids}


def collate_fn(batch):
    batch_input = [torch.LongTensor(example['input_ids']) for example in batch]
    batch_output = [torch.LongTensor(example['output_ids']) for example in batch]
    batch_mask = [torch.LongTensor(example['attention_mask']) for example in batch]

    padded_batch_input_ids = pad_sequence(batch_input, batch_first=True, padding_value=tokenizer.pad_token_id)
    padded_batch_label = pad_sequence(batch_output, batch_first=True, padding_value=tokenizer.pad_token_id)
    padded_batch_att_mask = pad_sequence(batch_mask, batch_first=True, padding_value=0)

    return {"input_ids": padded_batch_input_ids, "attention_mask": padded_batch_att_mask, "labels": padded_batch_label}

def to_device(data, device):
    new_data = {}
    for k in data:
        # k = k.to(device)
        new_data[k] = data[k].to(device)
    return new_data

In [13]:
train_dataset = IdiomDataset(train_input, train_output)
train_loader = DataLoader(train_dataset, batch_size=8, collate_fn=collate_fn, shuffle=True)

dev_dataset = IdiomDataset(dev_input, dev_output)
dev_loader = DataLoader(dev_dataset, batch_size=8, collate_fn=collate_fn, shuffle=False)


## Training

In [14]:
@torch.no_grad()
def evaluate(model:nn.Module, eval_loader:DataLoader):
    eval_loss = 0.0
    correct = 0
    total = 0
    model.eval()
    print("eval_loader len:", len(eval_loader))
    for batch in eval_loader:
        batch = to_device(batch, device)
        output = model(**batch)
        loss = output.loss
        eval_loss += loss.item()
        pred = output.logits.argmax(-1)
        label = batch["labels"]
        correct += torch.where(label!=0, pred==label, 0).sum().item()
        total += torch.sum(label!=0).item()

    eval_acc = correct / total
    eval_loss = eval_loss / len(eval_loader) 
    print(total, correct)
    return eval_acc, eval_loss

In [15]:
epoches = 5       
optimizer = optim.Adam(model.parameters(), lr=5e-5)
model.to(device)

model.train()
for epoch in range(epoches):
    epoch_loss = 0.0
    log_loss = 0.0
    for idx, batch in enumerate(train_loader):
        model.zero_grad()
        batch = to_device(batch, device)
        loss = model(**batch).loss
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        log_loss += loss.item()

        wandb.log({'batch':idx, 'train_loss': loss.item()})
        wandb.log({'batch':idx, 'accumulated_train_loss_in_this_100_batches': log_loss})

        if idx % 100 == 0:
            print(f"Train Step: {idx} Loss: {log_loss / 100}")
            log_loss = 0.0
    print(f"Epoch: {epoch+1} Loss is: {epoch_loss}")
    eval_acc, eval_loss = evaluate(model, dev_loader)
    print(f"Epoch {epoch+1} Eval Acc: {eval_acc}; Eval Loss: {eval_loss}")

Train Step: 0 Loss: 0.09620771408081055
Train Step: 100 Loss: 8.64742648601532
Train Step: 200 Loss: 7.374978351593017
Train Step: 300 Loss: 6.931719088554383
Train Step: 400 Loss: 6.691545848846435
Train Step: 500 Loss: 6.328795137405396
Train Step: 600 Loss: 6.13201672077179
Train Step: 700 Loss: 6.0014139890670775
Train Step: 800 Loss: 5.9826555347442625
Train Step: 900 Loss: 5.679838309288025
Train Step: 1000 Loss: 5.499179644584656
Train Step: 1100 Loss: 5.40323822259903
Train Step: 1200 Loss: 5.168613016605377
Train Step: 1300 Loss: 5.1516958379745486
Train Step: 1400 Loss: 5.144575681686401
Train Step: 1500 Loss: 4.87628068447113
Train Step: 1600 Loss: 4.77899142742157
Train Step: 1700 Loss: 4.652412295341492
Train Step: 1800 Loss: 4.667818729877472
Train Step: 1900 Loss: 4.345775380134582
Train Step: 2000 Loss: 4.290574998855591
Train Step: 2100 Loss: 4.329298982620239
Train Step: 2200 Loss: 4.215694088935852
Train Step: 2300 Loss: 4.101018986701965
Train Step: 2400 Loss: 4.241

In [16]:
torch.save(model.state_dict(), path+"T5-small_model_5epoch.pt")

In [17]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 54,925,824 trainable parameters


## Evaluation

In [18]:
@torch.no_grad()
def fill_idiom(model, loader):

    all_preds = []
    all_labels = []
    model.eval()
    for batch in loader:
        batch = to_device(batch, device)
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["labels"]
        outputs = model.generate(input_ids=input_ids, 
                                 attention_mask=attention_mask, 
                                 return_dict_in_generate=True, 
                                 pad_token_id=tokenizer.pad_token_id, 
                                 max_length=512, 
                                 top_k=15)
        truncated_outputs = []

        decode_texts = tokenizer.batch_decode([l[l != 0] for l in outputs['sequences']])
        gold_texts = tokenizer.batch_decode([l[l != 0] for l in labels])
        # print(decode_texts, gold_texts)
        for gold, decode in zip(gold_texts, decode_texts):
            l = set(gold.replace(' ', '').replace('[CLS]', '').split(','))
            p = set(decode.replace(' ', '').replace('[CLS]', '').split(','))
            # print(l, p)
            all_labels.append(l)
            all_preds.append(p)
        # print(decode_texts)
        # print(gold_texts)
        # break
    
    return all_preds, all_labels

def f1_score(sys, gold):
    tp = 0
    total = 0
    pos = 0
    for s, g in zip(sys, gold):
        total += len(g)
        pos += len(s)
        tp += len(g & s)
    precision = tp / pos if pos != 0 else 0
    recall = tp / total if total != 0 else 0
    f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) != 0 else 0
    return precision, recall, f1, tp

In [19]:
sys, gold = fill_idiom(model, dev_loader)
p, r, f1, tp = f1_score(sys, gold)

In [20]:
total = 0
for s, g in zip(sys, gold):
    total += len(g)

In [21]:
print(f"Accurate amount for Validation set is {tp} out of {total}")
print(f"Accuracy for Validation set is {tp/total}")
print(f"F1 score for Validation set is {f1}")

Accurate amount for Validation set is 1496 out of 3668
Accuracy for Validation set is 0.4078516902944384
F1 score for Validation set is 0.4100315198026586


In [22]:
sys[:10]

[{'恨入骨髓'},
 {'乱七八糟'},
 {'应付裕如'},
 {'独一无二'},
 {'一语道破', '不胜其烦', '评头题足'},
 {'一模一样'},
 {'罪魁祸首'},
 {'聪明才智'},
 {'百年不遇'},
 {'酸甜苦辣'}]

In [23]:
gold[:10]

[{'深恶痛绝'},
 {'杂乱无章'},
 {'磨刀霍霍'},
 {'独一无二'},
 {'一语道破', '不厌其烦', '品头题足'},
 {'大同小异'},
 {'罪魁祸首'},
 {'聪明才智'},
 {'千载难逢'},
 {'酸甜苦辣'}]

In [24]:
sys[-10:]

[{'无可置疑'},
 {'入不敷出'},
 {'近在咫尺'},
 {'虎视眈眈'},
 {'循序渐进', '按部就班'},
 {'鸡零狗碎'},
 {'轻举妄动'},
 {'世态炎凉'},
 {'因地制宜'},
 {'有条有理'}]

In [25]:
gold[-10:]

[{'实至名归'},
 {'入不敷出'},
 {'大惑不解'},
 {'盛气凌人'},
 {'例行公事', '新陈代谢'},
 {'无关痛痒'},
 {'按兵不动'},
 {'世态炎凉'},
 {'因地制宜'},
 {'神气活现'}]

In [35]:
with open ('T5-small_outputs.txt', 'w', encoding='utf-8') as t5:
    for s in sys:
        line = ','.join(s)
        t5.write(str(line)+'\n')