# T5 Text2Text Distractor Generation Inference (sentence-level)
使用 t5-base 訓練 <br>
將文章拆成一題的前後3句去訓練讓模型產生一個 distractor, distractors label: ```<s>d1</s>```
dataset使用filter後的資料比數 69009/9696/10233 (train/valid/test) <br>

### GPU

In [105]:
!nvidia-smi

Tue Jul  4 06:52:41 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 4090         On | 00000000:01:00.0 Off |                  Off |
|  0%   30C    P8               20W / 450W|  24205MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090         On | 00000000:04:00.0 Off |  

### Weight and Bias (Assisting Metrics, Optional)

In [106]:
!pip install wandb
!wandb login
project_name = "clean T5 three DG training sent ans to triple distractor , split"
import os

os.environ["WANDB_PROJECT"] = project_name

[34m[1mwandb[0m: Currently logged in as: [33mreily[0m ([33mblurr[0m). Use [1m`wandb login --relogin`[0m to force relogin


### import & device use GPU

In [107]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch
num_gpus = torch.cuda.device_count() 
print(f'Detect {num_gpus} GPUS')
device = torch.device('cuda' if torch.cuda.is_available() else "cpu")

Detect 2 GPUS


### Loading the dataset

In [108]:
import json
from pprint import pprint
def read_data(item):
    path = '../data/CLOTH-F/clean_cloth-f_{}.json'.format(item)
    with open(path) as f:
        data = json.load(f)
    return data

In [109]:
train = read_data('train')
valid = read_data('valid')
test = read_data('test')

In [110]:
len(train), len(valid), len(test)

(69009, 9696, 10233)

## load 進來的是已經處理好的資料

In [111]:
pprint(train[0])

{'answer': 'purposes',
 'distractors': ['scores', 'programs', 'methods'],
 'index': 0,
 'sentence': 'University students are different. However, a closer look at '
             'their _ of learning at university will enable us to classify '
             'them roughly into three groups: those who learn out of instinct, '
             'those who learn for a promising future, and those who learn with '
             'no definite objects.Firstly, there are a handful of students who '
             'learn day and night simply because they like to learn. '}


### Prepare data

In [112]:
# 將 input 做成 sentence + '</s>' + answer
def make_model_input(data):
    list_distractors = []
    model_input_sentences = []
    answers = []
    sentences = []
    labels = []
    d_num = 3
    for d in data:
        sentence = d['sentence']
        distractors = d['distractors']
        answer = d['answer']
        model_input_sentence = sentence + '</s>' + answer
        str_distractors = ', '.join(distractors)
        print(str_distractors)
        
        sentences.append(sentence) # 原本的 sentence
        labels.append(str_distractors) # 要生出的 distractor
        answers.append(answer) # 該題的 answer 
        list_distractors.append(distractors) # 該題的三個 distractor 算分可能要用
        model_input_sentences.append(model_input_sentence) # 要 input 給 model 的輸入
        
    
    return sentences, list_distractors, answers, model_input_sentences, labels

train_sentences, train_distractors, train_answers, train_sent, train_labels = make_model_input(train)
valid_sentences, valid_distractors, valid_answers, valid_sent, valid_labels = make_model_input(valid)
test_sentences, test_distractors, test_answers, test_sent, test_labels = make_model_input(test)

scores, programs, methods
dreaming, directing, leading
partly, mostly, sharply
take, drag, keep
countless, harmless, careless
Besides, Therefore, Thus
exploration, direction, attention
something, anything, everything
bend, predict, seek
stop, chose, come
if, when, thought
big, latest, useless
ever, once, just
anger, curiosity, gratitude
hid, praised, forgave
need, must, might
imagination, room, discussion
confused, annoyed, afraid
thought, laugh, guess
throw, change, return
done, watched, regretted
abandoned, disappointed, scolded
better, higher, older
before, into, against
painful, frightened, relaxed
hurt, found, broken
stopped, stood, placed
as, for, alike
happiest, biggest, weakest
seeing, hearing, staring
whom, how, where
night, evening, afternoon
joy, interest, satisfaction
gave, covered, put
tall, strong, weak
teacher, driver, professor
suggestion, friend, program
peaceful, powerful, hopeless
idea, suggestion, plan
left, found, joined
refused, hired, admitted
cinema, bookshop, s

In [113]:
len(train_sentences), len(train_distractors), len(train_answers), len(train_sent), len(train_labels)

(69009, 69009, 69009, 69009, 69009)

In [114]:
len(valid_sentences), len(valid_distractors), len(valid_answers), len(valid_sent), len(valid_labels)

(9696, 9696, 9696, 9696, 9696)

In [115]:
len(test_sentences), len(test_distractors), len(test_answers), len(test_sent), len(test_labels)

(10233, 10233, 10233, 10233, 10233)

In [None]:
for i in range(6):
    print(train_sentences[i])
    print(train_distractors[i])
    print(train_answers[i])
    print(train_sent[i])
    print(train_labels[i])
    print("*"*50)

In [117]:
tokenizer = T5Tokenizer.from_pretrained("t5-base")

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [118]:
train_encodings = tokenizer(train_sent, truncation="only_first", padding=True, text_target=train_labels, return_tensors="pt").to(device)
valid_encodings = tokenizer(valid_sent, truncation="only_first", padding=True, text_target=valid_labels, return_tensors="pt").to(device)
test_encodings = tokenizer(test_sent, truncation="only_first", padding=True, text_target=test_labels, return_tensors="pt").to(device)

In [119]:
train_encodings.keys()

dict_keys(['input_ids', 'attention_mask', 'labels'])

In [120]:
print(train_encodings.input_ids[0])

tensor([  636,   481,    33,   315,     5,   611,     6,     3,     9,  4645,
          320,    44,    70,     3,   834,    13,  1036,    44,  3819,    56,
         2956,   178,    12,   853,  4921,   135, 10209,   139,   386,  1637,
           10,   273,   113,   669,    91,    13, 16563,     6,   273,   113,
          669,    21,     3,     9, 12894,   647,     6,    11,   273,   113,
          669,    28,   150,     3, 14339,  4820,     5, 23559,     6,   132,
           33,     3,     9, 12114,    13,   481,   113,   669,   239,    11,
          706,   914,   250,    79,   114,    12,   669,     5,     1,  3659,
            1,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0, 

In [121]:
len(train_encodings.input_ids[0])

512

In [122]:
tokenizer.decode(train_encodings.input_ids[0])

'University students are different. However, a closer look at their _ of learning at university will enable us to classify them roughly into three groups: those who learn out of instinct, those who learn for a promising future, and those who learn with no definite objects.Firstly, there are a handful of students who learn day and night simply because they like to learn.</s> purposes</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><

In [123]:
len(train_encodings.input_ids)

69009

In [124]:
len(train_encodings.input_ids)

69009

In [125]:
train_encodings.keys()

dict_keys(['input_ids', 'attention_mask', 'labels'])

In [126]:
print(train_encodings.labels[0])

tensor([7586,    6, 1356,    6, 2254,    1,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0], device='cuda:0')


In [127]:
for i in range(6):
    print(tokenizer.decode(train_encodings.labels[i]))

scores, programs, methods</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
dreaming, directing, leading</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
partly, mostly, sharply</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
take, drag, keep</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
countless, harmless, careless</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
Besides, Therefore, Thus</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>


In [128]:
class ClothDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: val[idx].cpu().clone().detach() for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

train_dataset = ClothDataset(train_encodings)
valid_dataset = ClothDataset(valid_encodings)
test_dataset = ClothDataset(test_encodings)

In [129]:
len(train_dataset), len(valid_dataset), len(test_dataset)

(69009, 9696, 10233)

In [130]:
type(train_dataset)

__main__.ClothDataset

In [131]:
train_dataset[0]["input_ids"]

tensor([  636,   481,    33,   315,     5,   611,     6,     3,     9,  4645,
          320,    44,    70,     3,   834,    13,  1036,    44,  3819,    56,
         2956,   178,    12,   853,  4921,   135, 10209,   139,   386,  1637,
           10,   273,   113,   669,    91,    13, 16563,     6,   273,   113,
          669,    21,     3,     9, 12894,   647,     6,    11,   273,   113,
          669,    28,   150,     3, 14339,  4820,     5, 23559,     6,   132,
           33,     3,     9, 12114,    13,   481,   113,   669,   239,    11,
          706,   914,   250,    79,   114,    12,   669,     5,     1,  3659,
            1,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0, 

In [132]:
train_dataset[0]

{'input_ids': tensor([  636,   481,    33,   315,     5,   611,     6,     3,     9,  4645,
           320,    44,    70,     3,   834,    13,  1036,    44,  3819,    56,
          2956,   178,    12,   853,  4921,   135, 10209,   139,   386,  1637,
            10,   273,   113,   669,    91,    13, 16563,     6,   273,   113,
           669,    21,     3,     9, 12894,   647,     6,    11,   273,   113,
           669,    28,   150,     3, 14339,  4820,     5, 23559,     6,   132,
            33,     3,     9, 12114,    13,   481,   113,   669,   239,    11,
           706,   914,   250,    79,   114,    12,   669,     5,     1,  3659,
             1,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,   

### Fine-tuning

In [133]:
from transformers import T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer
model = T5ForConditionalGeneration.from_pretrained("t5-base")

In [134]:
batch_size = 16
args = Seq2SeqTrainingArguments(
    output_dir = "results",
    save_strategy = "epoch",
    evaluation_strategy = "epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    load_best_model_at_end=True,
    metric_for_best_model="P@1",
    num_train_epochs=30,
    predict_with_generate=True,
    eval_accumulation_steps = 1,
    report_to="wandb" if os.getenv("WANDB_PROJECT") else "none"
)

In [135]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

## compute matric 是以 P@1 最高為最好的 model

In [136]:
def get_sent_cnt(labels):
    sent_cnt_dic = {len(train_labels): train_sent, len(valid_labels): valid_sent, len(test_labels): test_sent}
    return sent_cnt_dic[len(labels)]

In [137]:
import numpy as np
def compute_metrics(p):
    predictions, labels = p
    
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # store all article
    predicted = []
    true_label = []
    
    # store a article
    distractors = []
    labels = []
    
    # evaluation metrics
    n_question = 0
    p1 = 0
    p3 = 0
    r3 = 0
    f3 = 0
    
    q_num_idx = 0
    n_question = len(decoded_labels)
    for k in range(n_question):
        pred = decoded_preds[k]
        label = decoded_labels[k]
        pred_list = pred.split(', ')
        label_list = label.split(', ')


        act_set = set(label_list)
        pred1_set = set(pred_list[:1])
        pred3_set = set(pred_list[:3])

        p_1 = len(act_set & pred1_set) / float(1)
        p_3 = len(act_set & pred3_set) / float(3)
        r_3 = len(act_set & pred3_set) / float(len(act_set))

        if p_3 == 0 and r_3 == 0:
            f1_3 = 0
        else:
            f1_3 = 2 * (p_3 * r_3 / (p_3 + r_3))

        p1+=p_1
        p3+=p_3
        r3+=r_3
        f3+=f1_3

    p1 = p1 / n_question
    p3 = p3 / n_question
    r3 = r3 / n_question
    f3 = f3 / n_question

    result = {'P@1': p1,
              'P@3': p3,
              'R@3': r3,
              'F1@3': f3}
    
    return result

In [138]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

In [141]:
trainer.save_model('../model/t5_vanilla-DG/t5-base-clean-sent-ans-tripleD-,split')