## 檢查GPU

In [1]:
!nvidia-smi

Sat Jan  7 12:12:50 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA TITAN RTX    Off  | 00000000:01:00.0 Off |                  N/A |
| 39%   53C    P0    67W / 280W |      0MiB / 24220MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA TITAN RTX    Off  | 00000000:02:00.0 Off |                  N/A |
| 36%   41C    P0    58W / 280W |      0MiB / 24220MiB |      0%      Default |
|       

## 基本 Import 環境安裝

In [2]:
from transformers import BartTokenizer, BartForConditionalGeneration
import torch
import json
import os, sys

2023-01-07 12:12:51.514574: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-07 12:12:51.627972: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-01-07 12:12:52.116969: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvrtc.so.11.0: cannot open shared object file: No such file or directory
2023-01-07 12:12:52.117128: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvrtc.so.11.0: cannot open shared object file: No such file or direc

## 載入資料集

In [3]:
def read_data(item):
    path = '/user_data/MedQA_DG/data/data_clean/questions/US/{}.json'.format(item)
    with open(path) as f:
        data = json.load(f)
    return data

In [4]:
train = read_data('train')
test = read_data('test')

In [5]:
len(train), len(test)

(10178, 1273)

### Data format 資料格式

- version : <String> 問題內容
- answer : <String> 答案內容
- options : <Arrays> 選項內容
  - id : <String> 有 A B C D E 五種不同的選項，其中一個是正確答案
  - text : <string> 選項內容
- answer_idx : <string> 答案的ID

In [6]:
train[0]

{'question': 'A 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She states it started 1 day ago and has been worsening despite drinking more water and taking cranberry extract. She otherwise feels well and is followed by a doctor for her pregnancy. Her temperature is 97.7°F (36.5°C), blood pressure is 122/77 mmHg, pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air. Physical exam is notable for an absence of costovertebral angle tenderness and a gravid uterus. Which of the following is the best treatment for this patient?',
 'answer': 'Nitrofurantoin',
 'options': {'A': 'Ampicillin',
  'B': 'Ceftriaxone',
  'C': 'Ciprofloxacin',
  'D': 'Doxycycline',
  'E': 'Nitrofurantoin'},
 'meta_info': 'step2&3',
 'answer_idx': 'E'}

### Prepare Data

In [7]:
from sklearn.model_selection import train_test_split

train, valid = train_test_split(train, random_state=777, train_size=0.9)
len(train), len(valid)

(9160, 1018)

In [8]:
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')

### 加入 Special Token 來區隔 distractor | EOD = End of Distractor

In [9]:
eod_toks = tokenizer.add_tokens(['[EOD]'], special_tokens=True) ##This line is updated

In [10]:
def processData(data):
    questions = []
    labels = []
    answers = []
    for d in data:
        question = d['question']
        options = d['options']
        answer_idx = d['answer_idx']

        answer = d['answer']

        distractors = []
        for value in options.values():
            if value != answer:
                distractors.append(value)

        labels.append('[EOD]'.join(distractors))
        answers.append(answer)
        questions.append(question)
    
    return questions, answers, labels

In [11]:
train_question, train_answer, train_label = processData(train)
valid_question, valid_answer, valid_label = processData(valid)
test_question, test_answer, test_label = processData(test)

In [12]:
train_label[0]

'Ethics committee consultation[EOD]Cerebral angiography[EOD]Court order for further management[EOD]Repeat CT scan of the head'

In [13]:
for idx in range(2):
    print("\n問題 : ",train_question[idx],'\n')
    print("答案 : ",train_answer[idx],'\n')
    print("選項 : ",train_label[idx],'\n')
    print('*'*15)


問題 :  Four days after being hospitalized, intubated, and mechanically ventilated, a 30-year-old man has no cough response during tracheal suctioning. He was involved in a motor vehicle collision and was obtunded on arrival in the emergency department. The ventilator is at a FiO2 of 100%, tidal volume is 920 mL, and positive end-expiratory pressure is 5 cm H2O. He is currently receiving vasopressors. His vital signs are within normal limits. The pupils are dilated and nonreactive to light. Corneal, gag, and oculovestibular reflexes are absent. There is no facial or upper extremity response to painful stimuli; the lower extremities show a triple flexion response to painful stimuli. Serum concentrations of electrolytes, urea, creatinine, and glucose are within the reference range. Arterial blood gas shows:
pH 7.45
pCO2 41 mm Hg
pO2 99 mm Hg
O2 saturation 99%
Two days ago, a CT scan of the head showed a left intracerebral hemorrhage with mass effect. The apnea test is positive. There are 

## Data Tokenization

In [14]:
train_encodings = tokenizer(train_question, train_answer, truncation=True, padding=True)
valid_encodings = tokenizer(valid_question, valid_answer, truncation=True, padding=True)
test_encodings = tokenizer(test_question, test_answer, truncation=True, padding=True)

In [15]:
train_encodings.keys()

dict_keys(['input_ids', 'attention_mask'])

In [16]:
def add_labels(encodings, distractors):
    
    distractors_encodings = tokenizer(distractors, padding=True)
    labels = []
    for i in range(len(distractors_encodings.input_ids)):
        labels.append(distractors_encodings.input_ids[i])
    
    encodings["labels"] = labels
    return encodings

In [17]:
train_encodings = add_labels(train_encodings, train_label)
valid_encodings = add_labels(valid_encodings, valid_label)
test_encodings = add_labels(test_encodings, test_label)

In [18]:
train_label[0]

'Ethics committee consultation[EOD]Cerebral angiography[EOD]Court order for further management[EOD]Repeat CT scan of the head'

### token_id = 50265  --> token = EOD

In [19]:
print(train_encodings.labels[0])

[0, 42301, 2857, 1540, 9434, 50265, 347, 2816, 44283, 5667, 118, 10486, 50265, 37349, 645, 13, 617, 1052, 50265, 45764, 12464, 14194, 9, 5, 471, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [20]:
tokenizer.decode(train_encodings.labels[0])

'<s>Ethics committee consultation [EOD] Cerebral angiography [EOD] Court order for further management [EOD] Repeat CT scan of the head</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>'

## Load Data ot Dataset

In [21]:
class MedQADataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

train_dataset = MedQADataset(train_encodings)
valid_dataset = MedQADataset(valid_encodings)
test_dataset = MedQADataset(test_encodings)

In [22]:
len(train_dataset), len(valid_dataset), len(test_dataset)

(9160, 1018, 1273)

## Fine-tuning

In [23]:
from transformers import BartForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer
import torch

model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")
model.resize_token_embeddings(len(tokenizer))

Embedding(50266, 768)

In [24]:
batch_size = 2
args = Seq2SeqTrainingArguments(
    output_dir = "./model",
    save_strategy = "epoch",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=10,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="P@1",
    weight_decay=0.01,
    predict_with_generate=True,
    eval_accumulation_steps = 1
)

In [25]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [26]:
import numpy as np
def compute_metrics(p):
    predictions, labels = p
    
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # store all article
    predicted = []
    true_label = []
    
    for k in range(len(decoded_labels)):
        pred = decoded_preds[k]
        label = decoded_labels[k]

        pred_list = pred.split(', ')
        label_list = label.split(', ')
        
        pred_list[0] = pred_list[0].split(' ')[-1]
        label_list[0] = label_list[0].split(' ')[-1]

        predicted.append(pred_list)
        true_label.append(label_list)

    # evaluation metrics
    p1 = 0
    p3 = 0
    r3 = 0
    f3 = 0
    for idx in range(len(true_label)):
        distractors = predicted[idx]
        labels = true_label[idx]

        act_set = set(labels)
        pred1_set = set(distractors[:1])
        pred3_set = set(distractors[:3])

        p_1 = len(act_set & pred1_set) / float(1)
        p_3 = len(act_set & pred3_set) / float(3)
        r_3 = len(act_set & pred3_set) / float(len(act_set))

        if p_3 == 0 and r_3 == 0:
            f1_3 = 0
        else:
            f1_3 = 2 * (p_3 * r_3 / (p_3 + r_3))

        p1+=p_1
        p3+=p_3
        r3+=r_3
        f3+=f1_3

    avg_p1 = p1 / len(true_label)
    avg_p3 = p3 / len(true_label)
    avg_r3 = r3 / len(true_label)
    avg_f3 = f3 / len(true_label)

    result = {'P@1': avg_p1,
              'P@3': avg_p3,
              'R@3': avg_r3,
              'F1@3': avg_f3}
    
    return result

In [27]:
# import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

In [28]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [29]:
trainer.train()

***** Running training *****
  Num examples = 9160
  Num Epochs = 10
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 22900
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mhankystyle[0m. Use [1m`wandb login --relogin`[0m to force relogin




Epoch,Training Loss,Validation Loss,P@1,P@3,R@3,F1@3
1,0.5312,0.416468,0.031434,0.016372,0.023625,0.015591
2,0.4598,0.387879,0.024558,0.013752,0.01714,0.01228
3,0.406,0.37463,0.023576,0.013098,0.014685,0.011014
4,0.3808,0.365723,0.030452,0.016045,0.022029,0.014976
5,0.3568,0.361149,0.02947,0.014735,0.020438,0.013765
6,0.341,0.357996,0.031434,0.015717,0.023689,0.015391
7,0.32,0.355934,0.02947,0.015062,0.020742,0.01407
8,0.3063,0.356089,0.026523,0.01408,0.017632,0.012488
9,0.2919,0.355944,0.034381,0.016699,0.024835,0.016253
10,0.2858,0.356806,0.03831,0.018009,0.028765,0.018218


***** Running Evaluation *****
  Num examples = 1018
  Batch size = 4
Saving model checkpoint to ./model/checkpoint-2290
Configuration saved in ./model/checkpoint-2290/config.json
Model weights saved in ./model/checkpoint-2290/pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-2290/tokenizer_config.json
Special tokens file saved in ./model/checkpoint-2290/special_tokens_map.json
added tokens file saved in ./model/checkpoint-2290/added_tokens.json
***** Running Evaluation *****
  Num examples = 1018
  Batch size = 4
Saving model checkpoint to ./model/checkpoint-4580
Configuration saved in ./model/checkpoint-4580/config.json
Model weights saved in ./model/checkpoint-4580/pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-4580/tokenizer_config.json
Special tokens file saved in ./model/checkpoint-4580/special_tokens_map.json
added tokens file saved in ./model/checkpoint-4580/added_tokens.json
***** Running Evaluation *****
  Num examples = 1018
  Batch siz

TrainOutput(global_step=22900, training_loss=0.3992818330989654, metrics={'train_runtime': 4941.4459, 'train_samples_per_second': 18.537, 'train_steps_per_second': 4.634, 'total_flos': 5.7706315849728e+16, 'train_loss': 0.3992818330989654, 'epoch': 10.0})

In [30]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 1018
  Batch size = 4


{'eval_loss': 0.3568064868450165,
 'eval_P@1': 0.03831041257367387,
 'eval_P@3': 0.018009168303863784,
 'eval_R@3': 0.02876459283888712,
 'eval_F1@3': 0.018217893217893213,
 'eval_runtime': 65.8772,
 'eval_samples_per_second': 15.453,
 'eval_steps_per_second': 3.871,
 'epoch': 10.0}

In [31]:
trainer.save_model('./model/bart-base-finetuned-pubmed-text2text-sentence-medqa')

Saving model checkpoint to ./model/bart-base-finetuned-pubmed-text2text-sentence-medqa
Configuration saved in ./model/bart-base-finetuned-pubmed-text2text-sentence-medqa/config.json
Model weights saved in ./model/bart-base-finetuned-pubmed-text2text-sentence-medqa/pytorch_model.bin
tokenizer config file saved in ./model/bart-base-finetuned-pubmed-text2text-sentence-medqa/tokenizer_config.json
Special tokens file saved in ./model/bart-base-finetuned-pubmed-text2text-sentence-medqa/special_tokens_map.json
added tokens file saved in ./model/bart-base-finetuned-pubmed-text2text-sentence-medqa/added_tokens.json


In [32]:
predictions, labels, metrics = trainer.predict(valid_dataset)
print('valid: ')
metrics

***** Running Prediction *****
  Num examples = 1018
  Batch size = 4


valid: 


{'eval_loss': 0.3568064868450165,
 'eval_P@1': 0.03831041257367387,
 'eval_P@3': 0.018009168303863784,
 'eval_R@3': 0.02876459283888712,
 'eval_F1@3': 0.018217893217893213,
 'eval_runtime': 65.7013,
 'eval_samples_per_second': 15.494,
 'eval_steps_per_second': 3.881}

In [33]:
predictions, labels, metrics = trainer.predict(test_dataset)
print('test: ')
metrics

***** Running Prediction *****
  Num examples = 1273
  Batch size = 4


test: 


{'eval_loss': 0.470745712518692,
 'eval_P@1': 0.03220738413197172,
 'eval_P@3': 0.013877978528410582,
 'eval_R@3': 0.02586512073943339,
 'eval_F1@3': 0.01564749730499927,
 'eval_runtime': 92.3145,
 'eval_samples_per_second': 13.79,
 'eval_steps_per_second': 3.456}

In [34]:
predictions_tokens = tokenizer.batch_decode(predictions,skip_special_tokens = True)

In [35]:
predictions_tokens = tokenizer.batch_decode(predictions,skip_special_tokens = False)

In [38]:
with open('pred.json','w') as f:
    json.dump(predictions_tokens,f,skipkeys=['<pad>'])

In [39]:
with open('pred.json','w') as f:
    json.dump(predictions_tokens,f,skipkeys=['<pad>','</s>','<s>'])

In [37]:
with open('test.json','w') as f:
    json.dump(test_label,f)