## Baseline Proposal
- Vanilla BERT as our baseline
- only consider the conversations, exlude prompts
- Use Adam as our optimizer

## Setup

In [1]:
import os
from google.colab import drive

drive.mount('/content/drive')
os.chdir('/content/drive/MyDrive/NYCU NLP Final/')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip install transformers datasets > /dev/null

In [3]:
import numpy as np
import pandas as pd
import torch

In [4]:
# parameters
SEED = 42
# N_SAMPLES_PER_LABEL = 377  # the smallest label count
MODEL_NAME='prajjwal1/bert-medium'
DROPOUT = 0.4
EPOCHS=50
TRAIN_BATCH_SIZE=16
VALID_BATCH_SIZE=64

In [5]:
import random

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

<torch._C.Generator at 0x7febbf889ad0>

## Read data

In [6]:
traindf = pd.read_csv('data/new_train.csv')
validdf = pd.read_csv('data/new_valid.csv')
testdf = pd.read_csv('data/new_test.csv')

In [7]:
print(f'# train: {len(traindf)}')
print(f'# valid: {len(validdf)}')
print(f'# test: {len(testdf)}')

# train: 19533
# valid: 2770
# test: 2547


In [8]:
classes = traindf['label'].unique()
n_labels = len(classes)

## Tokenization & Dataset

In [9]:
from transformers import AutoTokenizer

class PromptConvDataset(torch.utils.data.Dataset):
    def __init__(self, df, tokenizer):
        self.size = len(df)
        self.features = tokenizer(df['prompt'].values.tolist(), df['conv'].values.tolist(), truncation=True, padding=True)
        self.labels = df['label'].values.tolist() if ('label' in df.columns) else None

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.features.items()}
        if self.labels:
          item['labels'] = torch.tensor(self.labels[idx])
        
        return item

    def __len__(self):
        return self.size


tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

special_tokens_dict = {'additional_special_tokens': ['[SPEAKER_A]', '[SPEAKER_B]']}
tokenizer.add_special_tokens(special_tokens_dict)

train_dataset = PromptConvDataset(traindf, tokenizer)
valid_dataset = PromptConvDataset(validdf, tokenizer)
test_dataset = PromptConvDataset(testdf, tokenizer)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

## Model

In [10]:
# Master waiting for implementation
class SentimentClassifier:
    def __init__(self, backbone, classifier):
        self.backbone = backbone
        self.classifier = classifier

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
        pass

In [11]:
from transformers import AutoModelForSequenceClassification, AutoConfig

config = AutoConfig.from_pretrained(MODEL_NAME, 
                                    hidden_dropout_prob=0.2, 
                                    num_labels=n_labels, 
                                    classifier_dropout=DROPOUT)

model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, config=config)
model.resize_token_embeddings(len(tokenizer))

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifi

Embedding(30524, 768)

In [12]:
from datasets import load_metric

metric_precision = load_metric('precision')
metric_recall = load_metric('recall')
metric_f1 = load_metric('f1')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    precision = metric_precision.compute(predictions=predictions, references=labels, average='macro')['precision']
    recall = metric_recall.compute(predictions=predictions, references=labels, average='macro')['recall']
    f1_score = metric_f1.compute(predictions=predictions, references=labels, average='macro')['f1']
    return {'Precision': precision, 'Recall': recall, 'F1': f1_score}

In [13]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='0522_distilbert',
    logging_dir='0522_distilbert_logs',
    logging_strategy='epoch',
    evaluation_strategy='epoch',
    save_strategy='epoch',
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=VALID_BATCH_SIZE,
    lr_scheduler_type='cosine',
    warmup_steps=1000,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="F1",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    compute_metrics=compute_metrics
)

In [14]:
trainer.train()
# precision, recall and f1 are increasing but valid loss is getting higher.
# 模型針對錯誤的答案進行更正但降低了原本正確答案的邏輯分數？

***** Running training *****
  Num examples = 19533
  Num Epochs = 50
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 61050


Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,2.3516,1.479983,0.537462,0.531189,0.507238
2,1.2582,1.313279,0.59844,0.586548,0.577602
3,0.8955,1.348135,0.595033,0.57937,0.570547
4,0.6445,1.509236,0.589038,0.577857,0.575387
5,0.4509,1.729411,0.582243,0.572675,0.572507
6,0.3257,1.881127,0.581227,0.566587,0.56756


***** Running Evaluation *****
  Num examples = 2770
  Batch size = 32
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to 0522_distilbert/checkpoint-1221
Configuration saved in 0522_distilbert/checkpoint-1221/config.json
Model weights saved in 0522_distilbert/checkpoint-1221/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2770
  Batch size = 32
Saving model checkpoint to 0522_distilbert/checkpoint-2442
Configuration saved in 0522_distilbert/checkpoint-2442/config.json
Model weights saved in 0522_distilbert/checkpoint-2442/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2770
  Batch size = 32
Saving model checkpoint to 0522_distilbert/checkpoint-3663
Configuration saved in 0522_distilbert/checkpoint-3663/config.json
Model weights saved in 0522_distilbert/checkpoint-3663/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2770
  Batch size = 32
Saving model checkpoint to 0522_distilbert/checkpoint-4884
Confi

KeyboardInterrupt: ignored

## Prediction & Evaluation

In [None]:
from datasets import load_metric

metric_acc = load_metric('accuracy')
metric_f1 = load_metric('f1')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    acc = metric_acc.compute(predictions=predictions, references=labels)['accuracy']
    f1_score = metric_f1.compute(predictions=predictions, references=labels, average='macro')['f1']
    return {'accuracy': acc, 'F1': f1_score}

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# load local model
model = AutoModelForSequenceClassification.from_pretrained('roberta_baseline/checkpoint-10000')

training_args = TrainingArguments(
    output_dir='roberta_baseline',
    logging_dir='roberta_baseline_logs',
    logging_strategy='epoch',
    evaluation_strategy='epoch',
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=VALID_BATCH_SIZE
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    compute_metrics=compute_metrics
)

In [None]:
eval_preds = trainer.predict(valid_dataset)

***** Running Prediction *****
  Num examples = 2770
  Batch size = 64


In [None]:
compute_metrics((eval_pred.predictions, eval_pred.label_ids))

{'F1': 0.5827597979066158, 'accuracy': 0.5895306859205777}

In [None]:
test_preds = trainer.predict(test_dataset)

***** Running Prediction *****
  Num examples = 2547
  Batch size = 64


In [None]:
test_ans = np.argmax(test_preds.predictions, axis=-1)
testdf['pred'] = test_ans

In [None]:
submission = pd.read_csv('data/fixed_test.csv')
submission['pred'] = [-1]*len(submission)
for _, row in testdf.iterrows():
  submission.loc[(submission['conv_id'] == row['conv_id']), 'pred'] = row['pred']

In [None]:
submission

Unnamed: 0,conv_id,utterance_idx,prompt,utterance,pred
0,hit:0_conv:0,1,I felt guilty when I was driving home one nigh...,Yeah about 10 years ago I had a horrifying exp...,25
1,hit:0_conv:0,2,I felt guilty when I was driving home one nigh...,Did you suffer any injuries?,25
2,hit:0_conv:0,3,I felt guilty when I was driving home one nigh...,No I wasn't hit. It turned out they were drunk...,25
3,hit:0_conv:0,4,I felt guilty when I was driving home one nigh...,Why did you feel guilty? People really shouldn...,25
4,hit:0_conv:0,5,I felt guilty when I was driving home one nigh...,I don't know I was new to driving and hadn't e...,25
...,...,...,...,...,...
10968,hit:12416_conv:24832,4,I saw a huge cockroach outside my house today....,I live in Texas to so i know those feels,8
10969,hit:12423_conv:24847,1,I have a big test on Monday. I am so nervous_c...,I have a big test on Monday_comma_ I am so ner...,18
10970,hit:12423_conv:24847,2,I have a big test on Monday. I am so nervous_c...,What is the test on?,18
10971,hit:12423_conv:24847,3,I have a big test on Monday. I am so nervous_c...,It's for my Chemistry class. I haven't slept m...,18


In [None]:
submission[['pred']].to_csv('output/20220519_submission.csv', encoding='utf8')

## Master Proposal
- Use BERT to infer `prompt` & `utterance` representations，concatenate the two hypotheses.
- Add a `LayerNorm` layer to receive the concatenated result.
- Use `Linear` layer to do classification.
- Maybe we can use `SAM` to smooth the loss landscape