## Baseline Proposal
- Vanilla BERT as our baseline
- only consider the conversations, exlude prompts
- Use Adam as our optimizer

## Setup

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import os

os.chdir('/content/drive/MyDrive/NYCU NLP Final/')

In [4]:
!pip install transformers datasets > /dev/null

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m


In [5]:
import numpy as np
import pandas as pd
import torch

In [6]:
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

torch.cuda.is_available()

True

In [7]:
device = torch.device(0)
torch.cuda.set_device(device)
print(f'{device} is now being set.')

cuda:0 is now being set.


In [8]:
# parameters
SEED = 42
N_SAMPLES_PER_LABEL = 377  # the smallest label count
MODEL_NAME='distilbert-base-uncased'
EPOCHS=50
TRAIN_BATCH_SIZE=16
VALID_BATCH_SIZE=64

In [9]:
import random

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

<torch._C.Generator at 0x7fa2cacd5a30>

## Read data

In [11]:
traindf = pd.read_csv('data/new_train.csv')
validdf = pd.read_csv('data/new_valid.csv')
testdf = pd.read_csv('data/new_test.csv')

In [12]:
print(f'# train: {len(traindf)}')
print(f'# valid: {len(validdf)}')
print(f'# test: {len(testdf)}')

# train: 19533
# valid: 2770
# test: 2547


In [13]:
classes = traindf['label'].unique()
n_labels = len(classes)

In [14]:
# data selection: randomly select ${N_SAMPLES_PER_LABEL} samples for each label from training data
train_samples = []
for c in classes:
    samples = traindf[traindf['label'] == c]
    if len(samples) >= N_SAMPLES_PER_LABEL: 
        samples = samples.sample(N_SAMPLES_PER_LABEL, random_state=SEED)
    train_samples.append(samples)

traindf = pd.concat(train_samples)

In [15]:
len(traindf)

12064

In [33]:
traindf[traindf['label'] == 17]

Unnamed: 0,conv_id,prompt,conv,label,sent
19340,hit:9895_conv:19791,the new printer at work is awful,we got a new printer installed at work today [...,17,annoyed
6014,hit:2358_conv:4717,this tax war that us has started is borthering...,this tax war that us has started is borthering...,17,annoyed
18859,hit:960_conv:1921,i was returning an item at a store and the per...,i was returning an item at a store and the per...,17,annoyed
8497,hit:3813_conv:7626,i m waiting on my friend to confirm our plans ...,i m waiting on my friend to confirm our plans ...,17,annoyed
18870,hit:9619_conv:19239,came home the other day and my dogs were happy...,so i came home yesterday and my lovely dogs gr...,17,annoyed
...,...,...,...,...,...
5884,hit:2290_conv:4581,i was kind of bothered when i got passed over ...,i was pretty disappointed when i found out i w...,17,annoyed
8518,hit:3827_conv:7655,when mosquitos are eating me alive,it finally rained today and i know whats com...,17,annoyed
18938,hit:9668_conv:19336,my cat wont stop knocking over my plates,my new cat wont stop knocking over the plates ...,17,annoyed
12352,hit:6003_conv:12007,it s election season and i keep getting roboca...,man these candidate robo calls are driving me...,17,annoyed


In [34]:
validdf.loc[0, 'conv']

'my upstairs neighbors make a ton of noise at all hours of the night  it makes it difficult for me to sleep   [SEP] that really sucks  maybe you should try egging their door  or just break in and pretend you re bigfoot while they re trying to sleep  [SEP] i m not trying to get arrested  i think i ll just wait things out until i move in two months  [SEP] i would go with the bigfoot option  you can get a costume on the cheap on ebay nowadays  i ve used that tactic countless times and it has never failed '

## Tokenization & Dataset

In [16]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

train_enc = tokenizer(traindf['conv'].values.tolist(), truncation=True, padding=True)
valid_enc = tokenizer(validdf['conv'].values.tolist(), truncation=True, padding=True)
test_enc = tokenizer(testdf['conv'].values.tolist(), truncation=True, padding=True)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [17]:
class NLPFinalDataset(torch.utils.data.Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.features.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = NLPFinalDataset(train_enc, traindf['label'].values)
valid_dataset = NLPFinalDataset(valid_enc, validdf['label'].values)

## Model Training

In [24]:
from datasets import load_metric

metric_acc = load_metric('accuracy')
metric_f1 = load_metric('f1')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    acc = metric_acc.compute(predictions=predictions, references=labels)['accuracy']
    f1_score = metric_f1.compute(predictions=predictions, references=labels, average='macro')['f1']
    return {'accuracy': acc, 'F1': f1_score}

Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

In [None]:
from transformers import AutoModelForSequenceClassification

# model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=n_labels)
model = AutoModelForSequenceClassification.from_pretrained('./baseline_trainer/checkpoint-17000')

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier

In [25]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='baseline_trainer',
    logging_dir='baseline_logs',
    logging_strategy='epoch',
    evaluation_strategy='epoch',
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=VALID_BATCH_SIZE
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 12064
  Num Epochs = 50
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 37700


Epoch,Training Loss,Validation Loss
1,2.2865,1.782082
2,1.4166,1.680649
3,1.0018,1.778241
4,0.6934,1.899405
5,0.4628,2.17955
6,0.3086,2.396156
7,0.2093,2.705095
8,0.149,3.166678
9,0.1039,3.399624
10,0.0808,3.676774


Saving model checkpoint to baseline_trainer/checkpoint-500
Configuration saved in baseline_trainer/checkpoint-500/config.json
Model weights saved in baseline_trainer/checkpoint-500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2770
  Batch size = 64
Saving model checkpoint to baseline_trainer/checkpoint-1000
Configuration saved in baseline_trainer/checkpoint-1000/config.json
Model weights saved in baseline_trainer/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to baseline_trainer/checkpoint-1500
Configuration saved in baseline_trainer/checkpoint-1500/config.json
Model weights saved in baseline_trainer/checkpoint-1500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2770
  Batch size = 64
Saving model checkpoint to baseline_trainer/checkpoint-2000
Configuration saved in baseline_trainer/checkpoint-2000/config.json
Model weights saved in baseline_trainer/checkpoint-2000/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2770


## Prediction & Evaluation

In [26]:
eval_pred = trainer.predict(train_dataset)

***** Running Prediction *****
  Num examples = 12064
  Batch size = 64


In [27]:
compute_metrics((eval_pred.predictions, eval_pred.label_ids))

{'F1': 0.998504962019102, 'accuracy': 0.9985079575596817}

In [29]:
eval_pred = trainer.predict(valid_dataset)

***** Running Prediction *****
  Num examples = 2770
  Batch size = 64


In [31]:
compute_metrics((eval_pred.predictions, eval_pred.label_ids))

{'F1': 0.4889328143906021, 'accuracy': 0.4967509025270758}

## Master Proposal
- Use BERT to infer `prompt` & `utterance` representations，concatenate the two hypotheses.
- Add a `LayerNorm` layer to receive the concatenated result.
- Use `Linear` layer to do classification.
- Maybe we can use `SAM` to smooth the loss landscape