### Baselines:
- BERT-based classifier trained on the data
- Some form of siamese-nn

### Ideas:
- ...

In [1]:
import json
import pandas as pd
import numpy as np

pd.set_option('display.max_colwidth', None)

In [2]:
from datasets import load_dataset, load_metric, Dataset, Split
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer, AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline, DebertaForSequenceClassification
from transformers import TrainingArguments, Trainer
import wandb
import torch
from tqdm import tqdm

In [3]:
from sklearn.metrics import precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    return {
        'recall' : recall,
        'precision': precision,
        'f1': f1,   
    }

In [4]:
taska_training_df = pd.read_csv('../data/TaskA_train.csv')
taska_valid_df = pd.read_csv('../data/TaskA_dev.csv')

In [5]:
taska_training_df = taska_training_df[taska_training_df.Novelty != 0]
taska_valid_df = taska_valid_df[taska_valid_df.Novelty != 0]

In [6]:
taska_training_df['input_txt'] = taska_training_df.apply(lambda x: '[CLS] {} [SEP] {} [SEP] {} [SEP]'.format(x['topic'], x['Premise'], x['Conclusion']), axis=1)
taska_valid_df['input_txt'] = taska_valid_df.apply(lambda x: '[CLS] {} [SEP] {} [SEP] {} [SEP]'.format(x['topic'], x['Premise'], x['Conclusion']), axis=1)

## Fine-tune the NLI model on the training data:

In [7]:
nli_tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-base-mnli')
nli_model     = AutoModelForSequenceClassification.from_pretrained('microsoft/deberta-base-mnli')

Some weights of the model checkpoint at microsoft/deberta-base-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [8]:
train_dataset = Dataset.from_pandas(taska_training_df)
eval_dataset = Dataset.from_pandas(taska_valid_df)

In [9]:
nli_model.config.id2label

{0: 'CONTRADICTION', 1: 'NEUTRAL', 2: 'ENTAILMENT'}

In [10]:
novelty_map = dict([
    (1, 1), # if novel -> neutral label
    (-1, 2) # not novel -> entailment label
])

In [11]:
inverse_novelty_map = dict([
    (2,-1),
    (1,1),
    (0,1) 
])

In [12]:
def preprocess(example):
    inputs = nli_tokenizer(example["input_txt"], add_special_tokens=False, padding=True, max_length=512)
    inputs['labels'] = list(map(novelty_map.get, example['Novelty']))
    return inputs

In [13]:
train_dataset = train_dataset.map(preprocess, batched=True)
eval_dataset = eval_dataset.map(preprocess, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [14]:
def compute_metrics(pred):
    labels = list(map(inverse_novelty_map.get, pred.label_ids))
    preds = list(map(inverse_novelty_map.get, pred.predictions.argmax(-1)))
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    return {
        'recall' : recall,
        'precision': precision,
        'f1': f1,
    }

In [15]:
training_args = TrainingArguments(
    output_dir="nov_nli_trainer", 
    report_to=None,
    num_train_epochs=5,
    learning_rate=2e-05
)

In [16]:
trainer = Trainer(
    model=nli_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

In [17]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DebertaForSequenceClassification.forward` and have been ignored: topic, Validity, Premise, Validity-Confidence, Novelty, Novelty-Confidence, __index_level_0__, Conclusion, input_txt.
***** Running training *****
  Num examples = 718
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 450
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33mmajastahl[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=450, training_loss=0.23082022772894964, metrics={'train_runtime': 49.2084, 'train_samples_per_second': 72.955, 'train_steps_per_second': 9.145, 'total_flos': 268724087437500.0, 'train_loss': 0.23082022772894964, 'epoch': 5.0})

In [18]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `DebertaForSequenceClassification.forward` and have been ignored: topic, Validity, Premise, Validity-Confidence, Novelty, Novelty-Confidence, __index_level_0__, Conclusion, input_txt.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 8


{'eval_loss': 2.037123203277588,
 'eval_recall': 0.25609756097560976,
 'eval_precision': 0.875,
 'eval_f1': 0.39622641509433965,
 'eval_runtime': 1.2727,
 'eval_samples_per_second': 157.144,
 'eval_steps_per_second': 19.643,
 'epoch': 5.0}

w/o topic: {'eval_loss': 3.0899693965911865,
 'eval_recall': 0.1951219512195122,
 'eval_precision': 0.7619047619047619,
 'eval_f1': 0.31067961165048547,
 'eval_runtime': 0.5043,
 'eval_samples_per_second': 396.608,
 'eval_steps_per_second': 49.576,
 'epoch': 5.0}

In [None]:
preds = trainer.predict(eval_dataset)

## Fine-tune simple BERT model on the training data:

In [19]:
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
bert_model     = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /mnt/ceph/storage/data-tmp/current//majaa/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.9.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt from cache at /mnt/cep

In [20]:
train_dataset = Dataset.from_pandas(taska_training_df)
eval_dataset = Dataset.from_pandas(taska_valid_df)

In [21]:
novelty_map = dict([ # avoid negative labels
    (1, 1), 
    (-1, 0)
])

In [22]:
def preprocess(example):
    inputs = bert_tokenizer(example["input_txt"], add_special_tokens=False, padding=True, truncation=True, max_length=512)
    inputs['labels'] = list(map(novelty_map.get, example['Novelty']))
    return inputs

In [23]:
train_dataset = train_dataset.map(preprocess, batched=True)
eval_dataset = eval_dataset.map(preprocess, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [24]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    return {
        'recall' : recall,
        'precision': precision,
        'f1': f1,
    }

In [25]:
training_args = TrainingArguments(
    output_dir="vali_bert_trainer", 
    report_to="none",
    num_train_epochs=5,
    learning_rate=2e-05
)

PyTorch: setting up devices


In [26]:
trainer = Trainer(
    model=bert_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

In [27]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: topic, Validity, Premise, Validity-Confidence, Novelty, Novelty-Confidence, __index_level_0__, Conclusion, input_txt.
***** Running training *****
  Num examples = 718
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 450


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=450, training_loss=0.364324951171875, metrics={'train_runtime': 33.0159, 'train_samples_per_second': 108.735, 'train_steps_per_second': 13.63, 'total_flos': 228762729304800.0, 'train_loss': 0.364324951171875, 'epoch': 5.0})

In [28]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: topic, Validity, Premise, Validity-Confidence, Novelty, Novelty-Confidence, __index_level_0__, Conclusion, input_txt.
***** Running Evaluation *****
  Num examples = 200
  Batch size = 8


  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 1.3311876058578491,
 'eval_recall': 0.0,
 'eval_precision': 0.0,
 'eval_f1': 0.0,
 'eval_runtime': 0.4433,
 'eval_samples_per_second': 451.211,
 'eval_steps_per_second': 56.401,
 'epoch': 5.0}

w/o topic: {'eval_loss': 1.3336939811706543,
 'eval_recall': 0.024390243902439025,
 'eval_precision': 1.0,
 'eval_f1': 0.047619047619047616,
 'eval_runtime': 0.3022,
 'eval_samples_per_second': 661.783,
 'eval_steps_per_second': 82.723,
 'epoch': 5.0}

In [61]:
preds = trainer.predict(eval_dataset)

The following columns in the test set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: __index_level_0__, input_txt, topic, Premise, Novelty, Validity-Confidence, Novelty-Confidence, Validity, Conclusion.
***** Running Prediction *****
  Num examples = 200
  Batch size = 8
