### Baselines:
- BERT-based classifier trained on the data

### Ideas:
- Using NLI model: https://huggingface.co/microsoft/deberta-base-mnli and fine-tuning it on the data we have.
- Validity on the premise (sentence) level.

In [1]:
import wandb
wandb.init(project="argsvalidnovel")

[34m[1mwandb[0m: Currently logged in as: [33mmiladalsh[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [2]:
%load_ext autoreload

In [3]:
import json
import pandas as pd
import numpy as np
import sys

pd.set_option('display.max_colwidth', None)
sys.path.append('./src-py')

In [4]:
%autoreload
import sbert_training
from utils import *

In [5]:
from datasets import load_dataset, load_metric, Dataset, Split
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer, AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline, DebertaForSequenceClassification
from transformers import TrainingArguments, Trainer
import torch
from tqdm import tqdm
from sklearn.metrics import precision_recall_fscore_support


In [6]:
output_path = "../../data-ceph/arguana/argmining22-sharedtask/models/"

In [7]:
taska_training_df = pd.read_csv('../data/TaskA_train.csv')
taska_valid_df = pd.read_csv('../data/TaskA_dev.csv')

taska_training_df = taska_training_df[taska_training_df.Validity != 0]
taska_valid_df = taska_valid_df[taska_valid_df.Validity != 0]

In [7]:
nli_tokenizer = AutoTokenizer.from_pretrained('roberta-large-mnli')
nli_model     = AutoModelForSequenceClassification.from_pretrained('roberta-large-mnli')
arg_stance_pipeline = TextClassificationPipeline(model=nli_model, tokenizer=nli_tokenizer, framework='pt', task='validity_classifier', device=0)

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
#without topic
taska_valid_df['input_txt'] = taska_valid_df.apply(lambda x: '{} </s></s> {} '.format(x['Premise'], x['Conclusion']), axis=1)

resutls = arg_stance_pipeline(taska_valid_df['input_txt'].tolist())
taska_valid_df['pred_label'] = [1 if x['label'] == 'ENTAILMENT' else -1 for x in resutls]

precision, recall, f1, _ = precision_recall_fscore_support(taska_valid_df.Validity.tolist(), taska_valid_df.pred_label.tolist(), average='macro')

print('Precision: {}, Recall {}, F1: {}'.format(precision, recall, f1))

Precision: 0.715097588978186, Recall 0.7025405405405405, F1: 0.7070359156690091


In [10]:
#with topic
taska_valid_df['input_txt'] = taska_valid_df.apply(lambda x: '{}:{}  </s></s> {} '.format(x['topic'], x['Premise'], x['Conclusion']), axis=1)

resutls = arg_stance_pipeline(taska_valid_df['input_txt'].tolist())
taska_valid_df['pred_label'] = [1 if x['label'] == 'ENTAILMENT' else -1 for x in resutls]

precision, recall, f1, _ = precision_recall_fscore_support(taska_valid_df.Validity.tolist(), taska_valid_df.pred_label.tolist(), average='macro')

print('Precision: {}, Recall {}, F1: {}'.format(precision, recall, f1))

Precision: 0.7880518678604507, Recall 0.7017297297297297, F1: 0.712613304655093


So with topic is better...

In [11]:
taska_training_df['input_txt'] = taska_training_df.apply(lambda x: '{}:{}  </s></s> {} '.format(x['topic'], x['Premise'], x['Conclusion']), axis=1)
taska_valid_df['input_txt'] = taska_valid_df.apply(lambda x: '{}:{}  </s></s> {} '.format(x['topic'], x['Premise'], x['Conclusion']), axis=1)

## Fine-tune the NLI model on the training data:

In [25]:
nli_tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-base-mnli')
nli_model     = AutoModelForSequenceClassification.from_pretrained('microsoft/deberta-base-mnli')

loading configuration file https://huggingface.co/microsoft/deberta-base-mnli/resolve/main/config.json from cache at /mnt/ceph/storage/data-tmp/current//sile2804/.cache/huggingface/transformers/f7b31c39c192044791f5fdbf3d688249c69e527477a29901c7b1c3529cfd2d2b.486b7fcfb74d817138771852e9a12ae2309a5895b952e819a646622b0a75ecc0
Model config DebertaConfig {
  "_name_or_path": "microsoft/deberta-base-mnli",
  "architectures": [
    "DebertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "CONTRADICTION",
    "1": "NEUTRAL",
    "2": "ENTAILMENT"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "CONTRADICTION": 0,
    "ENTAILMENT": 2,
    "NEUTRAL": 1
  },
  "layer_norm_eps": 1e-07,
  "max_position_embeddings": 512,
  "max_relative_positions": -1,
  "model_type": "deberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_toke

In [10]:
train_dataset = Dataset.from_pandas(taska_training_df)
eval_dataset = Dataset.from_pandas(taska_valid_df)

In [11]:
nli_model.config.id2label

{0: 'CONTRADICTION', 1: 'NEUTRAL', 2: 'ENTAILMENT'}

In [12]:
validity_map = dict([
    (1, 2), # if valid -> entailment label
    (-1, 0) # not valid -> contradiction (neutral) label
])

In [13]:
inverse_validity_map = dict([
    (2,1),
    (1,-1),
    (0,-1) 
])

In [14]:
def preprocess(example):
    inputs = nli_tokenizer(example["input_txt"], add_special_tokens=False, padding=True, max_length=512)
    inputs['labels'] = list(map(validity_map.get, example['Validity']))
    return inputs

In [15]:
train_dataset = train_dataset.map(preprocess, batched=True)
eval_dataset = eval_dataset.map(preprocess, batched=True)



  0%|          | 0/1 [00:00<?, ?ba/s]



  0%|          | 0/1 [00:00<?, ?ba/s]

In [26]:
training_args = TrainingArguments(
    output_dir= output_path + "/task-A/validity/classification/nli-model",
    report_to="wandb",
    overwrite_output_dir=True,
    metric_for_best_model = 'f1',
    evaluation_strategy = 'steps',  
    learning_rate = 5e-6,                   # we can customize learning rate
    max_steps = 200, # five epochs
    logging_steps = 50,                    # we will log every 50 steps which is an epoch given the 700 examples and 16 batch size
    eval_steps = 50,                      # we will perform evaluation every 500 steps
    save_steps = 50,
    load_best_model_at_end = True,
    do_eval=True
)

PyTorch: setting up devices


In [27]:
trainer = Trainer(
    model=nli_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=lambda x: compute_nli_metrics(x, average='macro')
)

max_steps is given, it will override any value given in num_train_epochs


In [28]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `DebertaForSequenceClassification.forward` and have been ignored: Premise, Validity-Confidence, Conclusion, Novelty, __index_level_0__, topic, pred_label, Validity, input_txt, Novelty-Confidence. If Premise, Validity-Confidence, Conclusion, Novelty, __index_level_0__, topic, pred_label, Validity, input_txt, Novelty-Confidence are not expected by `DebertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 199
  Batch size = 8


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


{'eval_loss': 1.787210464477539,
 'eval_recall': 0.6978378378378378,
 'eval_precision': 0.6888544891640866,
 'eval_f1': 0.6910344464619352,
 'eval_runtime': 0.503,
 'eval_samples_per_second': 395.637,
 'eval_steps_per_second': 25.846}

In [29]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DebertaForSequenceClassification.forward` and have been ignored: Premise, Validity-Confidence, Novelty, __index_level_0__, topic, Conclusion, Validity, input_txt, Novelty-Confidence. If Premise, Validity-Confidence, Novelty, __index_level_0__, topic, Conclusion, Validity, input_txt, Novelty-Confidence are not expected by `DebertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 721
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 200


Step,Training Loss,Validation Loss,Recall,Precision,F1
50,0.7115,0.674265,0.620378,0.797887,0.605941
100,0.3472,0.798137,0.620378,0.797887,0.605941
150,0.3547,0.710128,0.647405,0.813448,0.643694
200,0.2476,0.720677,0.640649,0.809762,0.63449


The following columns in the evaluation set  don't have a corresponding argument in `DebertaForSequenceClassification.forward` and have been ignored: Premise, Validity-Confidence, Conclusion, Novelty, __index_level_0__, topic, pred_label, Validity, input_txt, Novelty-Confidence. If Premise, Validity-Confidence, Conclusion, Novelty, __index_level_0__, topic, pred_label, Validity, input_txt, Novelty-Confidence are not expected by `DebertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 199
  Batch size = 8
Saving model checkpoint to ../../data-ceph/arguana/argmining22-sharedtask/models//task-A/validity/classification/nli-model/checkpoint-50
Configuration saved in ../../data-ceph/arguana/argmining22-sharedtask/models//task-A/validity/classification/nli-model/checkpoint-50/config.json
Model weights saved in ../../data-ceph/arguana/argmining22-sharedtask/models//task-A/validity/classification/nli-model/checkpoint-50/pyt

TrainOutput(global_step=200, training_loss=0.4152486181259155, metrics={'train_runtime': 31.3267, 'train_samples_per_second': 102.149, 'train_steps_per_second': 6.384, 'total_flos': 233159689548000.0, 'train_loss': 0.4152486181259155, 'epoch': 4.35})

In [28]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `DebertaForSequenceClassification.forward` and have been ignored: Validity, __index_level_0__, input_txt, Novelty, Novelty-Confidence, Validity-Confidence, Premise, topic, Conclusion. If Validity, __index_level_0__, input_txt, Novelty, Novelty-Confidence, Validity-Confidence, Premise, topic, Conclusion are not expected by `DebertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 199
  Batch size = 8


{'eval_loss': 1.5748487710952759,
 'eval_recall': 0.6514054054054054,
 'eval_precision': 0.833452380952381,
 'eval_f1': 0.6480272108843537,
 'eval_runtime': 0.5444,
 'eval_samples_per_second': 365.545,
 'eval_steps_per_second': 23.88,
 'epoch': 4.35}

## Fine-tune simple ROBERTa model on the training data:

In [8]:
bert_tokenizer = AutoTokenizer.from_pretrained('roberta-base')

In [9]:
taska_training_df['input_txt'] = taska_training_df.apply(lambda x: '<s> {}:{} </s></s> {} </s>'.format(x['topic'], x['Premise'], x['Conclusion']), axis=1)
taska_valid_df['input_txt'] = taska_valid_df.apply(lambda x: '<s> {}:{} </s></s> {} </s>'.format(x['topic'], x['Premise'], x['Conclusion']), axis=1)

In [10]:
train_dataset = Dataset.from_pandas(taska_training_df)
eval_dataset = Dataset.from_pandas(taska_valid_df)

In [11]:
validity_map = dict([ # avoid negative labels
    (1, 1), 
    (-1, 0)
])

In [12]:
def preprocess(example):
    inputs = bert_tokenizer(example["input_txt"], add_special_tokens=False, padding=True, truncation=True, max_length=512)
    inputs['labels'] = list(map(validity_map.get, example['Validity']))
    return inputs

In [13]:
train_dataset = train_dataset.map(preprocess, batched=True)
eval_dataset = eval_dataset.map(preprocess, batched=True)



  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [18]:
bert_model     = AutoModelForSequenceClassification.from_pretrained('roberta-base', num_labels=2)

training_args = TrainingArguments(
    output_dir= output_path + "/task-A/validity/classification/roberta", 
    #report_to="wandb",
    logging_dir='/var/argmining-sharedtask/roberta-baseline-validity',
    overwrite_output_dir=True,
    metric_for_best_model = 'f1',
    evaluation_strategy = 'steps',          # check evaluation metrics at each epoch
    learning_rate = 5e-6,                   # we can customize learning rate
    max_steps = 500,
    logging_steps = 50,                    # we will log every 50 steps which is an epoch given the 700 examples and 16 batch size
    eval_steps = 50,                      # we will perform evaluation every 500 steps
    save_steps = 50,
    load_best_model_at_end = True,
)

loading configuration file https://huggingface.co/roberta-base/resolve/main/config.json from cache at /mnt/ceph/storage/data-tmp/current//sile2804/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
Model config RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.18.0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights fil

In [19]:
trainer = Trainer(
    model=bert_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=lambda x: compute_metrics(x, average='macro')
)

max_steps is given, it will override any value given in num_train_epochs


In [20]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: Conclusion, input_txt, __index_level_0__, Validity, Novelty-Confidence, Validity-Confidence, topic, Novelty, Premise. If Conclusion, input_txt, __index_level_0__, Validity, Novelty-Confidence, Validity-Confidence, topic, Novelty, Premise are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 721
  Num Epochs = 11
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 500
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss,Validation Loss,Recall,Precision,F1
50,0.6875,0.664073,0.5,0.31407,0.385802
100,0.6861,0.662371,0.5,0.31407,0.385802
150,0.6665,0.613459,0.5,0.31407,0.385802
200,0.5895,0.682816,0.521784,0.631872,0.447052
250,0.5348,0.671221,0.562324,0.722587,0.518548
300,0.4469,0.640873,0.635405,0.761591,0.630896
350,0.3758,0.629697,0.638162,0.752071,0.635531
400,0.3609,0.689154,0.631405,0.746552,0.626691
450,0.3486,0.70914,0.617892,0.734819,0.608594
500,0.3383,0.71618,0.611135,0.728546,0.599329


The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: Conclusion, input_txt, __index_level_0__, Validity, Novelty-Confidence, Validity-Confidence, topic, Novelty, Premise. If Conclusion, input_txt, __index_level_0__, Validity, Novelty-Confidence, Validity-Confidence, topic, Novelty, Premise are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 199
  Batch size = 8
  _warn_prf(average, modifier, msg_start, len(result))
Saving model checkpoint to ../../data-ceph/arguana/argmining22-sharedtask/models//task-A/validity/classification/roberta/checkpoint-50
Configuration saved in ../../data-ceph/arguana/argmining22-sharedtask/models//task-A/validity/classification/roberta/checkpoint-50/config.json
Model weights saved in ../../data-ceph/arguana/argmining22-sharedtask/models//task-A/validity/classification/

TrainOutput(global_step=500, training_loss=0.5035018081665039, metrics={'train_runtime': 57.1815, 'train_samples_per_second': 139.905, 'train_steps_per_second': 8.744, 'total_flos': 504253365375000.0, 'train_loss': 0.5035018081665039, 'epoch': 10.87})

In [21]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: Conclusion, input_txt, __index_level_0__, Validity, Novelty-Confidence, Validity-Confidence, topic, Novelty, Premise. If Conclusion, input_txt, __index_level_0__, Validity, Novelty-Confidence, Validity-Confidence, topic, Novelty, Premise are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 199
  Batch size = 8


{'eval_loss': 0.6296971440315247,
 'eval_recall': 0.6381621621621621,
 'eval_precision': 0.7520710059171598,
 'eval_f1': 0.6355311355311355,
 'eval_runtime': 0.5388,
 'eval_samples_per_second': 369.367,
 'eval_steps_per_second': 24.129,
 'epoch': 10.87}