### Baselines:
- BERT-based classifier trained on the data

### Ideas:
- Using NLI model: https://huggingface.co/microsoft/deberta-base-mnli and fine-tuning it on the data we have.
- Validity on the premise (sentence) level.

In [1]:
import json
import pandas as pd
import numpy as np

pd.set_option('display.max_colwidth', None)

In [2]:
from datasets import load_dataset, load_metric, Dataset, Split
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer, AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline, DebertaForSequenceClassification
from transformers import TrainingArguments, Trainer
import wandb
import torch
from tqdm import tqdm

In [3]:
from sklearn.metrics import precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    return {
        'recall' : recall,
        'precision': precision,
        'f1': f1,
    }

In [4]:
nli_tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-base-mnli')
nli_model     = AutoModelForSequenceClassification.from_pretrained('microsoft/deberta-base-mnli')
arg_stance_pipeline = TextClassificationPipeline(model=nli_model, tokenizer=nli_tokenizer, framework='pt', task='validity_classifier', device=0)

Some weights of the model checkpoint at microsoft/deberta-base-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
taska_training_df = pd.read_csv('../data/TaskA_train.csv')
taskb_training_df = pd.read_csv('../data/TaskB_train.csv')

taska_valid_df = pd.read_csv('../data/TaskA_dev.csv')
taskb_valid_df = pd.read_csv('../data/TaskB_dev.csv')

In [6]:
taska_training_df = taska_training_df[taska_training_df.Validity != 0]
taska_training_df['input_txt'] = taska_training_df.apply(lambda x: '[CLS] {} [SEP] {} [SEP]'.format(x['Premise'], x['Conclusion']), axis=1)

In [7]:
resutls = arg_stance_pipeline(taska_training_df['input_txt'].tolist())
taska_training_df['pred_label'] = [1 if x['label'] == 'ENTAILMENT' else -1 for x in resutls]

In [8]:
taska_training_df[['topic', 'Validity', 'pred_label']].head(n=10)

Unnamed: 0,topic,Validity,pred_label
0,TV viewing is harmful to children,1,-1
1,TV viewing is harmful to children,1,-1
2,TV viewing is harmful to children,1,-1
3,TV viewing is harmful to children,1,1
4,TV viewing is harmful to children,-1,-1
5,TV viewing is harmful to children,1,1
6,TV viewing is harmful to children,1,1
7,TV viewing is harmful to children,1,1
8,TV viewing is harmful to children,-1,1
9,TV viewing is harmful to children,-1,-1


In [9]:
taska_training_df.Validity.value_counts()

 1    401
-1    320
Name: Validity, dtype: int64

In [10]:
precision, recall, f1, _ = precision_recall_fscore_support(taska_training_df.Validity.tolist(), taska_training_df.pred_label.tolist(), average='binary')

print('Precision: {}, Recall {}, F1: {}'.format(precision, recall, f1))

Precision: 0.8571428571428571, Recall 0.47880299251870323, F1: 0.6144


In [11]:
taska_valid_df = taska_valid_df[taska_valid_df.Validity != 0]
taska_valid_df['input_txt'] = taska_valid_df.apply(lambda x: '[CLS] {} [SEP] {} [SEP]'.format(x['Premise'], x['Conclusion']), axis=1)

## Fine-tune the NLI model on the training data:

In [15]:
nli_tokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-base-mnli')
nli_model     = AutoModelForSequenceClassification.from_pretrained('microsoft/deberta-base-mnli')

Some weights of the model checkpoint at microsoft/deberta-base-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [16]:
train_dataset = Dataset.from_pandas(taska_training_df)
eval_dataset = Dataset.from_pandas(taska_valid_df)

In [17]:
nli_model.config.id2label

{0: 'CONTRADICTION', 1: 'NEUTRAL', 2: 'ENTAILMENT'}

In [18]:
validity_map = dict([
    (1, 2), # if valid -> entailment label
    (-1, 0) # not valid -> contradiction (neutral) label
])

In [46]:
inverse_validity_map = dict([
    (2,1),
    (1,-1),
    (0,-1) 
])

In [20]:
def preprocess(example):
    inputs = nli_tokenizer(example["input_txt"], add_special_tokens=False, padding=True, max_length=512)
    inputs['labels'] = list(map(validity_map.get, example['Validity']))
    return inputs

In [21]:
train_dataset = train_dataset.map(preprocess, batched=True)
eval_dataset = eval_dataset.map(preprocess, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [58]:
def compute_metrics(pred):
    labels = list(map(inverse_validity_map.get, pred.label_ids))
    preds = list(map(inverse_validity_map.get, pred.predictions.argmax(-1)))
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    return {
        'recall' : recall,
        'precision': precision,
        'f1': f1,
    }

In [78]:
training_args = TrainingArguments(
    output_dir="vali_nli_trainer", 
    report_to=None,
    num_train_epochs=5,
    learning_rate=2e-05
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [72]:
trainer = Trainer(
    model=nli_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

In [73]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DebertaForSequenceClassification.forward` and have been ignored: input_txt, Novelty, Conclusion, Validity-Confidence, Novelty-Confidence, Premise, pred_label, topic, Validity, __index_level_0__.
***** Running training *****
  Num examples = 721
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 455
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=455, training_loss=0.04366605989225618, metrics={'train_runtime': 32.7035, 'train_samples_per_second': 110.233, 'train_steps_per_second': 13.913, 'total_flos': 252576689069250.0, 'train_loss': 0.04366605989225618, 'epoch': 5.0})

In [74]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `DebertaForSequenceClassification.forward` and have been ignored: input_txt, Novelty, Conclusion, Validity-Confidence, Novelty-Confidence, Premise, topic, Validity, __index_level_0__.
***** Running Evaluation *****
  Num examples = 199
  Batch size = 8


{'eval_loss': 2.8470425605773926,
 'eval_recall': 0.952,
 'eval_precision': 0.7391304347826086,
 'eval_f1': 0.8321678321678321,
 'eval_runtime': 0.5098,
 'eval_samples_per_second': 390.326,
 'eval_steps_per_second': 49.036,
 'epoch': 5.0}

In [77]:
preds = trainer.predict(eval_dataset)

The following columns in the test set  don't have a corresponding argument in `DebertaForSequenceClassification.forward` and have been ignored: input_txt, Novelty, Conclusion, Validity-Confidence, Novelty-Confidence, Premise, topic, Validity, __index_level_0__.
***** Running Prediction *****
  Num examples = 199
  Batch size = 8


## Fine-tune simple BERT model on the training data:

In [12]:
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
bert_model     = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [13]:
train_dataset = Dataset.from_pandas(taska_training_df)
eval_dataset = Dataset.from_pandas(taska_valid_df)

In [14]:
validity_map = dict([ # avoid negative labels
    (1, 1), 
    (-1, 0)
])

In [15]:
def preprocess(example):
    inputs = nli_tokenizer(example["input_txt"], add_special_tokens=False, padding=True, truncation=True, max_length=512)
    inputs['labels'] = list(map(validity_map.get, example['Validity']))
    return inputs

In [16]:
train_dataset = train_dataset.map(preprocess, batched=True)
eval_dataset = eval_dataset.map(preprocess, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [17]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    return {
        'recall' : recall,
        'precision': precision,
        'f1': f1,
    }

In [18]:
training_args = TrainingArguments(
    output_dir="vali_bert_trainer", 
    report_to="none",
    num_train_epochs=5,
    learning_rate=2e-05
)

In [19]:
trainer = Trainer(
    model=bert_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

In [20]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: Validity, topic, Validity-Confidence, Novelty, pred_label, Conclusion, __index_level_0__, Premise, input_txt, Novelty-Confidence.
***** Running training *****
  Num examples = 721
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 455


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

In [None]:
trainer.evaluate()

In [None]:
preds = trainer.predict(eval_dataset)