# Japanese NLI model fine-tuning notebook
- We used Google Colaboratory (Tesla T5; fine-tuning took ~30 minutes)
- The following summarizes performance evaluation results
    - metric: overall accuracy \[%\]

| Model | JNLI valid | JSICK test |
|:-------------------------------:|:----:|:-----:|
| ours JNLI+JSICK vanilla 1 epoch | 87.8 |  83.8 |
| + epochs=3, lr=5e-5             | 89.2 | 87.0  |
| + accumulation_steps=4          | 90.9 |  89.0 |
| (Kurihara et al. 2022)          | 91.9 |  N/A  |
| (Yanaka and Mineshima 2022) |  N/A |  89.1 |
| cross-lingual transfer w/ existing model | 59.6 | 44.7  |

## Setup

In [None]:
!nvidia-smi

Mon Oct 24 06:07:18 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   52C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!pip install -q sentencepiece
!pip install -q transformers
!pip install -q datasets

[K     |████████████████████████████████| 1.3 MB 25.1 MB/s 
[K     |████████████████████████████████| 5.3 MB 34.8 MB/s 
[K     |████████████████████████████████| 163 kB 61.8 MB/s 
[K     |████████████████████████████████| 7.6 MB 56.0 MB/s 
[K     |████████████████████████████████| 441 kB 36.8 MB/s 
[K     |████████████████████████████████| 212 kB 74.4 MB/s 
[K     |████████████████████████████████| 115 kB 70.2 MB/s 
[K     |████████████████████████████████| 127 kB 67.7 MB/s 
[K     |████████████████████████████████| 115 kB 67.4 MB/s 
[?25h

In [None]:
import torch
torch.manual_seed(0)
import random
random.seed(0)
import numpy as np
np.random.seed(0)

## Datasets

In [None]:
!git clone https://github.com/verypluming/JSICK.git
!git clone https://github.com/yahoojapan/JGLUE.git

Cloning into 'JSICK'...
remote: Enumerating objects: 42, done.[K
remote: Counting objects: 100% (42/42), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 42 (delta 15), reused 28 (delta 13), pack-reused 0[K
Unpacking objects: 100% (42/42), done.
Cloning into 'JGLUE'...
remote: Enumerating objects: 66, done.[K
remote: Counting objects: 100% (66/66), done.[K
remote: Compressing objects: 100% (46/46), done.[K
remote: Total 66 (delta 13), reused 64 (delta 11), pack-reused 0[K
Unpacking objects: 100% (66/66), done.


In [None]:
import torch
import itertools
import math
from tqdm.notebook import tqdm
import csv, json
def load(dataset, split):
  print(f"# {dataset, split}")
  premises, hypotheses, gold_labels = [], [], []
  if dataset=="JSICK":
    with open(f'{dataset}/jsick/{split}.tsv') as file:
      for example in csv.DictReader(file, delimiter='\t'):
        premises.append(example["sentence_A_Ja"])
        hypotheses.append(example["sentence_B_Ja"])
        gold_labels.append(example["entailment_label_Ja"])
  if dataset=="JGLUE":
    with open(f'{dataset}/datasets/jnli-v1.0/{split}-v1.0.json') as file:
      for line in file:
        line = line.strip()
        if line == "":
          continue
        example = json.loads(line)
        premises.append(example["sentence1"])
        hypotheses.append(example["sentence2"])
        gold_labels.append(example["label"])
  return premises, hypotheses, gold_labels
def batch(iterable, size):
  args = [iter(iterable)] * size
  return itertools.zip_longest(*args)
def accuracy(premises, hypotheses, gold_labels, batch_size, label_map_model):
  prediction_labels = []
  with torch.no_grad(), tqdm(total=math.ceil(len(gold_labels)/batch_size)) as pbar:
    for premise, hypothesis in zip(batch(premises, batch_size), batch(hypotheses, batch_size)):
      premise, hypothesis = [i for i in premise if i is not None], [i for i in hypothesis if i is not None]
      features = tokenizer(premise, hypothesis, padding=True, truncation=True, return_tensors="pt").to(device='cuda')
      scores = model(**features).logits
      prediction_labels += [label_map_model[int(score_max)] for score_max in scores.argmax(dim=1)]
      pbar.update()
  ncorrect, nsamples = 0, 0
  for pred, gold in zip(prediction_labels, gold_labels):
    ncorrect += int(pred == gold)
    nsamples += 1
  print(f"- accuracy: {float(ncorrect)/float(nsamples)}")
def evaluate(batch_size, label_map_model):
  premises, hypotheses, gold_labels = load("JSICK", "test")
  accuracy(premises, hypotheses, gold_labels, batch_size, label_map_model)
  premises, hypotheses, gold_labels = load("JGLUE", "valid")
  accuracy(premises, hypotheses, gold_labels, batch_size, label_map_model)

## Inference and Evaluation

### ours w/ ablation study

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('/content/drive/My Drive/exp/nli/xlm-roberta-large-jsick-jglue-3epochs-accum4-lr5e5')
model = AutoModelForSequenceClassification.from_pretrained('/content/drive/My Drive/exp/nli/xlm-roberta-large-jsick-jglue-3epochs-accum4-lr5e5').cuda()
model.eval()
evaluate(512, {0:'contradiction', 1:'entailment', 2:'neutral'})

# ('JSICK', 'test')


  0%|          | 0/10 [00:00<?, ?it/s]

- accuracy: 0.8899939111020905
# ('JGLUE', 'valid')


  0%|          | 0/5 [00:00<?, ?it/s]

- accuracy: 0.9087921117502055


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('/content/drive/My Drive/exp/nli/xlm-roberta-large-jsick-jglue-3epochs-lr5e5')
model = AutoModelForSequenceClassification.from_pretrained('/content/drive/My Drive/exp/nli/xlm-roberta-large-jsick-jglue-3epochs-lr5e5').cuda()
model.eval()
evaluate(512, {0:'contradiction', 1:'entailment', 2:'neutral'})

# ('JSICK', 'test')


  0%|          | 0/10 [00:00<?, ?it/s]

- accuracy: 0.8699005480008118
# ('JGLUE', 'valid')


  0%|          | 0/5 [00:00<?, ?it/s]

- accuracy: 0.8923582580115037


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('/content/drive/My Drive/exp/nli/xlm-roberta-large-jsick-jglue-1epoch')
model = AutoModelForSequenceClassification.from_pretrained('/content/drive/My Drive/exp/nli/xlm-roberta-large-jsick-jglue-1epoch').cuda()
model.eval()
evaluate(512, {0:'contradiction', 1:'entailment', 2:'neutral'})

# ('JSICK', 'test')


  0%|          | 0/10 [00:00<?, ?it/s]

- accuracy: 0.8376293890805764
# ('JGLUE', 'valid')


  0%|          | 0/5 [00:00<?, ?it/s]

- accuracy: 0.8779786359901397


### cross-lingual transfer using existing multilingual model

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('salesken/xlm-roberta-base-finetuned-mnli-cross-lingual-transfer')
model = AutoModelForSequenceClassification.from_pretrained('salesken/xlm-roberta-base-finetuned-mnli-cross-lingual-transfer').cuda()
model.eval()
evaluate(512, {0:'entailment', 1:'neutral', 2:'contradiction'})

# ('JSICK', 'test')


  0%|          | 0/10 [00:00<?, ?it/s]

- accuracy: 0.44651918002841484
# ('JGLUE', 'valid')


  0%|          | 0/5 [00:00<?, ?it/s]

- accuracy: 0.596138044371405


## Fine-tuning

In [None]:
import csv, json
from sentence_transformers import InputExample
label2int = {"contradiction": 0, "entailment": 1, "neutral": 2}
data = []
with open(f'JSICK/jsick/train.tsv') as file:
  for example in csv.DictReader(file, delimiter='\t'):
    data.append(InputExample(texts=[example["sentence_A_Ja"], example["sentence_B_Ja"]], label=label2int[example["entailment_label_Ja"]]))
with open(f'JGLUE/datasets/jnli-v1.0/train-v1.0.json') as file:
  for line in file:
    line = line.strip()
    if line == "":
      continue
    example = json.loads(line)
    data.append(InputExample(texts=[example["sentence1"], example["sentence2"]], label=label2int[example["label"]]))
print(len(data))

25073


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### w/ gradient accumulation

In [None]:
!pip install -q git+https://github.com/hugoabonizio/sentence-transformers.git@feature/add-gradient-accumulation-crossencoder

In [None]:
from sentence_transformers.cross_encoder import CrossEncoder
from torch.utils.data import DataLoader
import math
model = CrossEncoder('xlm-roberta-large', num_labels=3)
train_dataloader = DataLoader(data, batch_size=32, shuffle=True)
model.fit(
    train_dataloader=train_dataloader,
    epochs=3,
    accumulation_steps=4,
    optimizer_params={'lr': 5e-5},
    warmup_steps=math.ceil(0.1 * len(data)),
    use_amp=True,
)
model.save('output/model')

Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.bias', 'lm_head.bias', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-large and are newly initialized: ['classifier.den

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Iteration:   0%|          | 0/196 [00:00<?, ?it/s]

Iteration:   0%|          | 0/196 [00:00<?, ?it/s]

Iteration:   0%|          | 0/196 [00:00<?, ?it/s]

In [None]:
model.save('/content/drive/My Drive/exp/nli/xlm-roberta-large-jsick-jglue-3epochs-accum4-lr5e5')

### w/o gradient accumulation

In [None]:
!pip install -q sentence-transformers

[K     |████████████████████████████████| 85 kB 4.1 MB/s eta 0:00:011
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


In [None]:
from sentence_transformers.cross_encoder import CrossEncoder
from torch.utils.data import DataLoader
import math
model = CrossEncoder('xlm-roberta-large', num_labels=3)
train_dataloader = DataLoader(data, batch_size=32, shuffle=True)
model.fit(
    train_dataloader=train_dataloader,
    epochs=3,
    optimizer_params={'lr': 5e-5},
    warmup_steps=math.ceil(0.1 * len(data)),
    show_progress_bar=True,
    use_amp=True,
)
model.save('output/model')

Downloading config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.09G [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaForSequenceClassification: ['lm_head.bias', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-large and are newly initialized: ['classifier.den

Downloading sentencepiece.bpe.model:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/8.68M [00:00<?, ?B/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Iteration:   0%|          | 0/784 [00:00<?, ?it/s]

Iteration:   0%|          | 0/784 [00:00<?, ?it/s]

Iteration:   0%|          | 0/784 [00:00<?, ?it/s]

In [None]:
model.save('/content/drive/My Drive/exp/nli/xlm-roberta-large-jsick-jglue-3epochs-lr5e5')

In [None]:
from sentence_transformers.cross_encoder import CrossEncoder
from torch.utils.data import DataLoader
import math
model = CrossEncoder('xlm-roberta-large', num_labels=3)
train_dataloader = DataLoader(data, batch_size=32, shuffle=True)
model.fit(
    train_dataloader=train_dataloader,
    epochs=1,
    warmup_steps=math.ceil(0.1 * len(data)),
    show_progress_bar=True,
    use_amp=True,
)
model.save('output/model')

Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-large and are newly initialized: ['classifier.out

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/784 [00:00<?, ?it/s]

In [None]:
model.save('/content/drive/My Drive/exp/nli/xlm-roberta-large-jsick-jglue-1epoch')