### Task: Multi-genre natural language inference

Does the premise entail the hypothesis? Three classes: entailment, neutral, contradiction.

- Example 1
  - Premise: The Old One always comforted Ca'daan, except today.
  - Hypothesis: Ca'daan is turning 36 years old tomorrow.
  - Label (entailment, neutral, contradiction)? ${\color{white} {Neutral.}}$
- Example 2
  - Premise: He turned and saw Jon sleeping in his half-tent.
  - Hypothesis: He saw Jon was asleep.
  - Label (entailment, neutral, contradiction)? ${\color{white} {Entailment.}}$
- Example 3
  - Premise (incomplete telephone speech): yes now you know if if everybody like in August when everybody's on vacation or something we can dress a little more casual or
  - Hypothesis: August is a black out month for vacations in the company.
  - Label (entailment, neutral, contradiction)? ${\color{white} {Contradiction.}}$

Notes
- MNLI is difficult; only recently did machine performance approach human performance.
  - Sometimes requires complex logical reasoning.
  - Sometimes requires commonsense reasoning.
  - Sometimes ambiguous; even humans disagree.
- MNLI covers many domains.
- MNLI is a good pretraining/intermediate-training task for transfer learning, for some natural language understanding tasks (NLU).

- References
  - Data loading is partly based on Imran's (CDS MS '20) research codebase.
  - MNLI dataset paper: https://cims.nyu.edu/~sbowman/multinli/paper.pdf.
  - MNLI as a transfer learning intermediate-training task: https://arxiv.org/pdf/2005.00628.pdf.
  - MNLI's current top machine performances as well as human performance:  https://gluebenchmark.com/leaderboard.

Today
- Using simple sequence modeling to do this task!

### Prereqs

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import os
import sys

import time
import datetime

import logging
import torch
import torch.optim as optim
import torch.nn as nn
from torch.optim import lr_scheduler
from torchtext import data as torchtext_data
from torchtext import datasets as torchtext_datasets

In [4]:
!nvidia-smi

Wed Sep 30 22:54:15 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    23W / 300W |      0MiB / 16130MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Loading and processing the MNLI dataset using torchtext

Link: https://torchtext.readthedocs.io/en/latest/index.html

What do we need to do, really? Many steps!

Obtain sentences
- Load the dataset
- Tokenize the sentences

Obtain vocab and optionally, embeddings
- Obtain the vocabulary
- Optional: load the pretraining embedding
- Turn all the tokens into vectors

Obtain mini-batches
- Turn examples into mini-batches (what's the batch size?)
  - This step requires shuffling
  - This step probably requires padding (to be discussed further, soon) 

In [5]:
args = {
    'batch_size': 128,
    'dataset': 'mnli',
    'dropout': 0.3,
    'device': 'cuda',
    'embed_dim': 100,
    'hidden_size': 256,
    'model': 'bilstm',
    'num_epochs': 20,
    'num_layers': 2,
    'lr': 0.001,
    'results_dir': '/content/drive/My Drive/colabs/prep-lab3-nli/checkpoints',
}

Data loading and processing
- No need to examine in detail, unless you want to learn about torchtext

## STEP 1 part 1: Convert input texts to batched inputs.
- Now, we first convert input texts to batched inputs.
- We will convert from batched inputs to embedded inputs **later**.

In [8]:

# Dev set origin in this colab: https://www.kaggle.com/c/multinli-matched-open-evaluation
# Test set origin in this colab: https://www.kaggle.com/c/multinli-mismatched-open-evaluation
# Not using self.text_field.vocab.load_vectors('glove.840B.300d'), but we're
# using a smaller "embedding matrix" instead.

class DatasetMNLI():
  def __init__(self):
    self.text_field = torchtext_data.Field(
      lower=True, tokenize='spacy', batch_first=True)
    self.label_field = torchtext_data.Field(
      sequential=False, unk_token = None, is_target=True)
    self.train, self.dev, self.test = torchtext_datasets.MultiNLI.splits(
      self.text_field, self.label_field,
      root='/content/drive/My Drive/colabs/prep-lab3-nli/.data')
    self.text_field.build_vocab(self.train, self.dev)
    self.label_field.build_vocab(self.train)

    # NOTE: update the path, after you create copies in your own google drive
    vector_cache_path = '/content/drive/My Drive/colabs/prep-lab3-nli/.vector_cache/multinli_vectors.pt'
    if os.path.isfile(vector_cache_path):
      self.text_field.vocab.vectors = torch.load(vector_cache_path)
    else:
      self.text_field.vocab.load_vectors('glove.6B.100d')
      os.makedirs(os.path.dirname(vector_cache_path))
      torch.save(self.text_field.vocab.vectors, vector_cache_path)
		
    (self.train_iter, self.dev_iter,
     self.test_iter) = torchtext_data.Iterator.splits(
      (self.train, self.dev, self.test),
      batch_size=args['batch_size'], device=args['device'],
      sort_key=lambda x: len(x.premise), sort_within_batch=False, shuffle=True)

  def vocab_size(self):
    return len(self.text_field.vocab)

  def out_dim(self):
    return len(self.label_field.vocab)

  def labels(self):
    return self.label_field.vocab.stoi

In [9]:
mnli_dataset = DatasetMNLI()

In [10]:
mnli_dataset

<__main__.DatasetMNLI at 0x7f881c60c5c0>

In [11]:
len(mnli_dataset.train_iter)

3068

In [None]:
k = 0
for x in mnli_dataset.dev_iter:
  print(x)
  k += 1
  if k > 10:
    break

Finally, we have our dataset iterators
- mnli_dataset.train_iter
- mnli_dataset.dev_iter
- mnli_dataset.test_iter

Let's examine what batch number 36 (of dev set is).
- What attributes does the batch contain?
- What are the shapes of batch_36.premise, batch_36.hypothesis, batch_36.label? 

In [None]:
import itertools
# https://docs.python.org/3/library/itertools.html#itertools.islice
batch_36 = next(itertools.islice(mnli_dataset.dev_iter, 36, None))  # to get batch #36
batch_36


[torchtext.data.batch.Batch of size 128 from MULTINLI]
	[.premise]:[torch.cuda.LongTensor of size 128x18 (GPU 0)]
	[.hypothesis]:[torch.cuda.LongTensor of size 128x26 (GPU 0)]
	[.label]:[torch.cuda.LongTensor of size 128 (GPU 0)]

In [None]:
batch_36 = next(itertools.islice(mnli_dataset.dev_iter, 36, None))
print('premise:', batch_36.premise)
print('label:', batch_36.label)

premise: tensor([[   70,    62,    18,  ...,  5764,  1945,     3],
        [  405,     9,    11,  ...,    13, 13340,     3],
        [   29,   153,    29,  ...,    24,  1292,     3],
        ...,
        [   10,  1077,    12,  ...,    15,    84,     2],
        [   56,   115,   118,  ...,   548,    39,    13],
        [  126,    28,    15,  ...,    41,   370,     3]], device='cuda:0')
label: tensor([0, 2, 1, 2, 2, 1, 1, 1, 2, 0, 0, 1, 2, 1, 2, 1, 2, 2, 1, 2, 1, 0, 2, 0,
        2, 2, 2, 0, 0, 2, 1, 0, 0, 1, 1, 2, 2, 0, 1, 2, 2, 0, 0, 1, 1, 0, 2, 2,
        1, 1, 0, 1, 0, 2, 0, 1, 1, 2, 1, 0, 1, 0, 0, 0, 1, 2, 1, 2, 1, 0, 0, 1,
        0, 2, 2, 2, 2, 0, 2, 1, 1, 2, 2, 0, 1, 0, 1, 0, 0, 2, 1, 0, 0, 1, 2, 0,
        2, 1, 1, 0, 2, 0, 2, 2, 0, 2, 0, 2, 2, 2, 0, 1, 2, 2, 2, 1, 1, 0, 0, 1,
        2, 1, 0, 1, 1, 1, 1, 1], device='cuda:0')


In [None]:
id2token = mnli_dataset.text_field.vocab.itos

In [None]:
def convert_to_text_label(tensor_label):
  int_label = int(tensor_label)
  if int_label == 0:
    return 'contradiction'
  elif int_label == 1:
    return 'neutral'
  else:
    assert int_label == 2
    return 'entailment'

for i in range(len(batch_36)):
  print('premise: ', ' '.join(
      [id2token[tokenid] for tokenid in (list(batch_36.premise))[i]]))
  print('hypothesis: ', ' '.join(
      [id2token[tokenid] for tokenid in (list(batch_36.hypothesis))[i]]))
  print('label: ', convert_to_text_label(batch_36.label[i]), '\n')

premise:  when people are late , it makes it hard to keep things working in a rational fashion .
hypothesis:  it does n't matter at what time do people arrive . <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
label:  contradiction 

premise:  maybe in that sense , the behavior of the pippens and iversons of the world is defensible .
hypothesis:  in one sense their antics are justifiable . <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
label:  entailment 

premise:  as long as assad lives , he can manage these troubles and keep an agreement with israel .
hypothesis:  as long as assad does n't die in the hurricane , he will still be able to take care of the trouble with the soldiers .
label:  neutral 

premise:  6see also internal control management and evaluation tool ( gao-01 - 1008 g , august 2001 ) .
hypothesis:  the tool for control management . <pad> <pad> <pad> <pad> <pad> <pad> 

Observations
- Why are there paddings in hypotheses, but no paddings in premises?

## STEP 1 part 2 (batched input -> embedded input)
## STEP 2 (feed embedded inputs into RNN/LSTM/GRU)
## STEP 3 (classification layers)

### Model

Refer to Google doc.

Now, we are defining our models using a class. 
- Q: how do we know what the model looks like? A: read the forward() function!



In [None]:
# model = nn.Sequential(nn.Linear(5,10), nn.Tanh(), nn.Linear(...))
# y = model(x)

In [None]:
# From now on, consider writing models in this format.
class ModelBiLSTM(nn.Module):
  def __init__(self, options):
    pass
  def forward(self, batch):
    pass

In [None]:
class ModelBiLSTM(nn.Module):
  def __init__(self, options):
    # All the parameters to the model class & all the layers, go into __init__.
    # We define the layers here, and we **call** the layers later, in the
    # forward() function.
    super(ModelBiLSTM, self).__init__()
    self.device = args['device']
    self.embed_dim = args['embed_dim']  # using GloVe
    self.hidden_size = args['hidden_size']
    self.num_layers = args['num_layers']
    self.embedding = nn.Embedding.from_pretrained(
      torch.load('/content/drive/My Drive/colabs/prep-lab3-nli/.vector_cache/multinli_vectors.pt'))
    self.directions = 2

    # Layers below: ReLU, dropout, projection (from embedding to input-to-LSTM),
    # LSTM, linear layers.
    self.relu = nn.LeakyReLU()  # non-linear layer
    self.dropout = nn.Dropout(p=args['dropout'])  # prob of an elt to be zeroed

    self.lstm = nn.LSTM(  # LSTM layer; Q: why put it here, within __init__?
      self.embed_dim, self.hidden_size, self.num_layers,
      dropout=args['dropout'], bidirectional=True,
      batch_first=True)  # BATCH FIRST!!!!!!!!!!!!!!!!!!!!!!!!!!!

    self.linear_first = nn.Linear(
      self.hidden_size * self.directions * 4, self.hidden_size)  # why 4?
    self.linear_second = nn.Linear(self.hidden_size, self.hidden_size)
    self.linear_third = nn.Linear(self.hidden_size, options['out_dim'])
    for layer in [self.linear_first, self.linear_second, self.linear_third]:
      nn.init.xavier_uniform_(layer.weight)
      nn.init.zeros_(layer.bias)
    # this is the linear classification layers
    # with nonlinearity in the middle
    self.layers = nn.Sequential(  
  	  self.linear_first, self.relu, self.dropout,
      self.linear_second, self.relu, self.dropout,
  	  self.linear_third,
    )

  def forward(self, batch):
    # This is the forward pass of the our model.

    # STEP 1 (part 2): batched inputs -> embedded input (which will be fed into LSTM)

    # Q: shape of batch.premise (which is the token inputs)? batch_size * input_len_premise
    # Q: what does batch.premise[i][j] mean?

    # Q: shape of premise_embed (which will be fed into LSTM layers)?

    premise_embed = self.embedding(batch.premise)
    hypothesis_embed = self.embedding(batch.hypothesis)


    # premise_embed.shape: batch size * premise input len * embed dim

    # STEP 2: LSTM
    # Q: what are the parameters to LSTM (when we're defining the LSTM)?
    # Q: what are the inputs to the LSTM layers?
    # Q: what are the outputs from the LSTM layers?
    # https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html

    # Q: what is self.lstm?
    # If we write nn.LSTM(...)(premise_embed_proj) and
    # nn.LSTM(...)(hypothesis_embed_proj), what will be different?
    premise_out, (premise_ht, _) = self.lstm(premise_embed, None)
    hypothesis_out, (hypothesis_ht, _) = self.lstm(hypothesis_embed, None)
    
    # Q: what is the shape of premise_out and premise_ht?

    # Equivalently, the above code can be written as
    # h0 = torch.zeros((self.num_layers * self.directions, batch.batch_size,
    #                   self.hidden_size)).to(self.device)
    # c0 = torch.zeros((self.num_layers * self.directions, batch.batch_size,
    #                   self.hidden_size)).to(self.device)
    # premise_out, (premise_ht, _) = self.lstm(premise_embed, (h0, c0))
    # hypothesis_out, (hypothesis_ht, _) = self.lstm(hypothesis_embed, (h0, c0))    



    # STEP 3: linear layers for classification

    # Q: how do we convert premise_out and hypothesis_out into the inputs to linear layers?
    # Q: how do we design the linear layers?

    premise = premise_out[:, -1, :]
    hypothesis = hypothesis_out[:, -1, :]

    combined = torch.cat(
      (premise,  # Q: shape? A: batch_size * (num_directions * hidden_size)
       hypothesis,  # Q: shape?
       torch.abs(premise - hypothesis),
       premise * hypothesis),  # Q: shape?
      1)  # Q: what are we doing here and what's 1?

    # Q: what's the shape of combined? 

    return self.layers(combined)

In [None]:
class ModelRNN(nn.Module):
  def __init__(self, options):
    super(ModelRNN, self).__init__()
    self.device = args['device']
    self.embed_dim = args['embed_dim']  # using GloVe
    self.hidden_size = args['hidden_size']
    self.num_layers = args['num_layers']
    self.embedding = nn.Embedding.from_pretrained(
      torch.load('/content/drive/My Drive/colabs/prep-lab3-nli/.vector_cache/multinli_vectors.pt'))
    self.directions = 2

    # Layers below: ReLU, dropout, projection (from embedding to input-to-RNN),
    # RNN, linear layers.
    self.relu = nn.LeakyReLU()  # non-linear layer
    self.dropout = nn.Dropout(p=args['dropout'])  # prob of an elt to be zeroed

    self.rnn = nn.RNN(
      self.embed_dim, self.hidden_size, self.num_layers,
      dropout=args['dropout'], batch_first=True)

    self.linear_first = nn.Linear(
      self.hidden_size * 4, self.hidden_size)  # why 4?
    self.linear_second = nn.Linear(self.hidden_size, self.hidden_size)
    self.linear_third = nn.Linear(self.hidden_size, options['out_dim'])
    for layer in [self.linear_first, self.linear_second, self.linear_third]:
      nn.init.xavier_uniform_(layer.weight)
      nn.init.zeros_(layer.bias)

    # Stacking all the layers together.
    self.layers = nn.Sequential(
  	  self.linear_first, self.relu, self.dropout,
      self.linear_second, self.relu, self.dropout,
  	  self.linear_third,
    )

  def forward(self, batch):
    # STEP 1: token inputs -> embedded input
  
    premise_embed = self.embedding(batch.premise)
    hypothesis_embed = self.embedding(batch.hypothesis)


    # STEP 2: RNN
    
    h0 = torch.zeros((self.num_layers, batch.batch_size,
                      self.hidden_size)).to(self.device)
    premise_out, premise_ht = self.rnn(premise_embed, h0)
    hypothesis_out, hypothesis_ht = self.rnn(hypothesis_embed, h0)


    # STEP 3: linear layers for classification 

    premise = premise_out[:, -1, :]
    hypothesis = hypothesis_out[:, -1, :]

    combined = torch.cat((premise, hypothesis, torch.abs(premise - hypothesis), premise * hypothesis), 1)
    return self.layers(combined)

### Training

In [None]:
def get_device(gpu_id):
  if torch.cuda.is_available():
    torch.cuda.set_device(gpu_id)
    return torch.device('cuda:{}'.format(gpu_id))
  else:
    return torch.device('cpu')

Rerun this cell with appropriate change (change the if True / if False line) if we're switching between models (RNN vs. LSTM).

In [None]:

device = get_device(0)

model_options = {
    'out_dim': mnli_dataset.out_dim(),
    'dropout': args['dropout'],
    'hidden_size': args['hidden_size'],
    'device': device,
    'dataset': args['dataset'],
}


best_val_acc = -999.0
if False:  # to change
  model = ModelBiLSTM(model_options)
  args['model'] = 'bilstm'
  args['lr'] = 0.001
else:  # to change
  model = ModelRNN(model_options)
  args['model'] = 'rnn'
  args['lr'] = 0.0005
optimizer = optim.Adam(model.parameters(), lr=args['lr'])
scheduler = lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)
criterion = nn.CrossEntropyLoss(reduction='sum')

model.to(device)


def get_logger(phase):
  dir_path = "{}/{}/{}/".format(
      args['results_dir'], args['model'], args['dataset'])
  file_path = "{}{}.log".format(dir_path, phase)
  if not os.path.exists(dir_path):
    os.makedirs(dir_path)
  logging.basicConfig(
    level=logging.INFO,
    filename=file_path,
    format='%(asctime)s - %(message)s', datefmt='%d-%b-%y %H:%M:%S')
  return logging.getLogger(phase)

logger = get_logger("train")
logger.info("Arguments: {}".format(args))


def train():
  model.train()
  mnli_dataset.train_iter.init_epoch()
  n_correct, n_total, n_loss = 0, 0, 0.0
  for batch_id, batch in enumerate(mnli_dataset.train_iter):
    prediction = model(batch)  # one of the important steps 
    loss = criterion(prediction, batch.label) # one of the important steps

    # Computing training accuracy below.
    n_correct += (torch.max(prediction, 1)[1].view(
        batch.label.size()) == batch.label).sum().item()
    n_total += batch.batch_size
    n_loss += loss.item()
		
    optimizer.zero_grad()  # one of the important steps
    loss.backward()  # one of the important steps
    optimizer.step()  # one of the important steps

    if batch_id % 300 == 0:
      print('| Epoch {:3d} | step {:5d} | epoch-cumulative train loss {:5.2f} | train acc {:5.4f} |'.format(
          epoch, batch_id, n_loss/n_total, n_correct/n_total))

  train_loss = n_loss/n_total
  train_acc = n_correct/n_total
  return train_loss, train_acc

def validate():
  model.eval()
  mnli_dataset.dev_iter.init_epoch()
  n_correct, n_total, n_loss = 0, 0, 0.0
  with torch.no_grad():
    for batch_id, batch in enumerate(mnli_dataset.dev_iter):
      prediction = model(batch)
      loss = criterion(prediction, batch.label)

      # Computing dev accuracy below.
      n_correct += (torch.max(prediction, 1)[1].view(
          batch.label.size()) == batch.label).sum().item()
      n_total += batch.batch_size
      n_loss += loss.item()
    val_loss = n_loss / n_total
    val_acc = n_correct / n_total
    return val_loss, val_acc

In [None]:
device

device(type='cuda', index=0)

In [None]:
# RNN
# Again, this is RNN!!!

print("Training starts!")

best_val_acc = -999.0

for epoch in range(args['num_epochs']):
  start = time.time()

  train_loss, train_acc = train()
  val_loss, val_acc = validate()
  scheduler.step()
    
  time_duration = time.time() - start

  # Save checkpoints and update best accuracy.
  if best_val_acc < 0 or val_acc > best_val_acc:
    best_val_acc = val_acc
    torch.save({
      'accuracy': best_val_acc,
      'options': model_options,
      'model_state_dict': model.state_dict(),
      'optimizer_state_dict': optimizer.state_dict(),  # if you want to resume training / do fine-tuning later
    }, '{}/{}/{}/best-params.pt'.format(
        args['results_dir'], args['model'], args['dataset']))
    logger.info('| Epoch {:3d} | train loss {:5.2f} | train acc {:5.2f} | val loss {:5.2f} | val acc {:5.2f} | time: {:5.2f}s |'.format(
        epoch, train_loss, train_acc, val_loss, val_acc, time_duration))

  print('| Epoch {:3d} | train loss {:5.2f} | train acc {:5.2f} | val loss {:5.2f} | val acc {:5.4f} | time: {:5.2f}s |'.format(
    epoch, train_loss, train_acc, val_loss, val_acc, time_duration))

logger.info("Training completed!\n\n")

Training starts!
| Epoch   0 | step     0 | epoch-cumulative train loss  1.09 | train acc 0.3906 |
| Epoch   0 | step   300 | epoch-cumulative train loss  1.10 | train acc 0.3330 |
| Epoch   0 | step   600 | epoch-cumulative train loss  1.10 | train acc 0.3405 |
| Epoch   0 | step   900 | epoch-cumulative train loss  1.10 | train acc 0.3437 |
| Epoch   0 | step  1200 | epoch-cumulative train loss  1.10 | train acc 0.3440 |
| Epoch   0 | step  1500 | epoch-cumulative train loss  1.10 | train acc 0.3452 |
| Epoch   0 | step  1800 | epoch-cumulative train loss  1.10 | train acc 0.3467 |
| Epoch   0 | step  2100 | epoch-cumulative train loss  1.10 | train acc 0.3466 |
| Epoch   0 | step  2400 | epoch-cumulative train loss  1.10 | train acc 0.3469 |
| Epoch   0 | step  2700 | epoch-cumulative train loss  1.10 | train acc 0.3479 |
| Epoch   0 | step  3000 | epoch-cumulative train loss  1.10 | train acc 0.3481 |
| Epoch   0 | train loss  1.10 | train acc  0.35 | val loss  1.10 | val acc 0.310

In [None]:
# BiLSTM

print("Training starts!")

best_val_acc = -999.0

for epoch in range(args['num_epochs']):
  start = time.time()

  train_loss, train_acc = train()
  val_loss, val_acc = validate()
  scheduler.step()
    
  time_duration = time.time() - start

  # Save checkpoints and update best accuracy.
  if best_val_acc < 0 or val_acc > best_val_acc:
    best_val_acc = val_acc
    torch.save({
      'accuracy': best_val_acc,
      'options': model_options,
      'model_state_dict': model.state_dict(),
      'optimizer_state_dict': optimizer.state_dict(),  # if you want to resume training / do fine-tuning later
    }, '{}/{}/{}/best-params.pt'.format(
        args['results_dir'], args['model'], args['dataset']))
    logger.info('| Epoch {:3d} | train loss {:5.2f} | train acc {:5.2f} | val loss {:5.2f} | val acc {:5.2f} | time: {:5.2f}s |'.format(
        epoch, train_loss, train_acc, val_loss, val_acc, time_duration))

  print('| Epoch {:3d} | train loss {:5.2f} | train acc {:5.2f} | val loss {:5.2f} | val acc {:5.4f} | time: {:5.2f}s |'.format(
    epoch, train_loss, train_acc, val_loss, val_acc, time_duration))

logger.info("Training completed!\n\n")

Training starts!
| Epoch   0 | step     0 | epoch-cumulative train loss  1.10 | train acc 0.3281 |
| Epoch   0 | step   300 | epoch-cumulative train loss  1.10 | train acc 0.3412 |
| Epoch   0 | step   600 | epoch-cumulative train loss  1.10 | train acc 0.3503 |
| Epoch   0 | step   900 | epoch-cumulative train loss  1.10 | train acc 0.3583 |
| Epoch   0 | step  1200 | epoch-cumulative train loss  1.09 | train acc 0.3692 |
| Epoch   0 | step  1500 | epoch-cumulative train loss  1.08 | train acc 0.3881 |
| Epoch   0 | step  1800 | epoch-cumulative train loss  1.06 | train acc 0.4088 |
| Epoch   0 | step  2100 | epoch-cumulative train loss  1.05 | train acc 0.4260 |
| Epoch   0 | step  2400 | epoch-cumulative train loss  1.03 | train acc 0.4408 |
| Epoch   0 | step  2700 | epoch-cumulative train loss  1.02 | train acc 0.4532 |
| Epoch   0 | step  3000 | epoch-cumulative train loss  1.01 | train acc 0.4645 |
| Epoch   0 | train loss  1.01 | train acc  0.47 | val loss  0.92 | val acc 0.557

In [None]:
!nvidia-smi

Fri Sep 18 23:00:57 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    37W / 300W |   7109MiB / 16130MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Loading model and evaluating on test set

In [None]:
def load_from_checkpoint():
  model = torch.load('{}/{}/{}/best-params.pt'.format(
    args['results_dir'], args['model'], args['dataset']))
  return model

In [None]:
model_lstm = load_from_checkpoint()

In [None]:
model_lstm

{'accuracy': 0.6431991849210392,
 'model_state_dict': OrderedDict([('embedding.weight',
               tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
                       [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
                       [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
                       ...,
                       [-0.6644, -0.3045,  0.6151,  ...,  0.1404,  0.5788, -0.0333],
                       [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
                       [-0.2211, -0.3787, -0.4532,  ..., -0.2616,  0.0806, -1.2265]],
                      device='cuda:0')),
              ('lstm.weight_ih_l0',
               tensor([[ 0.2913,  0.0550, -0.3073,  ...,  0.2546, -0.1119, -0.1759],
                       [ 0.0045,  0.2742, -0.1749,  ...,  0.5885, -0.1291,  0.0368],
                       [ 0.1682,  0.0177, -0.2452,  ...,  0.0537, -0.0911, -0.3067],
                       ...,
             

In [None]:
for x in model_lstm['model_state_dict']:
  print(x, '\t', model_lstm['model_state_dict'][x].shape)

embedding.weight 	 torch.Size([75323, 100])
lstm.weight_ih_l0 	 torch.Size([1024, 100])
lstm.weight_hh_l0 	 torch.Size([1024, 256])
lstm.bias_ih_l0 	 torch.Size([1024])
lstm.bias_hh_l0 	 torch.Size([1024])
lstm.weight_ih_l0_reverse 	 torch.Size([1024, 100])
lstm.weight_hh_l0_reverse 	 torch.Size([1024, 256])
lstm.bias_ih_l0_reverse 	 torch.Size([1024])
lstm.bias_hh_l0_reverse 	 torch.Size([1024])
lstm.weight_ih_l1 	 torch.Size([1024, 512])
lstm.weight_hh_l1 	 torch.Size([1024, 256])
lstm.bias_ih_l1 	 torch.Size([1024])
lstm.bias_hh_l1 	 torch.Size([1024])
lstm.weight_ih_l1_reverse 	 torch.Size([1024, 512])
lstm.weight_hh_l1_reverse 	 torch.Size([1024, 256])
lstm.bias_ih_l1_reverse 	 torch.Size([1024])
lstm.bias_hh_l1_reverse 	 torch.Size([1024])
linear_first.weight 	 torch.Size([256, 2048])
linear_first.bias 	 torch.Size([256])
linear_second.weight 	 torch.Size([256, 256])
linear_second.bias 	 torch.Size([256])
linear_third.weight 	 torch.Size([3, 256])
linear_third.bias 	 torch.Size([

In [None]:
device = get_device(0)

model_options = {
    'out_dim': mnli_dataset.out_dim(),
    'dropout': args['dropout'],
    'hidden_size': args['hidden_size'],
    'device': device,
    'dataset': args['dataset'],
}

best_val_acc = -999.0

# building LSTM model and loading state_dict here
model = ModelBiLSTM(model_options)
model.load_state_dict(model_lstm['model_state_dict'])  # extremely important line
args['model'] = 'bilstm'
args['lr'] = 0.001

criterion = nn.CrossEntropyLoss(reduction='sum')

model.to(device)

ModelBiLSTM(
  (embedding): Embedding(75323, 100)
  (relu): LeakyReLU(negative_slope=0.01)
  (dropout): Dropout(p=0.3, inplace=False)
  (lstm): LSTM(100, 256, num_layers=2, batch_first=True, dropout=0.3, bidirectional=True)
  (linear_first): Linear(in_features=2048, out_features=256, bias=True)
  (linear_second): Linear(in_features=256, out_features=256, bias=True)
  (linear_third): Linear(in_features=256, out_features=3, bias=True)
  (layers): Sequential(
    (0): Linear(in_features=2048, out_features=256, bias=True)
    (1): LeakyReLU(negative_slope=0.01)
    (2): Dropout(p=0.3, inplace=False)
    (3): Linear(in_features=256, out_features=256, bias=True)
    (4): LeakyReLU(negative_slope=0.01)
    (5): Dropout(p=0.3, inplace=False)
    (6): Linear(in_features=256, out_features=3, bias=True)
  )
)

In [None]:
def convert_tensor_to_text(x):
  return ' '.join([id2token[tokenid] for tokenid in list(x)])

In [None]:
model.eval()

mnli_dataset.test_iter.init_epoch()

n_correct, n_total, n_loss = 0, 0, 0.0
confusion_mat = torch.zeros((3, 3))  # Q: why 3?


with torch.no_grad():
  for batch_id, batch in enumerate(mnli_dataset.test_iter):
    logits = model(batch)
    loss = criterion(logits, batch.label)
    prediction = torch.max(logits, 1)[1].view(batch.label.size())

    # this for-loop below is optional; just computing the confusion matrix and
    # printing out incorrect prediction examples
    for i in range(len(prediction)):  
      one_label = batch.label[i].int()
      one_prediction = prediction[i].int()
      confusion_mat[one_label][one_prediction] += 1
      if 40 < batch_id < 50 and one_label != one_prediction:
        # printing out incorrect predictions!
        print('premise:', convert_tensor_to_text(batch.premise[i]))
        print('hypothesis:', convert_tensor_to_text(batch.hypothesis[i]))
        print('label:', convert_to_text_label(one_label))
        print('prediction:', convert_to_text_label(one_prediction))
        print('\n')
    
    n_correct += (prediction == batch.label).sum().item()
    n_total += batch.batch_size
    n_loss += loss.item()
  
  test_loss = n_loss / n_total
  test_acc = n_correct / n_total


premise: however , even for a basic product like blue jeans , it is unlikely that all these assumptions will hold .
hypothesis: assumptions rarely hold . <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
label: entailment
prediction: contradiction


premise: <unk> not as foot disease nor the bad breath of children , but just plain ( and literally ) <unk> .
hypothesis: a child with a diseased foot has <unk> . <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
label: contradiction
prediction: entailment


premise: from that point of view the differing interpretations mr. anson and i read into the passage are of secondary importance .
hypothesis: the differing in interpretations are of only secondary importance . <pad> <pad> <pad> <pad> <pad> <pad> <pa

In [None]:
#@title Default title text
confusion_mat

tensor([[2065.,  414.,  761.],
        [ 473., 1521., 1135.],
        [ 245.,  450., 2768.]])

In [None]:
test_acc

0.6462571196094385

### Discussion

More options or hyperparameters to explore.
- The hyperparameters in args.
  - For example, dropout rate, learning rate, batch size, etc.
  - Hidden size is also interesting to experiment on. Will larger hidden size improve performance?
- Only training on a subset of training data. Plot a curve of accuracy or loss vs. training set size?
- We experimented with vanilla RNN, as well as LSTM. Attempt GRU (which was actually created by Prof. Cho!)?

Are you satisfied with the accuracy? How can we improve the accuracy?
- By improving the pretrained GloVe embedding?
- By editing the networks?

Other interesting observations?
- Spurious correlation?
  - For example, the existence of the word "never" or "not" in the hypothesis sentence will lead to the false classification of "contradiction", despite the premise?
- Consider different test sets?
  - Adversarial NLI (Nie et al., 2019)
  - HANS (McCoy et al., 2019)
  - Etc.
- How to perform better on out-of-domain or very different test data?
  - Well, by collecting better training examples.
  - By training the model differently?
- More thoughts?

Finally, don't forget to double check the google doc for today's section. If you found this lab challenging, please go through the other notebook provided in the doc.