# This is a demo on how to use checkpoints of trained models

# Load a checkpoint
##### Model choices are uni-lstm, bi-lstm, bi-max-lstm, mean

In [1]:
import argparse
from models import SNLIModel
from train import parse_args
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
checkpoint_path = "modelsaves/bi_max_lstm_model.pth"
modeltype = 'bi-max-lstm' # Make sure this matches the checkpoint you are loading in!
# choices are uni-lstm, bi-lstm, bi-max-lstm, mean

In [3]:
params = parse_args() # The default parameters (do not worry, this is just for initialization, it wont matter since we are evaluating only not training)
params.checkpoint_path = checkpoint_path
params.encoder_model = modeltype

In [4]:
checkpoint_info = torch.load(checkpoint_path)
print(checkpoint_info.keys())
print(f"Model: {modeltype}")
print(f"Epoch: {checkpoint_info['epoch']}")
print(f"Dev accuracy; {checkpoint_info['dev_accuracy'].item()}")

dict_keys(['model_state_dict', 'optimizer_state_dict', 'epoch', 'dev_accuracy'])
Model: bi-max-lstm
Epoch: 5
Dev accuracy; 0.8474903702735901


In [5]:
checkpoint_model = SNLIModel(params)

Setting up all imports and downloads


[nltk_data] Downloading package punkt to /home/david/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/david/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Preprocessing the data
Setting up the classifier (and the encoder within)
Setting up the optimizer and loss function
Checkpoint loaded from modelsaves/bi_max_lstm_model.pth


# Example on how to predict using the checkpoint model

In [6]:
premise = "Two men sitting in the sun"
hypothesis = "Nobody is sitting in the shade"
checkpoint_model.predict([premise], [hypothesis])

(tensor([1], device='cuda:0'), ['neutral'])

In [7]:
premise = "A man is walking a dog"
hypothesis = "No cat is outside"
checkpoint_model.predict([premise], [hypothesis])

(tensor([0], device='cuda:0'), ['entailment'])

# Example on evaluating a dataset and obtaining an accuracy

In [8]:
dev_accuracy = checkpoint_model.evaluate_accuracy(checkpoint_model.dev_data).item()
test_accuracy = checkpoint_model.evaluate_accuracy(checkpoint_model.test_data).item()

In [9]:
print(f"The dev accuracy is {round(dev_accuracy, 5)} and the test accuracy is {round(test_accuracy, 5)}")

The dev accuracy is 0.84749 and the test accuracy is 0.85006


# Error Analysis
At the above code, simply change the model checkpoint and rerun the notebook to see how each model performs on these examples

This should be a contradiction

Why is it a hard task?

Lexical overlap is low and the model may struggle to tie "apple" to "fruit" or to detect that "no one" contradicts "a woman"

In [10]:
premise = "A woman is eating an apple"
hypothesis = "No one is eating fruit"
checkpoint_model.predict([premise], [hypothesis])

(tensor([0], device='cuda:0'), ['entailment'])

This should be a contradiction

Why is it hard task?

Both sentences involve a baby, but it takes commonsense to realize that babies who cry in cribs probably aren’t talking.

In [11]:
premise = "A baby is crying in a crib"
hypothesis = "The baby is talking to its parents"
checkpoint_model.predict([premise], [hypothesis])

(tensor([2], device='cuda:0'), ['contradiction'])

This should be a contradiction

Why is it hard task?

The model must track event order and realize "was eating" implies the action was still happening.

In [12]:
premise = "The man was eating dinner when the phone rang"
hypothesis = "The man had already finished eating"
checkpoint_model.predict([premise], [hypothesis])

(tensor([1], device='cuda:0'), ['neutral'])

This should be neutral

Why is it hard task?

The word "children/kids" appears, but the activities are unrelated and shouldn’t lead to entailment or contradiction.

In [13]:
premise = "Children are playing in a park"
hypothesis = "Kids are studying at school"
checkpoint_model.predict([premise], [hypothesis])

(tensor([0], device='cuda:0'), ['entailment'])

This should be a entailment

Why is it hard task?

It tests the model's ability to recognize paraphrasing and soft entailment across different phrasings.

In [14]:
premise = "A man is giving a presentation to a small audience"
hypothesis = "A man is speaking in front of a group"
checkpoint_model.predict([premise], [hypothesis])

(tensor([0], device='cuda:0'), ['entailment'])