# 16.4 Natural Language Inference and the Dataset

In Section 16.1, we discussed the problem of sentiment analysis. This task aims to classify a single text sequence into predefined categories, such as a set of sentiment polarities. However, when there is a need to decide whether one sentence can be inferred form another, or eliminate redundancy by identifying sentences that are semantically equivalent, knowing how to classify one text sequence is insufficient. Instead, we need to be able to reason over pairs of text sequences.

## 16.4.1 Natural Language Inference

Natural language inference studies whether a hypothesis can be inferred from a premise, where both are a text sequence. In other words, natural language inference determines the logical relationship between a pair of text sequences. Such relationships usually fall into three types:

 - Entailment: the hypothesis can be inferred from the premise
 - Contradiction: the negation of the hypothesis can be inferred from the premise.
 - Neutral: all the other cases.

Natural language inference is also known as the recognizing textual entailment task. For example, the following pair will be labeled as entailment because "showing affection" in the hypothesis can be inferred from "hugging one another" in the premise.

 - Premise: Two women are hugging each otehr.
 - Hypothesis: Two women are showing affection.

The following is an example of contradction as "running the coding example" indicates "not sleeping" rather than "sleeping".

 - Premise: A man is running the coding example from Dive into Deep Learning.
 - Hypothesis: The man is sleeping.

The third example shows a neutrality relationship because neither "famous" nor "not famous" can be inferred from the fact that "are performing for us".

 - Premise: The musicians are performing for us.
 - Hypothesis: The musicians are famous.

Natural language inference has been a central topic for understanding natural language. It enjoys wide applications ranging from information retrieval to open-domain question answering. To study this problem, we will begin by investigating a popular natural language inference benchmark dataset.

## 16.4.2 The Standford Natural Language Inference (SNLI) Dataset

In [1]:
import os
import re
import torch
from torch import nn
from d2l import torch as d2l
#@save
d2l.DATA_HUB['SNLI'] = (
'https://nlp.stanford.edu/projects/snli/snli_1.0.zip',
'9fcde07509c7e87ec61c640c1b2753d9041758e4')
data_dir = d2l.download_extract('SNLI')

### Reading the Dataset

In [2]:
#@save
def read_snli(data_dir, is_train):
    """Read the SNLI dataset into premises, hypotheses, and labels."""
    def extract_text(s):
        # Remove information that will not be used by us
        s = re.sub('\\(', '', s)
        s = re.sub('\\)', '', s)
        # Substitute two or more consecutive whitespace with space
        s = re.sub('\\s{2,}', ' ', s)
        return s.strip()
    label_set = {'entailment': 0, 'contradiction': 1, 'neutral': 2}
    file_name = os.path.join(data_dir, 'snli_1.0_train.txt'
            if is_train else 'snli_1.0_test.txt')
    with open(file_name, 'r') as f:
        rows = [row.split('\t') for row in f.readlines()[1:]]
    premises = [extract_text(row[1]) for row in rows if row[0] in label_set]
    hypotheses = [extract_text(row[2]) for row in rows if row[0] in label_set]
    labels = [label_set[row[0]] for row in rows if row[0] in label_set]
    return premises, hypotheses, labels

In [3]:
train_data = read_snli(data_dir, is_train=True)

In [4]:
for x0, x1, y in zip(train_data[0][:3], train_data[1][:3], train_data[2][:3]):
    print('premise:', x0)
    print('hypothesis:', x1)
    print('label:', y)

premise: A person on a horse jumps over a broken down airplane .
hypothesis: A person is training his horse for a competition .
label: 2
premise: A person on a horse jumps over a broken down airplane .
hypothesis: A person is at a diner , ordering an omelette .
label: 1
premise: A person on a horse jumps over a broken down airplane .
hypothesis: A person is outdoors , on a horse .
label: 0


In [5]:
test_data = read_snli(data_dir, is_train=False)
for data in [train_data, test_data]:
    print([[row for row in data[2]].count(i) for i in range(3)])

[183416, 183187, 182764]
[3368, 3237, 3219]


### Defining a Class for Loading the Dataset

In [6]:
#@save
class SNLIDataset(torch.utils.data.Dataset):
    """A customized dataset to load the SNLI dataset."""
    def __init__(self, dataset, num_steps, vocab=None):
        self.num_steps = num_steps
        all_premise_tokens = d2l.tokenize(dataset[0])
        all_hypothesis_tokens = d2l.tokenize(dataset[1])
        if vocab is None:
            self.vocab = d2l.Vocab(all_premise_tokens + all_hypothesis_tokens, min_freq=5, reserved_tokens=['<pad>'])
        else:
            self.vocab = vocab
            self.premises = self._pad(all_premise_tokens)
            self.hypotheses = self._pad(all_hypothesis_tokens)
            self.labels = torch.tensor(dataset[2])
            print('read ' + str(len(self.premises)) + ' examples')
        def _pad(self, lines):
            return torch.tensor([d2l.truncate_pad(
            self.vocab[line], self.num_steps, self.vocab['<pad>'])
                for line in lines])
        
        def __getitem__(self, idx):
            return (self.premises[idx], self.hypotheses[idx]), self.labels[idx]
        
        def __len__(self):
            return len(self.premises)

### Putting It All Together

In [7]:
#@save
def load_data_snli(batch_size, num_steps=50):
    """Download the SNLI dataset and return data iterators and vocabulary."""
    num_workers = d2l.get_dataloader_workers()
    data_dir = d2l.download_extract('SNLI')
    train_data = read_snli(data_dir, True)
    test_data = read_snli(data_dir, False)
    train_set = SNLIDataset(train_data, num_steps)
    test_set = SNLIDataset(test_data, num_steps, train_set.vocab)
    train_iter = torch.utils.data.DataLoader(train_set, batch_size,
    shuffle=True,
    num_workers=num_workers)
    test_iter = torch.utils.data.DataLoader(test_set, batch_size,
    shuffle=False,
    num_workers=num_workers)
    return train_iter, test_iter, train_set.vocab

In [8]:
train_iter, test_iter, vocab = load_data_snli(128, 50)
len(vocab)

AttributeError: 'SNLIDataset' object has no attribute '_pad'

In [9]:
for X, Y in train_iter:
    print(X[0].shape)
    print(X[1].shape)
    print(Y.shape)
    break

NameError: name 'train_iter' is not defined