**INITIALIZATION:**
- I use these three lines of code on top of my each notebooks because it will help to prevent any problems while reloading the same project. And the third line of code helps to make visualization within the notebook.

In [1]:
#@ INITIALIZATION:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**DOWNLOADING LIBRARIES AND DEPENDENCIES:**
- I have downloaded all the libraries and dependencies required for the project in one particular cell.

In [3]:
#@ DOWNLOADING THE LIBRARIES AND DEPENDENCIES:
# !pip install -U d2l
from d2l import torch as d2l

import os, re
import torch     
from torch import nn                                
from IPython import display

**GETTING THE DATASET:**
- I have used google colab for this notebook so the process of downloading and reading the data might be different in other platforms. I will use **Stanford Natural Language Inference Corpus** for this notebook. The SNLI Corpus is a collection of over 500000 labeled english pairs. 

In [5]:
#@ GETTING THE DATASET: 
d2l.DATA_HUB["SNLI"] = ('https://nlp.stanford.edu/projects/snli/snli_1.0.zip',
                        '9fcde07509c7e87ec61c640c1b2753d9041758e4')               # Reading the Dataset. 
data_dir = d2l.download_extract("SNLI")                                           # Extracting the Dataset. 

**READING THE DATASET:**
- I will define a function to only extract part of the dataset and then return list of premises, hypothesis and their labels. 

In [6]:
#@ READING THE DATASET: 
def read_snli(data_dir, is_train):                                # Reading Dataset into Premises, Hypothesis and Labels. 
  def extract_text(s):                                            # Removing unwanted Texts. 
    s = re.sub("\\(", "", s)                                      # Removing Information. 
    s = re.sub("\\)", "", s)                                      # Removing Information. 
    s = re.sub("\\s{2,}", " ", s)                                 # Replacing Whitespaces with Space. 
    return s.strip()
  
  label_set = {"entailment": 0, "contradiction": 1, 
               "neutral": 2}                                      # Initializing Labels. 
  file_name = os.path.join(data_dir, "snli_1.0_train.txt" if \
                           is_train else "snli_1.0_test.txt")
  with open(file_name, "r") as f: 
    rows = [row.split("\t") for row in f.readlines()[1:]]
  premises = [extract_text(row[1]) for row in rows if row[0] in 
              label_set]                                          # Initializing Premises. 
  hypothesis = [extract_text(row[2]) for row in rows if row[0] \
                in label_set]                                     # Initializing Hypothesis. 
  labels = [label_set[row[0]] for row in rows if row[0] in 
            label_set]                                            # Initializing Labels. 
  return premises, hypothesis, labels

In [7]:
#@ IMPLEMENTATION: 
train_data = read_snli(data_dir, is_train=True)                   # Implementation of Function. 
for x0, x1, y in zip(train_data[0][:3], train_data[1][:3], 
                     train_data[2][:3]):
  print("premise:", x0)                                           # Inspecting Premises. 
  print("hypothesis:", x1)                                        # Inspecting Hypothesis. 
  print("label:", y)                                              # Inspecting Labels. 

premise: A person on a horse jumps over a broken down airplane .
hypothesis: A person is training his horse for a competition .
label: 2
premise: A person on a horse jumps over a broken down airplane .
hypothesis: A person is at a diner , ordering an omelette .
label: 1
premise: A person on a horse jumps over a broken down airplane .
hypothesis: A person is outdoors , on a horse .
label: 0


In [8]:
#@ READING THE DATASET: 
test_data = read_snli(data_dir, is_train=False)                   # Implementation of Function. 
for data in [train_data, test_data]:
  print([[row for row in data[2]].count(i) for i in range(3)])    # Inspecting the Data. 

[183416, 183187, 182764]
[3368, 3237, 3219]


**LOADING THE DATASET:**
- I will define a class for loading the SNLI Dataset. The num steps argument in the class constructor specifies the length of a text sequence so that each minibatch of sequences will have the same shape. The token sequences which are longer than num steps are trimmed while special tokkens are appended to shorter sequences. 

In [53]:
#@ LOADING THE DATASET: 
class SNLIDataset(torch.utils.data.Dataset):                     # Loading SNLI Dataset. 
  def __init__(self, dataset, num_steps, vocab=None):            # Initializing Constructor Function. 
    self.num_steps = num_steps                                   # Initialization. 
    all_premise_tokens = d2l.tokenize(dataset[0])                # Initializing Tokenization. 
    all_hypothesis_tokens = d2l.tokenize(dataset[1])             # Initializing Tokenization. 
    if vocab is None: 
      self.vocab = d2l.Vocab(all_premise_tokens + \
                             all_hypothesis_tokens, min_freq=5, 
                             reserved_tokens=["<pad>"])          # Initializing Vocabulary of Tokens. 
    else: 
      self.vocab = vocab
    self.premises = self._pad(all_premise_tokens)                # Implementation of Padding and Truncation. 
    self.hypotheses = self._pad(all_hypothesis_tokens)           # Implementation of Padding and Truncation. 
    self.labels = torch.tensor(dataset[2])                       # Initializing Labels. 
    print("read " + str(len(self.premises)) + " examples")
  
  def _pad(self, lines):
    return torch.tensor([d2l.truncate_pad(self.vocab[line], 
                                          self.num_steps, 
                                          self.vocab["<pad>"])\
                         for line in lines])
    
  def __getitem__(self, idx):                                    # Accessing Premise, Hypothesis and Labels. 
    return (self.premises[idx], self.hypotheses[idx]), \
            self.labels[idx]
  
  def __len__(self):
    return len(self.premises)

In [54]:
#@ LOADING THE DATASET: 
def load_data_snli(batch_size, num_steps=50):                       # Initializing Data Iterations and Vocabulary. 
  num_workers = d2l.get_dataloader_workers()                        # Initialization. 
  data_dir = d2l.download_extract("SNLI")                           # Extracting the Dataset. 
  train_data = read_snli(data_dir, True)                            # Initializing Training Dataset. 
  test_data = read_snli(data_dir, False)                            # Initializing Test Dataset. 
  train_set = SNLIDataset(train_data, num_steps)                    # Initializing Training Set. 
  test_set = SNLIDataset(test_data, num_steps, 
                         train_set.vocab)                           # Initializing Test Set. 
  train_iter = torch.utils.data.DataLoader(train_set, batch_size, 
                                           shuffle=True, 
                                           num_workers=2)           # Initializing Training Iterations. 
  test_iter = torch.utils.data.DataLoader(test_set, batch_size, 
                                          shuffle=False, 
                                          num_workers=2)            # Initializing Test Iterations. 
  return train_iter, test_iter, train_set.vocab                                        

In [55]:
#@ IMPLEMENTATION:
train_iter, test_iter, vocab = load_data_snli(128, 50)              # Implementation of Function. 
print(len(vocab))                                                   # Inspecting Vocabulary. 

read 549367 examples
read 9824 examples
18678


In [56]:
#@ IMPLEMENTATION: 
for X, Y in train_iter: 
  print(X[0].shape)
  print(X[1].shape)
  print(Y.shape)
  break

torch.Size([128, 50])
torch.Size([128, 50])
torch.Size([128])
