__This seminar__ teaches you about metric learning for NLP.

In [1]:
import json
import pandas as pd
import numpy as np
import tqdm
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\1\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Stanford question answering dataset (SQuAD)

_this seminar is based on original notebook by [Oleg Vasilev](https://github.com/Omrigan/)_

Today we are going to work with a popular NLP dataset.

Here is the description of the original problem:

```
Stanford Question Answering Dataset (SQuAD) is a new reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage. With 100,000+ question-answer pairs on 500+ articles, SQuAD is significantly larger than previous reading comprehension datasets.
```


We are not going to solve it :) Instead we will try to answer the question in a different way: given the question, we will find a **sentence** containing the answer, but not within the context, but in a **whole databank**

In [2]:
# download the data
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json

"wget" не является внутренней или внешней
командой, исполняемой программой или пакетным файлом.


In [3]:
data = json.load(open('train-v1.1.json'))

In [4]:
data['data'][0]['paragraphs'][0]

{'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'qas': [{'answers': [{'answer_start': 515,
     'text': 'Saint Bernadette Soubirous'}],
   'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
   'id': '5733be284776f41900661182'},
  {'answers': [{'answer_start': 188, 'text': 'a copper statue of Christ

### The NLP part

The code here is very similar to `week10/`: preprocess text into tokens, create dictionaries, etc.

In [5]:
from nltk.tokenize import RegexpTokenizer
from collections import Counter,defaultdict
tokenizer = RegexpTokenizer(r"\w+|\d+")

#Dictionary of tokens
token_counts = Counter()

def tokenize(value):
    return tokenizer.tokenize(value.lower())

for q in tqdm.tqdm_notebook(data['data']):
    for p in q['paragraphs']:
        token_counts.update(tokenize(p['context']))

HBox(children=(IntProgress(value=0, max=442), HTML(value='')))




In [6]:
min_count = 4

tokens = [w for w, c in token_counts.items() if c > min_count] 
tokens = ["_PAD_", "_UNK_"] + tokens

token_to_id = {t : i for i, t in enumerate(tokens)}


In [7]:
assert token_to_id['me'] != token_to_id['woods']
assert token_to_id[tokens[42]]==42
assert len(token_to_id)==len(tokens)

In [8]:
PAD_ix = token_to_id["_PAD_"]
UNK_ix = token_to_id['_UNK_']

#good old as_matrix for the third time
def as_matrix(sequences, max_len=None):
    if isinstance(sequences[0], (str, bytes)):
        sequences = [tokenize(s) for s in sequences]
        
    max_len = max_len or max(map(len,sequences))
    
    matrix = np.zeros((len(sequences), max_len), dtype='int32') + PAD_ix
    for i, seq in enumerate(sequences):
        row_ix = [token_to_id.get(word, UNK_ix) for word in seq[:max_len]]
        matrix[i, :len(row_ix)] = row_ix
    
    return matrix

In [9]:
test = as_matrix(["Definitely, thOsE tokens areN'T LowerCASE!!", "I'm the monument to all your sins."])
print(test)
assert test.shape==(2,8)
print("Correct!")

[[11162   849     1 12434  2799 19196     0     0]
 [ 1030   311     3  5523    35   164  2605 20297]]
Correct!


### Build the dataset

In [10]:
from nltk.tokenize import sent_tokenize
def build_dataset(train_data):
    '''Takes SQuAD data
    Returns a list of tuples - a set of pairs (q, a_+)
    '''
    dataset = []
    for row in tqdm.tqdm_notebook(train_data):
        for paragraph in row['paragraphs']:
            offsets = []
            curent_index = 0
            for sent in sent_tokenize(paragraph['context']):
                curent_index+=len(sent)+2
                offsets.append((curent_index, sent))
                
            for qa in paragraph['qas']:
                question, answer = qa['question'], qa['answers'][0]
                
                #find a sentence that contains an answer
                for offset, sent in offsets:
                    if answer['answer_start'] < offset:
                        dataset.append((question, sent))
                        break
    return dataset

In [11]:
from sklearn.model_selection import train_test_split
train_data, val_data = train_test_split(data['data'], test_size=0.1)

train_data = build_dataset(train_data)
val_data = build_dataset(val_data)

HBox(children=(IntProgress(value=0, max=397), HTML(value='')))




HBox(children=(IntProgress(value=0, max=45), HTML(value='')))




In [12]:
for i in range(2, 18, 6):
    print("Q: %s\nA: %s\n" % val_data[i])

Q: Of what is Hinduism a combination?
A: The history of India includes the prehistoric settlements and societies in the Indian subcontinent; the blending of the Indus Valley Civilization and Indo-Aryan culture into the Vedic Civilization; the development of Hinduism as a synthesis of various Indian cultures and traditions; the rise of the Śramaṇa movement; the decline of Śrauta sacrifices and the birth of the initiatory traditions of Jainism, Buddhism, Shaivism, Vaishnavism and Shaktism; the onset of a succession of powerful dynasties and empires for more than two millennia throughout various geographic areas of the subcontinent, including the growth of Muslim dynasties during the Medieval period intertwined with Hindu powers; the advent of European traders resulting in the establishment of the British rule; and the subsequent independence movement that led to the Partition of India and the creation of the Republic of India.

Q: What was the first major civilization in South Asia?
A: T

# Building the model

Any self-respecting DSSM must have one or several vectorizers. In our case,
* Context vectorizer
* Answer vectorizer

It is perfectly legal to share some layers between them, but make sure they are at least a little different.

In [13]:
import torch, torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

class GlobalMaxPooling(nn.Module):
    def __init__(self, dim=-1):
        super(self.__class__, self).__init__()
        self.dim = dim
        
    def forward(self, x):
        return x.max(dim=self.dim)[0]

In [14]:
# we might as well create a global embedding layer here

GLOBAL_EMB = nn.Embedding(len(tokens), 64, padding_idx=PAD_ix)

In [15]:
class QuestionVectorizer(nn.Module):
    def __init__(self, n_tokens=len(tokens), out_size=64, use_global_emb=True):
        """ 
        A simple sequential encoder for questions.
        Use any combination of layers you want to encode a variable-length input 
        to a fixed-size output vector
        
        If use_global_emb is True, use GLOBAL_EMB as your embedding layer
        """
        super(self.__class__, self).__init__()
        if use_global_emb:
            self.emb = GLOBAL_EMB
        else:
            self.emb = nn.Embedding(n_tokens, 64, padding_idx=PAD_ix)
            
        self.conv = nn.Conv1d(64, 128, kernel_size=3, padding=1)
        self.pool = GlobalMaxPooling()
        self.dense = nn.Linear(128, out_size)
        
    def forward(self, text_ix):
        """
        :param text_ix: int64 Variable of shape [batch_size, max_len]
        :returns: float32 Variable of shape [batch_size, out_size]
        """
        h = torch.transpose(self.emb(text_ix), 2, 1)
        h = F.relu(self.pool(self.conv(h)))
        return self.dense(h)

In [16]:
class AnswerVectorizer(nn.Module):
    def __init__(self, n_tokens=len(tokens), out_size=64, use_global_emb=True):
        """ 
        A simple sequential encoder for answers.
        x -> emb -> conv -> global_max -> relu -> dense
        
        If use_global_emb is True, use GLOBAL_EMB as your embedding layer
        """
        super(self.__class__, self).__init__()
        if use_global_emb:
            self.emb = GLOBAL_EMB
        else:
            self.emb = nn.Embedding(n_tokens, 64, padding_idx=PAD_ix)
            
        self.conv = nn.Conv1d(64, 128, kernel_size=3, padding=1)
        self.pool = GlobalMaxPooling()
        self.dense = nn.Linear(128, out_size)
        
    def forward(self, text_ix):
        """
        :param text_ix: int64 Variable of shape [batch_size, max_len]
        :returns: float32 Variable of shape [batch_size, out_size]
        """
        h = torch.transpose(self.emb(text_ix), 2, 1)
        h = F.relu(self.pool(self.conv(h)))
        return self.dense(h)

In [17]:
for vectorizer in [QuestionVectorizer(out_size=100), AnswerVectorizer(out_size=100)]:
    print("Testing %s ..." % vectorizer.__class__.__name__)
    dummy_x = Variable(torch.LongTensor(test))
    dummy_v = vectorizer(dummy_x)

    assert isinstance(dummy_v, Variable)
    assert tuple(dummy_v.shape) == (dummy_x.shape[0], 100)

    del vectorizer
    print("Seems fine")

Testing QuestionVectorizer ...
Seems fine
Testing AnswerVectorizer ...
Seems fine


In [18]:
from itertools import chain

question_vectorizer = QuestionVectorizer()
answer_vectorizer = AnswerVectorizer()

opt = torch.optim.Adam(chain(question_vectorizer.parameters(),
                             answer_vectorizer.parameters()))

We are going to use a single `encode`, but with different weights. You can use different encode for anchor and negatives/positives.

Negative sampling can be either `in-graph` or `out-graph`. We start with out-graph. In the home assignment you are going to use in-graph.

In [19]:
def generate_batch(data, batch_size=None, replace=False, volatile=False, max_len=None):
    """ Samples training/validation batch with random negatives """
    if batch_size is not None:
        batch_ix = np.random.choice(len(data), batch_size, replace=replace)
        negative_ix = np.random.choice(len(data), batch_size, replace=True)
    else:
        batch_ix = range(len(data))
        negative_ix = np.random.permutation(np.arange(len(data)))

    
    anchors, positives = zip(*[data[i] for i in batch_ix])
    
    # sample random rows as negatives.
    # Note: you can do better by sampling "hard" negatives
    negatives = [data[i][1] for i in negative_ix]
    
    anchors, positives, negatives = map(lambda x: Variable(torch.LongTensor(as_matrix(x, max_len=max_len)),
                                                           volatile=volatile), 
                                        [anchors, positives, negatives])
    return anchors, positives, negatives

In [20]:
_dummy_anchors, _dummy_positives, _dummy_negatives = generate_batch(train_data, 2)

print("Q:")
print(_dummy_anchors)
print("A+:")
print(_dummy_positives)
print("A-:")
print(_dummy_negatives)

Q:
tensor([[  1877,   3309,     18,      3,   2025,   3212,    719,      6,
            659,   2535,  14236,    117,   1165,     22,      1,     22,
           7541,     35,      3,  15979,  14236],
        [  1877,    152,      3,   1051,     18,      3,  14490,     58,
            298,   6087,    112,      3,   1589,      0,      0,      0,
              0,      0,      0,      0,      0]])
A+:
tensor([[   112,    777,    547,   1382,      3,  22898,   1230,    130,
           5786,    126,      3,  22722,     35,   6791,      6,   2535,
          14236,   2579,      1,    659,  15267,     22,     42,     18,
              3,  15979,  14236,   2579,      1,     24,     52,    540,
            659,  15267,    188,    719,    117,   1165,    869,   1516,
           1843,     18,      3,   3212,   1165,    126,      3,   6617,
          15979,  14236],
        [    22,      3,    347,   7218,      3,   1589,     18,      3,
           2506,   5004,   3292,    230,     18,      3,  14490

In [21]:
def compute_loss(anchors, positives, negatives, delta=1):
    """ 
    Compute the triplet loss:
    
    max(0, delta + sim(anchors, negatives) - sim(anchors, positives))
    
    where sim is a dot-product between vectorized inputs
    
    """
    v_a = question_vectorizer(anchors)
    v_p = answer_vectorizer(positives)
    v_n = answer_vectorizer(negatives)
    sim_n = (v_a * v_n).sum(dim = 1)
    sim_p = (v_a * v_p).sum(dim = 1)
    return F.relu(delta + sim_n - sim_p).mean()

In [22]:
def compute_recall(anchors, positives, negatives, delta=1):
    """
    Compute the probability (ratio) at which sim(anchors, negatives) is greater than sim(anchors, positives)
    """
    v_a = question_vectorizer(anchors)
    v_p = answer_vectorizer(positives)
    v_n = answer_vectorizer(negatives)
    sim_n = (v_a * v_n).sum(dim = 1)
    sim_p = (v_a * v_p).sum(dim = 1)
    return (sim_p > sim_n).type(torch.FloatTensor).mean()

In [23]:
print(compute_loss(_dummy_anchors, _dummy_positives, _dummy_negatives))
print(compute_recall(_dummy_anchors, _dummy_positives, _dummy_negatives))

tensor(2.5151)
tensor(0.)


### Training loop

In [24]:
num_epochs = 100
max_len = 100
batch_size = 32
batches_per_epoch = 100

In [25]:
from tqdm import tnrange
def iterate_minibatches(data, batch_size=32, max_len=None,
                        max_batches=None, shuffle=True, verbose=True):
    indices = np.arange(len(data))
    if shuffle:
        indices = np.random.permutation(indices)
    if max_batches is not None:
        indices = indices[: batch_size * max_batches]
        
    irange = tnrange if verbose else range
    
    for start in irange(0, len(indices), batch_size):
        yield generate_batch([data[i] for i in indices[start : start + batch_size]], max_len=max_len)

For a difference, we'll ask __you__ to implement training loop this time.

Here's a sketch of one epoch:
1. iterate over __`batches_per_epoch`__ batches from __`train_data`__
    * Compute loss, backprop, optimize
    * Compute and accumulate recall
    
2. iterate over __`batches_per_epoch`__ batches from __`val_data`__
    * Compute and accumulate recall
    
3. print stuff :)

In [None]:
from IPython.display import clear_output

In [26]:
for epoch_i in range(num_epochs):
    
    print("Training:")
    train_loss = train_mae = train_batches = 0    
    #model.train(True)
    
    for batch in iterate_minibatches(train_data, max_batches=512):
        loss = compute_loss(*batch)
        loss.backward()
        opt.step()
        opt.zero_grad()

        train_loss += loss.item()
        train_batches += 1
    
    print("\tLoss:\t%.5f" % (train_loss / train_batches))
    print('\n\n')

Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.82672



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.69097



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.63133



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.58873



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.55738



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.53853



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.50307



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.50015



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.47774



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.46864



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.44636



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.43096



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.44157



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.41642



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.40748



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.39624



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.38050



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.38799



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.38765



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.38778



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.35967



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.35906



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.36295



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.35497



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.34423



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.33270



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.32945



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))


	Loss:	0.33869



Training:


HBox(children=(IntProgress(value=0, max=512), HTML(value='')))

KeyboardInterrupt: 

In [None]:
for epoch_i in range(num_epochs):
    
    print("Training:")
    train_loss = train_batches = 0    
    model.train(True)
    
    for batch in iterate_minibatches(data_train, max_batches=batches_per_epoch):
        loss = compute_loss(*batch)
        loss.backward()
        opt.step()
        opt.zero_grad()

        train_loss += loss.item()
        train_batches += 1
    
    print("\tLoss:\t%.5f" % (train_loss / train_batches))
    print('\n\n')
    
    print("Validation:")
    val_loss = val_batches = 0
    model.train(False)
    
    for batch in iterate_minibatches(val_data, shuffle=False):
        loss = compute_loss(*batch)

        val_loss += loss.item()
        val_batches += 1
        
    print("\tLoss:\t%.5f" % (val_loss / val_batches))
    print('\n\n')

In [45]:
Q = train_data[0][0]
print(f'Question: {Q}')
V_Q = question_vectorizer(Variable(torch.LongTensor(as_matrix([Q]))))
answers = [train_data[i][1] for i in range(100)]
V_A = answer_vectorizer(Variable(torch.LongTensor(as_matrix(answers))))

Question: What was William Scott Wilson's occupation?


In [49]:
sim = (V_A * V_Q).sum(1).data.numpy()
for i in np.argsort(-sim)[:5]:
    print(sim[i])
    print(answers[i])

-0.56246567
Zen meditation became an important teaching due to it offering a process to calm one's mind.
-0.9209545
Japan mustered a mere 10,000 samurai to meet this threat.
-0.95281345
Samurai were many of the early exchange students, not directly because they were samurai, but because many samurai were literate and well-educated scholars.
-1.0128446
Bushido was formalized by several influential leaders and families before the Edo Period.
-1.0128446
Bushido was formalized by several influential leaders and families before the Edo Period.
