# Tuning the no-answer performance for SQ2

We saw the inferior performance of compound model caused by weaker classification accuracy. In the paper we follow the BERT's approach for the independent model (using score score = start_logit[0] + end_logit[0] as no-answer logit), and use score = logit\[0,0\] for joint/compound models as a no-answer logit. Note the bert's preprocessing finds optimal threshold on dev data over the difference of the lowest no-answer score and best span answer score (no_ans_score=score-best_span_score) in all windows (subparts of the input that satisfy model's input length constraint) of the example.

Here, we investigate what happens if we fuse no-answer scores from independent heads (both start/end probability spaces are extended with no answer), joint heads score and best span answer score via trivial logistic regression and see whether there is any difference than when using official BERT's approach.

In [1]:
%load_ext autoreload
%autoreload 2
import torch
import math
import os
os.chdir("../")

from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))
    
def colorize(string,color="red"):
     return f"<span style=\"color:{color}\">{string}</span>"
print(os.getcwd())

/home/anonymized_name/research/JointSpanExtraction


## Lets try it out for best and worst models 
(same code is repeated down for the worst model)


In [2]:
import pickle
# This checkpoint: EM_74.14_F1_76.85_L_2.63
with open("squad2_scores_and_gts_best_model.pkl","rb") as f:
    data = pickle.load(f)

In [3]:
scores,labels = [],[]
for _id, l in data['labels'].items():
    labels.append(l)
    scores.append(data['scores'][_id])

In [4]:
import torch
X = torch.FloatTensor(scores)
Y = torch.BoolTensor(labels)

In [5]:
print(X,Y)
print(X.shape, Y.shape)

tensor([[ 0.6124, -2.5779, -3.0910,  9.5602],
        [ 1.2289, -1.8597, -1.9223,  9.6860],
        [ 9.1972,  4.7305,  4.6422,  7.0565],
        ...,
        [ 8.6794,  5.2043,  5.4920,  4.8284],
        [ 6.4353,  3.7706,  4.1866,  2.2437],
        [ 7.9825,  4.8924,  5.5472,  3.2727]]) tensor([False, False, False,  ...,  True,  True,  True])
torch.Size([11873, 4]) torch.Size([11873])


In [6]:
class ConstrainedLR(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(4,1,bias=True)
        
    def forward(self,X):
        return self.linear(X).squeeze(-1)

In [7]:
@torch.no_grad()
def evaluate_model(M,X,Y):
    M.eval()
    odds = M(X)
    predictions=odds>0.
    accuracy = (predictions==Y).sum()/float(len(Y))
    return accuracy

In [8]:
evaluate_model(ConstrainedLR(),X,Y)

tensor(0.2337)

In [9]:
import torch.nn.functional as F
from tqdm import tqdm
from transformers import get_linear_schedule_with_warmup
def run_training():
    STEPS=100
    model =  ConstrainedLR()
    model = model.train()
    opt = torch.optim.SGD(model.parameters(), lr=0.3)
    scheduler = get_linear_schedule_with_warmup(
                    opt,
                    num_warmup_steps=10,
                    num_training_steps=STEPS
                )

    best_acc= 0
    best_r = None
    iterator = tqdm(range(STEPS))
    labels=Y.float()
    for i in iterator:
        log_odds = model(X)
        l_list = F.binary_cross_entropy_with_logits(log_odds, labels, reduction='none')
        l = l_list.mean()
        l.backward()
        opt.step()
        scheduler.step()
        opt.zero_grad()
        if i % 1 == 0:
            r = { k: v.tolist() for k,v in dict(model.named_parameters()).items()}
            model.eval()
            with torch.no_grad():
                acc = evaluate_model(model, X, Y)
            model.train()
            if acc>best_acc:
                iterator.set_description(str(acc))
                best_acc = acc
                best_r = r
    return best_acc, best_r
total_best_acc, total_best_r = 0.,None
for _ in range(20):
    acc,r = run_training()
    if acc>total_best_acc:
        print(r)
        total_best_acc=acc
        total_best_r=r

tensor(0.8011): 100%|██████████| 100/100 [00:00<00:00, 514.69it/s]
tensor(0.8028): 100%|██████████| 100/100 [00:00<00:00, 520.81it/s]
tensor(0.3003):   0%|          | 0/100 [00:00<?, ?it/s]

{'linear.weight': [[0.5758213996887207, -0.12292639166116714, -0.1670505702495575, -0.4772215783596039]], 'linear.bias': [0.08077636361122131]}
{'linear.weight': [[0.43374213576316833, 0.03123548999428749, -0.14522607624530792, -0.40512537956237793]], 'linear.bias': [0.1149904802441597]}


tensor(0.8014): 100%|██████████| 100/100 [00:00<00:00, 460.04it/s]
tensor(0.8035): 100%|██████████| 100/100 [00:00<00:00, 406.44it/s]
tensor(0.8025):  22%|██▏       | 22/100 [00:00<00:00, 217.19it/s]

{'linear.weight': [[0.3941035866737366, -0.145783931016922, 0.07665596902370453, -0.3855445086956024]], 'linear.bias': [0.08933649957180023]}


tensor(0.8026): 100%|██████████| 100/100 [00:00<00:00, 406.51it/s]
tensor(0.8014): 100%|██████████| 100/100 [00:00<00:00, 477.57it/s]
tensor(0.8032): 100%|██████████| 100/100 [00:00<00:00, 616.83it/s]
tensor(0.8018): 100%|██████████| 100/100 [00:00<00:00, 701.04it/s]
tensor(0.8023): 100%|██████████| 100/100 [00:00<00:00, 401.65it/s]
tensor(0.8021): 100%|██████████| 100/100 [00:00<00:00, 592.86it/s]
tensor(0.8038): 100%|██████████| 100/100 [00:00<00:00, 709.71it/s]
tensor(0.8024):  25%|██▌       | 25/100 [00:00<00:00, 247.85it/s]

{'linear.weight': [[0.42036378383636475, -0.20511984825134277, 0.10401123017072678, -0.40278103947639465]], 'linear.bias': [0.12815922498703003]}


tensor(0.8027): 100%|██████████| 100/100 [00:00<00:00, 355.90it/s]
tensor(0.8033): 100%|██████████| 100/100 [00:00<00:00, 396.62it/s]
tensor(0.8027): 100%|██████████| 100/100 [00:00<00:00, 532.10it/s]
tensor(0.8036): 100%|██████████| 100/100 [00:00<00:00, 678.78it/s]
tensor(0.8023): 100%|██████████| 100/100 [00:00<00:00, 420.22it/s]
tensor(0.8032): 100%|██████████| 100/100 [00:00<00:00, 411.34it/s]
tensor(0.8022): 100%|██████████| 100/100 [00:00<00:00, 615.76it/s]
tensor(0.8028): 100%|██████████| 100/100 [00:00<00:00, 503.38it/s]
tensor(0.8032): 100%|██████████| 100/100 [00:00<00:00, 483.20it/s]


In [10]:
print("Total best:")
print(total_best_acc)
print(total_best_r)

Total best:
tensor(0.8038)
{'linear.weight': [[0.42036378383636475, -0.20511984825134277, 0.10401123017072678, -0.40278103947639465]], 'linear.bias': [0.12815922498703003]}


#  Solution from the BERT's source get_predictions method

In [11]:
found_model = ConstrainedLR()
found_model.linear.weight[0]=torch.FloatTensor([1.,0.,0.,-1,])
found_model.linear.bias[0] =0.

found_best_threshold=-0.752
evaluate_model(found_model,X+found_best_threshold,Y)

tensor(0.8022)

__Conclusion 1__: For the best model, solution is approximately the same!

In [12]:
import pickle
# This checkpoint: EM_72.62_F1_75.05_L_2.72

with open("squad2_scores_and_gts_worst_model.pkl","rb") as f:
    data = pickle.load(f)

In [13]:
scores,labels = [],[]
for _id, l in data['labels'].items():
    labels.append(l)
    scores.append(data['scores'][_id])

In [14]:
import torch
X = torch.FloatTensor(scores)
Y = torch.BoolTensor(labels)

In [15]:
print(X,Y)
print(X.shape, Y.shape)

tensor([[ 1.7163, -2.4542, -2.9251, 10.8823],
        [ 0.4709, -2.9283, -3.2450,  8.8130],
        [ 2.2096, -1.1581, -1.6156,  8.7181],
        ...,
        [10.4564,  5.8908,  5.2725,  7.3239],
        [ 6.2576,  2.5990,  2.7125,  4.5220],
        [ 8.7057,  4.2362,  4.1163,  4.1985]]) tensor([False, False, False,  ...,  True,  True,  True])
torch.Size([11873, 4]) torch.Size([11873])


In [16]:
class ConstrainedLR(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(4,1,bias=True)
        
    def forward(self,X):
        return self.linear(X).squeeze(-1)

In [17]:
@torch.no_grad()
def evaluate_model(M,X,Y):
    M.eval()
    odds = M(X)
    predictions=odds>0.
    accuracy = (predictions==Y).sum()/float(len(Y))
    return accuracy

In [18]:
evaluate_model(ConstrainedLR(),X,Y)

tensor(0.4995)

In [19]:
import torch.nn.functional as F
from tqdm import tqdm
from transformers import get_linear_schedule_with_warmup
def run_training():
    STEPS=100
    model =  ConstrainedLR()
    model = model.train()
    opt = torch.optim.SGD(model.parameters(), lr=0.3)
    scheduler = get_linear_schedule_with_warmup(
                    opt,
                    num_warmup_steps=10,
                    num_training_steps=STEPS
                )

    best_acc= 0
    best_r = None
    iterator = tqdm(range(STEPS))
    labels=Y.float()
    for i in iterator:
        log_odds = model(X)
        l_list = F.binary_cross_entropy_with_logits(log_odds, labels, reduction='none')
        l = l_list.mean()
        l.backward()
        opt.step()
        scheduler.step()
        opt.zero_grad()
        if i % 1 == 0:
            r = { k: v.tolist() for k,v in dict(model.named_parameters()).items()}
            model.eval()
            with torch.no_grad():
                acc = evaluate_model(model, X, Y)
            model.train()
            if acc>best_acc:
                iterator.set_description(str(acc))
                best_acc = acc
                best_r = r
    return best_acc, best_r
total_best_acc, total_best_r = 0.,None
for _ in range(20):
    acc,r = run_training()
    if acc>total_best_acc:
        print(r)
        total_best_acc=acc
        total_best_r=r

tensor(0.7862): 100%|██████████| 100/100 [00:00<00:00, 551.49it/s]
tensor(0.7859): 100%|██████████| 100/100 [00:00<00:00, 696.45it/s]
tensor(0.6734):   0%|          | 0/100 [00:00<?, ?it/s]

{'linear.weight': [[0.5943999290466309, -0.20816883444786072, -0.1340014934539795, -0.4769420027732849]], 'linear.bias': [0.021049603819847107]}


tensor(0.7846): 100%|██████████| 100/100 [00:00<00:00, 511.23it/s]
tensor(0.7855): 100%|██████████| 100/100 [00:00<00:00, 643.86it/s]
tensor(0.7860): 100%|██████████| 100/100 [00:00<00:00, 639.13it/s]
tensor(0.7864): 100%|██████████| 100/100 [00:00<00:00, 729.83it/s]
tensor(0.7856): 100%|██████████| 100/100 [00:00<00:00, 646.69it/s]
tensor(0.7762):   0%|          | 0/100 [00:00<?, ?it/s]

{'linear.weight': [[0.5754302144050598, -0.14204084873199463, -0.16680869460105896, -0.44428592920303345]], 'linear.bias': [-0.07883349061012268]}


tensor(0.7867): 100%|██████████| 100/100 [00:00<00:00, 592.41it/s]
tensor(0.7858): 100%|██████████| 100/100 [00:00<00:00, 724.38it/s]
tensor(0.7591):   0%|          | 0/100 [00:00<?, ?it/s]

{'linear.weight': [[0.3304724097251892, -0.03998826816678047, -0.004543735645711422, -0.4211843013763428]], 'linear.bias': [0.7147210836410522]}


tensor(0.7866): 100%|██████████| 100/100 [00:00<00:00, 586.22it/s]
tensor(0.7860): 100%|██████████| 100/100 [00:00<00:00, 755.22it/s]
tensor(0.7854): 100%|██████████| 100/100 [00:00<00:00, 707.60it/s]
tensor(0.7858): 100%|██████████| 100/100 [00:00<00:00, 816.33it/s]
tensor(0.7866): 100%|██████████| 100/100 [00:00<00:00, 687.89it/s]
tensor(0.7853): 100%|██████████| 100/100 [00:00<00:00, 757.34it/s]
tensor(0.7863): 100%|██████████| 100/100 [00:00<00:00, 645.24it/s]
tensor(0.7862): 100%|██████████| 100/100 [00:00<00:00, 732.73it/s]
tensor(0.7863): 100%|██████████| 100/100 [00:00<00:00, 655.45it/s]
tensor(0.7861): 100%|██████████| 100/100 [00:00<00:00, 619.61it/s]
tensor(0.7864): 100%|██████████| 100/100 [00:00<00:00, 651.45it/s]


In [20]:
print("Total best:")
print(total_best_acc)
print(total_best_r)

Total best:
tensor(0.7867)
{'linear.weight': [[0.3304724097251892, -0.03998826816678047, -0.004543735645711422, -0.4211843013763428]], 'linear.bias': [0.7147210836410522]}


#  Solution from the BERT's source get_predictions method

In [21]:
found_model = ConstrainedLR()
found_model.linear.weight[0]=torch.FloatTensor([1.,0.,0.,-1,])
found_model.linear.bias[0] =0.

found_best_threshold=-0.945
evaluate_model(found_model,X+found_best_threshold,Y)

tensor(0.7824)

__Conclusion 2__: For the worst model, the fused solution is slightly better, but still much weaker than (even the worst) no-answer accuracy of the independent model! These are all our 10 results for the no-answer accuracy of __independent__ models:
```
0.7930598838
0.7954181757
0.7950812768
0.7969342205
0.7965973217
0.7988713889
0.8015665796
0.7997978607
0.8077992083
0.8042617704
```
__average__:0.7989387686
__std__:0.004543938582

the difference is definitely statistically significant!