### Anaphora resolution

1) Get the pretrained model of FastText from https://fasttext.cc/docs/en/english-vectors.html

2) At the pytorch develop a model, that is a feed forward neural network that consists of three layers, an input layer of size 600, a first layer of size 300, a second layer of 80 and an output layer with two units, all layers have regularization and dropout. The activation function on all layers is ReLU

<img src="scheme.jpg">

In [1]:
import pandas as pd

In [2]:
df_dev = pd.read_csv('gap-development.tsv',sep='\t')

The task is to identify the target of a pronoun within a text passage. The source text is taken from Wikipedia articles. In the dataset, there are labels of the pronoun and two candidate names to which the pronoun could refer. An algorithm should be capable of deciding whether the pronoun refers to name A, name B, or neither.  
There are the following columns for analysis:
* ID - Unique identifier for an example (Matches to Id in output file format);
* Text - Text containing the ambiguous pronoun and two candidate names (about a paragraph in length);
* Text - Text containing the ambiguous pronoun and two candidate names (about a paragraph in length);
* Pronoun - The target pronoun (text);
* Pronoun-offset The character offset of Pronoun in Text;
* A - The first name candidate (text);
* A-offset - The character offset of name A in Text;
* B - The second name candidate;
* B-offset - The character offset of name B in Text;
* URL - The URL of the source Wikipedia page for the example;

In [3]:
df_dev

Unnamed: 0,ID,Text,Pronoun,Pronoun-offset,A,A-offset,A-coref,B,B-offset,B-coref,URL
0,development-1,Zoe Telford -- played the police officer girlf...,her,274,Cheryl Cassidy,191,True,Pauline,207,False,http://en.wikipedia.org/wiki/List_of_Teachers_...
1,development-2,"He grew up in Evanston, Illinois the second ol...",His,284,MacKenzie,228,True,Bernard Leach,251,False,http://en.wikipedia.org/wiki/Warren_MacKenzie
2,development-3,"He had been reelected to Congress, but resigne...",his,265,Angeloz,173,False,De la Sota,246,True,http://en.wikipedia.org/wiki/Jos%C3%A9_Manuel_...
3,development-4,The current members of Crime have also perform...,his,321,Hell,174,False,Henry Rosenthal,336,True,http://en.wikipedia.org/wiki/Crime_(band)
4,development-5,Her Santa Fe Opera debut in 2005 was as Nuria ...,She,437,Kitty Oppenheimer,219,False,Rivera,294,True,http://en.wikipedia.org/wiki/Jessica_Rivera
...,...,...,...,...,...,...,...,...,...,...,...
1995,development-1996,"Faye's third husband, Paul Resnick, reported t...",her,433,Nicole,255,False,Faye,328,True,http://en.wikipedia.org/wiki/Faye_Resnick
1996,development-1997,The plot of the film focuses on the life of a ...,her,246,Doris Chu,111,False,Mei,215,True,http://en.wikipedia.org/wiki/Two_Lies
1997,development-1998,Grant played the part in Trevor Nunn's movie a...,she,348,Maria,259,True,Imelda Staunton,266,False,http://en.wikipedia.org/wiki/Sir_Andrew_Aguecheek
1998,development-1999,The fashion house specialised in hand-printed ...,She,284,Helen,145,True,Suzanne Bartsch,208,False,http://en.wikipedia.org/wiki/Helen_David


In [4]:
df_val = pd.read_csv('gap-validation.tsv',sep='\t')

In [5]:
df_val

Unnamed: 0,ID,Text,Pronoun,Pronoun-offset,A,A-offset,A-coref,B,B-offset,B-coref,URL
0,validation-1,He admitted making four trips to China and pla...,him,256,Jose de Venecia Jr,208,False,Abalos,241,False,http://en.wikipedia.org/wiki/Commission_on_Ele...
1,validation-2,"Kathleen Nott was born in Camberwell, London. ...",She,185,Ellen,110,False,Kathleen,150,True,http://en.wikipedia.org/wiki/Kathleen_Nott
2,validation-3,"When she returns to her hotel room, a Liberian...",his,435,Jason Scott Lee,383,False,Danny,406,True,http://en.wikipedia.org/wiki/Hawaii_Five-0_(20...
3,validation-4,"On 19 March 2007, during a campaign appearance...",he,333,Reucassel,300,True,Debnam,325,False,http://en.wikipedia.org/wiki/Craig_Reucassel
4,validation-5,"By this time, Karen Blixen had separated from ...",she,427,Finch Hatton,290,False,Beryl Markham,328,True,http://en.wikipedia.org/wiki/Denys_Finch_Hatton
...,...,...,...,...,...,...,...,...,...,...,...
449,validation-450,"He then agrees to name the gargoyle Goldie, af...",He,305,Lucien,252,False,Abel,264,False,http://en.wikipedia.org/wiki/Goldie_(DC_Comics)
450,validation-451,"Disgusted with the family's ``mendacity'', Bri...",she,365,Maggie,242,False,Mae,257,False,http://en.wikipedia.org/wiki/Cat_on_a_Hot_Tin_...
451,validation-452,She manipulates Michael into giving her custod...,she,306,Scarlett,255,False,Alice,291,True,http://en.wikipedia.org/wiki/Michael_Moon_(Eas...
452,validation-453,"On April 4, 1986, Donal Henahan wrote in the N...",her,330,Aida,250,False,Miss Millo,294,True,http://en.wikipedia.org/wiki/Aprile_Millo


In [6]:
from gensim.models import KeyedVectors

# Creating the model
## Takes a lot of time depending on the vector file size 
en_model = KeyedVectors.load_word2vec_format('wiki-news-300d-1M.vec')

In [10]:
en_model.index2word[16]

'*'

In [7]:
import torch
from torch import nn
import torch.nn.functional as F

class MyAnaphoraNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        
        # define the layers
        self.layers = nn.Sequential(
            nn.Linear(600, 300),
            nn.Dropout(0.3),
            nn.ReLU(),
            nn.Linear(300, 80),
            nn.Dropout(0.3),
            nn.ReLU(),
            nn.Linear(80, 2),
            nn.Dropout(0.3),
            nn.ReLU()
        )
    
    def forward(self, x):
        # forward pass
        x = self.layers(x)
        return x

# instantiate the model
anmodel = MyAnaphoraNetwork()

# print model architecture
print(anmodel)

MyAnaphoraNetwork(
  (layers): Sequential(
    (0): Linear(in_features=600, out_features=300, bias=True)
    (1): Dropout(p=0.3, inplace=False)
    (2): ReLU()
    (3): Linear(in_features=300, out_features=80, bias=True)
    (4): Dropout(p=0.3, inplace=False)
    (5): ReLU()
    (6): Linear(in_features=80, out_features=2, bias=True)
    (7): Dropout(p=0.3, inplace=False)
    (8): ReLU()
  )
)


In [8]:
from nltk import tokenize
import re
from nltk.corpus import stopwords
STOPWORDS = stopwords.words('english')

def clean_text(text):
    text = text.lower()
    text = re.sub(r'\n|\r', ' ', text)
    text = re.sub(r' +', ' ', text)
    text = text.strip()
    
    text = tokenize.word_tokenize(text)
    # Remove a sentence if it is only one word long
    if len(text) > 2:
        return [word for word in text if word not in STOPWORDS]
    return text

def simple_tokenizer(text, pronoun, candidate_a, candidate_b):
    
    
    
    toktxt = clean_text(text)
    txt = []
    txt.append(pronoun)
    txt.append(candidate_a)
    txt.append(candidate_b)
    
    txt = txt + toktxt
    
    tokenized_text = []
    
    for word in txt:
        try:
            word_index = en_model.wv.vocab[word].index
            tokenized_text.append(word_index)
        except KeyError:
            continue #tokenized_text.append(16)
            
    while len(tokenized_text) < 600:
        tokenized_text.append(0)
    vec = torch.Tensor([tokenized_text])
    
    return vec.view(1, -1)



def make_target(label, label_to_ix):
    #print(label_to_ix[str(label['A'])])
    #print([label_to_ix[str(label['A'])], label_to_ix[str(label['B'])] ] )
    return torch.LongTensor([[ label_to_ix[str(label['A'])], label_to_ix[str(label['B'])] ]])

def decode_target(pred, label_to_ix):
        pass

label_to_ix = {"True": 1, "False": 0}

In [9]:
class ConfMatrix():
    def __init__(self):
        self.true_positive = 0
        self.false_positive = 0
        self.true_negative = 0
        self.false_negative = 0
    
    def precision(self):
        pr = 0
        if (self.true_positive + self.false_positive) != 0:
            pr = self.true_positive / (self.true_positive + self.false_positive)
        return round(pr * 100, 2) 
    
    def recall(self):
        rec = 0
        if (self.true_positive - self.false_negative) != 0:
            rec = self.true_positive / (self.true_positive + self.false_negative)
        return round(rec * 100, 2) 
    
    def f1_measure(self):
        pr = self.precision()
        rec = self.recall()
        f1m = 0
        if (pr + rec) != 0:
            f1m = 2 * (pr * rec) / (pr + rec)
        return round(f1m, 2)
    
    def accuracy(self):
        acc = 0
        if (self.true_positive + self.true_negative + self.false_positive + self.false_negative) != 0:
            acc = (self.true_positive + self.true_negative) / (self.true_positive + self.true_negative + self.false_positive + self.false_negative)
        return round(acc *100, 2)
    
    def compare_row(self, predicted, target):
        
        if predicted >= 1 and target >= 1:
            self.true_positive += 1 
            #print('predicted', predicted)
            #print('target', target)
            #print()
        elif predicted == 0 and target == 0:
            self.true_negative += 1
        elif predicted >= 1 and target == 0:
            self.false_positive += 1
        elif predicted == 0 and target >= 1:
            self.false_negative += 1
    
    def __repr__(self):
        return "TN: {0}, FP: {1}, FN: {2}, TP: {3}, precision: {4}, recall: {5}, f1: {6}, accuracy: {7}\n".format(self.true_negative, 
                                            self.false_positive, 
                                            self.false_negative, 
                                            self.true_positive,
                                            self.precision(),
                                            self.recall(),
                                            self.f1_measure(),
                                            self.accuracy())
        
    def __str__(self):
        return "TN: {0}, FP: {1}, FN: {2}, TP: {3}, precision: {4}, recall: {5}, f1: {6}, accuracy: {7}\n".format(self.true_negative, 
                                            self.false_positive, 
                                            self.false_negative, 
                                            self.true_positive,
                                            self.precision(),
                                            self.recall(),
                                            self.f1_measure(),
                                            self.accuracy())

In [10]:
a_conmat = ConfMatrix()
b_conmat = ConfMatrix()
a_conmat

TN: 0, FP: 0, FN: 0, TP: 0, precision: 0, recall: 0, f1: 0, accuracy: 0

In [11]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

loss_function = nn.NLLLoss()
criterion = nn.MSELoss()
b = nn.CrossEntropyLoss()
lossfxn = nn.CrossEntropyLoss()

# lossf = nn.MultiLabelMarginLoss()
lossf = nn.MultiLabelMarginLoss() #nn.MultiLabelSoftMarginLoss()
optimizer = optim.SGD(anmodel.parameters(), lr=0.1) # optim.SGD(anmodel.parameters(), lr=0.001, momentum=0.9)


X = df_dev

total_loss = 0
total_hits = 0
# Usually you want to pass over the training data several times.
# 100 is much bigger than on a real data set, but real datasets have more than
# two instances.  Usually, somewhere between 5 and 30 epochs is reasonable.
for epoch in range(15):
    print('Epoch: {0}\n'.format(epoch))
    a_conmat = ConfMatrix()
    b_conmat = ConfMatrix()
    for row in df_dev.loc[: , :].values:
        txt = row[1] #row['Text']
        pronoun = row[2] #row['Pronoun']
        ca = row[4] #row['A']
        cb = row[7]#row['B']
        label = {}
        label['A'] = row[6] #row['A-coref']
        label['B'] = row[9] #row['B-coref']
        
        # Step 1. Remember that PyTorch accumulates gradients.
        # We need to clear them out before each instance
        anmodel.zero_grad()

        # Step 2. Make our BOW vector and also we must wrap the target in a
        # Tensor as an integer. For example, if the target is SPANISH, then
        # we wrap the integer 0. The loss function then knows that the 0th
        # element of the log probabilities is the log probability
        # corresponding to SPANISH
        bow_vec = simple_tokenizer(txt, pronoun, ca, cb)
        target = make_target(label, label_to_ix)
        #print('target: ', target)
        
        
        # Step 3. Run our forward pass.
        log_probs = anmodel(bow_vec)
        # Add row to confusion matrix
        a_conmat.compare_row(log_probs[0][0].float(), target[0][0].float())
        b_conmat.compare_row(log_probs[0][1].float(), target[0][1].float()) 
        
        #print('prob: ', log_probs)
        #print('log_probs', log_probs.view(2))
        #print('target',target.view(2))
        # Step 4. Compute the loss, gradients, and update the parameters by
        # calling optimizer.step()
        #lossf(x, y)
        loss = lossf(log_probs, target)
        #lossa = criterion(log_probs[0][0].float(), target[0][0].float()) 
        #lossb = criterion(log_probs[0][1].float(), target[0][1].float()) 
        # print('lossa: ',lossb, 'lossb: ',lossa,'\n') 
        # outputs tensor(2.4402)
        #loss = lossa + lossb
        
        #lossfxn(log_probs.view(2), target.view(2))
         
        
       
        #print('loss', loss,'\n')
       
        # loss_a = a(log_probs[0], target[0])
        #loss_b = b(log_probs[1], target[1])
        #loss = loss_a + loss_b
        # loss = loss_function(log_probs, target)
        loss.backward()
        optimizer.step()
    print('a_CM: {0}\n'.format(a_conmat))
    print('b_CM: {0}\n'.format(b_conmat))
    

Epoch: 0





a_CM: TN: 800, FP: 333, FN: 570, TP: 297, precision: 47.14, recall: 34.26, f1: 39.68, accuracy: 54.85


b_CM: TN: 1088, FP: 1, FN: 909, TP: 2, precision: 66.67, recall: 0.22, f1: 0.44, accuracy: 54.5


Epoch: 1

a_CM: TN: 756, FP: 377, FN: 580, TP: 287, precision: 43.22, recall: 33.1, f1: 37.49, accuracy: 52.15


b_CM: TN: 1089, FP: 0, FN: 911, TP: 0, precision: 0, recall: 0.0, f1: 0, accuracy: 54.45


Epoch: 2

a_CM: TN: 764, FP: 369, FN: 574, TP: 293, precision: 44.26, recall: 33.79, f1: 38.32, accuracy: 52.85


b_CM: TN: 1089, FP: 0, FN: 911, TP: 0, precision: 0, recall: 0.0, f1: 0, accuracy: 54.45


Epoch: 3

a_CM: TN: 771, FP: 362, FN: 583, TP: 284, precision: 43.96, recall: 32.76, f1: 37.54, accuracy: 52.75


b_CM: TN: 1089, FP: 0, FN: 911, TP: 0, precision: 0, recall: 0.0, f1: 0, accuracy: 54.45


Epoch: 4

a_CM: TN: 790, FP: 343, FN: 566, TP: 301, precision: 46.74, recall: 34.72, f1: 39.84, accuracy: 54.55


b_CM: TN: 1089, FP: 0, FN: 911, TP: 0, precision: 0, recall: 0.0, f1: 

In [77]:
b_conmat

TN: 90, FP: 0, FN: 70, TP: 0

In [39]:
print('total_loss {0}\n'.format(total_loss))
print('total_hits {0}\n'.format(total_hits))

total_loss 150.0

total_hits 10



In [12]:
with torch.no_grad():
    loss_val_total = 0
    print('Validate\n')
    a_conmat = ConfMatrix()
    b_conmat = ConfMatrix()
    for row in df_val.loc[: , :].values:
        txt = row[1] #row['Text']
        pronoun = row[2] #row['Pronoun']
        ca = row[4] #row['A']
        cb = row[7]#row['B']
        label = {}
        #label = pd.DataFrame()
        label['A'] = row[6] #row['A-coref']
        label['B'] = row[9] #row['B-coref']
        
        target = make_target(label, label_to_ix)
        
        bow_vec = simple_tokenizer(txt, pronoun, ca, cb)
        log_probs = anmodel(bow_vec)
        a_conmat.compare_row(log_probs[0][0].float(), target[0][0].float())
        b_conmat.compare_row(log_probs[0][1].float(), target[0][1].float()) 
        
        
        '''
        loss = lossf(log_probs, target)
        loss_val_total += loss.item()
        print('pronoun',pronoun,
              log_probs[0][0],
              log_probs[0][1],
              int(log_probs[0][0]),
              'word 1',en_model.index2word[int(log_probs[0][0])],
              '\nword 2',en_model.index2word[int(log_probs[0][1])]
             )
        '''
        
        # print(log_probs,'\n')
    print('a_CM: {0}\n'.format(a_conmat))
    print('b_CM: {0}\n'.format(b_conmat))
    #print('total loss: ', loss_val_total)

Validate





a_CM: TN: 182, FP: 88, FN: 114, TP: 70, precision: 44.3, recall: 38.04, f1: 40.93, accuracy: 55.51


b_CM: TN: 249, FP: 0, FN: 205, TP: 0, precision: 0, recall: 0.0, f1: 0, accuracy: 54.85




In [136]:
df_val.loc[:15 , ['Text', 'A', 'A-coref', 'B', 'B-coref']]

Unnamed: 0,Text,A,A-coref,B,B-coref
0,He admitted making four trips to China and pla...,Jose de Venecia Jr,False,Abalos,False
1,"Kathleen Nott was born in Camberwell, London. ...",Ellen,False,Kathleen,True
2,"When she returns to her hotel room, a Liberian...",Jason Scott Lee,False,Danny,True
3,"On 19 March 2007, during a campaign appearance...",Reucassel,True,Debnam,False
4,"By this time, Karen Blixen had separated from ...",Finch Hatton,False,Beryl Markham,True
5,No amount of logic can shatter a faith conscio...,James Randi,False,Jos* Alvarez,True
6,Lieutenant General Weber Pasha wanted Faik Pas...,von Sanders,False,Faik Pasha,True
7,He went on to enter mainstream journalism as a...,Colin,False,Jake Burns,True
8,"In 1940 Lester Cowan, an independent film prod...",Scott,False,Cowan,True
9,"They have a stormy marriage, caused by his hot...",Beverley Callard,True,Liz,False
