#**Lab 9 - Coreference Resolution**


March 21st



In this lab, we are going to build a coreference system based on the mention-ranking algorithm proposed by Lee et al (2017).  You will get part of the code required to build the system, and you are required to fill three code blocks. Hints will be provided to guide you through. 

The first part of the notebook will show how to apply coreference resolution to English using a few examples. Then you have to apply that to a real dataset.  

In total, you will be given two python files (*.py), three JSON files (*.jsonlines) and one embedding file (*.txt):

*   **metric.py**: is used to compute the CoNLL scores; you don’t need to change it.
*   **[train/test/dev].jsonl** Documents are the training, testing and development set will be used for training and evaluating the model, which are ready to use. 
*   **word_embeddings.filtered.txt** is pre-trained 300-dimensional FastText word embeddings. The original file is large, so we‘ve removed all the words that do not appear in the datasets to make it much smaller.

These files are contained in the folder coreference_lab_files provided with the lab.


##**0. Mount your google drive to google colab.**

First, we will mount our google drive folders to Colab so we can access the filtered glove embeddings and the conll scorer. (Please upload them to your drive first.)

To mount your drive, run the block of code below and complete the instructions as prompted

In [None]:
from google.colab import drive, output
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


##**1. Add relevant paths**

Next, we will append the path to the google drive folder containing the forementioned files to google colab.



In [None]:
# IMPORTANT change this to the path to your folder. Remember to start from the home directory, 'My Drive'
# PATH_TO_FOLDER = "/content/drive/MyDrive/Colab_uni/7001p_ass_2/week9/coref_files/"
PATH_TO_FOLDER = "/content/drive/MyDrive/7001p_ass_2/week9/coref_files/"

In [None]:
import sys
sys.path.append(PATH_TO_FOLDER)

Now we can also add the paths to our dev/test/train files and our filtered embeddings


In [None]:
DEV_PATH = PATH_TO_FOLDER + 'dev.jsonl'
TEST_PATH = PATH_TO_FOLDER + 'test.jsonl'
TRAIN_PATH = PATH_TO_FOLDER + 'train.jsonl'

EMBEDDING_PATH = PATH_TO_FOLDER + 'word_embeddings.filtered.txt'


##**2. Import files**

We can now import metrics.py along with other python modules

In [None]:
%%capture

from keras import Input,Model
from keras import backend as K
from keras.layers import Dropout,Dense,LSTM,Bidirectional,Lambda,Reshape
import numpy as np
import tensorflow as tf 
import random
import json,time,collections,random, metrics

#seed everything
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

## **3. Creating an embedding dictionary**

Using the embedding file, we will create an embedding dictionary,
for easy access while preparing our data

In [None]:
# the dimension of the pretrained embeddings
EMBEDDING_SIZE = 300

In [None]:
def load_embeddings(embedding_path=EMBEDDING_PATH, embedding_size=EMBEDDING_SIZE):
    print("Loading word embeddings from {}...".format(embedding_path))
    embeddings = collections.defaultdict(lambda: np.zeros(embedding_size))
    for line in open(embedding_path):
        splitter = line.find(' ')
        emb = np.fromstring(line[splitter + 1:], np.float32, sep=' ')
        assert len(emb) == embedding_size
        embeddings[line[:splitter]] = emb
    print("Finished loading word embeddings")
    print("Number of words: " + str(len(embeddings)))
    return embeddings

In [None]:
EMBEDDING_DICT = load_embeddings() 

Loading word embeddings from /content/drive/MyDrive/7001p_ass_2/week9/coref_files/word_embeddings.filtered.txt...
Finished loading word embeddings
Number of words: 4675


##**4. Preparing Documents for Coreference**

In this section,  we will show how to prepare the dataset for coreference resolution using a few examples in English. Then youwill have to prepare for the Arabic dataset in the jsonfiles.
<br>

Each line in a given json file contains information for a single document. The “doc_key” stores the name of the document; the “sentences” points you to tokenized sentences of the document; the “clusters” element stores the coreference clusters. Each of the clusters contains a number of mentions encoded, each of the mentions has a start and an end indices, denoting the position of its first token and the index its last token within the document. 
As an illustration, consider the dummy dataset in the block of code below containing one document. Run the code cell to see the clusters of coreferent mentions.


In [None]:
dummy_dataset = [{'doc_key': 'large_cat',
                  'sentences':[['The', 'large', 'cat', 'yawned', '.'],
                               ['He','was', 'very', 'hungry', 'as', 'he', 'had', 'not', 'eaten', 'since', 'breakfast','.'],
                               ['An', 'unfortunate', 'rat', 'came', 'along', '.'],
                               ['The', 'cat', 'gobbled', 'him', 'up', '.']],
                  'clusters': [[[0, 2], [5, 5], [10, 10], [23, 24]], [[17, 19], [26, 26]]]
                }]


sents = [w for sent in dummy_dataset[0]['sentences'] for w in sent]
print('These are the clusters in %s' %dummy_dataset[0]['doc_key'])
for cl_idx, cl in enumerate(dummy_dataset[0]['clusters']):
    print('Cluster ' + str(cl_idx) + ':', [' '.join(sents[s: e+1])  for s, e in cl])

These are the clusters in large_cat
Cluster 0: ['The large cat', 'He', 'he', 'The cat']
Cluster 1: ['An unfortunate rat', 'him']


To prepare the each dataset for the coreference resolution model, we will need to create variables from the each document:

1.   Embedded Sentences: A 1 X num_sents X num_words X embedding size array for each document. 
2.   Mention Pairs: A 1 X num_pairs X 4 array like so [anaphor_start, anaphor_end, antecedent_start, antecedent_end]
3. Mention Pair Labels: A num_pairs X 1 array containing corresponding labels for each mention pair (i.e. 1 if the pair of mentions are coreferent, 0 otherwise). 

The functions that follow in the subsections below contains code for extracting this dataset. Study them and test their functionality using the dummy dataset.
In section 4.4, you'll use these functions to create the dev, test and train datasets.

###**4.1 Getting the mentions from the clusters**

The following block of code gets the mentions from a given cluster in a document. 

In [None]:
def get_mentions(clusters):

    # get a list of mentions (as tuples) sorted by start indices.
    gold_mentions = sorted([tuple(m) for cl in clusters for m in cl])

    # number of mentions
    num_mentions = len(gold_mentions)

    # assign unique indices to each mention in the mention list based on its position in the list
    gold_mention_map = {m: i for i, m in enumerate(gold_mentions)}

    # assign cluster ids to each mention in order E.g. cluser_ids = [4, 11, 5, 4, ..] => mention 0 is in cluster 4
    # along with mention 3.
    cluster_ids = [0]*num_mentions
    for cid, cluster in enumerate(clusters):
        for mention in cluster:
            cluster_ids[gold_mention_map[tuple(mention)]] = cid

    return gold_mentions, gold_mention_map, cluster_ids, num_mentions


In [None]:
dmentions, dment_map, dcluster_ids, dnum_mentions = get_mentions(dummy_dataset[0]['clusters'])
print('These are all the coreferent mentions in the sample document ', dmentions)
print('These are the mentions mapped to unique ids denoting their order in the document ', dment_map)
print('These are the cluster ids of the ordered mentions', dcluster_ids)
print('There are %d mentions in the document titled \'%s\'' %(dnum_mentions, dummy_dataset[0]['doc_key']))

These are all the coreferent mentions in the sample document  [(0, 2), (5, 5), (10, 10), (17, 19), (23, 24), (26, 26)]
These are the mentions mapped to unique ids denoting their order in the document  {(0, 2): 0, (5, 5): 1, (10, 10): 2, (17, 19): 3, (23, 24): 4, (26, 26): 5}
These are the cluster ids of the ordered mentions [0, 0, 0, 1, 0, 1]
There are 6 mentions in the document titled 'large_cat'


###**4.2 Turning the sentences into embeddings and the mention indices into vectors**

Using the next block of code, you can generate the padded document embeddings, and a copy of the mention starts and end indices adjusted for padding. 

In [None]:
def tensorize_doc_sentences(sentences, mentions):
    starts, ends = [],[]
    sent_lengths = [len(sent) for sent in sentences]  # the actual, unpadded length of each sentence
    max_sent_length = max(sent_lengths)

    # by padding each sentence to the maximum length, the embedded document will a new dimension
    embedded_sentences = np.zeros([1, len(sentences), max_sent_length, EMBEDDING_SIZE])

    # in this block, we adjust the mention indices to reflect the added padding.
    sent_start = 0
    sent_start_after_padding = 0
    offset = 0
    for i, sent in enumerate(sentences):
        for m_start, m_end in mentions:
            if (sent_start <= m_start) & (m_end < sent_start + len(sent)):
                starts.append(m_start + offset)
                ends.append(m_end + offset)
        sent_start += len(sent)
        sent_start_after_padding += max_sent_length
        offset += max_sent_length - len(sent)

        # Populate the the embedding tensor with the appropriate word embeddings.
        for j, word in enumerate(sent):
                embedded_word = EMBEDDING_DICT[word]
                embedded_sentences[0, i, j] = embedded_word


    return embedded_sentences, starts, ends

In [None]:
dsents_embedded, dstarts, dends = tensorize_doc_sentences(dummy_dataset[0]['sentences'], dmentions)

In [None]:
print('%d document with %d sentences, each with a maximum of %d words, encoded as %d dimensional vectors' %(dsents_embedded.shape[0], dsents_embedded.shape[1], dsents_embedded.shape[2], dsents_embedded.shape[3])) 
print('Mention starts: ', dstarts)
print('Mention ends: ', dends)

1 document with 4 sentences, each with a maximum of 12 words, encoded as 300 dimensional vectors
Mention starts:  [0, 12, 17, 24, 36, 39]
Mention ends:  [2, 12, 17, 26, 37, 39]


###**4.3. Generating Mention Pairs**

This next function generates the example pairs for training or evaluation. For each mention (anaphor), candidate antecedents are any mentions preceeding it.

<br>

Here, during training we choose up to 250 antecedents (i.e. MAX_ANT = 250) and maintain a 2:1 negative to positive example ratio i.e. NEG_RATIO=2. (choosing this ratio can be challenging as you want ample examples to learn from but at the same time do not want the positive examples to be overshadowed by the negative ones).

<br>

At test time, we generate up to MAX_ANT examples without paying attention to the example ratio. We also do not generate training labels for the pairs.

In [None]:
# the maximum number of candidate antecedents we will give to each of the candidate mentions.
MAX_ANT = 250

# the ratio of negative to postive examples
NEG_RATIO = 2



Study the function below and see the sample outputs.

In [None]:
def generate_pairs(num_mentions, cluster_ids, starts, ends, raw_starts, raw_ends, is_training, neg_ratio=NEG_RATIO, max_ant=MAX_ANT):
    mention_pairs = [[]]
    mention_pair_labels = [[]]
    raw_mention_pairs = []

    # for the training set, we want labels. We also want to pay heed to the positive:negative example ratio
    if is_training:
        for ana in range(num_mentions):
            pos = 1
            # each anaphor must not have more that MAX_ANT candidate antecedents
            s = 0 if ana < max_ant else (ana - max_ant)
            for ant in range(s, ana):
                # two mentions are coreferent if they are in the same cluster
                l = cluster_ids[ana] == cluster_ids[ant]
                # if it's a positive example, add it
                if l:
                    pos += neg_ratio
                    mention_pairs[0].append([starts[ana],ends[ana],starts[ant],ends[ant]])
                    mention_pair_labels[0].append(1)
                # if it's a negative example, check that we don't already have twice as 
                # many negative examples as positive ones before adding it
                elif pos > 0:
                    pos -=1
                    mention_pairs[0].append([starts[ana],ends[ana],starts[ant],ends[ant]])
                    mention_pair_labels[0].append(0)

    # for the test set, add the pairs without balancing or labels
    else:
        for ana in range(num_mentions):
            s = 0 if ana < max_ant else (ana - max_ant)
            for ant in range(s,ana):
                mention_pairs[0].append([starts[ana], ends[ana], starts[ant], ends[ant]])
                # here we also add the original mention indices for unpadded evaluation.
                raw_mention_pairs.append([(raw_starts[ana], raw_ends[ana]), (raw_starts[ant], raw_ends[ant])])
    

    return mention_pairs, mention_pair_labels, raw_mention_pairs

In [None]:
# A sample for training. Maximum of 4 antecedents per mention a 2:1 negative example ratio: positive example. No need to save the raw starts/ends
dmpairs, dpair_labels, draw_pairs = generate_pairs(dnum_mentions, dcluster_ids, dstarts, dends, None, None, True, 1, 4)

from tabulate import tabulate
print(tabulate(zip(dmpairs[0], dpair_labels[0]), headers=['Ana_Ant pair', 'Pair label (padded)', ]))

Ana_Ant pair        Pair label (padded)
----------------  ---------------------
[12, 12, 0, 2]                        1
[17, 17, 0, 2]                        1
[17, 17, 12, 12]                      1
[24, 26, 0, 2]                        0
[36, 37, 0, 2]                        1
[36, 37, 12, 12]                      1
[36, 37, 17, 17]                      1
[36, 37, 24, 26]                      0
[39, 39, 12, 12]                      0
[39, 39, 24, 26]                      1
[39, 39, 36, 37]                      0


In [None]:
# A sample for evaluation. No labels necessary. Here we pair each mention with all its antecedents
draw_starts, draw_ends = zip(*dmentions)
dmpairs, dpair_labels, draw_pairs = generate_pairs(dnum_mentions, dcluster_ids, dstarts, dends, draw_starts, draw_ends, False)

from tabulate import tabulate
print(tabulate(zip(draw_pairs, dmpairs[0]), headers=['Ana_Ant pair (unpadded)', 'Ana_Ant pair (padded)', ]))

Ana_Ant pair (unpadded)    Ana_Ant pair (padded)
-------------------------  -----------------------
[(5, 5), (0, 2)]           [12, 12, 0, 2]
[(10, 10), (0, 2)]         [17, 17, 0, 2]
[(10, 10), (5, 5)]         [17, 17, 12, 12]
[(17, 19), (0, 2)]         [24, 26, 0, 2]
[(17, 19), (5, 5)]         [24, 26, 12, 12]
[(17, 19), (10, 10)]       [24, 26, 17, 17]
[(23, 24), (0, 2)]         [36, 37, 0, 2]
[(23, 24), (5, 5)]         [36, 37, 12, 12]
[(23, 24), (10, 10)]       [36, 37, 17, 17]
[(23, 24), (17, 19)]       [36, 37, 24, 26]
[(26, 26), (0, 2)]         [39, 39, 0, 2]
[(26, 26), (5, 5)]         [39, 39, 12, 12]
[(26, 26), (10, 10)]       [39, 39, 17, 17]
[(26, 26), (17, 19)]       [39, 39, 24, 26]
[(26, 26), (23, 24)]       [39, 39, 36, 37]


### **4.4. Preprocessing and loading the dataset**

Now, you will prepare the dataset for the coreference resolution model. Preprocessing step is an important step and depends on the target language. For Arabic, removing diacritics (accents that are written  above, below or on top of certain letters)may improve the overall performance.

In [None]:
import re, json

def preprocess_arabic_text(text):
  #diacrtic unicodes are found using regular expressions
  diacritics_unicode = re.compile(r'[\u0617-\u061A\u064B-\u0652]')
  #the diacrtics are then removed
  text = re.sub(diacritics_unicode, "", text)
  return text

def get_data(json_file, is_training, preprocess_text):
    processed_docs = []

    for line in open(json_file):

      # read the document in
      doc = json.loads(line)
      
      # check that there are coreferent mentions in this document
      clusters = doc['clusters']

      sentences = doc['sentences']

      if(preprocess_text==True):
          preprocessed_sents = [[preprocess_arabic_text(t) for t in sent] for sent in sentences]
          doc['sentences'] = preprocessed_sents
      
      if len(clusters) == 0:
          continue

      #  get the mentions and their cluster information.
      gold_mentions, gold_mention_map, cluster_ids, num_mentions = get_mentions(clusters) # TASK 1.1 YOUR CODE HERE

      # splits the mentions into two arrays, one representing the start indices, 
      # and the other for the end indices
      raw_starts, raw_ends = zip(*gold_mentions)

      # pad sentences, create glove sentence embeddings, create mention starts and ends for padded document
      word_emb, starts, ends = tensorize_doc_sentences(sentences, gold_mentions) # TASK 1.2 YOUR CODE HERE

      # generate (anaphor, antecedent) pairs and their labels
      mention_pairs, mention_pair_labels, raw_mention_pairs = generate_pairs(num_mentions, cluster_ids, starts, ends, raw_starts, raw_ends, is_training) # TASK 1.3 YOUR CODE HERE
      mention_pairs, mention_pair_labels = np.array(mention_pairs),np.array(mention_pair_labels)

      # add the processed document to the list
      processed_docs.append((word_emb, mention_pairs, mention_pair_labels, clusters, raw_mention_pairs))

    return processed_docs

In [None]:
#get_data(json_file, is_training_preprocess_text) receives three inputs: 
#json_file (str) : the path to json file, preprocess_text)
#is_training (boolean): this is used to with generate_pairs(...) function to balance the number of generated pairs
#preprocess_text (boolean): whether to preprocess text or not 

DEV_DATA = get_data(DEV_PATH, False, True)
TEST_DATA = get_data(TEST_PATH, False, True)
TRAIN_DATA = get_data(TRAIN_PATH, True, True)

##**5. Building the Coreference Model**

In this section, we will build the coreference resolution model. There are many ways to learn coreference, in this lab, we will be building a simplified version of a mention pair classification model. 

<br>

Given a pair of mentions, (anaphor, antecedent), a mention pair classifier produces a single score between 0 and 1, representing the probability that the given pair is coreferent. We will use keras to take in the processed data we prepared in section 4 and produce mention pair scores for the given pairs.

###**5.1 First, we will initialize model parameters.**

In [None]:
# the dimension of the pretrained embeddings
EMBEDDING_SIZE = 300

# dropout rate for word embeddings
EMBEDDING_DROPOUT_RATE = 0.5

# the size of the hidden layer, include both LSTM and feedforward NN
HIDDEN_SIZE = 50

# the number of hidden layers used for the feedforward NN
NUM_FFN_LAYER = 2

# the dropout rate for the hidden layers of LSTM and feedforward NN
HIDDEN_DROPOUT_RATE = 0.2


###**5.2 Building the model**

In the next cell block, you will complete the build function by doing completing the steps that follow.

1. **Initialize the model inputs**

(a.) Using the keras input layer (already imported) initialize the two input layers
 `word_embeddings` and `mention_pairs`.
 
**Hints:**
*  The dimension for this input is `(batch_size X num_sents X num_words X embedding_size)` where batch_size is the number of inputs at each time step. Our batch size is 1 document.
*  As always, we do not need to specify the batch_size to the model (i.e. the first dimension is not included in the `shape` parameter of `Input()`.
*  The shape parameter is thus a tuple of 3 elements `(num_sents, num_words, embedding_size)`. The first two parameters vary accross documents but no matter, keras can infer them. You only need to specify the third (-1) element in the tuple.
*  The dimension for this input is  `(batch_size X num_mention_pairs  X 4)`.

    A line of code has been written for you to squeeze the word_embeddings after they have been created to remove the document dimension as the LSTMs only take 3 dimensional inputs

    `word_embeddings_no_batch = Lambda(lambda x: K.squeeze(x,0))(word_embeddings)`

    <br>

(b.) Apply `EMBEDDING_DROPOUT_RATE` dropout to this no-batch word embeddings

<br>

2. **Encode the document using Bidirectional LSTMs**


For task 1 we will be working at the beginning of the build() method. The task is to create a bidirectional LSTM to encode the sentences from both directions, which provides context information to the coreference system.

**Hints:**
* You'll need:
The dropout rate  of the hidden layers: `HIDDEN_DROPOUT_RATE`
The size of the lstm hidden layers: `HIDDEN_SIZE`

* You need to create a two layer bidirectional LSTM (BiLSTMs) by stacking two LSTM() layers wrapped with Bidirectional() layers. The BiLSTMs need to return the output for all the tokens in the sentences, not just the final one. The output of the BiLSTMs should be called word_output. Here is an example of how to create a BiLSTMs using keras: https://keras.io/examples/imdb_bidirectional_lstm/.

* Inorder to return the output for all the tokens in the sentences, i.e. a `(num_sents X num_words X HIDDEN_SIZE)` tensor, you will need to set the return_sequences attribute to True. The default setting is to return only the final output of the LSTM. And don’t forget to apply the recurrent_dropout.

<br>

We then flatten the output of the lstms to get a `(num_sents*num_words X HIDDEN_SIZE)` tensor using the Lambda function. This will help us gather the right indices for the mention pairs. We also apply further dropout.

`flatten_word_output = Lambda(lambda x:K.reshape(x, [-1, 2 * HIDDEN_SIZE]))(word_output)`

`flatten_word_output = Dropout(HIDDEN_DROPOUT_RATE)(flatten_word_output)`


<br>

Then, we get the mention pair representations by first collecting the learned embeddings for each the words represented by [anaphor_start, anaphor_end, antecedent_start, antecedent_end] for each pair. We retain the document dimension (i.e. batch_size) for this input.

`mention_pair_emb = Lambda(lambda x: K.gather(x[0], x[1]))([flatten_word_output, mention_pairs])`

Then concatenating the embeddings such that each mention pair is represented by a `4 X HIDDEN_SIZE` tensor.

`ffnn_input = Reshape((-1,8*HIDDEN_SIZE))(mention_pair_emb)`

`ffnn_input` is thus a `batch_size X num_mention_pairs X 400` tensor

<br>

3. **Create a multilayer feed-forward neural network to compute the mention-pair scores.**

Then you are required to create a FFNN that contains 2 hidden layers and an output layer. The outputs of the FFNN are  mention_pair_scores. Here are some requirements:

The hidden layers need to have a size of `HIDDEN_SIZE`
You need to apply dropout after each the hidden layers (but not the output layer). The outputs are called mention_pair_scores

**Hint:** 

Each hidden layer of the FFNN is a simple Dense() with an relu activation function. Layers are simply stacked together the output of the previous layer is the input for the next layer. To apply the dropout you can simply use the Dropout() layer. The output layer is slightly different, since it will have an output size of 1. Also in order to compute the binary cross entropy loss we need to give this final layer a sigmoid activation function.

After computing the mention_pair_scores you will need to remove the last dimension of it, since the last dimension is always 1. But be careful this time we will need to retain the batch dimension (for compute the loss and training accuracy). Again you can use the K.squeeze() method wrapped with a Lambda layer.
The final output `mention_pair_scores` should be a `batch_size X num_mention_pairs` tensor.


In [None]:
def build_model():
    # 1 (a.) Initialize the model inputs
    word_embeddings = Input((None,None,EMBEDDING_SIZE)) # YOUR CODE HERE
    mention_pairs = Input((None,4), dtype="int32") # TASK 2.1a YOUR CODE HERE
    # squeeze the (batch_size X num_sents X num_words X embedding_size) into a 
    # (num_sents X num_words X embedding_size) tensor
    word_embeddings_no_batch = Lambda(lambda x: K.squeeze(x,0))(word_embeddings)

    # 1 (b.). Apply embedding dropout to the squeezed embeddings.
    word_embeddings_dropped = Dropout(EMBEDDING_DROPOUT_RATE)(word_embeddings_no_batch) # TASK 2.1b YOUR CODE HERE

    # TASK 2.2. YOU CREATE A TWO LAYER BIDIRECTIONAL LSTM
    word_output = Bidirectional(LSTM(HIDDEN_SIZE, recurrent_dropout=HIDDEN_DROPOUT_RATE, return_sequences=True))(word_embeddings_dropped)
    word_output = Bidirectional(LSTM(HIDDEN_SIZE, recurrent_dropout=HIDDEN_DROPOUT_RATE, return_sequences=True))(word_output)

    # flattening the lstms output and apply dropout.
    flatten_word_output = Lambda(lambda x:K.reshape(x, [-1, 2 * HIDDEN_SIZE]))(word_output)
    flatten_word_output = Dropout(HIDDEN_DROPOUT_RATE)(flatten_word_output)

    # we gather the embeddings represented by [anaphor_start, anaphor_end, antecedent_start, antecedent_end] for each pair.
    mention_pair_emb = Lambda(lambda x: K.gather(x[0], x[1]))([flatten_word_output, mention_pairs])

    # we flatten them such that each mention_pair is represented by a 400D tensor.
    ffnn_input = Reshape((-1,8*HIDDEN_SIZE))(mention_pair_emb)

    # TASK 2.3. CREATE THE MULTILAYER PERCEPTRONS THEN SQUEEZE OUT THE LAST DIMENSION USING LAMBDA 
    ffnn = Dense(HIDDEN_SIZE, activation="relu")(ffnn_input)
    ffnn = Dropout(HIDDEN_DROPOUT_RATE)(ffnn)
    ffnn = Dense(HIDDEN_SIZE, activation="relu")(ffnn)
    ffnn = Dropout(HIDDEN_DROPOUT_RATE)(ffnn)

    ffnn_out = Dense(1, activation="sigmoid")(ffnn)

    mention_pair_scores = Lambda(lambda x: K.squeeze(x, -1))(ffnn_out)

    model = Model(inputs=[word_embeddings,mention_pairs], outputs=mention_pair_scores)
    model.compile(optimizer='adam',loss='binary_crossentropy')
    model.summary()
    return model

In [None]:
model = build_model()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, None, None,  0           []                               
                                 300)]                                                            
                                                                                                  
 lambda (Lambda)                (None, None, 300)    0           ['input_1[0][0]']                
                                                                                                  
 dropout (Dropout)              (None, None, 300)    0           ['lambda[0][0]']                 
                                                                                                  
 bidirectional (Bidirectional)  (None, None, 100)    140400      ['dropout[0][0]']            

##**6. Coreference Resolution Evaluation**

Coreference Resolution models are not evaluated using regular accuracy or f1 as one would evaluate a text classification model. Rather, using the pairwise scores produced by the system, we build coreference clusters. These clusters are
then evaluated using the CONLL score https://www.aclweb.org/anthology/W12-4501/ 
In this section, we build functions to build such clusters.

###**6.1 Getting the Predicted clusters**

First, we will write a function that takes a pair of mentions and produces two variables:

1. `predicted_clusters`: a list of tuples. Each tuple is a cluster (i.e. the elements of each tuple are the mentions predicted to belong to that cluster, where each mention is a (start_index, end_index) tuple.

2. `mention_to_predicted`: a dictionary whose keys are mentions and whose values are predicted clusters for the given mention

The input to the function `mention_pairs` is the list of predicted mention pairs
[[(anaphor_start, anaphor_end), (antecedent_start, antecedent_end)], ...] similar to draw_pairs in section 4.

In [None]:
def get_predicted_clusters(mention_pairs):
    mention_to_predicted = {}
    predicted_clusters = []
    
    # for each mention and its predicted antecedent
    for anaphora, predicted_antecedent in mention_pairs:
        # if the predicted antecedent has been processed before as an anaphor
        if predicted_antecedent in mention_to_predicted:
            # then the predicted cluster for the anaphor is the same as the one for its predicted antecedent
            predicted_cluster = mention_to_predicted[predicted_antecedent]
        # otherwise, 
        else:
            # create a new cluster, with the antecedent as the first mention in that cluster
            predicted_cluster = len(predicted_clusters) # the cluster number (it's order in the list of clusters)
            predicted_clusters.append([predicted_antecedent])
            mention_to_predicted[predicted_antecedent] = predicted_cluster

        # now we know the right cluster for the anaphor, add it to that cluster
        predicted_clusters[predicted_cluster].append(anaphora)
        mention_to_predicted[anaphora] = predicted_cluster

    # make the cluster list a cluster tuple. Lists can be dictionary keys; they are mutable and support item assignment.
    predicted_clusters = [tuple(pc) for pc in predicted_clusters]
    # get the {mention: complete cluster} map for the predictions.
    mention_to_predicted = {m: predicted_clusters[i] for m, i in mention_to_predicted.items()}

    return predicted_clusters, mention_to_predicted

###**6.2 Coreference evaluation for a given document**

In this subsection you will complete the `evaluate_coref()` function for coref evaluation on a single document. 

<br>

The `evaluate_coref()` function takes 3 parameters:
* `predicted_mention_pairs`: the list of predicted mention pairs
* `gold_clusters`: the gold cluster from the orginal document
* `evaluator`: a reference to an instance of metrics.CorefEvaluator()
You will use the first 2 parameters to create variables to run the `evaluator.update()` method.

<br>

The`evaluator.update()` method takes 3 parameters:
* `predicted_clusters`: from 5.1
*  `gold_clusters`: the gold cluster from the orginal document, each cluster transformed from a list to a tuple.
* `mention_to_predicted`: from 5.1
* `mention_to_gold`: the gold equivalent of  `mention_to_predicted`

<br>

Some of the code has been written for you. You complete the code below to generate the rest of it.

In [None]:
 def evaluate_coref(predicted_mention_pairs, gold_clusters, evaluator):
    
    g_m_v = get_mentions(gold_clusters)

    mm = {}
    for cluster,index in g_m_v[1].items():
      mm.setdefault(g_m_v[2][index], [])
      mm[g_m_v[2][index]].append(cluster)
    # turn each cluster in the list of gold cluster into a tuple (rather than a list)
    gold_clusters = tuple(tuple(v) for v in mm.values()) # TASK 3.1 CODE HERE
    # print(gold_clusters)

    # mention to gold is a {mention: cluster of mentions it belongs, including the present mention} map
    # TASK 3.2 WRITE CODE HERE TO GENERATE mention_to_gold from gold_clusters

    mention_to_gold = {}
    for cluster in list(mm.values()):
      for mention in cluster:
        mention_to_gold[mention] = tuple(cluster)

    # get the predicted_clusters and mention_to_predict using get_predicted_clusters()
    predicted_clusters, mention_to_predicted = get_predicted_clusters(predicted_mention_pairs) # TASK 3.3 CODE HERE

    # run the evaluator using the parameters you've gotten
    evaluator.update(predicted_clusters, gold_clusters, mention_to_predicted, mention_to_gold)

###**6.3 Evaluating the model on all the data**

In [None]:
def eval(model, eval_docs):
    coref_evaluator = metrics.CorefEvaluator()

    for word_embeddings, mention_pairs, _, gold_clusters, raw_mention_pairs in eval_docs:

        # get the mention pair scores from the model
        mention_pair_scores = model.predict_on_batch([word_embeddings, mention_pairs])

        predicted_antecedents = {}
        best_antecedent_scores = {}
        # for a given anaphor 
        for (ana, ant), score in zip(raw_mention_pairs, mention_pair_scores[0]):
            # only candidate antecedents with (ana, ante) above 0.5 are considered as valid system proposed candidates
            if score >= 0.5 and score > best_antecedent_scores.get(ana,0):
                # we chose the best among these to be the predicted antecedent for that anaphor
                predicted_antecedents[ana] = ant
                best_antecedent_scores[ana] = score

        # getting the [anaphor, antecedent] pairs.
        predicted_mention_pairs = [[k,v] for k,v in predicted_antecedents.items()]

        # evaluate the predicted mention pairs 
        evaluate_coref(predicted_mention_pairs, gold_clusters, coref_evaluator)

    # afer evaluating each document, get the conll prf
    p, r, f = coref_evaluator.get_prf()
    print("Average F1 (py): {:.2f}%".format(f * 100))
    print("Average precision (py): {:.2f}%".format(p * 100))
    print("Average recall (py): {:.2f}%".format(r * 100))
    return p,r,f

##**7. Training and Evaluating the Model the Coreference Model**

In [None]:
def time_used(start_time):
    curr_time = time.time()
    used_time = curr_time - start_time
    m = used_time // 60
    s = used_time - 60 * m
    return "%d m %d s" % (m, s)

In [None]:
def batch_generator(processed_data):
    random.shuffle(processed_data)
    for word_embeddings, mention_pairs, mention_pair_labels, _, _ in processed_data:
      yield [word_embeddings, mention_pairs], mention_pair_labels

In [None]:
def train(epochs, report=False):
    start_time = time.time()
    for epoch in range(epochs):
        print("\nStarting training epoch {}/{}".format(epoch + 1, epochs))
        epoch_time = time.time()

        model.fit(batch_generator(TRAIN_DATA), steps_per_epoch=2775)

        print("Time used for epoch {}: {}".format(epoch + 1, time_used(epoch_time)))
        dev_time = time.time()
        print("Evaluating on dev set after epoch {}/{}:".format(epoch + 1, epochs))
        eval(model, DEV_DATA)
        print("Time used for evaluate on dev set: {}".format(time_used(dev_time)))

    print("\nTraining finished!")
    print("Time used for training: {}".format(time_used(start_time)))

    print("\nEvaluating on test set:")
    test_time = time.time()
    p, r, f = eval(model, TEST_DATA)
    print("Time used for evaluate on test set: {}".format(time_used(test_time)))
    if report == True:
      return p, r, f

In [None]:
import warnings
warnings.filterwarnings('ignore')

# train the model for 10 epochs
train(10)


Starting training epoch 1/10
  20/2775 [..............................] - ETA: 12:33 - loss: 0.6409



Time used for epoch 1: 0 m 23 s
Evaluating on dev set after epoch 1/10:
Average F1 (py): 0.00%
Average precision (py): 0.00%
Average recall (py): 0.00%
Time used for evaluate on dev set: 0 m 6 s

Starting training epoch 2/10
  20/2775 [..............................] - ETA: 12:14 - loss: 0.5732



Time used for epoch 2: 0 m 10 s
Evaluating on dev set after epoch 2/10:
Average F1 (py): 29.50%
Average precision (py): 34.25%
Average recall (py): 64.15%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 3/10
  20/2775 [..............................] - ETA: 12:41 - loss: 0.5148



Time used for epoch 3: 0 m 10 s
Evaluating on dev set after epoch 3/10:
Average F1 (py): 29.68%
Average precision (py): 33.76%
Average recall (py): 65.56%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 4/10
  20/2775 [..............................] - ETA: 12:17 - loss: 0.4758



Time used for epoch 4: 0 m 5 s
Evaluating on dev set after epoch 4/10:
Average F1 (py): 29.63%
Average precision (py): 33.36%
Average recall (py): 65.26%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 5/10
  20/2775 [..............................] - ETA: 16:31 - loss: 0.4479



Time used for epoch 5: 0 m 7 s
Evaluating on dev set after epoch 5/10:
Average F1 (py): 31.24%
Average precision (py): 37.12%
Average recall (py): 58.84%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 6/10
  20/2775 [..............................] - ETA: 13:01 - loss: 0.4396



Time used for epoch 6: 0 m 10 s
Evaluating on dev set after epoch 6/10:
Average F1 (py): 35.73%
Average precision (py): 45.46%
Average recall (py): 48.68%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 7/10
  20/2775 [..............................] - ETA: 12:17 - loss: 0.4204



Time used for epoch 7: 0 m 10 s
Evaluating on dev set after epoch 7/10:
Average F1 (py): 35.13%
Average precision (py): 41.83%
Average recall (py): 50.80%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 8/10
  20/2775 [..............................] - ETA: 12:22 - loss: 0.4177



Time used for epoch 8: 0 m 10 s
Evaluating on dev set after epoch 8/10:
Average F1 (py): 37.64%
Average precision (py): 48.57%
Average recall (py): 48.40%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 9/10
  20/2775 [..............................] - ETA: 12:08 - loss: 0.4121



Time used for epoch 9: 0 m 10 s
Evaluating on dev set after epoch 9/10:
Average F1 (py): 35.46%
Average precision (py): 41.52%
Average recall (py): 52.75%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 10/10
  20/2775 [..............................] - ETA: 12:16 - loss: 0.4002



Time used for epoch 10: 0 m 5 s
Evaluating on dev set after epoch 10/10:
Average F1 (py): 32.22%
Average precision (py): 36.04%
Average recall (py): 58.00%
Time used for evaluate on dev set: 0 m 3 s

Training finished!
Time used for training: 2 m 12 s

Evaluating on test set:
Average F1 (py): 30.01%
Average precision (py): 33.82%
Average recall (py): 55.78%
Time used for evaluate on test set: 0 m 6 s


##**8. Questions:**



*   Would the performance decrease if we do not preprocess the text? If yes (or no), then why?
*   Experiment with different values for max antecedent (MAX_ANT) and negative ratio (NEG_RATIO), what do you observe?
*   How would you improve the accuracy? 




In [None]:
MAX_ANTs = [200, 250, 300]
NEG_RATIOs = [1, 2, 3]

results = []

for MAX_ANT in MAX_ANTs:
  for NEG_RATIO in NEG_RATIOs:
    model = build_model()
    MAX_ANT = MAX_ANT
    NEG_RATIO = NEG_RATIO
    p, r, f = train(10, report=True)
    results.append({"MAX_ANT":MAX_ANT, "NEG_RATIO":NEG_RATIO, 
                    "precision":p, "recall":r, "f1":f})

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_3 (InputLayer)           [(None, None, None,  0           []                               
                                 300)]                                                            
                                                                                                  
 lambda_4 (Lambda)              (None, None, 300)    0           ['input_3[0][0]']                
                                                                                                  
 dropout_4 (Dropout)            (None, None, 300)    0           ['lambda_4[0][0]']               
                                                                                                  
 bidirectional_2 (Bidirectional  (None, None, 100)   140400      ['dropout_4[0][0]']        



Time used for epoch 1: 0 m 23 s
Evaluating on dev set after epoch 1/10:
Average F1 (py): 29.11%
Average precision (py): 33.77%
Average recall (py): 58.45%
Time used for evaluate on dev set: 0 m 5 s

Starting training epoch 2/10
  20/2775 [..............................] - ETA: 12:29 - loss: 0.5565



Time used for epoch 2: 0 m 10 s
Evaluating on dev set after epoch 2/10:
Average F1 (py): 29.60%
Average precision (py): 34.20%
Average recall (py): 65.49%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 3/10
  20/2775 [..............................] - ETA: 12:15 - loss: 0.5069



Time used for epoch 3: 0 m 10 s
Evaluating on dev set after epoch 3/10:
Average F1 (py): 29.66%
Average precision (py): 33.61%
Average recall (py): 65.97%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 4/10
  20/2775 [..............................] - ETA: 12:10 - loss: 0.4733



Time used for epoch 4: 0 m 10 s
Evaluating on dev set after epoch 4/10:
Average F1 (py): 29.73%
Average precision (py): 33.49%
Average recall (py): 65.43%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 5/10
  20/2775 [..............................] - ETA: 12:30 - loss: 0.4541



Time used for epoch 5: 0 m 10 s
Evaluating on dev set after epoch 5/10:
Average F1 (py): 29.51%
Average precision (py): 33.92%
Average recall (py): 65.76%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 6/10
  20/2775 [..............................] - ETA: 12:46 - loss: 0.4331



Time used for epoch 6: 0 m 5 s
Evaluating on dev set after epoch 6/10:
Average F1 (py): 30.34%
Average precision (py): 35.87%
Average recall (py): 63.85%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 7/10
  20/2775 [..............................] - ETA: 18:42 - loss: 0.4236



Time used for epoch 7: 0 m 10 s
Evaluating on dev set after epoch 7/10:
Average F1 (py): 31.28%
Average precision (py): 36.90%
Average recall (py): 59.22%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 8/10
  20/2775 [..............................] - ETA: 14:44 - loss: 0.4159



Time used for epoch 8: 0 m 6 s
Evaluating on dev set after epoch 8/10:
Average F1 (py): 32.70%
Average precision (py): 39.14%
Average recall (py): 56.45%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 9/10
  20/2775 [..............................] - ETA: 11:37 - loss: 0.4053



Time used for epoch 9: 0 m 10 s
Evaluating on dev set after epoch 9/10:
Average F1 (py): 35.45%
Average precision (py): 44.96%
Average recall (py): 50.84%
Time used for evaluate on dev set: 0 m 4 s

Starting training epoch 10/10
  20/2775 [..............................] - ETA: 12:20 - loss: 0.4038



Time used for epoch 10: 0 m 5 s
Evaluating on dev set after epoch 10/10:
Average F1 (py): 32.48%
Average precision (py): 38.55%
Average recall (py): 57.00%
Time used for evaluate on dev set: 0 m 2 s

Training finished!
Time used for training: 2 m 15 s

Evaluating on test set:
Average F1 (py): 30.08%
Average precision (py): 35.57%
Average recall (py): 56.19%
Time used for evaluate on test set: 0 m 7 s
Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_5 (InputLayer)           [(None, None, None,  0           []                               
                                 300)]                                                            
                                                                                                  
 lambda_8 (Lambda)              (None, None, 300)    0           ['input_5[0][0]']  



Time used for epoch 1: 0 m 22 s
Evaluating on dev set after epoch 1/10:
Average F1 (py): 0.00%
Average precision (py): 0.00%
Average recall (py): 0.00%
Time used for evaluate on dev set: 0 m 6 s

Starting training epoch 2/10
  20/2775 [..............................] - ETA: 13:26 - loss: 0.5832



Time used for epoch 2: 0 m 6 s
Evaluating on dev set after epoch 2/10:
Average F1 (py): 29.57%
Average precision (py): 33.71%
Average recall (py): 63.35%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 3/10
  20/2775 [..............................] - ETA: 17:33 - loss: 0.5231



Time used for epoch 3: 0 m 10 s
Evaluating on dev set after epoch 3/10:
Average F1 (py): 29.64%
Average precision (py): 34.04%
Average recall (py): 65.94%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 4/10
  20/2775 [..............................] - ETA: 12:11 - loss: 0.4824



Time used for epoch 4: 0 m 10 s
Evaluating on dev set after epoch 4/10:
Average F1 (py): 29.72%
Average precision (py): 33.78%
Average recall (py): 65.39%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 5/10
  20/2775 [..............................] - ETA: 12:26 - loss: 0.4531



Time used for epoch 5: 0 m 10 s
Evaluating on dev set after epoch 5/10:
Average F1 (py): 29.76%
Average precision (py): 34.02%
Average recall (py): 65.26%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 6/10
  20/2775 [..............................] - ETA: 12:12 - loss: 0.4392



Time used for epoch 6: 0 m 5 s
Evaluating on dev set after epoch 6/10:
Average F1 (py): 33.68%
Average precision (py): 41.76%
Average recall (py): 53.92%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 7/10
  20/2775 [..............................] - ETA: 14:30 - loss: 0.4242



Time used for epoch 7: 0 m 6 s
Evaluating on dev set after epoch 7/10:
Average F1 (py): 36.37%
Average precision (py): 48.01%
Average recall (py): 46.78%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 8/10
  20/2775 [..............................] - ETA: 15:57 - loss: 0.4135



Time used for epoch 8: 0 m 6 s
Evaluating on dev set after epoch 8/10:
Average F1 (py): 34.91%
Average precision (py): 44.73%
Average recall (py): 52.65%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 9/10
  20/2775 [..............................] - ETA: 12:08 - loss: 0.4067



Time used for epoch 9: 0 m 10 s
Evaluating on dev set after epoch 9/10:
Average F1 (py): 36.67%
Average precision (py): 48.64%
Average recall (py): 50.27%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 10/10
  20/2775 [..............................] - ETA: 12:53 - loss: 0.4014



Time used for epoch 10: 0 m 5 s
Evaluating on dev set after epoch 10/10:
Average F1 (py): 37.78%
Average precision (py): 50.68%
Average recall (py): 48.76%
Time used for evaluate on dev set: 0 m 3 s

Training finished!
Time used for training: 2 m 7 s

Evaluating on test set:
Average F1 (py): 35.86%
Average precision (py): 49.33%
Average recall (py): 46.17%
Time used for evaluate on test set: 0 m 9 s
Model: "model_3"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_7 (InputLayer)           [(None, None, None,  0           []                               
                                 300)]                                                            
                                                                                                  
 lambda_12 (Lambda)             (None, None, 300)    0           ['input_7[0][0]']   



Time used for epoch 1: 0 m 18 s
Evaluating on dev set after epoch 1/10:
Average F1 (py): 0.00%
Average precision (py): 0.00%
Average recall (py): 0.00%
Time used for evaluate on dev set: 0 m 9 s

Starting training epoch 2/10
  20/2775 [..............................] - ETA: 14:26 - loss: 0.5870



Time used for epoch 2: 0 m 6 s
Evaluating on dev set after epoch 2/10:
Average F1 (py): 29.59%
Average precision (py): 34.20%
Average recall (py): 64.80%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 3/10
  20/2775 [..............................] - ETA: 20:47 - loss: 0.5276



Time used for epoch 3: 0 m 10 s
Evaluating on dev set after epoch 3/10:
Average F1 (py): 29.65%
Average precision (py): 33.73%
Average recall (py): 65.56%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 4/10
  20/2775 [..............................] - ETA: 18:41 - loss: 0.4852



Time used for epoch 4: 0 m 10 s
Evaluating on dev set after epoch 4/10:
Average F1 (py): 29.69%
Average precision (py): 34.13%
Average recall (py): 66.02%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 5/10
  20/2775 [..............................] - ETA: 14:59 - loss: 0.4541



Time used for epoch 5: 0 m 10 s
Evaluating on dev set after epoch 5/10:
Average F1 (py): 32.63%
Average precision (py): 41.06%
Average recall (py): 55.30%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 6/10
  20/2775 [..............................] - ETA: 14:27 - loss: 0.4431



Time used for epoch 6: 0 m 6 s
Evaluating on dev set after epoch 6/10:
Average F1 (py): 35.47%
Average precision (py): 46.05%
Average recall (py): 51.09%
Time used for evaluate on dev set: 0 m 4 s

Starting training epoch 7/10
  20/2775 [..............................] - ETA: 15:13 - loss: 0.4281



Time used for epoch 7: 0 m 6 s
Evaluating on dev set after epoch 7/10:
Average F1 (py): 36.69%
Average precision (py): 49.00%
Average recall (py): 49.44%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 8/10
  20/2775 [..............................] - ETA: 20:21 - loss: 0.4174



Time used for epoch 8: 0 m 8 s
Evaluating on dev set after epoch 8/10:
Average F1 (py): 37.07%
Average precision (py): 49.07%
Average recall (py): 48.01%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 9/10
  20/2775 [..............................] - ETA: 13:44 - loss: 0.4153



Time used for epoch 9: 0 m 6 s
Evaluating on dev set after epoch 9/10:
Average F1 (py): 36.17%
Average precision (py): 45.85%
Average recall (py): 49.71%
Time used for evaluate on dev set: 0 m 4 s

Starting training epoch 10/10
  20/2775 [..............................] - ETA: 14:43 - loss: 0.4088



Time used for epoch 10: 0 m 6 s
Evaluating on dev set after epoch 10/10:
Average F1 (py): 35.95%
Average precision (py): 46.25%
Average recall (py): 50.60%
Time used for evaluate on dev set: 0 m 2 s

Training finished!
Time used for training: 2 m 7 s

Evaluating on test set:
Average F1 (py): 34.26%
Average precision (py): 44.56%
Average recall (py): 47.83%
Time used for evaluate on test set: 0 m 7 s
Model: "model_4"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_9 (InputLayer)           [(None, None, None,  0           []                               
                                 300)]                                                            
                                                                                                  
 lambda_16 (Lambda)             (None, None, 300)    0           ['input_9[0][0]']   



Time used for epoch 1: 0 m 22 s
Evaluating on dev set after epoch 1/10:
Average F1 (py): 0.00%
Average precision (py): 0.00%
Average recall (py): 0.00%
Time used for evaluate on dev set: 0 m 6 s

Starting training epoch 2/10
  20/2775 [..............................] - ETA: 12:17 - loss: 0.5857



Time used for epoch 2: 0 m 5 s
Evaluating on dev set after epoch 2/10:
Average F1 (py): 29.35%
Average precision (py): 33.23%
Average recall (py): 63.31%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 3/10
  20/2775 [..............................] - ETA: 18:21 - loss: 0.5310



Time used for epoch 3: 0 m 10 s
Evaluating on dev set after epoch 3/10:
Average F1 (py): 29.65%
Average precision (py): 34.47%
Average recall (py): 65.84%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 4/10
  20/2775 [..............................] - ETA: 13:18 - loss: 0.4795



Time used for epoch 4: 0 m 10 s
Evaluating on dev set after epoch 4/10:
Average F1 (py): 29.67%
Average precision (py): 32.91%
Average recall (py): 65.51%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 5/10
  20/2775 [..............................] - ETA: 12:25 - loss: 0.4565



Time used for epoch 5: 0 m 10 s
Evaluating on dev set after epoch 5/10:
Average F1 (py): 29.78%
Average precision (py): 34.07%
Average recall (py): 65.24%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 6/10
  20/2775 [..............................] - ETA: 12:33 - loss: 0.4405



Time used for epoch 6: 0 m 10 s
Evaluating on dev set after epoch 6/10:
Average F1 (py): 30.25%
Average precision (py): 35.11%
Average recall (py): 61.67%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 7/10
  20/2775 [..............................] - ETA: 12:17 - loss: 0.4303



Time used for epoch 7: 0 m 5 s
Evaluating on dev set after epoch 7/10:
Average F1 (py): 32.10%
Average precision (py): 39.82%
Average recall (py): 56.21%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 8/10
  20/2775 [..............................] - ETA: 15:33 - loss: 0.4195



Time used for epoch 8: 0 m 6 s
Evaluating on dev set after epoch 8/10:
Average F1 (py): 31.18%
Average precision (py): 37.45%
Average recall (py): 58.61%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 9/10
  20/2775 [..............................] - ETA: 15:18 - loss: 0.4082



Time used for epoch 9: 0 m 6 s
Evaluating on dev set after epoch 9/10:
Average F1 (py): 32.87%
Average precision (py): 40.76%
Average recall (py): 55.63%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 10/10
  20/2775 [..............................] - ETA: 12:37 - loss: 0.4032



Time used for epoch 10: 0 m 5 s
Evaluating on dev set after epoch 10/10:
Average F1 (py): 33.41%
Average precision (py): 41.10%
Average recall (py): 54.27%
Time used for evaluate on dev set: 0 m 2 s

Training finished!
Time used for training: 2 m 5 s

Evaluating on test set:
Average F1 (py): 31.17%
Average precision (py): 38.74%
Average recall (py): 52.43%
Time used for evaluate on test set: 0 m 7 s
Model: "model_5"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_11 (InputLayer)          [(None, None, None,  0           []                               
                                 300)]                                                            
                                                                                                  
 lambda_20 (Lambda)             (None, None, 300)    0           ['input_11[0][0]']  



Time used for epoch 1: 0 m 22 s
Evaluating on dev set after epoch 1/10:
Average F1 (py): 0.00%
Average precision (py): 0.00%
Average recall (py): 0.00%
Time used for evaluate on dev set: 0 m 6 s

Starting training epoch 2/10
  20/2775 [..............................] - ETA: 12:21 - loss: 0.5690



Time used for epoch 2: 0 m 10 s
Evaluating on dev set after epoch 2/10:
Average F1 (py): 29.63%
Average precision (py): 33.91%
Average recall (py): 65.62%
Time used for evaluate on dev set: 0 m 4 s

Starting training epoch 3/10
  20/2775 [..............................] - ETA: 12:42 - loss: 0.5094



Time used for epoch 3: 0 m 5 s
Evaluating on dev set after epoch 3/10:
Average F1 (py): 29.70%
Average precision (py): 33.69%
Average recall (py): 65.93%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 4/10
  20/2775 [..............................] - ETA: 18:43 - loss: 0.4744



Time used for epoch 4: 0 m 10 s
Evaluating on dev set after epoch 4/10:
Average F1 (py): 29.70%
Average precision (py): 33.70%
Average recall (py): 65.54%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 5/10
  20/2775 [..............................] - ETA: 13:51 - loss: 0.4529



Time used for epoch 5: 0 m 6 s
Evaluating on dev set after epoch 5/10:
Average F1 (py): 29.82%
Average precision (py): 34.59%
Average recall (py): 64.57%
Time used for evaluate on dev set: 0 m 4 s

Starting training epoch 6/10
  20/2775 [..............................] - ETA: 12:09 - loss: 0.4358



Time used for epoch 6: 0 m 10 s
Evaluating on dev set after epoch 6/10:
Average F1 (py): 34.21%
Average precision (py): 43.79%
Average recall (py): 52.17%
Time used for evaluate on dev set: 0 m 4 s

Starting training epoch 7/10
  20/2775 [..............................] - ETA: 12:25 - loss: 0.4244



Time used for epoch 7: 0 m 10 s
Evaluating on dev set after epoch 7/10:
Average F1 (py): 35.85%
Average precision (py): 47.29%
Average recall (py): 50.70%
Time used for evaluate on dev set: 0 m 4 s

Starting training epoch 8/10
  20/2775 [..............................] - ETA: 13:13 - loss: 0.4082



Time used for epoch 8: 0 m 10 s
Evaluating on dev set after epoch 8/10:
Average F1 (py): 33.54%
Average precision (py): 41.93%
Average recall (py): 54.58%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 9/10
  20/2775 [..............................] - ETA: 15:16 - loss: 0.4084



Time used for epoch 9: 0 m 10 s
Evaluating on dev set after epoch 9/10:
Average F1 (py): 37.42%
Average precision (py): 48.61%
Average recall (py): 47.87%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 10/10
  20/2775 [..............................] - ETA: 18:56 - loss: 0.4030



Time used for epoch 10: 0 m 8 s
Evaluating on dev set after epoch 10/10:
Average F1 (py): 37.81%
Average precision (py): 49.59%
Average recall (py): 47.31%
Time used for evaluate on dev set: 0 m 2 s

Training finished!
Time used for training: 2 m 22 s

Evaluating on test set:
Average F1 (py): 35.52%
Average precision (py): 47.01%
Average recall (py): 45.38%
Time used for evaluate on test set: 0 m 4 s
Model: "model_6"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_13 (InputLayer)          [(None, None, None,  0           []                               
                                 300)]                                                            
                                                                                                  
 lambda_24 (Lambda)             (None, None, 300)    0           ['input_13[0][0]'] 



Time used for epoch 1: 0 m 20 s
Evaluating on dev set after epoch 1/10:
Average F1 (py): 0.00%
Average precision (py): 0.00%
Average recall (py): 0.00%
Time used for evaluate on dev set: 0 m 4 s

Starting training epoch 2/10
  20/2775 [..............................] - ETA: 12:07 - loss: 0.5835



Time used for epoch 2: 0 m 5 s
Evaluating on dev set after epoch 2/10:
Average F1 (py): 31.62%
Average precision (py): 40.65%
Average recall (py): 43.99%
Time used for evaluate on dev set: 0 m 4 s

Starting training epoch 3/10
  20/2775 [..............................] - ETA: 13:55 - loss: 0.5303



Time used for epoch 3: 0 m 6 s
Evaluating on dev set after epoch 3/10:
Average F1 (py): 29.75%
Average precision (py): 33.95%
Average recall (py): 65.61%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 4/10
  20/2775 [..............................] - ETA: 16:17 - loss: 0.4759



Time used for epoch 4: 0 m 7 s
Evaluating on dev set after epoch 4/10:
Average F1 (py): 29.83%
Average precision (py): 33.42%
Average recall (py): 63.74%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 5/10
  20/2775 [..............................] - ETA: 12:18 - loss: 0.4488



Time used for epoch 5: 0 m 10 s
Evaluating on dev set after epoch 5/10:
Average F1 (py): 31.94%
Average precision (py): 39.50%
Average recall (py): 56.94%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 6/10
  20/2775 [..............................] - ETA: 12:35 - loss: 0.4324



Time used for epoch 6: 0 m 5 s
Evaluating on dev set after epoch 6/10:
Average F1 (py): 35.50%
Average precision (py): 47.34%
Average recall (py): 50.23%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 7/10
  20/2775 [..............................] - ETA: 18:50 - loss: 0.4222



Time used for epoch 7: 0 m 8 s
Evaluating on dev set after epoch 7/10:
Average F1 (py): 36.06%
Average precision (py): 48.53%
Average recall (py): 50.23%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 8/10
  20/2775 [..............................] - ETA: 12:44 - loss: 0.4162



Time used for epoch 8: 0 m 5 s
Evaluating on dev set after epoch 8/10:
Average F1 (py): 34.02%
Average precision (py): 43.92%
Average recall (py): 52.65%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 9/10
  20/2775 [..............................] - ETA: 15:29 - loss: 0.4102



Time used for epoch 9: 0 m 10 s
Evaluating on dev set after epoch 9/10:
Average F1 (py): 36.07%
Average precision (py): 48.09%
Average recall (py): 50.48%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 10/10
  20/2775 [..............................] - ETA: 18:31 - loss: 0.4054



Time used for epoch 10: 0 m 8 s
Evaluating on dev set after epoch 10/10:
Average F1 (py): 37.43%
Average precision (py): 50.57%
Average recall (py): 48.86%
Time used for evaluate on dev set: 0 m 2 s

Training finished!
Time used for training: 1 m 59 s

Evaluating on test set:
Average F1 (py): 34.93%
Average precision (py): 47.55%
Average recall (py): 46.81%
Time used for evaluate on test set: 0 m 4 s
Model: "model_7"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_15 (InputLayer)          [(None, None, None,  0           []                               
                                 300)]                                                            
                                                                                                  
 lambda_28 (Lambda)             (None, None, 300)    0           ['input_15[0][0]'] 



Time used for epoch 1: 0 m 25 s
Evaluating on dev set after epoch 1/10:
Average F1 (py): 0.00%
Average precision (py): 0.00%
Average recall (py): 0.00%
Time used for evaluate on dev set: 0 m 4 s

Starting training epoch 2/10
  20/2775 [..............................] - ETA: 19:14 - loss: 0.5814



Time used for epoch 2: 0 m 8 s
Evaluating on dev set after epoch 2/10:
Average F1 (py): 29.59%
Average precision (py): 35.05%
Average recall (py): 64.54%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 3/10
  20/2775 [..............................] - ETA: 18:40 - loss: 0.5186



Time used for epoch 3: 0 m 8 s
Evaluating on dev set after epoch 3/10:
Average F1 (py): 29.67%
Average precision (py): 33.85%
Average recall (py): 65.44%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 4/10
  20/2775 [..............................] - ETA: 14:39 - loss: 0.4793



Time used for epoch 4: 0 m 10 s
Evaluating on dev set after epoch 4/10:
Average F1 (py): 29.95%
Average precision (py): 34.95%
Average recall (py): 63.42%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 5/10
  20/2775 [..............................] - ETA: 14:37 - loss: 0.4540



Time used for epoch 5: 0 m 10 s
Evaluating on dev set after epoch 5/10:
Average F1 (py): 30.04%
Average precision (py): 34.38%
Average recall (py): 62.80%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 6/10
  20/2775 [..............................] - ETA: 13:59 - loss: 0.4356



Time used for epoch 6: 0 m 10 s
Evaluating on dev set after epoch 6/10:
Average F1 (py): 31.80%
Average precision (py): 39.19%
Average recall (py): 57.45%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 7/10
  20/2775 [..............................] - ETA: 14:47 - loss: 0.4260



Time used for epoch 7: 0 m 10 s
Evaluating on dev set after epoch 7/10:
Average F1 (py): 34.74%
Average precision (py): 44.48%
Average recall (py): 51.53%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 8/10
  20/2775 [..............................] - ETA: 14:14 - loss: 0.4195



Time used for epoch 8: 0 m 10 s
Evaluating on dev set after epoch 8/10:
Average F1 (py): 35.54%
Average precision (py): 47.16%
Average recall (py): 49.42%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 9/10
  20/2775 [..............................] - ETA: 14:22 - loss: 0.4112



Time used for epoch 9: 0 m 10 s
Evaluating on dev set after epoch 9/10:
Average F1 (py): 36.75%
Average precision (py): 49.70%
Average recall (py): 48.68%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 10/10
  20/2775 [..............................] - ETA: 14:13 - loss: 0.4031



Time used for epoch 10: 0 m 6 s
Evaluating on dev set after epoch 10/10:
Average F1 (py): 31.89%
Average precision (py): 37.07%
Average recall (py): 58.97%
Time used for evaluate on dev set: 0 m 2 s

Training finished!
Time used for training: 2 m 26 s

Evaluating on test set:
Average F1 (py): 29.41%
Average precision (py): 35.01%
Average recall (py): 57.25%
Time used for evaluate on test set: 0 m 7 s
Model: "model_8"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_17 (InputLayer)          [(None, None, None,  0           []                               
                                 300)]                                                            
                                                                                                  
 lambda_32 (Lambda)             (None, None, 300)    0           ['input_17[0][0]'] 



Time used for epoch 1: 0 m 18 s
Evaluating on dev set after epoch 1/10:
Average F1 (py): 0.00%
Average precision (py): 0.00%
Average recall (py): 0.00%
Time used for evaluate on dev set: 0 m 5 s

Starting training epoch 2/10
  20/2775 [..............................] - ETA: 15:58 - loss: 0.5723



Time used for epoch 2: 0 m 7 s
Evaluating on dev set after epoch 2/10:
Average F1 (py): 29.53%
Average precision (py): 34.12%
Average recall (py): 64.08%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 3/10
  20/2775 [..............................] - ETA: 16:16 - loss: 0.5139



Time used for epoch 3: 0 m 7 s
Evaluating on dev set after epoch 3/10:
Average F1 (py): 29.67%
Average precision (py): 34.63%
Average recall (py): 65.91%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 4/10
  20/2775 [..............................] - ETA: 12:42 - loss: 0.4708



Time used for epoch 4: 0 m 10 s
Evaluating on dev set after epoch 4/10:
Average F1 (py): 30.66%
Average precision (py): 35.47%
Average recall (py): 59.66%
Time used for evaluate on dev set: 0 m 3 s

Starting training epoch 5/10
  20/2775 [..............................] - ETA: 13:15 - loss: 0.4448



Time used for epoch 5: 0 m 5 s
Evaluating on dev set after epoch 5/10:
Average F1 (py): 29.83%
Average precision (py): 34.09%
Average recall (py): 64.92%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 6/10
  20/2775 [..............................] - ETA: 19:00 - loss: 0.4301



Time used for epoch 6: 0 m 8 s
Evaluating on dev set after epoch 6/10:
Average F1 (py): 33.33%
Average precision (py): 41.20%
Average recall (py): 54.20%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 7/10
  20/2775 [..............................] - ETA: 12:41 - loss: 0.4189



Time used for epoch 7: 0 m 5 s
Evaluating on dev set after epoch 7/10:
Average F1 (py): 38.15%
Average precision (py): 53.44%
Average recall (py): 45.58%
Time used for evaluate on dev set: 0 m 4 s

Starting training epoch 8/10
  20/2775 [..............................] - ETA: 13:39 - loss: 0.4176



Time used for epoch 8: 0 m 6 s
Evaluating on dev set after epoch 8/10:
Average F1 (py): 35.39%
Average precision (py): 46.79%
Average recall (py): 50.91%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 9/10
  20/2775 [..............................] - ETA: 18:44 - loss: 0.4054



Time used for epoch 9: 0 m 8 s
Evaluating on dev set after epoch 9/10:
Average F1 (py): 35.71%
Average precision (py): 45.72%
Average recall (py): 49.26%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 10/10
  20/2775 [..............................] - ETA: 12:53 - loss: 0.4059



Time used for epoch 10: 0 m 5 s
Evaluating on dev set after epoch 10/10:
Average F1 (py): 36.74%
Average precision (py): 49.51%
Average recall (py): 49.00%
Time used for evaluate on dev set: 0 m 2 s

Training finished!
Time used for training: 1 m 56 s

Evaluating on test set:
Average F1 (py): 34.72%
Average precision (py): 46.45%
Average recall (py): 46.68%
Time used for evaluate on test set: 0 m 6 s
Model: "model_9"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_19 (InputLayer)          [(None, None, None,  0           []                               
                                 300)]                                                            
                                                                                                  
 lambda_36 (Lambda)             (None, None, 300)    0           ['input_19[0][0]'] 



Time used for epoch 1: 0 m 18 s
Evaluating on dev set after epoch 1/10:
Average F1 (py): 0.49%
Average precision (py): 31.46%
Average recall (py): 0.25%
Time used for evaluate on dev set: 0 m 5 s

Starting training epoch 2/10
  20/2775 [..............................] - ETA: 15:45 - loss: 0.5824



Time used for epoch 2: 0 m 10 s
Evaluating on dev set after epoch 2/10:
Average F1 (py): 29.50%
Average precision (py): 34.60%
Average recall (py): 63.92%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 3/10
  20/2775 [..............................] - ETA: 18:45 - loss: 0.5214



Time used for epoch 3: 0 m 10 s
Evaluating on dev set after epoch 3/10:
Average F1 (py): 29.64%
Average precision (py): 34.04%
Average recall (py): 65.83%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 4/10
  20/2775 [..............................] - ETA: 15:25 - loss: 0.4740



Time used for epoch 4: 0 m 10 s
Evaluating on dev set after epoch 4/10:
Average F1 (py): 29.89%
Average precision (py): 34.29%
Average recall (py): 63.96%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 5/10
  20/2775 [..............................] - ETA: 12:39 - loss: 0.4529



Time used for epoch 5: 0 m 5 s
Evaluating on dev set after epoch 5/10:
Average F1 (py): 30.43%
Average precision (py): 34.61%
Average recall (py): 61.10%
Time used for evaluate on dev set: 0 m 4 s

Starting training epoch 6/10
  20/2775 [..............................] - ETA: 13:03 - loss: 0.4332



Time used for epoch 6: 0 m 10 s
Evaluating on dev set after epoch 6/10:
Average F1 (py): 31.36%
Average precision (py): 38.20%
Average recall (py): 58.99%
Time used for evaluate on dev set: 0 m 4 s

Starting training epoch 7/10
  20/2775 [..............................] - ETA: 14:09 - loss: 0.4175



Time used for epoch 7: 0 m 10 s
Evaluating on dev set after epoch 7/10:
Average F1 (py): 34.50%
Average precision (py): 44.92%
Average recall (py): 52.53%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 8/10
  20/2775 [..............................] - ETA: 16:41 - loss: 0.4088



Time used for epoch 8: 0 m 7 s
Evaluating on dev set after epoch 8/10:
Average F1 (py): 33.55%
Average precision (py): 42.38%
Average recall (py): 53.26%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 9/10
  20/2775 [..............................] - ETA: 14:05 - loss: 0.4148



Time used for epoch 9: 0 m 10 s
Evaluating on dev set after epoch 9/10:
Average F1 (py): 36.27%
Average precision (py): 47.34%
Average recall (py): 50.59%
Time used for evaluate on dev set: 0 m 2 s

Starting training epoch 10/10
  20/2775 [..............................] - ETA: 12:22 - loss: 0.3994



Time used for epoch 10: 0 m 5 s
Evaluating on dev set after epoch 10/10:
Average F1 (py): 36.52%
Average precision (py): 48.80%
Average recall (py): 50.02%
Time used for evaluate on dev set: 0 m 4 s

Training finished!
Time used for training: 2 m 12 s

Evaluating on test set:
Average F1 (py): 34.40%
Average precision (py): 46.26%
Average recall (py): 47.89%
Time used for evaluate on test set: 0 m 5 s


In [None]:
import pandas as pd

pd.DataFrame(results).T

Unnamed: 0,0,1,2,3,4,5,6,7,8
MAX_ANT,200.0,200.0,200.0,250.0,250.0,250.0,300.0,300.0,300.0
NEG_RATIO,1.0,2.0,3.0,1.0,2.0,3.0,1.0,2.0,3.0
precision,0.355722,0.49335,0.445585,0.387425,0.470066,0.475547,0.350052,0.464494,0.462602
recall,0.561894,0.461719,0.478289,0.524332,0.453837,0.468117,0.572513,0.466757,0.478937
f1,0.300808,0.358599,0.342587,0.311716,0.355155,0.34933,0.294055,0.347228,0.343982
