# Coreference Resolution

In this problem set, you will venture into the challenging NLP task of **coreference resolution**. You will:

* Implement a simple rule-based system that achieve results which are surprisingly difficult to beat.
* Get acquainted with the trickiness of evaluating coref systems, and the current solutions in the field.
* Experiment with two neural approaches for coref to be implemented in PyTorch:
  * A feedforward network that only looks at boolean mention-pair features
  * A fully-neural architecture with embeddings all the way down
* Get a glimpse at domain adaptation in the wild, by trying to run a system trained on news against a narrative corpus and vice-versa.


# 0. Setup

In order to develop this assignment, you will need [python 3](https://www.python.org/downloads/) and the following libraries. Most if not all of these are part of [conda](https://docs.conda.io/en/latest/miniconda.html), so a good starting point would be to install that.

* [jupyter](http://jupyter.readthedocs.org/en/latest/install.html)
* [scipy](https://www.scipy.org/install.html)
* numpy (This will come if you install scipy like above, but if not install separately)
* [nosetests](https://nose.readthedocs.org/en/latest/)
* [torch](https://pytorch.org/get-started/locally/)
* [nltk](https://www.nltk.org/install.html) (You might also want to download `averaged_perceptron_tagger` with the NLTK downloader: activate your conda environment, open a python console, and `import nltk; nltk.download('averaged_perceptron_tagger')` if necessary)
* [xmltodict](https://github.com/martinblech/xmltodict#using-pypi)

Here is some help on installing packages in python: https://packaging.python.org/installing/. You can use ```pip --user``` to install locally without sudo.

## About this assignment

* Most of your coding will be in the python source files in the directory `mynlplib`.
* The directory `tests` contains unit tests that will be used to grade your assignment, using `nosetests`. You should run them as you work on the assignment to see that you're on the right track. You are free to look at their source code, if that helps -- though most of the relevant code is also here in this notebook. **You should run the tests and make sure they can pass before submitting your assignment.**
* **To submit this assignment, run the script `make-submission.sh`, and submit the tarball `pset4-submission.tgz` on Canvas. Make sure that the tarball contains all the files specified in `manifest.txt`.**

In [1]:
! pip install nose

Collecting nose
[?25l  Downloading https://files.pythonhosted.org/packages/15/d8/dd071918c040f50fa1cf80da16423af51ff8ce4a0f2399b7bf8de45ac3d9/nose-1.3.7-py3-none-any.whl (154kB)
[K     |██▏                             | 10kB 14.0MB/s eta 0:00:01[K     |████▎                           | 20kB 18.8MB/s eta 0:00:01[K     |██████▍                         | 30kB 19.3MB/s eta 0:00:01[K     |████████▌                       | 40kB 11.8MB/s eta 0:00:01[K     |██████████▋                     | 51kB 7.1MB/s eta 0:00:01[K     |████████████▊                   | 61kB 7.4MB/s eta 0:00:01[K     |██████████████▉                 | 71kB 7.5MB/s eta 0:00:01[K     |█████████████████               | 81kB 7.8MB/s eta 0:00:01[K     |███████████████████             | 92kB 8.0MB/s eta 0:00:01[K     |█████████████████████▏          | 102kB 8.1MB/s eta 0:00:01[K     |███████████████████████▎        | 112kB 8.1MB/s eta 0:00:01[K     |█████████████████████████▍      | 122kB 8.1MB/s eta 0:00:

In [2]:
! pip install xmltodict

Collecting xmltodict
  Downloading https://files.pythonhosted.org/packages/28/fd/30d5c1d3ac29ce229f6bdc40bbc20b28f716e8b363140c26eff19122d8a5/xmltodict-0.12.0-py2.py3-none-any.whl
Installing collected packages: xmltodict
Successfully installed xmltodict-0.12.0


In [3]:
import nose
import numpy as np
import os
import pickle
import scipy
import xmltodict
from collections import Counter, defaultdict
from nltk.tag import pos_tag

import torch
from torch.nn import functional as F
import torch.optim as optim

%load_ext autoreload
%autoreload 2

In [4]:
print('My library versions')

print('numpy: {}'.format(np.__version__))
print('xmltodict: {}'.format(xmltodict.__version__))
print('nose: {}'.format(nose.__version__))
print('scipy: {}'.format(scipy.__version__))
print('torch: {}'.format(torch.__version__))

My library versions
numpy: 1.18.5
xmltodict: 0.12.0
nose: 1.3.7
scipy: 1.4.1
torch: 1.7.0+cu101


To test whether your libraries are the right version, run:

`nosetests tests/test_environment.py`

In [5]:
# use ! to run shell commands in notebook
! nosetests tests/test_environment.py

.
----------------------------------------------------------------------
Ran 1 test in 0.001s

OK


In [16]:
from mynlplib import coref, coref_rules, coref_features, coref_learning, neural_net, utils

# constants for notebook use
ETA_0 = 0.01

# 1. Exploring the data (6 points)

The core data is in the form of "markables", or "referring expressions", which refer to token sequences that can participate in coreference relations.

Each markable is a namedtuple with five elements:
- ```string```, which is a list of tokens
- ```entity```, which defines the ground truth assignments
- ```start_token```, the index of the first token in the markable with respect to the entire document
- ```end_token```, one plus the index of the last token in the markable
- ```tags```, POS tags corresponding to the tokens in ```string``` which will remain NULL for now

The ```read_data``` function also returns a list of tokens.
You can use this to incorporate the linguistic context around each markable.

### Loading the dataset

For most of this problem set, we will explore a dataset of articles from the Wall Street Journal (WSJ) extracted and annotated from the Penn Treebank (PTB).

In [17]:
dv_dir = os.path.join('data','wsj','dev')
tr_dir = os.path.join('data','wsj','train')
te_dir = os.path.join('data','wsj','test') # all markables here are annotated as the same entity

In [18]:
markables, words = coref.read_data('06_wsj_0051.sty',basedir=tr_dir)

In [19]:
print('Markable object:', markables[0])
print('Words for markable extracted from text:', words[markables[0].start_token:markables[0].end_token])

Markable object: Markable(string=['Fujitsu', 'Ltd.'], entity='set_3082', start_token=0, end_token=2, tags=['NULL', 'NULL'])
Words for markable extracted from text: ['Fujitsu', 'Ltd.']


**Deliverable 1.1** (3 points): Write a function that returns all the markable **strings** associated with a given entity. Specifically, fill in the function `get_markables_for_entity()` in `coref.py`.

* **Test:** `tests\test_coref.py:test_get_markables_d1_1()`

In [20]:
sorted(coref.get_markables_for_entity(markables,'set_3082'))

['Fujitsu',
 'Fujitsu',
 'Fujitsu',
 'Fujitsu',
 'Fujitsu',
 'Fujitsu',
 'Fujitsu',
 "Fujitsu , Japan 's No. 1 computer maker",
 'Fujitsu Ltd.',
 'It',
 'The company',
 'The company',
 'We',
 'his company',
 'his company',
 'it',
 'it',
 'it',
 'it',
 'it',
 'it',
 'its',
 'its',
 'the company']

In [21]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [22]:
! nosetests tests/test_coref.py:test_get_markables_d1_1

.
----------------------------------------------------------------------
Ran 1 test in 0.942s

OK


**Deliverable 1.2** (3 points): Write a function that takes as input a string, and returns a list of distances to the most recent ground truth antecedent for every time the (case-insensitive) input string appears. For example, if the input is "they", it should make a list with one element for each time the word "they" appears in the list of markables. Each element should be the distance of the word "they" to the nearest previous mention of the entity that "they" references.

Fill in the function `get_distances()` in `coref.py`. If the input string is not anaphoric, the distance should be zero. Note that input strings may contain spaces. You may use any other function in `coref.py` to help you.

* **Test:** `tests\test_coref.py:test_get_antecedents_d1_2()`

In [23]:
coref.get_distances(markables,'they')

[1, 1, 1, 2, 2]

Now let's compare the typical distances for various mention types.

You can see the most frequent mention types by using the `Counter` class.

In [24]:
Counter([' '.join(markable.string) for markable in markables]).most_common(5)

[('it', 9), ('Fujitsu', 7), ('Japan', 5), ('they', 5), ('NEC', 4)]

In [25]:
coref.get_distances(markables, 'Fujitsu')

[15, 8, 49, 12, 7, 4, 2]

In [26]:
coref.get_distances(markables, 'the company')

[4, 4, 6]

In [27]:
coref.get_distances(markables, 'it') # there are 10 because our counter was case-sensitive

[1, 2, 2, 1, 6, 1, 6, 2, 6, 1]

In [28]:
! nosetests tests/test_coref.py:test_get_antecedents_d1_2

.
----------------------------------------------------------------------
Ran 1 test in 0.971s

OK


# 2. Rule-based coreference resolution (18 points)

We have written a simple coreference classifier, which predicts that each markable is linked to the most recent antecedent which is an exact string match.

The code block below applies this method to the dev set.

In [29]:
exact_matcher = coref_rules.make_resolver(coref_rules.exact_match)

The code above has two pieces:

- ```coref_rules.exact_match()``` is a function that takes two markables, and returns `True` iff they are an exact (case-insensitive) string match
- ```make_resolver()``` is a function that takes a matching function, and returns a function that computes an antecedent list for a list of markables.

Let's run it.

In [30]:
ant_exact = exact_matcher(markables)
print(ant_exact[:20])

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]


The output is a list of antecedent numbers, $c_i$. 
When $c_i = i$, the markable $i$ has no antecedent: it is the first mention of its entity. In this case, all first 20 mentions are new and don't have antecedents. You can try and modify the cell to look further down the list to see if there are ever actual matches, which we know should occur due to the output of cell 5.

We can test whether these predictions are correct by comparing against the key.

In [31]:
ant_true = coref.get_true_antecedents(markables)

In [32]:
num_correct = sum([c_true == c_predict for c_true, c_predict in zip(ant_true, ant_exact)])
acc = num_correct / len(markables)
print(f'correct: {num_correct}\taccuracy: {acc:.3f}')

correct: 128	accuracy: 0.660


## Evaluation

Coreference can be evaluated in terms of recall, precision, and F-measure. Here is how we will define these terms:

- **True positive**: The system predicts $\hat{c}_i < i$, and $\hat{c}_i$ and $i$ are references to the same entity.
- **False positive**: The system predicts $\hat{c}_i < i$, but $\hat{c}_i$ and $i$ are not references to the same entity.
- **False negative**: There exists some $c_i < i$ such that $c_i$ and $i$ are references to the same entity, but the system predicts either $\hat{c}_i = i$, or some $\hat{c}_i$ which is not really a reference to the same entity that $i$ references.
- Recall = $\frac{tp}{tp + fn}$
- Precision = $\frac{tp}{tp + fp}$
- F-measure = $\frac{2RP}{R+P}$

A couple of things to notice here:

- There is no reward for correctly identifying a markable as non-anaphoric (not having any antecedent), but you do avoid committing a false positive by doing this.
- You cannot compute the evaluation by directly matching the predicted antecedents to the true antecedents. Suppose the truth is $a \leftarrow b, b \leftarrow c$, but the system predicts $a \leftarrow b, a \leftarrow c$: the system should receive two true positives, since $a$ and $c$ are references to the same entity in the ground truth.

**Deliverable 2.1** (6 points): Implement `get_tp()`, `get_fp()`, and `get_fn()` in `coref.py`. You will want to use the function `coref.get_entities()`.

* **Test:** `tests\test_coref.py:test_recall_d2_1(), test_precision_d2_1(), test_fmeasure_d2_1()`

**NOTE!** You **must** successfully complete this deliverable. Otherwise, some of the unit tests won't work and you won't be able to complete the rest of the assignment.

In [33]:
f,r,p = coref.evaluate_f(exact_matcher, markables)
print(f'{f:.4f}\t{r:.4f}\t{p:.4f}')

0.5231	0.4096	0.7234


In [34]:
all_markables, all_words = coref.read_dataset(tr_dir)

In [35]:
coref.eval_on_dataset(exact_matcher, all_markables);

F: 0.5452	R: 0.4130	P:0.8018


Before optimizing on this simple F-measure (sometimes called F1) and its components, one should be aware that in the real world coreference is evaluated over three other metrics, namely $B^3$, **`CEAF`**, and **`MUC`**. We have implemented them for you, and will check our performance on them from time to time. You can read more about them [here](http://www.anthology.aclweb.org/W/W10/W10-4305.pdf).

In [36]:
def coref_metrics(matcher, dataset):
    ants = [matcher(m) for m in dataset]
    b3, ceaf, muc = coref.evaluate_bcm(dataset, ants)
    avg_f1 = np.average([b3, ceaf, muc])
    print(f'B-Cubed: {b3:.4f}\tCEAF: {ceaf:.4f}\tMUC: {muc:.4f}\tAverage: {avg_f1:.4f}')

In [37]:
coref_metrics(exact_matcher, all_markables); #"Average" is the commonly used main metric in state-of-the-art systems.

B-Cubed: 0.5756	CEAF: 0.4551	MUC: 0.5605	Average: 0.5304


In [38]:
! nosetests tests/test_coref.py:test_recall_d2_1

.
----------------------------------------------------------------------
Ran 1 test in 0.957s

OK


In [39]:
! nosetests tests/test_coref.py:test_precision_d2_1

.
----------------------------------------------------------------------
Ran 1 test in 0.964s

OK


In [40]:
! nosetests tests/test_coref.py:test_fmeasure_d2_1

.
----------------------------------------------------------------------
Ran 1 test in 0.953s

OK


The reasons for having multiple measures for coref evaluation is manyfold. One of them, discussed in the paper linked to above, has to do with the pre-resolution task of *identifying markables*. Since we're only working with pre-extracted markables, that needn't worry us.

The other reason is that different perspectives on coreference matching are equally plausible - we can focus on single correct predictions, or finding the correct clusters for each entity , for example. Each of these is "gameable" by different trivial classifiers.

**Deliverable 2.2** (3 points): To witness this problem, you will implement `coref_rules.singleton_matcher()`, which produces an assignment where each markable has its own entity, and `coref_rules.full_cluster_matcher()`, which assigns all markables to the same entity. Running the metrics against them will demonstrate the problem.

* **Test:** `tests\test_coref.py:test_singleton_matcher_d2_2(), test_full_cluster_matcher_d2_2()`

In [41]:
singleton_resolver = coref_rules.make_resolver(coref_rules.singleton_matcher)
coref_metrics(singleton_resolver, all_markables);

B-Cubed: 0.3411	CEAF: 0.0012	MUC: 0.0000	Average: 0.1141


MUC has an inherent problem evaluating singleton entities. $B^3$, on the other hand, is extra-generous with them.

In [42]:
full_cluster_resolver = coref_rules.make_resolver(coref_rules.full_cluster_matcher)
coref_metrics(full_cluster_resolver, all_markables);

B-Cubed: 0.0739	CEAF: 0.0304	MUC: 0.5552	Average: 0.2199


In this case MUC, which is focused on detecting incompatible clusters, is fairly comfortable with the fact that there's only one predicted cluster.
CEAF, a metric which gives low precision very easily, is difficult to score high on.

In [43]:
! nosetests tests/test_coref.py:test_singleton_matcher_d2_2

.
----------------------------------------------------------------------
Ran 1 test in 0.930s

OK


In [44]:
! nosetests tests/test_coref.py:test_full_cluster_matcher_d2_2

.
----------------------------------------------------------------------
Ran 1 test in 0.923s

OK


## Increasing precision

The `exact_match()` function matches everything, including pronouns. This can lead to mistakes:

"Umashanthi ate pizza until she was full. Parvati kept eating until she had a stomach ache."

In this example, both pronouns likely refer to the names that immediately precede them, and not to each other.

**Deliverable 2.3** (1.5 points): The file `coref_rules.py` contains the signature for a function `exact_match_no_pronoun()`, which solves this problem by only predicting matches between markables that are not pronouns. Implement and test this function. For now, you may use the list of pronouns provided in the code file `coref_rules.py`.

* **Test:** `tests\test_coref.py:test_match_nopro_d2_3(), tests\test_coref.py:test_match_nopro_f1_d2_3()`

In [45]:
no_pro_matcher = coref_rules.make_resolver(coref_rules.exact_match_no_pronouns)

In [46]:
f,r,p = coref.eval_on_dataset(no_pro_matcher,all_markables);

F: 0.4551	R: 0.3028	P:0.9158


In [47]:
coref_metrics(no_pro_matcher, all_markables);

B-Cubed: 0.5678	CEAF: 0.4269	MUC: 0.4568	Average: 0.4839


Precision has increased, but recall decreased, dragging down the overall F-measure as well as our favorite metrics.

In [48]:
! nosetests tests/test_coref.py:test_match_nopro_d2_3

.
----------------------------------------------------------------------
Ran 1 test in 0.906s

OK


In [49]:
! nosetests tests/test_coref.py:test_match_nopro_f1_d2_3

.
----------------------------------------------------------------------
Ran 1 test in 1.179s

OK


## Increasing recall

Our current matcher is very conservative. Let's try to increase recall. One solution is match on the **head word** of each markable. 

As you know, in a CFG parse, the head word is defined by a set of rules: for example, the head of a determiner-noun construction is the noun. In a dependency parse, the head word would be the root of the subtree governing the markable span. But this assumes that the markables correspond to syntactic constituents or dependency subtrees. This is not guaranteed to be true - particularly when there are parsing errors.

**Deliverable 2.4** (1.5 points): Let's start with a much simpler head-finding heuristic: simply select the *last word* in the markable. This handles many cases - but as we will see, not all. To do this, implement the function `match_last_token()` in ```coref_rules.py```. This function should match all cases where the final tokens match.

* **Test:** `tests\test_coref.py:test_match_last_tok_d2_4()`

In [50]:
last_tok_matcher = coref_rules.make_resolver(coref_rules.match_last_token)

In [51]:
coref.eval_on_dataset(last_tok_matcher,all_markables);

F: 0.3994	R: 0.4385	P:0.3666


In [52]:
! nosetests tests/test_coref.py:test_match_last_tok_d2_4

.
----------------------------------------------------------------------
Ran 1 test in 0.920s

OK


Recall is up, but precision is back down. To try to increase precision, let's add one more rule: two markables cannot coref if their spans overlap. This can happen with nested mentions, such as "(the president (of the United States))". Under our last-token rule, these two mentions would co-refer, but logically, overlapping markables cannot refer to the same entity. 

**Deliverable 2.5** (1.5 points): Fill in the function `match_last_token_no_overlap()`, which should match any two markables that share the same last token, unless their spans overlap. Use the `start_token` and `end_token` members of each markable to determine whether they overlap. the final tokens match.

* **Test:** `tests\test_coref.py:test_match_no_overlap_f1_d2_5()`

In [53]:
mltno_matcher = coref_rules.make_resolver(coref_rules.match_last_token_no_overlap)
coref.eval_on_dataset(mltno_matcher,all_markables);

F: 0.4911	R: 0.4965	P:0.4858


Both recall and precision increase. Why would recall increase? The restriction does not create any new coreference links, but it changes some incorrect links to correct links. This increases the number of true positives and reduces the number of false negatives.

In [54]:
coref_metrics(mltno_matcher, all_markables);

B-Cubed: 0.5571	CEAF: 0.4610	MUC: 0.5473	Average: 0.5218


Almost back to the results from the exact matcher.

In [55]:
! nosetests tests/test_coref.py:test_match_no_overlap_f1_d2_5

.
----------------------------------------------------------------------
Ran 1 test in 1.281s

OK


## Error analysis

To see whether we can do even better, let's try some error analysis on a specific file.

In [56]:
# predicted antecedent series
markables_17, _ = coref.read_data('17_wsj_0072.sty',basedir=tr_dir)
ant = coref_rules.make_resolver(coref_rules.match_last_token_no_overlap)(markables_17)

In [57]:
# let's look at large entities
m2e, e2m = coref.markables_to_entities(markables_17,ant)
big_entities = [ent for ent, vals in e2m.items() if len(vals) > 10]

In [58]:
for entity in big_entities:
    print(f'Entity {entity}: {len(e2m[entity])} mentions')
    print([' '.join(markables_17[idx].string) for idx in e2m[entity]])
    print()

Entity 8: 11 mentions
['Fed', 'Kansas City Fed', 'the Fed', 'the Fed', 'The Fed', 'The report from the Fed', 'the Fed', 'The Philadelphia Fed', 'Fed', 'Fed', 'regional Fed']



## Incorporating parts of speech

One clear mistake is that we are matching ''Kansas City Fed'' to ''The Philadelphia Fed'' and other ''Fed''s. The last token heuristic is the culprit: in this case, the first token is a key disambiguator. Let's try a more syntactically-motivated approach. 

Instead of matching the last token (low precision) or matching on all tokens (low recall), let's try matching on all *content* words. Let's start by including only the following grammatical categories:

- Nouns (proper, common, singular, plural)
- Pronouns (including possessive)
- Adjectives (including comparative and superlative)
- Cardinal numbers

To get these categories, we can call `read_dataset()` again with the optional `tagger` argument, a part of speech tagger. We'll use NLTK for this project, which has a structured perceptron tagger on the [PTB tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). 

In [59]:
all_markables, _ = coref.read_dataset(tr_dir, tagger=pos_tag)
all_markables_dev, all_words_dev = coref.read_dataset(dv_dir, tagger=pos_tag)
all_markables_te, all_words_test = coref.read_dataset(te_dir, tagger=pos_tag)

In [60]:
all_markables[1][1]

Markable(string=['a', 'trade', 'deficit', 'of', '$', '101', 'million'], entity='set_972', start_token=3, end_token=10, tags=['DT', 'NN', 'NN', 'IN', '$', 'CD', 'CD'])

In [128]:
#all_markables[0]

As you can see, the markables now have the `tags` member populated with the part-of-speech tags for each token in the `string` field.

**Deliverable 2.6** (3 points): Now implement a new matcher, `coref_rules.match_on_content()`. Your code should match $m_a$ and $m_i$ iff all content words are identical. It should also enforce the "no overlap" restriction defined above.

* **Test:** `tests\test_coref.py:test_match_content_f1_d2_6()`

In [61]:
content_matcher = coref_rules.make_resolver(coref_rules.match_on_content)
coref.eval_on_dataset(content_matcher, all_markables);

F: 0.5565	R: 0.4374	P:0.7647


In [62]:
coref_metrics(content_matcher, all_markables);

B-Cubed: 0.5872	CEAF: 0.4725	MUC: 0.5727	Average: 0.5441


In [63]:
! nosetests tests/test_coref.py:test_match_content_f1_d2_6

.
----------------------------------------------------------------------
Ran 1 test in 1.586s

OK


Finally getting some headway on those metrics!

**Deliverable 2.7** (1.5 points): Run the code blocks below to output predictions for the dev and test data.

* **Test:** `tests\test_coref.py:test_dev_acc_f1_d2_7(), test_test_acc_f1_d2_7()`

In [64]:
!mkdir -p predictions

In [65]:
coref.write_predictions(coref_rules.make_resolver(coref_rules.match_on_content),
                        all_markables_dev,
                        'predictions/rules-dev.preds')

In [66]:
f,r,p = coref.eval_predictions('predictions/rules-dev.preds',all_markables_dev);

F: 0.5463	R: 0.3960	P:0.8806


In [67]:
coref_metrics(content_matcher, all_markables_dev);

B-Cubed: 0.5768	CEAF: 0.4045	MUC: 0.5463	Average: 0.5092


In [68]:
coref.write_predictions(coref_rules.make_resolver(coref_rules.match_on_content),
                        all_markables_te,
                        'predictions/rules-test.preds')

In [None]:
# students can't run this (it'll match against a full-cluster match)
coref.eval_predictions('predictions/rules-test.preds', all_markables_te);
coref_metrics(content_matcher, all_markables_te);

F: 0.3797	R: 0.2344	P:1.0000
B-Cubed: 0.0228	CEAF: 0.0071	MUC: 0.3797	Average: 0.1365


In [70]:
! nosetests tests/test_coref.py:test_dev_acc_f1_d2_7

.
----------------------------------------------------------------------
Ran 1 test in 0.926s

OK


# 3. Machine learning for coreference resolution (19.5 points)

You will now implement coreference resolution using the mention-ranking model. Let's start by implementing some features.

**Deliverable 3.1** (1.5 points): Implement `coref_features.minimal_features`, using the rules you wrote from `coref_rules.` This should be a function that takes a list of markables, and indices for two mentions, and returns a dict with features and counts. Include the following features:

- `exact-match`
- `last-token-match`
- `content-match`
- `crossover`: value of 1 iff the mentions overlap
- `new-entity`: value of 1 iff i=j

For the first four features, you should call your code from coref_rules directly.

* **Test:** `tests\test_coref.test_minimal_features_d3_1()`

In [71]:
min_features = ['exact-match', 'last-token-match', 'content-match', 'crossover', 'new-entity']
for i, markable in enumerate(all_markables[1][:15]):
    print(i, markable)

0 Markable(string=['South', 'Korea'], entity='set_971', start_token=0, end_token=2, tags=['NNP', 'NNP'])
1 Markable(string=['a', 'trade', 'deficit', 'of', '$', '101', 'million'], entity='set_972', start_token=3, end_token=10, tags=['DT', 'NN', 'NN', 'IN', '$', 'CD', 'CD'])
2 Markable(string=['October'], entity='set_973', start_token=11, end_token=12, tags=['NNP'])
3 Markable(string=['the', 'country', "'s", 'economic', 'sluggishness'], entity='set_974', start_token=14, end_token=19, tags=['DT', 'NN', 'POS', 'JJ', 'NN'])
4 Markable(string=['country'], entity='set_971', start_token=15, end_token=16, tags=['NN'])
5 Markable(string=['government', 'figures'], entity='set_975', start_token=22, end_token=24, tags=['NN', 'NNS'])
6 Markable(string=['Wednesday'], entity='set_976', start_token=25, end_token=26, tags=['NNP'])
7 Markable(string=['Preliminary', 'tallies'], entity='set_977', start_token=27, end_token=29, tags=['JJ', 'NNS'])
8 Markable(string=['the', 'Trade', 'and', 'Industry', 'Minist

In [72]:
print(coref_features.minimal_features(all_markables[1],0,1))
print(coref_features.minimal_features(all_markables[1],0,13))
print(coref_features.minimal_features(all_markables[1],13,14))
print(coref_features.minimal_features(all_markables[1],6,6))
print(coref_features.minimal_features(all_markables[1],2,10))

defaultdict(<class 'float'>, {})
defaultdict(<class 'float'>, {'exact-match': 1, 'last-token-match': 1, 'content-match': 1})
defaultdict(<class 'float'>, {'crossover': 1})
defaultdict(<class 'float'>, {'new-entity': 1})
defaultdict(<class 'float'>, {'exact-match': 1, 'last-token-match': 1, 'content-match': 1})


In [73]:
! nosetests tests/test_coref.py:test_minimal_features_d3_1

.
----------------------------------------------------------------------
Ran 1 test in 0.926s

OK


**Deliverable 3.2** (6 points): You will now use these features in a simple feedforward neural net. Using pytorch, implement the `__init__()` and `forward()` functions in `coref_learning.FFCoref` which will be composed of two linear layers separated by a tanh nonlinearity, producing a score for each possible antecedent.

Later we will use this scoring function to select the most probable antecendent.

* **Test:** `tests\test_neural_coref.py:test_ffcoref_d3_2()`

In [74]:
COREF_FF_HIDDEN = 5 # dimension for hidden layer

In [75]:
torch.manual_seed(1984) # DO NOT CHANGE
coref_ff = coref_learning.FFCoref(min_features, COREF_FF_HIDDEN)

# scores for single mention pairs, no backprop
print(coref_ff(coref_features.minimal_features(all_markables[1],0,1)))
print(coref_ff(coref_features.minimal_features(all_markables[1],0,13)))
print(coref_ff(coref_features.minimal_features(all_markables[1],2,10)))

tensor([-0.1148], grad_fn=<AddBackward0>)
tensor([-0.3763], grad_fn=<AddBackward0>)
tensor([-0.3763], grad_fn=<AddBackward0>)


In [76]:
! nosetests tests/test_neural_coref.py:test_ffcoref_d3_2

.
----------------------------------------------------------------------
Ran 1 test in 0.004s

OK


**Deliverable 3.3** (3 points): Implement `FFCoref.score_instance()` to score all the possible antecedents given a markable.

* **Test:** `tests\test_neural_coref.py:test_ffcoref_score_instance_d3_3()`

In [77]:
i_scores = coref_ff.score_instance(all_markables[1], coref_features.minimal_features, 13)
print(i_scores)

tensor([[-0.3763, -0.1148, -0.1148, -0.1148, -0.1148, -0.1148, -0.1148, -0.1148,
         -0.1148, -0.1148, -0.1148, -0.1148, -0.1148, -0.0424]],
       grad_fn=<ViewBackward>)


In [78]:
! nosetests tests/test_neural_coref.py:test_ffcoref_score_instance_d3_3

.
----------------------------------------------------------------------
Ran 1 test in 0.004s

OK


In inference time, all we need is to use the above function and report the most likely antecedent for each markable. In training time, we will use an objective based on the **hinge margin-loss** function. Our variant will require the highest-scoring false candidate (according to the true entity annotations) to score lower than the highest-scoring true candidate *by a margin*:

$$L_{m_i} = \{ \text{max}_{a:m_a\notin A(m_i)} s(m_i,m_a) + M - \text{max}_{a:m_a\in A(m_i)} s(m_i,m_a) \}_{+}$$

Where $s$ is the score from the previous deliverable, $A(m)$ denotes the set of true antecedents for $m$, $M$ is our margin, and the $+$ subscript indicates that negative values are replaced by $0$ (since this is a hinge loss).

**Deliverable 3.4** (3 points):
Implement the helper function `FFCoref.instance_top_scores()` which supplies the arguments for this loss function.
**Note** the special cases where:
- Only true candidates exist (we're in the first cluster) - the trainer will have to skip these. Return `None`s.
- Only false candidates exist - this actually means we're in a new cluster.


* **Test:** `tests\test_neural_coref.py:test_ffcoref_score_instance_top_scores_d3_4()`

In [79]:
best_true_score, best_false_score = coref_ff.instance_top_scores(all_markables[1], coref_features.minimal_features, 13, 4)
print(torch.cat([best_true_score.view(1), best_false_score.view(1)], 0))

tensor([-0.1148, -0.0424], grad_fn=<CatBackward>)


Looks like we're ready to train our classifier!

In [80]:
torch.manual_seed(1984) # DO NOT CHANGE
coref_ff = coref_learning.FFCoref(min_features, COREF_FF_HIDDEN)
optimizer = optim.SGD(coref_ff.parameters(), lr=ETA_0)
coref_learning.train(coref_ff, optimizer, all_markables, coref_features.minimal_features, margin=1.0, epochs=2)

Loss = 0.6110637256505783
Loss = 0.4984031049442634


In [81]:
# training set results
coref_learning.evaluate(coref_ff, all_markables, coref_features.minimal_features)

F: 0.5148	R: 0.3944	P:0.7407


In [82]:
# dev set
coref_learning.evaluate(coref_ff, all_markables_dev, coref_features.minimal_features)

F: 0.5258	R: 0.3758	P:0.8750


In [83]:
# standard metrics on training set
ff_matcher = coref_learning.make_resolver(coref_features.minimal_features, coref_ff)
coref_metrics(ff_matcher, all_markables);

B-Cubed: 0.5735	CEAF: 0.4589	MUC: 0.5572	Average: 0.5299


In [None]:
# test set, students can't run
coref_metrics(ff_matcher, all_markables_te);

B-Cubed: 0.0214	CEAF: 0.0072	MUC: 0.3508	Average: 0.1265


In [84]:
! nosetests tests/test_neural_coref.py:test_ffcoref_score_instance_top_scores_d3_4

.
----------------------------------------------------------------------
Ran 1 test in 0.004s

OK


**Deliverable 3.5** (3 points): We can add more features to try and better capture relations between markables.

Implement distance features in `coref_features.distance_features()`, measuring the mention distance and the token distance. Specifically:

- **Mention distance** is number of intervening mentions between i and j, $i-j$.
- **Token distance** is number of tokens between the start of i and the end of j.

These should be binary features, up to a maximum distance of 10 for tokens / 5 for mentions, with the final feature indicating distance of 10/5 and above, respectively. The desired behavior is shown below.

* **Test:** `tests\test_coref.py:test_distance_features_d3_5()`

In [91]:
for i, markable_i in enumerate(all_markables[1][:4]):
    print(i, markable_i)

0 Markable(string=['South', 'Korea'], entity='set_971', start_token=0, end_token=2, tags=['NNP', 'NNP'])
1 Markable(string=['a', 'trade', 'deficit', 'of', '$', '101', 'million'], entity='set_972', start_token=3, end_token=10, tags=['DT', 'NN', 'NN', 'IN', '$', 'CD', 'CD'])
2 Markable(string=['October'], entity='set_973', start_token=11, end_token=12, tags=['NNP'])
3 Markable(string=['the', 'country', "'s", 'economic', 'sluggishness'], entity='set_974', start_token=14, end_token=19, tags=['DT', 'NN', 'POS', 'JJ', 'NN'])


In [92]:
print(coref_features.distance_features(all_markables[1],0,0))
print(coref_features.distance_features(all_markables[1],0,1))
print(coref_features.distance_features(all_markables[1],0,2))
print(coref_features.distance_features(all_markables[1],1,3))
print(coref_features.distance_features(all_markables[1],0,30))

defaultdict(<class 'float'>, {})
defaultdict(<class 'float'>, {'mention-distance-1': 1, 'token-distance-1': 1})
defaultdict(<class 'float'>, {'mention-distance-2': 1, 'token-distance-9': 1})
defaultdict(<class 'float'>, {'mention-distance-2': 1, 'token-distance-4': 1})
defaultdict(<class 'float'>, {'mention-distance-5': 1, 'token-distance-10': 1})


In [93]:
! nosetests tests/test_coref.py:test_distance_features_d3_5

.
----------------------------------------------------------------------
Ran 1 test in 0.899s

OK


**Deliverable 3.6** (3 points): Implement `coref_features.make_feature_union()`, which should take a list of feature functions, and return a function that computes the union of all features in the list. You can assume the feature functions don't use the same name for any feature.

* **Test:** `tests\test_coref.py:test_feature_union_d3_6()`

In [94]:
joint_feats = coref_features.make_feature_union([coref_features.minimal_features,
                                                 coref_features.distance_features])

In [95]:
print(joint_feats(all_markables[1],1,3))
print(joint_feats(all_markables[1],0,3))
print(joint_feats(all_markables[1],0,7))
print(joint_feats(all_markables[1],10,10))

defaultdict(<class 'float'>, {'mention-distance-2': 1, 'token-distance-4': 1})
defaultdict(<class 'float'>, {'mention-distance-3': 1, 'token-distance-10': 1})
defaultdict(<class 'float'>, {'mention-distance-5': 1, 'token-distance-10': 1})
defaultdict(<class 'float'>, {'new-entity': 1})


In [96]:
min_features_and_distances = min_features\
                                   + [f'mention-distance-{i}' for i in range(1,6)]\
                                   + [f'token-distance-{i}' for i in range(1,11)]

In [97]:
! nosetests tests/test_coref.py:test_feature_union_d3_6

.
----------------------------------------------------------------------
Ran 1 test in 0.908s

OK


In [98]:
torch.manual_seed(1984) # DO NOT CHANGE
coref_ff_w_distances = coref_learning.FFCoref(min_features_and_distances, 50) # we need more hidden units now
optimizer = optim.SGD(coref_ff_w_distances.parameters(), lr=ETA_0)
coref_learning.train(coref_ff_w_distances, optimizer, all_markables, joint_feats, epochs=2)

Loss = 0.5670151704351643
Loss = 0.5090080360410884


In [99]:
coref_learning.evaluate(coref_ff_w_distances, all_markables, joint_feats)

F: 0.5283	R: 0.4002	P:0.7770


In [100]:
ff_w_dist_matcher = coref_learning.make_resolver(joint_feats, coref_ff_w_distances)
coref_metrics(ff_w_dist_matcher, all_markables);

B-Cubed: 0.5756	CEAF: 0.4551	MUC: 0.5605	Average: 0.5304


Note that our basic F metric got slightly better, while the average standard metric stayed about the same.

In [101]:
coref.write_predictions(ff_w_dist_matcher,
                        all_markables_dev,
                        'predictions/ff-dev.preds')
coref.eval_predictions('predictions/ff-dev.preds', all_markables_dev);

F: 0.5308	R: 0.3758	P:0.9032


In [102]:
coref_metrics(ff_w_dist_matcher, all_markables_dev);

B-Cubed: 0.5653	CEAF: 0.3966	MUC: 0.5403	Average: 0.5007


In [103]:
coref.write_predictions(ff_w_dist_matcher,
                        all_markables_te,
                        'predictions/ff-test.preds')

In [None]:
# students can't run this
coref.eval_predictions('predictions/ff-test.preds', all_markables_te);
coref_metrics(ff_w_dist_matcher, all_markables_te);

F: 0.4621	R: 0.3378	P:0.7310
B-Cubed: 0.5329	CEAF: 0.4206	MUC: 0.4769	Average: 0.4768


# 4. Sequential Text Represenation (22.5 points)

In this section, we will find out whether neural representations of our text can help find coreferents.

The main idea is to run a bidirectional LSTM model, which you already have implemented from previous problem sets, and use the resulting hidden states to form representations of the markables. These will be fed into a feedforward classifier similar to the one from the previous section, except that the match features will also be embedded.

In [104]:
# Preparing the vocabulary for a word-to-index dictionary necessary for the initial embeddings table
vocab = set()
for doc in all_words + all_words_dev + all_words_test:
    vocab.update(doc)
vocab = sorted(list(vocab))
word_to_ix = {w:i for i,w in enumerate(vocab)}
print(len(vocab))

3530


**Deliverable 4.1** (6 points):
The first part will be very similar to code from the last homework, which you may reuse. Implement `neural_net.BiLSTMWordEmbedding` as a word  embedding lookup table followed by a bi-directional LSTM which runs on a text (here, the entire document) and outputs the hidden state from the LSTM as a contextual embedding for each word in it.

* **Test:** `tests\test_neural_coref.py:test_bilstm_embedding_d4_1()`

In [105]:
WORD_EMB_DIM = 64
WORD_LSTM_EMB_DIM = 128

In [106]:
torch.manual_seed(1984)
word_lstm = neural_net.BiLSTMWordEmbedding(word_to_ix, WORD_EMB_DIM, WORD_LSTM_EMB_DIM, 1, 0.5)
embs = word_lstm(all_words[0])
print(' '.join(all_words[0][:17] + ['...']), '\n')
print(all_words[0][4], '\n', embs[4][0][:5])
print(all_words[0][13], '\n', embs[13][0][:5])

McDermott International Inc. said its Babcock & Wilcox unit completed the sale of its Bailey Controls Operations ... 

its 
 tensor([-0.0129, -0.1065,  0.1406, -0.4480, -0.0298], grad_fn=<SliceBackward>)
its 
 tensor([-3.4436e-04, -6.9696e-02,  8.9008e-02, -4.5312e-01, -2.8312e-02],
       grad_fn=<SliceBackward>)


  "num_layers={}".format(dropout, num_layers))


In [107]:
! nosetests tests/test_neural_coref.py:test_bilstm_embedding_d4_1

.
----------------------------------------------------------------------
Ran 1 test in 0.011s

OK


We see how the same word type (*its*) is assigned different embeddings based on its context in the document.

## Attention Model

Our markable embeddings will be trained using an **Attention Model** which accepts the embeddings for the words $\{w_i\}$ in a markable and outputs a single vector that "attends" to the single vectors according to what it believes is their importance.
This concept, [originally used](https://arxiv.org/abs/1409.0473) for sequence-to-sequence models such as Machine Translation, is applied for our task as yet another attempt to find the crucial part of a markable, like we did for the head-finding heuristic and for the content-word matching. While those were "hard" techniques, yes-or-no for each token, here we're applying a "soft" weighting that still assigns all the words in the text some significance.

Practically, our model will train a vector parameter of the same embedding size as the BiLSTM output, $\vec{u}$. Each word's embedding in the input span $\vec{e_i}$ will be multiplied (dot-product) with $\vec{u}$ to assign it a scalar weight, $a_i$ (i.e., passing $\vec{e_i}$ through a linear layer with 1-d output). Finally each embedding will be multiplied by its normalized (softmaxed) weight $\alpha_i$, and the sum of these weighted vectors will be our markable's output embedding, $\vec{e_m}$:

$$a_i = \vec{u} \cdot \vec{e_i} + u_b$$

$$\vec{\alpha} = \text{Softmax}(\vec{a})$$

$$\vec{e_m} = \sum_i \alpha_i \vec{e_i}$$

**Deliverable 4.2** (6 points):
Implement `neural_net.AttentionBasedMarkableEmbedding`.

* **Test:** `tests\test_neural_coref.py:test_embedding_attention_d4_2()`

In [108]:
torch.manual_seed(1984)
attn_layer = neural_net.AttentionBasedMarkableEmbedding(WORD_LSTM_EMB_DIM)
mark_embs = [attn_layer(embs, m) for m in all_markables[0]]
for j in [0, 1, 4]:
    print(all_markables[0][j].string, all_markables[0][j].entity, '\n', mark_embs[j][:3])

['McDermott', 'International', 'Inc.'] set_356 
 tensor([0.1076, 0.1651, 0.1864], grad_fn=<SliceBackward>)
['its'] set_356 
 tensor([-0.0129, -0.1065,  0.1406], grad_fn=<SliceBackward>)
['its'] set_357 
 tensor([-0.0003, -0.0697,  0.0890], grad_fn=<SliceBackward>)


In [109]:
! nosetests tests/test_neural_coref.py:test_embedding_attention_d4_2

.
----------------------------------------------------------------------
Ran 1 test in 0.003s

OK


We will score a markable pair based on the following features:

1. Each markable's attended embedding
1. A low-dimension embedding for each of the pairwise features we've extracted in the previous sections. Since they are boolean, each will have an embedding for its ''false'' state and one for its ''true'' state.

First, let's implement a quick extractor for positive-valued features from a mention pair.

In [110]:
def get_positive_feats(doc, i, a, feats=coref_features.minimal_features):
    return [k for k,v in feats(doc,i,a).items() if v > 0.0]

In [111]:
get_positive_feats(all_markables[1], 0, 13)

['exact-match', 'last-token-match', 'content-match']

Now we will concatenate all of these embeddings together and use them as input in a two-layer feedforward network (with ReLU nonlinearity) which will produce a scalar score for markable match.

**Deliverable 4.3** (4.5 points):
Implement `__init__()` and `forward()` in `neural_net.SequentialScorer`.

* **Test:** `tests\test_neural_coref.py:test_sequential_scorer_d4_3()`

In [112]:
BOOLEAN_FEATURE_DIM = 6
SCORER_HIDDEN_DIM = 164

# starting with just one document
torch.manual_seed(1984)
word_lstm1 = neural_net.BiLSTMWordEmbedding(word_to_ix, WORD_EMB_DIM, WORD_LSTM_EMB_DIM, 1, 0.5)
embs1 = word_lstm1(all_words[1])
attn_layer1 = neural_net.AttentionBasedMarkableEmbedding(WORD_LSTM_EMB_DIM)
mkbls1 = all_markables[1]
mark_embs1 = [attn_layer(embs1, m) for m in mkbls1]
scorer = neural_net.SequentialScorer(WORD_LSTM_EMB_DIM, min_features, BOOLEAN_FEATURE_DIM, SCORER_HIDDEN_DIM)
scorer(mark_embs1[13], mark_embs1[0], get_positive_feats(mkbls1, 13, 0))

  "num_layers={}".format(dropout, num_layers))


tensor([-0.1786], grad_fn=<AddBackward0>)

In [113]:
! nosetests tests/test_neural_coref.py:test_sequential_scorer_d4_3

.
----------------------------------------------------------------------
Ran 1 test in 0.004s

OK


**Deliverable 4.4** (3 points):
Implement `score_instance()` and `instance_top_scores()` in `neural_net.SequentialScorer`. Their purpose is the same as the one in `FFCoref`, but they require the extra embeddings parameter. The former will require some changes to adapt to the different `forward()`, but the latter can be identical to its correlate in `FFCoref` if implemented correctly.

* **Test:** `tests\test_neural_coref.py:test_sequential_scorer_score_instance_d4_4(), test_sequential_scorer_instance_top_scores_d4_4()`

In [114]:
scorer.score_instance(mark_embs1, all_markables[1], 13, coref_features.minimal_features)

tensor([[-0.1786, -0.1996, -0.2356, -0.2084, -0.1870, -0.1862, -0.1949, -0.2230,
         -0.2247, -0.2132, -0.2400, -0.1906, -0.2505, -0.2377]],
       grad_fn=<CopySlices>)

In [115]:
! nosetests tests/test_neural_coref.py:test_sequential_scorer_score_instance_d4_4

.
----------------------------------------------------------------------
Ran 1 test in 0.005s

OK


In [116]:
! nosetests tests/test_neural_coref.py:test_sequential_scorer_instance_top_scores_d4_4

.
----------------------------------------------------------------------
Ran 1 test in 0.005s

OK


Due to the length of our documents and number of parameters, torch may not work properly on the entire text. We'll truncate the files before we train and use only the minimal pairwise feature space, causing our performance to be suboptimal.

In [117]:
torch.manual_seed(1984)
tr_word_lstm = neural_net.BiLSTMWordEmbedding(word_to_ix, WORD_EMB_DIM, WORD_LSTM_EMB_DIM, 1, 0.2)
tr_attn_layer = neural_net.AttentionBasedMarkableEmbedding(WORD_LSTM_EMB_DIM)
tr_scorer = neural_net.SequentialScorer(WORD_LSTM_EMB_DIM, min_features, BOOLEAN_FEATURE_DIM, SCORER_HIDDEN_DIM)
optimizer = optim.SGD(list(tr_word_lstm.parameters()) + list(tr_attn_layer.parameters()) + list(tr_scorer.parameters()), lr=ETA_0)
neural_net.train(tr_word_lstm, tr_attn_layer, tr_scorer,\
                 optimizer, all_words, all_markables, coref_features.minimal_features, word_limit=150, epochs=3)

  "num_layers={}".format(dropout, num_layers))


Epoch 1 complete.
Document losses = 0.94565, 0.72179, 0.60166, 0.71488, 0.90926, 0.91727, 0.79121, 0.47944, 0.44778, 0.58415, 0.58744, 0.67793, 0.55884, 0.60834, 0.43940, 0.47952, 0.37664, 0.50332, 0.49127, 0.15465, 0.77573, 0.83405
Overall loss = 0.60977
Epoch 2 complete.
Document losses = 0.54433, 0.08258, 0.31583, 0.53469, 0.72477, 0.80510, 0.80130, 0.72107, 0.37884, 0.60794, 0.56670, 0.54593, 0.54602, 0.59385, 0.38264, 0.47128, 0.40817, 0.68890, 0.38611, 0.53756, 0.92980, 0.81550
Overall loss = 0.56286
Epoch 3 complete.
Document losses = 0.51926, 0.23572, 0.52952, 0.62548, 0.73789, 0.85352, 0.79971, 0.49335, 0.36500, 0.50039, 0.59024, 0.45263, 0.54059, 0.56411, 0.38056, 0.48042, 0.39043, 0.51316, 0.48574, 0.19093, 0.80251, 0.87012
Overall loss = 0.54387


Let's evaluate on the entire dataset.

In [118]:
tr_resolver = neural_net.evaluate(tr_word_lstm, tr_attn_layer, tr_scorer, all_words, all_markables, coref_features.minimal_features)

F: 0.5329	R: 0.4037	P:0.7838


In [119]:
coref_metrics(tr_resolver, all_markables);

B-Cubed: 0.5756	CEAF: 0.4551	MUC: 0.5605	Average: 0.5304


In [120]:
dv_resolver = neural_net.evaluate(tr_word_lstm, tr_attn_layer, tr_scorer, all_words_dev, all_markables_dev, coref_features.minimal_features)

F: 0.5308	R: 0.3758	P:0.9032


In [121]:
coref_metrics(dv_resolver, all_markables_dev);

B-Cubed: 0.5653	CEAF: 0.3966	MUC: 0.5403	Average: 0.5007


In [None]:
# students can't run this
te_resolver = neural_net.evaluate(tr_word_lstm, tr_attn_layer, tr_scorer, all_words_test, all_markables_te, coref_features.minimal_features)
coref_metrics(te_resolver, all_markables_te);

F: 0.4519	R: 0.3297	P:0.7176
B-Cubed: 0.5334	CEAF: 0.4225	MUC: 0.4778	Average: 0.4779


## Pretrained Word Embeddings

**Deliverable 4.5** (3 points):
Implement `utils.initialize_with_pretrained()`. Start by copying from your implementation in the last homework.

* **Test:** `tests\test_neural_coref.test_pretrain_embeddings_d4_5()`

**Note** that there is a new pretrained file in the `data` folder. Although from the same original source, it is trimmed to the vocabulary in our new dataset, so don't use the same file from the last homework. In addition to this attribute, it includes a special token called **&lt;UNK&gt;**. You should use its assigned vector to initialize vectors for all unknown words in the dataset.

Later, if you're interested in improving your model's performance, you may want to know that there are more special tokens with trained vectors in the data file:
* **&lt;S&gt;** signifies the beginning of a sentence.
* **&lt;/S&gt;** signifies the end of a sentence.
* **&lt;PAD&gt;** is used to pad short sentences (this one probably won't be useful).

In [122]:
pret_embs = pickle.load(open('data/pretrained-embeds-coref.pkl', 'rb'))

In [123]:
print(pret_embs['Fujitsu'][:5])

[ 0.34177107 -0.1528549  -0.06971747  0.12536518  0.31670848]


In [124]:
print(pret_embs['<UNK>'][:5])

[-0.2400684   0.0170402  -0.5328812   0.16161029 -0.03450763]


In [125]:
torch.manual_seed(1984)
tr_pt_word_lstm = neural_net.BiLSTMWordEmbedding(word_to_ix, WORD_EMB_DIM, WORD_LSTM_EMB_DIM, 1, 0.2)
utils.initialize_with_pretrained(pret_embs, tr_pt_word_lstm)
tr_pt_attn_layer = neural_net.AttentionBasedMarkableEmbedding(WORD_LSTM_EMB_DIM)
tr_pt_scorer = neural_net.SequentialScorer(WORD_LSTM_EMB_DIM, min_features, BOOLEAN_FEATURE_DIM, SCORER_HIDDEN_DIM)
optimizer = optim.SGD(list(tr_pt_word_lstm.parameters()) + list(tr_pt_attn_layer.parameters()) + list(tr_pt_scorer.parameters()), lr=ETA_0)
neural_net.train(tr_pt_word_lstm, tr_pt_attn_layer, tr_pt_scorer,\
                 optimizer, all_words, all_markables, coref_features.minimal_features, word_limit=150, epochs=5)

  "num_layers={}".format(dropout, num_layers))


Epoch 1 complete.
Document losses = 0.93532, 0.72020, 0.63929, 0.79194, 0.72040, 0.89231, 0.79309, 0.58134, 0.34572, 0.48719, 0.64850, 0.44828, 0.50067, 0.64606, 0.36389, 0.49039, 0.42083, 0.79277, 0.55337, 0.66727, 0.81885, 0.84035
Overall loss = 0.63736
Epoch 2 complete.
Document losses = 0.55916, 0.05312, 0.35828, 0.45841, 1.10227, 0.80828, 0.74485, 0.50687, 0.39998, 0.58693, 0.55876, 0.53564, 0.53653, 0.58264, 0.36642, 0.47296, 0.40334, 0.62587, 0.45034, 0.18482, 0.94645, 0.86991
Overall loss = 0.55066
Epoch 3 complete.
Document losses = 0.57564, 0.10515, 0.61917, 0.67002, 0.79576, 0.83516, 0.78449, 0.47714, 0.37494, 0.50172, 0.56711, 0.46329, 0.54898, 0.54713, 0.36121, 0.48616, 0.40745, 0.49241, 0.54198, 0.17694, 0.74800, 0.86620
Overall loss = 0.54376
Epoch 4 complete.
Document losses = 0.56411, 0.05279, 0.32762, 0.46465, 0.81769, 0.82422, 0.77197, 0.45029, 0.36006, 0.49921, 0.55093, 0.45243, 0.53600, 0.54701, 0.35354, 0.48786, 0.40707, 0.49983, 0.50426, 0.18343, 0.74404, 0.86567

In [126]:
tr_pt_resolver = neural_net.evaluate(tr_pt_word_lstm, tr_pt_attn_layer, tr_pt_scorer, all_words, all_markables, coref_features.minimal_features)

F: 0.5420	R: 0.4188	P:0.7681


In [127]:
coref_metrics(tr_pt_resolver, all_markables);

B-Cubed: 0.5787	CEAF: 0.4673	MUC: 0.5661	Average: 0.5374


In [128]:
tr_pt_resolver_dev = neural_net.evaluate(tr_pt_word_lstm, tr_pt_attn_layer, tr_pt_scorer, all_words_dev, all_markables_dev, coref_features.minimal_features)
coref.write_predictions(tr_pt_resolver_dev,
                        all_markables_dev,
                        'predictions/nn-dev.preds');
coref_metrics(tr_pt_resolver_dev, all_markables_dev);

F: 0.5540	R: 0.3960	P:0.9219
B-Cubed: 0.5837	CEAF: 0.4213	MUC: 0.5540	Average: 0.5196


In [129]:
# students can't run this
tr_pt_resolver_te = neural_net.evaluate(tr_pt_word_lstm, tr_pt_attn_layer, tr_pt_scorer, all_words_test, all_markables_te, coref_features.minimal_features)
coref.write_predictions(tr_pt_resolver_te,
                        all_markables_te,
                        'predictions/nn-test.preds');
coref_metrics(tr_pt_resolver_te, all_markables_te);

F: 0.3541	R: 0.2151	P:1.0000
B-Cubed: 0.0216	CEAF: 0.0072	MUC: 0.3541	Average: 0.1276


In [130]:
! nosetests tests/test_neural_coref.py:test_pretrain_embeddings_d4_5

.
----------------------------------------------------------------------
Ran 1 test in 0.003s

OK


## 5. Domain Adaptation (no deliverables)

Our dataset has a second part, which is a corpus of fairy tales, rather than news stories.

Let's take advantage of this setup to see how well our WSJ-trained model does over the fairy tale data, as opposed to a model trained on the fairy tales themselves.

In [131]:
ft_dv_dir = os.path.join('data','tales','dev')
ft_tr_dir = os.path.join('data','tales','train')
ft_te_dir = os.path.join('data','tales','test')

In [132]:
ft_all_markables, ft_all_words = coref.read_dataset(ft_tr_dir, tagger=pos_tag)
ft_all_markables_dev, ft_all_words_dev = coref.read_dataset(ft_dv_dir, tagger=pos_tag)
ft_all_markables_te, ft_all_words_te = coref.read_dataset(ft_te_dir, tagger=pos_tag)

In [133]:
coref.eval_on_dataset(exact_matcher, ft_all_markables);

F: 0.6511	R: 0.5545	P:0.7884


The exact matcher is getting much higher numbers on this dataset than on WSJ. Let's train an ML system and see the differences. We'll use `FFCoref` from section 3.

In [134]:
torch.manual_seed(1984) # DO NOT CHANGE
coref_ff_fairy = coref_learning.FFCoref(min_features_and_distances, 50)
optimizer = optim.SGD(coref_ff_fairy.parameters(), lr=ETA_0)
coref_learning.train(coref_ff_fairy, optimizer, ft_all_markables, joint_feats, epochs=2)

Loss = 0.7183681557135433
Loss = 0.677033908750819


In [135]:
# in-domain trained, tested on fairy tale data
coref_learning.evaluate(coref_ff_fairy, ft_all_markables_dev, joint_feats)

F: 0.5910	R: 0.5161	P:0.6914


In [136]:
ft_ff_matcher = coref_learning.make_resolver(joint_feats, coref_ff_fairy)
coref_metrics(ft_ff_matcher, ft_all_markables_dev);

B-Cubed: 0.5220	CEAF: 0.4773	MUC: 0.6755	Average: 0.5582


These are the **in-domain** dev set scores for the fairy tale dataset.

Recall that the in-domain numbers for the WSJ portion were the following:

In [122]:
coref_metrics(ff_w_dist_matcher, all_markables_dev);

B-Cubed: 0.5653	CEAF: 0.3966	MUC: 0.5403	Average: 0.5007


Now we can see how the **cross-domain** results look. Can a model trained on news data be reliably used in a fairy-tale setting? How about the converse?

In [137]:
# trained on fairy, tested on WSJ
coref_metrics(ft_ff_matcher, all_markables_dev);

B-Cubed: 0.5668	CEAF: 0.4798	MUC: 0.5442	Average: 0.5303


In [138]:
# trained on WSJ, tested on fairy
coref_metrics(ff_w_dist_matcher, ft_all_markables_dev);

B-Cubed: 0.4632	CEAF: 0.3597	MUC: 0.6082	Average: 0.4770


The fairy numbers are lower (but not drastically) than in-domain, and the WSJ numbers are even slightly higher than in-domain (though might not be the case with better tuning).

If you're interested, you may try and join the two training sets together and see where that gets you. You can also use this extra data (only training sets!) for your bakeoff.

# 6. Final Bakeoff! (16 points)

Ideas for improvements:

- Cost-sensitive training to balance precision and recall
- Syntax (you can parse all the markables as a preprocessing step)
  - Tree distance
  - Syntactic parallelism
  - Better head matching
- Add layers
- Ensemble (average) multiple models together

Feel free to search the research literature to get ideas. If you use an idea from another paper, mention the paper (authors, title, and URL) in your writeup.

As usual, sometimes improvement can also come from tweaking parameters in the neural nets.

In section 4, recall that we truncated the training documents - maybe more data (or different portions of each document) can help.

To use cuda, pass in use_cuda=True into `neural_net.train()` and `utils.initialize_with_pretrained()`.

In [139]:
pret_embs = pickle.load(open('data/pretrained-embeds-coref.pkl', 'rb'))

In [151]:
# temp_mark = all_markables.copy()
# temp_word = all_words.copy()
# temp_mark.extend(ft_all_markables)
# temp_word.extend(ft_all_words)
# for ele in range(len(ft_all_markables)):
#   temp_mark.append(ft_all_markables[ele])
#   temp_word.append(ft_all_words[ele])

In [179]:
vocab_combined = set()
for doc in all_words + all_words_dev + all_words_test + ft_all_words + ft_all_words_dev + ft_all_words_te:
    vocab_combined.update(doc)
vocab_combined = sorted(list(vocab_combined))
word_to_ix_comb = {w:i for i,w in enumerate(vocab_combined)}
print(len(vocab_combined))

4725


In [180]:
len(ft_all_words)

20

In [181]:
temp_mark = []
temp_words = []
for i in all_markables:
  temp_mark.append(i)
for i in ft_all_markables:
  temp_mark.append(i)
for i in all_words:
  temp_words.append(i)
for i in ft_all_words:
  temp_words.append(i)




In [182]:
len(temp_words)

42

In [171]:
type(all_markables)

list

In [167]:
temp_mark[0][0]

Markable(string=['McDermott', 'International', 'Inc.'], entity='set_356', start_token=0, end_token=3, tags=['NNP', 'NNP', 'NNP'])

In [173]:
import random

In [174]:
#random.shuffle(temp_mark)

In [175]:
temp_mark[0][0]

Markable(string=['Newsweek'], entity='set_1611', start_token=0, end_token=1, tags=['NNP'])

In [207]:
torch.manual_seed(1984)#1984
tr_pt_word_lstm = neural_net.BiLSTMWordEmbedding(word_to_ix_comb, WORD_EMB_DIM, WORD_LSTM_EMB_DIM, 1, 0.2)
utils.initialize_with_pretrained(pret_embs, tr_pt_word_lstm)
tr_pt_attn_layer = neural_net.AttentionBasedMarkableEmbedding(WORD_LSTM_EMB_DIM)
tr_pt_scorer = neural_net.SequentialScorer(WORD_LSTM_EMB_DIM, min_features, BOOLEAN_FEATURE_DIM, SCORER_HIDDEN_DIM)
optimizer = optim.SGD(list(tr_pt_word_lstm.parameters()) + list(tr_pt_attn_layer.parameters()) + list(tr_pt_scorer.parameters()), lr=0.002)#0.005ETA_0 SGD
neural_net.train(tr_pt_word_lstm, tr_pt_attn_layer, tr_pt_scorer,\
                 optimizer, temp_words, temp_mark, coref_features.minimal_features, word_limit=150, epochs=50)#20

  "num_layers={}".format(dropout, num_layers))


Epoch 1 complete.
Document losses = 0.98316, 0.95209, 0.92554, 0.91273, 0.90314, 0.83302, 0.93350, 0.76336, 0.76597, 0.70839, 0.90354, 0.84734, 0.70620, 0.67235, 0.45028, 0.23927, 0.50218, 0.84730, 0.27545, 0.75855, 0.45032, 0.32969, 0.49980, 0.59795, 0.78163, 0.50723, 0.75095, 0.38667, 0.73013, 0.40505, 0.73602, 0.13153, 0.56790, 0.71028, 0.81196, 0.69448, 0.57888, 0.57064, 0.67893, 0.55839, 0.26957, 0.70476
Overall loss = 0.65025
Epoch 2 complete.
Document losses = 0.48138, 0.62336, 0.45208, 0.76778, 0.58190, 0.26365, 0.71507, 0.85504, 0.50984, 0.86378, 0.42082, 0.68242, 0.40391, 0.52524, 0.42322, 0.68582, 0.75239, 0.65656, 0.47898, 0.58105, 0.28565, 0.74883, 0.66872, 0.66872, 0.68132, 0.80914, 0.30326, 0.74509, 0.92715, 0.37695, 0.76670, 0.90151, 0.53730, 0.07993, 0.47123, 0.78031, 0.49799, 0.45191, 0.70526, 0.26171, 0.65227, 0.26525
Overall loss = 0.57505
Epoch 3 complete.
Document losses = 0.44410, 0.13645, 0.53143, 0.84984, 0.42880, 0.74947, 0.25895, 0.96477, 0.27067, 0.80022, 0.

In [208]:
tr_pt_resolver_bakeoff = neural_net.evaluate(tr_pt_word_lstm, tr_pt_attn_layer, tr_pt_scorer, all_words, all_markables, coref_features.minimal_features)

F: 0.5696	R: 0.5174	P:0.6335


In [209]:
coref_metrics(tr_pt_resolver_bakeoff, all_markables);

B-Cubed: 0.5915	CEAF: 0.4933	MUC: 0.6450	Average: 0.5766


In [210]:
tr_pt_resolver_dev = neural_net.evaluate(tr_pt_word_lstm, tr_pt_attn_layer, tr_pt_scorer, all_words_dev, all_markables_dev, coref_features.minimal_features)
coref.write_predictions(tr_pt_resolver_dev,
                        all_markables_dev,
                        'predictions/nn-dev-bakeoff.preds');
coref_metrics(tr_pt_resolver_dev, all_markables_dev);

F: 0.6000	R: 0.4832	P:0.7912
B-Cubed: 0.6166	CEAF: 0.4710	MUC: 0.6083	Average: 0.5653


In [211]:
# students can't run this
tr_pt_resolver_te = neural_net.evaluate(tr_pt_word_lstm, tr_pt_attn_layer, tr_pt_scorer, all_words_test, all_markables_te, coref_features.minimal_features)
coref.write_predictions(tr_pt_resolver_te,
                        all_markables_te,
                        'predictions/nn-test-bakeoff.preds');
coref_metrics(tr_pt_resolver_te, all_markables_te);

F: 0.4845	R: 0.3197	P:1.0000
B-Cubed: 0.0506	CEAF: 0.0229	MUC: 0.4845	Average: 0.1860


**Deliverable 6** (16 points):

Copy the applicable cells from section 3 or 4 by your choice to output predictions for both the dev and test sets of WSJ.

Write your predictions to `predictions/bakeoff-dev.preds` and `predictions/bakeoff-test.preds`.

Scoring:

- Dev F1 > .55: 2 points
- Dev F1 > .57: 4 points
- Dev F1 > .59: 6 points
- Dev F1 > .61: 8 points
- Test F1 > .53: 2 points
- Test F1 > .54: 4 points
- Test F1 > .55: 6 points
- Test F1 > .56: 8 points

- Top 3 in class (Test F1): + 4 points (bonus)
- We'll also give 4 bonus points to particularly unique / creative / well-motivated solutions (with motivation to be included in the write-up).

# 7. Writeup (18 points)

You can start your write-up in any format you prefer (e.g., LaTeX, Markdown), but please remember to export to `pset4-writeup.pdf` upon submission. Also, you will be asked to post your writeup on Piazza after the due date (plus late days).

**Deliverable 7.1** (6 points):

Describe your bakeoff design. What worked and what didn't? Give a possible reason behind it.

**Deliverable 7.2** (8 points):

You will select a research paper at ACL, EMNLP or NAACL that focuses on **coreference resolution**. Summarize the paper, answering the following questions:

1. What are the main ideas of the paper?
2. What worked and what didn't?
3. Are there any rationales the paper provided or you think that lead to the results?
4. How would you further improve over this work if you have the opportunity?

You must choose a paper in the main conference (not workshops). The paper must be at least four pages long. All papers from these conferences are available for free online: https://www.aclweb.org/anthology/.

**Deliverable 7.3** (4 points):

Pick any topic that you like or learned a lot from this course. Propose a related quiz question and give an answer to it. Please feel free to be creative here and remember that we don't have final exams for this iteration of the course :)