# Word Sense Disambiguation: from start to finish
A well-known NLP task is [Word Sense Disambiguation (WSD)](https://en.wikipedia.org/wiki/Word-sense_disambiguation). The goal is to identify the sense of a word in a sentence. Here is an example of the output of one of the best systems, called [Babelfy](http://babelfy.org/index). ![alt text](images/babelfy_output.png "Babelfy output")

Since 1998, there have been WSD competitions: [Senseval and SemEval](https://en.wikipedia.org/wiki/SemEval). The idea is very simple. A few people annotate words in a sentence with the correct meaning and systems try to the do same. Because we have the manual annotations, we can score how well each system performs. In this exercise, we are going to compete in [SemEval-2013 task 12: Multilingual Word Sense Disambiguation](https://www.cs.york.ac.uk/semeval-2013/task12.html).

The main steps in this exercise are:
* Introduction of the data and goals
* Performing WSD
* Loading manual annotations (which we will call **gold data**)
* System output 
* Write an XML file containing both the gold data and our system output
* Read the XML file, evaluate our performance, and perform error analysis

## Introduction of the data and goals

We will use the following data (originating from [SemEval-2013 task 12 test data](https://www.cs.york.ac.uk/semeval-2013/task12/data/uploads/datasets/semeval-2013-task12-test-data.zip)):

* **system input**: data/multilingual-all-words.en.xml 
* **gold data**: data/sem2013-aw.key

Given a word in a sentence, the goal of our system is to determine the corect meaning of that word. For example, look at the **system input** file (data/multilingual-all-words.en.xml) at lines 1724-1740.
All the *instance* elements are the ones we have to provide a meaning for. Please note that the *sentence* element has *wf* and *instance* children. The *instance* elements are the ones  for which we have to provide a meaning.


```xml
<sentence id="d003.s005">
    <wf lemma="frankly" pos="RB">Frankly</wf>
    <wf lemma="," pos=",">,</wf>
    <wf lemma="the" pos="DT">the</wf>
    <instance id="d003.s005.t001" lemma="market" pos="NN">market</instance>
    <wf lemma="be" pos="VBZ">is</wf>
    <wf lemma="very" pos="RB">very</wf>
    <wf lemma="calm" pos="JJ">calm</wf>
    <wf lemma="," pos=",">,</wf>
    <wf lemma="observe" pos="VVZ">observes</wf>
    <wf lemma="Mace" pos="NP">Mace</wf>
    <wf lemma="Blicksilver" pos="NP">Blicksilver</wf>
    <wf lemma="of" pos="IN">of</wf>
    <wf lemma="Marblehead" pos="NP">Marblehead</wf>
    <instance id="d003.s005.t002" lemma="asset_management" pos="NE">Asset_Management</instance>
    <wf lemma="." pos="SENT">.</wf>
  </sentence>
```

As a way to determine the possible meanings of a word, we will use [WordNet](https://wordnet.princeton.edu/). For example, for the lemma **market**, Wordnet lists the following meanings:

In [1]:
from nltk.corpus import wordnet as wn

In [2]:
for synset in wn.synsets('market', pos='n'):
    print(synset, synset.definition())

Synset('market.n.01') the world of commercial activity where goods and services are bought and sold
Synset('market.n.02') the customers for a particular product or service
Synset('grocery_store.n.01') a marketplace where groceries are sold
Synset('market.n.04') the securities markets in the aggregate
Synset('marketplace.n.02') an area in a town where a public mercantile establishment is set up


In order to know which meaning the manual annotators chose, we go to the **gold data** (data/sem2013-aw.key). For the identifier *d003.s005.t001*, we find:

d003 d003.s005.t001 market%1:14:01:: 

In order to know to which synset *market%1:14:01::* belongs, we can do the following:

In [25]:
lemma = wn.lemma_from_key('market%1:14:01::')
synset = lemma.synset()
print(synset, synset.definition())

Synset('market.n.04') the securities markets in the aggregate


Hence, the manual annotators chose **market.n.04**. 

## Performing WSD
As a first step, we will perform WSD. For this, we will use the [**lesk** WSD algorithm](http://www.d.umn.edu/~tpederse/Pubs/banerjee.pdf) as implemented in the [NLTK](http://www.nltk.org/howto/wsd.html). One of the applications of the Lesk algorithm is to determine which senses of words are related. Imagine that **cone** has three senses, and **pine** has three senses (example from [paper](http://www.d.umn.edu/~tpederse/Pubs/banerjee.pdf)):

**Cone**
* Sense 1: kind of *evergreen tree* with needle–shaped leaves
* Sense 2: waste away through sorrow or illness.

**Pine**
* Sense 1: solid body which narrows to a point
* Sense 2: something of this shape whether solid or hollow
* Sense 3: fruit of certain *evergreen tree*

As you can see, **sense 1 of cone** and **sense 3 of pine** have an overlap in their definitions and hence indicate that these senses are related. This idea can then be used to perform WSD. The words in the sentence of a word are compared against the definition of each sense of word. The word sense that has the highest number of overlapping words between the sentence and the definition of the word sense is chosen as the correct sense according to the algorithm.

In [1]:
from nltk.wsd import lesk

Given is a function that allows you to perform WSD on a sentence. The output is a **WordNet sensekey**, hence an identifier of a sense.
#### the function is given, but it is important that you understand how to call it.

In [2]:
def perform_wsd(sent, lemma, pos):
    '''
    perform WSD using the lesk algorithm as implemented in the nltk
    
    :param list sent: list of words
    :param str lemma: a lemma
    :param str pos: a pos (n | v | a | r)
    
    :rtype: str
    :return: wordnet sensekey or not_found
    '''
    sensekey = 'not_found'
    wsd_result = lesk(sent, lemma, pos)
    
    if wsd_result is not None:
        for lemma_obj in wsd_result.lemmas():
            if lemma_obj.name() == lemma:
                sensekey = lemma_obj.key()
    
    return sensekey


sent = ['I', 'went', 'to', 'the', 'bank', 'to', 'deposit', 'money', '.']
assert perform_wsd(sent, 'bank', 'n')  == 'bank%1:06:01::', 'key is %s' % perform_wsd(sent, 'bank', 'n')
assert perform_wsd(sent, 'dfsdf', 'n')  == 'not_found', 'key is %s' % perform_wsd(sent, 'money', 'n')
print(perform_wsd(sent, 'bank', 'n'))

bank%1:06:01::


## Loading manual annotations
Your job now is to load the manual annotations from 'data/sem2013-aw.key'.
Tip, you can use [**repr**](https://docs.python.org/3/library/functions.html#repr) to check which delimiter (space, tab, etc) was used.

In [3]:
def load_gold_data(path_to_gold_key):
    '''
    given the path to gold data of semeval2013 task 12,
    this function creates a dictionary mapping the identifier to the
    gold answers
    
    HINT: sometimes, there is more than one sensekey for identifier
    
    :param str path_to_gold_key: path to where gold data file is stored
    
    :rtype: dict
    :return: identifier (str) -> goldkeys (set)
    '''
    gold = {} 
    with open(path_to_gold_key) as infile:
        # complete this part
    
    return gold

IndentationError: expected an indented block (<ipython-input-3-14d7f765ebc5>, line 18)

Please check if your functions works correctly by running the cell below.

In [4]:
gold = load_gold_data('data/sem2013-aw.key')
assert len(gold) == 1644, 'number of gold items is %s' % len(gold)

NameError: name 'load_gold_data' is not defined

## Combining system input + system output + gold data
We are going to create a dictionary that looks like this:
```python
{10: {'sent_id' : 1
      'text': 'banks',
      'lemma' : 'bank',
      'pos' : 'n',
      'instance_id' : 'd003.s005.t001',
      'gold_keys' : {'bank%1:14:00::'},
      'system_key' : 'bank%1:14:00::'}
    }
```

Combining all relevant information in one dictionary will help us to create the NAF XML file.
In order to do this, we will write several functions. To work with XML, we will first import the lxml module.

In [5]:
from lxml import etree

In [2]:
def load_sentences(semeval_2013_input):
    '''
    given the path to the semeval input xml,
    this function creates a dictionary mapping sentence identfier
    to the sentence (list of words)
    
    HINT: you need the text of all:
    text/sentence/instance and text/sentence/wf elements
    
    :param str semeval_2013_input: path to semeval 2013 input xml
    
    :rtype: dict
    :return: mapping sentence identifier -> list of words
    '''
    sentences = dict()
    
    doc = etree.parse(semeval_2013_input)
    
    # insert code here
    
    return sentences

please check that your function works by running the cell below.

In [31]:
sentences = load_sentences('data/multilingual-all-words.en.xml')
assert len(sentences) == 306, 'number of sentences is different from needed 306: namely %s' % len(sentences)

In [32]:
def load_input_data(semeval_2013_input):
    '''
    given the path to input xml file, we will create a dictionary that looks like this:
    
    :rtype: dict
    :return: {10: {
                    'sent_id' : 1
                    'text': 'banks',
                    'lemma' : 'bank',
                    'pos' : 'n',
                    'instance_id' : 'd003.s005.t001',
                    'gold_keys' : {},
                    'system_key' : ''}
            }
    '''
    data = dict()
    doc = etree.parse(semeval_2013_input)
    identifier = 1
    
    for sent_el in doc.iterfind('text/sentence'):
        # insert code here
        
        for child_el in sent_el.getchildren():
            # insert code here
            
            info = {
                'sent_id' : # to fill, 
                'text': # to fill, 
                'lemma' : # to fill,
                'pos' : # to fill, 
                'instance_id' : # to fill if instance element else empty string,
                'gold_keys' : set(), # this is ok for now
                'system_key' : '' # this is ok for now
            }
            
            data[identifier] = info
            identifier += 1
                    
    return data


In [33]:
data = load_input_data('data/multilingual-all-words.en.xml')
assert len(data) == 8142, 'number of token is not the needed 8142: namely %s' % len(data)

In [3]:
def add_gold_and_wsd_output(data, gold, sentences): 
    '''
    the goal of this function is to fill the keys 'system_key'
    and 'gold_keys' for the entries in which the 'instance_id' is not an empty string.
    
    :param dict data: see output function 'load_input_data'
    :param dict gold: see output function 'load_gold_data' 
    :param dict sentences: see output function 'load_sentences' 
    
    NOTE: not all instance_ids have a gold answer! 
    
    :rtype: dict
    :return: {10: {'sent_id' : 1
      'text': 'banks',
      'lemma' : 'bank',
      'pos' : 'n',
      'instance_id' : 'd003.s005.t001',
      'gold_keys' : {'bank%1:14:00::'},
      'system_key' : 'bank%1:14:00::'}
    }
    '''
    for identifier, info in data.items():
        # get the instance id
        
        if instance_id:
            # perform wsd and get sensekey that lesk proposes
            
            # add system key to our dictionary
            # info['system_key'] = sensekey
            
            if instance_id in gold:
                info['gold_keys'] = gold[instance_id]

Call the function to combine all information.

In [14]:
add_gold_and_wsd_output(data, gold, sentences)

## Create NAF with system run and gold information
We are going to create one [NAF XML](http://www.newsreader-project.eu/files/2013/01/techreport.pdf) containing both the gold information and our system run. In order to do this, we will guide you through the process of doing this.

### Step a: create an xml object
**NAF** will be our root element.

In [15]:
new_root = etree.Element('NAF')
new_tree = etree.ElementTree(new_root)
new_root = new_tree.getroot()

We can inspect what we have created by using the **etree.dump** method. As you can see, we only have the root node **NAF** currently in our document.

In [16]:
etree.dump(new_root)

<NAF/>


### Step b: add children
We will now add the elements in which we will place the **wf** and **term** elements.

In [17]:
text_el = etree.Element('text')
terms_el = etree.Element('terms')

new_root.append(text_el)
new_root.append(terms_el)

In [18]:
etree.dump(new_root)

<NAF>
  <text/>
  <terms/>
</NAF>


### Step c: functions to create wf and term elements
#### TIP: check the subsection *Creating your own XML elements* from Topic 5

In [19]:
def create_wf_element(identifier, sent_id, text):
    '''
    create NAF wf element, such as:
    <wf id="11" sent_id="d001.s002">conference</wf>
    
    :param int identifier: our own identifier (convert this to string)
    :param str sent_id: the sentence id of the competition
    :param str text: the text
    '''
    # complete from here
    wf_el = etree.Element(
    
    return wf_el

In [None]:
def create_term_element(identifier, instance_id, system_key, gold_keys):
    '''
    create NAF xml element, such as:
    <term id="3885">
      <externalRef instance_id="d007.s013.t004" provenance="lesk" wordnetkey="player%1:18:04::"/>
      <externalRef instance_id="d007.s013.t004" provenance="gold" wordnetkey="player%1:18:01::"/>
    </term>
    
    :param int identifier: our own identifier (convert this to string)
    :param str system_key: system output
    :param set gold_keys: goldkeys
    '''
    # complete code here
    term_el = etree.Element(
    
    return term_el

### Step d: add wf and term elements 

In [21]:
counter = 0
for identifier, info in data.items():
    wf_el = create_wf_element(identifier, info['sent_id'], info['text'])
    text_el.append(wf_el)
    
    term_el = create_term_element(identifier, 
                                  info['instance_id'],
                                  info['system_key'], 
                                  info['gold_keys'])
    terms_el.append(term_el)


In [22]:
with open('semeval2013_run1.naf', 'wb') as outfile:
    new_tree.write(outfile,
                   pretty_print=True,
                   xml_declaration=True,
                   encoding='utf-8')

## Score our system run
Read the NAF file and extract relevant statistics, such as:
* overall performance (how many are correct?)
* [optional]: anything that you find interesting