# Word Sense Disambiguation


## Objectives

- Understanding
    - Lexical Relations
    - Word senses in WordNet
    - Semantic Similarity (in WordNet)
    
- Learning how to disambiguate word senses
    - Dictionary-based Word Sense Disambiguation with WordNet
        - Lesk Algorithm
        - Graph-based Methods
    - Supervised Word Sense Disambiguation
        - Feature Extractions for Word Sense Classification
            - Bag-of-Words
            - Collocational Features
        - Training and Evaluation

### Recommended Reading
- Dan Jurafsky and James H. Martin. [__Speech and Language Processing__ (SLP)](https://web.stanford.edu/~jurafsky/slp3/) (3rd ed. draft)
- Steven Bird, Ewan Klein, and Edward Loper. [__Natural Language Processing with Python__ (NLTK)](https://www.nltk.org/book/)
    

### Covered Material

- SLP
    - [Chapter 23: Word Senses and WordNet](https://web.stanford.edu/~jurafsky/slp3/23.pdf)
- NLTK
    - [Chapter 2: Accessing Text Corpora and Lexical Resources](https://www.nltk.org/book/ch02.html)
        - Section 5: WordNet


### Requirements

- [NLTK](https://www.nltk.org/)


## 1. Word Sense Disambiguation

In natural language processing, word sense disambiguation (WSD) is the problem of determining which "sense" (meaning) of a word is activated by the use of the word in a particular context, a process which appears to be largely unconscious in people. 

WSD is a natural classification problem: 
Given a word and its possible senses, as defined by a dictionary, the objective of WSD is to classify an occurrence of the word in context into one or more of its sense classes. The features of the context (such as neighboring words) provide are used as features for classification.

- Human Language is ambiguous
    - Syntacting ambiguity
        - Resolved by POS-tagging
        - Syntactic Parsing
    - Lexical ambiguity
        - Resolved by Word Sense Disambiguation
        - Semantics work at level of word __senses__, not __words__

__Example__:
- NOUN
    - 'they pulled the canoe up on the __bank__'
    - 'he cashed a check at the __bank__'
- VERB
    - 'the plane __banked__ steeply'
    - '__bank__ on your good education'

### 1.1. Task Variants
- __Lexical sample subtask__: only a small selection of words has to be disambiguated
    - Supervised machine learning: train a classifier for each word
- __All words subtask__: each and every content word in the test corpus has to be disambiguated.
    - Data sparseness issue, can't train a classifier for each word

### 1.2. Evaluation
Precision, recall, F1-measure against gold standard data

### 1.3. Lexical Relations
Relation between word senses.

- __Homonymy__: senses are not related
- __Polysemy__: senses are related
- __Metonymy__: a thing or concept is referred to by the name of something closely associated with that thing or concept. (e.g. *Rome* for Italian Government)
    - It is a subtype of polysemy


- __Synonymy__: senses are identical
- __Antonymy__: senses are opposite
- __Hyponymy__ (specific) (*car is hyponym of vehicle*) and __Hypernymy__ (generic): class-inclusion relationships (*vehicle is hypernymy of car*)
- __Meronymy__ (part)(*wheel is part of car*) and __Holonymy__ (whole): the part-whole relation (*car ia holonymy of wheel*)

## 2. WordNet

[WordNet](https://wordnet.princeton.edu/) is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. 

Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. 


__Summary__

WordNet is a:
- Graph (4 graphs for each of nouns, verbs, adjectives, and adverbs)
- Nodes are Synsets (synonyms)
- Labeled Edges are Relations between Synsets

    - PART-OF
    - KIND-OF (IS-A)
    - ENTAILMENT
    - ANTONYMY
    
> Senses in WordNet are generally ordered from most to least frequently used, with the most common sense numbered 1.

[WordNet Site](https://wordnet.princeton.edu/documentation/wndb5wn)

In [1]:
import nltk
from pprint import pprint
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /home/thomas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/thomas/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [2]:
# Let's import WordNet
from nltk.corpus import wordnet

In [3]:
# printing senses of a word (including honomymy & polysemy)
senses = wordnet.synsets('bank')
print(senses)

[Synset('bank.n.01'), Synset('depository_financial_institution.n.01'), Synset('bank.n.03'), Synset('bank.n.04'), Synset('bank.n.05'), Synset('bank.n.06'), Synset('bank.n.07'), Synset('savings_bank.n.02'), Synset('bank.n.09'), Synset('bank.n.10'), Synset('bank.v.01'), Synset('bank.v.02'), Synset('bank.v.03'), Synset('bank.v.04'), Synset('bank.v.05'), Synset('deposit.v.02'), Synset('bank.v.07'), Synset('trust.v.01')]


### 2.1. Synset 
The entity `bank.n.01` is called a __synset__, or "synonym set", a collection of synonymous words (or "lemmas").

The name is composed as `<lemma>.<pos>.<number>` string where: 
- `<lemma>` is the word's morphological stem 
- `<pos>` is one of the module attributes `ADJ`, `ADJ_SAT`, `ADV`, `NOUN` or `VERB` 
- `<number>` is the sense number, counting from `0`

Part-of-speech tags appear as below:

| POS | in Synset Name |
|:----|:---------------|
| `wn.NOUN`    | `n`
| `wn.VERB`    | `v`
| `wn.ADV`     | `r`
| `wn.ADJ`     | `a`
| `wn.ADJ_SAT` | `s` (satelite adjective, ignore)


In [4]:
# it's possible to provide part of speech to filter senses as well
senses = wordnet.synsets('bank', wordnet.NOUN)
pprint(senses)
print('')
print("POS:",senses[0].pos())  # part-of-speech tag of a synset

[Synset('bank.n.01'),
 Synset('depository_financial_institution.n.01'),
 Synset('bank.n.03'),
 Synset('bank.n.04'),
 Synset('bank.n.05'),
 Synset('bank.n.06'),
 Synset('bank.n.07'),
 Synset('savings_bank.n.02'),
 Synset('bank.n.09'),
 Synset('bank.n.10')]

POS: n


Each word of a synset can have several meanings, synset represents the single meaning that is common to all words in it. 
Each synset has a __definition__ and __example__ sentences, that can be accessed using `definition()` and `examples()` methods.

In [5]:
print(senses[0].definition())
print(senses[0].examples())

sloping land (especially the slope beside a body of water)
['they pulled the canoe up on the bank', 'he sat on the bank of the river and watched the currents']


### 2.2. Lemmatization
`wordnet.synsets()` method expects a word to be a __lemma__, i.e. canonical (dictionary) form of a word. In case it does find a word in WordNet, it internally applies morphological transformation rules to strip off affixes untill it finds the form.

```
MORPHOLOGICAL_SUBSTITUTIONS = {
    NOUN: [("s", ""), ("ses", "s"), ("ves", "f"), ("xes", "x"), ("zes", "z"), 
           ("ches", "ch"), ("shes", "sh"), ("men", "man"), ("ies", "y"), ],
    VERB: [("s", ""), ("ies", "y"), ("es", "e"), ("es", ""), 
           ("ed", "e"), ("ed", ""), 
           ("ing", "e"), ("ing", ""), ],
    ADJ: [("er", ""), ("est", ""), ("er", "e"), ("est", "e")],
    ADV: [],
}
```

Those could be applied calling `wordnet.morphy()`.

In [6]:
wordnet.morphy('banked')

'bank'

In [7]:
# Note that only verb synsets are listed
wordnet.synsets('banked') 

[Synset('bank.v.01'),
 Synset('bank.v.02'),
 Synset('bank.v.03'),
 Synset('bank.v.04'),
 Synset('bank.v.05'),
 Synset('deposit.v.02'),
 Synset('bank.v.07'),
 Synset('trust.v.01')]

`wordnet.morphy()` is the basis of the WordNet-based Lemmatizer in NLTK. The Lemmatizer can be used as follows, optionally providing a part-of-speech (default is NOUN).

In [8]:
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
print(lem.lemmatize('banks'))
print(lem.lemmatize('banked', pos=wordnet.VERB))
print(lem.lemmatize('bnked', pos=wordnet.VERB))  # returns the word itself if it cannot find it

bank
bank
bnked


#### 2.2.1. Lemmas in WordNet
In WordNet __Lemma__ is a pairing of words with a synset: `bank.n.01` + `bank`.

From a __synset__ we can get:
- all its lemmas (`lemmas()`)
- all its lemma names (`lemma_names()`)

From a __lemma__ we can get:
- its name (`name()`)
- synset it belongs to (`synset()`)

Similar to synsets, we can get all lemmas for a word as well using `lemmas()`.

In [9]:
lemmas = wordnet.lemmas('bank')
pprint(lemmas)

[Lemma('bank.n.01.bank'),
 Lemma('depository_financial_institution.n.01.bank'),
 Lemma('bank.n.03.bank'),
 Lemma('bank.n.04.bank'),
 Lemma('bank.n.05.bank'),
 Lemma('bank.n.06.bank'),
 Lemma('bank.n.07.bank'),
 Lemma('savings_bank.n.02.bank'),
 Lemma('bank.n.09.bank'),
 Lemma('bank.n.10.bank'),
 Lemma('bank.v.01.bank'),
 Lemma('bank.v.02.bank'),
 Lemma('bank.v.03.bank'),
 Lemma('bank.v.04.bank'),
 Lemma('bank.v.05.bank'),
 Lemma('deposit.v.02.bank'),
 Lemma('bank.v.07.bank'),
 Lemma('trust.v.01.bank')]


In [10]:
# Look up lemma directly
lemma = wordnet.lemma('bank.n.01.bank')
print(lemma.name())
print(lemma.synset())

bank
Synset('bank.n.01')


In [11]:
# Get Lemmas of a synset
print(senses[0].lemmas())
print(senses[0].lemma_names())

[Lemma('bank.n.01.bank')]
['bank']


### 2.3. Lexical Relations beween Synsets

WordNet synsets correspond to abstract concepts that are linked together in a hierarchy from very general (such as `Entity`, `State`, `Event` a.k.a *unique beginners* or *root synsets*) to very specific. 

Hypernymy/Hyponymy relations are used to navigate the taxonomy using `hypernyms()` and `hyponyms()` methods.

- `hypernym_paths()` gets the lists of the hypernym synsets to the root (several paths are possible)
- `root_hypernyms()` gets the root synset
- `hypernym_distances()` get the path(s) from the synset to the root, counting the distance of each node from the initial node on the way

- `max_depth()` returns the length of the longest hypernym path from the synset to the root.
- `min_depth()` returns the length of the shortest hypernym path from the synset to the root.

In [12]:
pprint(senses[0].hyponyms())
pprint(senses[0].hypernyms())

[Synset('riverbank.n.01'), Synset('waterside.n.01')]
[Synset('slope.n.01')]


In [13]:
# getting paths to the root of the taxonomy
pprint(senses[0].hypernym_paths())
# getting hypernyms with distances
pprint(senses[0].hypernym_distances())
# getting the root node
pprint(senses[0].root_hypernyms())
print(senses[0].max_depth())
print(senses[0].min_depth())

[[Synset('entity.n.01'),
  Synset('physical_entity.n.01'),
  Synset('object.n.01'),
  Synset('geological_formation.n.01'),
  Synset('slope.n.01'),
  Synset('bank.n.01')]]
{(Synset('bank.n.01'), 0),
 (Synset('entity.n.01'), 5),
 (Synset('geological_formation.n.01'), 2),
 (Synset('object.n.01'), 3),
 (Synset('physical_entity.n.01'), 4),
 (Synset('slope.n.01'), 1)}
[Synset('entity.n.01')]
5
5


Read about other relations defined for synsets and lemmas in the [NLTK documentation](http://www.nltk.org/api/nltk.corpus.reader.html#module-nltk.corpus.reader.wordnet).

__Whole description of WordNet methods and structure is out of the scope of the lab.__

## 3. Lesk Algorithm

> "What we try is to guess the correct word sense by counting overlaps between dictionary definitions of the various senses." 

(Lesk, Michael. "Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone." Proceedings of the 5th Annual International Conference on Systems Documentation. ACM, 1986.)

### 3.1. Simplified Lesk Algorithm

Kilgarriff and Rosenzweig (2000) [English SENSEVAL](http://www.lrec-conf.org/proceedings/lrec2000/pdf/8.pdf)

```
For each sense s of that word,
    set weight(s) to zero.

Identify set of unique words W in surrounding sentence.

For each word w in W,
    for each sense s,
        if w occurs in the definition or example sentences of s,
            add weight(w) to weight(s).
Choose sense with greatest weight(s)
```

> `weight(w)` is defined as the [inverse document frequency](https://en.wikipedia.org/wiki/Tf–idf) (IDF) of the word `w` over the definitions and example sentences in the dictionary. The IDF of a word `w` is computed as `-log(p(w))`, where `p(w)` is estimated as the fraction of dictionary "documents" -- definition or examples -- which contain the word. 

$$ IDF = -\log {|\{d \in D : w \in d\}| \over |D|}$$

where `w` is the word and `D` is the set of documents


### 3.2. Lesk Plus Corpus

> LESK-PLUS-CORPUS is as LESK, but also considers the tagged training data, so can be compared with supervised
systems. For each word in the sentence containing the test item, it tests whether `w` occurs in the dictionary entry or corpus instances for each candidate sense.


### 3.3. Simple Lesk with Equal Weights

If all words are equally weighted, we compute an overlap.
The algorithm becomes simpler.

```
function SIMPLIFIED LESK(word, sentence) returns best sense of word
    best-sense := most frequent sense for word (i.e. first in WordNet)
    max-overlap := 0
    context := set of words in sentence
    for each sense in senses of word do
        signature := set of words in gloss and examples of sense
        overlap := COMPUTE_OVERLAP(signature, context)
        if overlap > max-overlap then
            max-overlap := overlap
            best-sense := sense
    end
return(best-sense)
```

```
COMPUTE OVERLAP returns the number of words in common between two sets.
```

#### Improvements

- Removing stop words
    - IDF makes them weight less in Simplified Lesk by Kilgarriff and Rosenzweig (2000)

### 3.4. Using Lesk in NLTK
NLTK provide the implementation of the Lesk Algorithm is [`wsd` module](https://www.nltk.org/_modules/nltk/wsd.html).

In [14]:
from nltk.wsd import lesk

sense = lesk('Jane sat on the sloping bank of a river beside the water'.split(), 'bank')
print(sense)
print(sense.definition())

# possible to specify the POS
print(lesk('Jane sat on the sloping bank of a river beside the water'.split(), 
           'bank', 
           pos=wordnet.NOUN))

# possible to specify the synsets to choose from
print(lesk('Jane sat on the sloping bank of a river beside the water'.split(), 
           'bank', 
           synsets=wordnet.synsets('riverbank')))

Synset('bank.n.01')
sloping land (especially the slope beside a body of water)
Synset('bank.n.01')
Synset('riverbank.n.01')


### 3.5. Alternative Implementations of Lesk in `pywsd`

[`pywsd` library](https://github.com/alvations/pywsd) provides several variants of the Lesk algorithm.



- Original Lesk (Lesk, 1986) -- also *simplified*
- Adapted/Extended Lesk (Banerjee and Pederson, 2002/2003)
- Simple Lesk (with definition, example(s) and hyper+hyponyms)
- Cosine Lesk (use cosines to calculate overlaps instead of using raw counts)

Unfortunatelly, it has some compatibility issues. However, can be consulted for implementations.

### Exercises
Even though NLTK states that it implements Original Lesk Algorithm, in fact it is a Simplified Lesk Algorithm, that doesn't consider examples, and computes overlaps like the original. 

In the original algorithm context is computed differently. <mark style="background-color: rgba(0, 255, 0, 0.2)">Instead of comparing a target word's signature with the context words, the target signature is compared with the signatures of each of the context words. </mark>

Implement the Original Lesk Algorithm (modifying NLTK's, see pseudocode above)
Todo list:
- Complete lesk simplified
- Preprocessing:
    - compute pos-tag with `nltk.pos_tag`
    - remove stopwords
        - `from nltk.corpus import stopwords`
        - `stopwords.words('english')`

- take the majority decision (the sense predicted most frequently)

POS tags reminder:

| POS | in Synset Name |
|:----|:---------------|
| `wn.NOUN`    | `n`
| `wn.VERB`    | `v`
| `wn.ADV`     | `r`
| `wn.ADJ`     | `a`
| `wn.ADJ_SAT` | `s` (satelite adjective, ignore)

In [15]:
# Lesk simplified
def lesk(context_sentence, ambiguous_word, pos=None, synsets=None):

    context = set(context_sentence)
    
    if synsets is None:
        synsets = wordnet.synsets(ambiguous_word)
    # Filter by pos-tag
    if pos:
        synsets = [ss for ss in synsets if str(ss.pos()) == pos]

    if not synsets:
        return None
    
    #print(context)
    #for ss in synsets:
    #    print(len(context & set(nltk.word_tokenize(ss.definition()))))
    len_overlap_list = [len(context & set(nltk.word_tokenize(ss.definition()))) for ss in synsets]
    sense = len_overlap_list.index(max(len_overlap_list))

    return synsets[sense]


In [10]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# A bit of preprocessing 
def preprocess(text):
    mapping = {"NOUN": wordnet.NOUN, "VERB": wordnet.VERB, "ADJ": wordnet.ADJ, "ADV": wordnet.ADV}
    sw_list = stopwords.words('english')
    lem = WordNetLemmatizer()
    # tokenize, if input is text
    tokens = nltk.word_tokenize(text) if type(text) is str else text
    # compute pos-tag
    tagged = nltk.pos_tag(tokens, tagset="universal")
    # lowercase
    tagged = [(w.lower(), p) for w, p in tagged]
    # optional: remove all words that are not NOUN, VERB, ADJ, or ADV (i.e. no sense in WordNet)
    tagged = [(w, p) for w, p in tagged if p in mapping]
    # re-map tags to WordNet (return orignal if not in-mapping, if above is not used)
    tagged = [(w, mapping.get(p, p)) for w, p in tagged]
    # remove stopwords
    tagged = [(w, p) for w, p in tagged if w not in sw_list]
    # lemmatize
    tagged = [(w, lem.lemmatize(w, pos=p), p) for w, p in tagged]
    # unique the list
    tagged = list(set(tagged))
    
    return tagged

In [8]:
def get_sense_definitions(context):
    # input is text or list of strings
    lemma_tags = preprocess(context)

    # let's get senses for each
    senses = [(w, wordnet.synsets(l, p)) for w, l, p in lemma_tags]

    # let's get their definitions
    definitions = []
    for raw_word, sense_list in senses:
        if len(sense_list) > 1:
            # let's tokenize, lowercase & remove stop words 
            def_list = []
            for s in sense_list:
                defn = s.definition()
                # let's use the same preprocessing
                tags = preprocess(defn)
                toks = [l for w, l, p in tags]
                def_list.append((s, toks))
            definitions.append((raw_word, def_list))
    return definitions
    

In [9]:
def get_top_sense(words, sense_list):
    # get top sense from the list of sense-definition tuples
    # assumes that words and definitions are preprocessed identically
    val, sense = max((len(set(words).intersection(set(defn))), ss) for ss, defn in sense_list)
    return val, sense

In [11]:
from collections import Counter

def original_lesk(context_sentence, ambiguous_word, pos=None, synsets=None, majority=False):
    
    context_senses = get_sense_definitions(context_sentence)
    
    if synsets is None:
        synsets = get_sense_definitions(ambiguous_word)[0][1]

    if pos:
        synsets = [ss for ss in synsets if str(ss.pos()) == pos]

    if not synsets:
        return None
    
    scores = []
    for senses in context_senses:
        for sense in senses[1]:
            score, sense = get_top_sense(sense[1], synsets)
            scores.append((score, sense))
        
    if majority:
        # We remove 0 scores, senses without overlapping
        filtered_scores = [x[1] for x in scores if x[0] != 0]
        if len(filtered_scores) > 0:
            best_sense = Counter(filtered_scores).most_common(1)[0][0]
        else:
            # Almost random selection
            best_sense = Counter(scores).most_common(1)[0][0]
    else:
        _, best_sense = sorted(scores)[0]
    return best_sense

In [12]:
text = "Jane sat on the sloping bank of a river beside the water"
word = "bank"
print("Sense from lesk original", original_lesk(text, word, majority=True))
print("Sense from lesk simplified", lesk(text, word))


Sense from lesk original Synset('savings_bank.n.02')
Sense from lesk simplified Synset('savings_bank.n.02')


## 4. Graph-based Methods on WordNet for WSD

### 4.1. Maximum Relatedness Disambiguation

Pedersen et al. (2003) [Maximizing Semantic Relatedness to Perform Word Sense Disambiguation](https://www.d.umn.edu/~tpederse/Pubs/max-sem-relate.pdf)


```
w = words

foreach sense s[t][i] of target word w[t]$
    set score[i] = 0
    foreach word w[j] in window of context
        skip to next word if j == t

        foreach sense s[j][k] of w[j]
            temp_score[j] = relatedness(s[t][i], s[j][k])

        winning_score = highest score in array temp_score[]

        if (winning_score > threshold)
            set score[i] = score[i] + winning_score
            
return i, such that score[i] >= score[j] , for all j, 1 <= j <= n, n = number of words in sentence
```

#### 4.1.1. How do we define relatedness?

- Similar words are near-synonyms: e.g. *car*, *motorcycle*
- Related words can be related any way: e.g. *car*, *fuel*

- Thesaurus-based similarity
    - words have similar definitions (Lesk)
    - words are close to each other in hypernym hierarchy (graph-based)
- Distributional similarity
    - do words apprear in similar distributional contexts
    - __distributional (vector) semantics__

Compute the similarity between *dime* and *nickel* and between *nickel* and *credit card*: 

![](https://i.postimg.cc/tJn0NMgm/Screenshot-2023-01-03-at-10-26-20.png)

[*Original source (Resnik, 1995)*](https://arxiv.org/pdf/cmp-lg/9511007)

#### 4.1.2. Path-based Similarity

Two concepts (senses/synsets) are similar if they are near each other in the thesaurus hierarchy
- have a __short path__ between them (1 + number of edges between nodes)
- path to themselves has distance `1`

##### NLTK Path Based Metrics

- `synset1.path_similarity(synset2)`: Return a score denoting how similar two word senses are, based on the __shortest path__ that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, computed as `1/path_length`
- `synset1.lch_similarity(synset2)`: __Leacock-Chodorow Similarity__: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses and the maximum depth of the taxonomy in which the senses occur. The relationship is given as `-log(p/2d)` where `p` is the shortest path length and `d` the taxonomy depth.
- `synset1.wup_similarity(synset2)`: __Wu-Palmer Similarity__: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their __Least Common Subsumer__ (most specific ancestor node).

In [21]:
bank_r = wordnet.synsets('bank')[0]
bank_f = wordnet.synsets('bank')[1]
river = wordnet.synsets('river')[0]

print(river.definition())
print(bank_r.definition())
print(bank_f.definition())

print(bank_r.path_similarity(river))
print(bank_f.path_similarity(river))

a large natural stream of water (larger than a creek)
sloping land (especially the slope beside a body of water)
a financial institution that accepts deposits and channels the money into lending activities
0.1111111111111111
0.07692307692307693


In [22]:
print(bank_r.lch_similarity(river))
print(bank_f.lch_similarity(river))

print(bank_r.wup_similarity(river))
print(bank_f.wup_similarity(river))

1.4403615823901665
1.072636802264849
0.3333333333333333
0.14285714285714285


#### 4.1.3. Information Content Similarity

- Path-based similarity issues
    - each edge is has equal distance; however nodes high in hierarchy are more abstract
- Better metric
    - each edge has independent cost
    - nodes connected through higher-level (abstract) nodes are less similar

##### Information Content
- Trained on a corpus
- `P(c)` the probability of a concept `c` in a corpus
    $$ P(c) = \frac{\sum_{w \in \text{words}(c)}\text{count}(c)}{N}$$
    where $\text{words}(c)$ is set of all words that are children of concept $c$. $N$ is the total number of nouns observed. 
- All words are members of the root node (e.g. `Entity`); thus, `P(root) = 1`
- The lower a node in hierarchy, the lower its probability

- Information Content $$IC(c) = -log(P(c))$$
- Most Informative Subsumer (Lowest Common Subsumer) $LCS(c_1, c_2)$ is the lowest node in the hierarchy subsuming both $c_1$ and $c_2$

If you are further interested in this you should read the paper of [Resnik](https://arxiv.org/pdf/cmp-lg/9511007)

##### NLTK Information Content Based Metrics
- `res_similarity(other, ic)`: __Resnik Similarity__: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node). Computed as `IC(lcs) = -log(P(lcs))`. Lower is more similar.
- `lin_similarity(other, ic)`: __Lin Similarity__: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation `2 * IC(lcs) / (IC(s1) + IC(s2))`.
- `jcn_similarity(other, ic)`: __Jiang-Conrath Similarity__: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation `1 / (IC(s1) + IC(s2) - 2 * IC(lcs))`.

In [23]:
# getting pre-computed ic of the semcor corpus (large sense tagged corpus)
from nltk.corpus import wordnet_ic
nltk.download('wordnet_ic')
semcor_ic = wordnet_ic.ic('ic-semcor.dat')

[nltk_data] Downloading package wordnet_ic to
[nltk_data]     /home/thomas/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


In [24]:
print(bank_r.res_similarity(river, semcor_ic))
print(bank_f.res_similarity(river, semcor_ic))

0.6143639493869085
-0.0


In [25]:
print(bank_r.lin_similarity(bank_r, semcor_ic))
print(bank_f.lin_similarity(river, semcor_ic))

1.0
-0.0


In [26]:
print(bank_r.jcn_similarity(bank_r, semcor_ic))
print(bank_f.jcn_similarity(river, semcor_ic))

1e+300
0.06248754962684728


### Exercise
Extend Lesk algorithm (function) to use similarity metrics instead of just overlaps
- make it a keyword argument to allow different metrics

In [5]:
semcor_ic = wordnet_ic.ic('ic-semcor.dat')

def get_top_sense_sim(context_sense, sense_list, similarity):
    # get top sense from the list of sense-definition tuples
    # assumes that words and definitions are preprocessed identically
    scores = []
    for sense in sense_list:
        ss = sense[0]
        if similarity == "path":
            try:
                scores.append((ss.path_similarity(context_sense, semcoir_ic), ss))
            except:
                scores.append((0, ss))    
        elif similarity == "lch":
            try:
                scores.append((ss.lch_similarity(context_sense, semcoir_ic), ss))
            except:
                scores.append((0, ss))
        elif similarity == "wup":
            try:
                scores.append((ss.wup_similarity(context_sense), ss))
            except:
                scores.append((0, ss))
        elif similarity == "resnik":
            try:
                scores.append((ss.res_similarity(context_sense, semcoir_ic), ss))
            except:
                scores.append((0, ss))
        elif similarity == "lin":
            try:
                scores.append((ss.lin_similarity(context_sense, semcoir_ic), ss))
            except:
                scores.append((0, ss))
        elif similarity == "jiang":
            try:
                scores.append((ss.jcn_similarity(context_sense, semcoir_ic), ss))
            except:
                scores.append((0, ss))
        else:
            print("Similarity metric not found")
            return None
    val, sense = max(scores)
    return val, sense

    
def lesk_similarity(context_sentence, ambiguous_word, similarity="resnik", pos=None, synsets=None, majority=True):
    context_senses = get_sense_definitions(set(context_sentence) - set([ambiguous_word]))
    
    if synsets is None:
        synsets = get_sense_definitions(ambiguous_word)[0][1]

    if pos:
        synsets = [ss for ss in synsets if str(ss[0].pos()) == pos]

    if not synsets:
        return None
    
    scores = []
    
    for senses in context_senses:
        for sense in senses[1]:
            scores.append(get_top_sense_sim(sense[0], synsets, similarity))
                    
    if len(scores) == 0:
        return synsets[0][0]
                    
    # Majority voting as before    
    if majority:
        # We remove 0 scores, senses without overlapping
        filtered_scores = [x[1] for x in scores if x[0] != 0]
        if len(filtered_scores) > 0:
            best_sense = Counter(filtered_scores).most_common(1)[0][0][1]
        else:
            # Almost random selection
            best_sense = Counter(scores).most_common(1)[0][0][1]
    else:
        best_sense = sorted(scores)[0]
    
    return best_sense
        
'''
def pedersen(context_sentence, ambiguous_word, similarity="resnik", pos=None, 
                    synsets=None, threshold=0.1):
    
    context_senses = get_sense_definitions(set(context_sentence) - set([ambiguous_word]))

    if synsets is None:
        synsets = get_sense_definitions(ambiguous_word)[0][1]

    if pos:
        synsets = [ss for ss in synsets if str(ss[0].pos()) == pos]

    if not synsets:
        return None
    
    synsets_scores = {}
    for ss_tup in synsets:
        ss = ss_tup[0]
        if ss not in synsets_scores:
            synsets_scores[ss] = 0
        for senses in context_senses:
            scores = []
            for sense in senses[1]:
                if similarity == "path":
                    try:
                        # Append path similarity similarity between ambiguous word and senses from the context
                        scores.append((#Add similarity, ss))
                    except:
                        scores.append((0, ss))    
                elif similarity == "lch":
                    try:
                        # Append LCH similarity similarity between ambiguous word and senses from the context
                        scores.append((#Add similarity, ss))
                    except:
                        scores.append((0, ss))
                elif similarity == "wup":
                    try:
                        # Append WUP similarity similarity between ambiguous word and senses from the context
                        scores.append((#Add similarity, ss))
                    except:
                        scores.append((0, ss))
                elif similarity == "resnik":
                    try:
                        # Append Resnik similarity similarity between ambiguous word and senses from the context
                        # Don't forget semicor_ic
                        scores.append((#Add similarity, ss))
                    except:
                        scores.append((0, ss))
                elif similarity == "lin":
                    try:
                        # Append lin similarity similarity between ambiguous word and senses from the context
                        # Don't forget semicor_ic
                        scores.append((#Add similarity, ss))
                    except:
                        scores.append((0, ss))
                elif similarity == "jiang":
                    try:
                        # Append Jiang similarity similarity between ambiguous word and senses from the context
                        # Don't forget semicor_ic
                        scores.append((#Add similarity, ss))
                    except:
                        scores.append((0, ss))
                else:
                    print("Similarity metric not found")
                    return None
            value, sense = max(scores)
            if value > threshold:
                synsets_scores[sense] = synsets_scores[sense] + value

    values = list(synsets_scores.values())
    if sum(values) == 0:
        print('Warning: all the scores are 0')
    senses = list(synsets_scores.keys())
    best_sense_id = values.index(max(values))
                            
    return senses[best_sense_id]
'''



In [7]:
text = "Jane sat on the sloping bank of a river beside the water".split()
word = "bank"
sense = original_lesk(text, word, majority=True)
print('Original lesk', sense, sense.definition())
sense = lesk(text, word)
print('Symplified lesk', sense, sense.definition())
sense = lesk_similarity(text, word, "resnik")
print('Graph-based lesk', sense, sense.definition())
#sense = pedersen(text, word, similarity="path", threshold=0.1)
#print("Pedersen", sense, sense.definition())

Original lesk Synset('savings_bank.n.02') a container (usually with a slot in the top) for keeping money at home
Symplified lesk Synset('bank.n.01') sloping land (especially the slope beside a body of water)
Graph-based lesk Synset('savings_bank.n.02') savings_bank.n.02


## 5. Evaluation on Senseval 2

### 5.1. Senseval Corpus
The Senseval 2 Corpus contains data intended to train word-sense disambiguation classifiers. 
It contains data for four words: `hard`, `interest`, `line`, and `serve`. Let's use `interest` portion to illustrate evaluation.

In [None]:
nltk.download('senseval')

Corpus instances are stored as:
- `context` - POS-tagged context sentence
- `position` - index of the target word in a context sentence
- `senses` - labels

In [None]:
from nltk.corpus import senseval

inst = senseval.instances('interest.pos')[0]

print(inst.position, inst.context, inst.senses)

#### 5.1.1. Mapping Senseval Senses to WordNet

Senseval labels are not compatible with WordNet 3.0; thus, let's manually create a mapping.

__Senses for *interest* in Longman Dictionary__
- Sense 1 =  361 occurrences (15%) - readiness to give attention
- Sense 2 =   11 occurrences (01%) - quality of causing attention to be given to
- Sense 3 =   66 occurrences (03%) - activity, etc. that one gives attention to
- Sense 4 =  178 occurrences (08%) - advantage, advancement or favor
- Sense 5 =  500 occurrences (21%) - a share in a company or business
- Sense 6 = 1252 occurrences (53%) - money paid for the use of money

In [None]:
# definitions of "interest"'s synsets in WordNet
iss = wordnet.synsets('interest', pos='n')
for ss in iss:
    print(ss, ss.definition())
    

In [None]:
# Let's create mapping from convenience
mapping = {
    'interest_1': 'interest.n.01',
    'interest_2': 'interest.n.03',
    'interest_3': 'pastime.n.01',
    'interest_4': 'sake.n.01',
    'interest_5': 'interest.n.05',
    'interest_6': 'interest.n.04',
}

#### 5.1.2. Evaluation

- Let's use accuracy for simplicity
- Also demonstrating per-class precision, recall, and f-measure

In [None]:
from nltk.metrics.scores import precision, recall, f_measure, accuracy

refs = {k: set() for k in mapping.values()}
hyps = {k: set() for k in mapping.values()}
refs_list = []
hyps_list = []

# since WordNet defines more senses, let's restrict predictions
synsets = [ss for ss in wordnet.synsets('interest', pos='n') if ss.name() in mapping.values()]

for i, inst in enumerate(senseval.instances('interest.pos')):
    txt = [t[0] for t in inst.context]
    raw_ref = inst.senses[0] # let's get first sense
    hyp = lesk(txt, txt[inst.position], synsets=synsets).name()
    
    ref = mapping.get(raw_ref)
    
    # for precision, recall, f-measure        
    refs[ref].add(i)
    hyps[hyp].add(i)
    
    # for accuracy
    refs_list.append(ref)
    hyps_list.append(hyp)

print("Acc:", round(accuracy(refs_list, hyps_list), 3))

for cls in hyps.keys():
    p = precision(refs[cls], hyps[cls])
    r = recall(refs[cls], hyps[cls])
    f = f_measure(refs[cls], hyps[cls], alpha=1)
    
    print("{:15s}: p={:.3f}; r={:.3f}; f={:.3f}; s={}".format(cls, p, r, f, len(refs[cls])))

### Exercise
- Evaluate Original Lesk (your implementation on Senseval's `interest`)
- You can also easily evalutate Lesk similarity that we have seen before

In [None]:
from nltk.metrics.scores import precision, recall, f_measure, accuracy

refs = {k: set() for k in mapping.values()}
hyps = {k: set() for k in mapping.values()}
refs_list = []
hyps_list = []

# since WordNet defines more senses, let's restrict predictions

synsets = []
for ss in wordnet.synsets('interest', pos='n'):
    if ss.name() in mapping.values():
        # You need to preporecess the definitions
        # Give a look at the preprocessing function that we defined above 
        defn = # estract the defitions
        tags = # Preproccess the definition
        toks = # From tags extract the tokens
        synsets.append((ss,toks))

for i, inst in enumerate(senseval.instances('interest.pos')):
    txt = [t[0] for t in inst.context]
    raw_ref = inst.senses[0] # let's get first sense
    hyp = # Use original LESK or similarity LESK, for input parameters copy paste from above.
    ref = mapping.get(raw_ref)
    
    # for precision, recall, f-measure        
    refs[ref].add(i)
    hyps[hyp].add(i)
    
    # for accuracy
    refs_list.append(ref)
    hyps_list.append(hyp)

print("Acc:", round(accuracy(refs_list, hyps_list), 3))

for cls in hyps.keys():
    p = precision(refs[cls], hyps[cls])
    r = recall(refs[cls], hyps[cls])
    f = f_measure(refs[cls], hyps[cls], alpha=1)
    
    print("{:15s}: p={:.3f}; r={:.3f}; f={:.3f}; s={}".format(cls, p, r, f, len(refs[cls])))

## 6. Supervised Learning for WSD

### 6.1. Features for WSD
- Bag-of-Words (already covered)
- Collocational features

#### 6.1.1. Bag-of-Words (BOW) Classification (recap)

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_validate
from sklearn.metrics import classification_report
from sklearn.model_selecinsttion import StratifiedKFold

data = [" ".join([t[0] for t in inst.context]) for inst in senseval.instances('interest.pos')]
lbls = [inst.senses[0] for inst in senseval.instances('interest.pos')]

print(data[0])
print(lbls[0])


In [None]:
vectorizer = CountVectorizer()
classifier = MultinomialNB()
lblencoder = LabelEncoder()

stratified_split = StratifiedKFold(n_splits=5, shuffle=True)

vectors = vectorizer.fit_transform(data)

# encoding labels for multi-calss
lblencoder.fit(lbls)
labels = lblencoder.transform(lbls)

scores = cross_validate(classifier, vectors, labels, cv=stratified_split, scoring=['f1_micro'])

print(sum(scores['test_f1_micro'])/len(scores['test_f1_micro']))


#### 6.1.2. Collocational Features
- Assume +/-n words window from target

e.g. n=2

`... managers expect further [declines in] [interest] [rates .]`

- $w_{-1}$ : `declines`
- $w_{-2}$ : `in`
- $w_0$ __target__ : `interest`
- $w_{+1}$ : `rates`
- $w_{+2}$ : `.`

- POS-tags of these words
- word ngrams in window +/-3 are common
    - ngram(-3): declines in interest
    - ngram(-2): in interest
    - ngram(1): interest
    - ngram(2): interest rates
    - ngram(3): interest rates .


##### Using Collocational Features in scikit-learn
- represent features as dict
- use `DictVectorizer`

In [None]:
def collocational_features(inst):
    p = inst.position
    return {
        "w-2_word": 'NULL' if p < 2 else inst.context[p-2][0],
        "w-1_word": 'NULL' if p < 1 else inst.context[p-1][0],
        "w+1_word": 'NULL' if len(inst.context) - 1 < p+1 else inst.context[p+1][0],
        "w+2_word": 'NULL' if len(inst.context) - 1 < p+2 else inst.context[p+2][0]
    }

In [None]:
data_col = [collocational_features(inst) for inst in senseval.instances('interest.pos')]
print(data_col[0])

In [None]:
from sklearn.feature_extraction import DictVectorizer
dvectorizer = DictVectorizer(sparse=False)
dvectors = dvectorizer.fit_transform(data_col)

scores = cross_validate(classifier, dvectors, labels, cv=stratified_split, scoring=['f1_micro'])

print(sum(scores['test_f1_micro'])/len(scores['test_f1_micro']))

#### 6.1.3. Concatenating Feature Vectors

In [None]:
import numpy as np

# let's check shape's for sanity & types (for illustration)
print(vectors.shape, type(vectors))
print(dvectors.shape, type(dvectors))

# types of CountVectorizer and DictVectorizer outputs are different 
# we need to convert them to the same format
uvectors = np.concatenate((vectors.toarray(), dvectors), axis=1)

print(uvectors.shape, type(uvectors))

In [None]:
# cross-validating classifier the usual way
scores = cross_validate(classifier, uvectors, labels, cv=stratified_split, scoring=['f1_micro'])

print(sum(scores['test_f1_micro'])/len(scores['test_f1_micro']))

## Lab Exercise
- Extend collocational features with
    - POS-tags
    - Ngrams within window
- Concatenate BOW and new collocational feature vectors & evaluate
- Evaluate Lesk Original and Graph-based (Lesk Similarity or Pedersen) metrics on the same split & compare

In [None]:
#FINISHED: PRINT METRICS AS ONE AND NOT FOR EVERY BATCH. 
#THEN, MAKE IT EXECUTE BOTH ORIGINAL_LESK AND PEDERSEN TOGETHER

In [7]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_validate
from sklearn.metrics import classification_report
from sklearn.model_selection import StratifiedKFold
from nltk.metrics.scores import precision, recall, f_measure, accuracy
from nltk.corpus import senseval
from nltk.util import ngrams
from nltk.wsd import lesk
from nltk.corpus import wordnet
from nltk.corpus import wordnet_ic
import nltk
import numpy as np
from pprint import pprint
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from collections import Counter

nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('wordnet_ic')
semcor_ic = wordnet_ic.ic('ic-semcor.dat')

NGRAM_WINDOW = 3

mapping = {
    'interest_1': 'interest.n.01',
    'interest_2': 'interest.n.03',
    'interest_3': 'pastime.n.01',
    'interest_4': 'sake.n.01',
    'interest_5': 'interest.n.05',
    'interest_6': 'interest.n.04',
}

def preprocess(text):
    mapping = {"NOUN": wordnet.NOUN, "VERB": wordnet.VERB, "ADJ": wordnet.ADJ, "ADV": wordnet.ADV}
    sw_list = stopwords.words('english')
    
    lem = WordNetLemmatizer()
    
    # tokenize, if input is text
    tokens = nltk.word_tokenize(text) if type(text) is str else text
    # pos-tag
    tagged = nltk.pos_tag(tokens, tagset="universal")
    # lowercase
    tagged = [(w.lower(), p) for w, p in tagged]
    # optional: remove all words that are not NOUN, VERB, ADJ, or ADV (i.e. no sense in WordNet)
    tagged = [(w, p) for w, p in tagged if p in mapping]
    # re-map tags to WordNet (return orignal if not in-mapping, if above is not used)
    tagged = [(w, mapping.get(p, p)) for w, p in tagged]
    # remove stopwords
    tagged = [(w, p) for w, p in tagged if w not in sw_list]
    # lemmatize
    tagged = [(w, lem.lemmatize(w, pos=p), p) for w, p in tagged]
    # unique the list
    tagged = list(set(tagged))
    return tagged

def get_sense_definitions(context):
    # input is text or list of strings
    lemma_tags = preprocess(context)
    # let's get senses for each
    senses = [(w, wordnet.synsets(l, p)) for w, l, p in lemma_tags]
    
    # let's get their definitions
    definitions = []
    for raw_word, sense_list in senses:
        if len(sense_list) > 0:
            # let's tokenize, lowercase & remove stop words 
            def_list = []
            for s in sense_list:
                defn = s.definition()
                # let's use the same preprocessing
                tags = preprocess(defn)
                toks = [l for w, l, p in tags]
                def_list.append((s, toks))
            definitions.append((raw_word, def_list))
    return definitions


def get_top_sense(words, sense_list):
    # get top sense from the list of sense-definition tuples
    # assumes that words and definitions are preprocessed identically
    val, sense = max((len(set(words).intersection(set(defn))), ss) for ss, defn in sense_list)
    return val, sense
'''
def lesk_simplified(context_sentence, ambiguous_word, pos=None, synsets=None):
    context = set(context_sentence)
    
    if synsets is None:
        synsets = wordnet.synsets(ambiguous_word)
    if pos:
        synsets = [ss for ss in synsets if str(ss.pos()) == pos]

    if not synsets:
        return None
    # Measure the overlap between context and definitions
    _, sense = max(
        (len(context.intersection(ss.definition().split())), ss) for ss in synsets
    )

    return sense
'''

def original_lesk(context_sentence, ambiguous_word, pos=None, synsets=None, majority=False):

    context_senses = get_sense_definitions(set(context_sentence)-set([ambiguous_word]))
    if synsets is None:
        synsets = get_sense_definitions(ambiguous_word)[0][1]

    if pos:
        synsets = [ss for ss in synsets if str(ss[0].pos()) == pos]

    if not synsets:
        return None
    scores = []
    # print(synsets)
    for senses in context_senses:
        for sense in senses[1]:
            scores.append(get_top_sense(sense[1], synsets))
            
    if len(scores) == 0:
        return synsets[0][0]
    
    if majority:
        # We remove 0 scores senses without overlapping
        filtered_scores = [x[1] for x in scores if x[0] != 0]
        if len(filtered_scores) > 0:
            best_sense = Counter(filtered_scores).most_common(1)[0][0]
        else:
            # Almost random selection
            best_sense = Counter([x[1] for x in scores]).most_common(1)[0][0]
    else:
        _, best_sense = max(scores)
    return best_sense

##GRAPH BASED
def get_top_sense_sim(context_sense, sense_list, similarity):
    # get top sense from the list of sense-definition tuples
    # assumes that words and definitions are preprocessed identically
    scores = []
    for sense in sense_list:
        ss = sense[0]
        if similarity == "path":
            try:
                scores.append((context_sense.path_similarity(ss), ss))
            except:
                scores.append((0, ss))    
        elif similarity == "lch":
            try:
                scores.append((context_sense.lch_similarity(ss), ss))
            except:
                scores.append((0, ss))
        elif similarity == "wup":
            try:
                scores.append((context_sense.wup_similarity(ss), ss))
            except:
                scores.append((0, ss))
        elif similarity == "resnik":
            try:
                scores.append((context_sense.res_similarity(ss, semcor_ic), ss))
            except:
                scores.append((0, ss))
        elif similarity == "lin":
            try:
                scores.append((context_sense.lin_similarity(ss, semcor_ic), ss))
            except:
                scores.append((0, ss))
        elif similarity == "jiang":
            try:
                scores.append((context_sense.jcn_similarity(ss, semcor_ic), ss))
            except:
                scores.append((0, ss))
        else:
            print("Similarity metric not found")
            return None
    val, sense = max(scores)
    return val, sense

def lesk_similarity(context_sentence, ambiguous_word, similarity="resnik", pos=None, 
                    synsets=None, majority=True):
    context_senses = get_sense_definitions(set(context_sentence) - set([ambiguous_word]))
    
    if synsets is None:
        synsets = get_sense_definitions(ambiguous_word)[0][1]

    if pos:
        synsets = [ss for ss in synsets if str(ss[0].pos()) == pos]

    if not synsets:
        return None
    
    scores = []
    
    # Here you may have some room for improvement
    # For instance instead of using all the definitions from the context
    # you pick the most common one of each word (i.e. the first)
    for senses in context_senses:
        for sense in senses[1]:
            scores.append(get_top_sense_sim(sense[0], synsets, similarity))
            
    if len(scores) == 0:
        return synsets[0][0]
    
    if majority:
        filtered_scores = [x[1] for x in scores if x[0] != 0]
        if len(filtered_scores) > 0:
            best_sense = Counter(filtered_scores).most_common(1)[0][0]
        else:
            # Almost random selection
            best_sense = Counter([x[1] for x in scores]).most_common(1)[0][0]
    else:
        _, best_sense = max(scores)
    
    return best_sense

def pedersen(context_sentence, ambiguous_word, similarity="resnik", pos=None, 
                    synsets=None, threshold=0.1):
                        
                        
    context_senses = get_sense_definitions(set(context_sentence) - set([ambiguous_word]))

    if synsets is None:
        synsets = get_sense_definitions(ambiguous_word)[0][1]

    if pos:
        synsets = [ss for ss in synsets if str(ss[0].pos()) == pos]

    if not synsets:
        return None
    
    synsets_scores = {}
    for ss_tup in synsets:
        ss = ss_tup[0]
        if ss not in synsets_scores:
            synsets_scores[ss] = 0
        for senses in context_senses:
            scores = []
            for sense in senses[1]:
                if similarity == "path":
                    try:
                        scores.append((sense[0].path_similarity(ss), ss))
                    except:
                        scores.append((0, ss))    
                elif similarity == "lch":
                    try:
                        scores.append((sense[0].lch_similarity(ss), ss))
                    except:
                        scores.append((0, ss))
                elif similarity == "wup":
                    try:
                        scores.append((sense[0].wup_similarity(ss), ss))
                    except:
                        scores.append((0, ss))
                elif similarity == "resnik":
                    try:
                        scores.append((sense[0].res_similarity(ss, semcor_ic), ss))
                    except:
                        scores.append((0, ss))
                elif similarity == "lin":
                    try:
                        scores.append((sense[0].lin_similarity(ss, semcor_ic), ss))
                    except:
                        scores.append((0, ss))
                elif similarity == "jiang":
                    try:
                        scores.append((sense[0].jcn_similarity(ss, semcor_ic), ss))
                    except:
                        scores.append((0, ss))
                else:
                    print("Similarity metric not found")
                    return None
            value, sense = max(scores)
            if value > threshold:
                synsets_scores[sense] = synsets_scores[sense] + value
    
    values = list(synsets_scores.values())
    senses = list(synsets_scores.keys())
    best_sense_id = values.index(max(values))
    return senses[best_sense_id]

def collocational_features(inst, ngram_window=NGRAM_WINDOW):
    p = inst.position
    feats_dict = {
        "w-2_word": 'NULL' if p < 2 else inst.context[p-2][0],
        "w-1_word": 'NULL' if p < 1 else inst.context[p-1][0],
        "w+1_word": 'NULL' if len(inst.context) - 1 < p+1 else inst.context[p+1][0],
        "w+2_word": 'NULL' if len(inst.context) - 1 < p+2 else inst.context[p+2][0],
        "POS-tags": inst.context[p][1],
    }
    #Computing raw string 
    sent_before = [inst.context[p-i-1][0] for i in reversed(range(ngram_window-1)) if p>i+1]
    sent_after = [inst.context[p+i+1][0] for i in (range(ngram_window-1)) if len(inst.context) - 1 > p+1]
    word = [inst.context[p][0]]
    sent_for_ngrams = ' '.join(sent_before+word+sent_after)
    add_dict = {}
    
    #Computing ngrams from raw string 
    for i in range(ngram_window):
        values = []
        key_name = str(i+1)+'-gram'
        value_with_tuples = ngrams(nltk.word_tokenize(sent_for_ngrams), i+1)
        for item in value_with_tuples:
            value_str = ' '.join(item)
            values.append(value_str)
        add_dict.update({key_name: values})

    #Updating the features dict with the ngram dictionary
    feats_dict.update(add_dict)
    return feats_dict
      
    

data = [" ".join([t[0] for t in inst.context]) for inst in senseval.instances('interest.pos')]
lbls = [inst.senses[0] for inst in senseval.instances('interest.pos')]

#Supervised approach with BOW to solve word-sense disambiguation  

vectorizer = CountVectorizer()
classifier = MultinomialNB()
lblencoder = LabelEncoder()

stratified_split = StratifiedKFold(n_splits=5, shuffle=True)

vectors = vectorizer.fit_transform(data)

# encoding labels for multi-calss
lblencoder.fit(lbls)
labels = lblencoder.transform(lbls)


#Supervised approach using dictionary of features to solve word-sense disambiguation

data_col = [collocational_features(inst, NGRAM_WINDOW) for inst in senseval.instances('interest.pos')]
dvectorizer = DictVectorizer(sparse=False)
dvectors = dvectorizer.fit_transform(data_col)

concatenated_vectors = np.concatenate((vectors.toarray(), dvectors), axis=1)

scores = cross_validate(classifier, concatenated_vectors, labels, cv=stratified_split, scoring=['f1_micro'])

print(sum(scores['test_f1_micro'])/len(scores['test_f1_micro']))
print('')


    
# Evaluate lesk and lesk graph on same split


def run_experiment(data, lbls, stratified_split, mapping, synsets, method = 'pedersen'):
    exp_scores = {}
    exp_scores['precision'] = []
    exp_scores['accuracy'] = []
    exp_scores['recall'] = []
    exp_scores['f_measure'] = []
    
    for train_index, test_index in stratified_split.split(data, lbls):
        print(test_index)
        refs, hyps, refs_list, hyps_list = get_hyps(test_index, data, lbls, mapping, synsets, method)
        
        acc = round(accuracy(refs_list, hyps_list), 3)
        exp_scores['accuracy'].append(acc)
        for cls in hyps.keys():
            if refs[cls] == set():
                refs[cls].add(-1)
            if hyps[cls] == set():
                hyps[cls].add(-1)
            p= round(precision(refs[cls], hyps[cls]),3)
            r = round(recall(refs[cls], hyps[cls]),3)
            f = round(f_measure(refs[cls], hyps[cls], alpha=1),3)
            
            exp_scores['precision'].append(p)
            exp_scores['recall'].append(r)
            exp_scores['f_measure'].append(f)
        
    print(f"{method} precision: {sum(exp_scores['precision'])/len(exp_scores['precision'])}")
    print(f"{method} recall: {sum(exp_scores['recall'])/len(exp_scores['recall'])}")
    print(f"{method} f_measure: {sum(exp_scores['f_measure'])/len(exp_scores['f_measure'])}")
    print(f"{method} accuracy: {sum(exp_scores['accuracy'])/len(exp_scores['accuracy'])}")
    print('')
    
    

def get_hyps(test_index, data, lbls, mapping ,synsets, method):
    refs = {k: set() for k in mapping.values()}
    hyps = {k: set() for k in mapping.values()}
    refs_list = []
    hyps_list = []
    for index in test_index:
        if method == 'pedersen':
            hyp = pedersen(data[index].split(), 'interest', similarity='path',synsets = synsets).name()
        elif method == 'lesk':
            hyp = original_lesk(data[index].split(), 'interest',synsets = synsets, majority =True).name()
        else:
            print('specify another method, options: [pedersen, lesk]')

        ref = mapping[lbls[index]]

        # for precision, recall, f-measure        
        refs[ref].add(index)
        hyps[hyp].add(index)

        # for accuracy
        refs_list.append(ref)
        hyps_list.append(hyp)
    
    return refs, hyps, refs_list, hyps_list
  


# since WordNet defines more senses, let's restrict predictions
synsets = []
for ss in wordnet.synsets('interest', pos='n'):
    if ss.name() in mapping.values():
        defn = ss.definition()
        tags = preprocess(defn)
        toks = [l for w, l, p in tags]
        synsets.append((ss,toks))

run_experiment(data, lbls, stratified_split, mapping, synsets, 'pedersen')
run_experiment(data, lbls, stratified_split, mapping, synsets, 'lesk')


[nltk_data] Downloading package wordnet to /home/thomas/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/thomas/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet_ic to
[nltk_data]     /home/thomas/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


0.9210221139864944

[   1    4    5    6   24   33   47   48   59   62   65   67   68   69
   71   81   87   92   96  100  106  107  115  124  125  127  135  150
  169  174  184  202  205  211  213  219  232  235  243  246  257  274
  285  291  292  294  295  296  304  309  315  317  329  333  339  349
  354  356  358  375  392  396  397  398  407  411  418  422  424  443
  445  446  455  460  465  475  476  479  480  483  484  498  509  515
  517  525  528  532  535  541  543  547  548  554  560  562  570  574
  577  580  588  589  593  598  609  610  612  618  620  623  625  634
  641  645  655  662  689  691  698  710  711  712  716  717  718  726
  732  736  737  738  746  748  758  762  774  781  782  785  788  790
  796  799  811  812  814  819  825  831  835  842  845  848  852  860
  865  876  878  884  885  889  902  909  910  912  914  919  924  932
  942  947  952  958  962  970  973  981  986  992  999 1001 1004 1005
 1006 1011 1019 1022 1023 1025 1033 1034 1046 1052 1055 1

hyps cls is {486} and ref cls is {130, 772, 1286, 1290, 1783, 1036, 16, 1427, 1172, 1684, 2068, 2040, 153, 25, 2331, 1949, 287, 673, 1441, 1826, 2210, 2337, 553, 1065, 299, 300, 45, 684, 559, 1324, 1453, 1202, 1579, 2348, 437, 565, 2104, 1977, 1978, 2234, 1084, 1085, 1469, 449, 1474, 1219, 837, 840, 337, 338, 1748, 983, 1880, 1372, 1628, 2271, 352, 2145, 1251, 614, 1511, 2150, 619, 1134, 1902, 1782, 1527, 1400, 2294, 250, 890, 1791}
hyps cls is {2, 8, 2056, 12, 14, 16, 17, 2066, 2068, 25, 31, 34, 2083, 2086, 2089, 44, 45, 49, 2101, 2102, 56, 2104, 2109, 63, 2111, 2115, 74, 77, 2127, 2135, 2142, 2144, 101, 2150, 109, 2166, 123, 126, 130, 2181, 143, 2191, 159, 2210, 164, 2212, 166, 2215, 2220, 173, 2221, 175, 2228, 182, 2234, 189, 190, 2237, 2239, 2240, 2242, 2244, 197, 198, 2245, 200, 2250, 2256, 2260, 214, 2264, 217, 223, 2271, 2278, 2285, 238, 242, 2294, 247, 2295, 2296, 250, 255, 258, 2308, 2309, 2311, 266, 2314, 2317, 272, 2322, 278, 2327, 280, 287, 2337, 2339, 298, 299, 300, 2348, 

hyps cls is {930, 326, 2218, 1133, 880, 86} and ref cls is {128, 514, 1282, 1539, 1670, 904, 2185, 10, 1294, 1423, 1551, 1934, 151, 1049, 1689, 1692, 797, 158, 1948, 1312, 2209, 930, 1442, 2336, 1574, 426, 1964, 179, 180, 563, 439, 440, 569, 58, 1463, 572, 573, 1212, 2105, 320, 833, 2301, 1351, 328, 456, 457, 458, 1224, 1355, 1486, 1359, 1613, 212, 853, 221, 478, 1630, 2016, 2360, 229, 489, 1133, 366, 495, 880, 1264, 1517, 2292, 1142, 249, 378, 2045}
hyps cls is {0, 2052, 10, 19, 21, 26, 2077, 2080, 36, 2084, 38, 2096, 53, 54, 58, 2106, 2107, 2108, 2112, 70, 2119, 2121, 76, 90, 93, 2141, 97, 99, 2148, 2155, 111, 2162, 2163, 2168, 2170, 2176, 129, 134, 2182, 138, 141, 2190, 151, 152, 154, 155, 156, 2204, 158, 2209, 162, 163, 179, 180, 2227, 2229, 2231, 187, 195, 2246, 199, 2247, 2253, 2259, 212, 215, 218, 225, 2273, 227, 229, 237, 239, 240, 2288, 2292, 248, 249, 2297, 251, 2300, 2301, 2304, 260, 262, 265, 269, 270, 276, 2324, 281, 282, 2333, 288, 2336, 290, 297, 302, 2351, 306, 307, 235

hyps cls is {2049, 1796, 1549, 672, 1826, 551, 552, 1449, 1584, 1585, 2353, 833, 838, 455, 968, 2249, 1487, 1106, 1746, 724, 1874, 1750, 731, 220, 234, 235, 880, 630} and ref cls is {640, 513, 1536, 1273, 1286, 16, 1296, 18, 914, 150, 23, 790, 1047, 1049, 1687, 1942, 1947, 158, 1948, 672, 673, 1312, 1441, 1826, 2209, 1574, 2336, 300, 301, 558, 559, 684, 1585, 2353, 179, 564, 1460, 55, 58, 1340, 573, 1085, 1468, 833, 67, 68, 837, 838, 1475, 968, 841, 1613, 1487, 2331, 338, 1106, 852, 597, 1750, 2258, 2016, 1251, 234, 235, 1131, 1517, 880, 1142, 1143, 249, 1916, 2045, 1791}
hyps cls is {1441, 1155, 4, 1765, 134, 1608, 1864, 394, 395, 971, 1515, 561, 1331, 831, 2038, 23, 1854, 479} and ref cls is {220, 2230}
hyps cls is {2050, 2054, 523, 15, 21, 1047, 1559, 1049, 2072, 28, 540, 542, 1057, 550, 2087, 2089, 2090, 558, 564, 1589, 2103, 80, 592, 1624, 1113, 1629, 606, 608, 98, 104, 2154, 1131, 1643, 1141, 1143, 2168, 634, 638, 640, 129, 138, 650, 1167, 1169, 150, 1687, 152, 2205, 158, 1183, 1

hyps cls is {1025, 1423, 277, 664, 538, 1179, 1180, 928, 2080, 418, 554, 568, 1849, 1852, 1853, 962, 972, 1631, 623, 1904, 2289, 2291, 2036, 1660, 2301} and ref cls is {1793, 130, 1527, 1539, 1036, 525, 1934, 1039, 1423, 146, 1426, 532, 277, 1046, 2040, 2073, 538, 1692, 543, 1056, 1660, 553, 42, 426, 1834, 2346, 2223, 2352, 1202, 563, 437, 439, 2104, 569, 825, 2105, 2360, 1087, 449, 1218, 451, 580, 836, 1092, 1219, 2241, 457, 203, 1355, 1485, 337, 2002, 1627, 1372, 478, 2146, 229, 1510, 2151, 106, 2157, 1902, 1265, 2290, 2291, 1781, 1782, 119, 1528, 1529, 1020, 2301}
hyps cls is {770, 324, 1092, 1510, 393, 1001, 939, 1002, 1257, 1614, 2076, 1680, 2042, 155, 1276, 1693, 1502} and ref cls is {2118, 1473, 1790}
hyps cls is {1, 1539, 1541, 2059, 525, 1562, 1053, 1565, 543, 40, 553, 2098, 563, 2104, 1081, 66, 1093, 76, 1616, 2133, 2135, 1114, 1626, 1632, 2146, 1635, 106, 1642, 109, 119, 2177, 137, 649, 654, 146, 1686, 2207, 2208, 1703, 681, 171, 2219, 181, 189, 705, 707, 199, 1738, 2251, 22

hyps cls is {1804, 1295, 2075, 1828, 678, 687, 307, 180, 1209, 185, 570, 2365, 2254, 467, 1875, 342, 983, 993, 1507, 487, 1128, 1389, 1264, 1139, 629, 2043, 2302, 255} and ref cls is {514, 1540, 1927, 264, 2057, 10, 139, 2315, 1551, 2320, 1042, 153, 1946, 2075, 796, 797, 287, 2335, 930, 1315, 1700, 1827, 1830, 2085, 2211, 1579, 556, 45, 1453, 2222, 560, 2347, 2348, 180, 821, 185, 1977, 572, 1084, 574, 1212, 1216, 1469, 1472, 1351, 456, 1098, 1486, 1105, 1748, 87, 983, 1497, 1880, 988, 1630, 2143, 1250, 1507, 1252, 2018, 107, 619, 1133, 495, 1264, 500, 1783, 505, 378, 2299, 511}
hyps cls is {1414, 1173, 666, 26, 1830, 1965, 2222, 689, 823, 1977, 1979, 1469, 62, 576, 714, 723, 1621, 740, 1009, 882, 372, 886, 2170, 1915, 509} and ref cls is {1612, 1182}
hyps cls is {2048, 1026, 515, 516, 1540, 2055, 1544, 522, 1034, 1037, 1038, 24, 2074, 36, 549, 2085, 2095, 1078, 1596, 2108, 1094, 1095, 1098, 1610, 1618, 2141, 1637, 105, 1641, 1649, 2164, 121, 125, 1664, 2180, 133, 2181, 647, 2182, 139, 