# Lab 1: Semantic Similarity

In this lab, you will be investigating measures of semantic similarity based on WordNet and distributional similarity. In particular, you will be considering how closely they correlate with human judgements of synonymy. Students who have recently done Natural Language Engineering or Applied Natural Language Processing should be able to get through this relatively quickly and have time to move on to the extension material looking at statistical significance

1. [Getting Started](#getting-started)
    * `ic-brown` explained
2. [Useful WordNet Functions](#using-wordnet-wn-functions)
    * Write a function to return the path similarity of two nouns
    * Generalise the function to use IC measures
3. [Human Synonymy Judgements](#3-human-synonymy-judgements)

## Getting Started

In [17]:
# import nltk
# nltk.download()

import operator
from nltk.corpus import wordnet as wn, wordnet_ic as wn_ic, lin_thesaurus as lin

In [16]:
import nltk
nltk.download('wordnet_ic')

[nltk_data] Downloading package wordnet_ic to
[nltk_data]     /Users/lukebirkett/nltk_data...
[nltk_data]   Unzipping corpora/wordnet_ic.zip.


True

## Using WordNet (WN) Functions

In [22]:
wn.synsets("book") # returns the all the senses of a word
wn.synsets("book",wn.NOUN) # retuns the senses that are nouns
synsetA=wn.synsets("book",wn.NOUN)[0] # extract the first sense as a variable
synsetA.definition() # get the definition of that sense
synsetA.hyponyms() # get hyponyms (lower/children) of the sense
synsetA.hypernyms() # get the hypernym(s) of the sense
synsetB=wn.synsets("book",wn.NOUN)[1] # grab another sense
synsetB
synsetA.path_similarity(synsetB) # compuate the wordnet path length betwen the two
brown_ic=wn_ic.ic("ic-brown.dat") # read in the brown corpus
brown_ic
synsetA.res_similarity(synsetB,brown_ic)
synsetA.lin_similarity(synsetB,brown_ic)

0.7098990245459575

### `ic-brown` Explained

The particular brown data imported is in dictionary form and it is a frequency dictionary mappted to WordNet synset IDs. The IDs are mapped to IC scores based on the Brown Corpus. 

The structure of `ic-brown` is given as:
`{ 'part_of_speech': defaultdict(float, {synset_id: frequency_count}) }`

`part_of_speech` might be noun which is under the key `n`. Nested within the POS will be a `dict` which is structured as `synset_id: frequency_count` (key,value) pairs. 

The `synset_id` will be in "offset" form which is an 8 digit number. WordNet is usually accessed using this form `syn = wn.synset('bank.n.01')` but it's offset can be gathered using `syn.offset()`. WordNet can be directly searching using the offset with `wn.synset_from_pos_and_offset(pos, offset)`

Functions that take the `ic-brown` dictionary will have methods to derive and compare items using the offset: `lion.lin_similarity(cat, brown_ic)`. In this example it is obtaining the offset values for `lion` and `cat` and then using both IC scores to compute lin_similarity. It will also use `brown_ic` as a lookup for the LCS's IC as well. 

$$Sim_{Lin}(s_1, s_2) = \frac{2 \times IC(LCS)}{IC(s_1) + IC(s_2)}$$

The brown dataset is essentially just a lookup table of IC scores that WordNet and NLTK can use.

## 2.1 Tasks

#### Write a function to return the path similarity of two nouns

Remember this is the maximum similarity of all of the possible pairings of the two nouns. Make sure you test it. For (chicken,car) the correct answer is 0.0909 (3sf).

#### Generalise it so that you have an extra (optional) parameter which you use to select the WordNet similarity measure e.g., res similarity and lin similarity

**Reminder on Res and Lin Similarity:** Both are metrics that rely on Information Score (IC) which represents how "specific" or "informative" a concept is based on its frequency in a large body of text (a corpus).

Resnikâ€™s measure is based entirely on the Information Content (IC) of the Lowest Common Subsumer (LCS). This means that is only needs that one peice of informaiton to be calculated. The intuition is that is the common word between the two is rare, then the two words are highly similar. These are a common subset of the meaning of the ancestor, i.e. breed names for the LCS "dog". However, if the LCS is a common word then the two words are very distinct, there is no commmon ancestor that specificies their category. 

$$Sim_{resnik}(s_1, s_2) = IC(LCS(s_1, s_2))$$

Lin's measure is an extension of Resnik through normalization. The $IC(LCS)$ is taken as a ratio of the two words ICs. It can be summarization as the ratio of the shared information. The downside to Lin is that is now needs 3 IC values to calculate. If the two words are very rare, they made not show up in a corpus as would therefore fail the caclulation.

$$Sim_{lin}(s_1, s_2) = \frac{2 \times IC(LCS(s_1, s_2))}{IC(s_1) + IC(s_2)}$$

Additionally, where as Path Length can be calculated using WordNet alone, IC measures require the application of a corpus to the methodology. WordNet provides the hierachy and the corpus provides the statistics/data. The more frequently a concept appears, the less information it carries, this is because IC is the inverse of probability. 

$$IC(s) = -\log P(s)$$

Where the probability $P(s)$ is estimated by:Counting how many times the word (and all its more specific "children" in the hierarchy) appears in the corpus.Dividing by the total number of words in the corpus.

In [96]:
from nltk.corpus import wordnet_ic
import pprint

brown_ic = wordnet_ic.ic('ic-brown.dat')

def distance_from_ancestor(entry,ancestor):
    distance = 0
    current = entry

    while current != ancestor:
        hypers = current.hypernyms()
        if not hypers:
            break
        current = hypers[0]
        distance += 1

    return distance

def similarity(noun1, noun2):
    payload = {}
    payload["pair"] = (noun1,noun2)
    payload["anc"] = noun1.lowest_common_hypernyms(noun2)[0]
    payload["noun1_distance"] = distance_from_ancestor(noun1, payload["anc"])
    payload["noun2_distance"] = distance_from_ancestor(noun2, payload["anc"])
    payload["total_distance"] = payload["noun1_distance"] + payload["noun2_distance"]
    payload["path_similarity"] = 1 / (payload["total_distance"]+1)
    payload["res_similarity"] = noun1.res_similarity(noun2, brown_ic)
    payload["lin_similarity"] = noun1.lin_similarity(noun2, brown_ic)

    return payload


def max_similarity(n1, n2, type="path_similarity"):

    max_s = 0
    payl = {}
    
    sys1 = wn.synsets(n1, pos=wn.NOUN)
    sys2 = wn.synsets(n2, pos=wn.NOUN)

    for s1 in sys1:
        for s2 in sys2:
            _payl = similarity(s1, s2)
            ps = _payl[type]
            if ps > max_s:
                max_s = ps
                payl = _payl
    
    return payl

max_similarity("chicken", "car")


{'pair': (Synset('wimp.n.01'), Synset('car.n.02')),
 'anc': Synset('whole.n.02'),
 'noun1_distance': 5,
 'noun2_distance': 5,
 'total_distance': 10,
 'path_similarity': 0.09090909090909091,
 'res_similarity': 1.5318337432196856,
 'lin_similarity': 0.16297501193902675}

# 3 Human Synonymy Judgements

`mcdata.csv` contains the Miller & Charles human similarity judgements discussed in the seminar