# Lab 1: Semantic Similarity

In this lab, you will be investigating measures of semantic similarity based on WordNet and distributional similarity. In particular, you will be considering how closely they correlate with human judgements of synonymy. Students who have recently done Natural Language Engineering or Applied Natural Language Processing should be able to get through this relatively quickly and have time to move on to the extension material looking at statistical significance

* [Getting Started](#getting-started)
* []()
* []()
* []()
* []()
* []()

## Getting Started

In [17]:
# import nltk
# nltk.download()

import operator
from nltk.corpus import wordnet as wn, wordnet_ic as wn_ic, lin_thesaurus as lin

In [16]:
import nltk
nltk.download('wordnet_ic')

[nltk_data] Downloading package wordnet_ic to
[nltk_data]     /Users/lukebirkett/nltk_data...
[nltk_data]   Unzipping corpora/wordnet_ic.zip.


True

## Using WordNet (WN) Functions

In [None]:
wn.synsets("book") # returns the all the senses of a word
wn.synsets("book",wn.NOUN) # retuns the senses that are nouns
synsetA=wn.synsets("book",wn.NOUN)[0] # extract the first sense as a variable
synsetA.definition() # get the definition of that sense
synsetA.hyponyms() # get hyponyms (lower/children) of the sense
synsetA.hypernyms() # get the hypernym(s) of the sense
synsetB=wn.synsets("book",wn.NOUN)[1] # grab another sense
synsetB
synsetA.path_similarity(synsetB) # compuate the wordnet path length betwen the two
brown_ic=wn_ic.ic("ic-brown.dat") # read in the brown corpus
brown_ic
synsetA.res_similarity(synsetB,brown_ic)
synsetA.lin_similarity(synsetB,brown_ic)

0.7098990245459575

### `ic-brown` Explained

The particular brown data imported is in dictionary form and it is a frequency dictionary mappted to WordNet synset IDs. The IDs are mapped to IC scores based on the Brown Corpus. 

The structure of `ic-brown` is given as:
`{ 'part_of_speech': defaultdict(float, {synset_id: frequency_count}) }`

`part_of_speech` might be noun which is under the key `n`. Nested within the POS will be a `dict` which is structured as `synset_id: frequency_count` (key,value) pairs. 

The `synset_id` will be in "offset" form which is an 8 digit number. WordNet is usually accessed using this form `syn = wn.synset('bank.n.01')` but it's offset can be gathered using `syn.offset()`. WordNet can be directly searching using the offset with `wn.synset_from_pos_and_offset(pos, offset)`

Functions that take the `ic-brown` dictionary will have methods to derive and compare items using the offset: `lion.lin_similarity(cat, brown_ic)`. In this example it is obtaining the offset values for `lion` and `cat` and then using both IC scores to compute lin_similarity. It will also use `brown_ic` as a lookup for the LCS's IC as well. 

$$Sim_{Lin}(s_1, s_2) = \frac{2 \times IC(LCS)}{IC(s_1) + IC(s_2)}$$

The brown dataset is essentially just a lookup table of IC scores that WordNet and NLTK can use.

## 2.1 Tasks

#### Write a function to return the path similarity of two nouns. Remember this is the maximum similarity of all of the possible pairings of the two nouns. Make sure you test it. For (chicken,car) the correct answer is 0.0909 (3sf).

In [None]:
# TODO: use word directly to grab all senses and offset loop though them both. 
# take the max sim (probably overwrite keeping the max)

syn1 = wn.synset('chicken.n.01')
syn2 = wn.synset('car.n.01')

def distance_from_hyper(entry,ancestor):
    distance = 0
    print(f"starting sense {entry}")
    while entry != ancestor:
        entry = entry.hypernyms()[0]
        print(entry)
        distance += 1
    return distance


def path_similarity(noun1, noun2):
    print(f"noun_1 is: {noun1}")
    print(f"noun_2 is: {noun2}")

    anc = noun1.lowest_common_hypernyms(noun2)[0]
    print(f"The LCS is {anc}")

    print("===")
    
    noun1_distance = distance_from_hyper(noun1, anc)
    print(f"distance: {noun1_distance}")

    noun2_distance = distance_from_hyper(noun2, anc)
    print(f"distance: {noun2_distance}")

    total_distance = noun1_distance + noun2_distance

    path_sim = 1 / (total_distance+1)

    print(path_sim)

path_similarity(syn1, syn2)



noun_1 is: Synset('chicken.n.01')
noun_2 is: Synset('car.n.01')
The LCS is Synset('physical_entity.n.01')
===
starting sense Synset('chicken.n.01')
Synset('poultry.n.02')
Synset('bird.n.02')
Synset('meat.n.01')
Synset('food.n.02')
Synset('solid.n.01')
Synset('matter.n.03')
Synset('physical_entity.n.01')
distance: 7
starting sense Synset('car.n.01')
Synset('motor_vehicle.n.01')
Synset('self-propelled_vehicle.n.01')
Synset('wheeled_vehicle.n.01')
Synset('container.n.01')
Synset('instrumentality.n.03')
Synset('artifact.n.01')
Synset('whole.n.02')
Synset('object.n.01')
Synset('physical_entity.n.01')
distance: 9
0.058823529411764705


#### Generalise it so that you have an extra (optional) parameter which you use to select the WordNet similarity measure e.g., res similarity and lin similarity