# Week 6 Lab: Lexical Semantics

This week we turn our attention to lexical semantics, i.e., the meaning of words.  In this lab, you will be
* exploring the WordNet resource
* learning about lexical relations such as synonymy and hyponymy
* quantifying semantic similarity via distance in the WordNet hierarchy
* comparing WordNet similarity scores with human synonymy judgements




In [None]:
###uncomment if working on colab

#from google.colab import drive
#drive.mount('/content/drive')


First, lets import WordNet from the nltk library

In [None]:
#import nltk
#nltk.download('wordnet')
#nltk.download('wordnet_ic')

from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic as wn_ic


## Navigating WordNet

Central to the organisation of WordNet is the idea of a synset.  Words have senses and senses are grouped with synonymous senses (of other words) in **synsets**

If you want to find out which synsets a word belongs to, you use the `synsets` function.  

In [None]:
from nltk.corpus import wordnet as wn
wn.synsets("plant")

The output is a list of `Synset` objects each of which has a unique identifier containing one of its words, its part of speech and a number.  `Synset('book.n.01')` is the first noun sense of *book*.  However the word book is also in `Synset('record.n.05')` which is the fifth noun sense of *record*.  Lets inspect this synset further.

In [None]:
book_synsets=wn.synsets('book')
recordn5=book_synsets[2]
print(recordn5.lemma_names())  #get the words in the synset
print(recordn5.definition())   #get the definition of the synset
print(recordn5.examples())  #get examples of the words used in this sense

In [None]:
plant_synsets=wn.synsets('plant')
for i,s in enumerate(plant_synsets):
    print("{}:{}".format(i+1,s.definition()))

If you only want to find synsets associated with a particular part of speech of a word then you can give `synsets` an extra argument

In [None]:
#all of the WN POS tags
parts_of_speech=[wn.NOUN,wn.VERB,wn.ADJ,wn.ADV]

print(wn.synsets("red",parts_of_speech[0]))


### Exercise 1.1
* Write code to compute the number of synsets of each part of speech (noun, verb, adjective and adverb) for each of the following words:- book, chicken, counter, twig, fast, plant
* Store and display the information using a Pandas dataframe

Hint: you could use a nested list comprehension

The `Synset` object has a `lemmas()` method which returns all of the lemmas / word senses which make up that synset.  Remember, it is one sense of each word which is considered as synonymous within the synset.  Not every sense of *plant* is considered synonymous with every sense of *works*.  

In [None]:
for i,s in enumerate(plant_synsets):
    print("{}:{}".format(i,s.lemmas()))

Access the word form of a `Lemma` using its `names()` method.

In [None]:
cat_synsets = wn.synsets("cat",wn.NOUN)
for i,s in enumerate(cat_synsets):
    wordforms=[l.name() for l in s.lemmas()]
    print("{}:{}\n\t{}".format(i,wordforms,s.definition()))

The `Synset` object also has `hyponyms` and `hypernyms` methods which return hyponym and hypernym synsets respectively.

For example:

In [None]:
#getting back the list of hyponym synsets for recordn5 (the 5th noun sense of record)
recordn5.hyponyms()

In [None]:
#iterating over the hyponyms of the 6th Synset in the list of synsets for cat
for h in cat_synsets[6].hyponyms():
    h_words=[w.name() for w in h.lemmas()]
    print("{}:{}".format(h_words,h.definition()))

In [None]:
##Iterating over the hypernyms of the 6th sense of cat and output lemma names and definition
for h in cat_synsets[6].hypernyms():
    h_words=[w.name() for w in h.lemmas()]
    print("{}:{}".format(h_words,h.definition()))

As an alternative to calling .names() on the Lemmas associated with a Synset, you can also use the .lemma_names() method directly on a synset.

In [None]:
for h in recordn5.hypernyms():
    print(h.lemma_names())

Since the hyponymy relation forms a tree, we would expect synsets to generally have multiple hyponyms and a single hypernym.  At the top of the tree (also called the **root**), the hypernym list will be empty.  Most noun concepts in WordNet share a common root hypernym which is *entity*.  At the bottom of the tree (also referred to as the **leaves** of the tree), the hyponym list will be empty

### Exercise 1.2
Write a function, `distance_to_root` that will take a Synset and traverse up the tree until it reaches a root of the tree.  When it does so, it should return the number of steps taken.

Hint: This can be done using **recursion**, where a function repeatedly calls itself.  You need to define:
* a base case:  How will the function know when it is at the top of the tree and what should it return?
* a recursive step: In the general case, the function should call itself with a simpler problem (a Synset which is closer to the top of the tree).  When it gets the result of this function call, it needs to modify it in some way and then return its own answer

Make sure you test your function.  You should find that the 5th noun sense of record is 6 steps from the top.

How far are all of the other noun sense of book from a root of the tree?




## Semantic Similarity in WordNet

The simplest way of defining how similar two concepts are according to WordNet is to use the pathlength measure:

\begin{eqnarray*}
\mbox{sim}(\mbox{synsetA},\mbox{synsetB})=\frac{1}{1+\mbox{lengthOfPath}(\mbox{synsetA},\mbox{synsetB})}
\end{eqnarray*}

We have also introduced other measures in the lectures which incorporate **information content**, i.e., the amount of information we receive when a word from a given synset is used (there is more information in being told that something is a *poodle* than in being told it is an *animal*).

The `nltk.wn` module has built-in functions for computing these similarities between synsets.

In [None]:
books=wn.synsets("book",wn.NOUN)
print("path_similarity {}".format(wn.path_similarity(books[0],books[1])))

brown_ic=wn_ic.ic("ic-brown.dat")  #this gets information content data from the Brown corpus
print("resnik_similarity {}".format(wn.res_similarity(books[0],books[1],brown_ic)))
print("lin_similarity {}".format(wn.lin_similarity(books[0],books[1],brown_ic)))

Note it is impossible to compare synsets of different parts of speech using these methods because they are not connected via hyponymy

In [None]:
booksN=wn.synsets("book",wn.NOUN)
booksV=wn.synsets("book",wn.VERB)
print("path_similarity {}".format(wn.path_similarity(booksN[0],booksV[1])))
print("resnik_similarity {}".format(wn.res_similarity(booksN[0],booksV[1],brown_ic)))
print("lin_similarity {}".format(wn.lin_similarity(booksN[0],booksV[1],brown_ic)))

### Exercise 2.1

The similarity of two **words** with a given part of speech is defined as the **maximum** similarity of all possible sense pairings.  If word A has 5 noun senses and word B has 4 noun senses than there are 20 possible sense pairings to check.

* Write a function which will compute the path_similarity of two nouns.
* Make sure you test it.  The correct answer for *chicken* and *car* is 0.0909 to 3SF

### Exercise 2.2
Generalise your path_similarity function so that it takes an extra optional argument:
* the similarity measure to use


In [None]:
word_similarity("chicken","car",measure="lin")

## Comparing WordNet Similarities with Human Synonymy Judgements

The file `mcdata.csv` contains human synonymy judgements for a list of 30 noun pairs.   We can read in a `.csv` file using the `csv` library 

In [None]:
import csv
import os
directory='/content/drive/My Drive/NLE Notebooks/Week5LabsSolutions/'
filename='mcdata.csv'
filepath=os.path.join(directory,filename)


with open(filename,'r') as filestream:
    mcdata=list(csv.reader(filestream,delimiter=','))

df=pd.DataFrame(mcdata,columns=["word1","word2","human similarity"])
#lets make sure the scores are floats not strings.  We can do this by applying the float() function to every value in the column (which we can using map)
df["human similarity"]=df["human similarity"].map(float)

In [None]:
df.describe()

Note that the human similarity judgements range between 0 and 4.

### Exercise 3.1
Write code that will 
* compute the WordNet path_similarity for every pair of words in this data; and
* add it as a column in the dataframe.  If you have the path similarity scores in a list called `scores`, you can do this using `df['path']=scores`

Repeat for the Resnik and Lin similarity scores.


We can use pandas functionality to produce scatter plots and examine the correlation between different variables

In [None]:
%matplotlib inline

x="human similarity"
y="path"

df.plot.scatter(x,y)

### Exercise 3.2
Generate scatter plots showing Resnik similarity against human similarity and Lin similarity against human similarity.

The `DataFrame.corr()` method will compute the correlation for all pairs of columns with numeric values.  It is better to use Spearman's rank correlation coefficient than Pearson's product-moment correlation coefficient, since similarity scores are unlikely to be normally distributed.

In [None]:
df.corr(method='spearman')

### Exercise 3.3
* Looking at the scatter plots and the correlation coefficients, what do you conclude about the different WordNet similarity measures?
* Do you have any reservations about your conclusions?