# Lab 2: Semantic Similarity Using WordNet

I am running Python 3.10.12  Which version are you running?

In [1]:
!python --version

Python 3.12.12


## 1. Getting Started

If you haven't used nltk before, you will need to run the following cell and download resources.  You will need:

*   wordnet
*   wordnet_ic
*   omw-1.4


In [20]:
import nltk
nltk.download("wordnet")
nltk.download('wordnet_ic')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet_ic to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet_ic.zip.


True

In [6]:
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic as wn_ic
from nltk.corpus import  lin_thesaurus as lin

## 2 Useful WN Functions

Look at the code in the cells below.  Write notes as to what is being returned.

In [7]:
wn.synsets("book")

[Synset('book.n.01'),
 Synset('book.n.02'),
 Synset('record.n.05'),
 Synset('script.n.01'),
 Synset('ledger.n.01'),
 Synset('book.n.06'),
 Synset('book.n.07'),
 Synset('koran.n.01'),
 Synset('bible.n.01'),
 Synset('book.n.10'),
 Synset('book.n.11'),
 Synset('book.v.01'),
 Synset('reserve.v.04'),
 Synset('book.v.03'),
 Synset('book.v.04')]

In [8]:
len(wn.synsets("book"))

15

In [9]:
wn.synsets("book", wn.ADV)


[]

In [10]:
asynset=wn.synsets("book",wn.NOUN)[1]
asynset.definition()

'physical objects consisting of a number of pages bound together'

In [12]:
wn.synsets("book",wn.NOUN)[1]

Synset('book.n.02')

In [13]:
asynset.hyponyms()


[Synset('coffee-table_book.n.01'),
 Synset('folio.n.03'),
 Synset('sketchbook.n.01'),
 Synset('novel.n.02'),
 Synset('journal.n.04'),
 Synset('paperback_book.n.01'),
 Synset('hardback.n.01'),
 Synset('order_book.n.02'),
 Synset('picture_book.n.01'),
 Synset('album.n.02'),
 Synset('notebook.n.01')]

In [14]:
asynset.hypernyms()

[Synset('product.n.02')]

In [18]:
asynset.hypernyms().hyponyms()

AttributeError: 'list' object has no attribute 'hyponyms'

In [17]:
asynset.hypernyms()[0].hyponyms()


[Synset('magazine.n.02'),
 Synset('book.n.02'),
 Synset('book.n.11'),
 Synset('inspiration.n.02'),
 Synset('deliverable.n.01'),
 Synset('by-product.n.02'),
 Synset('work.n.02'),
 Synset('output.n.05'),
 Synset('yield.n.03'),
 Synset('movie.n.01'),
 Synset('newspaper.n.03'),
 Synset('end_product.n.01'),
 Synset('turnery.n.02'),
 Synset('job.n.04')]

In [21]:
brown_ic=wn_ic.ic('ic-brown.dat')
asynset.lin_similarity(asynset.hypernyms()[0].hyponyms()[8],brown_ic)

0.6062360082744072

# 2.1 Q1
Write a function to return the path similarity of two nouns.  Remember this is the maximum similarity of all of the possible pairings of the two nouns.  Make sure you test it.  For (chicken, car) the correct answer is 0.0909 (to 3 SF).

In [87]:
wn.synsets("car")

[Synset('car.n.01'),
 Synset('car.n.02'),
 Synset('car.n.03'),
 Synset('car.n.04'),
 Synset('cable_car.n.01')]

In [88]:
wn.synsets("chicken")

[Synset('chicken.n.01'),
 Synset('chicken.n.02'),
 Synset('wimp.n.01'),
 Synset('chicken.n.04'),
 Synset('chicken.s.01')]

In [54]:
def find_path_similarity(word1, word2):

  possible_similarities = []

  word1_synsets = wn.synsets(word1)
  word2_synsets = wn.synsets(word2)

  for i in range(len(word1_synsets)):
    for j in range(len(word2_synsets)):
      sim = wn.path_similarity(word1_synsets[i], word2_synsets[j])
      possible_similarities.append(sim)


  return max(possible_similarities)

In [56]:
find_path_similarity("chicken", "car")

0.09090909090909091

# 2.1 Q2
Generalise your function so that you have an extra (optional) parameter which you can use to select the WordNet similarity measure e.g., res_similarity and lin_similarity.  Make sure you test it.  For the pair (chicken,car), the correct answer for res_similarity is 1.53 (to 3SF) and the correct answer for lin_similarity is 0.179

In [69]:
def find_similarity1(word1, word2, sim_measure):

  possible_similarities = []

  word1_synsets = wn.synsets(word1)
  word2_synsets = wn.synsets(word2)
  brown_ic=wn_ic.ic("ic-brown.dat")

  for i in word1_synsets:
    for j in word2_synsets:
      if sim_measure == "res_similarity":

        sim = wn.res_similarity(i, j, brown_ic)
        possible_similarities.append(sim)

      if sim_measure == "lin_similarity":

        sim = wn.lin_similarity(i, j, brown_ic)
        possible_similarities.append(sim)

  return max(possible_similarities)

In [89]:
def find_similarity(word1, word2, sim_measure):

  possible_similarities = []

  word1_synsets = wn.synsets(word1, wn.NOUN)
  word2_synsets = wn.synsets(word2, wn.NOUN)
  brown_ic=wn_ic.ic("ic-brown.dat")

  for i in word1_synsets:
    for j in word2_synsets:

      if sim_measure == "path_similarity":

        sim = wn.path_similarity(i, j)
        possible_similarities.append(sim)

      if sim_measure == "res_similarity":

        sim = wn.res_similarity(i, j, brown_ic)
        possible_similarities.append(sim)

      if sim_measure == "lin_similarity":

        sim = wn.lin_similarity(i, j, brown_ic)
        possible_similarities.append(sim)

  return max(possible_similarities)

In [84]:
find_similarity1("chicken", "car", "res_similarity")

WordNetError: Computing the least common subsumer requires Synset('chicken.s.01') and Synset('car.n.01') to have the same part of speech.

In [90]:
find_similarity("chicken", "car", "path_similarity")

0.09090909090909091

In [85]:
find_similarity("chicken", "car", "res_similarity")

1.5318337432196856

In [67]:
find_similarity("chicken", "car", "lin_similarity")

0.17900106582025765

**To compute Lin similarity and resnik similarity, we can only use "NOUN"s**

## 3 Human Synonymy Judgments

### 3.1
Read in mcdata.csv and store it in an appropriate format so that you can obtain a list of pairs of the nouns and the score associated with each pair.


Again, many ways to complete this one.  I would encourage the use of the csv package (either with dialect = 'excel' or the delimiter and quotechar explicitly set) to avoid future problems when reading csv files which have commas in the fields.  Pandas is a good choice here and will make the correlations easier later (but just storing a list of triples (i.e., mcdata in the code below) is fine too).

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')


In [113]:
list_df = []
with open("sample_data/mcdata.csv") as file:
  for i in file:
    w1, w2, sim = i.split(",")
    list_df.append([w1, w2, sim.strip()])

list_df

[['asylum', 'madhouse', '3.61'],
 ['bird', 'cock', '3.05'],
 ['bird', 'crane', '2.97'],
 ['boy', 'lad', '3.76'],
 ['brother', 'monk', '2.82'],
 ['car', 'automobile', '3.92'],
 ['cemetery', 'woodland', '0.95'],
 ['chord', 'smile', '0.13'],
 ['coast', 'forest', '0.42'],
 ['coast', 'hill', '0.87'],
 ['coast', 'shore', '3.7'],
 ['crane', 'implement', '1.68'],
 ['food', 'fruit', '3.08'],
 ['food', 'rooster', '0.89'],
 ['forest', 'graveyard', '0.84'],
 ['furnace', 'stove', '3.11'],
 ['gem', 'jewel', '3.84'],
 ['glass', 'magician', '0.11'],
 ['journey', 'car', '1.16'],
 ['journey', 'voyage', '3.84'],
 ['lad', 'brother', '1.66'],
 ['lad', 'wizard', '0.42'],
 ['magician', 'wizard', '3.5'],
 ['midday', 'noon', '3.42'],
 ['monk', 'oracle', '1.1'],
 ['monk', 'slave', '0.55'],
 ['noon', 'string', '0.08'],
 ['rooster', 'voyage', '0.08'],
 ['shore', 'woodland', '0.63'],
 ['tool', 'implement', '1.68']]

In [98]:
import pandas as pd

df = pd.read_csv("sample_data/mcdata.csv", header=None)
df.columns = ["word1", "word2", "sim"]
df

Unnamed: 0,word1,word2,sim
0,asylum,madhouse,3.61
1,bird,cock,3.05
2,bird,crane,2.97
3,boy,lad,3.76
4,brother,monk,2.82
5,car,automobile,3.92
6,cemetery,woodland,0.95
7,chord,smile,0.13
8,coast,forest,0.42
9,coast,hill,0.87


### 3.2

Calculate the similarity score for each pair of nouns using at least 2 semantic similarity measures.

In [134]:
df.iloc[0,1]


'madhouse'

In [153]:
count = 0
for x in range(df.shape[0]):


  if wn.synsets(df.iloc[x,1], wn.NOUN) != 0:
    count += 1

count


30

In [141]:
word2_noun_syn = wn.synsets(df.iloc[0,1], wn.NOUN)[0]
word2_noun_syn

[Synset('bedlam.n.02')]

In [155]:
for x in range(df.shape[0]):
  find_similarity1(df.iloc[x,0], df.iloc[x,1], "lin_similarity")

WordNetError: Computing the least common subsumer requires Synset('bird.n.01') and Synset('cock.v.01') to have the same part of speech.

In [129]:
df.shape[0]

30

In [144]:
lin_sim = []


for x in range(df.shape[0]):
  sims = find_similarity1(df.iloc[x,0], df.iloc[x,1], "lin_similarity")
  lin_sim.append(sims)

lin_sim

WordNetError: Computing the least common subsumer requires Synset('bird.n.01') and Synset('cock.v.01') to have the same part of speech.

In [123]:
find_similarity1(df.iloc[0,0], df.iloc[0,1], "res_similarity")

9.475167326283652

### 3.3
Correlate each of the calculated sets of scores with each other and with the human judgements (I suggest you use scipy.stats.spearmanr() or pandas for this).

### 3.4

What do you conclude?

## 4 Distributional Similarity

We are going to be using some pre-computed Word2Vec embeddings.  We will be learning about how these are computed in a few weeks time.  For now, you can assume they in some way capture the notion of distributional similarity discussed in this week's class: words which are used in similar ways will have similar vectors in this space.  You can download the embeddings here:

https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

Note this is a very large file: 1.65GB zipped

Or, if working on a lab computer, you can use the following full path:

mac: /Volumes/teaching/Departments/Informatics/AdvancedNLP/GoogleNews-vectors-negative300.bin

windows: //ad.susx.ac.uk/ITS/Teaching/Departments/Informatics/AdvancedNLP/GoogleNews-vectors-negative300.bin


You should now be able to load them in to python using the following code (it may take a while to run this).  If working on a mac or your own machine, you may need to use conda or pip to install the gensim package into your environment.


In [None]:
from gensim.models import KeyedVectors

In [None]:
#this cell may take a minute or more to run as it is loading a large model into memory
#avoid re-running it unnecessarily
filename=os.path.join(path,"GoogleNews-vectors-negative300.bin")
mymodel =KeyedVectors.load_word2vec_format(filename,binary=True)

You can now query the model with calls to methods of mymodel as in the cells below.  Take time to think about what each call is doing and try some similar queries of your own.

In [None]:
mymodel.similarity('car','chicken')

In [None]:
#this cell may crash your session on the basic CoLab as you might run out of RAM.
#You probably need to be working on Anaconda on a reasonably powerful machine or have Colab Pro
mymodel.most_similar(positive=['man'])

In [None]:
mymodel.similarity('noon','string')

In [None]:
mymodel['man']

## 4.1
Repeat the tasks in Section 3 using similarity scores from the word2vec model.  Make sure you correlate the word2vec similarities with the human synonymy judgements and the wordnet similarity scores.  What do you conclude?

### 4.2
* Which measure has the highest correlation with human synonymy judgements now?

## 5 Extension
See the pdf for instructions.