In [1]:
import nltk
nltk.download("wordnet")
from nltk.corpus import wordnet as wn
nltk.download("omw-1.4")
nltk.download('semcor')
from nltk.corpus import semcor
from collections import Counter

[nltk_data] Downloading package wordnet to /home/mauro/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/mauro/nltk_data...
[nltk_data] Downloading package semcor to /home/mauro/nltk_data...


# Finding the most frequent WordNet synset for each word (when possible)

For this, we found some contradicting information. One source said, that if we get the Synsets for a word, they are already ordered by their frequencies, and others said that this is not the case. Just to be sure, we searched for a pre-tagged corpus to sample from, to find the accurate frequencies for each synset corresponding to a word. We chose the SemCor corpus for this.

In [2]:
# Convert POS tags into another system
def get_wordnet_pos(category):
  if category.startswith('J'):
    return 'a'  # Adjective
  elif category.startswith('V'):
    return 'v'  # Verb
  elif category.startswith('N'):
    return 'n'  # Noun
  elif category.startswith('R'):
    return 'r'  # Adverb
  else:
    return None  # WordNet doesn't handle other POS tags

word_tag_pairs = [
('the', 'DT'), ('man', 'NN'), ('swim', 'VB'), ('with', 'PR'), ('a', 'DT'),
('girl', 'NN'), ('and', 'CC'), ('a', 'DT'), ('boy', 'NN'), ('whilst', 'PR'),
('the', 'DT'), ('woman', 'NN'), ('walk', 'VB')
]

In [3]:
filtered_synsets=[]
for target_word, target_pos in word_tag_pairs:
  synset_counts = Counter()
  # Iterate through tagged sentences in the SemCor corpus
  for sentence in semcor.tagged_sents(tag='sem'):
      for word in sentence:
          # Check if the word is a tagged WordNet synset
          if isinstance(word, nltk.tree.tree.Tree):
              synset = word.label()
              if synset and isinstance(synset, nltk.corpus.reader.wordnet.Lemma):
                  # If the lemma matches the target word and synset is the correct category, count it
                  if synset.name().lower() == target_word and synset.synset().pos() == get_wordnet_pos(target_pos):
                      synset_counts[synset.synset()] += 1

  # Find the most frequent synset
  if synset_counts:
      most_common_synset, frequency = synset_counts.most_common(1)[0]
      filtered_synsets=filtered_synsets+[most_common_synset]
      print(f"\nMost frequent {target_pos} synset for '{target_word}': {most_common_synset}")
      print(f"Definition: {most_common_synset.definition()}")
      print(f"Frequency: {frequency}")
  else:
      print(f"\nNo {target_pos} synsets found for '{target_word}' in SemCor")


No DT synsets found for 'the' in SemCor

Most frequent NN synset for 'man': Synset('man.n.01')
Definition: an adult person who is male (as opposed to a woman)
Frequency: 429

Most frequent VB synset for 'swim': Synset('swim.v.01')
Definition: travel through water
Frequency: 12

No PR synsets found for 'with' in SemCor

No DT synsets found for 'a' in SemCor

Most frequent NN synset for 'girl': Synset('girl.n.01')
Definition: a young woman
Frequency: 77

No CC synsets found for 'and' in SemCor

No DT synsets found for 'a' in SemCor

Most frequent NN synset for 'boy': Synset('male_child.n.01')
Definition: a youthful male person
Frequency: 135

No PR synsets found for 'whilst' in SemCor

No DT synsets found for 'the' in SemCor

Most frequent NN synset for 'woman': Synset('woman.n.01')
Definition: an adult female person (as opposed to a man)
Frequency: 137

Most frequent VB synset for 'walk': Synset('walk.v.01')
Definition: use one's feet to advance; advance by steps
Frequency: 163


In [None]:
filtered_synsets

[Synset('man.n.01'),
 Synset('swim.v.01'),
 Synset('girl.n.01'),
 Synset('male_child.n.01'),
 Synset('woman.n.01'),
 Synset('walk.v.01')]

* First we converted the pos tag to wordnet values. We have to take into account that wordnet only has nouns, verbs, adjectives and adverbs (words with substantial meaning) so any other category of word won't have a synset in wordnet.
* After that we counted the number of apperances of synsets for the needed words in the SemCor corpus (with the matching POS tag) and saved the most common one.
* In these cases, the most common ones are also the first ones (since they end with .01), so the above hypothesis still holds.

# For each pair of words, when possible, print their corresponding least common subsumer (LCS) and their similarity value

In [None]:
# Get all posible combinations:
from itertools import combinations
def get_all_pairs(lst):
  return list(combinations(lst, 2))

pairs=get_all_pairs(filtered_synsets)

* We create a list with all the posible pairs of synsets and afterwards we filter the ones with the same pos tag. This is required because the trees from wordnet are different for each category of words and we can't compute the similarities or LCS if they are not on the same tree.

In [None]:
# We can just compute the similarities and LCS of synsets with the same pos:
valid_pairs= [pair for pair in pairs if pair[0].pos()==pair[1].pos()]

* For the Lin similarity calculation we decided to use the information content from the brown corpus.

In [6]:
# we import the information component from the brown corpus:
nltk.download('wordnet_ic')
from nltk.corpus import wordnet_ic
ic_brown = wordnet_ic.ic('ic-brown.dat')

[nltk_data] Downloading package wordnet_ic to /home/mauro/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


We calculated the ranges for the different similarity metrics based on the formulas from the class, and we found that the Path, Wu-Palmer and Lin similarities are all contained in the [0, 1] range, so we only need to normalize the Leacock-Chodorow similarity, since it takes values in the $[0, log_2(2 \text{MaxDepth})]$ interval. We normalize this by dividing with the similarity value we get when comparing a synset with itself.

In [None]:
# We get the value needed to normalize Leacock-Chodorow Similarity:
norm=wn.synset('historic_period.n.01').lch_similarity(wn.synset('historic_period.n.01'), ic_brown) # historic_period is an arbitrary choice here
print(f'The normalizing value for Leacock-Chodorow is:  {norm}')

# least common subsummer and similarities
def compute_similarities(synset1,synset2):

  #we print the LCS
  lcs = synset1.lowest_common_hypernyms(synset2)
  print(f"LCS for '{synset1}' and '{synset2}': {lcs}")

  # Compute Path Similarity
  print(f"Path Similarity: {synset1.path_similarity(synset2)}")

  # Compute Leacock-Chodorow Similarity
  print(f"Leacock-Chodorow Similarity: {synset1.lch_similarity(synset2)/norm}")

  # Compute Wu-Palmer Similarity
  print(f"Wu-Palmer Similarity: {synset1.wup_similarity(synset2)}")

  # Compute Lin Similarity (requires information content)
  try:
    lin_similarity = synset1.lin_similarity(synset2, ic_brown)
    if lin_similarity:
      print(f"Lin Similarity: {synset1.lin_similarity(synset2, ic_brown)}", end = "\n")
    else:
      print("Lin Similarity: Not available")
  except:
    print("Lin Similarity: Not available")


for pair in valid_pairs:
  print(f'\nComputing LCS and Similarities for pair: {pair}')
  compute_similarities(*pair)

The normalizing value for Leacock-Chodorow is:  3.6375861597263857

Computing LCS and Similarities for pair: (Synset('man.n.01'), Synset('girl.n.01'))
LCS for 'Synset('man.n.01')' and 'Synset('girl.n.01')': [Synset('adult.n.01')]
Path Similarity: 0.25
Leacock-Chodorow Similarity: 0.6188971751464533
Wu-Palmer Similarity: 0.631578947368421
Lin Similarity: 0.7135111237276783

Computing LCS and Similarities for pair: (Synset('man.n.01'), Synset('male_child.n.01'))
LCS for 'Synset('man.n.01')' and 'Synset('male_child.n.01')': [Synset('male.n.02')]
Path Similarity: 0.3333333333333333
Leacock-Chodorow Similarity: 0.6979831568441128
Wu-Palmer Similarity: 0.6666666666666666
Lin Similarity: 0.7294717876200584

Computing LCS and Similarities for pair: (Synset('man.n.01'), Synset('woman.n.01'))
LCS for 'Synset('man.n.01')' and 'Synset('woman.n.01')': [Synset('adult.n.01')]
Path Similarity: 0.3333333333333333
Leacock-Chodorow Similarity: 0.6979831568441128
Wu-Palmer Similarity: 0.6666666666666666
L

*What similarity seems better?*

This question is particularly hard because the answer may be dependant on the scenario. What can be said in a general manner is that the Path Similarity could get a misleading result because it doesn't account for the depth of the tree. A bigger tree will have more synsets, meaning more hyrarchy, resulting on the same synsets being more nodes away.

An other thing we could say is that Lin Similarity requires further inspection and relies on the amount and quality of the corpus where you compute the information. If you were to use a biased corpus or an outdated one you could be missing certain relationships. On the other hand, if we use a good database and fairly common words we will get great results. An example of this is the woman-girl analysis where it gets the higher similarity reflecting greater sensibility to their close relationship.

Both, Leacock-Chodorow and Wu-Palmer factor in the hirarchy of the tree and do certain normalization atending the complexity and depth. This is great and improves the results when compared to the simple Path Similarity. Leacock makes sure that two synset near the top of the tree get more penalised, which is better because this terms tend to be broader. On its side, Wu-Palmer uses their Lowest Shared Ancestor which is more related to our intuitive sense of similarity. A shared inconvenience of this methods is that they don't take into account the real usage of the words in real texts, contrary to Lin simmilarity.

To sum up, all of them are superior to Path similarity and depending on the availability and quality of information it could be better to use Lin simmilarity or one between Leacock-Chodorow or Wu-Palmer similarity.
