# Lab2.1: Words, concepts, semantic relations in Wordnet-NLTK

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

In this notebook, you are going to work with wordnet databases that have been incorporated in the NLTK package.
Detailed information how to access and use wordnet can be found here: http://www.nltk.org/howto/wordnet.html

Study the documentation and make yourself familiar with the different commands. Some of documentation is also explained below.

In [1]:
from nltk.corpus import wordnet as wn

Look up a word using the "wn.synsets()" function. This will give you a list of synsets in which the lookup string is matched with a lemma (synonym).

In [2]:
all_dog_synsets = wn.synsets('dog')
print('Number of synsets with "dog" as a synonym:', len(all_dog_synsets))
print(all_dog_synsets)

Number of synsets with "dog" as a synonym: 8
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]


Note that the synsets in this list are printed by listing the first synonym only. Also note that we got nouns and verbs. We take the first synset from the list and print out some information to understand the concept.

In [3]:
first_synset = all_dog_synsets[0]
print('The synset =', first_synset)
print('Type', type(first_synset))
print('The synonyms = ', first_synset.lemmas())
print('The definition =', first_synset.definition())
print('The full path of hypernyms =', first_synset.hypernym_paths())
print('The maximum depth of its hyponymy chain is = ', first_synset.max_depth())

The synset = Synset('dog.n.01')
Type <class 'nltk.corpus.reader.wordnet.Synset'>
The synonyms =  [Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'), Lemma('dog.n.01.Canis_familiaris')]
The definition = a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
The full path of hypernyms = [[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('animal.n.01'), Synset('chordate.n.01'), Synset('vertebrate.n.01'), Synset('mammal.n.01'), Synset('placental.n.01'), Synset('carnivore.n.01'), Synset('canine.n.02'), Synset('dog.n.01')], [Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('living_thing.n.01'), Synset('organism.n.01'), Synset('animal.n.01'), Synset('domestic_animal.n.01'), Synset('dog.n.01')]]
The maximum depth of its hyponymy

We see the synset is a specific class **Synset* defined in the wordnet module. The function **lemmas** gives us the synonyms (prefixed by the synset identifier **dog.n.01**) and we can get the definition, the hypernym chain and the max depth at which we find this synset in the complete wordnet graph.

The function **hypernym_paths** is interesting. Let us inspect it a bit more. It gets all the hyperonym relations upward starting from this synset untill there are no more hyperyms. Note that there can be multiple hypernym chains for a synset because occasionally WordNet has multiler hypernyms for some synsets. The next piece makes it easier to see the structure:

In [4]:
for path in first_synset.hypernym_paths():
    indent = ""
    for hyper in path:
        print(indent, hyper)
        indent += "   "

 Synset('entity.n.01')
    Synset('physical_entity.n.01')
       Synset('object.n.01')
          Synset('whole.n.02')
             Synset('living_thing.n.01')
                Synset('organism.n.01')
                   Synset('animal.n.01')
                      Synset('chordate.n.01')
                         Synset('vertebrate.n.01')
                            Synset('mammal.n.01')
                               Synset('placental.n.01')
                                  Synset('carnivore.n.01')
                                     Synset('canine.n.02')
                                        Synset('dog.n.01')
 Synset('entity.n.01')
    Synset('physical_entity.n.01')
       Synset('object.n.01')
          Synset('whole.n.02')
             Synset('living_thing.n.01')
                Synset('organism.n.01')
                   Synset('animal.n.01')
                      Synset('domestic_animal.n.01')
                         Synset('dog.n.01')


We see that dog in its first sense has to paths due to the hypernyms **canine.n.02** and **domestic_animal.n.01**. Two different ways of classifying dogs that eventually end up in the same top nodes from **animal.n.01** onwards. 

This makes you think about what all posssible ways are to classify something and how to know these? What about wordnets for other lanuages. Should these be classified in the same way also for different cultures? Clearly WordNet is just a proxy for semantic space although it may be the best we have.

We can iterate over the list of all synsets with the synonym dog, get each synset as an 'object' and next call specific functions to know more what they represent.

In [5]:
for synset in all_dog_synsets:
    print()
    print('The synset =', synset)
    print('Type', type(synset))
    print('The synonyms = ', synset.lemmas())
    print('The definition =', synset.definition())
    print('The maximum depth of its hyponymy chain is = ', synset.max_depth())


The synset = Synset('dog.n.01')
Type <class 'nltk.corpus.reader.wordnet.Synset'>
The synonyms =  [Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'), Lemma('dog.n.01.Canis_familiaris')]
The definition = a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
The maximum depth of its hyponymy chain is =  13

The synset = Synset('frump.n.01')
Type <class 'nltk.corpus.reader.wordnet.Synset'>
The synonyms =  [Lemma('frump.n.01.frump'), Lemma('frump.n.01.dog')]
The definition = a dull unattractive unpleasant girl or woman
The maximum depth of its hyponymy chain is =  10

The synset = Synset('dog.n.03')
Type <class 'nltk.corpus.reader.wordnet.Synset'>
The synonyms =  [Lemma('dog.n.03.dog')]
The definition = informal term for a man
The maximum depth of its hyponymy chain is =  9

The synset = Synset('cad.n.01')
Type <class 'nltk.corpus.reader.wordnet.Synset'>
The synonyms =  [Lemma('cad.n.01.cad

You can also obtain synsets with a certain part-of-speech only by passing a part-of-speech value as a parameter.

In [6]:
all_dog_verb_synsets = wn.synsets('dog', 'v')
print(all_dog_verb_synsets)

[Synset('chase.v.01')]


Remember from the lexture that the verbs do not have an augmented top layer in WordNet but the verb subnetwork consists of many isolated subgraphs as islands. So let's see what the hypernym path is for this verbal synset:

In [7]:
for path in all_dog_verb_synsets[0].hypernym_paths():
    indent = ""
    for hyper in path:
        print(indent, hyper.lemmas(), hyper.definition())
        indent += "   "

 [Lemma('travel.v.01.travel'), Lemma('travel.v.01.go'), Lemma('travel.v.01.move'), Lemma('travel.v.01.locomote')] change location; move, travel, or proceed, also metaphorically
    [Lemma('pursue.v.02.pursue'), Lemma('pursue.v.02.follow')] follow in or as if in pursuit
       [Lemma('chase.v.01.chase'), Lemma('chase.v.01.chase_after'), Lemma('chase.v.01.trail'), Lemma('chase.v.01.tail'), Lemma('chase.v.01.tag'), Lemma('chase.v.01.give_chase'), Lemma('chase.v.01.dog'), Lemma('chase.v.01.go_after'), Lemma('chase.v.01.track')] go after with the intent to catch


Interesing. The path is very short and not so informative. How would you consider to be a hypernym of the top synset **change location; move, travel, or proceed, also metaphorically**? **Change** or **Change-location**?

Various functions can be called on the synset object yielding different data structures. Try some to get a feeling for it. Below we show the specific relations that synsets can have besides hyponymy:

In [8]:
doggy_synset = all_dog_synsets[0]
#### Part - to -  whole relations:
print('Part holonyms:',doggy_synset.part_holonyms())
print('Member holonyms:',doggy_synset.member_holonyms())
print('Substance holonyms:',doggy_synset.substance_holonyms())

### Whole - to - part relations
print('Part meronyms:',doggy_synset.part_meronyms())
print('Member meronyms:',doggy_synset.member_meronyms())
print('Substance meronyms:',doggy_synset.substance_meronyms())


Part holonyms: []
Member holonyms: [Synset('canis.n.01'), Synset('pack.n.06')]
Substance holonyms: []
Part meronyms: [Synset('flag.n.07')]
Member meronyms: []
Substance meronyms: []


In [9]:
chase_synset = all_dog_verb_synsets[0]
#### Relations for verbal synsets
print('Caused:', chase_synset.causes())
print('Entailments:',chase_synset.entailments())
print('Hyponyms:', chase_synset.hyponyms())
print('Examples:', chase_synset.examples())


Caused: []
Entailments: []
Hyponyms: [Synset('hound.v.01'), Synset('quest.v.02'), Synset('run_down.v.07'), Synset('tree.v.03')]
Examples: ['The policeman chased the mugger down the alley', 'the dog chased the rabbit']


To get the full Python object definition of a synset to show all options, you can use the 'dir' command:

In [10]:
dir(chase_synset)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_all_hypernyms',
 '_definition',
 '_doc',
 '_examples',
 '_frame_ids',
 '_hypernyms',
 '_instance_hypernyms',
 '_iter_hypernym_lists',
 '_lemma_names',
 '_lemma_pointers',
 '_lemmas',
 '_lexname',
 '_max_depth',
 '_min_depth',
 '_name',
 '_needs_root',
 '_offset',
 '_pointers',
 '_pos',
 '_related',
 '_shortest_hypernym_paths',
 '_wordnet_corpus_reader',
 'acyclic_tree',
 'also_sees',
 'attributes',
 'causes',
 'closure',
 'common_hypernyms',
 'definition',
 'entailments',
 'examples',
 'frame_ids',
 'hypernym_distances',
 'hypernym_paths',
 'hypernyms',
 'hyponyms',
 'in_region_domains',
 'in_topic_domains

Some of these options may be familair to you if you have read the literature. We will look into the similarity functions more closely later on.

## Wordnets in other languages

There are wordnets in many different languages and many are linked to English. The ones that are freely available in the Open Multilingual Wordnet platform: http://compling.hss.ntu.edu.sg/omw/ are also available in NLTK. You can use "wn.langs" to get the full list.

You might get an error regarding NLTK not finding a 'omw' dataset. You can download it just like you did in lab1.

In [11]:
import nltk
nltk.download('omw')

[nltk_data] Downloading package omw to /Users/piek/nltk_data...
[nltk_data]   Package omw is already up-to-date!


True

Let's check out which languages have wordnets in the OMW package:

In [12]:
sorted(wn.langs())

['eng']

The listed language wordnets are created by translating the English synsets (the so-called Expand Method (Vossen (ed.) 1998). This means that the concepts of the English wordnet are re-used and the synonyms in the synsets are translated. Another approach is the merge method in which a wordnet is built independently from English and mapped to English afterwards. Only few wordnets are built independently and often not freely available.

Vossen, Piek. "Introduction to eurowordnet." In EuroWordNet: A multilingual database with lexical semantic networks, pp. 1-17. Springer, Dordrecht, 1998.

In case of the expanded wordnets, the concept structure is the same for all these wordnets (they share the English concepts). Starting from the English wordnet, you can  ask for the language lemmas linked to any synset in English.

Are there any Japanese lemmas linked to English dog sense 1? For this we need to use a different function **lemma_names** and pass in the  3-letter language tag as a parameter: 

In [13]:
# Are there any Japanese lemmas linked to English dog sense 1
wn.synset('dog.n.01').lemma_names('jpn')

['イヌ', 'ドッグ', '洋犬', '犬', '飼い犬', '飼犬']

In [14]:
# The same for Dutch
wn.synset('dog.n.01').lemma_names('nld')

['hond', 'joekel']

So that is great but can we also get the synset directly through a Dutch or Japanese synonym?

In [15]:
wn.synsets('dog.n.01.hond')

[]

Unfortunately not. You cannot directly get the synsets in Wordnet through the same interface we have used before for 'dog'. The next call therefore also does not work:

In [16]:
all_dog_synsets = wn.synsets('hond')
print('Number of synsets with "hond" as a synonym:', len(all_dog_synsets))
print(all_dog_synsets)

Number of synsets with "hond" as a synonym: 0
[]


To obtain the synsets for a non-English word, we first have to use the wn.lemmas() function to get the list of lemma objects for a specific language. The next cell shows this for the Dutch lemma *hond*.

In [17]:
dutch_dog_lemmas = wn.lemmas('hond', lang='nld')
print(dutch_dog_lemmas)

[Lemma('dog.n.01.hond'), Lemma('asshole.n.01.hond')]


In [18]:
type(dutch_dog_lemmas[0])

nltk.corpus.reader.wordnet.Lemma

*Lemma* is yet another class defined in the wordnet module with attributes and functions, some of which overlap with those of a synset. Let's check them out through the 'dir' function.

In [19]:
dutch_lemma = dutch_dog_lemmas[0]
dir(dutch_lemma)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_frame_ids',
 '_frame_strings',
 '_hypernyms',
 '_instance_hypernyms',
 '_key',
 '_lang',
 '_lex_id',
 '_lexname_index',
 '_name',
 '_related',
 '_synset',
 '_syntactic_marker',
 '_wordnet_corpus_reader',
 'also_sees',
 'antonyms',
 'attributes',
 'causes',
 'count',
 'derivationally_related_forms',
 'entailments',
 'frame_ids',
 'frame_strings',
 'hypernyms',
 'hyponyms',
 'in_region_domains',
 'in_topic_domains',
 'in_usage_domains',
 'instance_hypernyms',
 'instance_hyponyms',
 'key',
 'lang',
 'member_holonyms',
 'member_meronyms',
 'name',
 'part_holonyms',
 'part_meronyms',
 'pertainyms',
 'region_dom

Some are different from synset such as lang(). The function *.synset()* can be used to get the synsets associated with a lemma. Obviously, the synset information is the same as for the English wordnet because the Open Dutch Wordnet: http://wordpress.let.vupr.nl/odwn/ was created by expanding the English wordnet.

In [20]:
print(dutch_lemma, dutch_lemma.lang())

dutch_dog_synset = dutch_lemma.synset()
print('Synsets:', dutch_dog_synset)
print('Hypernyms:', dutch_dog_synset.hypernyms())
print('Definition:', dutch_dog_synset.definition())

Lemma('dog.n.01.hond') nld
Synsets: Synset('dog.n.01')
Hypernyms: [Synset('canine.n.02'), Synset('domestic_animal.n.01')]
Definition: a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds


So we have many wordnets in different languages. Can we get statistics on their coverage?

In [21]:
print('English:', len(list(wn.all_lemma_names(pos='n', lang='eng'))))

print('Dutch:', len(list(wn.all_lemma_names(pos='n', lang='nld'))))
print('Italian:', len(list(wn.all_lemma_names(pos='n', lang='ita'))))
print('Japanese:', len(list(wn.all_lemma_names(pos='n', lang='jpn'))))
print('Slovene:', len(list(wn.all_lemma_names(pos='n', lang='slv'))))
print('Spanish:', len(list(wn.all_lemma_names(pos='n', lang='spa'))))

English: 117798
Dutch: 36896
Italian: 31477
Japanese: 66031
Slovene: 30892
Spanish: 28647


We can see that English has a lot more synonyms than the other wordnets. So there is still work to be done.

### Get the Dutch and Japanese dogs

We can get more insight in the coverage of specific areas or semantic fields by collecting all the **hyponyms** for a basic level hypernym, e.g. all dogs, cats, horses, ships, cars, etc.

The next simple "for" loop iterates over all dog-hyponyms in the English wordnet and prints the synset and any Dutch and Japanese labels. We can easily see, which dogs are included in each language and which are not.

In [22]:
dog = wn.synset ('dog.n.01')
dogs = dog.hyponyms()
print('Number of dogs:', len(dogs))

for s in dogs:
    print(s)
    print('English:', s.lemma_names('eng'))
    print(s.definition())
    print('Dutch:', s.lemma_names('nld'))
    print('Japanese:', s.lemma_names('jpn'))
    print('Italian:', s.lemma_names('ita'))
    print('Spanish:', s.lemma_names('spa'))
    print()

Number of dogs: 18
Synset('basenji.n.01')
English: ['basenji']
small smooth-haired breed of African origin having a tightly curled tail and the inability to bark
Dutch: []
Japanese: []
Italian: ['basenji', 'cane_del_Congo']
Spanish: ['basenji']

Synset('corgi.n.01')
English: ['corgi', 'Welsh_corgi']
either of two Welsh breeds of long-bodied short-legged dogs with erect ears and a fox-like head
Dutch: []
Japanese: ['ウェルシュ・コーギー']
Italian: []
Spanish: []

Synset('cur.n.01')
English: ['cur', 'mongrel', 'mutt']
an inferior dog or one of mixed breed
Dutch: ['bastaard', 'bastaardhond', 'halve_gare', 'idioot', 'mormel', 'straathond']
Japanese: ['雑犬', '雑種犬', '駄犬']
Italian: ['bastardo']
Spanish: ['chucho', 'gozque', 'mestizo']

Synset('dalmatian.n.02')
English: ['dalmatian', 'coach_dog', 'carriage_dog']
a large breed having a smooth white coat with black or brown spots; originated in Dalmatia
Dutch: ['Dalmatische', 'dalmatiër']
Japanese: []
Italian: ['dalmata']
Spanish: []

Synset('great_pyrenee

This gives dogs as direct hyponyms but maybe there are more dogs as hyponyms of hyponyms of hyponyms, etc.  The WordNet interface documentation uses a so-called **anonymous function** (lambda) which is applied recursively to synsets that are the hyponyms of synsets. This is higher Python magic. For now accept that it traverses the hyponym tree from a starting synset and puts all results in a list.

The next cell defines the function **hypo** that returns all hyponyms of a synset assigned to the variable **s**. 

In [23]:
hypo = lambda s: s.hyponyms()

After defining the **hypo** function we can pass it as a parameter to the synset **closure** function.

In [24]:
dogs_at_all_levels = list(dog.closure(hypo))
print('Number of dogs:', len(dogs_at_all_levels))

Number of dogs: 189


Ahhh, we now have 189 dogs instead of 18 if we go deeper! Let's check their coverage in Dutch

In [25]:

for s in dogs_at_all_levels: 
    print(s)
    print('English:', s.lemma_names('eng'))
    print(s.definition())
    print('Dutch:', s.lemma_names('nld'))
    print('Japanese:', s.lemma_names('jpn'))
    print('Italian:', s.lemma_names('ita'))
    print('Spanish:', s.lemma_names('spa'))
    print()

Synset('basenji.n.01')
English: ['basenji']
small smooth-haired breed of African origin having a tightly curled tail and the inability to bark
Dutch: []
Japanese: []
Italian: ['basenji', 'cane_del_Congo']
Spanish: ['basenji']

Synset('corgi.n.01')
English: ['corgi', 'Welsh_corgi']
either of two Welsh breeds of long-bodied short-legged dogs with erect ears and a fox-like head
Dutch: []
Japanese: ['ウェルシュ・コーギー']
Italian: []
Spanish: []

Synset('cur.n.01')
English: ['cur', 'mongrel', 'mutt']
an inferior dog or one of mixed breed
Dutch: ['bastaard', 'bastaardhond', 'halve_gare', 'idioot', 'mormel', 'straathond']
Japanese: ['雑犬', '雑種犬', '駄犬']
Italian: ['bastardo']
Spanish: ['chucho', 'gozque', 'mestizo']

Synset('dalmatian.n.02')
English: ['dalmatian', 'coach_dog', 'carriage_dog']
a large breed having a smooth white coat with black or brown spots; originated in Dalmatia
Dutch: ['Dalmatische', 'dalmatiër']
Japanese: []
Italian: ['dalmata']
Spanish: []

Synset('great_pyrenees.n.01')
English: [

It is clear that the non-English wordnets lack coverage compared to the English WordNet for dogs.

The next simple code counts the missing dogs per language.

In [26]:
dog_gaps_nl = 0
dog_gaps_jp = 0
dog_gaps_es = 0
dog_gaps_it = 0


for s in dogs_at_all_levels: 
    if not s.lemma_names('nld'):
        dog_gaps_nl +=1
    
    if not s.lemma_names('jpn'):
        dog_gaps_jp +=1
    
    if not s.lemma_names('ita'):
        dog_gaps_it +=1
    
    if not s.lemma_names('spa'):
        dog_gaps_es +=1
        
print('Missing dogs in Dutch:', dog_gaps_nl)
print('Missing dogs in Japanse:', dog_gaps_jp)
print('Missing dogs in Italian:', dog_gaps_it)
print('Missing dogs in Spanish:', dog_gaps_es)

Missing dogs in Dutch: 142
Missing dogs in Japanse: 100
Missing dogs in Italian: 127
Missing dogs in Spanish: 154


Question: are there any Dutch words for dogs that are not in the English WordNet? Where can you find the answer to such a question? There is work to do to complete it. Perhaps a nice project for you to work on to increase the coverage of our Dutch WordNet?

## Wordnet Similarity

The structure of wordnet as a graph can be used to measure the similarity across concepts. The basic idea is that concepts can be connected by going up and down through the relations. By counting the steps, we can measure the distance between for example **car**, **train**, **man**, **woman**. 

![distance_in_wordnet](./images/wordnet_sim.png)

A whole series of similarity functions have been added to NLTK that measure the distances in different ways but use the same basic strategy that exploits the relations between synsets. See the documentation for the other methods. 

We show here how it works for the most basic method **path** that counts the steps. We first obtain a few synsets:

In [27]:
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
car = wn.synset('car.n.01')

We can get the full hypernym path for these synsets:

In [28]:
print('dogs are a type of:')
for path in dog.hypernym_paths():
    indent = ""
    for hyper in path:
        print(indent, hyper)
        indent += "   "

print('cats are a type of:')
for path in cat.hypernym_paths():
    indent = ""
    for hyper in path:
        print(indent, hyper)
        indent += "   "

dogs are a type of:
 Synset('entity.n.01')
    Synset('physical_entity.n.01')
       Synset('object.n.01')
          Synset('whole.n.02')
             Synset('living_thing.n.01')
                Synset('organism.n.01')
                   Synset('animal.n.01')
                      Synset('chordate.n.01')
                         Synset('vertebrate.n.01')
                            Synset('mammal.n.01')
                               Synset('placental.n.01')
                                  Synset('carnivore.n.01')
                                     Synset('canine.n.02')
                                        Synset('dog.n.01')
 Synset('entity.n.01')
    Synset('physical_entity.n.01')
       Synset('object.n.01')
          Synset('whole.n.02')
             Synset('living_thing.n.01')
                Synset('organism.n.01')
                   Synset('animal.n.01')
                      Synset('domestic_animal.n.01')
                         Synset('dog.n.01')
cats are a type of:
 Sy

First of all, we see that **cat** has only one path which has most similarity with the **canine** path of **dog**.
We can see that a large parts of the biological pathes are the same, starting from *carnivore*.

In [29]:
print('cars are a type of:')
for path in car.hypernym_paths():
    indent = ""
    for hyper in path:
        print(indent, hyper)
        indent += "   "

cars are a type of:
 Synset('entity.n.01')
    Synset('physical_entity.n.01')
       Synset('object.n.01')
          Synset('whole.n.02')
             Synset('artifact.n.01')
                Synset('instrumentality.n.03')
                   Synset('container.n.01')
                      Synset('wheeled_vehicle.n.01')
                         Synset('self-propelled_vehicle.n.01')
                            Synset('motor_vehicle.n.01')
                               Synset('car.n.01')
 Synset('entity.n.01')
    Synset('physical_entity.n.01')
       Synset('object.n.01')
          Synset('whole.n.02')
             Synset('artifact.n.01')
                Synset('instrumentality.n.03')
                   Synset('conveyance.n.03')
                      Synset('vehicle.n.01')
                         Synset('wheeled_vehicle.n.01')
                            Synset('self-propelled_vehicle.n.01')
                               Synset('motor_vehicle.n.01')
                                  Syn

The paths for *car* are very different but at some point they meet the other pathes at *whole.n.02*.

Wordnet similarity measures compare the distance from word meanings through the hypernym path and possibly other relations. Since the other relations are very sparse the measurements mostly boils down to the hypernym relations.

Synsets have the similarity function built in. The function requires another synset as the input and gives back a score:

In [30]:
print(dog.path_similarity(cat))

0.2


Is this very similar? The only way to find out is to compare this with other synsets such as *car*:

In [31]:
print(car.path_similarity(cat))
print(car.path_similarity(dog))

0.05555555555555555
0.07692307692307693


Right, these score a lot lower.

Let's see of this also works for verbs.

In [32]:
hit = wn.synset('hit.v.01')
slap = wn.synset('slap.v.01')
print(hit.path_similarity(slap))
print(wn.path_similarity(hit, slap))

0.14285714285714285
0.14285714285714285


It seems to work well but......

If you did some readings on WordNet, you know that the noun hierarchy has a single top-node synset 'entity-n-01'. All nominal synsets decent from this synset. This is not the case for verbs nor for adjectives. The verb part of WordNet therefore consists of '559' islands of disconnected synsets with 559 rootnodes. The English WordNet editors decided not to connect these islands in an artificial way as was done for nouns.

In [33]:
print('hit is a type of:')
for path in hit.hypernym_paths():
    indent = ""
    for hyper in path:
        print(indent, hyper)
        indent += "   "
print('slap is a type of:')
for path in slap.hypernym_paths():
    indent = ""
    for hyper in path:
        print(indent, hyper)
        indent += "   "

hit is a type of:
 Synset('move.v.02')
    Synset('propel.v.01')
       Synset('hit.v.01')
slap is a type of:
 Synset('touch.v.01')
    Synset('strike.v.01')
       Synset('slap.v.01')


We see there is no overlap between the pathes for the two verbs.

We can also see this using the NLTK root_hypernyms() function for the above noun and verb synsets.

In [34]:
print('Root for hit:', dog.root_hypernyms())
print('Root for hit:', cat.root_hypernyms())
print('Root for hit:', slap.root_hypernyms())
print('Root for hit:', hit.root_hypernyms())

Root for hit: [Synset('entity.n.01')]
Root for hit: [Synset('entity.n.01')]
Root for hit: [Synset('touch.v.01')]
Root for hit: [Synset('move.v.02')]


How is it still possible to get a value for similarity if the subgraphs are not connected? Well, the package imposes a simulated rootnode by grouping all the subgraph top-nodes under a single node. This is the default setting. If you do not want to use this, you can turn it off by setting the parameter *simulate_root* to False.

In [35]:
print(hit.path_similarity(slap, simulate_root=False))
print(wn.path_similarity(hit, slap, simulate_root=False))

None
None


Without the simulated root there is no path from 'hit' to 'slap'.

## Using WordNet similarity for words instead of synsets

Now if we want to use this for words, we first need to obtain all the synsets for a word and then compare each synset with the synsets of another word. We thus need a **for-loop** inside a **for-loop**. The first loop gets the synsets for the first word and the second loop for each synset gets the synsets for the second word to compare.

In [36]:
w1='dog'
w2='cat'
for s1 in wn.synsets(w1, 'n'):
    print(s1,':')
    for s2 in wn.synsets(w2, 'n'):
        print('\t', s2,':', s1.path_similarity(s2))

Synset('dog.n.01') :
	 Synset('cat.n.01') : 0.2
	 Synset('guy.n.01') : 0.125
	 Synset('cat.n.03') : 0.125
	 Synset('kat.n.01') : 0.07692307692307693
	 Synset('cat-o'-nine-tails.n.01') : 0.08333333333333333
	 Synset('caterpillar.n.02') : 0.07692307692307693
	 Synset('big_cat.n.01') : 0.2
	 Synset('computerized_tomography.n.01') : 0.05263157894736842
Synset('frump.n.01') :
	 Synset('cat.n.01') : 0.07142857142857142
	 Synset('guy.n.01') : 0.125
	 Synset('cat.n.03') : 0.125
	 Synset('kat.n.01') : 0.1
	 Synset('cat-o'-nine-tails.n.01') : 0.07142857142857142
	 Synset('caterpillar.n.02') : 0.06666666666666667
	 Synset('big_cat.n.01') : 0.07142857142857142
	 Synset('computerized_tomography.n.01') : 0.05555555555555555
Synset('dog.n.03') :
	 Synset('cat.n.01') : 0.07692307692307693
	 Synset('guy.n.01') : 0.2
	 Synset('cat.n.03') : 0.14285714285714285
	 Synset('kat.n.01') : 0.1111111111111111
	 Synset('cat-o'-nine-tails.n.01') : 0.07692307692307693
	 Synset('caterpillar.n.02') : 0.07142857142857

We can use the highest similarity from all pairs to find the strongest association.

In [37]:
w1='dog'
w2='cat'
for s1 in wn.synsets(w1, 'n'):
    top_sim_score = 0    
    top_sim_synset_w1 = ""
    top_sim_synset_w2 = ""
    for s2 in wn.synsets(w2, 'n'):
        sim = s1.path_similarity(s2)
        if sim>top_sim_score:
            top_sim_score = sim
            top_sim_synset_w2 = s2
    print('Most similar are', s1, top_sim_synset_w2,':', top_sim_score)

Most similar are Synset('dog.n.01') Synset('cat.n.01') : 0.2
Most similar are Synset('frump.n.01') Synset('guy.n.01') : 0.125
Most similar are Synset('dog.n.03') Synset('guy.n.01') : 0.2
Most similar are Synset('cad.n.01') Synset('guy.n.01') : 0.14285714285714285
Most similar are Synset('frank.n.02') Synset('kat.n.01') : 0.09090909090909091
Most similar are Synset('pawl.n.01') Synset('cat-o'-nine-tails.n.01') : 0.14285714285714285
Most similar are Synset('andiron.n.01') Synset('cat-o'-nine-tails.n.01') : 0.16666666666666666


# End of this notebook