<a href="https://colab.research.google.com/github/scskalicky/LING-226-vuw/blob/main/22_Intro_to_WordNet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What do words mean?

## *Introduction to Wordnet*

This notebook will introduce you to the English *WordNet*, which is a database of word associations. Of all the topics / resources we can work with in NLTK, I think WordNet may be the most interesting. That's because WordNet is trying to get at something we have not directly addressed up to this point — capturing the *meaning* of English words computationally. WordNet tries to capture such information by modelling (1) different senses associate with words and (2) relationships among those senses / words to other senses /words.


As such, WordNet is a fantastic resource for delving deeper into lexical information, in particular how words in English relate to one another. You can [use wordnet online](http://wordnetweb.princeton.edu/perl/webwn), which might help you conceptually understand the way that wordnet is organised and how NLTK has been used to access the wordnet information. The [Wikipedia page](https://en.wikipedia.org/wiki/WordNet) for WordNet also includes a good deal of background and discussion (although you may want to read it after going through this notebook).

NLTK includes WordNet as a resource, and is just a matter of downloading and importing the necessary resources.

> *You may also want to explore using `help(wn)` (after you have imported it)*.


In [1]:
# load in wordnet as wn
import nltk
nltk.download(['wordnet', 'omw-1.4'])
from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Ming\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Ming\AppData\Roaming\nltk_data...


## Synsets


In WordNet, words are organised into what are known as `synsets`, and these synsets represent the key objects you work with when using WordNet. You can search which synsets are associated with different words using the `synsets()` function.

Simply pass the word as a string to the function.  

In [2]:
# we can see that the word "dog" is associated with eight different synsets.
wn.synsets('dog')

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01'),
 Synset('chase.v.01')]

Each synset is a word or set of words associated with a specific meaning (which are referred to as *senses*). In the output above, we can see that the word `dog` is included in eight different synsets.


This means that WordNet includes eight possible meanings (or senses) for which the word `dog` can be associated. The actual names of the synsets are provided in this format:

```
word.x.01
```

`word.` — the "main" word associated with the synset.

`.x.` — the part of speech   
  - (above we see `n` and `v` for `noun` and `verb`)

`01` — this is how they maintain repeats of the same synset name and also indicate the most common uses.
  - this means whatever meaning or sense is associated with `dog.n.01` is thought to be more frequent than `dog.n.03`.

So, inspecting the output above there are eight synsets. Two synsets have dog as the "main" word (`dog.n.01` and `dog.n.03`), whereas the other six synsets will include `dog` as word associated with the synset meaning.


Let's save the first synset to a variable and explore the information contained within:

CAREFUL!

- Note that we use `synsets()` to find all synsets with a string word as input (synset**S**)

- We use `synset()` to call a specific synset using the name of the synset



In [11]:
# when you know the name of a synset you can call it directly
# save the dog synset to a variable.
dog01 = wn.synset('dog.n.01')

There are a number of method functions we can use. Two that are helpful to understand synsets are these:

- `.lemma_names()` — the other words which have the same meaning/sense (i.e., other words in the same synset)
- `.definition()` — the full definition of the synset

In [12]:
# use lemma_names() to see all the other words in a synset
# this helps us understand the "meaning" of the first synset for dog
dog01.lemma_names()

['dog', 'domestic_dog', 'Canis_familiaris']

In [13]:
# use .definition() to get the actual meaning
# this meaning makes sense right?
dog01.definition()

'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds'

So `dog.n.01` is related to the domestic animal, and the words / phrases `domestic_dog` and `canis_familiaris` are thought to be approximately equivalent in meaning/sense. That makes...sense...right?

Let's explore the other synset with `dog` in the name, `dog.n.03`. There is only one word in the lemma names, suggesting this is a very specific use of the word dog:

In [6]:
# lemma names
wn.synset('dog.n.03').lemma_names()

['dog']

In [7]:
# oh...this is clearly a different meaning!
wn.synset('dog.n.03').definition()

'informal term for a man'

So the other meaning of "dog" is a colloquialism with a very specific meaning. Again, there are no other words (or lemmas) included in this synset, further exhibiting the specific meaning associated with this use of dog.

What about the other synsets? Each of them was returned because the word "dog" is a lemma in the synsets.


For example, the `frank.n.02` synset meaning is related to hot dogs or frankfurters, and `dog` is one of the lemmas listed as a possible synonym.

In [14]:
wn.synset('frank.n.02').lemma_names()

['frank',
 'frankfurter',
 'hotdog',
 'hot_dog',
 'dog',
 'wiener',
 'wienerwurst',
 'weenie']

In [15]:
wn.synset('frank.n.02').definition()

'a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll'

### Better understanding how synsets are accessed

> *(I found this out by accident when playing around and thought it was fun to share)*

You can explore the other meanings and definitions by calling synset names as they were provided above.

Alternatively, since we know there are seven total synsets which include `dog` as a `noun`, we can iterate through those synsets by calling `dog.n.0x`, where `x` is the number 1-7.

I demonstrate this below using a loop which ranges through the numbers 1-7 and then calls the synset using `dog.n.0x` instead of the names provided above.

This helps us see that the number in the synset is related to the frequency of use of a word's different meanings, as well as the flexibility associated with calling synsets!

>> *In the cell you will see that to cycle through 1-7, I need to call `range(1,8)`.This is similar to slicing, in that I am asking for everything starting at 1 and ending at 8, but this does **not** include 8. I encourage you to play around with `range()` in order to better understand how this works*




In [16]:
# (how could this loop be changed to account for words that have more than 10 synsets?)
# (how could you make a function which can print out the synsets regardless of their length?
# # how could you always know the size of the synset?)

for i in range(1,8):
  print('definition: ' + str(i))
  print(wn.synset('dog.n.0' + str(i)).definition())

definition: 1
a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
definition: 2
a dull unattractive unpleasant girl or woman
definition: 3
informal term for a man
definition: 4
someone who is morally reprehensible
definition: 5
a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll
definition: 6
a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward
definition: 7
metal supports for logs in a fireplace


In [20]:
# we can also see what happens when we call a synset beyond the range (check out the WordNetError)
wn.synset('dog.n.08')

WordNetError: Lemma 'dog' with part of speech 'n' only has 7 senses

This makes perfect sense, because if you recall, we started this whole search by searching for the word "dog" and saw that there were eight synsets, seven of which were associated with nouns. So if we wanted to know how many "meanings" or "senses" are associated with a word, we can query the total number of synsets returned for a word, regardless of the name of the synset.

Just as a fun exercise, consider the function below where I use this information to make a smarter function which will tell us all the basic information that we might want to know for different words.



In [21]:
# because you can use len to find how many senses a word has...
len(wn.synsets('dog'))

8

In [22]:
# and because you could use the `.pos()` method to find the POS uses...
for synset in wn.synsets('dog'):
    print(synset.pos())

n
n
n
n
n
n
n
v


In [23]:
#  you could write a function to automatically count and print various information about words
def print_my_synsets_please(target):
    synsets = wn.synsets(target)

    # only do this stuff if there are any synsets for the target
    if synsets:

        # create variables to count number of POS uses for the word
        nouns = 0
        verbs = 0
        adjs = 0

        # iteratively add to each pos as you find them
        for synset in synsets:
            if synset.pos() == 'n':
                nouns += 1
            if synset.pos() == 'v':
                verbs += 1
            if synset.pos() == 'a':
                adjs += 1

        # save all the synsets for different POS
        noun_synsets = [synset.definition() + '\n' for synset in synsets if synset.pos() == "n"]
        verb_synsets = [synset.definition() + '\n' for synset in synsets if synset.pos() == "v"]
        adj_synsets = [synset.definition() + '\n' for synset in synsets if synset.pos() == "a"]

        # print out total number of senses, then number of senses for each POS along with their definitions
        # only print info for POS if they have senses for that POS

        print(f'Target {target} has {len(synsets)} senses.')
        if nouns > 0:
            print(f'Nouns: {nouns}')
            for n in noun_synsets:
                print(n)

        if verbs > 0:
            print(f'Verbs: {verbs}')
            for v in verb_synsets:
                print(v)

        if adjs > 0:
            print(f'Adjectives: {adjs}')
            for a in adj_synsets:
                print(a)
    else:
        print(f'Sorry, target word {target} makes no sense!')

Since I've gone to all that trouble making this function, try it out on some words to see how it works!

If you're brave, try it on the word `run`

In [24]:
print_my_synsets_please('dog')

Target dog has 8 senses.
Nouns: 7
a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds

a dull unattractive unpleasant girl or woman

informal term for a man

someone who is morally reprehensible

a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll

a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward

metal supports for logs in a fireplace

Verbs: 1
go after with the intent to catch



In [25]:
print_my_synsets_please('Canada')

Target Canada has 1 senses.
Nouns: 1
a nation in northern North America; the French were the first Europeans to settle in mainland Canada



In [26]:
print_my_synsets_please('happy')

Target happy has 4 senses.
Adjectives: 1
enjoying or showing or marked by joy or pleasure



In [27]:
print_my_synsets_please('comb')

Target comb has 8 senses.
Nouns: 5
a flat device with narrow pointed teeth on one edge; disentangles or arranges hair

the fleshy red crest on the head of the domestic fowl and other gallinaceous birds

any of several tools for straightening fibers

ciliated comb-like swimming plate of a ctenophore

the act of drawing a comb through hair

Verbs: 3
straighten with a comb

search thoroughly

smoothen and neaten with or as with a comb



In [28]:
print_my_synsets_please('blarg')

Sorry, target word blarg makes no sense!


What other helper functions might you want to make for WordNet? Or, how else you you update this function to allow for more options / customization?

### Specify part of speech in WordNet searches

We now know that we can use `.pos()` to check the POS of specific synsets, but we can also specify specific parts of speech when querying WordNet. You can do so by using the `pos = ` argument in `wn.synsets()`

In [34]:
# only return synsets that are nouns
wn.synsets('dog', pos = 'n')

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01')]

So we can quickly count specific meanings for different types if we need — could this information be used to modify the function I wrote above?

In [30]:
# how many meanings does 'dog' have as a noun?
len(wn.synsets('dog', pos = 'n'))

7

In [31]:
# how many meanings does 'dog' have as a verb?
len(wn.synsets('dog', pos = 'v'))

1

### Double-check the online version of WordNet

Take a moment now to go to the [online version of WordNet](http://wordnetweb.princeton.edu/perl/webwn). Type in the word 'dog' and search for it. You should see the same information we've explored here, but presented in a graphical user interface rather than using Python as we have in this notebook. This may help you get a better understanding of how the WordNet information is stored.

After using the website, consider the pros/cons of using Python and NLTK to do effectively the same thing. Being able to search WordNet programatically will be more efficient than looking up words one-at-a-time in the website version. At the same time, using the website version might make it easier to see the larger connections and categories when compared to accessing the information through NLTK.

### **Your Turn**

Explore some other words in WordNet to get a hang of looking through different synsets and accessing the information within. You should try the following **your turn** prompt:

> *Your Turn: Write down all the senses of the word `dish` that you can think of. Now, explore this word with the help of WordNet, using the same operations we used above.*



In [39]:
# look through other synsets here
dish = wn.synsets('word')


for set in dish:
    print(f"set: {set}\tdefinition: {set.definition()}")


set: Synset('word.n.01')	definition: a unit of language that native speakers can identify
set: Synset('word.n.02')	definition: a brief statement
set: Synset('news.n.01')	definition: information about recent and important events
set: Synset('word.n.04')	definition: a verbal command for action
set: Synset('discussion.n.02')	definition: an exchange of views on some topic
set: Synset('parole.n.01')	definition: a promise
set: Synset('word.n.07')	definition: a word is a string of bits stored in computer memory
set: Synset('son.n.02')	definition: the divine word of God; the second person in the Trinity (incarnate in Jesus)
set: Synset('password.n.01')	definition: a secret word or phrase known only to a restricted group
set: Synset('bible.n.01')	definition: the sacred writings of the Christian religions
set: Synset('give_voice.v.01')	definition: put into words or an expression


## Hypernyms and Hyponyms


So, the `synsets` show groups of words with approximately similar meanings — in this manner, these words are related to one another (through shared and similar senses). Another way words can relate to one another is through hierarchical categorization. For example, some words are larger categories for other words — the word `vehicle` is a larger category which can include `car`, `ambulance`, `truck`, and so one, while `car` is itself a larger category which can contain terms like `sedan`, `hatchback`, and so on.

These sorts of categorical relationships are known as **hypernymy** and **hyponymy**. WordNet includes this information for all of the words in WordNet, allowing you to traverse higher-level and lower-level cateogries associated with particular words.

The two key terms are **hypernym** and **hyponym**.

> HYPER means *above*, and HYPO means *below*.

Whether any one word is a hypernym or hyponym is always relative and depends on the word you are starting from. Using the example above of vehicles, we could say that `vehicle` is a hypernym of `car`, while `car` is a hypernym of `hatchback`. Conversely, `car` is a hyponym of `vehicle`. So, again, the designation of hypernym vs. hyponym comes down to which word one starts with, and which direction through the categories one wishes to traverse.

We can access the hypernyms of hyponyms of words using the `.hypernyms()` and `.hyponyms()` methods from the `synset` objects.

In [37]:
# which categories are smaller (more specific) than "dog"?
# inspect the results - are these all more specific types of dogs?
dog01.hyponyms()

[Synset('basenji.n.01'),
 Synset('corgi.n.01'),
 Synset('cur.n.01'),
 Synset('dalmatian.n.02'),
 Synset('great_pyrenees.n.01'),
 Synset('griffon.n.02'),
 Synset('hunting_dog.n.01'),
 Synset('lapdog.n.01'),
 Synset('leonberg.n.01'),
 Synset('mexican_hairless.n.01'),
 Synset('newfoundland.n.01'),
 Synset('pooch.n.01'),
 Synset('poodle.n.01'),
 Synset('pug.n.01'),
 Synset('puppy.n.01'),
 Synset('spitz.n.01'),
 Synset('toy_dog.n.01'),
 Synset('working_dog.n.01')]

In [38]:
# which categories are larger (less specific) than "dog"?
# inspect the results - are these categories larger than dog? Would they have additional members besides dog?
dog01.hypernyms()

[Synset('canine.n.02'), Synset('domestic_animal.n.01')]

Again, you can think of the hypernyms and hyponyms as different levels up or down a hierarchy. And, because we can quantify the number of levels or steps between two words/senses, we can calculate an approximation of the distance and/or similarity among concepts associated with the words. For example, what are the hypernyms of `dog` and `wolf`?



In [40]:
# compare hypernyms of dog and wolf
print(f'Dog hypernyms: {wn.synset("dog.n.01").hypernyms()}\n\nWolf hypernyms: {wn.synset("wolf.n.01").hypernyms()}')

Dog hypernyms: [Synset('canine.n.02'), Synset('domestic_animal.n.01')]

Wolf hypernyms: [Synset('canine.n.02')]


Ah ha! We can see that `dog` and `wolf` both include `canine.n.02` as a hypernym. Inspecting that synset gives us the definition:

In [41]:
wn.synset('canine.n.02').definition()

'any of various fissiped mammals with nonretractile claws and typically long muzzles'

And running `.hyponyms()` on that synset naturally includes `dog` and `wolf`

In [42]:
# that first one is a doozy - run the `.definition()` to find out it refers specifically to female dogs
wn.synset('canine.n.02').hyponyms()

[Synset('bitch.n.04'),
 Synset('dog.n.01'),
 Synset('fox.n.01'),
 Synset('hyena.n.01'),
 Synset('jackal.n.01'),
 Synset('wild_dog.n.01'),
 Synset('wolf.n.01')]

In [44]:
wn.synset('bitch.n.04').lemma_names()

['bitch']

So, we can see that `dog` and `wolf` are immediate members of the same hypernym (i.e., `canine`), which suggest these words are conceptually and semantically related. At the same time, there are seven total immediate members of the category `canine`, meaning that these seven terms/words/concepts/senses are relatively similar (because they are at an equal level under a hypernym).

What do you think? We know that a fox, a wolf, and a dog are clearly different. But we know they are *similar*, and this this categorization captures some approximation of that similarity.

Do you think this categorization reflects the way that humans categorize objects/concepts through language?

### Calculating specificity through hypernym/hyponym pathways

Based on where a word is in relative to other words in the WordNet heirarchy, that word may be associated with a very specific or very general concept. And, there may be different pathways to different higher-level concepts, depending on how similar the word is to other words. No matter what word you start with, there are higher level nodes that represent the "end" of many concepts.

Several people have mapped this relationship, for example [this paper](https://www.semanticscholar.org/paper/Inductive-learning-of-lexical-semantics-with-typed-Kazakov-Dobnik/e46c008c3b83b9a0155d7f8a5319a1208d8922ae) provided this figure:

<img src = https://i.imgur.com/eKr5Zfh.jpg width = "800" height = "350">

You can see on this graphic the highest level node linking all of the lower-level nodes is `entity`. There are no hypernyms for `entity`, only hyponyms. Inspecting the definition of entity reveals a very broad and general category, making it a good fit for one of the highest-level concepts.

In [48]:
wn.synset('entity.n.01').definition()
wn.synset('entity.n.01').hyponyms   ()

[Synset('abstraction.n.06'),
 Synset('physical_entity.n.01'),
 Synset('thing.n.08')]

We can explore relative distances among words/concepts in WordNet using these paths.

Specifically, the `.hypernym_paths()` method lets you explore these paths. For `dog`, because there are two *immediate* hypernyms (`domestic_animal` or `canine`), there are two possible routes or pathways to a higher-level category (both eventually end at `entity`). Depending on which of these two paths you take, different higher-level categories will be traversed. Remember that `wolf` only had one hypernym, and thus has only one immediate route to a higher level category.

Examine the two hypernym paths below, you should see how the categories become less specific as one goes up the list. Also note that the end hypernym is the same for both paths.

When starting at the bottom, you should be able to use the `"is a"` method for moving up the list. A dog is a canine, a canine is a carnivore, etc...

As we will see later on, this function counts the number of paths between two words to approximate their similarity - neat!

In [49]:
# save the hypernym paths of dog.n.01
dog_paths = dog01.hypernym_paths()

# how many paths are there?
# (read the paths from the bottom to the top, noting there are two separate entries for dog)
len(dog_paths)

dog_paths

[[Synset('entity.n.01'),
  Synset('physical_entity.n.01'),
  Synset('object.n.01'),
  Synset('whole.n.02'),
  Synset('living_thing.n.01'),
  Synset('organism.n.01'),
  Synset('animal.n.01'),
  Synset('chordate.n.01'),
  Synset('vertebrate.n.01'),
  Synset('mammal.n.01'),
  Synset('placental.n.01'),
  Synset('carnivore.n.01'),
  Synset('canine.n.02'),
  Synset('dog.n.01')],
 [Synset('entity.n.01'),
  Synset('physical_entity.n.01'),
  Synset('object.n.01'),
  Synset('whole.n.02'),
  Synset('living_thing.n.01'),
  Synset('organism.n.01'),
  Synset('animal.n.01'),
  Synset('domestic_animal.n.01'),
  Synset('dog.n.01')]]

In [50]:
# look at the first path (read from the bottom up)
# a dog is a canine is a carnivore is a placental...(and so on)
dog_paths[0]

[Synset('entity.n.01'),
 Synset('physical_entity.n.01'),
 Synset('object.n.01'),
 Synset('whole.n.02'),
 Synset('living_thing.n.01'),
 Synset('organism.n.01'),
 Synset('animal.n.01'),
 Synset('chordate.n.01'),
 Synset('vertebrate.n.01'),
 Synset('mammal.n.01'),
 Synset('placental.n.01'),
 Synset('carnivore.n.01'),
 Synset('canine.n.02'),
 Synset('dog.n.01')]

In [51]:
# look at the second path (read from the bottom up)
# a dog is a domestic animal is an animal is an organism...(and so on)
dog_paths[1]

[Synset('entity.n.01'),
 Synset('physical_entity.n.01'),
 Synset('object.n.01'),
 Synset('whole.n.02'),
 Synset('living_thing.n.01'),
 Synset('organism.n.01'),
 Synset('animal.n.01'),
 Synset('domestic_animal.n.01'),
 Synset('dog.n.01')]

## **Semantic Similarity**

There are a number of computational ways to calculate semantic similarity, and they usually rely on capturing some sort of mathematical "distance" between words.

Semantic spaces built from word vectors using Latent Semantic Analysis (LSA) or word2vec can do this too, but WordNet demonstrates this idea using distance in terms of categories, which is related more to human conceptualisations of word meaning, rather than word distributions.

In all cases, greater distance arguably means that something is less similar when compared to shorter distance.

In WordNet, the distance between two words and a root hypernym or hyponym is taken as a measure of similarity. So, the hypernym/hyponym relationship is used to calculate this measure of similarity.

In the example from the book, several different synsets are saved as variables, some of which logically are more related than others.

The authors then demonstrate the similarity among these saved synsets out.

I have always found it confusing the authors used "right" as a variable name without explaining it, but it refers to a type of whale: [a right whale](https://en.wikipedia.org/wiki/Right_whale).

I added the `.definition()` method for each version below so you can get an idea as to why the authors are using these examples for the test.

In [52]:
orca = wn.synset('orca.n.01')
orca.definition()

'predatory black-and-white toothed whale with large dorsal fin; common in cold seas'

In [53]:
minke = wn.synset('minke_whale.n.01')
minke.definition()

'small finback of coastal waters of Atlantic and Pacific'

In [54]:
right = wn.synset('right_whale.n.01')
right.definition()

"large Arctic whalebone whale; allegedly the `right' whale to hunt because of its valuable whalebone and oil"

In [55]:
tortoise = wn.synset('tortoise.n.01')
tortoise.definition()

'usually herbivorous land turtles having clawed elephant-like limbs; worldwide in arid area except Australia and Antarctica'

In [56]:
novel = wn.synset('novel.n.01')
novel.definition()

'an extended fictional work in prose; usually in the form of a story'

The `lowest_common_hypernm()` method will find the closest hypernym which two synsets share. Remember, hypernyms are the higher-level categories. So if two words share the next highest-level category (such as wolf and dog), they will be more similar than words which only share the most highest-level category, such as `entity.`

If a hypernym is "lower" that means it is more specific because it is lower in the overall categorisation schema - so two words linked by a more specific hypernym are more simliar than if they are linked by an abstract hypernym (such as entity). You can see the WordNet explanation [here](https://www.nltk.org/api/nltk.corpus.reader.wordnet.html?highlight=lowest_common_hypernyms#nltk.corpus.reader.wordnet.Synset.lowest_common_hypernyms).

In [57]:
# what is the lower common hypernym of right whales and minke whales?
right.lowest_common_hypernyms(minke)

[Synset('baleen_whale.n.01')]

In [58]:
# lowest common hypernym of right whale and orcas?
right.lowest_common_hypernyms(orca)

[Synset('whale.n.02')]

So which pair is more similar? Right Whales and Minke Whales are linked by Baleen Whale. Right Whales and Orcas are linked by Whale. Since a Baleen Whale is more specific than the term Whale, Right Whales and Minke Whales are more similar than Right Whales and Orcas. At least, that is the logic of these relationships.

Continue the exercise to compare right whales to tortoises and novels.

In [59]:
# vertebrate is less specific when compared to 'whale'
# so they are less similar than the comparisons done above.
right.lowest_common_hypernyms(tortoise)

[Synset('vertebrate.n.01')]

In [60]:
# entity is the least specific category possible,
# so right whales and novels have almost no semantic similarity.
right.lowest_common_hypernyms(novel)

[Synset('entity.n.01')]

Now that you conceptually understand the way NLTK defines these measures of similarity and difference, you can use the `.similarity()` methods to see the numeric measure WordNet will return.

Remember that `right` refers to a type of whale.

In [61]:
# finds the shortest path with connects two things - higher number = more similar
right.path_similarity(minke)

0.25

In [62]:
# less similar..
right.path_similarity(orca)

0.16666666666666666

In [63]:
# even less similar...
right.path_similarity(tortoise)

0.07692307692307693

In [64]:
# even more less similar!...
right.path_similarity(novel)

0.043478260869565216

Let's look at some other examples. Try it out for yourself, what other examples do you want to look at?

In [65]:
# compare dog and cat
wn.synset('dog.n.01').path_similarity(wn.synset('cat.n.01'))

0.2

Hopefully you're at the point where you think writing a function is a better way to explore new resources. I've written a helper function below to make it faster for us to compare similarities between inputs.

In [66]:
# why not write a short helper function?

def compare_similarity(base, comparison):
  """compares two WordNet synsets"""
  print(f'{base} & {comparison}: {base.path_similarity(comparison)}')


In [67]:
# define a baseline for comparison
base = wn.synset('beer.n.01')

# you can add any number of synsets here
comparisons = [wn.synset('wine.n.01'), wn.synset('water.n.01'), wn.synset('bread.n.01'),
               wn.synset('alcohol.n.01'), wn.synset('whiskey.n.01'), wn.synset('juice.n.01'),
               wn.synset('vegetable.n.01')]

# loop through comparisons and compare all to base.
for compare in comparisons:
  compare_similarity(base, compare)

Synset('beer.n.01') & Synset('wine.n.01'): 0.25
Synset('beer.n.01') & Synset('water.n.01'): 0.1
Synset('beer.n.01') & Synset('bread.n.01'): 0.125
Synset('beer.n.01') & Synset('alcohol.n.01'): 0.3333333333333333
Synset('beer.n.01') & Synset('whiskey.n.01'): 0.2
Synset('beer.n.01') & Synset('juice.n.01'): 0.14285714285714285
Synset('beer.n.01') & Synset('vegetable.n.01'): 0.09090909090909091


## **Meronyms and Holonymns**

In addition to larger/smaller categories, words can also be related in terms of being parts of a whole or a whole comprised of parts. Wikipedia has a pretty good explanation of this [here](https://en.wikipedia.org/wiki/Meronymy_and_holonymy).

For example, we can see the words which are all thought to have meanings that are related to different components of what a tree is.

In [68]:
# what are smaller ideas contained with the larger idea of tree?
wn.synset('tree.n.01').part_meronyms()

[Synset('burl.n.02'),
 Synset('crown.n.07'),
 Synset('limb.n.02'),
 Synset('stump.n.01'),
 Synset('trunk.n.01')]

In [69]:
# this is kinda weird, what's going on here?
# what are the smaller "ideas" or "concepts" contained within "dog"?
wn.synset('dog.n.01').part_meronyms()

[Synset('flag.n.07')]

In [70]:
# oh...weird?
wn.synset('flag.n.07').definition()

'a conspicuously marked or shaped tail'

In [71]:
# how about the word planet? what might we find?
wn.synsets('planet')

[Synset('planet.n.01'), Synset('satellite.n.02'), Synset('planet.n.03')]

In [72]:
# is this part of a planet?
wn.synset('planet.n.01').part_meronyms()

[Synset('biosphere.n.01')]

WordNet also has a way to find the "subtance" of something, which is slightly different than the component parts.

In [73]:
wn.synset('tree.n.01').substance_meronyms()

[Synset('heartwood.n.01'), Synset('sapwood.n.01')]

In [74]:
# I think the water example makes it a bit easier to see what substance/part is doing.
wn.synset('water.n.01').substance_meronyms()

[Synset('hydrogen.n.01'), Synset('oxygen.n.01')]

We can then use `member_holoynms()` and `substance_holonyms()` methods to go the other direction.

In [75]:
# trees are members of forests
wn.synset('tree.n.01').member_holonyms()

[Synset('forest.n.01')]

In [76]:
# water is the subtance for plenty of things.
wn.synset('water.n.01').substance_holonyms()

[Synset('body_of_water.n.01'),
 Synset('ice.n.01'),
 Synset('ice_crystal.n.01'),
 Synset('perspiration.n.01'),
 Synset('snowflake.n.01'),
 Synset('tear.n.01')]

The NLTK book makes the point that these different relationships demonstrate the intertwined nature of words and concepts (i.e., like a net!). The book demonstrates this first by looking at the synsets for the word "mint", and then how those synsets are further related to each other through associations such as meronyms etc.

In [77]:
# note that you can specify a string and a POS for WordNet to search.
for synset in wn.synsets('mint', pos = 'n'):
  print(synset.name() + ':', synset.definition())

batch.n.02: (often followed by `of') a large number or amount or extent
mint.n.02: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers
mint.n.03: any member of the mint family of plants
mint.n.04: the leaves of a mint plant used fresh or candied
mint.n.05: a candy that is flavored with a mint oil
mint.n.06: a plant where money is coined by authority of the government


The point being made here is that NLTK wants you to see how intertwined synsets are. Different senses of `mint` might in turn reflect these part:whole relationships

In [78]:
wn.synset('mint.n.04').part_holonyms()

[Synset('mint.n.02')]

In [79]:
wn.synset('mint.n.04').substance_holonyms()

[Synset('mint.n.05')]

There are quite a few other things you can do with WordNet not explained in NLTK - look [here](https://www.nltk.org/howto/wordnet.html)

Although there are a range of other methods, not all words / synsets have the same info available and it can be a pain figuring this out. For example, below I use the `derivationally_related_forms()` method (which is not explained in the book), but this function has to be run on a lemma, not a sysnet.

In [80]:
# find all the different ways the word "happy" can be derived
wn.synset('happy.a.01').lemmas()[0].derivationally_related_forms()

[Lemma('happiness.n.01.happiness'), Lemma('happiness.n.02.happiness')]

There is also a useful function `.antonyms()` which provide synsets with opposite meanings. You likely noticed the synsets here have `.a.` as the part of speech, which means they are adjectives. In English, adjectives are words which can be more productively derived as well as have more straightforward oppossites, so it makes sense that these functions are seemingly specific to adjectives.

In [81]:
# if you're not happy, you're...
wn.synset('happy.a.01').lemmas()[0].antonyms()

[Lemma('unhappy.a.01.unhappy')]

In [82]:
# why is it difficult for nouns to have opposites compared to adjectives?
# there is no "opposite" for dog, right? (it would be something like "undog")
wn.synset('dog.n.01').lemmas()[0].antonyms()

[]

## **Verb Entailment**

Another interesting concept to explore in WordNet is the concept of **verb entailment**. Roughly speaking, entailment is a logical deduction based on how different verbs relate to one another, sort of in a cause-and-effect relationship. For example, being punched also means being hurt, swimming also means becoming wet, and so on. The example from the book is that walking entails stepping.

I've found all the words I think up usually have no entailment!

But you should play around and see what you can find :)


In [83]:
# What other entailments can you think of and test?
wn.synset('practice.v.01').entailments()

[Synset('work.v.02')]

# Conclusion

WordNet represents a really interesting way to start thinking about how we can computationally measure a word's "meaning." If you think about it, the `.definition()` function of WordNet is likely very useless as a computational measure (unless we can somehow parse the text of the definition!). By quantifying the space between words and concepts, WordNet provides an interesting and useful (if imperfect) measure of semantic distance and associations.

There are other "nets" included in NLTK, such as VerbNet and the Multilingual WordNet. If you're keen, you can explore them in the NLTK docs:

- [VerbNet](https://www.nltk.org/api/nltk.corpus.reader.verbnet.html?highlight=verb%20net#module-nltk.corpus.reader.verbnet)
- [FrameNet](https://www.nltk.org/api/nltk.corpus.reader.framenet.html?highlight=frame%20net#module-nltk.corpus.reader.framenet)
- [Multilingual WordNet](https://github.com/globalwordnet/OMW) - (I don't know if you can access this through NLTK but I think you can, might need to search for "omw" which stands for open multilingual wordnet)