# WordNet

WordNet is a database full of semantic relations between words of many languages. Similar-meaning words are grouped together in *synsets*, which can be used to determine definitions, use-cases, lemmas.

NLTK offers a Python interface to WordNet which can be used once we import it.

In [74]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('book')

from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to /root/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to /root/nltk_data...
[nltk_data]    |   Package

## Nouns

As an example, I select a noun and output all of its synsets. The output will be a list of synsets that are relevant to the selected word.

In [75]:
wn.synsets('friend')

[Synset('friend.n.01'),
 Synset('ally.n.02'),
 Synset('acquaintance.n.03'),
 Synset('supporter.n.01'),
 Synset('friend.n.05')]

Let's extract its definition, usage examples, and lemmas. 

In [76]:
#Definition
wn.synset('friend.n.01').definition()

'a person you know well and regard with affection and trust'

In [77]:
#Usage examples
wn.synset('friend.n.01').examples()

['he was my best friend at the university']

In [78]:
#Lemmas
wn.synset('friend.n.01').lemmas()

[Lemma('friend.n.01.friend')]

We can also find the hypernyms of our word. Here, I demonstrate how to view the entire hierarchy of my chosen word.

In [79]:
#Traversing the hierarchy for a noun
hyper = wn.synset('friend.n.01').hypernyms()[0]
top = wn.synset('entity.n.01')

while hyper:
  print(hyper)
  if hyper == top:
    break
  if hyper.hypernyms():
    hyper = hyper.hypernyms()[0]

Synset('person.n.01')
Synset('causal_agent.n.01')
Synset('physical_entity.n.01')
Synset('entity.n.01')


You can see how the hierarchy is structure in a way that categorizes each noun as being a subclass of its parent. This pattern continues up until we reach the 'entity' noun, which encompasses all nouns.

Below, I've provided ways to obtain hierarchical relations. Where there are no applicable words, there will be an empty list.

In [80]:
#Hypernyms
print('Hypernyms:', wn.synset('friend.n.01').hypernyms())

#Hyponyms
print('Hyponyms:', wn.synset('friend.n.01').hyponyms())

#Meronyms
print('Meronyms:', wn.synset('friend.n.01').part_meronyms(), wn.synset('friend.n.01').substance_meronyms())

#Holonyms
print('Holonyms:', wn.synset('friend.n.01').member_holonyms())

#Antonyms
print('Antonyms:', wn.synset('friend.n.01').lemmas()[0].antonyms())



Hypernyms: [Synset('person.n.01')]
Hyponyms: [Synset('alter_ego.n.01'), Synset('amigo.n.01'), Synset('best_friend.n.01'), Synset('brother.n.04'), Synset('buddy.n.01'), Synset('companion.n.01'), Synset('confidant.n.01'), Synset('flatmate.n.01'), Synset('girlfriend.n.01'), Synset('light.n.08'), Synset('mate.n.08'), Synset('roommate.n.01'), Synset('schoolfriend.n.01')]
Meronyms: [] []
Holonyms: []
Antonyms: []


## Verbs

Now let's see what we can do with verbs. Below, I output a chosen word's synsets.

In [81]:
wn.synsets('punch')

[Synset('punch.n.01'),
 Synset('punch.n.02'),
 Synset('punch.n.03'),
 Synset('punch.v.01'),
 Synset('punch.v.02'),
 Synset('punch.v.03')]

Now, I select a synset of a verb variation of the word and extract its definition, usage examples, and lemmas.

In [82]:
#Definition
wn.synset('punch.v.01').definition()

'deliver a quick blow to'

In [83]:
#Usage examples
wn.synset('punch.v.01').examples()

['he punched me in the stomach']

In [84]:
#Lemmas
wn.synset('punch.v.01').lemmas()

[Lemma('punch.v.01.punch'), Lemma('punch.v.01.plug')]

Here, I traverse up the hierarchy of my chosen word, outputting all synsets as I go.

In [85]:
punch = wn.synset('punch.v.01')
hyper = lambda s: s.hypernyms()
list(punch.closure(hyper))

[Synset('hit.v.03'), Synset('touch.v.01')]

You can see how nouns and verbs differ in that there is no common hypernym for all verbs. Each word ends its hierarchy in different places, unlike nouns.

## Morphy

The morphy() function will return the base form of a given word. When given an inflected word (and optionally the pos), morphy() will use English rules-based determination to find the root word.


In [86]:
print(wn.morphy('punching'))
print(wn.morphy('punched', wn.ADV))
print(wn.morphy('punched', wn.VERB))

punch
None
punch


## Wu-Palmer Similarity Metric and the Lesk Algorithm

A similarity metric is often used to determine how closely related two words are in terms of their usage in a language. A score is given in the range of 0 (little similarity) to 1 (identity).

The Wu-Palmer similiarity metric is based on two words and their most specific common ancestor node.

Below, I choose two words I believe may be similar to some degree to demonstrate this function.

In [87]:
wn.wup_similarity(wn.synset('dog.n.01'), wn.synset('cat.n.01'))

0.8571428571428571

The Lesk algorithm returns the synset with the highest number of overlapping words between a given context sentence and definitions in each synset for a given word. We can additionally provide a pos argument for the word.

In [88]:
from nltk.wsd import lesk

for ss in wn.synsets('hit'):
  print(ss, ss.definition())

Synset('hit.n.01') (baseball) a successful stroke in an athletic contest (especially in baseball)
Synset('hit.n.02') the act of contacting one thing with another
Synset('hit.n.03') a conspicuous success
Synset('collision.n.01') (physics) a brief event in which two or more bodies come together
Synset('hit.n.05') a dose of a narcotic drug
Synset('hit.n.06') a murder carried out by an underworld syndicate
Synset('hit.n.07') a connection made via the internet to another website
Synset('hit.v.01') cause to move by striking
Synset('hit.v.02') hit against; come into sudden contact with
Synset('hit.v.03') deal a blow to, either with the hand or with an instrument
Synset('reach.v.01') reach a destination, either real or abstract
Synset('hit.v.05') affect or afflict suddenly, usually adversely
Synset('shoot.v.01') hit with a missile from a weapon
Synset('stumble.v.03') encounter by chance
Synset('score.v.01') gain points in a game
Synset('hit.v.09') cause to experience suddenly
Synset('strike.v.

In [89]:
my_sentence = ['Let', 'me', 'take', 'a', 'hit', 'of', 'that', '.']

print(lesk(my_sentence, 'hit', 'n'))
print(lesk(my_sentence, 'hit'))

Synset('hit.n.05')
Synset('shoot.v.01')


Specifying the pos will greatly help determining the correct synset of which a context-sentence's definition comes from. Notice how without the help of the target pos, the algorithm incorrectly returned a synset with a definition that has nothing to do with my sentence.

## SentiWordNet

SentiWordNet is designed for opinion mining. In other words, it is used to give sentiment scores for a given synset: positivity, negativity, objectivity. Each value is always in the range of 0 and 1, with all three scores adding up to the sum of 1.0.

Below, I select an emotionally charged word, and output the polarity scores for each of its synsets.

In [90]:

from nltk.corpus import sentiwordnet as swn

for ss in list(swn.senti_synsets('anticipation')):
  anti = ss
  print(anti)
  print('Positive score: ', anti.pos_score())
  print('Negative score: ', anti.neg_score())
  print('Objective score: ', anti.obj_score(), '\n')

<anticipation.n.01: PosScore=0.125 NegScore=0.25>
Positive score:  0.125
Negative score:  0.25
Objective score:  0.625 

<anticipation.n.02: PosScore=0.0 NegScore=0.0>
Positive score:  0.0
Negative score:  0.0
Objective score:  1.0 

<prediction.n.01: PosScore=0.0 NegScore=0.0>
Positive score:  0.0
Negative score:  0.0
Objective score:  1.0 

<anticipation.n.04: PosScore=0.5 NegScore=0.0>
Positive score:  0.5
Negative score:  0.0
Objective score:  0.5 



Now, let's try extracting polarity scores for each word in a given sentence.

In [91]:
my_sentence = 'anticipate lots of homework this semester'
neg = 0
pos = 0
words = my_sentence.split()
print(words)
for word in words:
  syn_list = list(swn.senti_synsets(word))
  if syn_list:
    syn = syn_list[0]
    print(syn)
    print('Positive score: ', syn.pos_score())
    print('Negative score: ', syn.neg_score())
    print('Objective score: ', syn.obj_score(), '\n')


['anticipate', 'lots', 'of', 'homework', 'this', 'semester']
<expect.v.01: PosScore=0.25 NegScore=0.25>
Positive score:  0.25
Negative score:  0.25
Objective score:  0.5 

<tons.n.01: PosScore=0.0 NegScore=0.25>
Positive score:  0.0
Negative score:  0.25
Objective score:  0.75 

<homework.n.01: PosScore=0.0 NegScore=0.0>
Positive score:  0.0
Negative score:  0.0
Objective score:  1.0 

<semester.n.01: PosScore=0.0 NegScore=0.0>
Positive score:  0.0
Negative score:  0.0
Objective score:  1.0 



Notice that each significant word returns their polarity scores. SentiWordNet does not categorize stopwords as well, so those are ignored. This will be really good in use for future NLP applications that requires a program to analyze the sentiment behind a sentence. For example, a program may need to know if you feel happy about something, or if you find something unfavorable.

## Collocation
Collocation is the natural juxtaposition of two or more words that form a greater meaning than the sum of its parts beyond mere coincidence. For example, 'fast food' means more than what the words individually can convey.

I will to use the Inaugural corpus in NLTK, so I will import it first.

In [92]:
#Get collocations
from nltk.corpus import *
text4.collocations()

United States; fellow citizens; years ago; four years; Federal
Government; General Government; American people; Vice President; God
bless; Chief Justice; one another; fellow Americans; Old World;
Almighty God; Fellow citizens; Chief Magistrate; every citizen; Indian
tribes; public debt; foreign nations


Let's select one of the collocations identified above and calculate mutual information.

Mutual information is the log of the probability:

P(x,y) / [P(x) * P(y)]




In [93]:
import math
text = ' '.join(text4.tokens)

vocab = len(set(text4))
fellow = text.count('fellow')/vocab
print('p(\'fellow\'): ', fellow)
citizens = text.count('citizens')/vocab
print('p(\'citizens\'):', citizens)
fellow_citizens = text.count('fellow citizens')/vocab
print('p(\'fellow citizens\'):', fellow_citizens)

pmi = math.log2(fellow_citizens / (fellow * citizens))

print('Mutual information score: ', pmi)

p('fellow'):  0.013665835411471322
p('citizens'): 0.026932668329177057
p('fellow citizens'): 0.006084788029925187
Mutual information score:  4.0472042737811735


The number outputted for the MI score calculation is used to measure the amount of non-randomness present when the two words occur in text. Mutual information is vital in assessing how important collocation is in a given text, and suggest that the target words may exist in the form of a two-way attraction.