# WordNet

## Summary

WordNet is a free and publically downloadable corpus of English words. This lexical database contains labeled and grouped sets of synsets -- which are cognitive synonyms linked together by lexical relationships. Essentially, it's grouped together by how similar the words are based on their meanings (semantics).

In [3]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [90]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('sentiwordnet')
nltk.download('gutenberg')
nltk.download('genesis')
nltk.download('inaugural')
nltk.download('nps_chat')
nltk.download('webtext')
nltk.download('treebank')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package genesis to /root/nltk_data...
[nltk_data]   Package genesis is already up-to-date!
[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!
[nltk_data] Downloading package nps_chat to /root/nltk_data...
[nltk_data]   Package nps_chat is already up-to-date!
[nltk_data] Downloading package webtext to /root/nltk_data...
[nltk_data]   Package webtext is already up-to-date!
[nltk_data] Downloading package treebank to /root/nltk_data.

True

In [8]:
from nltk.corpus import wordnet as wn

# Output all synsets of `leg`
wn.synsets('leg')

[Synset('leg.n.01'),
 Synset('leg.n.02'),
 Synset('leg.n.03'),
 Synset('branch.n.03'),
 Synset('leg.n.05'),
 Synset('peg.n.04'),
 Synset('leg.n.07'),
 Synset('leg.n.08'),
 Synset('stage.n.06')]

In [9]:
# Using the synset `n.n.06`, we will output its definition, usage examples, and lemmas
n = wn.synset('stage.n.06')
print('Synset: n.n.06')
print('Definition:', n.definition())
print('Usage Examples:', n.examples())
print('Lemmas:', n.lemmas())

Synset: n.n.06
Definition: a section or portion of a journey or course
Usage Examples: ['then we embarked on the second stage of our Caribbean cruise']
Lemmas: [Lemma('stage.n.06.stage'), Lemma('stage.n.06.leg')]


In [10]:
# Traverse up the WordNet and print out all synsets
list(n.closure(lambda x : x.hypernyms()))

[Synset('travel.n.01'),
 Synset('motion.n.06'),
 Synset('change.n.03'),
 Synset('action.n.01'),
 Synset('act.n.02'),
 Synset('event.n.01'),
 Synset('psychological_feature.n.01'),
 Synset('abstraction.n.06'),
 Synset('entity.n.01')]

The nouns are organized with more general concepts on top. For my example, notice how event appears above travel -- this is in relation to how "event" is significantly more broad than "travel".

In [11]:
def test_nym(func, syn):
    try:
        return func(syn)
    except:
        return []

# Print all: hypernyms, hyponyms, meronyms, holonyms, antonym
print('Hypernyms:', test_nym(lambda x : x.hypernyms(), n))
print('Hyponyms:', test_nym(lambda x : x.hyponyms(), n))
print('Meronyms:', test_nym(lambda x : x.part_meronyms(), n))
print('Holonyms:', test_nym(lambda x : x.part_holonyms(), n))
print('Antonyms:', test_nym(lambda x : x.antonyms(), n))

Hypernyms: [Synset('travel.n.01')]
Hyponyms: [Synset('fare-stage.n.01')]
Meronyms: []
Holonyms: [Synset('journey.n.01')]
Antonyms: []


In [12]:
wn.synsets('jump')

[Synset('jump.n.01'),
 Synset('leap.n.02'),
 Synset('jump.n.03'),
 Synset('startle.n.01'),
 Synset('jump.n.05'),
 Synset('jump.n.06'),
 Synset('jump.v.01'),
 Synset('startle.v.02'),
 Synset('jump.v.03'),
 Synset('jump.v.04'),
 Synset('leap_out.v.01'),
 Synset('jump.v.06'),
 Synset('rise.v.11'),
 Synset('jump.v.08'),
 Synset('derail.v.02'),
 Synset('chute.v.01'),
 Synset('jump.v.11'),
 Synset('jumpstart.v.01'),
 Synset('jump.v.13'),
 Synset('leap.v.02'),
 Synset('alternate.v.01')]

In [13]:
v = wn.synset('startle.v.02')
print('Synset: startle.v.02')
print('Definition:', v.definition())
print('Usage Examples:', v.examples())
print('Lemmas:', v.lemmas())

Synset: startle.v.02
Definition: move or jump suddenly, as if in surprise or alarm
Usage Examples: ['She startled when I walked into the room']
Lemmas: [Lemma('startle.v.02.startle'), Lemma('startle.v.02.jump'), Lemma('startle.v.02.start')]


In [14]:
# Traverse up the WordNet and print out all synsets
list(v.closure(lambda x : x.hypernyms()))

[Synset('move.v.03')]

Similar to the nouns, the verbs are also organized into a similar hierarchy. The more broad verbs are classified above in the hierarchy the more specific verbs.

In [32]:
print(wn.morphy('jumped'))
print(wn.morphy('jumps'))
print(wn.morphy('jump'))

# Interestingly, the morphy of jumping is just jumping not jump
print(wn.morphy('jumping'))

jump
jump
jump
jumping


In [50]:
a = 'sheer'
b = 'cut'
print(wn.synsets(a))
print(wn.synsets(b))

# Synsets of interest
a = wn.synset('swerve.v.01')
b = wn.synset('cut.v.01')
print(a)
print(b)

# Find the Wu-Palmer similarity
wn.wup_similarity(a, b)

[Synset('swerve.v.01'), Synset('sheer.v.02'), Synset('absolute.s.02'), Synset('plain.s.04'), Synset('bluff.s.01'), Synset('diaphanous.s.01'), Synset('sheer.r.01'), Synset('sheer.r.02')]
[Synset('cut.n.01'), Synset('cut.n.02'), Synset('cut.n.03'), Synset('cut.n.04'), Synset('cut.n.05'), Synset('cut.n.06'), Synset('stinger.n.02'), Synset('cut.n.08'), Synset('deletion.n.03'), Synset('cut.n.10'), Synset('cut.n.11'), Synset('snub.n.02'), Synset('baseball_swing.n.01'), Synset('cut.n.14'), Synset('cut.n.15'), Synset('cut.n.16'), Synset('cut.n.17'), Synset('cut.n.18'), Synset('cut.n.19'), Synset('cut.n.20'), Synset('cut.v.01'), Synset('reduce.v.01'), Synset('swerve.v.01'), Synset('cut.v.04'), Synset('cut.v.05'), Synset('cut.v.06'), Synset('cut.v.07'), Synset('cut.v.08'), Synset('write_out.v.02'), Synset('edit.v.03'), Synset('cut.v.11'), Synset('hack.v.02'), Synset('cut.v.13'), Synset('cut.v.14'), Synset('cut.v.15'), Synset('cut.v.16'), Synset('cut.v.17'), Synset('cut.v.18'), Synset('cut.v.19')

0.25

In [56]:
from nltk.wsd import lesk

# Use the Lesk Algorithm to find similarity
sent_a = 'The car swerved to avoid the incoming vehicles'.split(' ')
print(lesk(sent_a, 'swerve', 'v'))

sent_b = 'I used scissors to cut the sheet of paper'.split(' ')
print(lesk(sent_b, 'cut'))

Synset('swerve.v.01')
Synset('edit.v.03')


## Wu-Palmer Similarity Score

Based on the depth of their lowest common subsumer (LCS) node and their position in the WordNet hierarchy, two synsets are compared using the Wu-Palmer similarity score to determine how closely related they are.
The score is between 0 (no similarity) and 1. (identical meanings). Since I picked two words that are relatively close, it gave me a score of `0.25`.

## Lesk Algorithm

The Lesk method determines the relationship between two synsets based on the overlap of words in their **definitions**.
In my example, both approaches imply that "cut" and "swerve" are similar words but still have their own subtle differences in meaning. 

# SentiWordNet

SentiWordNet is a lexical resource that judges WordNet synsets to determine if they are positive, negative, or neutral sentiment by giving them polarity scores.
Applications for natural language processing could make advantage of sentiment analysis such as analyzing social media posts or determining news article bias.

In [66]:
from nltk.corpus import sentiwordnet as swn

w = swn.senti_synset('hideous.s.01')
print(w)
print("Positive score = ", w.pos_score())
print("Negative score = ", w.neg_score())
print("Objective score = ", w.obj_score())

<hideous.s.01: PosScore=0.0 NegScore=0.875>
Positive score =  0.0
Negative score =  0.875
Objective score =  0.125


In [72]:
sent = 'i absolutely loved the way he flipped his hair'
for s in sent.split(' '):
    sys = list(swn.senti_synsets(s))
    if len(sys) > 0:
        print(sys[0])

<iodine.n.01: PosScore=0.0 NegScore=0.0>
<absolutely.r.01: PosScore=0.5 NegScore=0.0>
<love.v.01: PosScore=0.5 NegScore=0.0>
<manner.n.01: PosScore=0.0 NegScore=0.0>
<helium.n.01: PosScore=0.0 NegScore=0.0>
<flip.v.01: PosScore=0.0 NegScore=0.0>
<hair.n.01: PosScore=0.0 NegScore=0.0>


These scores pretty much directly reflected what I intended for SentiWordNet to read. Knowing these scores can defintely help (as I mentioned above) with the classification of social media posts and news article analysis. More generally, it helps the app know the sentiment of the text. This can be used to show certain trends or the status quo feeling towards a certain idea, if we scrape web data.

# Collocations

A collocation is a group of words that regularly occur together in a corpus of language, implying a close relationship or frequent usage in groups (such as a word phrase).
The collocations module of NLTK uses statistical metrics like mutual information to locate and grade collocations in a corpus. 

In [87]:
import nltk
from nltk.book import *
text4

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


<Text: Inaugural Address Corpus>

In [91]:
text4.collocations()

United States; fellow citizens; years ago; four years; Federal
Government; General Government; American people; Vice President; God
bless; Chief Justice; one another; fellow Americans; Old World;
Almighty God; Fellow citizens; Chief Magistrate; every citizen; Indian
tribes; public debt; foreign nations


In [92]:
# Figure out Mutual information
text = ' '.join(text4.tokens)
text[:50]

'Fellow - Citizens of the Senate and of the House o'

In [94]:
import math
vocab = len(set(text4))
us = text.count('United States')/vocab
print("p(United States) = ",us )
u = text.count('United')/vocab
print("p(United) = ", u)
s = text.count('States')/vocab
print('p(States) = ', s)
pmi = math.log2(us / (u * s))
print('pmi = ', pmi)

p(United States) =  0.015860349127182045
p(United) =  0.0170573566084788
p(States) =  0.03301745635910224
pmi =  4.815657649820885


In [96]:
import math
vocab = len(set(text4))
us = text.count('every citizen')/vocab
print("p(every citizen) = ",us )
u = text.count('fellow')/vocab
print("p(every) = ", u)
s = text.count('citizen')/vocab
print('p(citizen) = ', s)
pmi = math.log2(us / (u * s))
print('pmi = ', pmi)

p(every citizen) =  0.0016957605985037406
p(every) =  0.013665835411471322
p(citizen) =  0.032618453865336655
pmi =  1.9275985490213754


Notice that `every citizen`'s multual information score is much lower than `United States`, which shows that the latter is much more likely to be a collocation. This makes sense as there are very little cases where `United` appears without `States`. However, the use of `every` and `citizen` are relatively more common alone.