Houston Holman
2/22/23

WordNet Showcase

WordNet is a large lexical database of words in the English language. Each word is assigned one or more synsets which are collections of words that are very similar to the target word. WordNet is used in natural language processing in order to help the computer to understand the meaning of words.

In [None]:
import nltk

In [None]:
nltk.download('all')

In [None]:
from nltk.corpus import wordnet as wn

Here are all the synsets for the word "dog"

In [None]:
wn.synsets('dog')

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01'),
 Synset('chase.v.01')]

Here is the definition, usage example, and lemmas of 'dog.n.03'

In [None]:
wn.synset('dog.n.03').definition()

'informal term for a man'

In [None]:
wn.synset('dog.n.03').examples()

['you lucky dog']

In [None]:
wn.synset('dog.n.03').lemmas()

[Lemma('dog.n.03.dog')]

Now let's traverse the hierarchy for 'dog.n.03'

In [None]:
hyp = wn.synset('dog.n.03').hypernyms()[0]
top = wn.synset('entity.n.01')
while hyp:
    print(hyp)
    if hyp == top:
        break
    if hyp.hypernyms():
        hyp = hyp.hypernyms()[0]

Synset('chap.n.01')
Synset('male.n.02')
Synset('person.n.01')
Synset('causal_agent.n.01')
Synset('physical_entity.n.01')
Synset('entity.n.01')


WordNet's organization of nouns is interesting. In this example, I find it a bit strange that 'dog' does not lead directly into 'male' and instead first goes to 'chap'. I am also not sure what a 'casual_agent' is. Other than those oddities, the hierarchy makes sense.

In [None]:
wn.synset('dog.n.03').hypernyms()

[Synset('chap.n.01')]

In [None]:
wn.synset('dog.n.03').hyponyms()

[]

In [None]:
wn.synset('dog.n.03').part_meronyms()

[]

In [None]:
wn.synset('dog.n.03').part_holonyms()

[]

In [None]:
for lemma in wn.synset('dog.n.03').lemmas():
  print(lemma.antonyms())

[]


Here are all the synsets for the word "pen"

In [None]:
wn.synsets('pen')

Synset('pen.n.01')

Here is the definition, usage example, and lemmas of 'write.v.01'

In [None]:
wn.synset('write.v.01').definition()

'produce a literary work'

In [None]:
wn.synset('write.v.01').examples()

['She composed a poem', 'He wrote four novels']

In [None]:
wn.synset('write.v.01').lemmas()

[Lemma('write.v.01.write'),
 Lemma('write.v.01.compose'),
 Lemma('write.v.01.pen'),
 Lemma('write.v.01.indite')]

Now let's traverse the hierarchy for 'write.v.01'

In [None]:
hyp = wn.synset('write.v.01').hypernyms()[0]
i = 0
while i < 5:
    i = i+1
    print(hyp)
    if hyp.hypernyms():
        hyp = hyp.hypernyms()[0]

Synset('create_verbally.v.01')
Synset('make.v.03')
Synset('make.v.03')
Synset('make.v.03')
Synset('make.v.03')


WordNet's hierarchy of this verb confuses me. It surprises me that it only consists of 3 levels but I guess that makes sense since I am having a hard time imagining what goes above 'make'. I have no idea how 'write' becomes 'create_verbally', considering that there is nothing verbal about it. I would have assumed it would lead into 'create'.

Let's use morphy() to show that other forms of the word share the same base

In [None]:
wn.morphy('penned', 'v')

'pen'

In [None]:
wn.morphy('penning', 'v')

'pen'

Let's now test the similarity between two words. For this example, let's use the words 'planet' and 'moon'

In [None]:
planet = wn.synset('planet.n.01')
planet.definition()

'(astronomy) any of the nine large celestial bodies in the solar system that revolve around the sun and shine by reflected light; Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, and Pluto in order of their proximity to the sun; viewed from the constellation Hercules, all the planets rotate around the sun in a counterclockwise direction'

In [None]:
moon = wn.synset('moon.n.01')
moon.definition()

'the natural satellite of the Earth'

In [None]:
wn.wup_similarity(planet,moon)

0.8

Now let's use the lesk algorithm to see if it can accurately pick out the correct synsets

In [None]:
from nltk.wsd import lesk

context = "Each planet in the solar system may have a moon that orbits it."
context_tokens = nltk.word_tokenize(context)
print(lesk(context_tokens, 'planet'))
print(lesk(context_tokens, 'moon'))

Synset('planet.n.01')
Synset('moon.v.02')


In [None]:
print(wn.synset('planet.n.01').definition())
print(wn.synset('moon.v.02').definition())

(astronomy) any of the nine large celestial bodies in the solar system that revolve around the sun and shine by reflected light; Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune, and Pluto in order of their proximity to the sun; viewed from the constellation Hercules, all the planets rotate around the sun in a counterclockwise direction
be idle in a listless or dreamy way


The Wu-Palmer algorithm seems to be very accurate in determining the similarity between 'planet' and 'moon'. My guess is that both belong to some kind of 'celestial objects' category, which explains their high similarity score. The Lesk algorithm was able to accurately pick out the correct synset for 'planet' but failed to find the correct synset for 'moon'. This shows that the algorithm is not perfect and can miss things.

SentiWordNet is a tool that assigns sentiment scores to words based on their meanings. When given a word, it returns how negative, neutral, or positive that word is. Using this information, SentiWordNet can be used to classify text or judge opinions.

In [None]:
from nltk.corpus import sentiwordnet as swn

word = swn.senti_synset('assault.n.02')
print("Positive score = ", word.pos_score())
print("Negative score = ", word.neg_score())
print("Objective score = ", word.obj_score())

Positive score =  0.0
Negative score =  0.375
Objective score =  0.625


In [None]:
sentence = "The ravenous child murdered the bowl of delicious cereal"
tokens = nltk.word_tokenize(sentence)

for token in tokens:
  if len(wn.synsets(token)) > 0:
    synset = wn.synsets(token)[0]
    word = swn.senti_synset(synset.name())
    print("Word: " + token)
    print("Positive score = ", word.pos_score())
    print("Negative score = ", word.neg_score())
    print("Objective score = ", word.obj_score())

Word: ravenous
Positive score =  0.25
Negative score =  0.625
Objective score =  0.125
Word: child
Positive score =  0.0
Negative score =  0.0
Objective score =  1.0
Word: murdered
Positive score =  0.0
Negative score =  0.625
Objective score =  0.375
Word: bowl
Positive score =  0.0
Negative score =  0.0
Objective score =  1.0
Word: delicious
Positive score =  0.0
Negative score =  0.0
Objective score =  1.0
Word: cereal
Positive score =  0.0
Negative score =  0.0
Objective score =  1.0


It seems that SentiWordNet does an alright job at sentiment analysis. It seems to catch most of the obvious words, but it does miss a lot. It surprised me that 'assault' was considered mostly neutral and only patially negative. In the sentence example, most of the results make sense. I was surprised to find that ravenous is considered mostly negative but also somewhat positive. I am wondering why delicious is not considered a positive word. Regardless, these scores can be used in NLP programs to perform sentiment analysis. This can be especially useful for analyzing opinions on certain topics by looking at social medias like Twitter.

Collocation refers to the tendency for some words to occur together more often than chance, insinuating that there is a relationship between those words. This is important in NLP because it can be used to analyze meaningful patterns in texts. 

In [6]:
import math

nltk.download('inaugural')

from nltk.book import text4

print(text4.collocations())

text = ' '.join(text4.tokens)

vocab = len(set(text4))
hg = text.count('fellow Americans')/vocab
print("p(fellow Americans) = ",hg )
h = text.count('fellow')/vocab
print("p(fellow) = ", h)
g = text.count('Americans')/vocab
print('p(Americans) = ', g)
pmi = math.log2(hg / (h * g))
print('pmi = ', pmi)

United States; fellow citizens; years ago; four years; Federal
Government; General Government; American people; Vice President; God
bless; Chief Justice; one another; fellow Americans; Old World;
Almighty God; Fellow citizens; Chief Magistrate; every citizen; Indian
tribes; public debt; foreign nations
None
p(fellow Americans) =  0.00199501246882793
p(fellow) =  0.013665835411471322
p(Americans) =  0.008478802992518703
pmi =  4.105819692018779


[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!


Since the mutual information formula returned a positive number, that means that 'fellow Americans' is likely to be a collocation.