# WordNet Portfolio Assignment
Jimmy Harvin



WordNet is a NLTK-integrated database developed at Princeton that contains semantic relationships and word definitions. Each word has a synset, or synonym set, containing all words relating to a given concept with definitions and examples. Words are organized in hierarchies, with is-a and in-a relationships making up the bulk of these hierarchies. Only content words appear in WordNet, and stop words are exempt from these hierarchies.

In [129]:
import math
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('sentiwordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

nltk.download('gutenberg')
nltk.download('genesis')
nltk.download('inaugural')
nltk.download('nps_chat')
nltk.download('webtext')
nltk.download('treebank')

from nltk.corpus import wordnet as wn
from nltk.wsd import lesk
from nltk.corpus import sentiwordnet as swn
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize
from nltk.book import *
text4

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package genesis to /root/nltk_data...
[nltk_data]   Package genesis is already up-to

<Text: Inaugural Address Corpus>

# Noun

In [56]:
wn.synsets("value")

[Synset('value.n.01'),
 Synset('value.n.02'),
 Synset('value.n.03'),
 Synset('value.n.04'),
 Synset('value.n.05'),
 Synset('value.n.06'),
 Synset('value.v.01'),
 Synset('prize.v.01'),
 Synset('respect.v.01'),
 Synset('measure.v.04'),
 Synset('rate.v.03')]

In [57]:
val_synset = wn.synset("value.n.02")
print(val_synset.definition())
print(val_synset.examples())
print(val_synset.lemmas())

the quality (positive or negative) that renders something desirable or valuable
['the Shakespearean Shylock is of dubious value in the modern world']
[Lemma('value.n.02.value')]


In [58]:
hyper = val_synset.hypernyms()[0]
top = wn.synset("entity.n.01")

while hyper:
  print(hyper)
  if hyper == top:
    break
  elif hyper.hypernyms():
    hyper = hyper.hypernyms()[0]

Synset('worth.n.02')
Synset('quality.n.01')
Synset('attribute.n.02')
Synset('abstraction.n.06')
Synset('entity.n.01')


WordNet is organized as a series of hypernyms and holonyms. Hypernyms and hyponyms form an is-a relationship, with a hypernym being a more general form of a given hyponym. All nouns can be traced to the top hypernym 'entity', but verbs and adverbs do not have this sort of tree structure. Holonyms and meronyms form an in-a relationship, with a holonym being a whole and meronyms being parts of that whole. Synsets in WordNet have lists of hypernyms/hyponyms and holonyms/meronyms, and these lists can be traversed to find more generic or specific synsets.

In [59]:
print("Hypernyms: ", end = "")
print(val_synset.hypernyms())
print("Hyponyms: ", end = "")
print(val_synset.hyponyms())
print("Holonyms: ", end = "")
print(val_synset.member_holonyms())
print("Meronyms: ", end = "")
print(val_synset.member_meronyms())
print("Antonyms: ", end = "")
print(val_synset.lemmas()[0].antonyms())

Hypernyms: [Synset('worth.n.02')]
Hyponyms: [Synset('book_value.n.01'), Synset('gross_domestic_product.n.01'), Synset('gross_national_product.n.01'), Synset('importance.n.01'), Synset('invaluableness.n.01'), Synset('market_value.n.01'), Synset('monetary_value.n.01'), Synset('national_income.n.01'), Synset('par_value.n.01'), Synset('price.n.03'), Synset('richness.n.04'), Synset('standard.n.04'), Synset('unimportance.n.02')]
Holonyms: []
Meronyms: []
Antonyms: []


# Verb

In [60]:
val_verb = wn.synset("value.v.01")
print(val_verb.definition())
print(val_verb.examples())
print(val_verb.lemmas())

fix or determine the value of; assign a value to
['value the jewelry and art work in the estate']
[Lemma('value.v.01.value')]


In [61]:
hyper_verb = val_verb.hypernyms()[0]

while hyper_verb:
  print(hyper_verb)
  if len(hyper_verb.hypernyms()) > 0:
    hyper_verb = hyper_verb.hypernyms()[0]
  else:
    break

Synset('determine.v.03')


Verbs can also be part of a hypernym and hyponym relationship, though all verbs cannot be traced back to a single umbrella term. Rather than a meronym and holonym relationship, verbs can have troponyms, or more specific instances of an action, such as walking as opposed to moving.

# Morphy

In [62]:
print(wn.morphy("valuing", wn.NOUN))
print(wn.morphy("valuing", wn.VERB))
print(wn.morphy("valuing", wn.ADJ))

print(wn.morphy("valuer", wn.NOUN))
print(wn.morphy("valuer", wn.VERB))
print(wn.morphy("valuer", wn.ADJ))

None
value
None
valuer
None
None


# Similarity

In [67]:
print(wn.synsets("worth"))
worth = wn.synset("worth.n.01")
print(worth.definition())

[Synset('worth.n.01'), Synset('worth.n.02'), Synset('worth.n.03'), Synset('deserving.s.01'), Synset('worth.s.02')]
an indefinite quantity of something having a specified value


In [68]:
print("Path similarity: ", end = "")
print(val_synset.path_similarity(worth))
print("Wu-Palmer similarity: ", end = "")
print(wn.wup_similarity(val_synset, worth))

context = ["This", "gold", "has", "a", "high", "value"]
print("\nLesk algorithm: ", end = "")
print(lesk(context, 'worth'))
print(wn.synset("worth.s.02").definition())
print(lesk(context, 'worth', pos = "n"))

Path similarity: 0.125
Wu-Palmer similarity: 0.36363636363636365

Lesk algorithm: Synset('worth.s.02')
having a specified value
Synset('worth.n.01')


In [69]:
print(val_synset.path_similarity(wn.synset("worth.n.02")))
print(wn.wup_similarity(val_synset, wn.synset("worth.n.02")))

0.5
0.9090909090909091


I expected a higher numbers for path and Wu-Palmer similarity, given that these definitions are almost synonymous, but after double-checking, the version of 'worth' that appears as a hypernym of this instance of 'value' is 'worth.n.01', which has a more abstract definition. The similarities between 'value.n.02' and 'worth.n.02' are much higher. The Lesk algorithm is easy to understand, as it just tries to find overlap between word definitions and usage. I do not entirely get the point of this, though, as parts of definitions are rarely used in context, and overlap between usage and WordNet examples would likely be more valuable.

# SentiWordNet

In [75]:
print(wn.synsets("disgraceful"))
disgrace = wn.synset("disgraceful.s.01")
print(disgrace.definition())

[Synset('disgraceful.s.01'), Synset('black.s.12')]
giving offense to moral sensibilities and injurious to reputation; ; - Thackeray


In [84]:
for senti in swn.senti_synsets("disgraceful"):
  print(senti.synset)
  print("Positive: " + str(senti.pos_score()))
  print("Negative: " + str(senti.neg_score()))
  print("Objective: " + str(senti.obj_score()))

Synset('disgraceful.s.01')
Positive: 0.0
Negative: 0.5
Objective: 0.5
Synset('black.s.12')
Positive: 0.125
Negative: 0.5
Objective: 0.375


In [94]:
senti_sentence = "I am unbelievably stressed about these upcoming interviews"
senti_list = senti_sentence.split(" ")

print(senti_list)
for word in senti_list:
  if len(list(swn.senti_synsets(word))) > 0:
    print(list(swn.senti_synsets(word))[0])

['I', 'am', 'unbelievably', 'stressed', 'about', 'these', 'upcoming', 'interviews']
<iodine.n.01: PosScore=0.0 NegScore=0.0>
<americium.n.01: PosScore=0.0 NegScore=0.0>
<incredibly.r.01: PosScore=0.0 NegScore=0.5>
<stress.v.01: PosScore=0.0 NegScore=0.0>
<about.s.01: PosScore=0.0 NegScore=0.0>
<approaching.s.01: PosScore=0.0 NegScore=0.0>
<interview.n.01: PosScore=0.0 NegScore=0.0>


In [108]:
tagged_list = pos_tag(word_tokenize(senti_sentence))

def pos_to_wn(pos):
  if pos[0] == 'N':
    return wn.NOUN
  elif pos[0] == 'V':
    return wn.VERB
  elif pos[0] == 'J':
    return wn.ADJ
  elif pos[0] == 'R':
    return wn.ADV
  return

i = 0
for word in senti_list:
  wn_pos = pos_to_wn(tagged_list[i][1])
  i += 1
  if wn_pos and len(list(swn.senti_synsets(word, wn_pos))) > 0:
    print(list(swn.senti_synsets(word, wn_pos))[0], end = " ")
    print("ObjScore=" + str(list(swn.senti_synsets(word, wn_pos))[0].obj_score()))

<be.v.01: PosScore=0.25 NegScore=0.125> ObjScore=0.625
<incredibly.r.01: PosScore=0.0 NegScore=0.5> ObjScore=0.5
<stress.v.01: PosScore=0.0 NegScore=0.0> ObjScore=1.0
<approaching.s.01: PosScore=0.0 NegScore=0.0> ObjScore=1.0
<interview.n.01: PosScore=0.0 NegScore=0.0> ObjScore=1.0


SentiWordNet can be used to find the emotional values of synsets according to the WordNet database. The stored values for each synset are positivity, negativity, and objectivity, which are straightforward. The first code block here uses a naive approach, assuming that the first synset found for each word has the correct definition. This surprisingly yielded elements for the first two words in the sentence, so I took another approach using POS tagging to try and find the correct definition for each word, ignoring stop words or words not in the WordNet database. The given scores usually make sense, though I am surprised that stress has a negative score of zero. It is possible that the wrong form of the word was chosen, but it is also possible that this word is not seen as necessarily negative for whatever reason. With these polarity scores, it could be possible to find the general mood or emotions behind a body of text by comparing the average or total positivity and negativity to a larger corpus, and the same thing can be done with objectivity scores to try and isolate bias.

# Collocation

In [127]:
print(text4.collocations())

United States; fellow citizens; years ago; four years; Federal
Government; General Government; American people; Vice President; God
bless; Chief Justice; one another; fellow Americans; Old World;
Almighty God; Fellow citizens; Chief Magistrate; every citizen; Indian
tribes; public debt; foreign nations
None


In [131]:
text = " ".join(token.lower() for token in text4.tokens)
unique_words = len(set(text4))
prob_col = text.count("fellow citizens") / unique_words
prob_fellow = text.count("fellow") / unique_words
prob_citizens = text.count("citizens") / unique_words

print("PMI: " + str(math.log2(prob_col / (prob_fellow * prob_citizens))))

PMI: 4.132057790928088


A collocation is a series of words that carries more meaning than the sum of its parts, so much so that replacing a word in the collocation with a synonym would still change the meaning of the phrase. Collocations can be found by seeing if words appear more frequently together than they would by random chance, which can be measured with point-wise mutual information. PMI is the logarithm of the conditional probability of a set of words appearing together. For bigrams, this would be the logarithm of the probability of both words appearing together divided by the probabilities of each individual word in a given text. For the inaugural address, 'fellow citizens' has a PMI of about 4.13, which is quite high. From this, it can be assumed that 'fellow citizens' is a collocation. This was identified by the NLTK collocations function, so that seems to be accurate, though from this sample of code it is unknown if the list is exhaustive.