# WordNet Project

WordNet is a lexical database of English words that links nouns, verbs, and adverbs into synsets that represent different concepts. Each synset collects different words that have the same meaning. WordNet organizes words with a heirarchical structure using hyponyms, homonyms, troponyms, and other relations to give synsets relative relationships with one another. WordNet was originally designed to model how the human mind stores word concepts.

### Imports

In [2]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('sentiwordnet')
nltk.download('stopwords')

nltk.download('webtext')
nltk.download('treebank')
nltk.download('nps_chat')
nltk.download('inaugural')
nltk.download('genesis')
nltk.download('gutenberg')
from nltk.corpus import wordnet as wordnet
from nltk.corpus import sentiwordnet as sentiwordnet
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.wsd import lesk
from nltk.book import text4

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/sentiwordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package webtext to /root/nltk_data...
[nltk_data]   Unzipping corpora/webtext.zip.
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package nps_chat to /root/nltk_data...
[nltk_data]   Unzipping corpora/nps_chat.zip.
[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Unzipping corpora/inaugural.zip.
[nltk_data] Downloading package genesis to /root/nltk_data...
[nltk_data]   Unzipping corpora/genesis.zip.

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## Synset Interconnection for Nouns

Nouns are organized with several relations between different synsets. A synset's hypernyms and hyponyms have an 'is a' relationship with the synset for nouns. Hypernyms are more overarching concepts whereas a hyponym narrows down a synset's concept to something more specific. Holonyms and Meronyms have an 'is a part of' relationship with a noun synset. A meronym is a smaller part of a synset's concept, whereas a holonym is a concept where the synset's concept is itself a component to a larger whole.

Antonyms are accessed through lemmas.

### Explore the synset of a noun
Retrieve the synset for a given noun and display the definition, use examples, and lemmas of the first synset

In [3]:
# retrieve a synset group
synsets = wordnet.synsets("vertebrae", pos=wordnet.NOUN)
print(synsets)

# extract info of first synset, if there is a synset
if synsets:
  print("\nSynset {}\nDefinition: {}\nUseage Examples: {}\nLemmas: {}".format(synsets[0],synsets[0].definition(), synsets[0].examples(), synsets[0].lemmas()))
else:
  print("No sysnet found for that")

[Synset('vertebra.n.01')]

Synset Synset('vertebra.n.01')
Definition: one of the bony segments of the spinal column
Useage Examples: []
Lemmas: [Lemma('vertebra.n.01.vertebra')]


Print the definition, useage, and lemmas for each synset of the word.

---



In [4]:
# extract info of all synsets, if there is a synset
if synsets:
  for syn in synsets:
    print("\nSynset {}\nDefinition: {}\nUseage Examples: {}\nLemmas: {}".format(syn,syn.definition(), syn.examples(), syn.lemmas()))
else:
  print("No synsets generated.")


Synset Synset('vertebra.n.01')
Definition: one of the bony segments of the spinal column
Useage Examples: []
Lemmas: [Lemma('vertebra.n.01.vertebra')]


### Traverse the Tree of the first synset
Traverse up the heirarchy from hyponyms as far as the hypernyms go, and then down as far as hyponyms go


In [5]:
if synsets:
  # select the first synset
  syn = synsets[0]

  print("Word: {}".format(syn))

  # get hypernyms--------------------
  hyp = syn.hypernyms()

  # print hypernyms if existant
  if hyp:
    count = 1
    while hyp:
      print("Round {} hypernyms: {}".format(count, hyp))
      hyp = hyp[0].hypernyms()
      count+=1

  # print error if no hyponyms
  else:
    print("No hypernyms for {}".format(syn))

  # get hyponnyms--------------------
  hyp = syn.hyponyms()

  # print hyponyms if existant
  if hyp:
    count = 1
    while hyp:
      print("Round {} hyponyms: {}".format(count, hyp))
      hyp = hyp[0].hyponyms()
      count+=1

  # print error if no hyponyms
  else:
    print("No hyponyms for {}".format(syn))
else:
  print("No synsets generated.")

Word: Synset('vertebra.n.01')
Round 1 hypernyms: [Synset('bone.n.01')]
Round 2 hypernyms: [Synset('connective_tissue.n.01')]
Round 3 hypernyms: [Synset('animal_tissue.n.01')]
Round 4 hypernyms: [Synset('tissue.n.01')]
Round 5 hypernyms: [Synset('body_part.n.01')]
Round 6 hypernyms: [Synset('part.n.03')]
Round 7 hypernyms: [Synset('thing.n.12')]
Round 8 hypernyms: [Synset('physical_entity.n.01')]
Round 9 hypernyms: [Synset('entity.n.01')]
Round 1 hyponyms: [Synset('cervical_vertebra.n.01'), Synset('coccygeal_vertebra.n.01'), Synset('lumbar_vertebra.n.01'), Synset('sacral_vertebra.n.01'), Synset('thoracic_vertebra.n.01')]
Round 2 hyponyms: [Synset('atlas.n.03'), Synset('axis.n.05')]


### Relations to a Synset
Print the hypernyms, hyponyms, meronyms, holonyms, and antonyms of a selected synset.


In [6]:
# perform action if synset exists
if synsets:
  # select the first synset
  syn = synsets[0]

  # print metrics
  print("\nSynset {}\nHypernyms: {}\nHyponyms: {}\nHolonyms: {}\nMeronyms: {}\nAntonyms: {}".format(
      syn,syn.hypernyms(), syn.hyponyms(), syn.part_holonyms(), syn.part_meronyms(), syn.lemmas()[0].antonyms()))


else:
  print("No synsets generated.")


Synset Synset('vertebra.n.01')
Hypernyms: [Synset('bone.n.01')]
Hyponyms: [Synset('cervical_vertebra.n.01'), Synset('coccygeal_vertebra.n.01'), Synset('lumbar_vertebra.n.01'), Synset('sacral_vertebra.n.01'), Synset('thoracic_vertebra.n.01')]
Holonyms: [Synset('spinal_column.n.01')]
Meronyms: [Synset('apophysis.n.02'), Synset('centrum.n.01'), Synset('transverse_process.n.01')]
Antonyms: []


## Synset Interconnection for Verbs

The arrangement of Verbs has a similar structure to that of nouns.
* Hypernyms are less specific ways of describing the same action.
* Hyponyms are more specific ways of describing the same action.
* Antonyms describe the opposite of the action, if the action has a clear inverse (ex. sitting and rising)
* Holonyms and Meronyms are more difficult to associate with verbs because verbs are rarely 'part of a whole'.

In [7]:
# select a verb
synsets = wordnet.synsets("soar", pos=wordnet.VERB)
print(synsets)

# extract info of first synset, if there is a synset
if synsets:
  print("\nSynset {}\nDefinition: {}\nUseage Examples: {}\nLemmas: {}".format(synsets[0],synsets[0].definition(), synsets[0].examples(), synsets[0].lemmas()))
else:
  print("No sysnet found for that")

[Synset('soar.v.01'), Synset('hang_glide.v.01'), Synset('soar.v.03'), Synset('soar.v.04'), Synset('sailplane.v.01')]

Synset Synset('soar.v.01')
Definition: rise rapidly
Useage Examples: ['the dollar soared against the yen']
Lemmas: [Lemma('soar.v.01.soar'), Lemma('soar.v.01.soar_up'), Lemma('soar.v.01.soar_upwards'), Lemma('soar.v.01.surge'), Lemma('soar.v.01.zoom')]


### Traverse Up the Tree of the first synset
Traverse up the heirarchy from hyponyms as far as the hypernyms go, and then down as far as hyponyms go


In [8]:
if synsets:
  # select the first synset
  syn = synsets[0]

  print("Word: {}".format(syn))

  # get hypernyms--------------------
  hyp = syn.hypernyms()

  # print hypernyms if existant
  if hyp:
    count = 1
    while hyp:
      print("Round {} hypernyms: {}".format(count, hyp))
      hyp = hyp[0].hypernyms()
      count+=1

  # print error if no hyponyms
  else:
    print("No hypernyms for {}".format(syn))

  # get hyponnyms--------------------
  hyp = syn.hyponyms()

  # print hyponyms if existant
  if hyp:
    count = 1
    while hyp:
      print("Round {} hyponyms: {}".format(count, hyp))
      hyp = hyp[0].hyponyms()
      count+=1

  # print error if no hyponyms
  else:
    print("No hyponyms for {}".format(syn))
else:
  print("No synsets generated.")

Word: Synset('soar.v.01')
Round 1 hypernyms: [Synset('rise.v.01')]
Round 2 hypernyms: [Synset('travel.v.01')]
Round 1 hyponyms: [Synset('billow.v.01')]
Round 2 hyponyms: [Synset('cloud.v.03')]


## Generate Morphologies
Generate morphologies for the word. As you can see, all of these morphologies are accepted by the morphy function as valid forms of the word, but there is no such thing as 'wolfshes'. Morphy is a rules-based simplifier, and here you can see the limitations in that, as morphy accepts words that have no business being in the English language as valid morphologies to words.

However, this does have some benefits for parsing 'Internet English', where morphologies of words are invented to convey self-expression.
Example: "Wow look at all them wolfies" is not proper English but is representative of how some people talk in messages.

In [9]:
# select a word
word = "wolf"

# print metrics
print("Word: {}", word)

noun = ["s","ses","xes","zes","ches","shes","men","ies"]
verb = ["s","ies","es","ed","ing"]
adjactive = ["er","est"]

print("Accepted (by morphy) Variant forms:")

# generate nouns
for suffix in noun:
  if wordnet.morphy(word+"s", wordnet.NOUN) != None:
    print("Noun: ", word + suffix)

# generate nouns
for suffix in verb:
  if wordnet.morphy(word+"s", wordnet.VERB) != None:
    print("Verb: ", word + suffix)

# generate nouns
for suffix in adjactive:
  if wordnet.morphy(word+"s", wordnet.ADJ) != None:
    print("Adjactive: ", word + suffix)


Word: {} wolf
Accepted (by morphy) Variant forms:
Noun:  wolfs
Noun:  wolfses
Noun:  wolfxes
Noun:  wolfzes
Noun:  wolfches
Noun:  wolfshes
Noun:  wolfmen
Noun:  wolfies
Verb:  wolfs
Verb:  wolfies
Verb:  wolfes
Verb:  wolfed
Verb:  wolfing


## Similarity (Wu-Palmer and Lesk)


### Select two words

In [10]:
# word
word_a = "wolf"
word_b = "cat"

# use of word
word_a_use = "The wolf pack raced after the deer."
word_b_use = "Sometimes I like to give my cat extra treats."

# possible synsets for both words
synsa = wordnet.synsets(word_a)
synsb = wordnet.synsets(word_b)

# select the first synset (can be changed by the next step)
syna = synsa[0]
synb = synsb[0]

Select the desired synset from the words (manual tuning)

In [11]:
# extract synset a
syna = synsa[0]
print("Synsets for a: {}".format(synsa))

print("\nSynset {}\nDefinition: {}\nUseage Examples: {}\nLemmas: {}".format(syna,syna.definition(), syna.examples(), syna.lemmas()))

# extract synset b
synb = synsb[0]
print("\n\nSynsets for b: {}".format(synsb))

print("\nSynset {}\nDefinition: {}\nUseage Examples: {}\nLemmas: {}".format(synb,synb.definition(), synb.examples(), synb.lemmas()))

Synsets for a: [Synset('wolf.n.01'), Synset('wolf.n.02'), Synset('wolf.n.03'), Synset('wolf.n.04'), Synset('beast.n.02'), Synset('wolf.v.01')]

Synset Synset('wolf.n.01')
Definition: any of various predatory carnivorous canine mammals of North America and Eurasia that usually hunt in packs
Useage Examples: []
Lemmas: [Lemma('wolf.n.01.wolf')]


Synsets for b: [Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')]

Synset Synset('cat.n.01')
Definition: feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats
Useage Examples: []
Lemmas: [Lemma('cat.n.01.cat'), Lemma('cat.n.01.true_cat')]


### Wu-Palmer Similarty
The Wu-Palmer algorithm determines a score of 'similarity' from common ancestor words shared between the two words tested, whereas the Path Similarity metric compares how similar the paths to the words are.
* Both range from 0 (no similarity) to 1 (identity)
* The Wu-Palmer algorithm calculates similarity from the distances of hypernym levels to the common ancestor.
* The path similarity score is a more brute-force measurement of similarity between paths.
* The Wu-Palmer algorithm tends to have a higher score than the path-similarity algorithm.

In [12]:
print("Wu-Palmer score: {}".format(wordnet.wup_similarity(syna, synb)))
print("Path-Similarity score: {}".format(syna.path_similarity(synb)))

Wu-Palmer score: 0.8571428571428571
Path-Similarity score: 0.2


### Lesk Algorithm
The Lesk algorithm is used for identifying what synset a given word most likely belongs to. It tends to correctly guage if the word is a noun, verb, or adjactive, but it fails with more complex context. 
For example:
* "The wolf pack raced after the deer." - identifies 'wolf' as a german classical scholar. 
* "Sometimes I like to give my cat extra treats." - identifies 'cat' as 'kat', a type of plant used as a drug. 

Using Lesk with a sentence whose stopwords have been removed appears to improve performance significantly (see SentiWordNet example for demonstration).

In [13]:
# tokenize examples
tokens_a = word_tokenize(word_a_use)
tokens_b = word_tokenize(word_b_use)

# lesk algorithm
lesk_a = lesk(tokens_a, word_a)
lesk_b = lesk(tokens_b, word_b)
print("Example sentence tokens: ", tokens_a)
print("Lesk Synset for {}: {} - {}".format(word_a, lesk_a, lesk_a.definition()))
print("Example sentence tokens: ", tokens_b)
print("Lesk Synset for {}: {} - {}".format(word_b, lesk_b, lesk_b.definition()))

Example sentence tokens:  ['The', 'wolf', 'pack', 'raced', 'after', 'the', 'deer', '.']
Lesk Synset for wolf: Synset('wolf.n.03') - German classical scholar who claimed that the Iliad and Odyssey were composed by several authors (1759-1824)
Example sentence tokens:  ['Sometimes', 'I', 'like', 'to', 'give', 'my', 'cat', 'extra', 'treats', '.']
Lesk Synset for cat: Synset('kat.n.01') - the leaves of the shrub Catha edulis which are chewed like tobacco or used to make tea; has the effect of a euphoric stimulant


## SentiWordNet
SentiWordNet is a tool that analyzes the sentiment in words. It can be used in applications that measure the emotional intent behind words, such as detecting hate speech in social media posts, detecting depression in conversation records, or allowing an artificial conversation to correctly identify the tone behind words and appropriately respond to the sentiments expressed by a human user.

SentiWordNet can analyse the sentiment of entire sentences though a function of the independent sentiments of the words. The example given here removes stopwords and then assigns specific synsets to tokens using Lesk before collecting the sum of positive and negative sentiments across the tokens to assign a sentiment metric to the entire sentence.

In [14]:
# select a word
word = "terrific"
syns = wordnet.synsets(word)
print("Possible synsets: ", syns)
syn = syns[1]
print("Synset {}\nDefinition: {}".format(syn,syn.definition()))

Possible synsets:  [Synset('terrific.s.01'), Synset('fantastic.s.02'), Synset('terrific.s.03')]
Synset Synset('fantastic.s.02')
Definition: extraordinarily good or great ; used especially as intensifiers


In [15]:
# score the word sentiment
sents = sentiwordnet.senti_synsets(word)
for sent in sents:
  print("{} - Definition: {}".format(sent.synset,sent.synset.definition()))
  print("Positive Score: {}\nNegative score: {}\nObjective score: {}".format(sent.pos_score(), sent.neg_score(), sent.obj_score()))

Synset('terrific.s.01') - Definition: very great or intense
Positive Score: 0.25
Negative score: 0.25
Objective score: 0.5
Synset('fantastic.s.02') - Definition: extraordinarily good or great ; used especially as intensifiers
Positive Score: 0.75
Negative score: 0.0
Objective score: 0.25
Synset('terrific.s.03') - Definition: causing extreme terror
Positive Score: 0.0
Negative score: 0.625
Objective score: 0.375


Analyze the sentiment of a sentence, using Lesk to identify the synset for a token. Removing the stopwords significantly improves the functionality of Lesk, as Lesk will assign incorrect tokens to stop words (such as 'a' = 'an amino acit in deoxyriboneucleic acid'). With false assignments absent, the remaining words are assigned much more accurate synsets to their intended interpretation.

In [16]:
# score the sentiment of a whole sentence of words
sentence = "When I look out at the moon, I imagine you are looking up at the same moon as I." # mostly negative
sentence = "Together we will walk through a somber woods, hand in hand, your shadow swallowed by mine in the calm morning sunlight." # partly positive but mostly negative
sentence = "Such a maddening delight, for power to corrupt my veins as surely as fire consuming dry paper." # mostly positive
tokens = word_tokenize(sentence)

# remove stopwords
stopset = stopwords.words('english')
tokens = [token for token in tokens if token not in stopset]

positive = 0
negative = 0
# gather scores
for token in tokens:
  syn = lesk(tokens, token)
  if syn:
    sent = sentiwordnet.senti_synset(syn.name())
    print("Word {} ({})\nPos: {}, Neg: {}, Obj: {}".format(syn.name(), syn.definition(), sent.pos_score(), sent.neg_score(), sent.obj_score()))
    positive += sent.pos_score()
    negative += sent.neg_score()
    
print("\nPositive score total: {}\nNegative score total: {}".format(positive, negative))

Word such.s.01 (of so extreme a degree or extent)
Pos: 0.0, Neg: 0.125, Obj: 0.875
Word madden.v.03 (make mad)
Pos: 0.0, Neg: 0.0, Obj: 1.0
Word delight.v.02 (take delight in)
Pos: 0.0, Neg: 0.25, Obj: 0.75
Word power.v.01 (supply the force or power for the functioning of)
Pos: 0.0, Neg: 0.0, Obj: 1.0
Word corrupt.v.01 (corrupt morally or by intemperance or sensuality)
Pos: 0.0, Neg: 0.625, Obj: 0.375
Word vein.v.01 (make a veinlike pattern)
Pos: 0.0, Neg: 0.0, Obj: 1.0
Word surely.r.01 (definitely or positively (`sure' is sometimes used informally for `surely'))
Pos: 0.25, Neg: 0.0, Obj: 0.75
Word fire.v.06 (drive out or away by or as if by fire)
Pos: 0.0, Neg: 0.0, Obj: 1.0
Word devour.v.03 (eat immoderately)
Pos: 0.0, Neg: 0.0, Obj: 1.0
Word dry.v.02 (become dry or drier)
Pos: 0.0, Neg: 0.0, Obj: 1.0
Word paper.v.01 (cover with paper)
Pos: 0.0, Neg: 0.0, Obj: 1.0

Positive score total: 0.25
Negative score total: 1.0


## Collocations
Collocations are groups of words that together have a different meaning than the words when taken individually. The words in a collocation cannot be swapped for synonyms without changing the meaning of the collocation.

 A good example of this is 'dead ahead'. On their own, the words 'dead' and 'ahead' imply that there are zombies in front of us, whereas 'dead ahead' when taken as a single unit means only directly ahead. This is an example of a collocation. To say 'deceased ahead' would change the meaning entirelly.

Collocations can be detected in text when words appear next to eachother with an unusually high probability.

In [17]:
# collect collocations
col = text4.collocations()

United States; fellow citizens; years ago; four years; Federal
Government; General Government; American people; Vice President; God
bless; Chief Justice; one another; fellow Americans; Old World;
Almighty God; Fellow citizens; Chief Magistrate; every citizen; Indian
tribes; public debt; foreign nations


### Calculate mutual information for the first collocation

Mutual information is a metric calcualted as:

log(P(x,y) / (P(x) * P(Y)))

The mutual information score is the log of the probability that the two words appeared together instead divided by the ordinary probability of appearing together together own.

The mutual information formula is useful, but could be made futher specialized by checking probabilities only among different word groups that have the same part of speech (noun, verb, etc) since certain groups of words may have a higher probability of appearing by one another in a sequence. Furthermore, whether or not a group of words appears as a collorary could be influenced by both the corpus and the score

In [18]:
import math

# select collocations
a = "God"
b = "bless"
ab = a + " " + b

word_count = len(list(text4))
text = ' '.join(text4.tokens) # recreate text

# calculate probabilities for each word
p_a = text.count(a) / word_count
p_b = text.count(b) / word_count

# calculate probability of words together
p_ab = text.count(ab) / word_count

# calculate and scorePMI
pmi = math.log2(p_ab / (p_a * p_b))

print("PMI score: ", pmi)

PMI score:  8.076081481141687
