# What is WordNet?

WordNet is part of the NLTK corpus database which allows for Python users to find definitions, hypernyms, and antonyms of words. It is very powerful which makes it helpful for text analysis within databases. A downside to WordNet is that it does not support contraction words.


# NOUN

In [1]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import wordnet as wn
from nltk.wsd import lesk

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [2]:
wn.synsets('paper') # output all synsets of noun

[Synset('paper.n.01'),
 Synset('composition.n.08'),
 Synset('newspaper.n.01'),
 Synset('paper.n.04'),
 Synset('paper.n.05'),
 Synset('newspaper.n.02'),
 Synset('newspaper.n.03'),
 Synset('paper.v.01'),
 Synset('wallpaper.v.01')]

In [3]:
paper = wn.synset('paper.n.05')
wn.synset('paper.n.05').definition() # definition of synset

'a scholarly article describing the results of observations or stating hypotheses'

In [4]:
wn.synset('paper.n.05').examples()

['he has written many scientific papers']

In [5]:
wn.synset('paper.n.05').lemmas()

[Lemma('paper.n.05.paper')]

In [6]:
# traverse wordnet output all synsets possible
paper_synsets = wn.synsets('paper', pos = wn.NOUN)
for sense in paper_synsets:
  lemmas = [l.name() for l in sense.lemmas()]
  print("Synset: " + sense.name() + "(" + sense.definition() + ")  \n\t Lemmas:" + str(lemmas))

Synset: paper.n.01(a material made of cellulose pulp derived mainly from wood or rags or certain grasses)  
	 Lemmas:['paper']
Synset: composition.n.08(an essay (especially one written as an assignment))  
	 Lemmas:['composition', 'paper', 'report', 'theme']
Synset: newspaper.n.01(a daily or weekly publication on folded sheets; contains news and articles and advertisements)  
	 Lemmas:['newspaper', 'paper']
Synset: paper.n.04(a medium for written communication)  
	 Lemmas:['paper']
Synset: paper.n.05(a scholarly article describing the results of observations or stating hypotheses)  
	 Lemmas:['paper']
Synset: newspaper.n.02(a business firm that publishes newspapers)  
	 Lemmas:['newspaper', 'paper', 'newspaper_publisher']
Synset: newspaper.n.03(the physical object that is the product of a newspaper publisher)  
	 Lemmas:['newspaper', 'paper']


It seems like the definitions for the chosen noun is ordered based on the frequency that the word is used. It seems like whatever frequency database WordNet is basing its results off of are aligned closely with academia because the second result is 'composition'. I do like that the first result was related to the exact definition I was intending for WordNet to use, paper as a material. It also seems like the words are organized based on how concrete to abstract the definitions of the word are. Also, the numbers assigned to these synsets are not in order.

### NOUN: hypernyms, hyponyms, meronyms,holonyms, antonym

In [7]:
paper.hypernyms()

[Synset('article.n.01')]

In [8]:
paper.hyponyms()

[]

In [9]:
paper.member_meronyms()

[]

In [10]:
paper.member_holonyms()

[]

In [11]:
paper.lemmas()[0].antonyms()

[]

# VERB

In [12]:
wn.synsets('spin') # output all synsets of spin

[Synset('spin.n.01'),
 Synset('spin.n.02'),
 Synset('spin.n.03'),
 Synset('tailspin.n.02'),
 Synset('spin.n.05'),
 Synset('spin.v.01'),
 Synset('spin.v.02'),
 Synset('whirl.v.02'),
 Synset('spin.v.04'),
 Synset('spin.v.05'),
 Synset('spin.v.06'),
 Synset('spin.v.07'),
 Synset('spin.v.08')]

In [13]:
spin = wn.synset('spin.n.03')
wn.synset('spin.n.03').definition()

'a short drive in a car'

In [14]:
wn.synset('spin.n.03').examples()

['he took the new car for a spin']

In [15]:
wn.synset('spin.n.03').lemmas()

[Lemma('spin.n.03.spin')]

In [16]:
# Traverse wordnet output all synsets
spin_synsets = wn.synsets('spin', pos = wn.VERB)
for sense in spin_synsets:
  lemmas = [l.name() for l in sense.lemmas()]
  print("Synset: " + sense.name() + "(" + sense.definition() + ")  \n\t Lemmas:" + str(lemmas))

Synset: spin.v.01(revolve quickly and repeatedly around one's own axis)  
	 Lemmas:['spin', 'spin_around', 'whirl', 'reel', 'gyrate']
Synset: spin.v.02(stream in jets, of liquids)  
	 Lemmas:['spin']
Synset: whirl.v.02(cause to spin)  
	 Lemmas:['whirl', 'birl', 'spin', 'twirl']
Synset: spin.v.04(make up a story)  
	 Lemmas:['spin']
Synset: spin.v.05(form a web by making a thread)  
	 Lemmas:['spin']
Synset: spin.v.06(work natural fibers into a thread)  
	 Lemmas:['spin']
Synset: spin.v.07(twist and turn so as to give an intended interpretation)  
	 Lemmas:['spin']
Synset: spin.v.08(prolong or extend)  
	 Lemmas:['spin', 'spin_out']


For the verbs it seems like WordNet is organizing them on the most commony used definition, similar to nouns. However, the numbers for these synsets are in order where as the noun were not. I found it interesting that spin.v.07 was listed much further below spin.v.04 when to me they seem to be very close in meaning as definitions. It seemed a bit odd to place them seperately as well, I would have assumed WordNet to choose one variation of the definition.

In [17]:
wn.morphy('spin', wn.VERB) #morphy to find different forms of the word

'spin'

# 2 Similar Words

In [18]:
# allowed/permitted | allow/permit
wn.synsets('allow') # output all synsets of allow

[Synset('let.v.01'),
 Synset('permit.v.01'),
 Synset('allow.v.03'),
 Synset('allow.v.04'),
 Synset('leave.v.06'),
 Synset('allow.v.06'),
 Synset('admit.v.05'),
 Synset('give_up.v.11'),
 Synset('allow.v.09'),
 Synset('allow.v.10')]

In [19]:
wn.synsets('permit') # output all synsets of permit

[Synset('license.n.01'),
 Synset('license.n.04'),
 Synset('permit.n.03'),
 Synset('permit.v.01'),
 Synset('let.v.01'),
 Synset('allow.v.10')]

In [20]:
allow = wn.synset('allow.v.10') # chosen sysnet
wn.synset('allow.v.10').definition()

'allow the presence of or allow (an activity) without opposing or prohibiting'

In [21]:
permit = wn.synset('permit.v.01') # chosen synset
wn.synset('permit.v.01').definition()

'consent to, give permission'

In [22]:
wn.wup_similarity(allow, permit) # Wu-Palmer result

0.8888888888888888

### Lesk Alg

In [23]:
# lesk algorithm for "allow"
for ss in wn.synsets('allow'):
    print(ss, ss.definition())
sent1 = ['I', 'will', 'allow', 'it', '.']
print(lesk(sent1, 'allow', 'v'))
print(lesk(sent1, 'allow'))

Synset('let.v.01') make it possible through a specific action or lack of action for something to happen
Synset('permit.v.01') consent to, give permission
Synset('allow.v.03') let have
Synset('allow.v.04') give or assign a resource to a particular person or cause
Synset('leave.v.06') make a possibility or provide opportunity for; permit to be attainable or cause to remain
Synset('allow.v.06') allow or plan for a certain possibility; concede the truth or validity of something
Synset('admit.v.05') afford possibility
Synset('give_up.v.11') allow the other (baseball) team to score
Synset('allow.v.09') grant as a discount or in exchange
Synset('allow.v.10') allow the presence of or allow (an activity) without opposing or prohibiting
Synset('let.v.01')
Synset('let.v.01')


In [24]:
# lesk algorithm for "permit"
for ss in wn.synsets('permit'):
    print(ss, ss.definition())
sent2 = ['I', 'will', 'permit', 'it', '.']
print(lesk(sent1, 'permit', 'v'))
print(lesk(sent1, 'permit'))

Synset('license.n.01') a legal document giving official permission to do something
Synset('license.n.04') the act of giving a formal (usually written) authorization
Synset('permit.n.03') large game fish; found in waters of the West Indies
Synset('permit.v.01') consent to, give permission
Synset('let.v.01') make it possible through a specific action or lack of action for something to happen
Synset('allow.v.10') allow the presence of or allow (an activity) without opposing or prohibiting
Synset('let.v.01')
Synset('let.v.01')


The Wu-Palmer similarity metric was easier to execute than the Lesk algorithm because it only took one line. The Wu-Palmer metric gave me a number I was happy with, 88%. I agree with the results and believe that if I had picked possibly a different combination of the definitions the percentage could have been higher. I think the Lesk algorithm results would have given the two words a 100% rating for similarity if it could have given a metric because I ended up getting the same resutls for both words, let.v.01. I think that if I had used a different sentence for each word using the Lesk algorithm I would have gotten different results depending on the vairability of the provided sentence. 


In [25]:
print(lesk(sent2, 'permit', pos = 'a'))

None


# SentiWordNet

SentiWordNet is used for getting the emotional range of a given sentence such as positive, negative, or neutral. It takes in full sentences, not words, to analyze. SentiWordNet will provide its results as a decimal that has a range from 0 to 1. This can be used by chatbots to determine how a person is feeling during the automated assistance process.

In [30]:
# Must import senti-synset
from nltk.corpus import sentiwordnet as swn
nltk.download('sentiwordnet')
#wn.synsets('joy')
joy = swn.senti_synset('fantastic.s.02')
print(joy)
print("Positive score: ", joy.pos_score())
print("Negative score: ", joy.neg_score())
print("Objective score: ", joy.obj_score())

<fantastic.s.02: PosScore=0.75 NegScore=0.0>
Positive score:  0.75
Negative score:  0.0
Objective score:  0.25


[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


In [31]:
sentence_joy = 'I am very excited for this weekend!'
# using the word 'very' makes the results more positive than using 
# the word 'so' or just putting a space
negative = 0
positive = 0
tokens = sentence_joy.split()
for token in tokens:
  syn_list = list(swn.senti_synsets(token))
  if syn_list:
    syn = syn_list[0]
    negative += syn.neg_score()
    positive += syn.pos_score()

print('-', '\t', '+')
print(negative, '\t', positive)

- 	 +
0.375 	 0.75


I think that these scores are a bit off because, I believe the positivity score should have been higher. Perhaps in the 0.8 or 0.9 range with the negativity score being 0.1 or 0.2. I also noticed that when I added the word 'very' the results became more positive compared to when I used the word 'so.'  Also, when I did not include any word of emphasis the positivity score also decreased.

# Collocation

In [32]:
import nltk
nltk.download('gutenberg')
nltk.download('genesis')
nltk.download('inaugural')
nltk.download('nps_chat')
nltk.download('treebank')
nltk.download('webtext')
nltk.download('stopwords')
from nltk.book import *
from nltk.collocations import *
text4.collocations()

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package genesis to /root/nltk_data...
[nltk_data]   Unzipping corpora/genesis.zip.
[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Unzipping corpora/inaugural.zip.
[nltk_data] Downloading package nps_chat to /root/nltk_data...
[nltk_data]   Unzipping corpora/nps_chat.zip.
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package webtext to /root/nltk_data...
[nltk_data]   Unzipping corpora/webtext.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
United States; fellow citizens; years ago; four years; Federal
Government; General Government; American people; Vice President; God
bless; Chief Justice; one another; fellow Americans; Old World;
Almighty God; Fellow citizens; Chief Magistrate; every citizen; Indian
tribes; public debt; foreign nations


Collocation is based on the closeness of a word with another word based on frequency. It is the most common with nouns and verbs being next to each other, or even adjectives and nouns together. An example of a common collocation is "lions roar." Also, collocations help with giving a better full picture of what the speaker is trying to protray to the listener. With these extra details (AKA collocations) the person listening is in a better position to respond approprately to what the speaker said. 

In [33]:
text = ' '.join(text4.tokens)
text[:100]

'Fellow - Citizens of the Senate and of the House of Representatives : Among the vicissitudes inciden'

In [34]:
import math
vocab = len(set(text4))

hg = text.count('Chief Justice')/vocab
print("p(Chief Justice) = ",hg )
h = text.count('Chief')/vocab
print("p(Chief) = ", h)
g = text.count('Justice')/vocab
print('p(Justice) = ', g)
pmi = math.log2(hg / (h * g))
print('pmi = ', pmi)

p(Chief Justice) =  0.001396508728179551
p(Chief) =  0.002793017456359102
p(Justice) =  0.002394014962593516
pmi =  7.706352115508489


The results of the collocation mutual information values were not very impressive to me. I was dissapointed because, I figured the words 'Chief Justice' would appear more frequently together. It seems like the two words appeared more frequently independantly that as a joint pairing. 