# **What is WordNet?** 👇

WordNet is a lexical database that groups English words into sets of synonyms based on their meanings. These groups are called "synsets" and are linked together by semantic relationships such as antonyms, hypernyms, hyponyms, and meronyms. It is used in various applications of Natural Language Processing.

Below is how we incorporate WordNet into our python project.

In [15]:
import nltk
import math
from nltk.corpus import wordnet as wn
from nltk.wsd import lesk as lk
from nltk.corpus import sentiwordnet as swn
nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('sentiwordnet')
nltk.download('book')
from nltk.book import *
text4

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to /root/nltk_data...
[nltk_data]    |   Package con

<Text: Inaugural Address Corpus>

## Part 1: Picking a Noun

In the following code we will pick a noun and do various wordnet operations to take advantage of its features. First we get the synsets of a given noun

In [16]:
wn.synsets('enemy')

[Synset('enemy.n.01'),
 Synset('enemy.n.02'),
 Synset('enemy.n.03'),
 Synset('foe.n.02')]

The sysnets are displayed now we will pick the first  one which is 'enemy.n.01' and we will output:

*   Definition
*   Usage example
*   Lemma

In [17]:
wn.synset('enemy.n.01').definition()

'an opposing military force'

In [18]:
wn.synset('enemy.n.01').examples()

['the enemy attacked at dawn']

In [19]:
wn.synset('enemy.n.01').lemmas()

[Lemma('enemy.n.01.enemy')]

now we iterate overall sysnets of the word enemy to extract all defintions, examples and lemmas

In [20]:
enemy_synsets = wn.synsets('enemy', pos=wn.NOUN)
for sense in enemy_synsets:
    lemmas = [l.name() for l in sense.lemmas()]
    print("Synset: " + sense.name() + "(" +sense.definition() + ")  \n\t Lemmas:" + str(lemmas))

Synset: enemy.n.01(an opposing military force)  
	 Lemmas:['enemy']
Synset: enemy.n.02(an armed adversary (especially a member of an opposing military force))  
	 Lemmas:['enemy', 'foe', 'foeman', 'opposition']
Synset: enemy.n.03(any hostile group of people)  
	 Lemmas:['enemy']
Synset: foe.n.02(a personal enemy)  
	 Lemmas:['foe', 'enemy']


As we see WordNet has organized the nouns in from most alike definitions to least alike. It started with the military definition to a more personal definition of the word.

Now we output the hypernyms, hypnoyms, meronyms, holonyms, and antonyms of our selected word. 

In [48]:
enemy = wn.synset('enemy.n.01')
print(enemy.hypernyms())
print(enemy.hyponyms())
print(enemy.part_meronyms())
print(enemy.part_holonyms())

[Synset('military_unit.n.01')]
[]
[]
[]


To get an antonym it is a bit different but not difficult

In [49]:
enemy = wn.synsets('enemy', pos=wn.NOUN)[0]
enemy

Synset('enemy.n.01')

In [47]:
enemy.lemmas()

[Lemma('enemy.n.01.enemy')]

In [45]:
enemy.lemmas()[0].antonyms()

[]

In this case, our noun didn't have a antonym in wordnet therefore it returned null

## Part 2: Picking a verb

Now we will do the same as above but with the verb.



In [22]:
wn.synsets('climb')

[Synset('ascent.n.01'),
 Synset('climb.n.02'),
 Synset('climb.n.03'),
 Synset('climb.v.01'),
 Synset('climb.v.02'),
 Synset('wax.v.02'),
 Synset('climb.v.04'),
 Synset('climb.v.05'),
 Synset('rise.v.02')]

In [23]:
wn.synset('climb.v.01').definition()

'go upward with gradual or continuous progress'

In [24]:
wn.synset('climb.v.01').examples()

['Did you ever climb up the hill behind your house?']

In [25]:
wn.synset('climb.v.01').lemmas()

[Lemma('climb.v.01.climb'),
 Lemma('climb.v.01.climb_up'),
 Lemma('climb.v.01.mount'),
 Lemma('climb.v.01.go_up')]

In [26]:
enemy_synsets = wn.synsets('climb', pos=wn.VERB)
for sense in enemy_synsets:
    lemmas = [l.name() for l in sense.lemmas()]
    print("Synset: " + sense.name() + "(" +sense.definition() + ")  \n\t Lemmas:" + str(lemmas))

Synset: climb.v.01(go upward with gradual or continuous progress)  
	 Lemmas:['climb', 'climb_up', 'mount', 'go_up']
Synset: climb.v.02(move with difficulty, by grasping)  
	 Lemmas:['climb']
Synset: wax.v.02(go up or advance)  
	 Lemmas:['wax', 'mount', 'climb', 'rise']
Synset: climb.v.04(slope upward)  
	 Lemmas:['climb']
Synset: climb.v.05(improve one's social status)  
	 Lemmas:['climb']
Synset: rise.v.02(increase in value or to a higher point)  
	 Lemmas:['rise', 'go_up', 'climb']


As we see above, we see the most common usuage at the top and as we go to the bottom of its uses we see more figurative/niche uses of the verb. 

## Part 3: Using Morphy

In this example we use morphy to find uses as many examples of a word as we can

In [27]:

wn.morphy('steepest')

'steep'

In [40]:
wn.morphy('steeper')

'steeper'

## Part 4: Using similarities

Here we are going to pick two similar words and use the Wu-Palmer similarity metric and the Lesk algorithm to find some interesting information.

Using the Wu-Palmer's algorithm

In [28]:
first_word = wn.synset('snake.n.01')
second_word = wn.synset('python.n.01')
print(first_word.definition())
print(second_word.definition())
wn.wup_similarity(first_word, second_word)

limbless scaly elongate reptile; some are venomous
large Old World boas


0.8888888888888888

using the Lesk algorithm

In [29]:
lk('snake', 'python')

Synset('python.n.02')

In [30]:
lk('python', 'snake')

Synset('snake.v.03')

In [31]:
wn.synset('python.n.02').definition()

'a soothsaying spirit or a person who is possessed by such a spirit'

In [32]:
wn.synset('snake.v.03').definition()

'move along a winding path'

It is interesting how both algorithms picked different words. It seems that Lesk picked the most colloquial version of the word. Lesk algorithm picked a more obscure match for the result. If I was to pick an algorithm I would go with Wu's algorithm since it gives me a value that closely relates each word and could be more applicable. The only time I would use the Lesk algorithm is that if I didn't need a specic definition

## Part 5: Sentiword

SentiWordNet is lexical took that is used for analyzing the senitment of a text. It groups words into three rankings: positivity, negativity, and objectivity. It can usd in various fields of NLP programs for example, it could be implemented to get the general opinion about a specific piece of media.

The score is in a range from 0 - 1 where the closer the number is to 1 is the closer the feeling. 

In [33]:

for word_use in swn.senti_synsets('hate'):
  print("\nNegative score = ", word_use.neg_score())
  print("Positive score = ", word_use.pos_score())
  print("Objective score = ", word_use.obj_score())
     


Negative score =  0.375
Positive score =  0.125
Objective score =  0.5

Negative score =  0.75
Positive score =  0.0
Objective score =  0.25


The word hate didn't always get a score of zero for postive. I assume this was the case because one may hate bad things, which is good. 

Now, we will do the same thing but using a sentence

In [34]:
sentence = 'Computer Science majors hate touching grass'

neg = 0
pos = 0
tokens = sentence.split()
for token in tokens:
    print('\nWord: ' + token)
    syn_list = list(swn.senti_synsets(token))
    if syn_list:
        syn = syn_list[0]
        neg += syn.neg_score()
        pos += syn.pos_score()
        print(syn.neg_score())
        print(syn.pos_score())
    
print("\nneg\tpos counts")
print(neg, '\t', pos)



Word: Computer
0.0
0.0

Word: Science
0.0
0.0

Word: majors
0.0
0.125

Word: hate
0.375
0.125

Word: touching
0.0
0.0

Word: grass
0.0
0.0

neg	pos counts
0.375 	 0.25


As we see with the sentence above the score is mostly negative since it is picking at CS majors and I would say SentiWordNet did a decent job. If the input text was larger then a better score can be given. Of course doing it like this has limitations since some words can be aplified with the context of surrounding words. For example, the word 'majors' got picked as postive even though in this context it doesn't shouldn't affect sentiment.

## Part 6: Collocation

In Natural Language Processing, collocation refers to words in a langauge that often appear together. For instance, the words 'milk tea' often appear together due to being a popular beverage while the words 'milk soda' might not as much since they aren't a common combination (at least as far as I am aware.)

In [35]:
text4.name

'Inaugural Address Corpus'

In [36]:
text4.collocations()

United States; fellow citizens; years ago; four years; Federal
Government; General Government; American people; Vice President; God
bless; Chief Justice; one another; fellow Americans; Old World;
Almighty God; Fellow citizens; Chief Magistrate; every citizen; Indian
tribes; public debt; foreign nations


As we see there is common collocations in the text such as United States and American People which makes sense given the text4 is about the Inaugural Address Corpus

## Part 7: Mutual Infomration

Mutual information is a statistic that tells us the association between two words in a given text. Where a higher Mutual Information indicates a higher dependence score.

Now let us take at a collocation from text4 and calcualte the mutual information on it

The formula is log2[P(x,y) / [P(x) * P(y)]] which matematically works as the log probabilities of each word appearing by itself and the probablities of each other appearing together. 

In [37]:
text = ' '.join(text4.tokens)

In [38]:
vocab = len(set(text4))
hg = text.count('United States')/vocab
print("p(United States) = ",hg )
h = text.count('United')/vocab
print("p(United) = ", h)
g = text.count('States')/vocab
print('p(States) = ', g)
pmi = math.log2(hg / (h * g))
print('pmi = ', pmi)

p(United States) =  0.015860349127182045
p(United) =  0.0170573566084788
p(States) =  0.03301745635910224
pmi =  4.815657649820885


In [41]:
vocab = len(set(text4))
hg = text.count('in the')/vocab
print("p(in the) = ",hg )
h = text.count('in ')/vocab
print("p(in) = ", h)
g = text.count('the ')/vocab
print('p(the) = ', g)
pmi = math.log2(hg / (h * g))
print('pmi = ', pmi)

p(in the) =  0.09276807980049875
p(in) =  0.30733167082294266
p(the) =  0.9533167082294264
pmi =  -1.659123548063425


As we see above, in the text given as input the words United States have a higher PMI than the words in the. Which means they have a higher dependency on each other. That being said if we were to use a different text then the scores could be different. This score can be useful in NLP since it could help us retrive information, analyze text, or get sentiment analysis.