<a href="https://colab.research.google.com/github/LGNRoy/NLP-Lab/blob/master/NLTK_wn_note.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLTK Library and WordNet

WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept.

In Python, NLTK library includes English WordNet.

**To use wordnet, you need to download the wordnet data via NLTK library**

 * **[NLTK](https://www.nltk.org/)** is a **N**atural **L**anguage **T**ool**k**iit for python. 

In [0]:
import nltk

## WordNet

In [0]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [0]:
from nltk.corpus import wordnet as wn

Let's get a set of synonyms that share a common meaning.

In [0]:
dog = wn.synset('dog.n.01')
person = wn.synset('person.n.01')
cat = wn.synset('cat.n.01')
computer = wn.synset('computer.n.01')

### path_similarity()
path_similarity() returns a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1.

In [0]:
print("dog<->cat : ", wn.path_similarity(dog,cat))
print("person<->cat : ", wn.path_similarity(person,cat))
print("person<->dog : ", wn.path_similarity(person,dog))
print("person<->computer : ", wn.path_similarity(person,computer))

dog<->cat :  0.2
person<->cat :  0.1
person<->dog :  0.2
person<->computer :  0.1111111111111111


### Wu-Palmer Similarity (wup_similarity() )
wup_similarity() returns a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).

In [0]:
print("dog<->cat : ", wn.wup_similarity(dog,cat))
print("person<->cat : ", wn.wup_similarity(person,cat))
print("person<->dog : ", wn.wup_similarity(person,dog))
print("person<->computer : ", wn.wup_similarity(person,computer))

dog<->cat :  0.8571428571428571
person<->cat :  0.5714285714285714
person<->dog :  0.75
person<->computer :  0.5


# Exercise

Write a program to find the most and the least similar words by nesting its synonyms or antonyms. To compare words, you can use two similarity measurement functions (path_similarity() and wup_similarity()). Please give a nesting limit so that your program cannot nest more than 6 times (as argument and should have default value). 

Useful information: [NLTK-WordNet](http://www.nltk.org/howto/wordnet.html#synsets)

Note. Some words do not have antonyms (dog, cat, person, computer, etc.)



In [0]:
def find_similar_synset(synset, max_nesting=6):
  similar_synsets = {
      "least":"",
      "most":""
  }
  
  # we suppose the word is synset
  def itereator(synset, max_nesting, wordbag):
    synonyms = []
    antonyms = []

    for lemma in synset.lemmas():
      meaning = lemma.name()
      synonyms += wn.synsets(meaning)

      ants = lemma.antonyms()
      if ants:
        for ant in ants:
          antonyms.append(ant.synset())

    synonyms = list(set(synonyms))
    antonyms = list(set(antonyms))

    if max_nesting > 0:
      max_nesting -= 1
      for syn in synonyms:
        if syn not in wordbag:
  #         print("{}\t syn: {}".format(max_nesting, syn))
          wordbag.append(syn)
          itereator(synset, max_nesting, wordbag)

      for ant in antonyms:
        if ant not in wordbag:
  #         print("{}\t ant: {}".format(max_nesting, ant))
          wordbag.append(ant)
          itereator(synset, max_nesting, wordbag)
   
  wordbag = []
  itereator(synset, max_nesting, wordbag)
#   print(wordbag)
#   print(len(wordbag))
  wordbag.remove(synset)
  
  least, most= 1, 0
  for item in wordbag:
    simi = wn.wup_similarity(item,word)
    if simi:
      if simi < least:
        least = simi
        similar_synsets["least"] = item.name()

      if simi > most:
        most = simi
        similar_synsets["most"] = item.name()

      
      
  return similar_words

In [0]:
word = wn.synset('give.v.01')
print(word.definition())
print(word.examples()[0])
print(find_similar_synset(word, max_nesting=6))

cause to have, in the abstract sense or physical sense
She gave him a black eye
{'least': 'respect.v.01', 'most': 'value.n.06'}


In [0]:
# You should submit "ipynb" file (You can download it from "File" > "Download .ipynb") to Canvas
# You can write extra functions

def find_similar_words(word, max_nesting=6):
  similar_words = {
      "least":"",
      "most":""
  }
  
  # we proposed word is str
  def itereator(word, max_nesting, wordbag):
    synonyms = []
    antonyms = []
    
    for syn in wn.synsets(word):
      for lemma in syn.lemmas():
        synonyms.append(lemma.name())
        for ant in lemma.antonyms():
          antonyms.append(ant.name())

    synonyms = list(set(synonyms))
    antonyms = list(set(antonyms))

    if max_nesting > 0:
      max_nesting -= 1
      for syn in synonyms:
        if syn not in wordbag:
          wordbag.append(syn)
          itereator(word, max_nesting, wordbag)

      for ant in antonyms:
        if ant not in wordbag:
          wordbag.append(ant)
          itereator(word, max_nesting, wordbag)
   
  wordbag = []
  itereator(word, max_nesting, wordbag)
  if word in wordbag:
    wordbag.remove(word)
  
  least, most= 1, 0
  word_syns = wn.synsets(word)
  for item in wordbag:
    sum_simi = 0
    count = 0
    for word_syn in word_syns:
      for item_syn in wn.synsets(item):
        simi = wn.wup_similarity(item_syn,word_syn)
        if simi: 
          count += 1
          sum_simi += simi
    
    if count > 0:
      simi = sum_simi/count
      if simi:
        if simi < least:
          least = simi
          similar_words["least"] = item

        if simi > most:
          most = simi
          similar_words["most"] = item
      
  return similar_words

In [0]:
find_similar_words('dog', max_nesting=6)

{'least': 'frank', 'most': 'domestic_dog'}

In [0]:
#词性不同时，相似怎么处理，
#是不是应该过滤，保持词性一致？
a = wn.synset('good.s.20')
b = wn.synset('good.n.02')
print(wn.wup_similarity(dog,cat))
print(wn.path_similarity(dog,cat))

None
None


In [0]:
  # 假设输入的word是str
  def itereator(word, nax_nesting, wordbag):
    synonyms = []
    antonyms = []
    
    
    ## 获取这个词所有的意思
    ## 每个意思找近义词和反义词
    
    for syn in wn.synsets(word):
      print("syn:",syn)
      for lemma in syn.lemmas():
        print("lemma:",lemma.name())
        synonyms.append(lemma.name())
        
        ant = lemma.antonyms()
        if ant:
          antonyms.append(ant[0].name())
    
    synonyms = list(set(synonyms))
    print(synonyms)
    antonyms = list(set(antonyms))
    print(antonyms)
      

目前的问题是 lemmas()代表的是什么意思

In [0]:
# 假设输入的word是synset
def itereator(word, max_nesting, wordbag):
  synonyms = []
  antonyms = []

  for lemma in word.lemmas():
    meaning = lemma.name()
    synonyms += wn.synsets(meaning)

    ants = lemma.antonyms()
    if ants:
      for ant in ants:
        antonyms.append(ant.synset())

  synonyms = list(set(synonyms))
  antonyms = list(set(antonyms))
  
  #可能要改成BFS
  
  #DFS 
  if (max_nesting > 0):
    max_nesting -= 1
#     print(max_nesting > 0)
#     print(synonyms)
    for syn in synonyms:
#       print(max_nesting)
      if syn not in wordbag:
        print("{}\t syn: {}".format(max_nesting, syn))
        wordbag.append(syn)
        itereator(word, max_nesting, wordbag)

    for ant in antonyms:
      if ant not in wordbag:
        print("{}\t ant: {}".format(max_nesting, ant))
        wordbag.append(ant)
        itereator(word, max_nesting, wordbag)

wordbag = []
itereator(wn.synset('car.n.01'),6,wordbag)
print(wordbag)

5	 syn: Synset('car.n.04')
4	 syn: Synset('machine.v.02')
3	 syn: Synset('car.n.03')
2	 syn: Synset('cable_car.n.01')
1	 syn: Synset('machine.n.01')
0	 syn: Synset('machine.n.05')
0	 syn: Synset('machine.n.03')
0	 syn: Synset('car.n.02')
0	 syn: Synset('machine.n.04')
0	 syn: Synset('car.n.01')
0	 syn: Synset('machine.v.01')
0	 syn: Synset('automobile.v.01')
0	 syn: Synset('machine.n.02')
[Synset('car.n.04'), Synset('machine.v.02'), Synset('car.n.03'), Synset('cable_car.n.01'), Synset('machine.n.01'), Synset('machine.n.05'), Synset('machine.n.03'), Synset('car.n.02'), Synset('machine.n.04'), Synset('car.n.01'), Synset('machine.v.01'), Synset('automobile.v.01'), Synset('machine.n.02')]


*   从下面的例子可以看出， synsets的输出是包含了输入的这个词的所有单词
*   每个单词再根据意思求出所有的近义词





In [0]:
# 假设，synsets求出来的是所有的近义词
def itereator(word, nax_nesting, wordbag):
  synonyms = []
  antonyms = []


  ## 获取这个词所有的意思
  ## 每个意思找近义词和反义词

  for syn in wn.synsets(word):
    print("     syn:",syn)
    for lemma in syn.lemmas():
      print("lemma:",lemma)
      synonyms.append(lemma.name())
  synonyms = list(set(synonyms))
  print(synonyms)
    
itereator('car',1,[])

     syn: Synset('car.n.01')
lemma: Lemma('car.n.01.car')
lemma: Lemma('car.n.01.auto')
lemma: Lemma('car.n.01.automobile')
lemma: Lemma('car.n.01.machine')
lemma: Lemma('car.n.01.motorcar')
     syn: Synset('car.n.02')
lemma: Lemma('car.n.02.car')
lemma: Lemma('car.n.02.railcar')
lemma: Lemma('car.n.02.railway_car')
lemma: Lemma('car.n.02.railroad_car')
     syn: Synset('car.n.03')
lemma: Lemma('car.n.03.car')
lemma: Lemma('car.n.03.gondola')
     syn: Synset('car.n.04')
lemma: Lemma('car.n.04.car')
lemma: Lemma('car.n.04.elevator_car')
     syn: Synset('cable_car.n.01')
lemma: Lemma('cable_car.n.01.cable_car')
lemma: Lemma('cable_car.n.01.car')
['railway_car', 'car', 'motorcar', 'gondola', 'railcar', 'railroad_car', 'elevator_car', 'machine', 'auto', 'cable_car', 'automobile']


# Extension
Test your program with 10 different words (selects randomly by yourself) and find the reasonable threshold limit for the number of nestings (by changing value for max_nesting argument)

In [0]:
# This is an extension task. You do not need to submit this task for your assessment.


words = ["test", "program", "ten", "different", "word", "assessment", "limit", "number", "nesting", "value"]
print("The length of word list is:", len(words))

max_nesting = 100
print("The max nesting is:", max_nesting)

table = []

for word in words:
#   word_syn = wn.synsets(word)[0]
#   similar_words = find_similar_words(word_syn, max_nesting)
#   table.append([word_syn.name(), similar_words["least"], similar_words["most"]])

  similar_words = find_similar_words(word, max_nesting)
  table.append([word, similar_words["least"], similar_words["most"]])
  
#   print(word+"\t\t", find_similar_words(word_syn, max_nesting))

import pandas as pd
table = pd.DataFrame(table)
table.columns = ['word','least','most']
table

The length of word list is: 10
The max nesting is: 100


Unnamed: 0,word,least,most
0,test,prove,psychometric_test
1,program,platform,syllabus
2,ten,tenner,10
3,different,like,dissimilar
4,word,formulate,word_of_honor
5,assessment,judgement,appraisal
6,limit,specify,limit_point
7,number,numerate,phone_number
8,nesting,nuzzle,nest
9,value,assess,economic_value


In [0]:
# 问题单词：different limit

word = 'different'
item = 'dissimilar'

sum_simi = 0
count = 0
for word_syn in wn.synsets(word):
  for item_syn in wn.synsets(item):
    count += 1
    simi = wn.wup_similarity(item_syn,word_syn)
    print(count, '\t',word_syn, '\t',item_syn, '\t',simi)
    if simi: 
      sum_simi += simi

simi = sum_simi/count
print(simi)

1 	 Synset('different.a.01') 	 Synset('dissimilar.a.01') 	 None
2 	 Synset('different.a.01') 	 Synset('unalike.a.01') 	 None
3 	 Synset('different.a.01') 	 Synset('unlike.a.01') 	 None
4 	 Synset('different.s.02') 	 Synset('dissimilar.a.01') 	 None
5 	 Synset('different.s.02') 	 Synset('unalike.a.01') 	 None
6 	 Synset('different.s.02') 	 Synset('unlike.a.01') 	 None
7 	 Synset('different.s.03') 	 Synset('dissimilar.a.01') 	 None
8 	 Synset('different.s.03') 	 Synset('unalike.a.01') 	 None
9 	 Synset('different.s.03') 	 Synset('unlike.a.01') 	 None
10 	 Synset('unlike.a.01') 	 Synset('dissimilar.a.01') 	 None
11 	 Synset('unlike.a.01') 	 Synset('unalike.a.01') 	 None
12 	 Synset('unlike.a.01') 	 Synset('unlike.a.01') 	 1.0
13 	 Synset('different.s.05') 	 Synset('dissimilar.a.01') 	 None
14 	 Synset('different.s.05') 	 Synset('unalike.a.01') 	 None
15 	 Synset('different.s.05') 	 Synset('unlike.a.01') 	 None
0.06666666666666667


In [0]:
# 查找指定词汇的同义词
print(wn.synsets('motorcar'))
print(wn.synset('car.n.01').lemma_names)
motorcar = wn.synset('car.n.01')
types_of_motorcar = motorcar.hyponyms()
print(types_of_motorcar[26])
#  'method' object is not iterable
# print([lemma.name for synset in types_of_motorcar for lemma in synset.lemmas])

# 反义词
wn.lemma('supply.n.02.supply').antonyms()[0].name


[Synset('car.n.01')]
<bound method Synset.lemma_names of Synset('car.n.01')>
Synset('stanley_steamer.n.01')


<bound method Lemma.name of Lemma('demand.n.02.demand')>