# WordNet

http://wordnetweb.princeton.edu/perl/webwn

In [1]:
from nltk.corpus import wordnet

Посмотрим на работу WordNet на примере существительного *star*.

1) Найдем все значения слова *star*.

In [7]:
star = wordnet.synsets('star')
for ss in star:
    print(ss, ss.definition())

Synset('star.n.01') (astronomy) a celestial body of hot gases that radiates energy derived from thermonuclear reactions in the interior
Synset('ace.n.03') someone who is dazzlingly skilled in any field
Synset('star.n.03') any celestial body visible (as a point of light) from the Earth at night
Synset('star.n.04') an actor who plays a principal role
Synset('star.n.05') a plane figure with 5 or more points; often used as an emblem
Synset('headliner.n.01') a performer who receives prominent billing
Synset('asterisk.n.01') a star-shaped character * used in printing
Synset('star_topology.n.01') the topology of a network whose components are connected to a hub
Synset('star.v.01') feature as the star
Synset('star.v.02') be the star in a performance
Synset('star.v.03') mark with an asterisk
Synset('leading.s.01') indicating the most important performer or role


2) Выпишем определения для значений (a) "звезда на небе" и (b) "известный актер"

In [8]:
print(star[2], star[2].definition())
print(star[3], star[3].definition())

Synset('star.n.03') any celestial body visible (as a point of light) from the Earth at night
Synset('star.n.04') an actor who plays a principal role


3) Найдем два произвольных контекста для слова *star* в значениях "звезда на небе" и (b) "известный актер"

In [10]:
sent1 = 'Have you ever looked up into the night sky and wondered just how many stars there are in space?'
sent2 = 'Because every kid thinks they should be a rap mogul or a movie star.'
sent1_tokens = [word.strip('.,?') for word in sent1.split(' ')]
sent2_tokens = [word.strip('.,?') for word in sent2.split(' ')]
print(sent1_tokens)
print(sent2_tokens)

['Have', 'you', 'ever', 'looked', 'up', 'into', 'the', 'night', 'sky', 'and', 'wondered', 'just', 'how', 'many', 'stars', 'there', 'are', 'in', 'space']
['Because', 'every', 'kid', 'thinks', 'they', 'should', 'be', 'a', 'rap', 'mogul', 'or', 'a', 'movie', 'star']


Применим алгоритм Леска для разрешения неоднозначности в обоих предложениях:  
https://ru.wikipedia.org/wiki/Метод_Леска

In [11]:
from nltk.wsd import lesk
print(lesk(sent1_tokens, 'star').definition())
print(lesk(sent2_tokens, 'star').definition())

the topology of a network whose components are connected to a hub
be the star in a performance


Алгоритм ошибся в первом случае, а во втором вывела определение, похожее на правильное.

4) Найдем гиперонимы для значения (a) "звезда на небе" и гиперонимы для значения (b) "известный актер"

In [12]:
for ss in star[2].hypernyms():
    print(ss, ss.definition())
for ss in star[3].hypernyms():
    print(ss, ss.definition())

Synset('celestial_body.n.01') natural objects visible in the sky
Synset('actor.n.01') a theatrical performer


Астрономический объект и актёр.

5) Вычислим наименьшее расстояние между значением *star* "известный актер" и значениями лексемы *movie* и  
наименьшее расстояние между значением *star* "звезда на небе" и значениями лексемы *sky*.  

In [20]:
film = wordnet.synsets('film')
sky = wordnet.synsets('sky')

def get_dist_sim(ss1, lexeme):
    distances = []
    similarities = []
    for ss2 in lexeme:
        dist = ss1.shortest_path_distance(ss2)
        if dist is not None:
            distances.append(dist)
            sim = ss1.path_similarity(ss2)
            similarities.append(sim)
    return distances, similarities

# min d(star: "звезда на небе", sky)
dist1 = get_dist_sim(star[2], sky)[0]
print('min d(star: "звезда на небе", sky): {}'.format(min(dist1)))
print('closest lemma definition: {}\n'.format(sky[dist1.index(min(dist1))].definition()))
# Правда, здесь выводится только одно ближайшее значение. Их может быть несколько.

# min d(star: "известный актер", film)
dist2 = get_dist_sim(star[3], film)[0]
print('min d(star: "известный актер", film): {}'.format(min(dist2)))
print('closest lemma definition: {}\n'.format(film[dist2.index(min(dist2))].definition()))

# min d(star: "звезда на небе", film)
dist3 = get_dist_sim(star[2], film)[0]
print('min d(star: "звезда на небе", film): {}'.format(min(dist3)))
print('closest lemma definition: {}\n'.format(film[dist3.index(min(dist3))].definition()))

# min d(star: "известный актер", sky)
dist4 = get_dist_sim(star[3], sky)[0]
print('min d(star: "известный актер", sky): {}'.format(min(dist4)))
print('closest lemma definition: {}\n'.format(sky[dist4.index(min(dist4))].definition()))

print('min (d(star: "звезда на небе", sky), d(star: "звезда на небе", film)): {}'.format(min(min(dist1), min(dist2))))
print('min (d(star: "известный актер", sky), d(star: "известный актер", film)): {}'.format(min(min(dist3), min(dist4))))

min d(star: "звезда на небе", sky): 10
closest lemma definition: the atmosphere and outer space as viewed from the earth

min d(star: "известный актер", film): 8
closest lemma definition: a thin coating or layer

min d(star: "звезда на небе", film): 5
closest lemma definition: a thin coating or layer

min d(star: "известный актер", sky): 11
closest lemma definition: the atmosphere and outer space as viewed from the earth

min (d(star: "звезда на небе", sky), d(star: "звезда на небе", film)): 8
min (d(star: "известный актер", sky), d(star: "известный актер", film)): 5


Кратчайшее расстояние от небесной звезды до sky - 10 (значение sky выбрано верное), до film - 5. Логично.  
Кратчайшее расстояние от актера до film - 8 и до sky - 11. До film расстояние короче, что неправильно. И определение для film дано неверное.

In [15]:
sky = wordnet.synsets('sky')
for ss in sky:
    print(ss, ss.definition())

Synset('sky.n.01') the atmosphere and outer space as viewed from the earth
Synset('flip.v.06') throw or toss with a light motion


In [17]:
sky = wordnet.synsets('film')
for ss in sky:
    print(ss, ss.definition())

Synset('movie.n.01') a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement
Synset('film.n.02') a medium that disseminates moving pictures
Synset('film.n.03') photographic material consisting of a base of celluloid covered with a photographic emulsion; used to make negatives or transparencies
Synset('film.n.04') a thin coating or layer
Synset('film.n.05') a thin sheet of (usually plastic and usually transparent) material used to wrap or cover things
Synset('film.v.01') make a film or photograph of something
Synset('film.v.02') record in film


6) Вычислим вычислим близость значений *star* "звезда на небе" и *cosmos*.

In [24]:
cosmos = wordnet.synsets("cosmos")
for ss in cosmos:
    print(ss, ss.definition())

Synset('universe.n.01') everything that exists anywhere
Synset('cosmos.n.02') any of various mostly Mexican herbs of the genus Cosmos having radiate heads of variously colored flowers and pinnate leaves; popular fall-blooming annuals


Посчитаем близость с помощью критериев Path Similarity, Leacock-Chodorow Similarity, Wu-Palmer Similarity, Resnik Similarity, Jiang-Conrath Similarity и Lin Similarity. Для последних трех будем использовать Information Content корпуса Brown.

In [26]:
from nltk.corpus import wordnet_ic
ic = wordnet_ic.ic('ic-brown.dat')

print(star[2].path_similarity(cosmos[0]))

#Leacock-Chodorow Similarity
print(star[2].lch_similarity(cosmos[0]))

#Wu-Palmer Similarity
print(star[2].wup_similarity(cosmos[0]))
    
#Resnik Similarity
print(star[2].res_similarity(cosmos[0], ic))
    
#Jiang-Conrath Similarity
print(star[2].jcn_similarity(cosmos[0], ic))
    
#Lin Similarity
print(star[2].lin_similarity(cosmos[0], ic))

0.25
2.2512917986064953
0.7692307692307693
4.94037581120928
0.1248115233147824
0.5522184208452325


Вычислим вычислим близость значений *star* "известный актер" и *cat* "домашнее животное".

In [27]:
cat = wordnet.synsets("cat")
for ss in cat:
    print(ss, ss.definition())

Synset('cat.n.01') feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats
Synset('guy.n.01') an informal term for a youth or man
Synset('cat.n.03') a spiteful woman gossip
Synset('kat.n.01') the leaves of the shrub Catha edulis which are chewed like tobacco or used to make tea; has the effect of a euphoric stimulant
Synset('cat-o'-nine-tails.n.01') a whip with nine knotted cords
Synset('caterpillar.n.02') a large tracked vehicle that is propelled by two endless metal belts; frequently used for moving earth in construction and farm work
Synset('big_cat.n.01') any of several large cats typically able to roar and living in the wild
Synset('computerized_tomography.n.01') a method of examining body organs by scanning them with X rays and using a computer to construct a series of cross-sectional scans along a single axis
Synset('cat.v.01') beat with a cat-o'-nine-tails
Synset('vomit.v.01') eject the contents of the stomach through the mouth


In [28]:
from nltk.corpus import wordnet_ic
ic = wordnet_ic.ic('ic-brown.dat')

print(star[3].path_similarity(cat[0]))

#Leacock-Chodorow Similarity
print(star[3].lch_similarity(cat[0]))

#Wu-Palmer Similarity
print(star[3].wup_similarity(cat[0]))
    
#Resnik Similarity
print(star[3].res_similarity(cat[0], ic))
    
#Jiang-Conrath Similarity
print(star[3].jcn_similarity(cat[0], ic))
    
#Lin Similarity
print(star[3].lin_similarity(cat[0], ic))

0.07142857142857142
0.9985288301111273
0.48
2.2241504712318556
0.0750854700495999
0.25037636757776993


Мы посчитали значения семантической близости для двух пар слов:  
    в первой слова семантически близки и связаны отношением слов и его семантическое поле,  
    во второй слова никак семантически не связаны. Соответственно мы ожидаем большее значение метрик для первой пары.

Все метрики выдают большие значения для схожести первой пары, и меньшее для второй. Лучше всего отражает эти два факта мера семантической схожести Jiang-Conrath (наибольшее частное между двумя значениями - 0.625).