### Краткое содержание:
- Попробован алгоритм Леска с удалением стоп-слов из датасета и WordNet, без удаления стоп-слов, c ограничениями на окно контекста, с применением не только определений, но и примеров,с разделением по частям речи
- Число определений для слова "break" в WordNet
- Получилось определить с высокой степенью субъективного согласия с полученным вариантом 3 значения
- Ещё 1-2 значения получается улучшать до состояния "частично подходит"
- Причины неэффективного определения: высокая многозначность слова, широкая вариабельность контекстов употребления, короткие определения, короткие примеры, которых немного. Таким образом, максимальное пересечение слов контекста, в котором употреблено слово с разными определениями и примерами из WordNet недостаточно велико и не может являться хорошим дискриминативным признаком для определения значения.

In [170]:
import random
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.wsd import lesk
from string import punctuation
morph = MorphAnalyzer()
punct = punctuation+'«»—…“”*№–'
stops = set(stopwords.words('english'))

import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/lisa/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [627]:
def tokenize(text):
    
    words = [word.strip(punct) for word in text.lower().split() if word and word not in stops]
    words = [word for word in words if word]

    return words

In [174]:
corpus = [line for line in open('corpus_eng.txt') if 'break' in tokenize(line)]

In [236]:
sample_full = random.sample(corpus, 10)

In [623]:
sample_full

['About 31,000 Tumblekins toys are being recalled by International Playthings as the toy can break into small pieces with sharp edges, posing a laceration hazard. More » ©Landscape Structures \n',
 "Black Magic - Halloween's the perfect excuse to break out the black vodka and an easy made-from-scratch soda drink awaits your ghoulish guests. Bring your summer shandy into autumn with this easy Pumpkin Shandy recipe. Photo Courtesy: © Cointreau • From the Pumpkin Patch \n",
 'In the picture you can see the cross hatching. This is caused from the honing operation and helps the rings "seat" during the break in process. When the term "seat" is used, it means the ring is worn down slightly, allowing it to mate to the cylinder wall, and form a seal which holds compression in the cylinder/combustion chamber, as well as keeping oil out. A secondary purpose of boring the cylinder is to ensure it is completely round. To do this correctly, the shop should use a torque plate, which is attached to th

Lines в```sample``` слишком длинные, неудобно просматривать. Вырежем предложения со словом 'break'

In [238]:
sample = []
for line in sample_full:
    sentences = line.split('.')
    for s in sentences:
        if 'break' in tokenize(s):
            sample.append(s)

In [239]:
len(sample)

10

In [240]:
sample

['About 31,000 Tumblekins toys are being recalled by International Playthings as the toy can break into small pieces with sharp edges, posing a laceration hazard',
 "Black Magic - Halloween's the perfect excuse to break out the black vodka and an easy made-from-scratch soda drink awaits your ghoulish guests",
 ' This is caused from the honing operation and helps the rings "seat" during the break in process',
 ' They break the rules in order to convince the rule-makers that they need to change the rules, which is itself a kind of state-approved process',
 ' Assuming the vice president would break any tie in favor of change, it would take 50 of those 52 to do away with the filibuster',
 " Trending Now Hillary tried to break the glass ceiling, but the only crack she found was Trumps &ss pressed against it Jeffrey · 6 mins ago She wasn't the first woman to run for president",
 'Hernandez led graduate and undergraduate students to Houston during Spring Break 2016 to conduct original reporti

In [306]:
sample_tok = [tokenize(line) for line in sample]

### Алгоритм Леска c токенизацией и удалением стоп-слов

In [648]:
def lesk_tokenize( word, sentence):

    bestsense = 0
    maxoverlap = 0
    
    for i, synset in enumerate(wn.synsets(word)):
        definition = tokenize(synset.definition())
        definition = set(definition)
        sentence = set(sentence)
        overlap = len(definition & sentence)
        if overlap > maxoverlap:
            maxoverlap = overlap
            bestsense = i
    
    return bestsense

In [649]:
i = 0
definitions = []
for line in sample_tok:
    i = lesk_tokenize('break', line)
    definitions.append(wn.synsets('break')[i].definition())

In [650]:
for i in range(10):
    print(i+1, sample[i], '---', definitions[i], sep = '\n')

1
About 31,000 Tumblekins toys are being recalled by International Playthings as the toy can break into small pieces with sharp edges, posing a laceration hazard
---
become separated into pieces or fragments
2
Black Magic - Halloween's the perfect excuse to break out the black vodka and an easy made-from-scratch soda drink awaits your ghoulish guests
---
break down, literally or metaphorically
3
 This is caused from the honing operation and helps the rings "seat" during the break in process
---
break down, literally or metaphorically
4
 They break the rules in order to convince the rule-makers that they need to change the rules, which is itself a kind of state-approved process
---
an abrupt change in the tone or register of the voice (as at puberty or due to emotion)
5
 Assuming the vice president would break any tie in favor of change, it would take 50 of those 52 to do away with the filibuster
---
an abrupt change in the tone or register of the voice (as at puberty or due to emotion)

Подходит: 1 хорошо, 6, 10 - в метафорическом смысле.
Может быть убирать стоп-слова не стоит, т.к. среди них есть предлоги из фразовых глаголов типа "break up"
  
Не подходят определения, не пересекающиеся со значениями, поскольку 'break' - очень многозначное слово(в WordNet 75 значений)  и контексты чрезвычайно разнообразны, а определения - коротки
7 - Spring Break 2016, мероприятие, нет в определениях
7,5,4 - "зацепилось" за глаголы, не относящиеся к break
2, 3, 8, 9 - 'break' в определении

Чтобы убрать бессмысленные пересечения, нужно брать контекст как несколько слов вокруг "break"
Чтобы попробовать поймать фразовые глаголы, можно вернуть стоп-слова

### Алгоритм Леска с ограниченным контекстом

In [639]:
def get_words_in_context(word, words, window=3):
    words_in_context = []
    for i in range(len(words)):
        left = words[max(0, i-window):i]
        right = words[i+1:i+window+1]
        target = words[i]
        if target == word:
            words_in_context = left+right
    
    return words_in_context

In [638]:
get_words_in_context('break', sample_tok[0])

['international', 'playthings', 'toy', 'small', 'pieces', 'sharp']

In [651]:
i = 0
definitions = []
for line in sample_tok:
    line = get_words_in_context('break', line)
    i = lesk_tokenize('break', line)
    definitions.append(wn.synsets('break')[i].definition())

In [652]:
for i in range(10):
    print(i+1, sample[i], '---', definitions[i], sep = '\n')

1
About 31,000 Tumblekins toys are being recalled by International Playthings as the toy can break into small pieces with sharp edges, posing a laceration hazard
---
become separated into pieces or fragments
2
Black Magic - Halloween's the perfect excuse to break out the black vodka and an easy made-from-scratch soda drink awaits your ghoulish guests
---
some abrupt occurrence that interrupts an ongoing activity
3
 This is caused from the honing operation and helps the rings "seat" during the break in process
---
some abrupt occurrence that interrupts an ongoing activity
4
 They break the rules in order to convince the rule-makers that they need to change the rules, which is itself a kind of state-approved process
---
act in disregard of laws, rules, contracts, or promises
5
 Assuming the vice president would break any tie in favor of change, it would take 50 of those 52 to do away with the filibuster
---
an abrupt change in the tone or register of the voice (as at puberty or due to em

Стало интереснее!

Подходят:
1 - результат не изменился
4, 8 - новое

Не подходят:
2 - кажется, тут довольно специфический контекст. Значение 'break out' не очень понятно, скорее всего что-то близкое к "to make a surprise", хотя чаще это что-то междуе "to get out" и "to destroy"
5 - кажется, так всё равно зацепляем "change"
3 - стало ближе
6 - стало хуже
7 - - Spring Break is a vacation period in early spring at universities and schools (wikipedia), результат "some abrupt occurrence that interrupts an ongoing activity" почти подходит:)
9, 10 - ближе, чем было

### Алгоритм Леска без удаления стоп-слов

In [654]:
def tokenize_full(text):
    words = [word.strip(punct) for word in text.lower().split() if word]
    return words

In [656]:
sample_tok_full = [tokenize_full(line) for line in sample]

In [655]:
def lesk_tokenize_full( word, sentence):

    bestsense = 0
    maxoverlap = 0
    
    for i, synset in enumerate(wn.synsets(word)):
        definition = tokenize_full(synset.definition())
        definition = set(definition)
        sentence = set(sentence)
        overlap = len(definition & sentence)
        if overlap > maxoverlap:
            maxoverlap = overlap
            bestsense = i
    
    return bestsense

In [657]:
i = 0
definitions = []
for line in sample_tok_full:
    line = get_words_in_context('break', line)
    i = lesk_tokenize_full('break', line)
    definitions.append(wn.synsets('break')[i].definition())

In [658]:
for i in range(10):
    print(i+1, sample[i], '---', definitions[i], sep = '\n')

1
About 31,000 Tumblekins toys are being recalled by International Playthings as the toy can break into small pieces with sharp edges, posing a laceration hazard
---
destroy the integrity of; usually by force; cause to separate into pieces or fragments
2
Black Magic - Halloween's the perfect excuse to break out the black vodka and an easy made-from-scratch soda drink awaits your ghoulish guests
---
(geology) a crack in the earth's crust resulting from the displacement of one side with respect to the other
3
 This is caused from the honing operation and helps the rings "seat" during the break in process
---
(geology) a crack in the earth's crust resulting from the displacement of one side with respect to the other
4
 They break the rules in order to convince the rule-makers that they need to change the rules, which is itself a kind of state-approved process
---
(geology) a crack in the earth's crust resulting from the displacement of one side with respect to the other
5
 Assuming the vi

1, 8 подходят, остальное замусорили стоп-словами. Попробуем увеличить контекстное окно, чтобы попадали не только они

In [659]:
i = 0
definitions = []
for line in sample_tok_full:
    line = get_words_in_context('break', line, 5)
    i = lesk_tokenize_full('break', line)
    definitions.append(wn.synsets('break')[i].definition())

In [660]:
for i in range(10):
    print(i+1, sample[i], '---', definitions[i], sep = '\n')

1
About 31,000 Tumblekins toys are being recalled by International Playthings as the toy can break into small pieces with sharp edges, posing a laceration hazard
---
destroy the integrity of; usually by force; cause to separate into pieces or fragments
2
Black Magic - Halloween's the perfect excuse to break out the black vodka and an easy made-from-scratch soda drink awaits your ghoulish guests
---
(geology) a crack in the earth's crust resulting from the displacement of one side with respect to the other
3
 This is caused from the honing operation and helps the rings "seat" during the break in process
---
(geology) a crack in the earth's crust resulting from the displacement of one side with respect to the other
4
 They break the rules in order to convince the rule-makers that they need to change the rules, which is itself a kind of state-approved process
---
(geology) a crack in the earth's crust resulting from the displacement of one side with respect to the other
5
 Assuming the vi

Не помогло.

### Алгоритм Леска с удалением стоп-слов и с использованием определений+примеров

In [671]:
def lesk_examples(word, sentence):

    bestsense = 0
    maxoverlap = 0
    example = []
    
    for i, synset in enumerate(wn.synsets(word)):
        definition = tokenize(synset.definition())
        exes = [tokenize(line) for line in synset.examples()]
        for ex in exes:
            example = example
        synset_context = example + definition
        sentence = set(sentence)
        synset_context = set(synset_context)
        overlap = len(synset_context & sentence)
        if overlap > maxoverlap:
            maxoverlap = overlap
            bestsense = i
    
    return bestsense

In [677]:
i = 0
definitions = []
for line in sample_tok:
    line = get_words_in_context('break', line, 5)
    i = lesk_examples('break', line)
    definitions.append(wn.synsets('break')[i].definition())

In [678]:
for i in range(10):
    print(i+1, sample[i], '---', definitions[i], sep = '\n')

1
About 31,000 Tumblekins toys are being recalled by International Playthings as the toy can break into small pieces with sharp edges, posing a laceration hazard
---
become separated into pieces or fragments
2
Black Magic - Halloween's the perfect excuse to break out the black vodka and an easy made-from-scratch soda drink awaits your ghoulish guests
---
some abrupt occurrence that interrupts an ongoing activity
3
 This is caused from the honing operation and helps the rings "seat" during the break in process
---
some abrupt occurrence that interrupts an ongoing activity
4
 They break the rules in order to convince the rule-makers that they need to change the rules, which is itself a kind of state-approved process
---
act in disregard of laws, rules, contracts, or promises
5
 Assuming the vice president would break any tie in favor of change, it would take 50 of those 52 to do away with the filibuster
---
an abrupt change in the tone or register of the voice (as at puberty or due to em

Ок: 1, 4, 8.
Всё ещё слишком вариативный контекст, примеров недостаточно.
Например, для 9 предложения подходит определение  
n.synsets('separate.v.08') - -  discontinue an association or relation; go different ways  
Посмотрим примеры:
'The business partners broke over a tax question',  
'The couple separated after 25 years of marriage',  
'My friend and I split up'  

Даже приведя 'broke' к нормальной форме, потенциально мы получим такие неинформативные пересечения, как 'I', 'up' (если оставим стоп-слова) 'after'. И даже если нам повезёт с 'up', всё равно останутся несколько вариантов значения выражения 'break up'

### Разделим примеры по частям речи

In [692]:
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/lisa/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [701]:
sample_pos = [pos_tag(word_tokenize(line)) for line in sample]

In [702]:
sample_verb = []
sample_noun = []
sample_other = []
for i in range(len(sample_pos)):
    for j in sample_pos[i]:
        if j[0].lower() == 'break':
            if j[1].startswith('V'):
                sample_verb.append(sample[i])
            elif j[1].startswith('N'):
                sample_noun.append(sample[i])
            else:
                sample_other.append(sample[i])

In [703]:
print(len(sample_noun)+len(sample_verb))

10


In [708]:
sample_tok_noun = [tokenize(line) for line in sample_noun]

In [709]:
sample_tok_verb = [tokenize(line) for line in sample_verb]

In [706]:
def lesk_examples_pos(word, pos, sentence):

    bestsense = 0
    maxoverlap = 0
    example = []
    
    for i, synset in enumerate(wn.synsets(word, pos)):
        definition = tokenize(synset.definition())
        exes = [tokenize(line) for line in synset.examples()]
        for ex in exes:
            example = example
        synset_context = example + definition
        sentence = set(sentence)
        synset_context = set(synset_context)
        overlap = len(synset_context & sentence)
        if overlap > maxoverlap:
            maxoverlap = overlap
            bestsense = i
    
    return bestsense

In [710]:
i = 0
definitions_noun = []
for line in sample_tok_noun:
    line = get_words_in_context('break', line, 5)
    i = lesk_examples_pos('break','n', line)
    definitions_noun.append(wn.synsets('break','n')[i].definition())

In [715]:
for i in range(len(sample_tok_noun)):
    print(i+1, sample_tok_noun[i], '---', definitions_noun[i], sep = '\n')

1
['caused', 'honing', 'operation', 'helps', 'rings', 'seat', 'break', 'process']
---
some abrupt occurrence that interrupts an ongoing activity
2
['hernandez', 'led', 'graduate', 'undergraduate', 'students', 'houston', 'spring', 'break', '2016', 'conduct', 'original', 'reporting', 'based', 'hell', 'high', 'water', 'investigation']
---
some abrupt occurrence that interrupts an ongoing activity
3
['around', '7', 'months', 'break', 'met', 'girl', 'eventually', 'married', 'love', 'dearly', 'leads', 'inner', 'turmoil']
---
some abrupt occurrence that interrupts an ongoing activity


Для существительных улучшений нет

In [714]:
i = 0
definitions_verb = []
for line in sample_tok_verb:
    line = get_words_in_context('break', line, 5)
    i = lesk_examples_pos('break','v', line)
    definitions_verb.append(wn.synsets('break','v')[i].definition())

In [716]:
for i in range(len(sample_tok_verb)):
    print(i+1, sample_tok_verb[i], '---', definitions_verb[i], sep = '\n')

1
['31,000', 'tumblekins', 'toys', 'recalled', 'international', 'playthings', 'toy', 'break', 'small', 'pieces', 'sharp', 'edges', 'posing', 'laceration', 'hazard']
---
become separated into pieces or fragments
2
['black', 'magic', "halloween's", 'perfect', 'excuse', 'break', 'black', 'vodka', 'easy', 'made-from-scratch', 'soda', 'drink', 'awaits', 'ghoulish', 'guests']
---
terminate
3
['break', 'rules', 'order', 'convince', 'rule-makers', 'need', 'change', 'rules', 'kind', 'state-approved', 'process']
---
act in disregard of laws, rules, contracts, or promises
4
['assuming', 'vice', 'president', 'would', 'break', 'tie', 'favor', 'change', 'would', 'take', '50', '52', 'away', 'filibuster']
---
happen or take place
5
['trending', 'hillary', 'tried', 'break', 'glass', 'ceiling', 'crack', 'found', 'trumps', 'ss', 'pressed', 'jeffrey', '·', '6', 'mins', 'ago', 'first', 'woman', 'run', 'president']
---
become fractured; break or crack on the surface only
6
['also', 'country', 'fans', 'break

Чуть лучше стало для примера 5 (6 в исходном датасете)

Вспомогательные ячейки

In [718]:
for synset in wn.synsets('break'):
    print(synset, ' - ',synset.definition())

Synset('interruption.n.02')  -  some abrupt occurrence that interrupts an ongoing activity
Synset('break.n.02')  -  an unexpected piece of good luck
Synset('fault.n.04')  -  (geology) a crack in the earth's crust resulting from the displacement of one side with respect to the other
Synset('rupture.n.02')  -  a personal or social separation (as between opposing factions)
Synset('respite.n.02')  -  a pause from doing something (as work)
Synset('breakage.n.03')  -  the act of breaking something
Synset('pause.n.01')  -  a time interval during which there is a temporary cessation of something
Synset('fracture.n.01')  -  breaking of hard tissue such as bone
Synset('break.n.09')  -  the occurrence of breaking
Synset('break.n.10')  -  an abrupt change in the tone or register of the voice (as at puberty or due to emotion)
Synset('break.n.11')  -  the opening shot that scatters the balls in billiards or pool
Synset('break.n.12')  -  (tennis) a score consisting of winning a game when your opponen

In [691]:
wn.synset('separate.v.08').examples()

['The business partners broke over a tax question',
 'The couple separated after 25 years of marriage',
 'My friend and I split up']