# Word Sense Disambiguation using Supervised Learning
## The Naive Bayes Model



## The Senseval dataset

The Senseval 2 corpus is a word sense disambiguation corpus. Each item in the corpus corresponds to a single ambiguous word. For each of these words, the corpus contains a list of instances, corresponding to occurrences of that word. Each instance provides the word; a list of word senses that apply to the word occurrence; and the word’s context.
https://www.nltk.org/howto/corpus.html#senseval

Detailed description of dataset creation in publication here: https://aclanthology.org/S01-1001/

In [None]:
import nltk
nltk.download('senseval')
nltk.download('wordnet')
from nltk.corpus import senseval
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
import string
from nltk import tokenize
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('punkt')
import re

nltk.download('wordnet')

from nltk.corpus import wordnet as wn
from nltk.wsd import lesk



[nltk_data] Downloading package senseval to /root/nltk_data...
[nltk_data]   Package senseval is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
inst = senseval.instances('interest.pos')
inst[0]

SensevalInstance(word='interest-n', position=18, context=[('yields', 'NNS'), ('on', 'IN'), ('money-market', 'JJ'), ('mutual', 'JJ'), ('funds', 'NNS'), ('continued', 'VBD'), ('to', 'TO'), ('slide', 'VB'), (',', ','), ('amid', 'IN'), ('signs', 'VBZ'), ('that', 'IN'), ('portfolio', 'NN'), ('managers', 'NNS'), ('expect', 'VBP'), ('further', 'JJ'), ('declines', 'NNS'), ('in', 'IN'), ('interest', 'NN'), ('rates', 'NNS'), ('.', '.')], senses=('interest_6',))

In [None]:
len(inst)

2368

In [None]:
for inst in senseval.instances('interest.pos')[:40]:
  p = inst.position
  left = ' '.join(w for (w,t) in inst.context[p-3:p])
  word = ' '.join(w for (w,t) in inst.context[p:p+1])
  right = ' '.join(w for (w,t) in inst.context[p+1:p+4])
  senses = ' '.join(inst.senses)
  print('%20s |%10s | %-15s -> %s' % (left, word, right, senses))

 further declines in |  interest | rates .         -> interest_6
to indicate declining |  interest | rates because they -> interest_6
 rises in short-term |  interest | rates .         -> interest_6
               . 4 % |  interest | in this energy-services -> interest_5
holding company with | interests | in the mechanical -> interest_5
     refunded , plus |  interest | .               -> interest_6
       curry set the |  interest | rate on the     -> interest_6
      country 's own |  interest | , prompted the  -> interest_4
    of principal and |  interest | is the only     -> interest_6
     to increase its |  interest | to 70 %         -> interest_5
     show the strong |  interest | of japanese investors -> interest_1
    retired early if |  interest | rates decline , -> interest_6
         the drop in |  interest | rates since the -> interest_6
         the drop in |  interest | rates eventually will -> interest_6
    par plus accrued |  interest | to the date     -> interest_6

In [None]:
senseval.fileids()

['hard.pos', 'interest.pos', 'line.pos', 'serve.pos']

In [None]:
def senses(word):
    """
    This takes a target word from senseval-2 (find out what the possible
    are by running senseval.fileides()), and it returns the list of possible 
    senses for the word
    """
    return list(set(i.senses[0] for i in senseval.instances(word)))

senses('interest.pos')

['interest_3',
 'interest_2',
 'interest_1',
 'interest_6',
 'interest_5',
 'interest_4']

In [None]:
[i for i in senseval.instances('interest.pos') if i.senses[0]=='interest_1']

NameError: ignored

In [None]:
senses('line.pos')

['division', 'text', 'cord', 'formation', 'phone', 'product']

In [None]:
senses('serve.pos')

['SERVE12', 'SERVE10', 'SERVE6', 'SERVE2']

In [None]:
for inst in senseval.instances('hard.pos')[0:10]:
  p = inst.position
  left = ' '.join(w for (w,t) in inst.context[p-3:p])
  word = ' '.join(w for (w,t) in inst.context[p:p+1])
  right = ' '.join(w for (w,t) in inst.context[p+1:p+4])
  senses = ' '.join(inst.senses)
  print('%20s |%10s | %-15s -> %s' % (left, word, right, senses))

for inst in senseval.instances('hard.pos')[-10:]:
  p = inst.position
  left = ' '.join(w for (w,t) in inst.context[p-3:p])
  word = ' '.join(w for (w,t) in inst.context[p:p+1])
  right = ' '.join(w for (w,t) in inst.context[p+1:p+4])
  senses = ' '.join(inst.senses)
  print('%20s |%10s | %-15s -> %s' % (left, word, right, senses))

         and that 's |      hard | to do .         -> HARD1
        are having a |      hard | time helping president -> HARD1
           i find it |      hard | to believe that -> HARD1
        person , the |      hard | part in correcting -> HARD1
        'our life is |    harder | now , yes       -> HARD1
   which have become |      hard | to sell in      -> HARD1
         to face the |      hard | facts of life   -> HARD1
                     |      hard | to make the     -> HARD1
           it may be |      hard | just to finish  -> HARD1
             , is it |      hard | portraying matt dillon -> HARD1
       also weed out |      hard | cover classics that -> HARD3
        cushion on a |      hard | headboard for comfortable -> HARD3
            a rock - |      hard | field against the -> HARD3
   either fabrics or |      hard | surfaces .      -> HARD3
        ivory is the |      hard | endosperm of the -> HARD3
     capitol was its |      hard | floors , the    -> HARD3
      

In [None]:
senses('hard.pos')

['HARD3', 'HARD2', 'HARD1']

## The Naive Bayes model

We use Bayes's classifier in order to label (classify) the words with e certain WordNet sense. For this we need a context window surrounding the target word (the word for which we search the sense). The context window should contain only "content words" (words with important meaning, that bring information, like nouns, verbs etc)

We note P(s|c) the probability for sense s in the context c. For each such sense of the target word the probability is computed and we take the sense with the highest probability compared to the others.

In order to compute the probability `P(s|c)`, we use the formula: 

`P(s|c)=P(c|s)*P(s)/P(c)`. 

`P(s)` is the probability of a sense without any context. For computing `P(c|s)` we need a training set (with texts that contain the target word, already labeled with its correct sense).

NLTK already has the classifier implemented. In this laboratory we will use the NLTK NaiveBayesClassifier:https://www.nltk.org/_modules/nltk/classify/naivebayes.html

The Naive Bayes classifier will first compute the prior probability for the senses (or, generally speaking, for the class labels) - this is determined by the label's frequncy in the training set. the features are used to see the likelyhood of having that label in a given context.

In [None]:
import nltk
import random
from nltk.classify import accuracy, NaiveBayesClassifier, MaxentClassifier
from collections import defaultdict

In [None]:
# NaiveBayesClassifier.train(train_set)

where `train_set` must contain a list with the classes and features for each class. The train_set list will contain tuples of two elements. First element is a dictionary with the features (name and value of each feature). The second element is the class label.


In [None]:
def sense_instances(instances, sense):
    """
    This returns the list of instances in instances that have the sense sense
    """
    return [instance for instance in instances if instance.senses[0]==sense]

In [None]:
sense2 = sense_instances(senseval.instances('hard.pos'), 'HARD2')

In [None]:
sense2[:5]

[SensevalInstance(word='hard-a', position=15, context=[('keep', 'VB'), ('this', 'DT'), ('one', 'CD'), ('in', 'IN'), ('your', 'PRP$'), ('drawer', 'NN'), ('for', 'IN'), ('the', 'DT'), ('next', 'JJ'), ('time', 'NN'), ('the', 'DT'), ('boss', 'NN'), ('gives', 'VBZ'), ('you', 'PRP'), ('a', 'DT'), ('hard', 'JJ'), ('time', 'NN'), ('.', '.')], senses=('HARD2',)),
 SensevalInstance(word='hard-a', position=11, context=[('she', 'PRP'), ('recommends', 'VBZ'), ('continuing', 'VBG'), ('education', 'NN'), ('courses', 'NNS'), (',', ','), ('developing', 'VBG'), ('effective', 'JJ'), ('people', 'NNS'), ('skills', 'NNS'), ('and', 'CC'), ('hard', 'JJ'), ('work', 'NN'), ('.', '.')], senses=('HARD2',)),
 SensevalInstance(word='hard-a', position=10, context=[('the', 'DT'), ('phrase', 'NN'), ('``', '``'), ('consent', 'NN'), ('of', 'IN'), ('the', 'DT'), ('governed', 'VBN'), ("''", "''"), ('needs', 'VBZ'), ('a', 'DT'), ('hard', 'JJ'), ('look', 'NN'), ('.', '.')], senses=('HARD2',)),
 SensevalInstance(word='hard-a

In [None]:
nltk.download('stopwords')
STOPWORDS_SET = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# Some helper functions we'll need to train our model

def extract_vocab_frequency(instances, stopwords=STOPWORDS_SET, n=300):
    """
    Given a list of senseval instances, return a list of the n most frequent words that
    appears in its context (i.e., the sentence with the target word in), output is in order
    of frequency and includes also the number of instances in which that key appears in the
    context of instances.
    """
    fd = nltk.FreqDist()
    for i in instances:
        (target, suffix) = i.word.split('-')
        words = (c[0] for c in i.context if not c[0] == target)
        for word in set(words) - set(stopwords):
            fd[word] += 1
    return fd.most_common()[:n+1]

In [None]:
def extract_vocab(instances, stopwords=STOPWORDS_SET, n=300):
    return [w for w,f in extract_vocab_frequency(instances,stopwords,n)]

In [None]:
def extract_vocab_frequency2(instances, word, stopwords=STOPWORDS_SET, n=300):
  fd = nltk.FreqDist()
  for i in instances:
    remained_words = set(i) - set(stopwords) - set(word)
    for word in set(i) - remained_words:
      fd[word] += 1

  return fd.most_common()[:n+1]


def extract_vocab2(sentences, word, stopwords=STOPWORDS_SET, n=300):
  return [w for w, f in extract_vocab_frequency2(sentences,word,stopwords,n)]

In [None]:
extract_vocab(senseval.instances('interest.pos'), stopwords=STOPWORDS_SET, n=1000)[:20]

['.',
 ',',
 'rates',
 "'s",
 'said',
 '%',
 'interests',
 '``',
 "''",
 '$',
 'million',
 'n',
 "'t",
 'mr',
 'company',
 'u',
 'rate',
 'would',
 'market',
 'bonds']

In [None]:
# Feature extraction

def wsd_context_features(instance, vocab, dist=3):
    features = {}
    ind = instance.position
    con = instance.context
    for i in range(max(0, ind-dist), ind):
        j = ind-i
        features['left-context-word-%s(%s)' % (j, con[i][0])] = True

    for i in range(ind+1, min(ind+dist+1, len(con))):
        j = i-ind
        features['right-context-word-%s(%s)' % (j, con[i][0])] = True

 
    features['word'] = instance.word
    features['pos'] = con[1][1]
    return features


This feature set represents the context of a word w as the sequence of m pairs (word,tag) that occur before w and the sequence of m pairs (word, tag) that occur after w. As we'll see shortly, you can specify the value of m (e.g., m=1 means the context consists of just the immediately prior and immediately subsequent word-tag pairs); otherwise, m defaults to 3.

In [None]:
senseval.instances('interest.pos')[4]

SensevalInstance(word='interest-n', position=8, context=[('finmeccanica', 'NN'), ('is', 'VBZ'), ('an', 'DT'), ('italian', 'NN'), ('state-owned', 'JJ'), ('holding', 'NN'), ('company', 'NN'), ('with', 'IN'), ('interests', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('mechanical', 'JJ'), ('engineering', 'NN'), ('industry', 'NN'), ('.', '.')], senses=('interest_5',))

In [None]:
vocab_interest = extract_vocab(senseval.instances('interest.pos'), stopwords=[], n=300)
wsd_context_features(senseval.instances('interest.pos')[4], vocab=vocab_interest)

{'left-context-word-1(with)': True,
 'left-context-word-2(company)': True,
 'left-context-word-3(holding)': True,
 'pos': 'VBZ',
 'right-context-word-1(in)': True,
 'right-context-word-2(the)': True,
 'right-context-word-3(mechanical)': True,
 'word': 'interest-n'}

In [None]:
def wsd_word_features(instance, vocab, dist=3):
    """
    Create a featureset where every key returns False unless it occurs in the
    instance's context
    """
    features = defaultdict(lambda:False)
    features['alwayson'] = True
    #cur_words = [w for (w, pos) in i.context]
    try:
      # 
        for(w, pos) in instance.context:
            if w in vocab:
                features[w] = True
    except ValueError:
        pass
    return features

In [None]:
def wsd_context_features1(instance, word, vocab, dist=3):
  #try:
  #  ind = instance.index(word)
  #except ValueError:
  #  return {}
  ind = instance.index(word)
  features = {}
  for i in range(max(0, ind - dist), ind):
    j = ind-i
    features['left-context-word-%s(%s)' % (j, instance[i])] = True

  for i in range(ind+1, min(ind+dist+1, len(instance))):
    j = ind-i
    features['right-context-word-%s(%s)' % (j, instance[i])] = True

  features['word'] = word   

  return features

This feature set is based on the set S of the n most frequent words that occur in the same sentence as the target word w across the entire training corpus (as you'll see later, you can specify the value of n, but if you don't specify it then it defaults to 300). For each occurrence of w, wsd_word_features represents its context as the subset of those words from S that occur in the w's sentence.

In [None]:
senseval.instances('interest.pos')[4]

SensevalInstance(word='interest-n', position=8, context=[('finmeccanica', 'NN'), ('is', 'VBZ'), ('an', 'DT'), ('italian', 'NN'), ('state-owned', 'JJ'), ('holding', 'NN'), ('company', 'NN'), ('with', 'IN'), ('interests', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('mechanical', 'JJ'), ('engineering', 'NN'), ('industry', 'NN'), ('.', '.')], senses=('interest_5',))

In [None]:
wsd_word_features(senseval.instances('interest.pos')[4], vocab=vocab_interest)

defaultdict(<function __main__.wsd_word_features.<locals>.<lambda>>,
            {'.': True,
             'alwayson': True,
             'an': True,
             'company': True,
             'holding': True,
             'in': True,
             'industry': True,
             'interests': True,
             'is': True,
             'the': True,
             'with': True})

In [None]:
_inst_cache = {}


In [None]:
def wsd_classifier(trainer, word, features, stopwords_list = STOPWORDS_SET, number=300, distance=3, confusion_matrix=False):
    """
    This function takes as arguments:
        a trainer (e.g., NaiveBayesClassifier.train);
        a target word from senseval2
        a feature set (this can be wsd_context_features or wsd_word_features);
        a number (defaults to 300), which determines for wsd_word_features the number of
            most frequent words within the context of a given sense that you use to classify examples;
        a distance (defaults to 3) which determines the size of the window for wsd_context_features (if distance=3, then
            wsd_context_features gives 3 words and tags to the left and 3 words and tags to
            the right of the target word);
        confusion_matrix (defaults to False), which if set to True prints a confusion matrix.

    Calling this function splits the senseval data for the word into a training set and a test set (the way it does
    this is the same for each call of this function, because the argument to random.seed is specified,
    but removing this argument would make the training and testing sets different each time you build a classifier).

    It then trains the trainer on the training set to create a classifier that performs WSD on the word,
    using features (with number or distance where relevant).

    It then tests the classifier on the test set, and prints its accuracy on that set.


    If confusion_matrix==True, then calling this function prints out a confusion matrix, where each cell [i,j]
    indicates how often label j was predicted when the correct label was i (so the diagonal entries indicate labels
    that were correctly predicted).
    """
    print("Reading data...")
    #global _inst_cache
    if word not in _inst_cache:
        _inst_cache[word] = [(i, i.senses[0]) for i in senseval.instances(word)]
    events = _inst_cache[word][:]
    senses = list(set(l for (i, l) in events))
    instances = [i for (i, l) in events]
    vocab = extract_vocab(instances, stopwords=stopwords_list, n=number)
    print(' Senses: ' + ' '.join(senses))

    # Split the instances into a training and test set,
    #if n > len(events): n = len(events)
    n = len(events)
    random.seed(334)
    random.shuffle(events)
    training_data = events[:int(0.8 * n)]
    test_data = events[int(0.8 * n):n]

    # Train classifier
    print('Training classifier...')
    classifier = trainer([(features(i, vocab, distance), label) for (i, label) in training_data])
    # Test classifier
    print('Testing classifier...')
    acc = accuracy(classifier, [(features(i, vocab, distance), label) for (i, label) in test_data] )
    print('Accuracy: %6.4f' % acc)
    
    if confusion_matrix==True:
        gold = [label for (i, label) in test_data]
        derived = [classifier.classify(features(i,vocab)) for (i,label) in test_data]
        cm = nltk.ConfusionMatrix(gold,derived)
        print(cm)
    
    return classifier
        

In [None]:
# Training the classifier:
# NB, with features based on 300 most frequent context words
wsd_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features)

# Pseudocode, general training steps:
# featureset = [extract_features[i] for i in instances]
# classifier = NaiveBayesClassifier.train((feature, label) for feature in featureset)


Reading data...
 Senses: HARD3 HARD2 HARD1
Training classifier...
Testing classifier...
Accuracy: 0.8547


In [None]:
# NB, with features based word + pos in 6 word window
wsd_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_context_features)

Reading data...
 Senses: HARD3 HARD2 HARD1
Training classifier...
Testing classifier...
Accuracy: 0.8927


In [None]:
wsd_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_context_features, confusion_matrix=True) # 0.33

Reading data...
 Senses: HARD3 HARD2 HARD1
Training classifier...
Testing classifier...
Accuracy: 0.8927
      |   H   H   H |
      |   A   A   A |
      |   R   R   R |
      |   D   D   D |
      |   1   2   3 |
------+-------------+
HARD1 |<650> 35  16 |
HARD2 |  12 <76>  5 |
HARD3 |   7  18 <48>|
------+-------------+
(row = reference; col = test)



In [None]:
wsd_classifier(NaiveBayesClassifier.train, 'interest.pos', wsd_context_features, confusion_matrix=True) # 1/6

Reading data...
 Senses: interest_5 interest_4 interest_2 interest_1 interest_3 interest_6
Training classifier...
Testing classifier...
Accuracy: 0.4219
           |   i   i   i   i   i   i |
           |   n   n   n   n   n   n |
           |   t   t   t   t   t   t |
           |   e   e   e   e   e   e |
           |   r   r   r   r   r   r |
           |   e   e   e   e   e   e |
           |   s   s   s   s   s   s |
           |   t   t   t   t   t   t |
           |   _   _   _   _   _   _ |
           |   1   2   3   4   5   6 |
-----------+-------------------------+
interest_1 | <24> 33   4   4   5   1 |
interest_2 |   .  <3>  .   1   .   . |
interest_3 |   2   2  <4>  .   .   . |
interest_4 |   1  22   2 <14>  .   2 |
interest_5 |   5  45   1   6 <61>  2 |
interest_6 |   . 133   2   1   . <94>|
-----------+-------------------------+
(row = reference; col = test)



In [None]:
wsd_classifier(NaiveBayesClassifier.train, 'interest.pos', wsd_context_features) # 1/6 ~ 0.16

Reading data...
 Senses: interest_5 interest_4 interest_2 interest_1 interest_3 interest_6
Training classifier...
Testing classifier...
Accuracy: 0.4219


Why is the accuracy lower for "interest"...?

**Baseline**: how could we guess the sense of a word without any additional information?

In [None]:
# Frequency Baseline
hard_sense_fd = nltk.FreqDist([i.senses[0] for i in senseval.instances('hard.pos')])
most_frequent_hard_sense= list(hard_sense_fd.keys())[0]
frequency_hard_sense_baseline = hard_sense_fd.freq(list(hard_sense_fd.keys())[0])


In [None]:
frequency_hard_sense_baseline

0.797369028386799

In [None]:
interest_sense_fd = nltk.FreqDist([i.senses[0] for i in senseval.instances('interest.pos')])
most_frequent_interest_sense= list(interest_sense_fd.keys())[0]
frequency_interest_sense_baseline = interest_sense_fd.freq(list(interest_sense_fd.keys())[0])

In [None]:
frequency_interest_sense_baseline

0.5287162162162162

You can also use Naive Bayes classifier from sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

In [None]:
from sklearn.metrics import classification_report

In [None]:
classification_report??

# Exercitii (1p)

Download a sample of the iWeb corpus, available here: https://www.corpusdata.org/iweb/samples/text0.zip . Unzip the archive and choose on of the text files in the archive at random. You will use it in the next exercises.

In [None]:
!wget https://www.corpusdata.org/iweb/samples/text0.zip

--2022-05-25 17:51:11--  https://www.corpusdata.org/iweb/samples/text0.zip
Resolving www.corpusdata.org (www.corpusdata.org)... 209.90.108.238
Connecting to www.corpusdata.org (www.corpusdata.org)|209.90.108.238|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 37647973 (36M) [application/x-zip-compressed]
Saving to: ‘text0.zip’


2022-05-25 17:51:51 (930 KB/s) - ‘text0.zip’ saved [37647973/37647973]



In [None]:
!unzip -x 'text0.zip'

Archive:  text0.zip
  inflating: 103053.txt              
  inflating: 116053.txt              
  inflating: 12053.txt               
  inflating: 128053.txt              
  inflating: 138053.txt              
  inflating: 140053.txt              
  inflating: 152053.txt              
  inflating: 161053.txt              
  inflating: 167053.txt              
  inflating: 17053.txt               
  inflating: 179053.txt              
  inflating: 181053.txt              
  inflating: 183053.txt              
  inflating: 185053.txt              
  inflating: 19053.txt               
  inflating: 206053.txt              
  inflating: 228053.txt              
  inflating: 241053.txt              
  inflating: 246053.txt              
  inflating: 247053.txt              
  inflating: 253053.txt              
  inflating: 259053.txt              
  inflating: 27053.txt               
  inflating: 273053.txt              
  inflating: 287053.txt              
  inflating: 297053.txt       

In [None]:
file = '179053.txt'


def read_data_and_preprocess(file):
  f = open(file, 'r')
  data = f.read()
  sentences = tokenize.sent_tokenize(data)
  preprocessed = []
  for sentence in sentences:
    #lowercase
    sentence = sentence.lower()
    #remove the tags
    prep_sentence = re.sub(r"<h>|<p>", " ", sentence)
    #remove @ and &amp
    prep_sentence = re.sub(r"@|&amp;", " ", prep_sentence)
    #remove numbers that may begin with letter and followed by /\
    prep_sentence = re.sub(r"[a-z]*\d+\/*", " ", prep_sentence)
    #remove endlines
    prep_sentence = re.sub(r"\\n", " ", prep_sentence)
    preprocessed.append(prep_sentence)

  documents = []
  for document in preprocessed:
    doc = []
    words = tokenize.word_tokenize(document)
    for word in words:
      doc.append(word)
    documents.append(doc)


  return documents


documents = read_data_and_preprocess(file)

In [None]:
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
import string
nltk.download('stopwords')
import re

gloss_rel = lambda x: x.definition()
example_rel = lambda x: " ".join(x.examples())
hyponym_rel = lambda x: " ".join(w.definition() for w in x.hyponyms())
meronym_rel = lambda x: " ".join(w.definition() for w in x.member_meronyms() + \
                                 x.part_meronyms() + x.substance_meronyms())
also_rel = lambda x: " ".join(w.definition() for w in x.also_sees())
attr_rel = lambda x: " ".join(w.definition() for w in x.attributes())
hypernym_rel = lambda x: " ".join(w.definition() for w in x.hypernyms())

relpairs = {wn.NOUN: [(hyponym_rel, meronym_rel), (meronym_rel, hyponym_rel),
                      (hyponym_rel, hyponym_rel),
                      (gloss_rel, meronym_rel), (meronym_rel, gloss_rel),
                      (example_rel, meronym_rel), (meronym_rel, example_rel),
                      (gloss_rel, gloss_rel)],
            wn.ADJ: [(also_rel, gloss_rel), (gloss_rel, also_rel),
                     (attr_rel, gloss_rel), (gloss_rel, attr_rel),
                     (gloss_rel, gloss_rel),
                     (example_rel, gloss_rel), (gloss_rel, example_rel),
                     (gloss_rel, hypernym_rel), (hypernym_rel, gloss_rel)],
            wn.VERB:[(example_rel, example_rel),
                     (example_rel, hypernym_rel), (hypernym_rel, example_rel),
                     (hyponym_rel, hyponym_rel),
                     (gloss_rel, hyponym_rel), (hyponym_rel, gloss_rel),
                     (example_rel, gloss_rel), (gloss_rel, example_rel)]}

def preprocess(text):
    """
    Helper function to preprocess text (lowercase, remove punctuation etc.)
    """
    words = nltk.word_tokenize(text)
    punctuation = string.punctuation
    words = [word.lower() for word in words if word not in punctuation]
    words = [word for word in words if not word in stopwords.words('english')] # ? not part of the original algorithm to remove all stopwords! (only ones at the edges of the subsequence)
    return words

def lcs(S1, S2):
    """
    Helper function to compute length and offsets of longest common substring of
    S1 and S2. Uses the classical dynamic programming algorithm.
    """
    M = [[0]*(1+len(S2)) for i in range(1+len(S1))]
    longest, x_longest, y_longest = 0, 0, 0
    for x in range(1,1+len(S1)):
        for y in range(1,1+len(S2)):
            if S1[x-1] == S2[y-1]:
                M[x][y] = M[x-1][y-1] + 1
                if M[x][y]>longest:
                    longest = M[x][y]
                    x_longest = x
                    y_longest = y
            else:
                M[x][y] = 0
    return longest, x_longest - longest, y_longest - longest

def score(gloss1, gloss2, normalized=False):
    """
    Compute score between two glosses based on length of common substrings.
    """
    gloss1 = preprocess(gloss1)
    gloss2 = preprocess(gloss2)
    curr_score = 0
    longest, start1, start2, = lcs(gloss1, gloss2)
    while longest > 0:
        gloss1[start1 : start1 + longest] = []
        gloss2[start2 : start2 + longest] = []
        curr_score += longest ** 2
        longest, start1, start2 = lcs(gloss1, gloss2)
    if normalized and curr_score:
      return curr_score / (len(gloss1) + len(gloss2))
    return curr_score

def relatedness(sense1, sense2, relpairs, normalized=False):
    """
    Compute the relatedness of two senses (synsets) using the list of pairs of
    relations in relpairs.
    """
    return sum(score(pair[0](sense1), pair[1](sense2), normalized=normalized) # Note: normalization not explicitly part of original algorithm!
    for pair in relpairs)

def wsd(context, target, winsize, pos_tag, verbose=False, normalized=False):
    """
    Find the best sense for a word in a given context.
    Arguments:
    context - sentence(s) we are analyzing; expected as list of strings
    target  - string representing the word whose senses we're trying to
              disambiguate. Target is assumed to occur once in sentence. In case
              of multiple occurences, the first one is considered. Will throw
              ValueError if target is not in sentence
    winsize - size of window used for disambiguating. The algorithm will only
              look at winsize words of the appropriate part-of-speech around the
              target word
    pos_tag - part of speech of target word
    """
    context = list(filter(None, [wn.synsets(word, pos=pos_tag) for word in context]))
    target_synsets = wn.synsets(target, pos=pos_tag)
    try:
      pos = context.index(target_synsets)
    except ValueError:
      return None, 0.

    window = context[max(pos - winsize, 0) : pos] + \
             context[pos + 1 : min(pos + winsize + 1, len(context))]
    sense_scores = [sum(sum(relatedness(sense, other_sense, relpairs[pos_tag], normalized=normalized)
                              for other_sense in senses)
                   for senses in window) for sense in target_synsets]
    if verbose:
      print("All scores:")
      for i, s in enumerate(target_synsets):
        print(sense_scores[i], s, s.definition())
    best_score = max(sense_scores)
    best_index = sense_scores.index(best_score)
    return target_synsets[best_index], best_score


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


1. Load the text in the selected file and disambiguate every instance of the word `hard` and `line`. Try different approaches, using both knowledge-based and corpus-based methods:

- use the trained Naive Bayes classifier above
- use Lesk algorithm's implementation in NLTK (see previous lab)
- use Banerjee & Pedersen's extended Lesk algoritm (see previous lab, you can use the implementation there)


In [None]:
def pos_tag_wordnet(text, input):
  part_of_speech = ['R', 'J', 'N', 'V']
  part_of_speech_wn = [wn.ADV, wn.ADJ, wn.NOUN, wn.VERB]

  pos_text = nltk.pos_tag(text)
  
  get_part_of_speech = []
  for (word, pos_tag) in pos_text:
    if (pos_tag[0] in part_of_speech) and (pos_tag[0] != 'R'):
      index = part_of_speech.index(pos_tag[0])
      get_part_of_speech.append((word, part_of_speech_wn[index]))
    else:
      get_part_of_speech.append((word, wn.NOUN))
  
  return get_part_of_speech[text.index(input)][1]

### EXERCISES FOR WORD 'HARD'

In [None]:
target_synsets = wn.synsets('hard')
for element in target_synsets:
  print(element, element.definition())

Synset('difficult.a.01') not easy; requiring great physical or mental effort to accomplish or comprehend or endure
Synset('hard.a.02') dispassionate; 
Synset('hard.a.03') resisting weight or pressure
Synset('hard.s.04') very strong or vigorous
Synset('arduous.s.01') characterized by effort to the point of exhaustion; especially physical effort
Synset('unvoiced.a.01') produced without vibration of the vocal cords
Synset('hard.a.07') (of light) transmitted directly from a pointed light source
Synset('hard.a.08') (of speech sounds); produced with the back of the tongue raised toward or touching the velum
Synset('intemperate.s.03') given to excessive indulgence of bodily appetites especially for intoxicating liquors
Synset('hard.s.10') being distilled rather than fermented; having a high alcoholic content
Synset('hard.s.11') unfortunate or hard to bear
Synset('hard.s.12') dried out
Synset('hard.r.01') with effort or force or vigor
Synset('hard.r.02') with firmness
Synset('hard.r.03') earne

In [None]:
mapping = {
    "HARD1": ["difficult.a.01", "arduous.s.01", "hard.s.11", "hard.r.01","hard.r.09", 'hard.r.10'],    
    "HARD2": ["hard.a.02", "hard.s.04", "unvoiced.a.01", "intemperate.s.03"],
    "HARD3": ["hard.a.03", "hard.r.07", "heavily.r.07","hard.a.08"], 
}

In [None]:
classifier = wsd_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_context_features)
voc_hard = extract_vocab2(documents, "hard")
print("-----")
total_occurences = 0
total_1_2 = 0
total_1_3 = 0
total_2_3 = 0
total_1_2_3 = 0

for sent in documents:
  new_document = preprocess(' '.join(sent))
  if 'hard' in new_document:
    total_occurences += 1
    pos = pos_tag_wordnet(new_document, 'hard')
    ws = wsd(context=new_document, target="hard", winsize=3, pos_tag=pos)
    extended = True
    if ws[0] is None:
      extended = False
      print("extended Lesk: {}".format(None))
    else:
      print("extended Lesk: {}".format(ws[0].name()))
    print("Lesk: {}".format(lesk(new_document, 'hard').name()))
    ws2 = wsd_context_features1(new_document, 'hard', voc_hard)
    print("Classifier NB: {}".format(classifier.classify(ws2)))

    #print(mapping[classifier.classify(ws2)])
    print("-----")
    if classifier.classify(ws2) is not None and ws[0] is not None:
      ok = 0
      if ws[0].name() == lesk(new_document, 'hard').name():
        total_1_2 += 1
        ok += 1
      
      if ws[0].name() in mapping[classifier.classify(ws2)]:
        total_1_3 += 1
        ok += 1
      
      if lesk(new_document, 'hard').name() in mapping[classifier.classify(ws2)]:
        total_2_3 += 1
        ok += 1
      
      if ok == 3:
        total_1_2_3 += 1
    

Reading data...
 Senses: HARD2 HARD3 HARD1
Training classifier...
Testing classifier...
Accuracy: 0.8927
-----
extended Lesk: hard.s.04
Lesk: hard.s.11
Classifier NB: HARD1
-----
extended Lesk: arduous.s.01
Lesk: hard.s.11
Classifier NB: HARD3
-----
extended Lesk: difficult.a.01
Lesk: hard.s.11
Classifier NB: HARD1
-----
extended Lesk: hard.a.03
Lesk: hard.s.11
Classifier NB: HARD2
-----
extended Lesk: hard.s.10
Lesk: hard.s.11
Classifier NB: HARD1
-----
extended Lesk: hard.a.08
Lesk: hard.s.11
Classifier NB: HARD3
-----
extended Lesk: difficult.a.01
Lesk: hard.s.11
Classifier NB: HARD1
-----
extended Lesk: None
Lesk: hard.s.11
Classifier NB: HARD1
-----
extended Lesk: None
Lesk: hard.s.11
Classifier NB: HARD1
-----
extended Lesk: hard.s.10
Lesk: hard.s.11
Classifier NB: HARD3
-----
extended Lesk: None
Lesk: hard.s.11
Classifier NB: HARD1
-----
extended Lesk: difficult.a.01
Lesk: hard.s.11
Classifier NB: HARD1
-----
extended Lesk: difficult.a.01
Lesk: hard.s.11
Classifier NB: HARD1
---

2. Compute the proportion of word occurrences which were disambiguated identically by the three algorithms, separately for `line` and `hard` (#identical outputs / #total occurences of word). For which of the two words is there higher agreement between the methods? 

You can create your own mapping between Senseval2 and WordNet senses (doesn't need to be perfect). More info on mappings here: http://lcl.uniroma1.it/wsdeval/


In [None]:
print("Proportion 3 algorithms: {}".format(total_1_2_3/total_occurences))
print("Proportion Lesk and Extended Lesk : {}".format(total_1_2/total_occurences))
print("Proportion Lesk and NB: {}".format(total_1_3/total_occurences))
print("Proportion Extended Lesk and NB: {}".format(total_2_3/total_occurences)) #high agreement

Proportion 3 algorithms: 0.0
Proportion Lesk and Extended Lesk : 0.0
Proportion Lesk and NB: 0.5113636363636364
Proportion Extended Lesk and NB: 0.6363636363636364


3. Pick one of the knowledge-based algorithms above and print the instances where it disagreed with the Naive Bayes method. (they returned a different prediction): show the context where the word occured, and the outputs for each of the methods.

In [None]:
#lesk and NB
classifier = wsd_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_context_features)
voc_hard = extract_vocab2(documents, "hard")
print("-----")
for sent in documents:
  new_document = preprocess(' '.join(sent))
  if 'hard' in new_document:
    ws2 = wsd_context_features1(new_document, 'hard', voc_hard)
    if classifier.classify(ws2) is not None :
      if lesk(new_document, 'hard').name() not in mapping[classifier.classify(ws2)]:
        print("Document: {}".format(' '.join(new_document)))
        print("Lesk output: {}".format(lesk(new_document, 'hard').name()))
        print("NB output: {}".format(mapping[classifier.classify(ws2)]))
        print("-----")

    

Reading data...
 Senses: HARD2 HARD3 HARD1
Training classifier...
Testing classifier...
Accuracy: 0.8927
-----
Document: add degree confusion name suspension design gets suspension designs even r/c pan hard bar track bar helps locate axle vehicle keeps axle housing moving side side
Lesk output: hard.s.11
NB output: ['hard.a.03', 'hard.r.07', 'heavily.r.07', 'hard.a.08']
-----
Document: 's worth noting link design properly triangulated need pan hard bar
Lesk output: hard.s.11
NB output: ['hard.a.02', 'hard.s.04', 'unvoiced.a.01', 'intemperate.s.03']
-----
Document: climb rock get back boat almost hard getting rock definitely worth effort
Lesk output: hard.s.11
NB output: ['hard.a.03', 'hard.r.07', 'heavily.r.07', 'hard.a.08']
-----
Document: 're hard anodized use larger stronger hardware durability
Lesk output: hard.s.11
NB output: ['hard.a.03', 'hard.r.07', 'heavily.r.07', 'hard.a.08']
-----
Document: machined sway bar clamps little hard see 've added install
Lesk output: hard.s.11
NB 

4. Train several NaiveBayes models for 'hard.pos'/'interest.pos', including at least the following: 
- for the wsd_word_features version, vary number between 100, 200 and 300, 
- and vary the stopwords_list between [] (i.e., the null list) and STOPWORDS; 



- for the wsd_context_features version, vary the distance between 1, 2 and 3, 
- and vary the stopwords_list between [] and STOPWORDS.
- try to only keep the POS of the words in the context (remove the word itself from the features, and use the POSs instead)

In [None]:

print("\nWSD_WORD_FEATURES\n")
numbers = [100, 200, 300]
for number in numbers:
  print("\nNumber: {}, STOPWORDS=YES".format(number))
  classifier = wsd_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features, number = number)
  print("\nNumber: {}, STOPWORDS=NO".format(number))
  classifier = wsd_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features, number = number, stopwords_list=[])


print("\nWSD_CONTEXT_FEATURES\n")
distances = [1, 2, 3]
for distance in distances:
  print("\nDistance: {}, STOPWORDS=YES".format(distance))
  classifier = wsd_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features, distance = distance)
  print("\nDistance: {}, STOPWORDS=NO".format(distance))
  classifier = wsd_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features, distance = distance, stopwords_list=[])


WSD_WORD_FEATURES


Number: 100, STOPWORDS=YES
Reading data...
 Senses: HARD2 HARD3 HARD1
Training classifier...
Testing classifier...
Accuracy: 0.8431

Number: 100, STOPWORDS=NO
Reading data...
 Senses: HARD2 HARD3 HARD1
Training classifier...
Testing classifier...
Accuracy: 0.8489

Number: 200, STOPWORDS=YES
Reading data...
 Senses: HARD2 HARD3 HARD1
Training classifier...
Testing classifier...
Accuracy: 0.8547

Number: 200, STOPWORDS=NO
Reading data...
 Senses: HARD2 HARD3 HARD1
Training classifier...
Testing classifier...
Accuracy: 0.8581

Number: 300, STOPWORDS=YES
Reading data...
 Senses: HARD2 HARD3 HARD1
Training classifier...
Testing classifier...
Accuracy: 0.8547

Number: 300, STOPWORDS=NO
Reading data...
 Senses: HARD2 HARD3 HARD1
Training classifier...
Testing classifier...
Accuracy: 0.8604

WSD_CONTEXT_FEATURES


Distance: 1, STOPWORDS=YES
Reading data...
 Senses: HARD2 HARD3 HARD1
Training classifier...
Testing classifier...
Accuracy: 0.8547

Distance: 1, STOPWORDS=NO
Re

### EXERCISES FOR WORD 'LINE'

In [None]:
target_synsets = wn.synsets('line')
for element in target_synsets:
  print(element, element.definition())

Synset('line.n.01') a formation of people or things one beside another
Synset('line.n.02') a mark that is long relative to its width
Synset('line.n.03') a formation of people or things one behind another
Synset('line.n.04') a length (straight or curved) without breadth or thickness; the trace of a moving point
Synset('line.n.05') text consisting of a row of words written across a page or computer screen
Synset('line.n.06') a single frequency (or very narrow band) of radiation in a spectrum
Synset('line.n.07') a fortified position (especially one marking the most forward position of troops)
Synset('argumentation.n.02') a course of reasoning aimed at demonstrating a truth or falsehood; the methodical process of logical reasoning
Synset('cable.n.02') a conductor for transmitting electrical or optical signals or electric power
Synset('course.n.02') a connected series of events or actions or developments
Synset('line.n.11') a spatial location defined by a real or imaginary unidimensional ex

In [None]:
mapping = { "cord": ["cable.n.02","line.n.18", "line.n.20"],
    "phone": ["telephone_line.n.02"],
    "product": ["line.n.22"],
    "text": ["line.n.05", "agate_line.n.01", "note.n.02"],
    "division": ["line.n.29"],
    'formation': ['line.n.03']
}

In [None]:
classifier = wsd_classifier(NaiveBayesClassifier.train, 'line.pos', wsd_context_features)
voc_hard = extract_vocab2(documents, "line")
print("-----")
total_occurences = 0
total_1_2 = 0
total_1_3 = 0
total_2_3 = 0
total_1_2_3 = 0

for sent in documents:
  new_document = preprocess(' '.join(sent))
  if 'line' in new_document:
    total_occurences += 1
    pos = pos_tag_wordnet(new_document, 'line')
    ws = wsd(context=new_document, target="line", winsize=3, pos_tag=pos)
    if ws[0] is None:
      print("extended Lesk: {}".format(None))
    else:
      print("extended Lesk: {}".format(ws[0].name()))
    print("Lesk: {}".format(lesk(new_document, 'line').name()))
    ws2 = wsd_context_features1(new_document, 'line', voc_hard)
    print("Classifier NB: {}".format(classifier.classify(ws2)))

    print("-----")
    if classifier.classify(ws2) is not None and ws[0] is not None:
      ok = 0
      if ws[0].name() == lesk(new_document, 'line').name():
        total_1_2 += 1
        ok += 1
      
      if ws[0].name() in mapping[classifier.classify(ws2)]:
        total_1_3 += 1
        ok += 1
      
      if lesk(new_document, 'line').name() in mapping[classifier.classify(ws2)]:
        total_2_3 += 1
        ok += 1
      
      if ok == 3:
        total_1_2_3 += 1
    

Reading data...
 Senses: cord phone product formation division text
Training classifier...
Testing classifier...
Accuracy: 0.7470
-----
extended Lesk: line.n.18
Lesk: trace.v.02
Classifier NB: formation
-----
extended Lesk: line.n.11
Lesk: line.v.01
Classifier NB: text
-----
extended Lesk: line.n.18
Lesk: agate_line.n.01
Classifier NB: text
-----
extended Lesk: line.n.18
Lesk: line.v.01
Classifier NB: formation
-----
extended Lesk: line.n.18
Lesk: line.v.01
Classifier NB: text
-----
extended Lesk: line.n.18
Lesk: line.v.01
Classifier NB: text
-----
extended Lesk: line.n.18
Lesk: line.v.01
Classifier NB: text
-----
extended Lesk: line.n.22
Lesk: line.v.01
Classifier NB: product
-----
extended Lesk: line.n.18
Lesk: line.v.01
Classifier NB: product
-----
extended Lesk: tune.n.01
Lesk: line.v.01
Classifier NB: formation
-----
extended Lesk: line.n.18
Lesk: line.v.01
Classifier NB: text
-----
extended Lesk: line.n.18
Lesk: line.v.01
Classifier NB: text
-----
extended Lesk: cable.n.02
Lesk: 

In [None]:
print("Proportion 3 algorithms: {}".format(total_1_2_3/total_occurences))
print("Proportion Lesk and Extended Lesk : {}".format(total_1_2/total_occurences))
print("Proportion Lesk and NB: {}".format(total_1_3/total_occurences))
print("Proportion Extended Lesk and NB: {}".format(total_2_3/total_occurences)) #high agreement

Proportion 3 algorithms: 0.00847457627118644
Proportion Lesk and Extended Lesk : 0.025423728813559324
Proportion Lesk and NB: 0.0423728813559322
Proportion Extended Lesk and NB: 0.06779661016949153


The proportion is larger when it comes to 'line.pos' than 'hard.pos', but, in the same time the proportions using algorithm NB and lesk/extended lesk are larger on 'hard.pos' than 'line.pos'

In [None]:
#lesk and NB
classifier = wsd_classifier(NaiveBayesClassifier.train, 'line.pos', wsd_context_features)
voc_line = extract_vocab2(documents, "line")
print("-----")
for sent in documents:
  new_document = preprocess(' '.join(sent))
  if 'line' in new_document:
    ws2 = wsd_context_features1(new_document, 'line', voc_line)
    if classifier.classify(ws2) is not None :
      if lesk(new_document, 'line').name() not in mapping[classifier.classify(ws2)]:
        print("Document: {}".format(' '.join(new_document)))
        print("Lesk output: {}".format(lesk(new_document, 'line').name()))
        print("NB output: {}".format(mapping[classifier.classify(ws2)]))
        print("-----")

    

Reading data...
 Senses: cord phone product formation division text
Training classifier...
Testing classifier...
Accuracy: 0.7470
-----
Document: bottom line larger pinion gear make axial vehicle faster
Lesk output: trace.v.02
NB output: ['line.n.03']
-----
Document: line on-site registration line **all on-site registration must made cash
Lesk output: line.v.01
NB output: ['line.n.05', 'note.n.02', 'agate_line.n.01']
-----
Document: bottom line cleaner gears especially pinion ridgecrest without removing transmission completely dropping transmission makes process much easier
Lesk output: line.v.01
NB output: ['line.n.03']
-----
Document: want cut line check download waiver adult- and- minor
Lesk output: line.v.01
NB output: ['line.n.05', 'note.n.02', 'agate_line.n.01']
-----
Document: non-emergency problem reach on-site axial phone line contact listed
Lesk output: line.v.01
NB output: ['line.n.05', 'note.n.02', 'agate_line.n.01']
-----
Document: emergency contact info on-site axial phon

In [None]:

print("\nWSD_WORD_FEATURES\n")
numbers = [100, 200, 300]
for number in numbers:
  print("\nNumber: {}, STOPWORDS=YES".format(number))
  classifier = wsd_classifier(NaiveBayesClassifier.train, 'line.pos', wsd_word_features, number = number)
  print("\nNumber: {}, STOPWORDS=NO".format(number))
  classifier = wsd_classifier(NaiveBayesClassifier.train, 'line.pos', wsd_word_features, number = number, stopwords_list=[])


print("\nWSD_CONTEXT_FEATURES\n")
distances = [1, 2, 3]
for distance in distances:
  print("\nDistance: {}, STOPWORDS=YES".format(distance))
  classifier = wsd_classifier(NaiveBayesClassifier.train, 'line.pos', wsd_word_features, distance = distance)
  print("\nDistance: {}, STOPWORDS=NO".format(distance))
  classifier = wsd_classifier(NaiveBayesClassifier.train, 'line.pos', wsd_word_features, distance = distance, stopwords_list=[])


WSD_WORD_FEATURES


Number: 100, STOPWORDS=YES
Reading data...
 Senses: cord phone product formation division text
Training classifier...
Testing classifier...
Accuracy: 0.6169

Number: 100, STOPWORDS=NO
Reading data...
 Senses: cord phone product formation division text
Training classifier...
Testing classifier...
Accuracy: 0.6024

Number: 200, STOPWORDS=YES
Reading data...
 Senses: cord phone product formation division text
Training classifier...
Testing classifier...
Accuracy: 0.6663

Number: 200, STOPWORDS=NO
Reading data...
 Senses: cord phone product formation division text
Training classifier...
Testing classifier...
Accuracy: 0.6578

Number: 300, STOPWORDS=YES
Reading data...
 Senses: cord phone product formation division text
Training classifier...
Testing classifier...
Accuracy: 0.6928

Number: 300, STOPWORDS=NO
Reading data...
 Senses: cord phone product formation division text
Training classifier...
Testing classifier...
Accuracy: 0.6843

WSD_CONTEXT_FEATURES


Distance: 1