## 3

> The Senseval 2 Corpus contains data intended to train word-sense disambiguation classifiers. It contains data for four words: `hard`, `interest`, `line`, and `serve`. Choose one of these four words, and load the corresponding data:
> ```python
> >>> from nltk.corpus import senseval
> >>> instances = senseval.instances('hard.pos')
> >>> size = int(len(instances) * 0.1)
> >>> train_set, test_set = instances[size:], instances[:size]
> ```
> Using this dataset, build a classifier that predicts the correct sense tag for a given instance. See the corpus HOWTO at http://www.nltk.org/howto for information on using the instance objects returned by the Senseval 2 Corpus.

In [1]:
from nltk.corpus import senseval
instances = senseval.instances('hard.pos')
size = int(len(instances) * 0.1)

In [23]:
def sense_features(instance):
    return {'word': instance.context[instance.position][0],
           'tag': instance.context[instance.position][1],
           'tag-prev': instance.context[instance.position-1][1]}

In [24]:
instances = [(sense_features(instance), instance.senses[0]) for instance in instances]

In [25]:
train_set, test_set = instances[size:], instances[:size]

In [27]:
import nltk
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [28]:
print(nltk.classify.accuracy(classifier, test_set))

0.9676674364896074


## 6

> The synonyms `strong` and `powerful` pattern differently (try combining them with `chip` and `sales`). What features are relevant in this distinction? Build a classifier that predicts when each word should be used.

In [29]:
from nltk.corpus import brown

In [36]:
raw_sents_idx = [i for (i, sent) in enumerate(brown.sents()) if ('strong' in sent or 'powerful' in sent)]

In [37]:
tagged_sents = [sent for (i, sent) in enumerate(brown.tagged_sents()) if i in raw_sents_idx]

In [39]:
def syn_features(sent):
    idx = 0
    for i in range(len(sent)):
        if sent[i][0] == 'strong' or sent[i][0] == 'powerful':
            idx = i
            break
    ret1 = {'tag-prev': sent[idx-1][1],
           'tag-after': sent[idx+1][1]}
    ret2 = sent[idx][0]
    return (ret1, ret2)

In [40]:
tagged_sents = [syn_features(sent) for sent in tagged_sents]

In [41]:
size = int(len(tagged_sents) * 0.1)
train_set, test_set = tagged_sents[size:], tagged_sents[:size]

In [42]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [43]:
print(nltk.classify.accuracy(classifier, test_set))

0.7083333333333334


## 9

> The PP Attachment Corpus is a corpus describing prepositional phrase attachment decisions. Each instance in the corpus is encoded as a `PPAttachment` object:
> ```python
> >>> from nltk.corpus import ppattach
> >>> ppattach.attachments('training')
> [PPAttachment(sent='0', verb='join', noun1='board',
> prep='as', noun2='director', attachment='V'),
> PPAttachment(sent='1', verb='is', noun1='chairman',
> prep='of', noun2='N.V.', attachment='N'),
> ...]
> >>> inst = ppattach.attachments('training')[1]
> >>> (inst.noun1, inst.prep, inst.noun2)
> ('chairman', 'of', 'N.V.')
> ```
> Select only the instances where `inst.attachment` is `N`:
> ```python
> >>> nattach = [inst for inst in ppattach.attachments('training')
> ... if inst.attachment == 'N']
> ```
> Using this subcorpus, build a classifier that attempts to predict which preposition is used to connect a given pair of nouns. For example, given the pair of nouns `team` and `researchers`, the classifier should predict the preposition `of`. See the corpus HOWTO at http://www.nltk.org/howto for more information on using the PP Attachment Corpus.

In [44]:
from nltk.corpus import ppattach

In [45]:
nattach = [inst for inst in ppattach.attachments('training') if inst.attachment == 'N']

In [51]:
def prep_feature(inst):
    return {'verb': inst.verb, 'noun1': inst.noun1, 'noun2': inst.noun2}

In [52]:
nattach = [(prep_feature(inst), inst.prep) for inst in nattach]

In [54]:
classifier = nltk.NaiveBayesClassifier.train(nattach)

In [55]:
classifier.classify({'verb': 'is', 'noun1': 'team', 'noun2': 'researchers'})

'of'