# 1 Supervised Classification
![Image](images/supervised-classification.png)

# 1.1 Gender Identification
- Names ending in a, e and i are likely to be female
- Names ending in k, o, r, s and t are likely to be male

In [1]:
# feature extractor function
def gender_features(word):
    return {'last_letter': word[-1]}

# returned dictionary, known as a feature set
gender_features('Shrek')

{'last_letter': 'k'}

In [2]:
# list of examples and corresponding class labels
import nltk
from nltk.corpus import names

labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])

import random
random.shuffle(labeled_names)

In [3]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]

# training set is used to train a new "naive Bayes" classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [4]:
classifier.classify(gender_features('Neo'))

'male'

In [5]:
classifier.classify(gender_features('Trinity'))

'female'

In [6]:
# percent of names classified gender correctly
print(nltk.classify.accuracy(classifier, test_set))

0.742


In [7]:
# features most effective for distinguishing the names' genders
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'k'              male : female =     33.9 : 1.0
             last_letter = 'a'            female : male   =     33.0 : 1.0
             last_letter = 'f'              male : female =     16.7 : 1.0
             last_letter = 'p'              male : female =     12.6 : 1.0
             last_letter = 'd'              male : female =     10.5 : 1.0


# 1.2 Choosing The Right Features
- Feature extractors are built through a process of trial-and-error

In [8]:
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

In [9]:
print(gender_features2('John'))

{'first_letter': 'j', 'last_letter': 'n', 'count(a)': 0, 'has(a)': False, 'count(b)': 0, 'has(b)': False, 'count(c)': 0, 'has(c)': False, 'count(d)': 0, 'has(d)': False, 'count(e)': 0, 'has(e)': False, 'count(f)': 0, 'has(f)': False, 'count(g)': 0, 'has(g)': False, 'count(h)': 1, 'has(h)': True, 'count(i)': 0, 'has(i)': False, 'count(j)': 1, 'has(j)': True, 'count(k)': 0, 'has(k)': False, 'count(l)': 0, 'has(l)': False, 'count(m)': 0, 'has(m)': False, 'count(n)': 1, 'has(n)': True, 'count(o)': 1, 'has(o)': True, 'count(p)': 0, 'has(p)': False, 'count(q)': 0, 'has(q)': False, 'count(r)': 0, 'has(r)': False, 'count(s)': 0, 'has(s)': False, 'count(t)': 0, 'has(t)': False, 'count(u)': 0, 'has(u)': False, 'count(v)': 0, 'has(v)': False, 'count(w)': 0, 'has(w)': False, 'count(x)': 0, 'has(x)': False, 'count(y)': 0, 'has(y)': False, 'count(z)': 0, 'has(z)': False}


In [10]:
print(len(labeled_names))

featuresets = [(gender_features2(n), gender) for (n, gender) in labeled_names]
print(labeled_names[0])
print(featuresets[:1])

7944
('Barron', 'male')
[({'first_letter': 'b', 'last_letter': 'n', 'count(a)': 1, 'has(a)': True, 'count(b)': 1, 'has(b)': True, 'count(c)': 0, 'has(c)': False, 'count(d)': 0, 'has(d)': False, 'count(e)': 0, 'has(e)': False, 'count(f)': 0, 'has(f)': False, 'count(g)': 0, 'has(g)': False, 'count(h)': 0, 'has(h)': False, 'count(i)': 0, 'has(i)': False, 'count(j)': 0, 'has(j)': False, 'count(k)': 0, 'has(k)': False, 'count(l)': 0, 'has(l)': False, 'count(m)': 0, 'has(m)': False, 'count(n)': 1, 'has(n)': True, 'count(o)': 1, 'has(o)': True, 'count(p)': 0, 'has(p)': False, 'count(q)': 0, 'has(q)': False, 'count(r)': 2, 'has(r)': True, 'count(s)': 0, 'has(s)': False, 'count(t)': 0, 'has(t)': False, 'count(u)': 0, 'has(u)': False, 'count(v)': 0, 'has(v)': False, 'count(w)': 0, 'has(w)': False, 'count(x)': 0, 'has(x)': False, 'count(y)': 0, 'has(y)': False, 'count(z)': 0, 'has(z)': False}, 'male')]


In [11]:
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

# accuracy
print(nltk.classify.accuracy(classifier, test_set))

0.77


![Image](images/corpus-org.png)
- Very productive method for refining the feature set is error analysis
- The training set is used to train the model, and the dev-test set is used to perform error analysis

In [12]:
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]

train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.751


In [13]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )


for (tag, guess, name) in sorted(errors)[:10]:
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

correct=female   guess=male     name=Aleen                         
correct=female   guess=male     name=Alexis                        
correct=female   guess=male     name=Alisun                        
correct=female   guess=male     name=Alyson                        
correct=female   guess=male     name=Anet                          
correct=female   guess=male     name=Annabel                       
correct=female   guess=male     name=Annabell                      
correct=female   guess=male     name=Arden                         
correct=female   guess=male     name=Avril                         
correct=female   guess=male     name=Bev                           


- Looking through this list of errors
- Names ending in yn appear to be predominantly female, despite the fact that names ending in n tend to be male
- Names ending in ch are usually male, even though names that end in h tend to be female

In [14]:
def gender_features(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:]}

In [15]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]

# the performance on the dev-test dataset improves
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.761


# 1.3 Document Classification

In [16]:
# Movie Reviews Corpus, which categorizes each review as positive or negative
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

# print(movie_reviews.categories())
# print(movie_reviews.fileids())
# print(movie_reviews.words())
print(len(documents))
print(documents[:1])

2000
[(['as', 'african', 'american', 'detective', 'vergil', 'tibbs', 'questions', 'a', 'suspected', 'white', 'murderer', 'inside', 'a', 'jail', 'cell', ',', 'there', 'is', 'a', 'wonderful', ',', 'eye', '-', 'catching', 'shot', 'which', 'instantaneously', 'presents', 'the', 'main', 'message', 'of', 'the', 'entire', 'film', '.', 'the', 'shot', 'has', 'tibbs', "'", 'face', 'completely', 'covered', 'by', 'the', 'shadows', 'of', 'the', 'prison', 'bars', '.', 'to', 'see', 'these', 'bars', 'blocking', 'his', 'face', ',', 'we', 'see', 'how', 'separated', 'tibbs', 'is', 'from', 'the', 'rest', 'of', 'the', 'characters', 'in', 'the', 'film', '.', 'as', 'a', 'black', 'detective', 'conducting', 'an', 'investigation', 'in', 'a', 'southern', 'town', 'full', 'of', 'violent', 'white', 'bigots', ',', 'no', 'matter', 'how', 'innocent', 'tibbs', 'is', ',', 'he', 'is', 'still', 'seen', 'by', 'these', 'bigots', 'as', 'a', 'threat', 'simply', 'because', 'of', 'the', 'color', 'of', 'his', 'skin', '.', 'the', 

In [17]:
# constructing a list of the 2000 most frequent words
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

# feature extractor that simply checks whether each of these words is present in a given document
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

In [18]:
features = document_features(movie_reviews.words('pos/cv957_8737.txt'))

# preview 1st 10 items of features (dictionary)
dict(list(features.items())[:10])

{'contains(:)': True,
 'contains(a)': True,
 'contains(church)': False,
 'contains(couples)': False,
 'contains(go)': False,
 'contains(party)': False,
 'contains(plot)': True,
 'contains(teen)': False,
 'contains(to)': True,
 'contains(two)': True}

In [19]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, test_set))

classifier.show_most_informative_features(5)

0.86
Most Informative Features
       contains(martian) = True              neg : pos    =      7.6 : 1.0
    contains(schumacher) = True              neg : pos    =      7.4 : 1.0
     contains(atrocious) = True              neg : pos    =      6.6 : 1.0
        contains(turkey) = True              neg : pos    =      6.5 : 1.0
       contains(singers) = True              pos : neg    =      6.4 : 1.0


# 1.4 Part-of-Speech Tagging
- We can train a classifier to work out which suffixes are most informative

In [20]:
from nltk.corpus import brown

suffix_fdist = nltk.FreqDist()
for word in brown.words():
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1

common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]
print(common_suffixes)

['e', ',', '.', 's', 'd', 't', 'he', 'n', 'a', 'of', 'the', 'y', 'r', 'to', 'in', 'f', 'o', 'ed', 'nd', 'is', 'on', 'l', 'g', 'and', 'ng', 'er', 'as', 'ing', 'h', 'at', 'es', 'or', 're', 'it', '``', 'an', "''", 'm', ';', 'i', 'ly', 'ion', 'en', 'al', '?', 'nt', 'be', 'hat', 'st', 'his', 'th', 'll', 'le', 'ce', 'by', 'ts', 'me', 've', "'", 'se', 'ut', 'was', 'for', 'ent', 'ch', 'k', 'w', 'ld', '`', 'rs', 'ted', 'ere', 'her', 'ne', 'ns', 'ith', 'ad', 'ry', ')', '(', 'te', '--', 'ay', 'ty', 'ot', 'p', 'nce', "'s", 'ter', 'om', 'ss', ':', 'we', 'are', 'c', 'ers', 'uld', 'had', 'so', 'ey']


In [21]:
# feature extraction
def pos_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
    return features

In [22]:
tagged_words = brown.tagged_words(categories='news')
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]

print(len(featuresets))
print(featuresets[0])

100554
({'endswith(e)': True, 'endswith(,)': False, 'endswith(.)': False, 'endswith(s)': False, 'endswith(d)': False, 'endswith(t)': False, 'endswith(he)': True, 'endswith(n)': False, 'endswith(a)': False, 'endswith(of)': False, 'endswith(the)': True, 'endswith(y)': False, 'endswith(r)': False, 'endswith(to)': False, 'endswith(in)': False, 'endswith(f)': False, 'endswith(o)': False, 'endswith(ed)': False, 'endswith(nd)': False, 'endswith(is)': False, 'endswith(on)': False, 'endswith(l)': False, 'endswith(g)': False, 'endswith(and)': False, 'endswith(ng)': False, 'endswith(er)': False, 'endswith(as)': False, 'endswith(ing)': False, 'endswith(h)': False, 'endswith(at)': False, 'endswith(es)': False, 'endswith(or)': False, 'endswith(re)': False, 'endswith(it)': False, 'endswith(``)': False, 'endswith(an)': False, "endswith('')": False, 'endswith(m)': False, 'endswith(;)': False, 'endswith(i)': False, 'endswith(ly)': False, 'endswith(ion)': False, 'endswith(en)': False, 'endswith(al)': Fal

In [23]:
%%time

size = int(len(featuresets) * 0.1)
# size
train_set, test_set = featuresets[size:], featuresets[:size]

classifier = nltk.DecisionTreeClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

print(classifier.classify(pos_features('cats')))

0.6270512182993535
NNS
CPU times: user 10min 59s, sys: 17.7 s, total: 11min 17s
Wall time: 11min 15s


In [24]:
print(classifier.pseudocode(depth=4))

if endswith(the) == False: 
  if endswith(,) == False: 
    if endswith(s) == False: 
      if endswith(.) == False: return '.'
      if endswith(.) == True: return '.'
    if endswith(s) == True: 
      if endswith(is) == False: return 'PP$'
      if endswith(is) == True: return 'BEZ'
  if endswith(,) == True: return ','
if endswith(the) == True: return 'AT'



# 1.5 Exploiting Context
- We will pass in a complete (untagged) sentence, along with the index of the target word

In [25]:
def pos_features(sentence, i):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    return features

In [26]:
pos_features(brown.sents()[0], 8)

{'prev-word': 'an', 'suffix(1)': 'n', 'suffix(2)': 'on', 'suffix(3)': 'ion'}

In [27]:
%time

tagged_sents = brown.tagged_sents(categories='news')
featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append( (pos_features(untagged_sent, i), tag) )

size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)

nltk.classify.accuracy(classifier, test_set)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 3.1 µs


0.7891596220785678

# 1.6 Sequence Classification
- Use joint classifier models to capture the dependencies between related classification tasks
- Sequence classifier models can be used to jointly choose part-of-speech tags for all the words in a given sentence
- Consecutive classification / greedy sequence classification
    1. Find the most likely class label for the first input
    2. Then to use that answer to help find the best label for the next input
    3. Process can then be repeated until all of the inputs have been labeled

In [28]:
# provides a list of the tags that we've predicted for the sentence so far
def pos_features(sentence, i, history):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
        features["prev-tag"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
        features["prev-tag"] = history[i-1]
    return features


# sequence classifier
class ConsecutivePosTagger(nltk.TaggerI):

    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = pos_features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = pos_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

In [29]:
%time

tagged_sents = brown.tagged_sents(categories='news')
size = int(len(tagged_sents) * 0.1)
train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]

tagger = ConsecutivePosTagger(train_sents)
tagger.evaluate(test_sents)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 3.58 µs


0.7980528511821975

# 2.1 Sentence Segmentation
- Classification task for punctuation

In [30]:
sents = nltk.corpus.treebank_raw.sents()

"""
Tokens is a merged list of tokens from the individual sentences
Boundaries is a set containing the indexes of all sentence-boundary tokens
"""
tokens = []
boundaries = set()
offset = 0
for sent in sents:
    tokens.extend(sent)
    offset += len(sent)
    boundaries.add(offset-1)

In [31]:
def punct_features(tokens, i):
    return {'next-word-capitalized': tokens[i+1][0].isupper(),
            'prev-word': tokens[i-1].lower(),
            'punct': tokens[i],
            'prev-word-is-one-char': len(tokens[i-1]) == 1}

In [32]:
%time
featuresets = [(punct_features(tokens, i), (i in boundaries))
               for i in range(1, len(tokens)-1)
               if tokens[i] in '.?!']

featuresets[0]

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 3.34 µs


({'next-word-capitalized': False,
  'prev-word': 'nov',
  'prev-word-is-one-char': False,
  'punct': '.'},
 False)

In [33]:
%time

size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]

classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 3.34 µs


0.936026936026936

# 8 Summary
- Modeling the linguistic data found in corpora can help us to understand linguistic patterns, and can be used to make predictions about new language data.
- Supervised classifiers use labeled training corpora to build models that predict the label of an input based on specific features of that input.
- Supervised classifiers can perform a wide variety of NLP tasks, including document classification, part-of-speech tagging, sentence segmentation, dialogue act type identification, and determining entailment relations, and many other tasks.
- When training a supervised classifier, you should split your corpus into three datasets: a training set for building the classifier model; a dev-test set for helping select and tune the model's features; and a test set for evaluating the final model's performance.
- When evaluating a supervised classifier, it is important that you use fresh data, that was not included in the training or dev-test set. Otherwise, your evaluation results may be unrealistically optimistic.
- Decision trees are automatically constructed tree-structured flowcharts that are used to assign labels to input values based on their features. Although they're easy to interpret, they are not very good at handling cases where feature values interact in determining the proper label.
- In naive Bayes classifiers, each feature independently contributes to the decision of which label should be used. This allows feature values to interact, but can be problematic when two or more features are highly correlated with one another.
- Maximum Entropy classifiers use a basic model that is similar to the model used by naive Bayes; however, they employ iterative optimization to find the set of feature weights that maximizes the probability of the training set.
- Most of the models that are automatically constructed from a corpus are descriptive — they let us know which features are relevant to a given patterns or construction, but they don't give any information about causal relationships between those features and patterns.