# Learning to Classify Text

## Supervised Classification

Classification is the task of choosing the correct class label for a given input. A classifier is called supervised if it is built based on training corpora containing the correct label for each input.

### Gender Identification

Let's build a classifier to determine whether a given name is of a male or a female person. To start with, we'll build a function that extracts the final letter of a given word and returns a *feature dictionary*.

In [1]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [2]:
gender_features('Shrek')

{'last_letter': 'k'}

The key in the dictionary is the feature name. Let's prepare a list of examples of names and corresponding class labels (males or females)

In [8]:
import nltk
from nltk.corpus import names
labeled_names = (
    [(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

In [9]:
import random
random.shuffle(labeled_names)

We can now use the feature extractor to process the data, and divide it into train and test sets.

In [10]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [11]:
# Testing the classifier on names that do not appear in the training set
print('Neo:', classifier.classify(gender_features('Neo')))
print('Trinity:', classifier.classify(gender_features('Trinity')))

Neo: male
Trinity: female


In [12]:
# Evaulating the classifier on the test set
print(nltk.classify.accuracy(classifier, test_set))

0.786


We can examine the classifier to determine the most effective features:

In [13]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     37.3 : 1.0
             last_letter = 'k'              male : female =     30.8 : 1.0
             last_letter = 'f'              male : female =     17.1 : 1.0
             last_letter = 'p'              male : female =     12.4 : 1.0
             last_letter = 'v'              male : female =      9.7 : 1.0


According to the above the names in the training set that end in "a" are female 33 times more often than they are male.

When working with a large corpora, constructing a list for features can quickly use up a large amount of memory. In these cases, we can use `nltk.classify.apply_features` which returns an object that acts like a list but does not store the feature sets in memory.

In [18]:
from nltk.classify import apply_features
train_set = apply_features(gender_features, labeled_names[500:])
test_set = apply_features(gender_features, labeled_names[:500])

### Choosing the Right Features

Let's improve upon our feature extractor for gender names by adding more features.

In [19]:
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in "abcdefghijklmnopqrstuvwxyz":
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

In [22]:
print(gender_features2('John'))

{'first_letter': 'j', 'last_letter': 'n', 'count(a)': 0, 'has(a)': False, 'count(b)': 0, 'has(b)': False, 'count(c)': 0, 'has(c)': False, 'count(d)': 0, 'has(d)': False, 'count(e)': 0, 'has(e)': False, 'count(f)': 0, 'has(f)': False, 'count(g)': 0, 'has(g)': False, 'count(h)': 1, 'has(h)': True, 'count(i)': 0, 'has(i)': False, 'count(j)': 1, 'has(j)': True, 'count(k)': 0, 'has(k)': False, 'count(l)': 0, 'has(l)': False, 'count(m)': 0, 'has(m)': False, 'count(n)': 1, 'has(n)': True, 'count(o)': 1, 'has(o)': True, 'count(p)': 0, 'has(p)': False, 'count(q)': 0, 'has(q)': False, 'count(r)': 0, 'has(r)': False, 'count(s)': 0, 'has(s)': False, 'count(t)': 0, 'has(t)': False, 'count(u)': 0, 'has(u)': False, 'count(v)': 0, 'has(v)': False, 'count(w)': 0, 'has(w)': False, 'count(x)': 0, 'has(x)': False, 'count(y)': 0, 'has(y)': False, 'count(z)': 0, 'has(z)': False}


We have to keep in mind that providing too many features to the model can lead to overfitting on the training data. Thus, the model will not be able to generalize well to unseen data. 

Let's train our model with the new features and see how it performs on the test set.

In [23]:
featuresets = [(gender_features2(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]

In [24]:
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.782


As we can see above, the new model using the new featureset does not offer much improvement over the model that uses just the last letter of the name as a feature. In fact, it performs *worse*. 

An efficient method for refining the feature set is *error analysis*. First, we select a development set, containing the corpus data for creating the model. This development set is then subdivided into the training set and the dev-test set.

We'll follow this approach now.

In [25]:
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]

The training set is used to train the model, and the dev-test set is used to perform error analysis. The test set serves in our final evaluation of the system.

Now that we have divided the corpus into appropriate datasets, we'll train our model using the *train* set and test it on the *dev-test* set.

In [26]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.76


Using the dev-test set, we can check the errors that the classifier makes when predicting the name genders. 

In [27]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag, guess, name))

In [30]:
for (tag, guess, name) in sorted(errors)[:20]:
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

correct=female   guess=male     name=Allison                       
correct=female   guess=male     name=Allsun                        
correct=female   guess=male     name=Amargo                        
correct=female   guess=male     name=Ardeen                        
correct=female   guess=male     name=Avrit                         
correct=female   guess=male     name=Bill                          
correct=female   guess=male     name=Brenn                         
correct=female   guess=male     name=Brittan                       
correct=female   guess=male     name=Carleen                       
correct=female   guess=male     name=Carlen                        
correct=female   guess=male     name=Carolann                      
correct=female   guess=male     name=Caryn                         
correct=female   guess=male     name=Cass                          
correct=female   guess=male     name=Cathyleen                     
correct=female   guess=male     name=Catlin     

If we analyse the list above, we will notice that the *suffixes* seem to be important for classifying a name gender i.e. we have to look at more than one letter at the end of the names.

We can, thus, accordingly modify the feature extractor to include the same.

In [31]:
def gender_features(word):
    return {
        'suffix1': word[-1:],
        'suffix2': word[-2:]}

In [32]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.794


We can see the the performance of our model has improved by almost 1%, compared to the one that takes a look only at the last letter.

### Document Classification

Let's take a look at a case of classifying documents (movies) into categories (positive or negative).

In [36]:
from nltk.corpus import movie_reviews
documents = [
    (list(movie_reviews.words(fileid)), category)
    for category in movie_reviews.categories()
    for fileid in movie_reviews.fileids(category)]

In [43]:
print(documents[0][0][:40])

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an', 'accident', '.', 'one', 'of', 'the', 'guys', 'dies', ',', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', ',', 'and']


In [44]:
random.shuffle(documents)

Now let's define a feature extractor for documents. We can define a feature for each word, indicating whether or not a document containts that word. 

To limit the number of features that the classifier will need to process, we can construct a list of the most frequent 200 words, and define a feature extractor only for those words.

In [45]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

In [46]:
def document_features(document):
    document_words = set(document)
    features = {"contains({})".format(word): (word in document_words) for word in word_features}
    return features

In [49]:
# Uncomment the following line to check the results of feature extraction on one document
# print(document_features(movie_reviews.words('pos/cv957_8737.txt')))

We can now train a classifier to label new movie reviews.

In [50]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [51]:
print(nltk.classify.accuracy(classifier, test_set))

0.82


In [52]:
classifier.show_most_informative_features(5)

Most Informative Features
 contains(unimaginative) = True              neg : pos    =      7.6 : 1.0
    contains(schumacher) = True              neg : pos    =      7.4 : 1.0
     contains(atrocious) = True              neg : pos    =      7.0 : 1.0
        contains(suvari) = True              neg : pos    =      7.0 : 1.0
        contains(shoddy) = True              neg : pos    =      7.0 : 1.0


### Part-of-Speech Tagging

For tagging parts of speech, we can train a classifier to work out which suffixes are most informative. 

First, let's check out which suffixes are most common.

In [53]:
from nltk.corpus import brown
suffix_fdist = nltk.FreqDist()
for word in brown.words():
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1

In [54]:
common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]
print(common_suffixes)

['e', ',', '.', 's', 'd', 't', 'he', 'n', 'a', 'of', 'the', 'y', 'r', 'to', 'in', 'f', 'o', 'ed', 'nd', 'is', 'on', 'l', 'g', 'and', 'ng', 'er', 'as', 'ing', 'h', 'at', 'es', 'or', 're', 'it', '``', 'an', "''", 'm', ';', 'i', 'ly', 'ion', 'en', 'al', '?', 'nt', 'be', 'hat', 'st', 'his', 'th', 'll', 'le', 'ce', 'by', 'ts', 'me', 've', "'", 'se', 'ut', 'was', 'for', 'ent', 'ch', 'k', 'w', 'ld', '`', 'rs', 'ted', 'ere', 'her', 'ne', 'ns', 'ith', 'ad', 'ry', ')', '(', 'te', '--', 'ay', 'ty', 'ot', 'p', 'nce', "'s", 'ter', 'om', 'ss', ':', 'we', 'are', 'c', 'ers', 'uld', 'had', 'so', 'ey']


Let's define a feature extractor which checks a given word for these suffixes, and train a decision tree classifier. 

In [55]:
def pos_features(word):
    return {"endswith({})".format(suffix): word.lower().endswith(suffix) for suffix in common_suffixes}

In [56]:
tagged_words = brown.tagged_words(categories='news')
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]

In [57]:
size = int(len(featuresets)*0.1)
train_set, test_set = featuresets[size:], featuresets[:size]

In [58]:
classifier = nltk.DecisionTreeClassifier.train(train_set)

In [59]:
nltk.classify.accuracy(classifier, test_set)

0.6270512182993535

In [60]:
classifier.classify(pos_features('cats'))

'NNS'

We can use NLTK to print out the decision tree model as pseudocode:

In [61]:
print(classifier.pseudocode(depth=4))

if endswith(the) == False: 
  if endswith(,) == False: 
    if endswith(s) == False: 
      if endswith(.) == False: return '.'
      if endswith(.) == True: return '.'
    if endswith(s) == True: 
      if endswith(is) == False: return 'PP$'
      if endswith(is) == True: return 'BEZ'
  if endswith(,) == True: return ','
if endswith(the) == True: return 'AT'



### Exploiting Context

Let's modify our feature extractor to use an untagged sentence so that we can train a classifier to tag words with part-of-speech while looking at the context.

In [62]:
def pos_features(sentence, i):
    features = {
        "suffix(1)": sentence[i][-1:],
        "suffix(2)": sentence[i][-2:],
        "suffix(3)": sentence[i][-3:]}
    if i==0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    return features        

In [64]:
# testing the feature extractor
pos_features(brown.sents()[0], 8)

{'suffix(1)': 'n', 'suffix(2)': 'on', 'suffix(3)': 'ion', 'prev-word': 'an'}

In [66]:
tagged_sents = brown.tagged_sents(categories='news')
featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append((pos_features(untagged_sent, i), tag))

In [67]:
size = int(len(featuresets)*0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [68]:
nltk.classify.accuracy(classifier, test_set)

0.7891596220785678

### Sequence Classification

The goal of classifying a sequence is to find the most appropriate labellings for a collection of related inputs. 

One strategy for sequence classification is known as *consecutive classification* which infs the most ikely class label for the first input and then use that answer to help find the best label for the next input. 

First, we must modify our pos feature extractor to take a `history` argument, which will provide a list of tags that we've predicted for the sentence so far.

In [69]:
def pos_features(sentence, i, history):
    features = {
        "suffix(1)": sentence[i][-1:],
        "suffix(2)": sentence[i][-2:],
        "suffix(3)": sentence[i][-3:]}
    if i==0:
        features["prev-word"] = "<START>"
        features["prev-tag"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
        features["prev-tag"] = history[i-1]
    return features

Let's build our feature 