# Learning to Classify Text

## Supervised Classification

Classification is the task of choosing the correct class label for a given input. A classifier is called supervised if it is built based on training corpora containing the correct label for each input.

### Gender Identification

Let's build a classifier to determine whether a given name is of a male or a female person. To start with, we'll build a function that extracts the final letter of a given word and returns a *feature dictionary*.

In [3]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [4]:
gender_features('Shrek')

{'last_letter': 'k'}

The key in the dictionary is the feature name. Let's prepare a list of examples of names and corresponding class labels (males or females)

In [5]:
from nltk.corpus import names
labeled_names = (
    [(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

In [6]:
import random
random.shuffle(labeled_names)

We can now use the feature extractor to process the data, and divide it into train and test sets.

In [7]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [15]:
# Testing the classifier on names that do not appear in the training set
print('Neo:', classifier.classify(gender_features('Neo')))
print('Trinity:', classifier.classify(gender_features('Trinity')))

Neo: male
Trinity: female


In [16]:
# Evaulating the classifier on the test set
print(nltk.classify.accuracy(classifier, test_set))

0.746


We can examine the classifier to determine the most effective features:

In [17]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     34.2 : 1.0
             last_letter = 'k'              male : female =     32.1 : 1.0
             last_letter = 'f'              male : female =     15.2 : 1.0
             last_letter = 'p'              male : female =     11.1 : 1.0
             last_letter = 'v'              male : female =     10.5 : 1.0


According to the above the names in the training set that end in "a" are female 33 times more often than they are male.

When working with a large corpora, constructing a list for features can quickly use up a large amount of memory. In these cases, we can use `nltk.classify.apply_features` which returns an object that acts like a list but does not store the feature sets in memory.

In [None]:
from nltk.classify import apply_features
train_set = apply_features(gender_features, labeled_names-500:)
test