## Naive Bayes Classification ##

Gender identification of names example.

Male and female names have distinct characteristics (features) that may be used to classify a given name as male or female. Those names ending in a, e, and i are usually female names. While names ending in k, o, r, s, and t are typically male names.

The following program will classify a name as male or female based on the above features. First, names from nltk.corpus will be imported to build a list of examples with corresponding class labels. The class label is the gender associated with the name.

In [None]:
# Author: Elizabeth Brooks
# Date Modified: 06/29/2015

# PreProcessor Directives
import random
from nltk.classify import apply_features
from nltk.corpus import names
labeledNames = ([(name, 'male') for name in names.words('male.txt')] +
    [(name, 'female') for name in names.words('female.txt')])
    
# Randomize the data
random.shuffle(labeledNames)

Next, a feature extractor will be used to build a dictionary of relevant information (feature, name).

In [None]:
# Function for extracting relevant features
def extractFeatures(wordInput):
    return {'lastLetter': wordInput[-1]}

Use the feature extractor to process the data in names, and to divide the resulting list of feature sets into a training set and a dev set. Then train the classifier for Naive Bayes classification using the determined training set (trainSet).

In [None]:
# Determine the feature sets
featureSets = [(extractFeatures(n), gender) for (n, gender) in labeledNames]

# Establish the training and dev data sets
trainSet, devSet = featureSets[500:], featuresSets[:500] #before and after 500

# Train the Naive Bayes (NB) classifier
classifierNB = nltk.NaiveBayesClassifier.train(trainSet)

Note: in order to reduce memory requirements, the function nltk.classify.apply_features may be used to construct a single list containing all the features of every instance. The function returns an object that behaves like a list, however it does not store all the feature sets in memory. See example function calls below.

In [None]:
# Establish the training set
trainSet = apply_features(extractFeatures, labeled_names[500:])

# Establish the dev set
devSet = apply_features(extractFeatures, labeledNames[:500])

It is possible to display the accuracy of the trained classifier, using the above determined dev set (devSet), by simply using the below function call.

In [None]:
# Print the screen the probable accuracy of the NB classifier
print(nltk.classify.accuracy(classifierNB, devSet))

It is also possible to display the features most effective for distinguishing a name's class, or gender.

In [None]:
# Print the top 5 features with the best class 
# identification probability
classifierNB.show_most_informative_features(5)

Finally, either create a test data set or input namess not found in the training/dev data sets.

In [None]:
# Test using sample names
classifierNB.classify(extractFeatures('Neo'))
classifierNB.classify(extractFeatures('Trinity'))