# DATA 620 - Project 3

Jeremy OBrien, Mael Illien, Vanita Thompson

### Name Gender Classifier

* Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. 
* Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. 
* Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. 
* Once you are satisfied with your classifier, check its final performance on the test set. 
* How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?
* Source: Natural Language Processing with Python, exercise 6.10.2.

## Setup

For this project, we work with the 'names' corpus contained within NLTK.

We start our work by restructing some of the guidelines from Chapter 6 of our text. The function 'test_classifier' performs most of the work: splitting the data into train and test sets, extracting features, training the model, and predicting based on the test set. 

In the feature engineering section, we build on examples from the text and develop new features to evaluate. 

We study each of the three classifiers described in the text - NaiveBayes, DecisionTree, and MaxEntropy - in their respective sections, and compare results in a summary table in the conclusion.

In [1]:
import random
import pandas as pd
import nltk, re, pprint
from nltk.corpus import names
from nltk.classify import apply_features
from IPython.display import display, HTML

For ease of comparison, we compile a list of the models created for this project and their performance.

In [2]:
models = [] # Will contain tuples (classifier, class_name, gf_name, acc_devtest, acc_test)

## Data Import & Transformation

The NLTK 'names' corpus contains both male and female names in separate text files. The code below extracts the names from both files, assigns a gender to the name, and stores the information in a list of tuples. The labeled names are shuffled to randomize their distribution over the train and test sets.

In [3]:
labeled_names = ([(name, 'male') for name in names.words('male.txt')] + 
         [(name, 'female') for name in names.words('female.txt')])

random.seed(620)
random.shuffle(labeled_names)
labeled_names[:10]

[('Christie', 'female'),
 ('Tibold', 'male'),
 ('Chet', 'male'),
 ('Alyss', 'female'),
 ('Eunice', 'female'),
 ('Mehetabel', 'female'),
 ('Marj', 'female'),
 ('Adam', 'male'),
 ('Natka', 'female'),
 ('Sarene', 'female')]

### Test Classifier

The **test_classifier** function takes as arguments a corpus of labeled names, a function to extract features from that corpus, and a classifier type. It splits the datasets into three parts: a training set, a devtest set, and a test set. It trains the classifier and returns the model and its accuracy on both test sets. This information is made available to compare between approaches to feature engineering and model types.

In [4]:
def test_classifier(names_corpus, gender_features_function, classifier_type):

    # Train test split
    train_names = names_corpus[:500]
    devtest_names = names_corpus[500:1000]
    test_names = names_corpus[1000:]
    
    # Appy features
    train_set = [(gender_features_function(n), gender) for (n, gender) in train_names]
    devtest_set = [(gender_features_function(n), gender) for (n, gender) in devtest_names]
    test_set = [(gender_features_function(n), gender) for (n, gender) in test_names]
    
    # Classify and print score; if Maximum Entropy, use trace to limit diagnostic output on screen
    if classifier_type == nltk.ConditionalExponentialClassifier:
        classifier = classifier_type.train(train_set, trace=0)
    else:
        classifier = classifier_type.train(train_set)
    acc_devtest = round(nltk.classify.accuracy(classifier, devtest_set), 3)
    acc_test = round(nltk.classify.accuracy(classifier, test_set), 3)
    print('Dev test set accuracy: ' + str(acc_devtest))
    print('Test set accuracy: ' + str(acc_test))
    
    #classifier.show_most_informative_features(5)
    
    class_name = classifier_type.__name__
    gf_name = gender_features_function.__name__
    models.append((classifier, class_name, gf_name, acc_devtest, acc_test))
    
    return classifier  

## Feature Engineering

We explore a number of approaches to breaking down first names and extracting features to train the classifier model. These different approaches may lend the classifiers more or less power to efficiently discriminate between female and male names.

We include features defined in the text and augment them with additional appraoches.

**Approach \#1**:  The last letter of the name. While a simple approach, the particular letter at the end of a name can be powerful predictor of gender.

In [5]:
def gender_features1(name):
    return {'last_letter': name[-1]}

**Approach \#2**:  The first letter, last letter, and presence and counts of all letters in the name.

In [6]:
def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

**Approach \#3**:  The last and penultimate letters in the name.

In [7]:
def gender_features3(word):
    return {'suffix1': word[-1:], 'suffix2': word[-2:]}

**Approach \#4**:  The first letter, last letter, presence and counts of all letters, and suffixes (final sequence of one, two, or three letters) of the name.

In [8]:
def gender_features4(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    features['suffix1'] =  name[-1:]
    features['suffix2'] = name[-2:]
    features['suffix3'] = name[-3:]
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

**Approach \#5**:  Whether the first letter or last letter of the name are vowels.

In [9]:
def gender_features5(name):
    features = {}
    features["vowel_start"] = int(name[0].lower() in 'aeiuo')
    features["vowel_end"] = int(name[-1].lower() in 'aeiuo')
    return features

**Approach \#6**:  The length of the name, with an arbitrary cutoff of five letters.

In [10]:
def gender_features6(name):
    features = {}
    features["short_name"] = int(len(name) < 4)
    features['length'] = len(name)
    return features

## Classifiers

In this section, we define the model parameters and call the **test_classifier** function. Accuracy scores on both test sets are outputted for each model.

### Naive Bayes

With a Naive Bayes classifier, every feature is used to determine which label (male or female) should be assigned to a given input name. The prior probability (the proportion of male and females names in the training data) is modulated by the contribution from each feature to arrive at a likelihood estimate for each label. The label with the highest probability is then assigned. Note that this classifier works under the assumption that each feature is independent of every other features which can be unrealistic, hence the qualifier 'naive'. 

We test our first feature (last letter only) and by displaying the most informative features, we discover the important classifying power of the last letter 'a' which is nearly 46 times more likely to be a female name. The accuracy score is solid for a single feature.

In [11]:
nb_mod1 = test_classifier(labeled_names, gender_features1, nltk.NaiveBayesClassifier)
nb_mod1.show_most_informative_features(5)

Dev test set accuracy: 0.758
Test set accuracy: 0.756
Most Informative Features
             last_letter = 'a'            female : male   =     45.9 : 1.0
             last_letter = 'd'              male : female =      9.7 : 1.0
             last_letter = 'o'              male : female =      8.9 : 1.0
             last_letter = 'r'              male : female =      6.5 : 1.0
             last_letter = 'i'            female : male   =      4.7 : 1.0


When including the first letter, last letter, and the presence and count of all alphabet letters, we see that the first five most informative features are identical to the first approach. We print the next five features to see the contribution of the added features. While their contribution is not as significant as the first five, we can see below the impact of two 'd's in a name, the first letter 'z', and the presence of 'w'. 

In [12]:
nb_mod2 = test_classifier(labeled_names, gender_features2, nltk.NaiveBayesClassifier)
nb_mod2.show_most_informative_features(10)

Dev test set accuracy: 0.738
Test set accuracy: 0.753
Most Informative Features
              lastletter = 'a'            female : male   =     45.9 : 1.0
              lastletter = 'd'              male : female =      9.7 : 1.0
              lastletter = 'o'              male : female =      8.9 : 1.0
              lastletter = 'r'              male : female =      6.5 : 1.0
              lastletter = 'i'            female : male   =      4.7 : 1.0
                count(d) = 2                male : female =      4.5 : 1.0
              lastletter = 'm'              male : female =      3.9 : 1.0
             firstletter = 'z'              male : female =      3.9 : 1.0
                count(w) = 1                male : female =      3.9 : 1.0
                  has(w) = True             male : female =      3.9 : 1.0


A suffix a size one is equivalent to the last letter of a name so it is no surprise to see suffix1 = 'a' appearing at the top. Suffixes of size two have a strong contribution. The suffix 'ne' (as in 'Anne') is more likely to be female while the suffix 'er' (as in Peter) is more likely to be male.

In [13]:
nb_mod3 = test_classifier(labeled_names, gender_features3, nltk.NaiveBayesClassifier)
nb_mod3.show_most_informative_features(10)

Dev test set accuracy: 0.762
Test set accuracy: 0.766
Most Informative Features
                 suffix1 = 'a'            female : male   =     45.9 : 1.0
                 suffix2 = 'ne'           female : male   =     10.2 : 1.0
                 suffix1 = 'd'              male : female =      9.7 : 1.0
                 suffix1 = 'o'              male : female =      8.9 : 1.0
                 suffix1 = 'r'              male : female =      6.5 : 1.0
                 suffix2 = 'er'             male : female =      5.2 : 1.0
                 suffix1 = 'i'            female : male   =      4.7 : 1.0
                 suffix2 = 'on'             male : female =      4.6 : 1.0
                 suffix1 = 'm'              male : female =      3.9 : 1.0
                 suffix2 = 'ed'             male : female =      3.6 : 1.0


When three-letter suffixes are included we encounter the highest accuracy on the test set for a Naive Bayers model so far: .779. We encounter duplication between lastletter and suffix1. A suffix of size three 'ine' shows up in the top ten.

In [14]:
nb_mod4 = test_classifier(labeled_names, gender_features4, nltk.NaiveBayesClassifier)
nb_mod4.show_most_informative_features(10)

Dev test set accuracy: 0.78
Test set accuracy: 0.779
Most Informative Features
              lastletter = 'a'            female : male   =     45.9 : 1.0
                 suffix1 = 'a'            female : male   =     45.9 : 1.0
                 suffix2 = 'ne'           female : male   =     10.2 : 1.0
              lastletter = 'd'              male : female =      9.7 : 1.0
                 suffix1 = 'd'              male : female =      9.7 : 1.0
              lastletter = 'o'              male : female =      8.9 : 1.0
                 suffix1 = 'o'              male : female =      8.9 : 1.0
                 suffix3 = 'ine'          female : male   =      6.5 : 1.0
              lastletter = 'r'              male : female =      6.5 : 1.0
                 suffix1 = 'r'              male : female =      6.5 : 1.0


By evaluating letters on whether they are vowels, we find that names ending in vowels are 2.5 times more likely to be female names. The converse (ending in a consonant), is true for males.

In [15]:
nb_mod5 = test_classifier(labeled_names, gender_features5, nltk.NaiveBayesClassifier)
nb_mod5.show_most_informative_features()

Dev test set accuracy: 0.728
Test set accuracy: 0.729
Most Informative Features
               vowel_end = 1              female : male   =      2.5 : 1.0
               vowel_end = 0                male : female =      2.4 : 1.0
             vowel_start = 1              female : male   =      1.1 : 1.0
             vowel_start = 0                male : female =      1.0 : 1.0


With an accuracy score of 0.63, the number of letters in a name is the worst feature to classify gender thus far.

In [16]:
nb_mod6 = test_classifier(labeled_names, gender_features6, nltk.NaiveBayesClassifier)
nb_mod6.show_most_informative_features(10)

Dev test set accuracy: 0.656
Test set accuracy: 0.63
Most Informative Features
                  length = 3                male : female =      3.5 : 1.0
                  length = 9              female : male   =      3.5 : 1.0
              short_name = 1                male : female =      3.1 : 1.0
                  length = 4                male : female =      1.7 : 1.0
                  length = 7              female : male   =      1.3 : 1.0
                  length = 10               male : female =      1.2 : 1.0
                  length = 5              female : male   =      1.1 : 1.0
              short_name = 0              female : male   =      1.1 : 1.0
                  length = 8              female : male   =      1.0 : 1.0
                  length = 6              female : male   =      1.0 : 1.0


### Decision Trees

Decision trees are made up of two components: decision nodes, which check feature values, and leaf nodes, which assign labels. The algorithm computes a decision stump for each possible feature, and evaluates which feature achieves the best accuracy on the training data. It then iteratively checks every leaf of the stump and computes a new decision stump based on the feature that maximizes the accuracy as before.

We leverage pseudocode and truncated pretty_format output to help understand the structure of each decision tree below.

The first approach uses only a single feature (the last letter) which doesn't result in much of a tree. However, the accuracy score is quite similar to the Naives Bayes model using the same feature.

In [17]:
dt_mod1 = test_classifier(labeled_names, gender_features1, nltk.DecisionTreeClassifier)

Dev test set accuracy: 0.756
Test set accuracy: 0.762


When a wider set of features are use a branching structure becomes apparent, with indents representing leaf nodes. Interestingly, the accuracy score has decreased slightly from the last letter feature alone.

In [18]:
dt_mod2 = test_classifier(labeled_names, gender_features2, nltk.DecisionTreeClassifier)
print(dt_mod2.pretty_format(width=50, prefix='', depth=4)[:1003]) # alternate to pseudocode function call

Dev test set accuracy: 0.768
Test set accuracy: 0.723
lastletter=a? ..................... female
lastletter=b? ..................... male
lastletter=c? ..................... male
lastletter=d? ..................... male
  firstletter=a? .................. male
  firstletter=c? .................. female
  firstletter=d? .................. male
  firstletter=f? .................. female
  firstletter=m? .................. male
  firstletter=n? .................. male
  firstletter=r? .................. male
  firstletter=s? .................. male
  firstletter=t? .................. male
  firstletter=w? .................. male
lastletter=e? ..................... female
  firstletter=a? .................. female
    count(c)=0? ................... female
      count(a)=2? ................. male
      count(a)=1? ................. female
    count(c)=1? ................... male
  firstletter=b? .................. female
    count(f)=0? ................... female
    count(f)=1? ..........

In the next approach, only suffixes of size 2 form the leaves of the tree. The score is lower than the first two approaches.

In [19]:
dt_mod3 = test_classifier(labeled_names, gender_features3, nltk.DecisionTreeClassifier)
print(dt_mod3.pretty_format(width=50, prefix='', depth=4)[:500])

Dev test set accuracy: 0.73
Test set accuracy: 0.732
suffix2=ad? ....................... male
suffix2=ah? ....................... female
suffix2=ak? ....................... male
suffix2=al? ....................... male
suffix2=am? ....................... male
suffix2=an? ....................... female
suffix2=ar? ....................... male
suffix2=as? ....................... male
suffix2=at? ....................... male
suffix2=ba? ....................... female
suffix2=be? ....................... female
suffix2=by? ....................... male



While adding more features increases complexity, it has actually reduced the accuracy compared with the preceding approaches. This is likely due to the fact that as the tree descends down into leaves there is less and less training data available to generalize, leading to overfitting.

In [20]:
dt_mod4 = test_classifier(labeled_names, gender_features4, nltk.DecisionTreeClassifier)
print(dt_mod4.pretty_format(width=50, prefix='', depth=4)[:1000])

Dev test set accuracy: 0.634
Test set accuracy: 0.639
suffix3=aak? ...................... male
suffix3=ace? ...................... male
suffix3=Ace? ...................... male
suffix3=ada? ...................... female
suffix3=add? ...................... male
suffix3=ady? ...................... male
suffix3=afe? ...................... male
suffix3=ain? ...................... female
suffix3=air? ...................... male
suffix3=ait? ...................... male
suffix3=ale? ...................... male
suffix3=ami? ...................... female
suffix3=ana? ...................... female
suffix3=and? ...................... male
suffix3=ane? ...................... female
suffix3=ang? ...................... male
suffix3=ani? ...................... female
suffix3=ano? ...................... male
suffix3=ara? ...................... female
suffix3=ard? ...................... male
suffix3=arj? ...................... female
suffix3=ark? ...................... male
suffix3=arv? ...............

The next two approaches yield single-feature trees without additional branching. Their accuracy scores are similar to the Naive Bayes approaches using the same feature sets.

In [21]:
dt_mod5 = test_classifier(labeled_names, gender_features5, nltk.DecisionTreeClassifier)
print(dt_mod5.pseudocode(depth=2))

Dev test set accuracy: 0.728
Test set accuracy: 0.729
if vowel_end == 0: return 'male'
if vowel_end == 1: return 'female'



In [22]:
dt_mod6 = test_classifier(labeled_names, gender_features6, nltk.DecisionTreeClassifier)
print(dt_mod6.pseudocode(depth=2))

Dev test set accuracy: 0.626
Test set accuracy: 0.61
if length == 10: return 'female'
if length == 11: return 'female'
if length == 12: return 'female'
if length == 13: return 'female'
if length == 2: return 'female'
if length == 3: return 'male'
if length == 4: return 'male'
if length == 5: return 'female'
if length == 6: return 'female'
if length == 7: return 'female'
if length == 8: return 'female'
if length == 9: return 'female'



### Maximum Entropy

Instead of using probabilites to set model parameters as the Naive Bayes classifier does, the Maximum Entropy Model (or [MaxEntropy](https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf)) searches for the set of parameters that maximize model performance. The property of [entropy](https://lost-contact.mit.edu/afs/cs.pitt.edu/projects/nltk/docs/tutorial/classifying/nochunks.html#maxent) entails uniformity of the distribution where there isn't empirical evidence that would constrain that uniformity.  

Unlike Naive Bayes, MaxEntropy does not assume independence of features, and so is not negatively impacted when there is dependence between features - which can often be the case. As MaxEnt captures the structure of the training data, the more features it uses the stronger the constraint of empirical consistency becomes.

MaxEntropy is a conditional classifier, meaning it can be used to determine the most likely label for a given input or conversely how likely a label is for that input.  A generative classifier like Naive Bayers can estimate the most likely input value, how likely an input value is, as well as the same given an input label. 

NLTK offers two MaxEntropy algorithms out of the box: Generalized Iterative Scaling (GIS) and Improved Iterative Scaling (IIS). Iterative optimization using these algorithms can be time consuming, and the Wikipedia article on [MaxEntropy](https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution) and the literature ['A comparison of algorithms for maximum entropy parameter estimation'](http://luthuli.cs.uiuc.edu/~daf/courses/Opt-2017/Papers/p18-malouf.pdf) note that gradient-based methods, such as coordinate descent and limited memory L-BFGS and LMBVM are preferable for their improved computational performance. While NLTK offered additional classifier algorithms through SciPy, it seems support has lapsed in the latest SciPy releases.

Given the outdated SciPy support and minimal computational duress of this dataset, for this project we implement MaxEntropy using the GIS algorithm.

Using only the last letter of the name leads to rapid convergence on the second iteration. Interestingly, consonants in the last letter position are the most informative features, in five of six cases classifying male. This contrasts with the Naive Bayes model, which had a mix of vowels / consonants and genders as top features.

In [23]:
me_mod1 = test_classifier(labeled_names, gender_features1, nltk.ConditionalExponentialClassifier)
me_mod1.show_most_informative_features(10)

Dev test set accuracy: 0.756
Test set accuracy: 0.763
   6.644 last_letter=='k' and label is 'male'
   6.644 last_letter=='x' and label is 'male'
   6.644 last_letter=='b' and label is 'male'
   6.644 last_letter=='z' and label is 'female'
   6.644 last_letter=='p' and label is 'male'
   6.644 last_letter=='c' and label is 'male'
  -5.858 last_letter=='a' and label is 'male'
  -2.392 last_letter=='i' and label is 'male'
  -2.000 last_letter=='d' and label is 'female'
  -1.807 last_letter=='o' and label is 'female'


Including the first letter and counts of all letters improves accuracy, but with a noticeable impact on computational performance. Some first letters and shorter counts are top ten features, demonstrating the contribution to performance these features have. Even after 90 iterations the model continues to improve incrementally.

In [24]:
me_mod2 = test_classifier(labeled_names, gender_features2, nltk.ConditionalExponentialClassifier)
me_mod2.show_most_informative_features(10)

Dev test set accuracy: 0.766
Test set accuracy: 0.782
  -5.130 lastletter=='a' and label is 'male'
   3.347 firstletter=='u' and label is 'female'
   2.292 firstletter=='y' and label is 'male'
  -2.175 lastletter=='i' and label is 'male'
  -1.604 lastletter=='o' and label is 'female'
  -1.590 lastletter=='d' and label is 'female'
   1.563 lastletter=='c' and label is 'male'
   1.501 count(e)==4 and label is 'female'
   1.449 count(h)==3 and label is 'female'
  -1.374 lastletter=='r' and label is 'female'


Including the last two letters of the name does not improve upon the accuracy of just using the final letter.  As it takes noticeably longer to process and seven iterations to reach an optimum, this is not a strong candidate.

In [25]:
me_mod3 = test_classifier(labeled_names, gender_features3, nltk.ConditionalExponentialClassifier)
me_mod3.show_most_informative_features(10)

Dev test set accuracy: 0.758
Test set accuracy: 0.763
  15.029 suffix2=='ha' and label is 'male'
  11.066 suffix2=='vi' and label is 'male'
  10.253 suffix2=='ko' and label is 'female'
  -9.849 suffix1=='a' and label is 'male'
   8.238 suffix2=='Em' and label is 'female'
   7.264 suffix2=='ev' and label is 'female'
   6.848 suffix2=='es' and label is 'female'
   6.848 suffix2=='ss' and label is 'female'
   6.439 suffix2=='me' and label is 'male'
   6.439 suffix2=='fe' and label is 'male'


Adding two and three-letter suffixes to first letter, last letter, and counts delivers the best accuracy of all models (including Naive Bayers and Decision Trees) so far: 0.784. In exchange, the model takes a good amount of time to run, and continues to improve after 90 iterations. With the exception of the final letter 'a' classifying male, two- and three-letter suffixes have the most impact.

In [26]:
me_mod4 = test_classifier(labeled_names, gender_features4, nltk.ConditionalExponentialClassifier)
me_mod4.show_most_informative_features(10)

Dev test set accuracy: 0.772
Test set accuracy: 0.784
  -3.148 lastletter=='a' and label is 'male'
  -3.148 suffix1=='a' and label is 'male'
   2.847 suffix2=='ha' and label is 'male'
   2.847 suffix3=='cha' and label is 'male'
   2.367 suffix3=='nri' and label is 'male'
   2.058 suffix3=='eer' and label is 'female'
   1.968 suffix3=='lil' and label is 'male'
   1.847 suffix3=='ase' and label is 'male'
   1.799 suffix3=='vie' and label is 'male'
   1.795 suffix2=='vi' and label is 'male'


The final two approaches - whether first / last letters are vowels, and the length of names - yield the worst accuracy measures of the MaxEntropy models.

In [27]:
me_mod5 = test_classifier(labeled_names, gender_features5, nltk.ConditionalExponentialClassifier)
me_mod5.show_most_informative_features(10)

Dev test set accuracy: 0.728
Test set accuracy: 0.729
  -1.238 vowel_end==1 and label is 'male'
   0.495 vowel_end==1 and label is 'female'
  -0.477 vowel_end==0 and label is 'female'
   0.394 vowel_end==0 and label is 'male'
  -0.314 vowel_start==1 and label is 'male'
   0.219 vowel_start==1 and label is 'female'
  -0.188 vowel_start==0 and label is 'male'
   0.149 vowel_start==0 and label is 'female'


In [28]:
me_mod6 = test_classifier(labeled_names, gender_features6, nltk.ConditionalExponentialClassifier)
me_mod6.show_most_informative_features(10)

Dev test set accuracy: 0.626
Test set accuracy: 0.611
   5.721 length==2 and label is 'female'
   5.185 length==13 and label is 'female'
   5.185 length==11 and label is 'female'
   5.185 length==12 and label is 'female'
  -1.716 length==9 and label is 'male'
  -0.740 length==3 and label is 'female'
   0.542 length==9 and label is 'female'
   0.417 length==3 and label is 'male'
  -0.400 length==7 and label is 'male'
  -0.279 short_name==0 and label is 'male'


## Conclusion

We conclude by summarizing all the models we have tested. This summary is shown both in the order of testing, and ranked by accuracy on the test set. 

In [29]:
def summarize_models(models):
    table = pd.DataFrame(columns = ['class', 'features', 'accuracy_devtest', 'accuracy_test'])
    
    for m in models:
        df = pd.DataFrame({'class': [m[1]], 'features': [m[2]], 'accuracy_devtest': [m[3]], 'accuracy_test': [m[4]]})
        table = table.append(df, ignore_index=True)

    return table

In [30]:
table = summarize_models(models)
table.sort_values(by='accuracy_test', ascending=False)

Unnamed: 0,class,features,accuracy_devtest,accuracy_test
15,MaxentClassifier,gender_features4,0.772,0.784
13,MaxentClassifier,gender_features2,0.766,0.782
3,NaiveBayesClassifier,gender_features4,0.78,0.779
2,NaiveBayesClassifier,gender_features3,0.762,0.766
14,MaxentClassifier,gender_features3,0.758,0.763
12,MaxentClassifier,gender_features1,0.756,0.763
6,DecisionTreeClassifier,gender_features1,0.756,0.762
0,NaiveBayesClassifier,gender_features1,0.758,0.756
1,NaiveBayesClassifier,gender_features2,0.738,0.753
8,DecisionTreeClassifier,gender_features3,0.73,0.732


The best model is:

In [31]:
best_model = models[15][0] # Select the best model by index from the table above
best_model

<ConditionalExponentialClassifier: 2 labels, 892 features>

The highest accuracy scores overall were attained by MaxEntropy classifiers - one using gender_features4 (combining many different features) to achieve an accuracy of 0.784, and close behind it the other one using gender_features2 for 0.782.

The next best performing models are Naive Bayes classifiers using gender_features4 and gender_features3 (first and last letters) with accuracies of 0.779 and 0.766, respectively.

The top performing Decision Tree classifier uses gender_features1 (last letters) garners seventh place with an accuracy of 0.762.

Overall, assessed on the basis of average accuracy over the different feature sets, MaxEntropy outperforms Naive Bayes, and both outscore Decision Trees. 

This may have something to do with the MaxEntropy classifier accounting for connections between the features rather than treating them independently in the way that Naive Bayes does. Additionally, given that the training set consists of only 500 names it's possible that the Decision Tree classifier 'memorized' the training data and is unable to generalize on new, unseen data. 

In [32]:
table.groupby('class', as_index=False).mean().sort_values(by='accuracy_test', ascending=False)

Unnamed: 0,class,accuracy_devtest,accuracy_test
1,MaxentClassifier,0.734333,0.738667
2,NaiveBayesClassifier,0.737,0.7355
0,DecisionTreeClassifier,0.707,0.699167


MaxEntropy models were relatively more accurate than Naive Bayes and Decision Tree classifiers when using the GIS algorithm, but it's not clear if more efficient algorithms (like coordinate descent) might also deliver improved accuracy. Unfortunately, testing this without an update to the NLTK package that supports current versions of SciPy would entail a non-trivial level of effort, so this remains to be explored.

Additionally, while essential for Natural Language Processing, the NLTK module may not contain the most powerful classifiers. Revisiting this project using the sklearn package could allow for more control over the classification process. 

## Youtube

In [33]:
from IPython.display import YouTubeVideo
#YouTubeVideo('...')