# DATA 620 - Project 3

Jeremy OBrien, Mael Illien, Vanita Thompson

* Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. 
* Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. 
* Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. 
* Once you are satisfied with your classifier, check its final performance on the test set. 
* How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?
* Source: Natural Language Processing with Python, exercise 6.10.2.

## Setup

We will be working with the 'names' corpus contained within NLTK.

We will be starting our work by restructing some of the content from Chapter 6. The function 'test_classifier' will encapsulate most of the work by splitting the data into train and test sets, extracting features, training the model and predicting on the test set. 

A section will be dedicated to feature engineering by buiding on the example features from the book. The three classifiers from Chapter 6 were NaiveBayes, DecisionTree, MaxEntropy. Each classifier will be studied in its respective section. A summary table will be provided in order to easily compare the different models.

In [1]:
import random
import pandas as pd
import nltk, re, pprint
from nltk.corpus import names
from nltk.classify import apply_features

A number of models will be constructed in this document which will be saved in a list as defined below.

In [2]:
models = [] # Will contain tuples (classifier, class_name, gf_name, acc_devtest, acc_test)

## Data Import & Transformation

The NLTK 'names' corpus contains both male and female names in different text files. The code below extracts the names from both files, assigns the gender, and stores the information in a list of tuples. The labeled names are shuffled to avoid getting getting a single gender in a train or test set.

In [3]:
labeled_names = ([(name, 'male') for name in names.words('male.txt')] + 
         [(name, 'female') for name in names.words('female.txt')])

random.seed(620)
random.shuffle(labeled_names)
labeled_names[:10]

[('Christie', 'female'),
 ('Tibold', 'male'),
 ('Chet', 'male'),
 ('Alyss', 'female'),
 ('Eunice', 'female'),
 ('Mehetabel', 'female'),
 ('Marj', 'female'),
 ('Adam', 'male'),
 ('Natka', 'female'),
 ('Sarene', 'female')]

### Train Test Split

(Explanation) MI: This section is not necessary. It is built into the test test_classifier function

In [4]:
# Incorporated in function (JO: uncommented for use in testing, can recomment out for submission)
train_names = labeled_names[:500]
devtest_names = labeled_names[500:1000]
test_names = labeled_names[1000:]

In [5]:
# Incorporated in function (JO: uncommented for use in testing, can recomment out for submission)
# train_set = [(gender_features(n), gender) for (n, gender) in train_names]
# devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
# test_set = [(gender_features(n), gender) for (n, gender) in test_names]

### Test Classifier

The test_classifier function takes in a corpus of names, as well as a function to extract features and a classifier type. It splits the datasets into 3: train, devtest and test. The classifier model is then trained and the its accuracy on both test sets printed. We also save information about each model in the model list in order to use it later in a summary table.

In [41]:
def test_classifier(names_corpus, gender_features_function, classifier_type):
#     train_set = apply_features(gender_features, names[:500])
#     devtest_set = apply_features(gender_features, names[500:1000])
#     test_set = apply_features(gender_features, names[1000:])

    # Train test split
    train_names = names_corpus[:500]
    devtest_names = names_corpus[500:1000]
    test_names = names_corpus[1000:]
    
    # Appy features
    train_set = [(gender_features_function(n), gender) for (n, gender) in train_names]
    devtest_set = [(gender_features_function(n), gender) for (n, gender) in devtest_names]
    test_set = [(gender_features_function(n), gender) for (n, gender) in test_names]
    
    # Classify and print score
    classifier = classifier_type.train(train_set)
    acc_devtest = round(nltk.classify.accuracy(classifier, devtest_set), 3)
    acc_test = round(nltk.classify.accuracy(classifier, test_set), 3)
    print('Dev test set accuracy: ' + str(acc_devtest))
    print('Test set accuracy: ' + str(acc_test))
    
    #classifier.show_most_informative_features(5)
    
    class_name = classifier_type.__name__
    gf_name = gender_features_function.__name__
    models.append((classifier, class_name, gf_name, acc_devtest, acc_test))
    
    return classifier  

### Errors

We define an error function so that we can take a glance at what the classifiers are getting wrong.

In [7]:
def errors(classifier, gender_feature_function):
    errors = []
    for (name, tag) in devtest_names:
        guess = classifier.classify(gender_feature_function(name))
        if guess != tag:
            errors.append( (tag, guess, name) )
            
    for (tag, guess, name) in sorted(errors): 
        print('correct=%-8s guess=%-8s name=%-30s'%(tag, guess, name))

## Feature Engineering

Here, we try to break down first names by extracting features from them. This exploration allows us to find features that make it possible to discriminate between a female and a male name. These features are what the models are trained on.

Below, we develop a variety of features and also include those that were defined in the text.

We start with the most basic feature, which is only the last letter. While very simple, we will see later that particular letters at the end of a name can be power predictors.

In [8]:
def gender_features(name):
    return {'last_letter': name[-1]}

We augment the simple feature above to include the first letter as well as the presence and counts of all alphabet letters.

In [9]:
def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

These features revisit the end of names, but this time the features also include the second to last letter.

In [10]:
def gender_features3(word):
    return {'suffix1': word[-1:],'suffix2': word[-2:]}

Here we have a combination of the features above with the addition of suffix3, the last 3 letters.

In [11]:
def gender_features4(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    features['suffix1'] =  name[-1:]
    features['suffix2'] = name[-2:]
    features['suffix3'] = name[-3:]
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

We already looked at individual letters. Here we include whether the first and last letters are vowels or consonants.

In [12]:
def gender_features5(name):
    features = {}
    features["vowel_start"] = int(name[0].lower() in 'aeiuo')
    #features["con_start"] = int(name[0].lower() not in 'aeiuo')
    features["vowel_end"] = int(name[-1].lower() in 'aeiuo')
    #features["con_end"] = int(name[-1].lower() not in 'aeiuo')
    return features

We also look to see if the length of a name has any classifying power. The length of 5 that we use to determine the cutoff is arbitrary.

In [55]:
def gender_features6(name):
    features = {}
    #features["long_name"] = int(len(name) > 4) 
    features["short_name"] = int(len(name) < 4)
    features['length'] = len(name)
    return features

## Classifiers

In this section, we define the model parameters and call the test_classifier function. Accuracy scores on both test sets are printed for each model.

### Naive Bayes

With a Naive Bayes classifier, every feature is used to determine which label (male or female) should be assigned to a given input name. The prior probability (the proportion of male and females names in the training data) is modulated by the contribution from each feature to arrive at a likelyhood estimate for each label. The label with the highest probability is then assigned. Note that this classifier works under the assumption that each feature is independent of every other features which can be unrealistic, hence the word naive. 

We test our first feature (last letter only) and by displaying the most informative features, we discover the important classifying power of the last letter 'a' which is nearly 46 times more likely to be a female name. The accuracy score is solid for a single feature.

In [45]:
nb_mod1 = test_classifier(labeled_names, gender_features, nltk.NaiveBayesClassifier)
nb_mod1.show_most_informative_features(5)

Dev test set accuracy: 0.758
Test set accuracy: 0.756
Most Informative Features
             last_letter = 'a'            female : male   =     45.9 : 1.0
             last_letter = 'd'              male : female =      9.7 : 1.0
             last_letter = 'o'              male : female =      8.9 : 1.0
             last_letter = 'r'              male : female =      6.5 : 1.0
             last_letter = 'i'            female : male   =      4.7 : 1.0


When include first letter, last letter, and the presence and count of all alphabet letters, we see that the first 5 most informative features are identical to above. We print the next 5 features to see the contribution of the added features. While not as significant, we can see below the contribution of two 'd' in a name, the first letter 'z' and the presence of 'w'. 

In [15]:
nb_mod2 = test_classifier(labeled_names, gender_features2, nltk.NaiveBayesClassifier)
nb_mod2.show_most_informative_features(10)

0.738
0.753168202764977
Most Informative Features
              lastletter = 'a'            female : male   =     45.9 : 1.0
              lastletter = 'd'              male : female =      9.7 : 1.0
              lastletter = 'o'              male : female =      8.9 : 1.0
              lastletter = 'r'              male : female =      6.5 : 1.0
              lastletter = 'i'            female : male   =      4.7 : 1.0
                count(d) = 2                male : female =      4.5 : 1.0
              lastletter = 'm'              male : female =      3.9 : 1.0
             firstletter = 'z'              male : female =      3.9 : 1.0
                count(w) = 1                male : female =      3.9 : 1.0
                  has(w) = True             male : female =      3.9 : 1.0


A suffix a size 1 is the same as the last letter of a name so it is no surprise to see suffix1 = 'a' showing up at the top again. Suffixes of size 2 start to have a stronger contribution. The suffix 'ne' as in Anne is more likely to be female while the suffix 'er' as in Peter is more likely to be male.

In [47]:
nb_mod3 = test_classifier(labeled_names, gender_features3, nltk.NaiveBayesClassifier)
nb_mod3.show_most_informative_features(10)

Dev test set accuracy: 0.762
Test set accuracy: 0.766
Most Informative Features
                 suffix1 = 'a'            female : male   =     45.9 : 1.0
                 suffix2 = 'ne'           female : male   =     10.2 : 1.0
                 suffix1 = 'd'              male : female =      9.7 : 1.0
                 suffix1 = 'o'              male : female =      8.9 : 1.0
                 suffix1 = 'r'              male : female =      6.5 : 1.0
                 suffix2 = 'er'             male : female =      5.2 : 1.0
                 suffix1 = 'i'            female : male   =      4.7 : 1.0
                 suffix2 = 'on'             male : female =      4.6 : 1.0
                 suffix1 = 'm'              male : female =      3.9 : 1.0
                 suffix2 = 'ed'             male : female =      3.6 : 1.0


This model below is a combination of the features above, with the addition of 3 leter suffixes. Again, no surprise that some results are repeated since lastletter and suffix1 are the same thing. The suffix of size 3 'ine' shows up in the top 10. At 0.779 accuracy on the test set, this is the highest Naive Bayes model we have tested so far.

In [50]:
nb_mod4 = test_classifier(labeled_names, gender_features4, nltk.NaiveBayesClassifier)
nb_mod4.show_most_informative_features(10)

Dev test set accuracy: 0.78
Test set accuracy: 0.779
Most Informative Features
              lastletter = 'a'            female : male   =     45.9 : 1.0
                 suffix1 = 'a'            female : male   =     45.9 : 1.0
                 suffix2 = 'ne'           female : male   =     10.2 : 1.0
              lastletter = 'd'              male : female =      9.7 : 1.0
                 suffix1 = 'd'              male : female =      9.7 : 1.0
              lastletter = 'o'              male : female =      8.9 : 1.0
                 suffix1 = 'o'              male : female =      8.9 : 1.0
                 suffix3 = 'ine'          female : male   =      6.5 : 1.0
              lastletter = 'r'              male : female =      6.5 : 1.0
                 suffix1 = 'r'              male : female =      6.5 : 1.0


We learn by looking beyond individual letters looking at vowels and consonants that names ended in vowels are more 2.5 times more likely to be a female name. The converse (ending in a consonant), is true for males.

In [52]:
nb_mod5 = test_classifier(labeled_names, gender_features5, nltk.NaiveBayesClassifier)
nb_mod5.show_most_informative_features()

Dev test set accuracy: 0.728
Test set accuracy: 0.729
Most Informative Features
               vowel_end = 1              female : male   =      2.5 : 1.0
               vowel_end = 0                male : female =      2.4 : 1.0
             vowel_start = 1              female : male   =      1.1 : 1.0
             vowel_start = 0                male : female =      1.0 : 1.0


With an accuracy score of 0.63, the number of letters in a name is so far the worst feature to classify its gender.

In [57]:
nb_mod6 = test_classifier(labeled_names, gender_features6, nltk.NaiveBayesClassifier)
nb_mod6.show_most_informative_features(10)

Dev test set accuracy: 0.656
Test set accuracy: 0.63
Most Informative Features
                  length = 3                male : female =      3.5 : 1.0
                  length = 9              female : male   =      3.5 : 1.0
              short_name = 1                male : female =      3.1 : 1.0
                  length = 4                male : female =      1.7 : 1.0
                  length = 7              female : male   =      1.3 : 1.0
                  length = 10               male : female =      1.2 : 1.0
                  length = 5              female : male   =      1.1 : 1.0
              short_name = 0              female : male   =      1.1 : 1.0
                  length = 8              female : male   =      1.0 : 1.0
                  length = 6              female : male   =      1.0 : 1.0


### Decision Trees

(Explanation)

In [20]:
dt_mod1 = test_classifier(labeled_names, gender_features, nltk.DecisionTreeClassifier)
print(dt_mod1.pseudocode(depth=4))

0.756
0.7622407834101382
if last_letter == 'a': return 'female'
if last_letter == 'b': return 'male'
if last_letter == 'c': return 'male'
if last_letter == 'd': return 'male'
if last_letter == 'e': return 'female'
if last_letter == 'g': return 'male'
if last_letter == 'h': return 'female'
if last_letter == 'i': return 'female'
if last_letter == 'j': return 'female'
if last_letter == 'k': return 'male'
if last_letter == 'l': return 'male'
if last_letter == 'm': return 'male'
if last_letter == 'n': return 'male'
if last_letter == 'o': return 'male'
if last_letter == 'p': return 'male'
if last_letter == 'r': return 'male'
if last_letter == 's': return 'male'
if last_letter == 't': return 'male'
if last_letter == 'v': return 'male'
if last_letter == 'x': return 'male'
if last_letter == 'y': return 'female'
if last_letter == 'z': return 'female'



In [21]:
dt_mod2 = test_classifier(labeled_names, gender_features2, nltk.DecisionTreeClassifier)
print(dt_mod2.pseudocode(depth=2))

0.746
0.7193260368663594
if lastletter == 'a': return 'female'
if lastletter == 'b': return 'male'
if lastletter == 'c': return 'male'
if lastletter == 'd': 
  if firstletter == 'a': return 'male'
  if firstletter == 'c': return 'female'
  if firstletter == 'd': return 'male'
  if firstletter == 'f': return 'female'
  if firstletter == 'm': return 'male'
  if firstletter == 'n': return 'male'
  if firstletter == 'r': return 'male'
  if firstletter == 's': return 'male'
  if firstletter == 't': return 'male'
  if firstletter == 'w': return 'male'
if lastletter == 'e': 
  if firstletter == 'a': return 'female'
  if firstletter == 'b': return 'female'
  if firstletter == 'c': return 'female'
  if firstletter == 'd': return 'female'
  if firstletter == 'e': return 'female'
  if firstletter == 'f': return 'female'
  if firstletter == 'g': return 'female'
  if firstletter == 'h': return 'female'
  if firstletter == 'i': return 'female'
  if firstletter == 'j': return 'male'
  if firstletter 

In [22]:
dt_mod3 = test_classifier(labeled_names, gender_features3, nltk.DecisionTreeClassifier)
print(dt_mod3.pseudocode(depth=2))

0.73
0.732286866359447
if suffix2 == 'ad': return 'male'
if suffix2 == 'ah': return 'female'
if suffix2 == 'ak': return 'male'
if suffix2 == 'al': return 'male'
if suffix2 == 'am': return 'male'
if suffix2 == 'an': return 'female'
if suffix2 == 'ar': return 'male'
if suffix2 == 'as': return 'male'
if suffix2 == 'at': return 'male'
if suffix2 == 'ba': return 'female'
if suffix2 == 'be': return 'female'
if suffix2 == 'by': return 'male'
if suffix2 == 'ca': return 'female'
if suffix2 == 'ce': return 'male'
if suffix2 == 'ch': return 'male'
if suffix2 == 'ck': return 'male'
if suffix2 == 'cy': return 'female'
if suffix2 == 'da': return 'female'
if suffix2 == 'dd': return 'male'
if suffix2 == 'de': return 'female'
if suffix2 == 'do': return 'male'
if suffix2 == 'dy': return 'male'
if suffix2 == 'ea': return 'female'
if suffix2 == 'ed': return 'male'
if suffix2 == 'ee': return 'female'
if suffix2 == 'ej': return 'male'
if suffix2 == 'el': return 'female'
if suffix2 == 'Em': return 'female'
i

In [23]:
dt_mod4 = test_classifier(labeled_names, gender_features4, nltk.DecisionTreeClassifier)
print(dt_mod4.pseudocode(depth=2))

0.656
0.6460253456221198
if suffix3 == 'aak': return 'male'
if suffix3 == 'ace': return 'male'
if suffix3 == 'Ace': return 'male'
if suffix3 == 'ada': return 'female'
if suffix3 == 'add': return 'male'
if suffix3 == 'ady': return 'male'
if suffix3 == 'afe': return 'male'
if suffix3 == 'ain': return 'female'
if suffix3 == 'air': return 'male'
if suffix3 == 'ait': return 'male'
if suffix3 == 'ale': return 'male'
if suffix3 == 'ami': return 'female'
if suffix3 == 'ana': return 'female'
if suffix3 == 'and': return 'male'
if suffix3 == 'ane': return 'female'
if suffix3 == 'ang': return 'male'
if suffix3 == 'ani': return 'female'
if suffix3 == 'ano': return 'male'
if suffix3 == 'ara': return 'female'
if suffix3 == 'ard': return 'male'
if suffix3 == 'arj': return 'female'
if suffix3 == 'ark': return 'male'
if suffix3 == 'arv': return 'male'
if suffix3 == 'ary': return 'male'
if suffix3 == 'ase': return 'male'
if suffix3 == 'ata': return 'female'
if suffix3 == 'ate': return 'female'
if suffix3

In [24]:
dt_mod5 = test_classifier(labeled_names, gender_features5, nltk.DecisionTreeClassifier)
print(dt_mod5.pseudocode(depth=2))

0.728
0.7288306451612904
if vowel_end == 0: return 'male'
if vowel_end == 1: return 'female'



In [25]:
dt_mod6 = test_classifier(labeled_names, gender_features6, nltk.DecisionTreeClassifier)
print(dt_mod6.pseudocode(depth=2))

0.626
0.6104550691244239
if length == 10: return 'female'
if length == 11: return 'female'
if length == 12: return 'female'
if length == 13: return 'female'
if length == 2: return 'female'
if length == 3: return 'male'
if length == 4: return 'male'
if length == 5: return 'female'
if length == 6: return 'female'
if length == 7: return 'female'
if length == 8: return 'female'
if length == 9: return 'female'



### Max Entropy

Instead of using probabilites to set model parameters as the Naive Bayes classifier does, the Maximum Entropy Model (or MaxEnt) searches for the set of parameters that maximize model performance. The property of entropy entails uniformity of the distribution where there isn't empirical evidence that would constrain that uniformity.  

Intuition is that classifiers with lower entropy introduce biases that are not justified. 

Importantly, MaxEnt does not assume independence of features (as Naive Bayes does) and so is not negatively impacted when there is dependence between features (can often be the case). As MaxEnt captures the structure of the training data, the more features it uses the stronger the constraint of empirical consistenycy becomes (reference)[https://lost-contact.mit.edu/afs/cs.pitt.edu/projects/nltk/docs/tutorial/classifying/nochunks.html#maxent].

For each joint feature (define!), MaxEnt algorithms calculate the empirical frequency and...(complete!).

NLTK offers two algorithms, Generalized Iterative Scaling (GIS) and Improved Iterative Scaling (IIS), which offers faster convergence. Accoring to wikipedia (link!) and the literature ('A comparison of algorithms for maximum entropy parameter estimation')[http://luthuli.cs.uiuc.edu/~daf/courses/Opt-2017/Papers/p18-malouf.pdf], the performance of these algorithms has been substantially improved uppon by gradient-based methods, such as coordinate descent and limited memory L-BFGS and LMBVM. Most notably, iterative optimizations can be time consuming

Additionally, MaxEnt is a conditional classifier, meaning it can be used to determine the most likely label for a given input or conversely how likely a label is for that input.  A generative classifier like Naive Bayers can estimate the most likely input value, how likely an input value is, the same given an input label. 

https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf (cite!) 


Rapid comuptation, peaks after single iteration

(Add chart)

In [26]:
me_mod1 = test_classifier(labeled_names, gender_features, nltk.ConditionalExponentialClassifier)  # consider changing max_iter param to 20, how to add kwarg?
# https://stackoverflow.com/questions/39391280/how-to-change-number-of-iterations-in-maxent-classifier-for-pos-tagging-in-nltk

me_mod1.show_most_informative_features(10)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.368
             2          -0.37629        0.758
             3          -0.37386        0.758
             4          -0.37241        0.758
             5          -0.37144        0.758
             6          -0.37075        0.758
             7          -0.37024        0.758
             8          -0.36983        0.758
             9          -0.36951        0.758
            10          -0.36925        0.758
            11          -0.36903        0.758
            12          -0.36884        0.758
            13          -0.36869        0.758
            14          -0.36855        0.758
            15          -0.36843        0.758
            16          -0.36832        0.758
            17          -0.36823        0.758
            18          -0.36814        0.758
            19          -0.36807        0.758
 

In [27]:
#me_mod1.explain(test_set)

In [28]:
# type(me_mod1.show_most_informative_features(10))

(Explanation)

(Add chart)

In [29]:
me_mod2 = test_classifier(labeled_names, gender_features2, nltk.ConditionalExponentialClassifier)
me_mod2.show_most_informative_features(10)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.368
             2          -0.60663        0.632
             3          -0.59048        0.632
             4          -0.57533        0.632
             5          -0.56114        0.638
             6          -0.54787        0.664
             7          -0.53547        0.684
             8          -0.52387        0.694
             9          -0.51302        0.716
            10          -0.50286        0.728
            11          -0.49334        0.742
            12          -0.48442        0.756
            13          -0.47604        0.750
            14          -0.46816        0.760
            15          -0.46074        0.762
            16          -0.45374        0.768
            17          -0.44714        0.764
            18          -0.44090        0.764
            19          -0.43500        0.768
 

Slow computation, peaks after six iterations

(Add chart)

In [30]:
me_mod3 = test_classifier(labeled_names, gender_features3, nltk.ConditionalExponentialClassifier)
me_mod3.show_most_informative_features(10)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.368
             2          -0.34422        0.816
             3          -0.32017        0.830
             4          -0.30386        0.834
             5          -0.29234        0.836
             6          -0.28389        0.838
             7          -0.27743        0.840
             8          -0.27234        0.840
             9          -0.26820        0.840
            10          -0.26477        0.840
            11          -0.26187        0.840
            12          -0.25940        0.840
            13          -0.25725        0.840
            14          -0.25537        0.840
            15          -0.25372        0.840
            16          -0.25225        0.840
            17          -0.25093        0.840
            18          -0.24975        0.840
            19          -0.24868        0.840
 

Rapid computation, unclear if reached optimum as continues to improve

(Add chart)

In [31]:
me_mod4 = test_classifier(labeled_names, gender_features4, nltk.ConditionalExponentialClassifier)
me_mod4.show_most_informative_features(10)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.368
             2          -0.58951        0.632
             3          -0.55937        0.632
             4          -0.53237        0.656
             5          -0.50821        0.714
             6          -0.48661        0.744
             7          -0.46726        0.782
             8          -0.44989        0.800
             9          -0.43425        0.820
            10          -0.42012        0.838
            11          -0.40730        0.858
            12          -0.39564        0.868
            13          -0.38498        0.872
            14          -0.37520        0.878
            15          -0.36620        0.880
            16          -0.35788        0.878
            17          -0.35017        0.882
            18          -0.34300        0.882
            19          -0.33631        0.882
 

## Conclusion

In [32]:
models[11]

(<nltk.classify.decisiontree.DecisionTreeClassifier at 0x10ebcac10>,
 'DecisionTreeClassifier',
 'gender_features6',
 0.626,
 0.6104550691244239)

In [33]:
def summarize_models(models):
    table = pd.DataFrame(columns = ['class', 'features', 'accuracy_devtest', 'accuracy_test'])
    
    for m in models:
        df = pd.DataFrame({'class': [m[1]], 'features': [m[2]], 'accuracy_devtest': [m[3]], 'accuracy_test': [m[4]]})
        table = table.append(df, ignore_index=True)

    return table

In [34]:
table = summarize_models(models)
table

Unnamed: 0,class,features,accuracy_devtest,accuracy_test
0,NaiveBayesClassifier,gender_features,0.758,0.756336
1,NaiveBayesClassifier,gender_features2,0.738,0.753168
2,NaiveBayesClassifier,gender_features3,0.762,0.765697
3,NaiveBayesClassifier,gender_features4,0.78,0.778658
4,NaiveBayesClassifier,gender_features5,0.728,0.728831
5,NaiveBayesClassifier,gender_features6,0.628,0.611031
6,DecisionTreeClassifier,gender_features,0.756,0.762241
7,DecisionTreeClassifier,gender_features2,0.746,0.719326
8,DecisionTreeClassifier,gender_features3,0.73,0.732287
9,DecisionTreeClassifier,gender_features4,0.656,0.646025


In [35]:
table.sort_values(by='accuracy_test', ascending=False)

Unnamed: 0,class,features,accuracy_devtest,accuracy_test
15,MaxentClassifier,gender_features4,0.772,0.784274
13,MaxentClassifier,gender_features2,0.766,0.782114
3,NaiveBayesClassifier,gender_features4,0.78,0.778658
2,NaiveBayesClassifier,gender_features3,0.762,0.765697
14,MaxentClassifier,gender_features3,0.758,0.762673
12,MaxentClassifier,gender_features,0.756,0.762529
6,DecisionTreeClassifier,gender_features,0.756,0.762241
0,NaiveBayesClassifier,gender_features,0.758,0.756336
1,NaiveBayesClassifier,gender_features2,0.738,0.753168
8,DecisionTreeClassifier,gender_features3,0.73,0.732287


In [36]:
final_model = models[11][0]
final_model

<nltk.classify.decisiontree.DecisionTreeClassifier at 0x10ebcac10>

## Youtube