# DATA 620 - Project 3

Jeremy OBrien, Mael Illien, Vanita Thompson

* Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. 
* Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. 
* Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. 
* Once you are satisfied with your classifier, check its final performance on the test set. 
* How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?
* Source: Natural Language Processing with Python, exercise 6.10.2.

## Setup

We will be working with the 'names' corpus contained within NLTK.

We will be starting our work by restructing some of the content from Chapter 6. The function 'test_classifier' will encapsulate most of the work by splitting the data into train and test sets, extracting features, training the model and predicting on the test set. 

A section will be dedicated to feature engineering by buiding on the example features from the book. The three classifiers from Chapter 6 were NaiveBayes, DecisionTree, MaxEntropy. Each classifier will be studied in its respective section. A summary table will be provided in order to easily compare the different models.

In [1]:
import random
import pandas as pd
import nltk, re, pprint
from nltk.corpus import names
from nltk.classify import apply_features

A number of models will be constructed in this document which will be saved in a list as defined below.

In [2]:
models = [] # Will contain tuples (model, model_name, gender_features, classifier_type)

## Data Import & Transformation

The NLTK 'names' corpus contains both male and female names in different text files. The code below extracts the names from both files, assigns the gender, and stores the information in a list of tuples. The labeled names are shuffled to avoid getting getting a single gender in a train or test set.

In [3]:
labeled_names = ([(name, 'male') for name in names.words('male.txt')] + 
         [(name, 'female') for name in names.words('female.txt')])

random.seed(620)
random.shuffle(labeled_names)
labeled_names[:10]

[('Christie', 'female'),
 ('Tibold', 'male'),
 ('Chet', 'male'),
 ('Alyss', 'female'),
 ('Eunice', 'female'),
 ('Mehetabel', 'female'),
 ('Marj', 'female'),
 ('Adam', 'male'),
 ('Natka', 'female'),
 ('Sarene', 'female')]

### Train Test Split

(Explanation) MI: This section is not necessary. It is built into the test test_classifier function

In [4]:
# Incorporated in function (JO: uncommented for use in testing, can recomment out for submission)
train_names = labeled_names[:500]
devtest_names = labeled_names[500:1000]
test_names = labeled_names[1000:]

In [5]:
# Incorporated in function (JO: uncommented for use in testing, can recomment out for submission)
# train_set = [(gender_features(n), gender) for (n, gender) in train_names]
# devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
# test_set = [(gender_features(n), gender) for (n, gender) in test_names]

### Test Classifier

The test_classifier function takes in a corpus of names, as well as a function to extract features and a classifier type. It splits the datasets into 3: train, devtest and test. The classifier model is then trained and the its accuracy on both test sets printed.

(Explanation)

In [6]:
def test_classifier(names_corpus, gender_features_function, classifier_type):
#     train_set = apply_features(gender_features, names[:500])
#     devtest_set = apply_features(gender_features, names[500:1000])
#     test_set = apply_features(gender_features, names[1000:])

    # Train test split
    train_names = names_corpus[:500]
    devtest_names = names_corpus[500:1000]
    test_names = names_corpus[1000:]
    
    # Appy features
    train_set = [(gender_features_function(n), gender) for (n, gender) in train_names]
    devtest_set = [(gender_features_function(n), gender) for (n, gender) in devtest_names]
    test_set = [(gender_features_function(n), gender) for (n, gender) in test_names]
    
    # Classify and print score
    classifier = classifier_type.train(train_set)
    acc_devtest = nltk.classify.accuracy(classifier, devtest_set)
    acc_test = nltk.classify.accuracy(classifier, test_set)
    print(acc_devtest)
    print(acc_test)
    
    #classifier.show_most_informative_features(5)
    
    class_name = classifier_type.__name__
    gf_name = gender_features_function.__name__
    models.append((classifier, class_name, gf_name, acc_devtest, acc_test))
    
    return classifier  

### Errors

(Explanation)

In [7]:
def errors(classifier):
    errors = []
    for (name, tag) in devtest_names:
        guess = classifier.classify(gender_features(name))
        if guess != tag:
            errors.append( (tag, guess, name) )
            
    for (tag, guess, name) in sorted(errors): 
        print('correct=%-8s guess=%-8s name=%-30s'%(tag, guess, name))

## Feature Engineering

(Explanation)

### Example 1

(Explanation)

In [8]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [9]:
gender_features('John')

{'last_letter': 'n'}

### Example 2

(Explanation)

In [10]:
def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

In [11]:
#gender_features2('John')

### Example 3

(Explanation)

In [12]:
def gender_features3(word):
    return {'suffix1': word[-1:],'suffix2': word[-2:]}

In [13]:
gender_features3('Cristina')

{'suffix1': 'a', 'suffix2': 'na'}

### Example 4

(Explanation)

In [14]:
def gender_features4(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    features['suffix1'] =  name[-1:]
    features['suffix2'] = name[-2:]
    features['suffix3'] = name[-3:]
    # features['length'] = len(name) # doesn't add much
    #suf = []
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

## Classifiers

In this section, we define the model parameters and call the test_classifier function. Accuracy scores on both test sets are printed for each model.

### Naive Bayes

(Explanation)

In [15]:
nb_mod1 = test_classifier(labeled_names, gender_features, nltk.NaiveBayesClassifier)

0.758
0.756336405529954


(Explanation)

In [16]:
nb_mod1.show_most_informative_features(5)
print(nb_mod1.classify(gender_features('Neo')))
print(nb_mod1.classify(gender_features('Trinity')))

Most Informative Features
             last_letter = 'a'            female : male   =     45.9 : 1.0
             last_letter = 'd'              male : female =      9.7 : 1.0
             last_letter = 'o'              male : female =      8.9 : 1.0
             last_letter = 'r'              male : female =      6.5 : 1.0
             last_letter = 'i'            female : male   =      4.7 : 1.0
male
female


(Explanation)

In [17]:
nb_mod2 = test_classifier(labeled_names, gender_features2, nltk.NaiveBayesClassifier)
nb_mod2.show_most_informative_features(5)

0.738
0.753168202764977
Most Informative Features
              lastletter = 'a'            female : male   =     45.9 : 1.0
              lastletter = 'd'              male : female =      9.7 : 1.0
              lastletter = 'o'              male : female =      8.9 : 1.0
              lastletter = 'r'              male : female =      6.5 : 1.0
              lastletter = 'i'            female : male   =      4.7 : 1.0


(Explanation)

In [18]:
nb_mod3 = test_classifier(labeled_names, gender_features3, nltk.NaiveBayesClassifier)
nb_mod3.show_most_informative_features(5)

0.762
0.7656970046082949
Most Informative Features
                 suffix1 = 'a'            female : male   =     45.9 : 1.0
                 suffix2 = 'ne'           female : male   =     10.2 : 1.0
                 suffix1 = 'd'              male : female =      9.7 : 1.0
                 suffix1 = 'o'              male : female =      8.9 : 1.0
                 suffix1 = 'r'              male : female =      6.5 : 1.0


(Explanation)

In [19]:
nb_mod4 = test_classifier(labeled_names, gender_features4, nltk.NaiveBayesClassifier)
nb_mod4.show_most_informative_features(10)

0.78
0.7786578341013825
Most Informative Features
              lastletter = 'a'            female : male   =     45.9 : 1.0
                 suffix1 = 'a'            female : male   =     45.9 : 1.0
                 suffix2 = 'ne'           female : male   =     10.2 : 1.0
              lastletter = 'd'              male : female =      9.7 : 1.0
                 suffix1 = 'd'              male : female =      9.7 : 1.0
              lastletter = 'o'              male : female =      8.9 : 1.0
                 suffix1 = 'o'              male : female =      8.9 : 1.0
                 suffix3 = 'ine'          female : male   =      6.5 : 1.0
              lastletter = 'r'              male : female =      6.5 : 1.0
                 suffix1 = 'r'              male : female =      6.5 : 1.0


### Decision Trees

(Explanation)

In [20]:
dt_mod1 = test_classifier(labeled_names, gender_features, nltk.DecisionTreeClassifier)

0.756
0.7622407834101382


In [21]:
dt_mod2 = test_classifier(labeled_names, gender_features2, nltk.DecisionTreeClassifier)

0.75
0.7226382488479263


In [22]:
dt_mod3 = test_classifier(labeled_names, gender_features3, nltk.DecisionTreeClassifier)

0.73
0.732286866359447


In [23]:
dt_mod4 = test_classifier(labeled_names, gender_features4, nltk.DecisionTreeClassifier)

0.63
0.637528801843318


(Explanation)

In [24]:
import math
def entropy(labels):
    freqdist = nltk.FreqDist(labels)
    probs = [freqdist.freq(l) for l in freqdist]
    return -sum(p * math.log(p,2) for p in probs)

### Max Entropy

Instead of using probabilites to set model parameters as the Naive Bayes classifier does, the Maximum Entropy Model (or MaxEnt) searches for the set of parameters that maximize model performance. The property of entropy entails uniformity of the distribution where there isn't empirical evidence that would constrain that uniformity.  

Intuition is that classifiers with lower entropy introduce biases that are not justified. 

Importantly, MaxEnt does not assume independence of features (as Naive Bayes does) and so is not negatively impacted when there is dependence between features (can often be the case). As MaxEnt captures the structure of the training data, the more features it uses the stronger the constraint of empirical consistenycy becomes (reference)[https://lost-contact.mit.edu/afs/cs.pitt.edu/projects/nltk/docs/tutorial/classifying/nochunks.html#maxent].

For each joint feature (define!), MaxEnt algorithms calculate the empirical frequency and...(complete!).

NLTK offers two algorithms, Generalized Iterative Scaling (GIS) and Improved Iterative Scaling (IIS), which offers faster convergence. Accoring to wikipedia (link!) and the literature ('A comparison of algorithms for maximum entropy parameter estimation')[http://luthuli.cs.uiuc.edu/~daf/courses/Opt-2017/Papers/p18-malouf.pdf], the performance of these algorithms has been substantially improved uppon by gradient-based methods, such as coordinate descent and limited memory L-BFGS and LMBVM. Most notably, iterative optimizations can be time consuming

Additionally, MaxEnt is a conditional classifier, meaning it can be used to determine the most likely label for a given input or conversely how likely a label is for that input.  A generative classifier like Naive Bayers can estimate the most likely input value, how likely an input value is, the same given an input label. 

https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf (cite!) 


Rapid comuptation, peaks after single iteration

(Add chart)

In [25]:
me_mod1 = test_classifier(labeled_names, gender_features, nltk.ConditionalExponentialClassifier)  # consider changing max_iter param to 20, how to add kwarg?
# https://stackoverflow.com/questions/39391280/how-to-change-number-of-iterations-in-maxent-classifier-for-pos-tagging-in-nltk

me_mod1.show_most_informative_features(10)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.368
             2          -0.37629        0.758
             3          -0.37386        0.758
             4          -0.37241        0.758
             5          -0.37144        0.758
             6          -0.37075        0.758
             7          -0.37024        0.758
             8          -0.36983        0.758
             9          -0.36951        0.758
            10          -0.36925        0.758
            11          -0.36903        0.758
            12          -0.36884        0.758
            13          -0.36869        0.758
            14          -0.36855        0.758
            15          -0.36843        0.758
            16          -0.36832        0.758
            17          -0.36823        0.758
            18          -0.36814        0.758
            19          -0.36807        0.758
 

In [26]:
#me_mod1.explain(test_set)

In [27]:
# type(me_mod1.show_most_informative_features(10))

(Explanation)

(Add chart)

In [28]:
me_mod2 = test_classifier(labeled_names, gender_features2, nltk.ConditionalExponentialClassifier)
me_mod2.show_most_informative_features(10)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.368
             2          -0.60663        0.632
             3          -0.59048        0.632
             4          -0.57533        0.632
             5          -0.56114        0.638
             6          -0.54787        0.664
             7          -0.53547        0.684
             8          -0.52387        0.694
             9          -0.51302        0.716
            10          -0.50286        0.728
            11          -0.49334        0.742
            12          -0.48442        0.756
            13          -0.47604        0.750
            14          -0.46816        0.760
            15          -0.46074        0.762
            16          -0.45374        0.768
            17          -0.44714        0.764
            18          -0.44090        0.764
            19          -0.43500        0.768
 

Slow computation, peaks after six iterations

(Add chart)

In [29]:
me_mod3 = test_classifier(labeled_names, gender_features3, nltk.ConditionalExponentialClassifier)
me_mod3.show_most_informative_features(10)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.368
             2          -0.34422        0.816
             3          -0.32017        0.830
             4          -0.30386        0.834
             5          -0.29234        0.836
             6          -0.28389        0.838
             7          -0.27743        0.840
             8          -0.27234        0.840
             9          -0.26820        0.840
            10          -0.26477        0.840
            11          -0.26187        0.840
            12          -0.25940        0.840
            13          -0.25725        0.840
            14          -0.25537        0.840
            15          -0.25372        0.840
            16          -0.25225        0.840
            17          -0.25093        0.840
            18          -0.24975        0.840
            19          -0.24868        0.840
 

Rapid computation, unclear if reached optimum as continues to improve

(Add chart)

In [None]:
me_mod4 = test_classifier(labeled_names, gender_features4, nltk.ConditionalExponentialClassifier)
me_mod4.show_most_informative_features(10)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.368
             2          -0.58951        0.632
             3          -0.55937        0.632
             4          -0.53237        0.656
             5          -0.50821        0.714
             6          -0.48661        0.744
             7          -0.46726        0.782
             8          -0.44989        0.800
             9          -0.43425        0.820
            10          -0.42012        0.838
            11          -0.40730        0.858
            12          -0.39564        0.868
            13          -0.38498        0.872
            14          -0.37520        0.878
            15          -0.36620        0.880
            16          -0.35788        0.878
            17          -0.35017        0.882
            18          -0.34300        0.882
            19          -0.33631        0.882
 

## Conclusion

In [None]:
models[0]

In [None]:
def summarize_models(models):
    table = pd.DataFrame(columns = ['class', 'features', 'accuracy_devtest', 'accuracy_test'])
    
    for m in models:
        df = pd.DataFrame({'class': [m[1]], 'features': [m[2]], 'accuracy_devtest': [m[3]], 'accuracy_test': [m[4]]})
        table = table.append(df, ignore_index=True)

    return table

In [None]:
table = summarize_models(models)
table

In [None]:
table.sort_values(by='accuracy_test', ascending=False)

## Youtube