# DATA 620 - Project 3

Jeremy OBrien, Mael Illien, Vanita Thompson

* Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. 
* Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. 
* Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. 
* Once you are satisfied with your classifier, check its final performance on the test set. 
* How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?
* Source: Natural Language Processing with Python, exercise 6.10.2.

The three classifiers from Chapter 6: NaiveBayes, DecisionTree, MaxEntropy

## Setup

In [1]:
import random
import nltk, re, pprint
from nltk.corpus import names
from nltk.classify import apply_features

## Data Import & Transformation

In [2]:
labeled_names = ([(name, 'male') for name in names.words('male.txt')] + 
         [(name, 'female') for name in names.words('female.txt')])

random.shuffle(labeled_names)
labeled_names[:10]

[('Petunia', 'female'),
 ('Angelita', 'female'),
 ('Margit', 'female'),
 ('Bathsheba', 'female'),
 ('Garland', 'female'),
 ('Flemming', 'male'),
 ('Monty', 'male'),
 ('Karylin', 'female'),
 ('Chelton', 'male'),
 ('Pepillo', 'male')]

### Train Test Split

In [3]:
# Incorporated in function
# train_names = labeled_names[:500]
# devtest_names = labeled_names[500:1000]
# test_names = labeled_names[1000:]

In [4]:
# Incorporated in function
# train_set = [(gender_features(n), gender) for (n, gender) in train_names]
# devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
# test_set = [(gender_features(n), gender) for (n, gender) in test_names]

### Test Classifier

In [5]:
def test_classifier(names_corpus, gender_features_function, classifier_type):
#     train_set = apply_features(gender_features, names[:500])
#     devtest_set = apply_features(gender_features, names[500:1000])
#     test_set = apply_features(gender_features, names[1000:])

    # Train test split
    train_names = names_corpus[:500]
    devtest_names = names_corpus[500:1000]
    test_names = names_corpus[1000:]
    
    # Appy features
    train_set = [(gender_features_function(n), gender) for (n, gender) in train_names]
    devtest_set = [(gender_features_function(n), gender) for (n, gender) in devtest_names]
    test_set = [(gender_features_function(n), gender) for (n, gender) in test_names]
    
    # Classify and print score
    classifier = classifier_type.train(train_set)
    print(nltk.classify.accuracy(classifier, devtest_set))
    print(nltk.classify.accuracy(classifier, test_set))
    
    #classifier.show_most_informative_features(5)
    
    return classifier  

### Errors

In [6]:
def errors(classifier):
    errors = []
    for (name, tag) in devtest_names:
        guess = classifier.classify(gender_features(name))
        if guess != tag:
            errors.append( (tag, guess, name) )
            
    for (tag, guess, name) in sorted(errors): 
        print('correct=%-8s guess=%-8s name=%-30s'%(tag, guess, name))

## Feature Engineering

### Example 1

In [7]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [8]:
gender_features('John')

{'last_letter': 'n'}

### Example 2

In [9]:
def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

In [10]:
#gender_features2('John')

### Example 3

In [11]:
def gender_features3(word):
    return {'suffix1': word[-1:],'suffix2': word[-2:]}

In [12]:
gender_features3('Cristina')

{'suffix1': 'a', 'suffix2': 'na'}

### Example 4

In [13]:
def gender_features4(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    features['suffix1'] =  name[-1:]
    features['suffix2'] = name[-2:]
    features['suffix3'] = name[-3:]
    # features['length'] = len(name) # doesn't add much
    #suf = []
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

## Naive Bayes

In [14]:
mod1 = test_classifier(labeled_names, gender_features, nltk.NaiveBayesClassifier)

0.762
0.7452476958525346


In [15]:
mod1.show_most_informative_features(5)
print(mod1.classify(gender_features('Neo')))
print(mod1.classify(gender_features('Trinity')))

Most Informative Features
             last_letter = 'a'            female : male   =     16.4 : 1.0
             last_letter = 's'              male : female =     15.5 : 1.0
             last_letter = 'o'              male : female =     15.5 : 1.0
             last_letter = 'r'              male : female =      6.3 : 1.0
             last_letter = 'k'              male : female =      5.5 : 1.0
male
female


In [16]:
mod2 = test_classifier(labeled_names, gender_features2, nltk.NaiveBayesClassifier)
mod2.show_most_informative_features(5)

0.722
0.7386232718894009
Most Informative Features
              lastletter = 'a'            female : male   =     16.4 : 1.0
              lastletter = 'o'              male : female =     15.5 : 1.0
              lastletter = 's'              male : female =     15.5 : 1.0
             firstletter = 'h'              male : female =      6.9 : 1.0
              lastletter = 'r'              male : female =      6.3 : 1.0


In [17]:
mod3 = test_classifier(labeled_names, gender_features3, nltk.NaiveBayesClassifier)
mod3.show_most_informative_features(5)

0.774
0.7564804147465438
Most Informative Features
                 suffix1 = 'a'            female : male   =     16.4 : 1.0
                 suffix1 = 'o'              male : female =     15.5 : 1.0
                 suffix1 = 's'              male : female =     15.5 : 1.0
                 suffix1 = 'r'              male : female =      6.3 : 1.0
                 suffix1 = 'k'              male : female =      5.5 : 1.0


In [18]:
mod4 = test_classifier(labeled_names, gender_features4, nltk.NaiveBayesClassifier)
mod4.show_most_informative_features(10)

0.76
0.7785138248847926
Most Informative Features
              lastletter = 'a'            female : male   =     16.4 : 1.0
                 suffix1 = 'a'            female : male   =     16.4 : 1.0
                 suffix1 = 'o'              male : female =     15.5 : 1.0
              lastletter = 'o'              male : female =     15.5 : 1.0
              lastletter = 's'              male : female =     15.5 : 1.0
                 suffix1 = 's'              male : female =     15.5 : 1.0
             firstletter = 'h'              male : female =      6.9 : 1.0
              lastletter = 'r'              male : female =      6.3 : 1.0
                 suffix1 = 'r'              male : female =      6.3 : 1.0
             firstletter = 'w'              male : female =      6.2 : 1.0


## Decision Trees

In [19]:
dt_mod1 = test_classifier(labeled_names, gender_features4, nltk.DecisionTreeClassifier)

0.686
0.663594470046083


In [20]:
import math
def entropy(labels):
    freqdist = nltk.FreqDist(labels)
    probs = [freqdist.freq(l) for l in freqdist]
    return -sum(p * math.log(p,2) for p in probs)

## Max Entropy

Per the text:
Instead of using probabilites to set model params, search to find set of params that max model performance throug iterative optimization techniques (which can be time consuming).

Generalization of the Naive Bayes classifier model.

For each joint feature, Max Ent calculcate the empirical frequency of that features.

Conditional classifier: can be used to determine what is most likely label for given input or how likely fiven label is for given input. P(label|input).

---


Notes: https://lost-contact.mit.edu/afs/cs.pitt.edu/projects/nltk/docs/tutorial/classifying/nochunks.html#maxent

Maximum Entropy Classifier is not negatively impacted by feature interdependence as Naive Bayes (which assumes independence) is.  
Use classifiers that are empirically consistent with training data, meaning estimate of frequency of each feature is equal to actual.  
This captures the structure of the training data.  More features uses, the stronger the constraint of empirical consistency becomes.

Intuition is that 'classifiers with lower entropy introduce biases that are not justified'



Rapid comuptation, peaks after single iteration

ADD CHART

In [23]:
me_mod1 = test_classifier(labeled_names, gender_features, nltk.ConditionalExponentialClassifier)
me_mod1.show_most_informative_features(10)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.396
             2          -0.40885        0.742
             3          -0.40384        0.742
             4          -0.40085        0.742
             5          -0.39887        0.742
             6          -0.39745        0.742
             7          -0.39638        0.742
             8          -0.39556        0.742
             9          -0.39490        0.742
            10          -0.39436        0.742
            11          -0.39391        0.742
            12          -0.39353        0.742
            13          -0.39320        0.742
            14          -0.39292        0.742
            15          -0.39268        0.742
            16          -0.39246        0.742
            17          -0.39226        0.742
            18          -0.39209        0.742
            19          -0.39194        0.742
 

In [24]:
me_mod = test_classifier(labeled_names, gender_features2, nltk.ConditionalExponentialClassifier)
me_mod2.show_most_informative_features(10)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.396
             2          -0.62973        0.604
             3          -0.61106        0.604
             4          -0.59370        0.628
             5          -0.57761        0.658
             6          -0.56269        0.692
             7          -0.54888        0.716
             8          -0.53608        0.736
             9          -0.52421        0.754
            10          -0.51319        0.768
            11          -0.50294        0.770
            12          -0.49340        0.770
            13          -0.48451        0.774
            14          -0.47620        0.780
            15          -0.46842        0.784
            16          -0.46112        0.794
            17          -0.45427        0.794
            18          -0.44783        0.794
            19          -0.44176        0.798
 

NameError: name 'me_mod2' is not defined

Slow computation, peaks after six iterations

In [25]:
me_mod3 = test_classifier(labeled_names, gender_features3, nltk.ConditionalExponentialClassifier)
me_mod3.show_most_informative_features(10)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.396
             2          -0.37462        0.792
             3          -0.34529        0.818
             4          -0.32621        0.822
             5          -0.31325        0.824
             6          -0.30403        0.826
             7          -0.29714        0.828
             8          -0.29181        0.828
             9          -0.28753        0.828
            10          -0.28402        0.828
            11          -0.28108        0.828
            12          -0.27858        0.828
            13          -0.27643        0.828
            14          -0.27455        0.828
            15          -0.27290        0.828
            16          -0.27144        0.828
            17          -0.27013        0.828
            18          -0.26896        0.828
            19          -0.26790        0.828
 

Rapid computation, unclear if reached optimum as continues to improve 

In [26]:
me_mod4 = test_classifier(labeled_names, gender_features4, nltk.ConditionalExponentialClassifier)
me_mod4.show_most_informative_features(10)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.396
             2          -0.61468        0.604
             3          -0.58329        0.622
             4          -0.55521        0.700
             5          -0.53014        0.752
             6          -0.50773        0.780
             7          -0.48767        0.800
             8          -0.46966        0.818
             9          -0.45343        0.834
            10          -0.43876        0.850
            11          -0.42544        0.858
            12          -0.41331        0.866
            13          -0.40221        0.866
            14          -0.39202        0.870
            15          -0.38263        0.870
            16          -0.37394        0.874
            17          -0.36589        0.878
            18          -0.35839        0.878
            19          -0.35139        0.878
 

## Conclusion

## Youtube