# DATA 620 - Project 3

Jeremy OBrien, Mael Illien, Vanita Thompson

* Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. 
* Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. 
* Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. 
* Once you are satisfied with your classifier, check its final performance on the test set. 
* How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?
* Source: Natural Language Processing with Python, exercise 6.10.2.

The three classifiers from Chapter 6: NaiveBayes, DecisionTree, MaxEntropy

## Setup

In [35]:
import random
import nltk, re, pprint
from nltk.corpus import names
from nltk.classify import apply_features

## Data Import & Transformation

In [36]:
labeled_names = ([(name, 'male') for name in names.words('male.txt')] + 
         [(name, 'female') for name in names.words('female.txt')])

random.shuffle(labeled_names)
labeled_names[:10]

[('Genny', 'female'),
 ('Colene', 'female'),
 ('Maura', 'female'),
 ('Bertie', 'male'),
 ('Stacey', 'female'),
 ('Kala', 'female'),
 ('Carlina', 'female'),
 ('Annetta', 'female'),
 ('Sansone', 'male'),
 ('Lilias', 'female')]

### Train Test Split

In [28]:
# Incorporated in function
# train_names = labeled_names[:500]
# devtest_names = labeled_names[500:1000]
# test_names = labeled_names[1000:]

In [37]:
# Incorporated in function
# train_set = [(gender_features(n), gender) for (n, gender) in train_names]
# devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
# test_set = [(gender_features(n), gender) for (n, gender) in test_names]

### Test Classifier

In [70]:
def test_classifier(names_corpus, gender_features_function):
#     train_set = apply_features(gender_features, names[:500])
#     devtest_set = apply_features(gender_features, names[500:1000])
#     test_set = apply_features(gender_features, names[1000:])

    # Train test split
    train_names = names_corpus[:500]
    devtest_names = names_corpus[500:1000]
    test_names = names_corpus[1000:]
    
    # Appy features
    train_set = [(gender_features_function(n), gender) for (n, gender) in train_names]
    devtest_set = [(gender_features_function(n), gender) for (n, gender) in devtest_names]
    test_set = [(gender_features_function(n), gender) for (n, gender) in test_names]
    
    # Classify and print score
    classifier = nltk.NaiveBayesClassifier.train(train_set)
    print(nltk.classify.accuracy(classifier, devtest_set))
    print(nltk.classify.accuracy(classifier, test_set))
    
    classifier.show_most_informative_features(5)
    
    return classifier

    

## Feature Engineering

### Example 1

In [60]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [61]:
gender_features('John')

{'last_letter': 'n'}

In [71]:
mod1 = test_classifier(labeled_names, gender_features)

0.75
0.7512960829493087
Most Informative Features
             last_letter = 'a'            female : male   =     42.7 : 1.0
             last_letter = 'd'              male : female =     15.5 : 1.0
             last_letter = 'k'              male : female =     10.1 : 1.0
             last_letter = 'i'            female : male   =      9.8 : 1.0
             last_letter = 's'              male : female =      3.9 : 1.0


In [72]:
mod1.classify(gender_features('Neo'))

'male'

In [73]:
mod1.classify(gender_features('Trinity'))

'female'

### Example 2

In [62]:
def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

In [63]:
#gender_features2('John')

In [74]:
mod2 = test_classifier(labeled_names, gender_features2)

0.742
0.7618087557603687
Most Informative Features
              lastletter = 'a'            female : male   =     42.7 : 1.0
              lastletter = 'd'              male : female =     15.5 : 1.0
              lastletter = 'k'              male : female =     10.1 : 1.0
              lastletter = 'i'            female : male   =      9.8 : 1.0
                count(z) = 1                male : female =      4.9 : 1.0


### Example 3

In [77]:
def gender_features3(word):
    return {'suffix1': word[-1:],'suffix2': word[-2:]}

In [78]:
gender_features3('Cristina')

{'suffix1': 'a', 'suffix2': 'na'}

In [79]:
mod3 = test_classifier(labeled_names, gender_features3)

0.77
0.763536866359447
Most Informative Features
                 suffix1 = 'a'            female : male   =     42.7 : 1.0
                 suffix1 = 'd'              male : female =     15.5 : 1.0
                 suffix2 = 'an'             male : female =     10.3 : 1.0
                 suffix1 = 'k'              male : female =     10.1 : 1.0
                 suffix1 = 'i'            female : male   =      9.8 : 1.0


## Test Classifier

In [None]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

In [None]:
for (tag, guess, name) in sorted(errors): 
    print('correct=%-8s guess=%-8s name=%-30s'%(tag, guess, name))

## Naive Bayes

## Decision Trees

In [None]:
import math
def entropy(labels):
    freqdist = nltk.FreqDist(labels)
    probs = [freqdist.freq(l) for l in freqdist]
    return -sum(p * math.log(p,2) for p in probs)

## Max Entropy

## Conclusion

## Youtube