# DATA 620 - Project 3

Jeremy OBrien, Mael Illien, Vanita Thompson

* Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. 
* Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. 
* Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. 
* Once you are satisfied with your classifier, check its final performance on the test set. 
* How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?
* Source: Natural Language Processing with Python, exercise 6.10.2.

The three classifiers from Chapter 6: NaiveBayes, DecisionTree, MaxEntropy

## Setup

In [27]:
import nltk, re, pprint
from nltk.corpus import names
from nltk.classify import apply_features

## Data Import & Transformation

In [43]:
names = ([(name, 'male') for name in names.words('male.txt')] + 
         [(name, 'female') for name in names.words('female.txt')])

names[:10]

[('Aamir', 'male'),
 ('Aaron', 'male'),
 ('Abbey', 'male'),
 ('Abbie', 'male'),
 ('Abbot', 'male'),
 ('Abbott', 'male'),
 ('Abby', 'male'),
 ('Abdel', 'male'),
 ('Abdul', 'male'),
 ('Abdulkarim', 'male')]

### Train Test Split

In [45]:
train_names = names[:500]
devtest_names = names[500:1000]
test_names = names[1000:]

In [46]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]

## Feature Engineering

### Basic Example

In [33]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [37]:
gender_features('John')

{'last_letter': 'n'}

In [4]:
featuresets = [(gender_features(n), g) for (n,g) in names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [5]:
classifier.classify(gender_features('Neo'))

'male'

In [6]:
classifier.classify(gender_features('Trinity'))

'female'

In [7]:
classifier.classify(gender_features('Mael'))

'female'

In [8]:
print(nltk.classify.accuracy(classifier, test_set))

0.602


In [9]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     35.5 : 1.0
             last_letter = 'k'              male : female =     34.1 : 1.0
             last_letter = 'f'              male : female =     15.9 : 1.0
             last_letter = 'p'              male : female =     13.5 : 1.0
             last_letter = 'v'              male : female =     12.7 : 1.0


### Example 2

In [47]:
def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

In [36]:
gender_features2('John')

{'firstletter': 'j',
 'lastletter': 'n',
 'count(a)': 0,
 'has(a)': False,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 0,
 'has(e)': False,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 1,
 'has(h)': True,
 'count(i)': 0,
 'has(i)': False,
 'count(j)': 1,
 'has(j)': True,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 0,
 'has(l)': False,
 'count(m)': 0,
 'has(m)': False,
 'count(n)': 1,
 'has(n)': True,
 'count(o)': 1,
 'has(o)': True,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 0,
 'has(s)': False,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 0,
 'has(y)': False,
 'count(z)': 0,
 'has(z)': False}

### Example 3

In [38]:
def gender_features3(word):
    return {'suffix1': word[-1:],'suffix2': word[-2:]}

In [39]:
gender_features3('Cristina')

{'suffix1': 'a', 'suffix2': 'na'}

## Test Classifier

In [61]:
def test_classifier(gender_features_function):
#     train_set = apply_features(gender_features, names[:500])
#     devtest_set = apply_features(gender_features, names[500:1000])
#     test_set = apply_features(gender_features, names[1000:])

    train_set = [(gender_features_function(n), gender) for (n, gender) in train_names]
    devtest_set = [(gender_features_function(n), gender) for (n, gender) in devtest_names]
    test_set = [(gender_features_function(n), gender) for (n, gender) in test_names]
    
    classifier = nltk.NaiveBayesClassifier.train(train_set)
    print(nltk.classify.accuracy(classifier, devtest_set))
    print(nltk.classify.accuracy(classifier, test_set))
    return classifier
    

In [62]:
mod1 = test_classifier(gender_features)

1.0
0.27980990783410137


In [63]:
mod2 = test_classifier(gender_features2)

1.0
0.27980990783410137


In [64]:
mod3 = test_classifier(gender_features3)

1.0
0.27980990783410137


In [65]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

In [17]:
for (tag, guess, name) in sorted(errors): 
    print('correct=%-8s guess=%-8s name=%-30s'%(tag, guess, name))

correct=male     guess=female   name=Clinton                       
correct=male     guess=female   name=Clive                         
correct=male     guess=female   name=Clyde                         
correct=male     guess=female   name=Cob                           
correct=male     guess=female   name=Cobb                          
correct=male     guess=female   name=Cobbie                        
correct=male     guess=female   name=Cobby                         
correct=male     guess=female   name=Cody                          
correct=male     guess=female   name=Colbert                       
correct=male     guess=female   name=Cole                          
correct=male     guess=female   name=Coleman                       
correct=male     guess=female   name=Colin                         
correct=male     guess=female   name=Collin                        
correct=male     guess=female   name=Conan                         
correct=male     guess=female   name=Connie     

## Naive Bayes

## Decision Trees

## Max Entropy

## Conclusion

## Youtube