## Naive Bayes Classifiers
A powerful and intutitive technique. File this one away, it'll often teach you a lot about a problem, even if it doesn't "win" the accuracy game. First some examples from NLTK.

In [None]:
import nltk
from nltk.corpus import names
import random
from collections import Counter
import re

In [None]:
# Create some labeled observations
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                 [(name, 'female') for name in names.words('female.txt')])

# shuffle so that we can have a training and test set
random.shuffle(labeled_names)

In [None]:
# For the purposes of this toy example, we just use the last letters as our only feature
def gender_features(word):
    return {'last_letter': word[-1]}

For this next line, read a bit about what's going on with this classifier [here](http://www.nltk.org/book/ch06.html). 

In [None]:
# This line is super important to understand
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]

In [None]:
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
# Test vs train
print(nltk.classify.accuracy(classifier, test_set))

In [None]:
our_class = """Mary Peyten Xin Alexis John Brenden
               Madeline Claire Diana August Jon
               Brenna Hong-Shen Chris CJ Kristi
               Apsara Mike Craig""".split() 
# Taking some liberties with Hong Shen to prevent splitting his name

for student in our_class :
    print(student + " classified as " + classifier.classify(gender_features(student)))

print(1-5/len(our_class)) #64% accuracy

In [None]:
# Looking at the counts by gender can be useful for
# understanding priors.
Counter([gender for name, gender in labeled_names])

In [None]:
# let's just look at all the features. Usually you'd only show a few
classifier.show_most_informative_features(26)

Now let's build up some data sets so we can do iterative improvements to our model. 

In [None]:
random.shuffle(labeled_names) # Use this to shuffle in place to build training and test set

This next cell is worth understanding. Ask questions if it is opaque. 

In [None]:
test_size = 500
devtest_size = 1000

train_names = labeled_names[(test_size + devtest_size):]
devtest_names = labeled_names[test_size:(test_size + devtest_size)]
test_names = labeled_names[:test_size]

In [None]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

In [None]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

Read the results of the cells below, and form some hypotheses of additional features to add. 

In [None]:
for (tag, guess, name) in sorted(errors):
    print(f'correct={tag:<8} guess={guess:<8s} name={name:<30}')

At this point, look at the names that are being missed and see if you can add some features that will improve our accuracy. Some potential options:

* Specific starting or ending letters.
* Letters at the beginning or end of the name.
* Patterns like doubled letters, etc. 

In [None]:
# Putting regexes in their own cell so they only have to be compiled once
hyphen_or_space = re.compile(r'[ -]')

In [None]:
# here's a more complicated version.
def gender_features_2(word):
    ''' This function should take in a word and return a dictionary
        with the name of the feature as the key and the value 
        as the feature value. '''
    ll = word[-1]
    penultimate = word[-2]
    last_2 = word[-2:]
    last_3 = word[-3:]
    last_4 = word[-4:]
    first_2 = word[:2]
        
    max_letters = max([v for k,v in Counter(word).items()]) 
    
    if hyphen_or_space.search(word) :
        double = True
    else :
        double = False
        
    has_bob = "bob" in word
    
    ret_dict = {'last_letter':ll,
                'penultimate_y':(penultimate=="y"),
                'last_3' : last_3,
                'last_3_ann_een':(last_3 in {"ann","een"}),
                'last_4_lynn' : (last_4 == "lynn"),
                'double_name' : double,
                'has_bob' : has_bob,
                'first_2':first_2,
                'letter_repeats': max_letters >= 2}
    
    return (ret_dict)

Now, having defined our new function, we can test it on `devtest`.

In [None]:
train_set = [(gender_features_2(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features_2(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features_2(n), gender) for (n, gender) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

And you can look at the features and the errors:

In [None]:
classifier.show_most_informative_features(30)

In [None]:
for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

Don't run this next cell till you're _completely_ done tweaking your `gender_features_2` code. 

In [None]:
# Once you're done tweaking your code, run this one. 
print(nltk.classify.accuracy(classifier, test_set))

So that estimate is your unbiased estimate of your classifier accuracy. 