## Naive Bayes Classifiers
A powerful and intutitive technique. File this one away, it'll often teach you a lot about a problem, even if it doesn't "win" the accuracy game. First some examples from NLTK.

In [1]:
import nltk

from nltk.corpus import names
import random
from collections import Counter
import re

In [2]:
# Create some labeled observations
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                 [(name, 'female') for name in names.words('female.txt')])

# shuffle so that we can have a training and test set
random.shuffle(labeled_names)

In [3]:
# For the purposes of this toy example, we just use the last letters as our only feature
def gender_features(word):
    return {'last_letter': word[-1]}

For this next line, read a bit about what's going on with this classifier [here](http://www.nltk.org/book/ch06.html). 

In [4]:
# This line is super important to understand
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]

In [5]:
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [6]:
# Test vs train
print(nltk.classify.accuracy(classifier, test_set))

0.784


Now something for you to do. Fill in the blanks in the cell below to look at how the classifier does on our class.

In [7]:
our_class = "name1 name2".split() # fill in our class here 

for student in our_class :
    print(student + " classified as " + classifier.classify(gender_features(student)))

# Calculate the overall accuracy
errors = 0
print("On our class, the accuracy is {}.".format(errors/len(our_class)))

name1 classified as female
name2 classified as female
On our class, the accuracy is 0.0.


In [8]:
# Looking at the counts by gender can be useful for
# understanding priors.
Counter([gender for name, gender in labeled_names])

Counter({'female': 5001, 'male': 2943})

In [9]:
# let's just look at all the features. Usually you'd only show a few
classifier.show_most_informative_features(26)

Most Informative Features
             last_letter = 'a'            female : male   =     36.9 : 1.0
             last_letter = 'k'              male : female =     31.2 : 1.0
             last_letter = 'f'              male : female =     16.6 : 1.0
             last_letter = 'p'              male : female =     11.2 : 1.0
             last_letter = 'v'              male : female =     10.5 : 1.0
             last_letter = 'd'              male : female =     10.3 : 1.0
             last_letter = 'm'              male : female =     10.1 : 1.0
             last_letter = 'o'              male : female =      8.2 : 1.0
             last_letter = 'r'              male : female =      6.5 : 1.0
             last_letter = 'w'              male : female =      5.4 : 1.0
             last_letter = 'g'              male : female =      4.9 : 1.0
             last_letter = 's'              male : female =      4.1 : 1.0
             last_letter = 'z'              male : female =      4.0 : 1.0

Now let's build up some data sets so we can do iterative improvements to our model. 

In [10]:
random.shuffle(labeled_names) # Use this to shuffle in place to build training and test set

This next cell is worth understanding. Ask questions if it is opaque. 

In [11]:
test_size = 500
devtest_size = 1000

train_names = labeled_names[(test_size + devtest_size):]
devtest_names = labeled_names[test_size:(test_size + devtest_size)]
test_names = labeled_names[:test_size]

In [12]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.766


In [13]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

Read the results of the cells below, and form some hypotheses of additional features to add. 

In [14]:
for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

correct=female   guess=male     name=Alleen                        
correct=female   guess=male     name=Alyson                        
correct=female   guess=male     name=Alyss                         
correct=female   guess=male     name=Anabel                        
correct=female   guess=male     name=April                         
correct=female   guess=male     name=Ardelis                       
correct=female   guess=male     name=Ardis                         
correct=female   guess=male     name=Aurel                         
correct=female   guess=male     name=Avril                         
correct=female   guess=male     name=Babs                          
correct=female   guess=male     name=Beau                          
correct=female   guess=male     name=Beret                         
correct=female   guess=male     name=Bill                          
correct=female   guess=male     name=Blondell                      
correct=female   guess=male     name=Brook      

At this point, look at the names that are being missed and see if you can add some features that will improve our accuracy. Some potential options:

* Specific starting or ending letters.
* Letters at the beginning or end of the name.
* Patterns like doubled letters, etc. 

### Your work
Make changes to the below cells to improve `gender_features_2`.

In [17]:
# Putting regexes in their own cell so they only have to be compiled once
hyphen = re.compile(r'-') # here's an example.

In [18]:
# build your own function. Here's an example to get you started
def gender_features_2(word):
    ''' This function should take in a word and return a dictionary
        with the name of the feature as the key and the value 
        as the feature value. '''
    last_letter = word[-1]
    first_letter = word[0]
      
    if hyphen.search(word) :
        double = True
    else :
        double = False
    
    ret_dict = {'last_letter':last_letter,
                'first_letter_c':first_letter=="C",
                'first_letter_j':first_letter=="J",
                'double_name' : double}
    
    return (ret_dict)

In [19]:
print(gender_features_2("John"))
print(gender_features_2("Harika"))
print(gender_features_2("Carrie-Ann"))

{'last_letter': 'n', 'first_letter_c': False, 'first_letter_j': True, 'double_name': False}
{'last_letter': 'a', 'first_letter_c': False, 'first_letter_j': False, 'double_name': False}
{'last_letter': 'n', 'first_letter_c': True, 'first_letter_j': False, 'double_name': True}


Now, having defined our new function, we can test it on `devtest`.

In [20]:
train_set = [(gender_features_2(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features_2(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features_2(n), gender) for (n, gender) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.769


And you can look at the features and the errors:

In [21]:
classifier.show_most_informative_features(30)

Most Informative Features
             last_letter = 'a'            female : male   =     39.5 : 1.0
             last_letter = 'k'              male : female =     37.3 : 1.0
             last_letter = 'f'              male : female =     26.8 : 1.0
             last_letter = 'v'              male : female =     11.3 : 1.0
             last_letter = 'p'              male : female =      9.9 : 1.0
             last_letter = 'd'              male : female =      9.9 : 1.0
             last_letter = 'm'              male : female =      9.6 : 1.0
             last_letter = 'o'              male : female =      8.3 : 1.0
             last_letter = 'r'              male : female =      6.8 : 1.0
             last_letter = 'z'              male : female =      5.1 : 1.0
             last_letter = 'g'              male : female =      4.3 : 1.0
             last_letter = 's'              male : female =      3.9 : 1.0
             last_letter = 'u'              male : female =      3.9 : 1.0

In [22]:
for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

correct=female   guess=male     name=Alleen                        
correct=female   guess=male     name=Alyson                        
correct=female   guess=male     name=Alyss                         
correct=female   guess=male     name=Anabel                        
correct=female   guess=male     name=April                         
correct=female   guess=male     name=Ardelis                       
correct=female   guess=male     name=Ardis                         
correct=female   guess=male     name=Aurel                         
correct=female   guess=male     name=Avril                         
correct=female   guess=male     name=Babs                          
correct=female   guess=male     name=Beau                          
correct=female   guess=male     name=Beret                         
correct=female   guess=male     name=Bill                          
correct=female   guess=male     name=Blondell                      
correct=female   guess=male     name=Brook      

Don't run this next cell till you're _completely_ done tweaking your `gender_features_2` code. 

In [23]:
# Once you're done tweaking your code, run this one. 
print(nltk.classify.accuracy(classifier, test_set))

0.748


So that estimate is your unbiased estimate of your classifier accuracy. 