## Naive Bayes Classifiers: Introduction
A powerful and intutitive technique. File this one away, it'll often teach you a lot about a problem, even if it doesn't "win" the accuracy game. First some examples from NLTK.

In [1]:
import nltk

from nltk.corpus import names
import random

# Create some labeled observations
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                 [(name, 'female') for name in names.words('female.txt')])

# shuffle so that we can have a training and test set
random.shuffle(labeled_names)

Take a look at labeled_names, to get a sense for what's in there. This is always a good idea.

In [2]:
labeled_names[:5]

[('Chandler', 'male'),
 ('Tonnie', 'male'),
 ('Minny', 'female'),
 ('Catlee', 'female'),
 ('Gertrudis', 'female')]

In [3]:
# For the purposes of this toy example, we just use the last letters as our only feature
def gender_features(word):
    return {'last_letter': word[-1]}

For this next line, read a bit about what's going on with this classifier [here](http://www.nltk.org/book/ch06.html). 

In [4]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [10]:
featuresets[:10]

[({'last_letter': 'r'}, 'male'),
 ({'last_letter': 'e'}, 'male'),
 ({'last_letter': 'y'}, 'female'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 's'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'o'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 't'}, 'male'),
 ({'last_letter': 'n'}, 'male')]

Take a look at `featuresets`. What kind of data structure is it? What are the elements within it?

In [5]:
# NLTK makes it easy to evaluate the accuracy of the rule.
print(nltk.classify.accuracy(classifier, test_set))

0.76


Let's see how the classifier does on our class. Fill in the gaps below. 

In [6]:
our_class = # make this a list of first names in our class

for student in our_class :
    print(student + " classified as " + classifier.classify(gender_features(student)))

# What's the overall accuracy? 


SyntaxError: invalid syntax (<ipython-input-6-107ad8c6174c>, line 1)

We might reasonably ask, how many males and females do we have in each group? Below we see two ways of displaying that information.

In [7]:
# This method takes more typing, but may 
# be easier to read.

num_males = 0

for item in featuresets :
    dd, gender = item
        
    if gender == "male" :
        num_males += 1
    
num_males

2943

In [8]:
# This approach is more pythonic, but also harder to understand.
# When you try to interpret it, remember to start with the innermost
# part (probably the `for` loop here). 

from collections import Counter

Counter([gender for dd, gender in featuresets])

Counter({'female': 5001, 'male': 2943})

In [9]:
# let's just look at all the features. Usually you'd only show a few
classifier.show_most_informative_features(26)

Most Informative Features
             last_letter = 'k'              male : female =     41.6 : 1.0
             last_letter = 'a'            female : male   =     33.5 : 1.0
             last_letter = 'f'              male : female =     13.2 : 1.0
             last_letter = 'v'              male : female =     11.2 : 1.0
             last_letter = 'p'              male : female =     10.5 : 1.0
             last_letter = 'd'              male : female =      9.5 : 1.0
             last_letter = 'o'              male : female =      8.8 : 1.0
             last_letter = 'm'              male : female =      8.5 : 1.0
             last_letter = 'r'              male : female =      6.7 : 1.0
             last_letter = 'w'              male : female =      6.2 : 1.0
             last_letter = 'g'              male : female =      5.2 : 1.0
             last_letter = 's'              male : female =      4.5 : 1.0
             last_letter = 'z'              male : female =      4.3 : 1.0

How should we interpret those columns above? 

--- 

The lecture mentions the idea of building a dev-test set, in addition to the test and train sets above. Let's do that now so that we can build up some more complicated feature extractors.

In [11]:
random.shuffle(labeled_names) # Use this to shuffle in place to build training and test set

In [12]:
test_size = 500
devtest_size = 1000

train_names = labeled_names[(test_size + devtest_size):]
devtest_names = labeled_names[test_size:(test_size + devtest_size)]
test_names = labeled_names[:test_size]

In [13]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

Run the code below. Look at the kind of names that are being misclassified. As you do that, think about rules you migth design that would correct these mistakes.  

In [14]:
for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

correct=female   guess=male     name=Adrien                        
correct=female   guess=male     name=Alison                        
correct=female   guess=male     name=Angil                         
correct=female   guess=male     name=Arlen                         
correct=female   guess=male     name=Aryn                          
correct=female   guess=male     name=Avrit                         
correct=female   guess=male     name=Bab                           
correct=female   guess=male     name=Bell                          
correct=female   guess=male     name=Brett                         
correct=female   guess=male     name=Brigit                        
correct=female   guess=male     name=Britt                         
correct=female   guess=male     name=Brooks                        
correct=female   guess=male     name=Candis                        
correct=female   guess=male     name=Carlin                        
correct=female   guess=male     name=Carmon     

Now you're going to start building your own feature extractor. 

In [26]:
# build your own function. Here's an example to
# help you get the syntax right. 
def gender_features_2(word):
    ''' This function should take in a word and return a dictionary
        with the name of the feature as the key and the value 
        as the feature value. '''
    ll = word[-1]
    penultimate = word[-2]
    last_3 = word[-3:]
    has_bob = "bob" in word
    
    letters_2_3 = word[1:3]
    ends_in_lyn = "lyn" == last_3
    
#    ret_dict = {'last_letter':ll,
#                'penultimate_y':(penultimate=="y"),
#                'last_3':last_3,
#                'has_bob' : has_bob}

    ret_dict = {'letters_2_3': letters_2_3,
                'lyn':ends_in_lyn,
                'll':ll}

    return (ret_dict)

In [21]:
# let's look at an output
gender_features_2("bobby")

{'letters_2_3': 'ob', 'lyn': False}

Now let's form our new training and dev-test sets. 

In [27]:
train_set = [(gender_features_2(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features_2(n), gender) for (n, gender) in devtest_names]

Let's train this new code on the training set and evaluate it on the _development_ test set. 

In [28]:
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.761


We can look at the most informative features...

In [29]:
classifier.show_most_informative_features(10)

Most Informative Features
                      ll = 'a'            female : male   =     37.4 : 1.0
                      ll = 'k'              male : female =     24.9 : 1.0
                      ll = 'p'              male : female =     18.6 : 1.0
                      ll = 'f'              male : female =     12.5 : 1.0
                      ll = 'd'              male : female =     10.5 : 1.0
                      ll = 'm'              male : female =     10.4 : 1.0
                     lyn = True           female : male   =     10.3 : 1.0
                      ll = 'v'              male : female =      9.8 : 1.0
                      ll = 'o'              male : female =      8.1 : 1.0
             letters_2_3 = 'is'           female : male   =      7.4 : 1.0


And look at where we're getting errors.

In [25]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

correct=male     guess=female   name=Abel                          
correct=male     guess=female   name=Aguste                        
correct=male     guess=female   name=Ajai                          
correct=male     guess=female   name=Aleck                         
correct=male     guess=female   name=Alex                          
correct=male     guess=female   name=Alf                           
correct=male     guess=female   name=Alfonse                       
correct=male     guess=female   name=Alfredo                       
correct=male     guess=female   name=Algernon                      
correct=male     guess=female   name=Ali                           
correct=male     guess=female   name=Alister                       
correct=male     guess=female   name=Allah                         
correct=male     guess=female   name=Alphonso                      
correct=male     guess=female   name=Amery                         
correct=male     guess=female   name=Amos       

Now you'll refine `gender_features_2`. Go through the errors above, try new rules. Can you come up with any that drammatically increase the accuracy of your classifer? You should be able to get this above 82% accuracy with some experimentation. What's the highest value you can get? 

--- 

Once you're done tweaking your code or we're out of time, get your final accuracy measure against the test set. In order to have an unbiased estimate of your error, you need to do this once at the end of your development cycle. 

In [None]:
# Once you're done tweaking your code, run this one. 
print(nltk.classify.accuracy(classifier, test_set))