## Naive Bayes Classifiers: Introduction
A powerful and intutitive technique. File this one away, it'll often teach you a lot about a problem, even if it doesn't "win" the accuracy game. First some examples from NLTK.

In [None]:
import nltk

from nltk.corpus import names
import random

# Create some labeled observations
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                 [(name, 'female') for name in names.words('female.txt')])

# shuffle so that we can have a training and test set
random.shuffle(labeled_names)

Take a look at labeled_names, to get a sense for what's in there. This is always a good idea.

In [None]:
labeled_names[:5]

In [None]:
# For the purposes of this toy example, we just use the last letters as our only feature
def gender_features(word):
    return {'last_letter': word[-1]}

For this next line, read a bit about what's going on with this classifier [here](http://www.nltk.org/book/ch06.html). 

In [None]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

Take a look at `featuresets`. What kind of data structure is it? What are the elements within it?

In [None]:
# NLTK makes it easy to evaluate the accuracy of the rule.
print(nltk.classify.accuracy(classifier, test_set))

Let's see how the classifier does on our class. Fill in the gaps below. 

In [None]:
our_class = # make this a list of first names in our class

for student in our_class :
    print(student + " classified as " + classifier.classify(gender_features(student)))

# What's the overall accuracy? 


We might reasonably ask, how many males and females do we have in each group? Below we see two ways of displaying that information.

In [None]:
# This method takes more typing, but may 
# be easier to read.

num_males = 0

for item in featuresets :
    dd, gender = item
        
    if gender == "male" :
        num_males += 1
    
num_males

In [None]:
# This approach is more pythonic, but also harder to understand.
# When you try to interpret it, remember to start with the innermost
# part (probably the `for` loop here). 

from collections import Counter

Counter([gender for dd, gender in featuresets])

In [None]:
# let's just look at all the features. Usually you'd only show a few
classifier.show_most_informative_features(26)

How should we interpret those columns above? 

--- 

The lecture mentions the idea of building a dev-test set, in addition to the test and train sets above. Let's do that now so that we can build up some more complicated feature extractors.

In [None]:
random.shuffle(labeled_names) # Use this to shuffle in place to build training and test set

In [None]:
test_size = 500
devtest_size = 1000

train_names = labeled_names[(test_size + devtest_size):]
devtest_names = labeled_names[test_size:(test_size + devtest_size)]
test_names = labeled_names[:test_size]

In [None]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

Run the code below. Look at the kind of names that are being misclassified. As you do that, think about rules you migth design that would correct these mistakes.  

In [None]:
for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

Now you're going to start building your own feature extractor. 

In [None]:
# build your own function. Here's an example to
# help you get the syntax right. 
def gender_features_2(word):
    ''' This function should take in a word and return a dictionary
        with the name of the feature as the key and the value 
        as the feature value. '''
    ll = word[-1]
    penultimate = word[-2]
    last_3 = word[-3:]
    
    has_bob = "bob" in word
        
    ret_dict = {'last_letter':ll,
                'penultimate_y':(penultimate=="y"),
                'last_3':last_3,
                'has_bob' : has_bob}
    
    return (ret_dict)

In [None]:
# let's look at an output
gender_features_2("bobby")

Now let's form our new training and dev-test sets. 

In [None]:
train_set = [(gender_features_2(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features_2(n), gender) for (n, gender) in devtest_names]

Let's train this new code on the training set and evaluate it on the _development_ test set. 

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

We can look at the most informative features...

In [None]:
classifier.show_most_informative_features(10)

And look at where we're getting errors.

In [None]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

Now you'll refine `gender_features_2`. Go through the errors above, try new rules. Can you come up with any that drammatically increase the accuracy of your classifer? You should be able to get this above 82% accuracy with some experimentation. What's the highest value you can get? 

--- 

Once you're done tweaking your code or we're out of time, get your final accuracy measure against the test set. In order to have an unbiased estimate of your error, you need to do this once at the end of your development cycle. 

In [None]:
# Once you're done tweaking your code, run this one. 
print(nltk.classify.accuracy(classifier, test_set))