Natural language processing in python is generally done with two libraries, nltk and scikit-learn.  We are going to use nltk today.

First, we need data to process.  Because we're going to teach our computer to gues if someone is male or female based on their first name, we download a list of 1000 baby names from a really cute web site: https://www.babble.com/pregnancy/1000-most-popular-girl-names/

In [172]:
def load_names():        
    with open('../sample_data/boys.csv', 'r') as inf:
        boys_names = inf.read().split(',\n')
    with open('../sample_data/girls.csv', 'r') as inf:
        girls_names = inf.read().split(',\n')
    return boys_names, girls_names

try:
    import requests
    from bs4 import BeautifulSoup

    def pull_names():
        """
        Reads 1000 most common American baby names from the internet, 

        Returns names in two lists: boys_names, girls_names
        """
        site = 'https://www.babble.com/pregnancy/'
        boys_url = site + '1000-most-popular-boy-names/'
        girls_url = site + '1000-most-popular-girl-names/'

        boys_page = requests.get(boys_url)
        girls_page = requests.get(girls_url)

        def get_name(x):
            if x.attrs.get('class', None) == [u'p1'] \
                and x.getText() \
                and x.getText().isalpha():
                return x.getText()
            return None

        boys_soup = BeautifulSoup(boys_page.text, "html.parser")
        boys_names = [get_name(x) for x in boys_soup.find_all('li')]
        boys_names = filter(None, boys_names)

        girls_soup = BeautifulSoup(girls_page.text, "html.parser")
        girls_names = [get_name(x) for x in girls_soup.find_all('li')]
        girls_names = filter(None, girls_names)

        return boys_names, girls_names    
except:
    def pull_names():        
        return load_names()        
        
boys, girls = pull_names()
print len(boys), len(girls)

999 999


Wait, what?!

We specified a web site, and asked the requests library to download its text contents.

In [157]:
site = 'https://www.babble.com/pregnancy/'
boys_url = site + '1000-most-popular-boy-names/'

boys_page = requests.get(boys_url)
print boys_page.text[:241] + "\n"*5 + boys_page.text[-300:]

<!doctype html>


<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en-US"> <![endif]-->
<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8" lang="en-US"> <![endif]-->
<!--[if IE 8]>    <html class="no-js lt-ie9" lang="en-US">




</body>
</html> <!-- end page -->
<!-- Performance optimized by W3 Total Cache. Learn more: https://www.w3-edge.com/products/

Object Caching 4277/4642 objects using memcached
Content Delivery Network via a.dilcdn.com/bl

 Served from: www.babble.com @ 2017-04-25 18:00:00 by W3 Total Cache -->


After looking at the page source (with inspect in chrome), I see that the names are all list items (shown with <li> tags).

I looked at their structure and wrote an inline function to filter out bad list items.

In [158]:
def get_name(x):
    """
    # All names are class p1 i.e. text with a certain look
    # Ensure this element has text at all
    # Ensure the text is alphabetical
    """
    if x.attrs.get('class', None) == [u'p1'] \
        and x.getText() \
        and x.getText().isalpha():
        return x.getText()
    return None

Then we let an html parser (beautiful soup) read this html contents and parse it into a data object model, essentially a searchable dictionary.

In [159]:
boys_soup = BeautifulSoup(boys_page.text, "html.parser")
print boys_soup.find_all('a')[:2]

[<a class="tm-logo" href="https://www.babble.com" name="&amp;lpos=nav-category&amp;lid=nav-category/image/logo">Babble</a>, <a class="tm-social-link tm-social-facebook" href="https://www.facebook.com/Babble" name="&amp;lpos=social/sharebar/top-nav&amp;lid=social/sharebar/top-nav/social_Facebook" target="_blank">Facebook</a>]


I then used list comprehension on the set of all list items, turning any bad items into None.  

Then I filtered out all None items.

In [171]:
boys_names = [get_name(x) for x in boys_soup.find_all('li')]
print "Num None elements: %d" % sum([b is None for b in boys_names])

boys_names = filter(None, boys_names)
print boys_names[:5]

Num None elements: 35
[u'Noah', u'Liam', u'Mason', u'Jacob', u'William']


Now for the natural language processing.  To process natural language, you have to turn the text into a set of features.  Our feature is going to be simple: what's the name's first letter.

In [162]:
def first_letter(name):
    return {'first_letter': name[0]}

features = [first_letter(name) for name in boys]
print zip(boys[:3], features[:3])

[(u'Noah', {'first_letter': u'N'}), (u'Liam', {'first_letter': u'L'}), (u'Mason', {'first_letter': u'M'})]


We want to tell a classifier: these are a bunch of features that belong to boys' names, and these are a bunch for girls (in other words these are a labeled training set).

Our third argument is a function that turns a name into a feature dictionary.  This flexibility lets us come up with better features later.

In [163]:
from numpy.random import shuffle, seed

def label_names(boys_names, girls_names, func):
    """
    Apply feature function, func to each name, 
    and return test and training feature sets of the form
    feature_vector, label.
    
    Returns: test_set, training_set
    """
    labeled_names = [(name, 'male') for name in boys_names] + \
                    [(name, 'female') for name in girls_names]

    featuresets = [(func(x), g) for (x, g) in labeled_names]
    shuffle(featuresets)   
    train_set = featuresets[:-len(featuresets)/3]
    test_set = featuresets[-len(featuresets)/3:]
    return test_set, train_set

Now we can do text classification with a very few lines of code, and see how well the first letter of a name predicts a person's gender.

In [170]:
import nltk

boys, girls = pull_names()
seed(20) # so our class has matching results
test_set, train_set = label_names(boys, girls, first_letter)

classifier = nltk.NaiveBayesClassifier.train(train_set)
classifier.show_most_informative_features(10)

print "Accuracy: {}".format(nltk.classify.accuracy(classifier, test_set))

Most Informative Features
            first_letter = u'H'           female : male   =      2.5 : 1.0
            first_letter = u'D'             male : female =      2.4 : 1.0
            first_letter = u'T'             male : female =      2.4 : 1.0
            first_letter = u'I'             male : female =      2.4 : 1.0
            first_letter = u'W'             male : female =      2.0 : 1.0
            first_letter = u'A'           female : male   =      1.9 : 1.0
            first_letter = u'F'             male : female =      1.8 : 1.0
            first_letter = u'V'           female : male   =      1.7 : 1.0
            first_letter = u'B'             male : female =      1.7 : 1.0
            first_letter = u'S'           female : male   =      1.6 : 1.0
Accuracy: 0.599099099099


It's really easy to switch to another type of classifier.

In [165]:
classifier = nltk.DecisionTreeClassifier.train(train_set)
print classifier.pseudocode(depth=2)
print "Accuracy: {}".format(nltk.classify.accuracy(classifier, test_set))

if first_letter == u'A': return 'female'
if first_letter == u'B': return 'male'
if first_letter == u'C': return 'male'
if first_letter == u'D': return 'male'
if first_letter == u'E': return 'female'
if first_letter == u'F': return 'male'
if first_letter == u'G': return 'male'
if first_letter == u'H': return 'female'
if first_letter == u'I': return 'male'
if first_letter == u'J': return 'male'
if first_letter == u'K': return 'female'
if first_letter == u'L': return 'female'
if first_letter == u'M': return 'female'
if first_letter == u'N': return 'female'
if first_letter == u'O': return 'male'
if first_letter == u'P': return 'female'
if first_letter == u'Q': return 'male'
if first_letter == u'R': return 'male'
if first_letter == u'S': return 'female'
if first_letter == u'T': return 'male'
if first_letter == u'U': return 'male'
if first_letter == u'V': return 'female'
if first_letter == u'W': return 'male'
if first_letter == u'X': return 'male'
if first_letter == u'Y': return 'male'
if fi

Let's see how it does for the people in our class.

In [166]:
def test_our_class(classifier, func):
    py_names = ['James', 'Shailja', 'Chris', 'Dave', 'Sheng', 
                'Claire', 'Akshay', 'Catherine', 'Rhonda', 'Emily']
    py_genders = ['male', 'female', 'male', 'male', 'male',
                  'female', 'male', 'female', 'female', 'female']

    for name, gender in zip(py_names, py_genders):
        gender_guess = classifier.classify(func(name))
        if gender_guess != gender:
            print "Incorrectly classified: {:<8}".format(name)
    print "\n"
            
test_our_class(classifier, first_letter)

Incorrectly classified: Sheng   
Incorrectly classified: Claire  
Incorrectly classified: Akshay  
Incorrectly classified: Catherine
Incorrectly classified: Rhonda  




Let's define a new feature function.  I think women have more vowels in their names than men, and that the last letter might also help us discriminate.

In [167]:
def ends_and_vowels(name):
    n_vowels = sum(map(name.lower().count, "aeiou"))
    vowel_decile = int(10*n_vowels / float(len(name)))
    return {'first_letter': name[0],
            'last_letter':name[-1],
            'vowel_decile':vowel_decile}

I'll write a quick function to show accuracythe of our two classifiers.

In [168]:
def naive_bayes(test_set, train_set, func):     
    classifier = nltk.NaiveBayesClassifier.train(train_set) 
    accuracy = nltk.classify.accuracy(classifier, test_set)
    print "Naive Bayes accuracy: {:0.2}".format(accuracy)
    return classifier, accuracy

def decision_tree(test_set, train_set, func):    
    classifier = nltk.DecisionTreeClassifier.train(train_set)
    accuracy = nltk.classify.accuracy(classifier, test_set)
    print "Decision Tree accuracy: {:0.2}".format(accuracy)
    return classifier, accuracy

Now let's see how much better this classifier is.

In [177]:
seed(2) # So the class has matching results
func = ends_and_vowels
test_set, train_set = label_names(boys, girls, func)

classifier, accuracy = naive_bayes(test_set, train_set, func)
test_our_class(classifier, func)

classifier, accuracy = decision_tree(test_set, train_set, func)
test_our_class(classifier, func)

Naive Bayes accuracy: 0.74
Incorrectly classified: Akshay  


Decision Tree accuracy: 0.74
Incorrectly classified: Akshay  


