# Project 1 - Gender Prediction by Name

Small application which will take input a name of a person and tell if the name is female or male.

In order to accomplish that, we will look at the last letter of each name.

First, we create a very simple function which returns the last letter of the name which is provided as input:

In [25]:

def gender_features_part1(word):
    word = str(word).lower()
    return {'last_letter': word[-1:]}

Now we get a sample of names using the nltk built-in module

In [26]:
from nltk.corpus import names as names_sample
import nltk, random

Now we create a list joining the names in both male.txt and female.txt and assigning the gender to each name:

In [27]:
names = [(name, 'male') for name in names_sample.words('male.txt')] + [(name, 'female') for name in
                                                                       names_sample.words('female.txt')]

Let's shuffle them a bit ;)

In [28]:
random.shuffle(names)

Now, we can create a feature set, containing the last letter and the gender:

In [29]:
feature_sets = [(gender_features_part1(name.lower()), gender) for name, gender in names]

In [30]:
feature_sets[:10]

[({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'l'}, 'female'),
 ({'last_letter': 'l'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'h'}, 'male'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'a'}, 'female')]

In [31]:
len(feature_sets)

7944

We split our feature set in **train** and **test** sets:

In [32]:
train_set = feature_sets[3000:]
test_set = feature_sets[:3000]

Create and train our classifier:

In [33]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

Test our classifier using a random name:

In [34]:
print(classifier.classify(gender_features_part1('Anna')))

female


Testing the accuracy of our classifier, using our **test set**:

In [35]:
print(nltk.classify.accuracy(classifier, test_set)*100) #this value changes every time we run the code, due 
                                                        #to random test-train split

75.7


Seeing what our classifier has learned from our training set show_most_informative_features function no of features we want to see - default value of 10

In [37]:
print(classifier.show_most_informative_features(n=10))

Most Informative Features
             last_letter = 'k'              male : female =     45.7 : 1.0
             last_letter = 'a'            female : male   =     39.6 : 1.0
             last_letter = 'f'              male : female =     17.1 : 1.0
             last_letter = 'w'              male : female =     13.8 : 1.0
             last_letter = 'p'              male : female =     11.6 : 1.0
             last_letter = 'v'              male : female =     11.6 : 1.0
             last_letter = 'd'              male : female =     11.2 : 1.0
             last_letter = 'o'              male : female =      9.6 : 1.0
             last_letter = 'm'              male : female =      7.6 : 1.0
             last_letter = 'r'              male : female =      7.4 : 1.0
None
