Assignment Detail:

For this project, please work with the entire class as one collaborative group! Your project should be submitted (as a Jupyter Notebook via GitHub) by end of the due date. The group should present their code and findings in our meetup. The ability to be an effective member of a virtual team is highly valued in the data science job market.

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can.
Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the devtest set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. 

How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

Source: Natural Language Processing with Python, exercise 6.10.2.


Building data set: Using nltk function we have build gender data set called "Gender_names" here. 
    

In [1]:
import nltk
nltk.download('names')

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\sql_ent_svc\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!


True

In [2]:
from nltk.corpus import names
import random
from nltk.classify import apply_features

#Building the Gender_names data set
Gender_names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
random.shuffle(Gender_names)

In [3]:
Gender_names[0:10] #show the names with gender. 

[('Harmon', 'male'),
 ('Rozina', 'female'),
 ('Miles', 'male'),
 ('Berk', 'male'),
 ('Mitra', 'female'),
 ('Berte', 'female'),
 ('Hiro', 'male'),
 ('Giovanni', 'male'),
 ('Hildagarde', 'female'),
 ('Clem', 'male')]

Gender Identification:
Male and female names have distinct characteristics such as names ending in a, e, and i are likely to be female, while names ending in k, o, r, s, and t are likely to be male. We have build a classifier to model these differences more precisely. We will look for the last letter of a given name. (Source: NLP book page 222-223) 

In [4]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [5]:
gender_features('Justine')

{'last_letter': 'e'}

Next, we have used the feature extractor to process the Gender_names data, and divide the resulting list of feature sets into a training set and a test set.

In [6]:
featuresets = [(gender_features(n), g) for (n,g) in Gender_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [7]:
featuresets[2]

({'last_letter': 's'}, 'male')

In [9]:
print (classifier.classify(gender_features('Romeo'))) #male
print (classifier.classify(gender_features('Trinity'))) #female

male
female


In [10]:
print (nltk.classify.accuracy(classifier, test_set))

0.758


In [11]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     35.9 : 1.0
             last_letter = 'k'              male : female =     32.3 : 1.0
             last_letter = 'f'              male : female =     14.6 : 1.0
             last_letter = 'p'              male : female =     11.9 : 1.0
             last_letter = 'd'              male : female =     10.2 : 1.0


Since we are working with large corpora, we will use  use the function
nltk.classify.apply_features which does not store all the feature sets in memory.

In [12]:
train_set = apply_features(gender_features, Gender_names[500:])
test_set = apply_features(gender_features, Gender_names[:500])

Choosing the Right Features:
Selecting relevant features and deciding how to encode them are very important to build a  good model. 

In [13]:
def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

In [14]:
gender_features2('John')

{'firstletter': 'j',
 'lastletter': 'n',
 'count(a)': 0,
 'has(a)': False,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 0,
 'has(e)': False,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 1,
 'has(h)': True,
 'count(i)': 0,
 'has(i)': False,
 'count(j)': 1,
 'has(j)': True,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 0,
 'has(l)': False,
 'count(m)': 0,
 'has(m)': False,
 'count(n)': 1,
 'has(n)': True,
 'count(o)': 1,
 'has(o)': True,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 0,
 'has(s)': False,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 0,
 'has(y)': False,
 'count(z)': 0,
 'has(z)': False}

Accuracy of naive Bayes classifier using the feature extractor: 

In [15]:
featuresets = [(gender_features2(n), g) for (n,g) in Gender_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print (nltk.classify.accuracy(classifier, test_set))

0.77


Above result shows that the accuracy of classifier to count letters is about 2% more
than the accuracy of a classifier that only pays attention to the final letter of each name.


Error analysis: First, we select a development set, containing the
corpus data for creating the model. This development set is then subdivided into the
training set and the dev-test set.


In [16]:
train_names = Gender_names[1500:]
devtest_names = Gender_names[500:1500]
test_names = Gender_names[:500]

We have divided the corpus into appropriate datasets. Then we have built a model using the training
set, and then run it on the dev-test set.

In [17]:
train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
test_set = [(gender_features(n), g) for (n,g) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print (nltk.classify.accuracy(classifier, devtest_set))

0.748


Using the dev-test set, we can generate a list of the errors that the classifier makes when
predicting name genders:

In [18]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

The names classifier that we have built generates about 100 errors on the dev-test corpus as follows:

In [19]:
for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    print ('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))

correct=female   guess=male     name=Adrien                        
correct=female   guess=male     name=Alis                          
correct=female   guess=male     name=Ann                           
correct=female   guess=male     name=Ardys                         
correct=female   guess=male     name=Aryn                          
correct=female   guess=male     name=Ashleigh                      
correct=female   guess=male     name=Astrid                        
correct=female   guess=male     name=Ayn                           
correct=female   guess=male     name=Beilul                        
correct=female   guess=male     name=Bel                           
correct=female   guess=male     name=Bert                          
correct=female   guess=male     name=Beryl                         
correct=female   guess=male     name=Beth                          
correct=female   guess=male     name=Beulah                        
correct=female   guess=male     name=Bridget    

Now, we will adjust our feature extractor to include features for two-letter suffixes:

In [20]:
def gender_features(word):
    return {'suffix1': word[-1:], 'suffix2': word[-2:]}

In [21]:
train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print (nltk.classify.accuracy(classifier, devtest_set))

0.77


Rebuilding the classifier with the new feature extractor, we see that the performance
on the dev-test dataset improves by almost two percentage points from 74.8% to 77.0%)

I have created another features here which utilizes the first and last letter. It also looks for the prefix and suffix, or first and last two or three letters, depending on the name's length of a name and looks for whether or not any of the consonant clusters are present.

In [31]:
def class_gender_features4(name):
    features = {}
    temp_name = name
    cons_clusters = ["bl", "br", "ch", "cl", "cr", "dr", "fl", "fr", "gl", "gr", "pl", "pr", "sc", "sh", "sk", "sl", "sm", "sn", "sp", "st", "sw", "th", "tr", "tw", "wh", "wr", "sch", "scr", "shr", "sph", "spl", "spr", "squ", "str", "thr"]
    features["firstletter"] = name[0].lower() 
    features["lastletter"] = name[-1].lower() 
    features["prefix"] = name[:3].lower() if len(name) > 4 else name[:2].lower() 
    features["suffix"] = name[-3:].lower() if len(name) > 4 else name[-2:].lower()
    clusters = []
    for cluster in cons_clusters[::-1]:
        if cluster in temp_name:
            temp_name = temp_name.replace(cluster, "")
            clusters.append(cluster)
    features["consonant_clusters_1"] = clusters[0] if len(clusters) > 0 else None
    features["consonant_clusters_2"] = clusters[1] if len(clusters) > 1 else None
    features["consonant_clusters_3"] = clusters[2] if len(clusters) > 2 else None
    return features

Testing accuracy:
We will test the accuracy of of both gender features of finding the gender by last name and counting the letters of names here. To do this, we will run each function 100 times.

In [23]:
import pandas as pd

In [27]:
def accuracy(number_of_runs, function_to_use):
    acc_df = {
        "classifier": [],
        "train_set_accuracy": [],
        "test_set_accuracy": [],
        "devtest_set_accuracy": [],
        "devtest_errors": []
    }
    for i in range(number_of_runs):
        random.shuffle(Gender_names)
        acc_train_names = Gender_names[1000:]
        acc_devtest_names = Gender_names[500:1000]
        acc_test_names = Gender_names[:500]
        acc_train_set = [(function_to_use(n), g) for (n,g) in acc_train_names]
        acc_devtest_set = [(function_to_use(n), g) for (n,g) in acc_devtest_names]
        acc_test_set = [(function_to_use(n), g) for (n,g) in acc_test_names]
        acc_classifier = nltk.NaiveBayesClassifier.train(acc_train_set)
        acc_df["classifier"].append(acc_classifier)
        acc_df["train_set_accuracy"].append(nltk.classify.accuracy(acc_classifier, acc_train_set))
        acc_df["test_set_accuracy"].append(nltk.classify.accuracy(acc_classifier, acc_test_set))
        acc_df["devtest_set_accuracy"].append(nltk.classify.accuracy(acc_classifier, acc_devtest_set))
        acc_errors = []
        for (name, tag) in acc_devtest_names:
            acc_guess = acc_classifier.classify(function_to_use(name))
            if acc_guess != tag:
                acc_errors.append( (tag, acc_guess, name) )
        acc_df["devtest_errors"].append(acc_errors)
    acc_df = pd.DataFrame.from_dict(acc_df)
    return(acc_df)

In [28]:
Accuracy_df_1 = accuracy(100, gender_features)
Accuracy_df_1.describe()

Unnamed: 0,train_set_accuracy,test_set_accuracy,devtest_set_accuracy
count,100.0,100.0,100.0
mean,0.789293,0.78172,0.78106
std,0.001951,0.014845,0.019954
min,0.785138,0.744,0.732
25%,0.787874,0.772,0.768
50%,0.789315,0.782,0.78
75%,0.790323,0.79,0.796
max,0.794067,0.816,0.83


The accuracy of the first features shows that the average (mean) accuracy accross the test_set are between 78.1% and 78.9%. The mean accuracy of train_set is more than the accuracy of devtest_set. 

In [29]:
Accuracy_df_2 = accuracy(100, gender_features2)
Accuracy_df_2.describe()

Unnamed: 0,train_set_accuracy,test_set_accuracy,devtest_set_accuracy
count,100.0,100.0,100.0
mean,0.778713,0.77186,0.77356
std,0.001699,0.018326,0.01807
min,0.775202,0.716,0.726
25%,0.77747,0.7595,0.7615
50%,0.778658,0.774,0.774
75%,0.780098,0.784,0.784
max,0.783698,0.818,0.816


The accuracy of the second features shows that the average (mean) accuracy across the test_set are between 77.3% and 77.9%. The mean accuracy of train_set is more than the accuracy of devtest_set.

In [32]:
class_df_3 = accuracy(100, class_gender_features4)
class_df_3.describe()

Unnamed: 0,train_set_accuracy,test_set_accuracy,devtest_set_accuracy
count,100.0,100.0,100.0
mean,0.883635,0.83434,0.83302
std,0.001748,0.016208,0.016638
min,0.878744,0.792,0.79
25%,0.882488,0.824,0.824
50%,0.883641,0.834,0.834
75%,0.884937,0.845,0.844
max,0.887529,0.868,0.876


The accuracy of the third features, looking for prefix and suffix in the names, shows that the average (mean) accuracy across the test_set are between 83.3% and 88.4%. The mean accuracy of train_set is more than the accuracy of devtest_set and test accuracy.