# Data 620 - Project 3 | Building the best name gender classification

Using any of the three classifiers described in Chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can.

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. 

How does the performance on the test set compare to the performacne on the dev-test set? Is this what you'd expect?

Group Members: Abdellah Ait Elmouden, Habib Khan, Priya Shaji, Vijaya Cherukuri

In [7]:
# loading libraries
import nltk
from nltk.corpus import names
from nltk.classify import apply_features
import pandas as pd
import random

# download names
#nltk.download('names')

# Getting the Corpus

In [8]:
# Getting the data and shuffling them
names = ([(name, 'male') for name in names.words('male.txt')] + 
[(name, 'female') for name in names.words('female.txt')])


In [9]:
# Reshuffle the corpus
random.shuffle(names)

In [44]:
# Random names with gender
names[1:15]

[('Bartholomeo', 'male'),
 ('Corrine', 'female'),
 ('Hodge', 'male'),
 ('Inesita', 'female'),
 ('Meier', 'male'),
 ('Coletta', 'female'),
 ('Marlin', 'male'),
 ('Terri', 'female'),
 ('Darth', 'male'),
 ('Lind', 'male'),
 ('Shem', 'male'),
 ('Moreen', 'female'),
 ('Kettie', 'female'),
 ('Gracie', 'female')]

In [11]:
# checking the length
len(names)

7944

# Analysis

For this project, we will start working on the functions taken from Chapter 6 to test the accuracy of gender classification based on names. There are 2 functions which will be tested first and then finally modified function will be created that will be tested on test data to see final results. 

Before going forward, first we will create an accuracy function that will calculate the accuracy of each function to see which model performed better in terms of gender name classification.

In [31]:
# Defining accuracy function - 

def accuracy(runs, function):
    accuracy_df = {
        "classifier": [],
        "train_accuracy": [],
        "test_accuracy": [],
        "devtest_accuracy": [],
        "devtest_errors": []
    }
    for i in range(runs):
        random.shuffle(names)
        accuracy_train = names[1000:]
        accuracy_devtest = names[500:1000]
        accuracy_test = names[:500]
        
        accuracy_trainset = [(function(n), g) for (n,g) in accuracy_train]
        accuracy_devtestset = [(function(n), g) for (n,g) in accuracy_devtest]
        accuracy_testset = [(function(n), g) for (n,g) in accuracy_test]
        
        accuracy_classifier = nltk.NaiveBayesClassifier.train(accuracy_trainset)
        accuracy_df["classifier"].append(accuracy_classifier)
        accuracy_df["train_accuracy"].append(nltk.classify.accuracy(accuracy_classifier, accuracy_trainset))
        accuracy_df["test_accuracy"].append(nltk.classify.accuracy(accuracy_classifier, accuracy_testset))
        accuracy_df["devtest_accuracy"].append(nltk.classify.accuracy(accuracy_classifier, accuracy_devtestset))
        
        accuracy_errors = []
        for (name, tag) in accuracy_devtest:
            accuracy_guess = accuracy_classifier.classify(function(name))
            if accuracy_guess != tag:
                accuracy_errors.append( (tag, accuracy_guess, name) )
                
        accuracy_df["devtest_errors"].append(accuracy_errors)
        
    accuracy_df = pd.DataFrame.from_dict(accuracy_df)
    return(accuracy_df)

In [40]:
# Creating function 1
def gender_feature1(name):
    return {'last_letter': name[-1]}

gender_feature1("Mr. Sherlock Holmes")

{'last_letter': 's'}

In [41]:
# Creating function 2 - changing into lower
def gender_features2(name):
    features={}
    features["firstletter"]= name[0].lower()
    features["lastletter"] = name[-1].lower()
    features["suffix2"] =  name[-2:].lower()
    features["preffix2"] = name[:2].lower()
    for letter in 'aeiou':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features
gender_features2("Mr. Sherlock Holmes")

{'firstletter': 'm',
 'lastletter': 's',
 'suffix2': 'es',
 'preffix2': 'mr',
 'count(a)': 0,
 'has(a)': False,
 'count(e)': 2,
 'has(e)': True,
 'count(i)': 0,
 'has(i)': False,
 'count(o)': 2,
 'has(o)': True,
 'count(u)': 0,
 'has(u)': False}

In [49]:
# Creating function 3 - 

def gender_features3(name):
    features={}
    features["firstletter"]= name[0].lower()
    features["lastletter"] = name[-1].lower()
    
    # adding parameters if length of name is more than 4 for suffix and preffix
    features["suffix2"] =  name[-3:].lower() if len(name) > 4 else name[-2:].lower()
    features["preffix2"] = name[:3].lower() if len(name) > 4 else name[:2].lower()
    for letter in 'aeiou':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

In [52]:
gender_features3("Sherlock")

{'firstletter': 's',
 'lastletter': 'k',
 'suffix2': 'ock',
 'preffix2': 'she',
 'count(a)': 0,
 'has(a)': False,
 'count(e)': 1,
 'has(e)': True,
 'count(i)': 0,
 'has(i)': False,
 'count(o)': 1,
 'has(o)': True,
 'count(u)': 0,
 'has(u)': False}

In [35]:
# Testing function 1
test_f1 = accuracy(100, gender_feature1)
test_f1.describe()

Unnamed: 0,train_accuracy,test_accuracy,devtest_accuracy
count,100.0,100.0,100.0
mean,0.76327,0.7563,0.75928
std,0.001891,0.016586,0.017786
min,0.758929,0.71,0.708
25%,0.761953,0.746,0.7475
50%,0.763177,0.754,0.758
75%,0.764437,0.7665,0.7725
max,0.768865,0.806,0.8


function 1 has accuracy of 0.756300 on test set which was performed through checking the last letter of the names. According to the text in chapter 6, there are patterns in the last letter of the gender which was initially helpful to identify the genders with 75.36% accuracy on the test dataset. Names ending in a, e and i are likely to be female while names ending in k, o, r, s and t are likely to be male. The accuracy did not change significantly throughout the train, devtest and test datasets.

In [42]:
# Testing function 2
test_f2 = accuracy(100, gender_features2)
test_f2.describe()

Unnamed: 0,train_accuracy,test_accuracy,devtest_accuracy
count,100.0,100.0,100.0
mean,0.80862,0.79704,0.79664
std,0.001876,0.016476,0.015384
min,0.804724,0.758,0.764
25%,0.80746,0.786,0.786
50%,0.80854,0.796,0.796
75%,0.809908,0.81,0.806
max,0.813076,0.844,0.834


In function 2, more features were added to improve the accuracy of name gender classifications other than the last letter that was discussed previously. In this function, first letter of first name has also been added to see if there is any trend among the genders. Also, number of vowels has also been seen to identify any pattern for male and female. Function 2 improved the accuracy by almost 4 - 5 % in train, test and devtest datasets overall. There is no significant difference in the accuracies among the train, test and devtest datasets which means prediction was almost 80% times accurate to classify the name gender. 

In [50]:
# Testing function 3
test_f3 = accuracy(100, gender_features3)
test_f3.describe()

Unnamed: 0,train_accuracy,test_accuracy,devtest_accuracy
count,100.0,100.0,100.0
mean,0.870756,0.82806,0.82956
std,0.001737,0.017033,0.017379
min,0.866647,0.768,0.794
25%,0.869636,0.8175,0.816
50%,0.870752,0.826,0.828
75%,0.871976,0.84,0.842
max,0.875288,0.866,0.876


Function 3 was modification of previous functions 2. In this function, we added another criteria for suffix and preffix i.e. if the length of name is more than 4 then the function will look up for first 3 and last 3 letters to see the pattern for gender classification out of names. It tremendously improved the accuracy in train dataset which is 87.07% but the accuracy dropped down to around 82.8% on test and devtest datasets but still it is improved version than the previous functions. 

# References

https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781789139495/2/ch02lvl1sec17/training-a-sentiment-classifier-for-movie-reviews

https://www.nltk.org/book/ch06.html