# Project 3: Classification of Gender based on Names

This is a collaborative project conducted by the Fall 2017 students of DATA 620 at The City University of New York, in partial fulfillment of the requirements for the MS in Data Science degree.

### Problem Description

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

Source: Natural Language Processing with Python, exercise 6.10.2.

### Contributors Include

* K. Joy Payton
* Keith Folsom
* Sonya Hong
* Shyam Balagurumurthy Viswanathan
* Derek Nokes
* Liam Byrne
* Latif Masud
* Valerie Briot


## Obtaining the Corpus

Note: If not already executed, nltk.download() will allow you access to the names corpus

### Importing Packages

Since we all used Anaconda's Python version which comes pre-installed with most of the packages we need, we can simply import them into our notebook. `textstat` doesn't come with the base insallation, so we have to download it before importing it: 

In [1]:
! pip install textstat



In [30]:
# Importing required libraries/packages
import nltk
from nltk.corpus import names
import random
import numpy as np
from nltk.metrics import *
import re
    
import operator
import string
from textstat.textstat import textstat

To start, we will use NLTK's provided library of male and female names, and shuffle our dataset: 

In [12]:
names = ([(name, 'male') for name in names.words('male.txt')] + \
         [(name, 'female') for name in names.words('female.txt')])

random.shuffle(names)

print ("Size of dataset:", len(names))
# let's see what the randomly shuffles names look like
names[0:9]

Size of dataset: 7944


[('Caro', 'female'),
 ('Collie', 'female'),
 ('Geoffrey', 'male'),
 ('Kizzee', 'female'),
 ('Issy', 'female'),
 ('Dody', 'female'),
 ('Cari', 'female'),
 ('Abigale', 'female'),
 ('Rice', 'male')]

## Creating Data Subsets
Now that we have our overall dataset, we will split out our data into three different subsets to be used for different purposes. There are 7944 names in the dataset, and we will first split the "names" data set into a Development set of 7,444 entries and a Test set of 500 entries.

The Development set will be used to test each features as we build the module. This test will be split between a training set and a dev-test set.  

##### Development set:
* 6944 names for the training set
* 500 names for the dev-test set  

##### Test set:
* 500 names for the testing set



In [13]:
# Split the names set into a Development set and a Test set;
# Developmeent set will be used for training and testing each features as we build the model

test_names, development_set_names = names[0:500], names[500:]

In [14]:
# Confirm the size of the three subsets
print("Development Set = {}".format(len(development_set_names)))
print("Test Set = {}".format(len(test_names)))

Development Set = 7444
Test Set = 500


## Feature Extractor Class

This section below is to incrementallly improve the feature extraction functions which are subsequently applied to the development and test datasets.

To facilitate the analysis, a class for the classifier was constructed. This class will have all the features and various methods, show error, display confusion matrix, evaluate fit of the model. The features that the class can evaluate are: 

1. Last Letter
2. First Letter, most names begining with a vowel are associated with females  
3. Vowels count
4. Hard consonants using general rules of c and g
5. Soft consonants using general rules of c and g
6. Syllable Count of names via textstat 
7. Name length
8. Last two letters
9. Last 3 letters
10. Character count
11. Character present
12. Count of each letter
13. Count of pair of letters in the alphabet
14. First 2 letters
15. First letter, Last Letter, Last 2 Letters, last 3 Letters, 2-grams
16. Takes into account all of the top indicators for classification. 

There are three additional functions provided in the class, which do the following:
* `show_errors`: prints out the number of correct idenfifications, and the number of guesses for each name passed to the function. 
* `classifier_report`: Provides a Confusion Matrix of the prediction, which helps describe the performance of our classification model.
* `fit_model`" Takes in a model, and a range of feature numbers, and runs the dataset through those features and reports the results. 

In [61]:
class NLP_Classifier():    
    def __init__(self,model):
        self.model = model


    def get_features(self,name,feat_num):
        '''
        Parameters:
            name - string of name to extract feature
            feat_num - itterable collection of integers specifying features. *Defaults to 1:9 inclusive
                1: last letter
                2: first letter
                3: Vowel counts
                4: Hard consonant count
                5: Soft consonant count
                6: Syllable Count
                7: Name length
                8: Last two chars
                9: Last three chars
                10: char count --> feature for all alpha chars
                11: char present --> feature for all alpha chars (boolean)
        Returns:
            features: a dictionary of extracted features
        '''
        features = {}    
        
        # Converts feat_num to itterable if type is int
        if type(feat_num) is int:
            feat_num = (0, feat_num)        
       
        # Gender Feature 1: Last letter - book example
        if 1 in feat_num:
            features['last_letter'] = name[-1].lower()
            
        # Gender Feature 2: First letter - most names beginning with a vowel --> females
        if 2 in feat_num:
            features['first_letter'] = name[0].lower()
            
        # Gender Feature 3: Vowel Counts
        if 3 in feat_num:
            features['vowel_count'] = len(re.sub(r'[^aeiou]', '', name.lower()))
            
        # Gender Feature 4: Hard consonants using general rules of c and g
        if 4 in feat_num:
            features['hard_consts'] = len(re.findall(r'[cg][^eiy]', name.lower()))/2
            
        # Gender Feature 5: Soft consonants using general rules of c and g
        if 5 in feat_num:
            features['soft_consts'] = len(re.findall(r'[cg][eiy]', name.lower()))/2
            
        # Gender Feature 6: Syllable Count of names via textstat
        if 6 in feat_num:
            features['syllable_count'] = textstat.syllable_count(name.lower())
    
        # Gender Feature 7: Name length
        if 7 in feat_num:
            features["length"] = len(name)
        
        # Gender Feature 8: Last two chars
        if 8 in feat_num:
            features["last2letters"] = name[-2:].lower()
            
        # Gender Feature 9: Last three chars
        if 9 in feat_num:
            features["last3letters"] = name[-3:].lower()
    
        # Gender Feature 10: Char Counts (overfitts)
        if 10 in feat_num:
            for letter in string.ascii_lowercase:
                features["count_{0}".format(letter)] = name.lower().count(letter)
                
        # Gender Feature 11: Char Booleans (overfitts)
        if 11 in feat_num:
            for letter in string.ascii_lowercase:
                features["has_{0}".format(letter)] = letter in name.lower()
        
        
        if 12 in feat_num:
            features = {}
            letters=list(map(chr, range(ord('a'), ord('z') + 1)))
            for letter in letters:
                features["count(%s)" % letter] = name.lower().count(letter)


        if 13 in feat_num:
            features = {}
            letters=list(map(chr, range(ord('a'), ord('z') + 1)))
            for letter1 in letters:
                for letter2 in letters:
                    features["has("+letter1+letter2+")"] = (letter1+letter2 in name.lower())

        if 14 in feat_num:
            features["first2Letters"]=name[0:2].lower()

        if 15 in feat_num:
            features = {}
            features["firstletter"] = name[0].lower()
            features["lastletter"] = name[-1].lower()
            features["last2letter"] = name[-2:].lower()
            features["last3letter"] = name[-3:].lower()

            letters=list(map(chr, range(ord('a'), ord('z') + 1)))
            for letter1 in letters:
                features["count("+letter1+")"] = name.lower().count(letter1)
                features["has("+letter1+")"] = (letter1 in name.lower())
                # iterate over 2-grams
                for letter2 in letters:

                    features["has("+letter1+letter2+")"] = (letter1+letter2 in name.lower())


        if 16 in feat_num:
            # define features
            features = {}
            # has(fo) = True
            features["has(fo)"] = ('fo' in name.lower())
            # has(hu) = True
            features["has(hu)"] = ('hu' in name.lower())
            # has(rv) = True
            features["has(rv)"] = ('rv' in name.lower())    
            # has(rw) = True
            features["has(rw)"] = ('rw' in name.lower()) 
            # has(sp) = True
            features["has(sp)"] = ('sp' in name.lower())

            # lastletter = 'a'
            features["lastletter=a"] = ('a' in name[-1:].lower())
            # lastletter = 'f'
            features["lastletter=f"] = ('f' in name[-1:].lower())
            # lastletter = 'k'
            features["lastletter=k"] = ('k' in name[-1:].lower())

            # last2letter = 'ch'
            features["last2letter=ch"] = ('ch' in name[-2:].lower())
            # last2letter = 'do'
            features["last2letter=do"] = ('do' in name[-2:].lower())
            # last2letter = 'ia'
            features["last2letter=ia"] = ('ia' in name[-2:].lower())
            # last2letter = 'im'
            features["last2letter=im"] = ('im' in name[-2:].lower())
            # last2letter = 'io'
            features["last2letter=io"] = ('io' in name[-2:].lower())
            # last2letter = 'la'
            features["last2letter=la"] = ('la' in name[-2:].lower())
            # last2letter = 'ld'
            features["last2letter=ld"] = ('ld' in name[-2:].lower())
            # last2letter = 'na'
            features["last2letter=na"] = ('na' in name[-2:].lower())
            # last2letter = 'os'
            features["last2letter=os"] = ('os' in name[-2:].lower())
            # last2letter = 'ra'
            features["last2letter=ra"] = ('ra' in name[-2:].lower())
            # last2letter = 'rd'
            features["last2letter=rd"] = ('rd' in name[-2:].lower())
            # last2letter = 'rt'
            features["last2letter=rt"] = ('rt' in name[-2:].lower())
            # last2letter = 'sa'
            features["last2letter=sa"] = ('sa' in name[-2:].lower())
            # last2letter = 'ta'
            features["last2letter=ta"] = ('ta' in name[-2:].lower())
            # last2letter = 'us'
            features["last2letter=us"] = ('us' in name[-2:].lower())

            # last3letter = 'ana'
            features["last3letter=ana"] = ('ana' in name[-3:].lower())    
            # last3letter = u'ard'
            features["last3letter=ard"] = ('ard' in name[-3:].lower())        
            # last3letter = u'ita'
            features["last3letter=ita"] = ('ita' in name[-3:].lower())    
            # last3letter = u'nne'
            features["last3letter=nne"] = ('nne' in name[-3:].lower())    
            # last3letter = u'tta'
            features["last3letter=tta"] = ('tta' in name[-3:].lower())    
        
        return features

    
    def show_errors(self, errors, n=None):
        if n is not None: errors = errors[:n]  
        print("list of first %s errors :" %(n))
        for (tag, guess, name) in sorted(errors): 
            print('correct=%-8s guess=%-8s name=%-30s' %(tag, guess, name))
        print("")
        return None 
    
    
    def classifier_report(self,classifier,dataset,feat_num):
        feat_num = int(feat_num)
        dataset_predictions = [classifier.classify(self.get_features(n,feat_num))  for (n, g) in dataset]
        dataset_gold = [g  for (n, g) in dataset]
        cm=ConfusionMatrix(dataset_gold, dataset_predictions)
        print("Confusion Matrix: ")
        print(cm)
        print("")
    
    def fit_model(self,feat_num_start, feat_num):
       
        for i in np.arange(feat_num_start, feat_num+1):
            feat_num =int(i)
            errors = []
            
            # devtest-set and training set are constructed
            random.shuffle(development_set_names)
            devtest_names, train_names = development_set_names[0:500], development_set_names[500:]
            
            train_set = [(self.get_features(n,feat_num), g)  for (n, g) in train_names]
            devtest_set = [(self.get_features(n,feat_num), g)  for (n, g) in devtest_names]
            test_set = [(self.get_features(n,feat_num), g)  for (n, g) in test_names] 
            
            classifier = self.model.train(train_set) 
            
            # For errors list
            for (name, tag) in devtest_names:
                guess = classifier.classify(self.get_features(name,feat_num)) 
                if guess != tag: 
                    errors.append((tag, guess, name))    
                    
            #Print classifier report
            self.classifier_report(classifier,train_names,i)
            
            # Print errors        
            self.show_errors(errors, 5)
            
            # Print show_most_informative_features for NaiveBayes
            # Print pseudocode for DecisionTree
            if self.model ==nltk.NaiveBayesClassifier:
                classifier.show_most_informative_features(5)                
            elif self.model ==nltk.DecisionTreeClassifier:
                print(classifier.pseudocode(depth=5))
            
            print("-----------------------------------------")
            print("Accuracy of model {} using feature {} : {}".format(self.model,feat_num,nltk.classify.accuracy(classifier, devtest_set)))
            print("=========================================")
            print("")
            
    def get_sorted_feature_accuracies(self,feat_num_start, feat_num):
        feature_accuracy = {}
        for i in np.arange(feat_num_start, feat_num+1):
            feat_num =int(i)
            errors = []
            
            # devtest-set and training set are constructed
            random.shuffle(development_set_names)
            devtest_names, train_names = development_set_names[0:500], development_set_names[500:]
            
            train_set = [(self.get_features(n,feat_num), g)  for (n, g) in train_names]
            devtest_set = [(self.get_features(n,feat_num), g)  for (n, g) in devtest_names]
            test_set = [(self.get_features(n,feat_num), g)  for (n, g) in test_names] 
            
            classifier = self.model.train(train_set) 
            
            # For errors list
            for (name, tag) in devtest_names:
                guess = classifier.classify(self.get_features(name,feat_num)) 
                if guess != tag: 
                    errors.append((tag, guess, name))    
                    
            
            feature_accuracy[feat_num] = nltk.classify.accuracy(classifier, devtest_set)
        
        
        #sort for accuracy, and then reverse the array to return the array as most accurate to least accurate
        sorted_by_accuracy = sorted(feature_accuracy.items(), key=operator.itemgetter(1))
        return sorted_by_accuracy[::-1]
        

## NaiveBayes

The first classifier we selected is the Naive Bayes classifier.  

We will first use feature 1 - 14 (single features) and examine how the model performs for each. These results will provide us with the basis to derive more complexe features to refine model.  

In [63]:
clf = NLP_Classifier(nltk.NaiveBayesClassifier)
# specify initial and final feature
ranked_features = clf.get_sorted_feature_accuracies(1, 14)

In [64]:
features = {
    1: "Last Letter",
    2: "First Letter",
    3: "Vowels count",
    4: "Hard consonants using general rules of c and g",
    5: "Soft consonants using general rules of c and g",
    6: "Syllable Count of names via textstat",
    7: "Name length",
    8: "Last two letters",
    9: "Last 3 letters",
    10: "Character count",
    11: "Character present",
    12: "Count of each letter",
    13: "Count of pair of letters in the alphabet",
    14: "First 2 letters"
}

print("Top Five Single Features with the Highest Accuracy")
print("---------------------------------------------------------------")
for (feat_num, accuracy) in ranked_features[0:5]:
    print('Feature: %-30s Accuracy: %-8s' %(features[feat_num], accuracy))

print("---------------------------------------------------------------")


Top Five Single Features with the Highest Accuracy
---------------------------------------------------------------
Feature: Last 3 letters                 Accuracy: 0.8     
Feature: Last two letters               Accuracy: 0.77    
Feature: Count of pair of letters in the alphabet Accuracy: 0.758   
Feature: Last Letter                    Accuracy: 0.738   
Feature: Character count                Accuracy: 0.706   
---------------------------------------------------------------


In [65]:
clf.fit_model(1, 14)

Confusion Matrix: 
       |    f      |
       |    e      |
       |    m    m |
       |    a    a |
       |    l    l |
       |    e    e |
-------+-----------+
female |<3563> 807 |
  male |  857<1717>|
-------+-----------+
(row = reference; col = test)


list of first 5 errors :
correct=female   guess=male     name=Allys                         
correct=female   guess=male     name=Linnet                        
correct=male     guess=female   name=Ashley                        
correct=male     guess=female   name=Lennie                        
correct=male     guess=female   name=Timmie                        

Most Informative Features
             last_letter = 'a'            female : male   =     32.3 : 1.0
             last_letter = 'k'              male : female =     29.8 : 1.0
             last_letter = 'f'              male : female =     15.2 : 1.0
             last_letter = 'v'              male : female =     11.2 : 1.0
             last_letter = 'd'              mal


Confusion Matrix: 
       |    f      |
       |    e      |
       |    m    m |
       |    a    a |
       |    l    l |
       |    e    e |
-------+-----------+
female |<3782> 586 |
  male | 1540<1036>|
-------+-----------+
(row = reference; col = test)


list of first 5 errors :
correct=male     guess=female   name=Artie                         
correct=male     guess=female   name=Austen                        
correct=male     guess=female   name=Micheil                       
correct=male     guess=female   name=Rene                          
correct=male     guess=female   name=Salim                         

Most Informative Features
                   has_w = True             male : female =      4.2 : 1.0
                   has_u = True             male : female =      1.8 : 1.0
                   has_p = True             male : female =      1.8 : 1.0
                   has_f = True             male : female =      1.7 : 1.0
                   has_o = True             ma

#### Evaluation of single features for Naive Baysien Classifier

Based on results for features 1-14, we can see that the best performing features are; 1, 8, 9, 10, 11, 13. Combining these features, we produced features 15 & 16.  

We will now evaluate classifier with these features.

In [66]:
clf.fit_model(15, 16)

Confusion Matrix: 
       |    f      |
       |    e      |
       |    m    m |
       |    a    a |
       |    l    l |
       |    e    e |
-------+-----------+
female |<3732> 623 |
  male |  503<2086>|
-------+-----------+
(row = reference; col = test)


list of first 5 errors :
correct=female   guess=male     name=Charmain                      
correct=female   guess=male     name=Chickie                       
correct=female   guess=male     name=Cloris                        
correct=female   guess=male     name=Rianon                        
correct=female   guess=male     name=Veronique                     

Most Informative Features
             last2letter = 'na'           female : male   =     95.5 : 1.0
             last2letter = 'la'           female : male   =     72.2 : 1.0
             last2letter = 'ra'           female : male   =     58.8 : 1.0
                 has(hu) = True             male : female =     37.6 : 1.0
             last2letter = 'ia'           femal

Based on these results, we would consider feature 15 to build the best model.

###  Decision Tree

The second classifier we selected is the Decision Tree classifier.  

Again, we will first use features 1 - 14 (single features) and examine how the model perform for each. These results will provide us with the basis to derive more complexe features to refine model.  

In [9]:
clf_dt = NLP_Classifier(nltk.DecisionTreeClassifier)
# specify initial and final feature
clf_dt.fit_model(1, 14)

Confusion Matrix: 
       |    f      |
       |    e      |
       |    m    m |
       |    a    a |
       |    l    l |
       |    e    e |
-------+-----------+
female |<3590> 802 |
  male |  840<1712>|
-------+-----------+
(row = reference; col = test)


list of first 5 errors :
correct=female   guess=male     name=Clo                           
correct=female   guess=male     name=Dell                          
correct=female   guess=male     name=Row                           
correct=male     guess=female   name=Connie                        
correct=male     guess=female   name=Godfree                       

-----------------------------------------
Accuracy of model <class 'nltk.classify.decisiontree.DecisionTreeClassifier'> using feature 1 : 0.748

Confusion Matrix: 
       |    f      |
       |    e      |
       |    m    m |
       |    a    a |
       |    l    l |
       |    e    e |
-------+-----------+
female |<4170> 206 |
  male | 2218 <350>|
-------+-----------+

Confusion Matrix: 
       |    f      |
       |    e      |
       |    m    m |
       |    a    a |
       |    l    l |
       |    e    e |
-------+-----------+
female |<4150> 219 |
  male | 1175<1400>|
-------+-----------+
(row = reference; col = test)


list of first 5 errors :
correct=female   guess=male     name=Agretha                       
correct=female   guess=male     name=Felipa                        
correct=male     guess=female   name=Barri                         
correct=male     guess=female   name=Lex                           
correct=male     guess=female   name=Reinhold                      

-----------------------------------------
Accuracy of model <class 'nltk.classify.decisiontree.DecisionTreeClassifier'> using feature 12 : 0.716



KeyboardInterrupt: 

Based on these results, it would seem that features; 1, 8, 9, 10, 12 lead to better outcomes.  We will now evaluate the evaluate the combination features.  

In [67]:
clf.fit_model(15, 16)

Confusion Matrix: 
       |    f      |
       |    e      |
       |    m    m |
       |    a    a |
       |    l    l |
       |    e    e |
-------+-----------+
female |<3748> 615 |
  male |  507<2074>|
-------+-----------+
(row = reference; col = test)


list of first 5 errors :
correct=female   guess=male     name=Agnes                         
correct=female   guess=male     name=Hedy                          
correct=female   guess=male     name=Ivory                         
correct=female   guess=male     name=Kirstin                       
correct=male     guess=female   name=Kenneth                       

Most Informative Features
             last2letter = 'na'           female : male   =     92.8 : 1.0
             last2letter = 'la'           female : male   =     70.6 : 1.0
             last2letter = 'ia'           female : male   =     37.5 : 1.0
                 has(hu) = True             male : female =     36.6 : 1.0
             last2letter = 'ra'           femal

### Model Selection (Choose Best Candidate) 

Now that we have the best model build with both classifiers, we will test the model against the test data set and compare the results.  


#### Check the model's final performance on the test set. 

#### How does the performance on the test set compare to the performance on the dev-test set? 

#### Is this what you'd expect?