## Project 3

This is a collaborative project conducted by the Fall 2017 students of DATA 620 at The City University of New York, in partial fulfillment of the requirements for the MS in Data Science degree.

### Problem Description

This is a Team Project! For this project, please work with the entire class as one collaborative group! Your project should be submitted (as an IPython Notebook via GitHub) by end of day on Monday, October 25th. The group should present their code and findings in our meet-up on Tuesday October 26th. The ability to be an effective member of a virtual team is highly valued in the data science job market.
Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

Source: Natural Language Processing with Python, exercise 6.10.2.

### Contributors Include

* Joy Payton
* Keith Folsom
* Sonya Hong


### First, Obtain the Corpus

Note: If not already executed, nltk.download() will allow you access to the names corpus

In [6]:
import nltk
from nltk.corpus import names
import random
import numpy as np
from nltk.metrics import *
import re
    

import string
from textstat.textstat import textstat
#nltk.download('names')

In [7]:
names = ([(name, 'male') for name in names.words('male.txt')] + \
         [(name, 'female') for name in names.words('female.txt')])

#names = random.shuffle(names)

In [8]:
test_names, devtest_names, train_names = names[0:500], names[500:1000], names[1000:]

In [9]:
# Confirm the size of the three subsets
print("Training Set = {}".format(len(train_names)))
print("Dev-Test Set = {}".format(len(devtest_names)))
print("Test Set = {}".format(len(test_names)))

Training Set = 6944
Dev-Test Set = 500
Test Set = 500


In [10]:
class NLP_Classifier():
    def __init__(self,model):
        self.model = model
        #self.feat_num = feat_num
        
        #train_set = [(gender_features3(n), g)  for (n, g) in train_names]
        #devtest_set = [(gender_features3(n), g)  for (n, g) in devtest_names]
        #test_set = [(gender_features3(n), g)  for (n, g) in test_names]


    def get_features(self,name,feat_num):
        '''
        Parameters:
            name - string of name to extract feature
            feat_num - itterable colleciton of integers specifying features. *Defaults to 1:9 inclusive
                1: last letter
                2: first letter
                3: Vowel counts
                4: Hard consonant count
                5: Soft consonant count
                6: Syllable Count
                7: Name length
                8: Last two chars
                9: Last three chars
                10: char count --> feature for all alpha chars
                11: char present --> feature for all alpha chars (boolean)
        Returns:
            features: a dictionary of extracted features
        '''
        features = {}
        
        
        
        # Converts feat_num to itterable if type is int
        if type(feat_num) is int:
            feat_num = (0, feat_num)        
       
        # Gender Feature 1: Last letter - book example
        if 1 in feat_num:
            features['last_letter'] = name[-1].lower()
            
        # Gender Feature 2: First letter - most names beginning with a vowel --> females
        if 2 in feat_num:
            features['first_letter'] = name[0].lower()
            
        # Gender Feature 3: Vowel Counts
        if 3 in feat_num:
            features['vowel_count'] = len(re.sub(r'[^aeiou]', '', name.lower()))
            
        # Gender Feature 4: Hard consonants using general rules of c and g
        if 4 in feat_num:
            features['hard_consts'] = len(re.findall(r'[cg][^eiy]', name.lower()))/2
            
        # Gender Feature 5: Soft consonants using general rules of c and g
        if 5 in feat_num:
            features['soft_consts'] = len(re.findall(r'[cg][eiy]', name.lower()))/2
            
        # Gender Feature 6: Syllable Count of names via textstat
        if 6 in feat_num:
            features['syllable_count'] = textstat.syllable_count(name.lower())
    
        # Gender Feature 7: Name length
        if 7 in feat_num:
            features["length"] = len(name)
        
        # Gender Feature 8: Last two chars
        if 8 in feat_num:
            features["last2letters"] = name[-2:].lower()
            
        # Gender Feature 9: Last three chars
        if 9 in feat_num:
            features["last3letters"] = name[-3:].lower()
    
        # Gender Feature 10: Char Counts (overfitts)
        if 10 in feat_num:
            for letter in string.ascii_lowercase:
                features["count_{0}".format(letter)] = name.lower().count(letter)
                
        # Gender Feature 11: Char Booleans (overfitts)
        if 11 in feat_num:
            for letter in string.ascii_lowercase:
                features["has_{0}".format(letter)] = letter in name.lower()
        
        
        if 12 in feat_num:
            features = {}
            letters=list(map(chr, range(ord('a'), ord('z') + 1)))
            for letter in letters:
                features["count(%s)" % letter] = name.lower().count(letter)


        if 13 in feat_num:
            features = {}
            letters=list(map(chr, range(ord('a'), ord('z') + 1)))
            for letter1 in letters:
                for letter2 in letters:
                    features["has("+letter1+letter2+")"] = (letter1+letter2 in name.lower())

        if 14 in feat_num:
            features["first2Letters"]=name[0:2].lower()

        if 15 in feat_num:
            features = {}
            features["firstletter"] = name[0].lower()
            features["lastletter"] = name[-1].lower()
            features["last2letter"] = name[-2:].lower()
            features["last3letter"] = name[-3:].lower()

            letters=list(map(chr, range(ord('a'), ord('z') + 1)))
            for letter1 in letters:
                features["count("+letter1+")"] = name.lower().count(letter1)
                features["has("+letter1+")"] = (letter1 in name.lower())
                # iterate over 2-grams
                for letter2 in letters:

                    features["has("+letter1+letter2+")"] = (letter1+letter2 in name.lower())


        if 16 in feat_num:
            # define features
            features = {}
            # has(fo) = True
            features["has(fo)"] = ('fo' in name.lower())
            # has(hu) = True
            features["has(hu)"] = ('hu' in name.lower())
            # has(rv) = True
            features["has(rv)"] = ('rv' in name.lower())    
            # has(rw) = True
            features["has(rw)"] = ('rw' in name.lower()) 
            # has(sp) = True
            features["has(sp)"] = ('sp' in name.lower())

            # lastletter = 'a'
            features["lastletter=a"] = ('a' in name[-1:].lower())
            # lastletter = 'f'
            features["lastletter=f"] = ('f' in name[-1:].lower())
            # lastletter = 'k'
            features["lastletter=k"] = ('k' in name[-1:].lower())

            # last2letter = 'ch'
            features["last2letter=ch"] = ('ch' in name[-2:].lower())
            # last2letter = 'do'
            features["last2letter=do"] = ('do' in name[-2:].lower())
            # last2letter = 'ia'
            features["last2letter=ia"] = ('ia' in name[-2:].lower())
            # last2letter = 'im'
            features["last2letter=im"] = ('im' in name[-2:].lower())
            # last2letter = 'io'
            features["last2letter=io"] = ('io' in name[-2:].lower())
            # last2letter = 'la'
            features["last2letter=la"] = ('la' in name[-2:].lower())
            # last2letter = 'ld'
            features["last2letter=ld"] = ('ld' in name[-2:].lower())
            # last2letter = 'na'
            features["last2letter=na"] = ('na' in name[-2:].lower())
            # last2letter = 'os'
            features["last2letter=os"] = ('os' in name[-2:].lower())
            # last2letter = 'ra'
            features["last2letter=ra"] = ('ra' in name[-2:].lower())
            # last2letter = 'rd'
            features["last2letter=rd"] = ('rd' in name[-2:].lower())
            # last2letter = 'rt'
            features["last2letter=rt"] = ('rt' in name[-2:].lower())
            # last2letter = 'sa'
            features["last2letter=sa"] = ('sa' in name[-2:].lower())
            # last2letter = 'ta'
            features["last2letter=ta"] = ('ta' in name[-2:].lower())
            # last2letter = 'us'
            features["last2letter=us"] = ('us' in name[-2:].lower())

            # last3letter = 'ana'
            features["last3letter=ana"] = ('ana' in name[-3:].lower())    
            # last3letter = u'ard'
            features["last3letter=ard"] = ('ard' in name[-3:].lower())        
            # last3letter = u'ita'
            features["last3letter=ita"] = ('ita' in name[-3:].lower())    
            # last3letter = u'nne'
            features["last3letter=nne"] = ('nne' in name[-3:].lower())    
            # last3letter = u'tta'
            features["last3letter=tta"] = ('tta' in name[-3:].lower())    
        
        return features

    
    def show_errors(self, errors, n=None):
        if n is not None: errors = errors[:n]          
        for (tag, guess, name) in sorted(errors): 
            print('correct=%-8s guess=%-8s name=%-30s' %(tag, guess, name))
        return None 
    
    
    def classifier_report(self,classifier,dataset,feat_num):
        feat_num = int(feat_num)
        dataset_predictions = [classifier.classify(self.get_features(n,feat_num))  for (n, g) in dataset]
        dataset_gold = [g  for (n, g) in dataset]
        cm=ConfusionMatrix(dataset_gold, dataset_predictions)
        print(cm)
    
    def fit_model(self,feat_num):
       
        for i in np.arange(1,feat_num):
            feat_num =int(i)
            errors = [] 
            
            train_set = [(self.get_features(n,feat_num), g)  for (n, g) in train_names]
            devtest_set = [(self.get_features(n,feat_num), g)  for (n, g) in devtest_names]
            test_set = [(self.get_features(n,feat_num), g)  for (n, g) in test_names] 
            classifier = self.model.train(train_set) 
            
            # For errors list
            for (name, tag) in devtest_names:
                guess = classifier.classify(self.get_features(name,feat_num)) 
                if guess != tag: 
                    errors.append((tag, guess, name))    
                    
            # Print errors        
            self.show_errors(errors, 0)
            
            #Print classifier report
            self.classifier_report(classifier,train_names,i)
            
            # Only for NaiveBayes
            if self.model ==nltk.NaiveBayesClassifier:
                classifier.show_most_informative_features(5)
            
            print("Accuracy of model {} using feature {}:{}".format(self.model,feat_num,nltk.classify.accuracy(classifier, devtest_set)))
            
        

In [11]:
clf = NLP_Classifier(nltk.NaiveBayesClassifier)
clf.fit_model(16)

       |    f      |
       |    e      |
       |    m    m |
       |    a    a |
       |    l    l |
       |    e    e |
-------+-----------+
female |<4665> 336 |
  male | 1111 <832>|
-------+-----------+
(row = reference; col = test)

Most Informative Features
             last_letter = 'a'            female : male   =     37.4 : 1.0
             last_letter = 'k'              male : female =     30.4 : 1.0
             last_letter = 'f'              male : female =     16.9 : 1.0
             last_letter = 'p'              male : female =     14.9 : 1.0
             last_letter = 'v'              male : female =     11.8 : 1.0
Accuracy of model <class 'nltk.classify.naivebayes.NaiveBayesClassifier'> using feature 1:0.468
       |    f      |
       |    e      |
       |    m    m |
       |    a    a |
       |    l    l |
       |    e    e |
-------+-----------+
female |<4795> 206 |
  male | 1585 <358>|
-------+-----------+
(row = reference; col = test)

Most Informative Feat

       |    f      |
       |    e      |
       |    m    m |
       |    a    a |
       |    l    l |
       |    e    e |
-------+-----------+
female |<4523> 478 |
  male |  875<1068>|
-------+-----------+
(row = reference; col = test)

Most Informative Features
                 has(rv) = True             male : female =     33.4 : 1.0
                 has(hu) = True             male : female =     33.4 : 1.0
                 has(sp) = True             male : female =     19.7 : 1.0
                 has(lt) = True             male : female =     17.0 : 1.0
                 has(tc) = True             male : female =     16.3 : 1.0
Accuracy of model <class 'nltk.classify.naivebayes.NaiveBayesClassifier'> using feature 13:0.322
       |    f      |
       |    e      |
       |    m    m |
       |    a    a |
       |    l    l |
       |    e    e |
-------+-----------+
female |<4687> 314 |
  male | 1282 <661>|
-------+-----------+
(row = reference; col = test)

Most Informative Fea