Assignment Detail:

For this project, please work with the entire class as one collaborative group! Your project should be submitted (as a Jupyter Notebook via GitHub) by end of the due date. The group should present their code and findings in our meetup. The ability to be an effective member of a virtual team is highly valued in the data science job market.

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can.
Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the devtest set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. 

How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

Source: Natural Language Processing with Python, exercise 6.10.2.


Building data set: Using nltk function we have build gender data set called "Gender_names" here. 
    

In [701]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import nltk
nltk.download('names')

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\rajwa\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!


True

In [702]:
from nltk.corpus import names
import random
from nltk.classify import apply_features

#Building the Gender_names data set
Gender_names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
random.shuffle(Gender_names)

In [703]:
Gender_names[0:10] #show the names with gender. 

[('Avivah', 'female'),
 ('Emmaline', 'female'),
 ('Rori', 'female'),
 ('Shannon', 'male'),
 ('Celia', 'female'),
 ('Gay', 'female'),
 ('Charin', 'female'),
 ('Pollyanna', 'female'),
 ('Brena', 'female'),
 ('Tobye', 'female')]

Gender Identification:
Male and female names have distinct characteristics such as names ending in a, e, and i are likely to be female, while names ending in k, o, r, s, and t are likely to be male. We have build a classifier to model these differences more precisely. We will look for the last letter of a given name. (Source: NLP book page 222-223) 

In [704]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [705]:
gender_features('Justine')

{'last_letter': 'e'}

Next, we have used the feature extractor to process the Gender_names data, and divide the resulting list of feature sets into a training set and a test set.

In [706]:
featuresets = [(gender_features(n), g) for (n,g) in Gender_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
accuracy_base= nltk.classify.accuracy(classifier, test_set)
print ("Accuracy with a to z letters check :{}".format(accuracy_base))

Accuracy with a to z letters check :0.786


In [707]:
featuresets[2]

({'last_letter': 'i'}, 'female')

In [708]:
print (classifier.classify(gender_features('Romeo'))) #male
print (classifier.classify(gender_features('Trinity'))) #female

male
female


In [709]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     33.9 : 1.0
             last_letter = 'k'              male : female =     32.7 : 1.0
             last_letter = 'f'              male : female =     16.6 : 1.0
             last_letter = 'v'              male : female =     11.2 : 1.0
             last_letter = 'p'              male : female =     11.2 : 1.0


In [710]:
df_gf = pd.DataFrame(featuresets, columns=['Letter','Gender'])
df_gf['LastLetter'] = df_gf['Letter'].apply(lambda x: x['last_letter'])
df_gf = df_gf[['LastLetter','Gender']]
df_gf.head()

Unnamed: 0,LastLetter,Gender
0,h,female
1,e,female
2,i,female
3,n,male
4,a,female


In [711]:
df_gf.shape

(7944, 2)

Since we are working with large corpora, we will use  use the function
nltk.classify.apply_features which does not store all the feature sets in memory.

In [712]:
### DO WE need THIS CODE
train_set = apply_features(gender_features, Gender_names[500:])
test_set = apply_features(gender_features, Gender_names[:500])

Choosing the Right Features:
Selecting relevant features and deciding how to encode them are very important to build a  good model. 

In [713]:
def gender_features_az(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

In [714]:
# gender_features2('John')


Accuracy of naive Bayes classifier using the feature extractor: 

In [715]:
featuresets_az = [(gender_features_az(n), g) for (n,g) in Gender_names]
train_set, test_set = featuresets_az[500:], featuresets_az[:500]
classifier_az = nltk.NaiveBayesClassifier.train(train_set)
accuracy_az= nltk.classify.accuracy(classifier_az, test_set)
print ("Accuracy with Last Letter  [classifier] check :{}".format(accuracy_base))
print ("Accuracy First and Last Letter [classifier_az] check :{}".format(accuracy_az))

Accuracy with Last Letter  [classifier] check :0.786
Accuracy First and Last Letter [classifier_az] check :0.8


In [716]:
print(featuresets_az[1])
list(featuresets_az[1][0].values())[0]


({'firstletter': 'e', 'lastletter': 'e', 'count(a)': 1, 'has(a)': True, 'count(b)': 0, 'has(b)': False, 'count(c)': 0, 'has(c)': False, 'count(d)': 0, 'has(d)': False, 'count(e)': 2, 'has(e)': True, 'count(f)': 0, 'has(f)': False, 'count(g)': 0, 'has(g)': False, 'count(h)': 0, 'has(h)': False, 'count(i)': 1, 'has(i)': True, 'count(j)': 0, 'has(j)': False, 'count(k)': 0, 'has(k)': False, 'count(l)': 1, 'has(l)': True, 'count(m)': 2, 'has(m)': True, 'count(n)': 1, 'has(n)': True, 'count(o)': 0, 'has(o)': False, 'count(p)': 0, 'has(p)': False, 'count(q)': 0, 'has(q)': False, 'count(r)': 0, 'has(r)': False, 'count(s)': 0, 'has(s)': False, 'count(t)': 0, 'has(t)': False, 'count(u)': 0, 'has(u)': False, 'count(v)': 0, 'has(v)': False, 'count(w)': 0, 'has(w)': False, 'count(x)': 0, 'has(x)': False, 'count(y)': 0, 'has(y)': False, 'count(z)': 0, 'has(z)': False}, 'female')


'e'

In [717]:
df_gf_az = pd.DataFrame(featuresets_az, columns=['Letter','Gender'])
df_gf_az['Duo'] = df_gf_az['Letter'].apply(lambda x: list(x.values())[0] + list(x.values())[1]) 
df_gf_az = df_gf_az[['Duo','Gender']]
df_gf_az.head()

Unnamed: 0,Duo,Gender
0,ah,female
1,ee,female
2,ri,female
3,sn,male
4,ca,female


In [718]:
print(df_gf_az.shape)


(7944, 2)


Above result shows that the accuracy of classifier to count letters is about 2% more
than the accuracy of a classifier that only pays attention to the final letter of each name.

### Error analysis

First, we select a development set, containing the corpus data for creating the model. This development set is then subdivided into the *training set* and the *dev-test* set.
<br>
<br>
**devtest_names :** Unique records from 500 to 1500 Index  <br>
**train_names :** Unique records from 1500+ Index <br>
**test_names :** Unique records from 0 to 500 Index <br>

In [719]:
train_names = Gender_names[1500:]
devtest_names = Gender_names[500:1500]
test_names = Gender_names[:500]

We have divided the corpus into appropriate datasets. Then we have built a model using the training
set, and then run it on the dev-test set.

#### Running Base Gender Classifier with Last Letter

In [720]:
train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
test_set = [(gender_features(n), g) for (n,g) in test_names]
classifier1 = nltk.NaiveBayesClassifier.train(train_set)
print (nltk.classify.accuracy(classifier1, devtest_set))

0.764


#### Running Gender Classifier with First and Last Letter

In [721]:
train_set = [(gender_features_az(n), g) for (n,g) in train_names]
devtest_set = [(gender_features_az(n), g) for (n,g) in devtest_names]
test_set = [(gender_features_az(n), g) for (n,g) in test_names]
classifier2 = nltk.NaiveBayesClassifier.train(train_set)
print (nltk.classify.accuracy(classifier2, devtest_set))

0.787


Using the dev-test set, we can generate a list of the errors that the classifier makes when
predicting name genders:<br>
Lets use our 500 Devtest_name to check the predictions by using the both Classifier1  with Feature method : gender_features

In [722]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier1.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

errors = sorted(errors)

In [723]:
len(errors)

236

The names classifier that we have built generates about **216 errors** on the **devtest_names** corpus as follows, we are listing few of them as below : <br>



In [724]:
for (tag, guess, name) in errors[0:10]: # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    print ('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))

correct=female   guess=male     name=Allsun                        
correct=female   guess=male     name=Alys                          
correct=female   guess=male     name=Arden                         
correct=female   guess=male     name=Ardis                         
correct=female   guess=male     name=Astrid                        
correct=female   guess=male     name=Avis                          
correct=female   guess=male     name=Beatriz                       
correct=female   guess=male     name=Bette-Ann                     
correct=female   guess=male     name=Bev                           
correct=female   guess=male     name=Bridget                       


We note that <b>l </b>is  mostly know as Male but, <b>el</b> can be classified as Female.
<br>Similarly We note that <b>n</b>is  mostly know as Male but, <b>nn/an</b> can be classified as Female.

We will now build another model where we would consider the last two letters of the word and then train our model.

In [725]:
# Collect last Two Letters from the Words 
def gender_features_tls(word):
    return {'suffix1': word[-1:], 'suffix2': word[-2:]}

In [726]:
gender_features_lt("John")

{'suffix1': 'n', 'suffix2': 'hn'}

In [727]:
train_set = [(gender_features_tls(n), g) for (n,g) in train_names]
devtest_set = [(gender_features_tls(n), g) for (n,g) in devtest_names]
classifier3 = nltk.NaiveBayesClassifier.train(train_set)
print (nltk.classify.accuracy(classifier3, devtest_set))

0.792


Rebuilding the classifier with the new feature extractor, we see that the performance
on the dev-test dataset improves by almost two percentage points from 78.4% to 80.0%

I have created another features here which utilizes the first and last letter. It also looks for the prefix and suffix, or first and last two or three letters, depending on the name's length of a name and looks for whether or not any of the consonant clusters are present.

In [728]:
# NEED SOURCE FOR cons_clusters
def class_gender_features4(name):
    features = {}
    temp_name = name
    cons_clusters = ["bl", "br", "ch", "cl", "cr", "dr", "fl", "fr", 
                     "gl", "gr", "pl", "pr", "sc", "sh", "sk", "sl", 
                     "sm", "sn", "sp", "st", "sw", "th", "tr", "tw", 
                     "wh", "wr", "sch", "scr", "shr", "sph", "spl", 
                     "spr", "squ", "str", "thr"]
    features["firstletter"] = name[0].lower() 
    features["lastletter"] = name[-1].lower() 
    features["prefix"] = name[:3].lower() if len(name) > 4 else name[:2].lower() 
    features["suffix"] = name[-3:].lower() if len(name) > 4 else name[-2:].lower()
    clusters = []
    for cluster in cons_clusters[::-1]:
        if cluster in temp_name:
            temp_name = temp_name.replace(cluster, "")
            clusters.append(cluster)
    features["consonant_clusters_1"] = clusters[0] if len(clusters) > 0 else None
    features["consonant_clusters_2"] = clusters[1] if len(clusters) > 1 else None
    features["consonant_clusters_3"] = clusters[2] if len(clusters) > 2 else None
    return features

In [729]:
class_gender_features4("RAJWAN")

{'firstletter': 'r',
 'lastletter': 'n',
 'prefix': 'raj',
 'suffix': 'wan',
 'consonant_clusters_1': None,
 'consonant_clusters_2': None,
 'consonant_clusters_3': None}

In [952]:
classifier_4.show_most_informative_features(10)

Most Informative Features
              lastletter = 'a'            female : male   =     34.8 : 1.0
              lastletter = 'k'              male : female =     18.9 : 1.0
                  suffix = 'nne'          female : male   =     18.9 : 1.0
                  suffix = 'ita'          female : male   =     15.3 : 1.0
                  suffix = 'tta'          female : male   =     14.4 : 1.0
              lastletter = 'o'              male : female =     12.7 : 1.0
                  suffix = 'ard'            male : female =     12.6 : 1.0
                  suffix = 'and'            male : female =     11.5 : 1.0
                  suffix = 'son'            male : female =     11.1 : 1.0
                  prefix = 'dor'          female : male   =     10.5 : 1.0


### TESTING with new Model with Vowel Count

In [None]:
from scipy import spatial

def gender_features_icv(name):
    features = {}
    temp_name = name.lower()
    lenName = len(name)
    if lenName >= 4 :
        features["name_len"] = 1 
    
    features["name_len"] = len(name)
    features["firstletter"] = name[0].lower() 
    features["lastletter"] = name[-1].lower() 
    features["prefix"] = name[:3].lower() if len(name) > 4 else name[:2].lower() 
    features["suffix"] = name[-3:].lower() if len(name) > 4 else name[-2:].lower()
    Vowel = ['a','e','i','o','u']
    Vclusters = []
    dataSetI = []
    dataSetII = []
#     Check if Last 2 Letters are Vowels
    if name[-2:lenName-1] in Vowel :  features["Vowel_l2"] =  1  
    if name[-1:lenName] in Vowel : features["Vowel_l1"] =  1
        #     Find the Cosine Distance of Last 2 Letters 
        #    temp_name[0],temp_name[1],temp_name[-2],temp_name[-1]
    if lenName >= 4 :      
        dataSetI = [
                ord(name[0])-96,ord(name[1])-96,ord(name[-2])-96,ord(name[-3])-96] # This returns Number of Letters
        dataSetII = [#ord(name[-1:lenName])-96,ord(name[-1].lower())-96,
                 ord(name[1])-96,ord(name[2])-96,ord(name[-1])-96,ord(name[-2])-96]  # This returns Number of Letters
    
    else:
        dataSetI = [
                ord(name[0])-96,ord(name[-2])-96] # This returns Number of Letters
        dataSetII = [#ord(name[-1:lenName])-96,ord(name[-1].lower())-96,
                 ord(name[1])-96,ord(name[-1])-96]  # This returns Number of Letters
    
    cos_result = 1 - spatial.distance.cosine(dataSetI, dataSetII)
    features["cos_lt"] = cos_result
    features["Len_vowel"] = 0  
    dataSetI = []
    dataSetII = []
    flag = False
    for vb in Vowel[::-1]:
        if vb in temp_name:
            n_vowels= temp_name.count(vb) # COunt how many times you see Vowels            
            temp_name = temp_name.replace(vb, "")
            Vclusters.append(vb)
            features[vb]=n_vowels
            features["Len_vowel"] = features["Len_vowel"] + n_vowels
            if flag == True:
                dataSetI.append(ord(vb)-96)
                flag = False
            else :
                dataSetII.append(ord(vb)-96)
                flag = True
            
    if features["Len_vowel"]%2==0:
        cos_result = 1 - spatial.distance.cosine(dataSetI, dataSetII)
#         features["cos_lt"] = features["cos_lt"]  + cos_result
           # Find the Cosine Distance of Last 2 Letters 
#         dataSetI = [ord(name[-2:lenName-1])-96,ord(name[0].lower())-96,] # This returns Number of Letters
#         dataSetII = [ord(name[-1:lenName])-96,ord(name[-1].lower())-96,]  # This returns Number of Letters


    return features

In [1277]:
from scipy import spatial

def gender_features_icv(name):
    features = {}
    temp_name = name.lower()
    lenName = len(name)    
    features["name_len"] = len(name)
    features["firstletter"] = name[0].lower() 
    features["lastletter"] = name[-1].lower() 
    features["prefix"] = name[:3].lower() if len(name) > 4 else name[:2].lower() 
    features["suffix"] = name[-3:].lower() if len(name) > 4 else name[-2:].lower()
    Vowel = ['a','e','i','o','u']
    Vclusters = []
    dataSetI = []
    dataSetII = []
#     Check if Last 2 Letters are Vowels
#     if name[-2:lenName-1] in Vowel :  
#         features["Vowel_l2"] =  1
#     else :  
#         features["Vowel_l2"] =  0
#     if name[-1:lenName] in Vowel : 
#         features["Vowel_l1"] =  1 
#     else :
#         features["Vowel_l1"] =  0
#     if lenName >= 4 :    else:       
    cos_result = 1 - spatial.distance.cosine(dataSetI, dataSetII)
   
    features["Len_vowel"] = 0  
    dataSetI = []
    dataSetII = []
    flag = False
    for vb in Vowel[::-1]:
        if vb in temp_name:
            n_vowels= temp_name.count(vb) # COunt how many times you see Vowels            
            temp_name = temp_name.replace(vb, "")
            Vclusters.append(vb)
            features[vb]=n_vowels
#             features["Len_vowel"] = features["Len_vowel"] + n_vowels
            if flag == True:
                dataSetI.append(ord(vb)-96)
                flag = False
            else :
                dataSetII.append(ord(vb)-96)
                flag = True
#     features["Len_Ratio"] = (lenName - features["Len_vowel"])
    #Check if There are two letters repeated next to each other
    ch = [name[i]==name[i+1] for i in range(len(name)) if i <= len(name)-2 ]
#     features["Double"] = ch.count(True)
    return features

In [1283]:
# classifier_icv.classify(gender_features_icv("Rajendra"))
# gender_features_icv("Raj")
temp_name = "Rajiity"


# dataSetI = [ord(temp_name[-2:len(temp_name)-1])-96] # This returns Number of Letters
# dataSetII = [ord(temp_name[-1:len(temp_name)])-96]  # This returns Number of Letters
# cos_result = 1 - spatial.distance.cosine(dataSetI, dataSetII)

# cos_result

# # result
# # temp_name[-2:len(temp_name)-1],temp_name[-1:len(temp_name)]

gender_features_icv("Rajama")
[temp_name[a] for a in range(len(temp_name))]
[(temp_name[a],temp_name[b]) if a <= len(temp_name) else b - 1 
 for a in range(len(temp_name))  for b in range(len(temp_name[a:a+2]))   ]

for i in range(len(temp_name)):
    if i <= len(temp_name)-2:
        print(temp_name[i]==temp_name[i+1])

# len(temp_name),temp_name,temp_name[4]

ch = [temp_name[i]==temp_name[i+1] for i in range(len(temp_name)) if i <= len(temp_name)-2 ]
ch.count(True)

gender_features_icv("Raiijama")
# ch = [temp_name[i]==temp_name[i+1] for i in range(len(temp_name)) if i <= len(temp_name)-2 ]
# ch.count(True)

False
False
False
True
False
False


{'name_len': 8,
 'firstletter': 'r',
 'lastletter': 'a',
 'prefix': 'rai',
 'suffix': 'ama',
 'Len_vowel': 0,
 'i': 2,
 'a': 3}

In [1284]:
train_set = [(gender_features_icv(n), g) for (n,g) in train_names]
devtest_set = [(gender_features_icv(n), g) for (n,g) in devtest_names]
classifier_icv = nltk.NaiveBayesClassifier.train(train_set)
print (nltk.classify.accuracy(classifier_icv, devtest_set))

0.842


In [1280]:
classifier_icv.show_most_informative_features(10)

Most Informative Features
              lastletter = 'a'            female : male   =     31.2 : 1.0
              lastletter = 'k'              male : female =     27.2 : 1.0
                  suffix = 'ard'            male : female =     22.6 : 1.0
                  suffix = 'tta'          female : male   =     19.6 : 1.0
              lastletter = 'v'              male : female =     18.5 : 1.0
                  suffix = 'na'           female : male   =     17.4 : 1.0
                  suffix = 'nne'          female : male   =     17.2 : 1.0
              lastletter = 'f'              male : female =     15.2 : 1.0
                  prefix = 'rod'            male : female =     14.9 : 1.0
                  suffix = 'vin'            male : female =     13.9 : 1.0


In [1281]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier_tls.classify(gender_features_icv(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

errors = sorted(errors)

for (tag, guess, name) in errors[0:20]: # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
    print ('correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))

correct=male     guess=female   name=Abby                          
correct=male     guess=female   name=Abner                         
correct=male     guess=female   name=Adam                          
correct=male     guess=female   name=Adlai                         
correct=male     guess=female   name=Aguste                        
correct=male     guess=female   name=Albrecht                      
correct=male     guess=female   name=Alec                          
correct=male     guess=female   name=Aleksandrs                    
correct=male     guess=female   name=Allen                         
correct=male     guess=female   name=Allie                         
correct=male     guess=female   name=Ambrose                       
correct=male     guess=female   name=Ambrosi                       
correct=male     guess=female   name=Ambrosius                     
correct=male     guess=female   name=Amos                          
correct=male     guess=female   name=Andrej     

In [1282]:
# MOVE TO END 
from sklearn.model_selection import train_test_split
base_data = [(n, g) for (n,g) in Gender_names]
# split data into training and test data.
gl_train_set, gl_test_set = train_test_split(base_data,train_size=0.5,test_size=0.5,shuffle =True)

featuresets_train = [(gender_features(n), g) for (n,g) in gl_train_set]
featuresets_test = [(gender_features(n), g) for (n,g) in gl_test_set]

featuresets_az_train = [(gender_features_az(n), g) for (n,g) in gl_train_set]
featuresets_az_test = [(gender_features_az(n), g) for (n,g) in gl_test_set]

featuresets_tls_train = [(gender_features_tls(n), g) for (n,g) in gl_train_set]
featuresets_tls_test = [(gender_features_tls(n), g) for (n,g) in gl_test_set]

featuresets_4_train = [(class_gender_features4(n), g) for (n,g) in gl_train_set]
featuresets_4_test = [(class_gender_features4(n), g) for (n,g) in gl_test_set]

featuresets_icv_train = [(gender_features_icv(n), g) for (n,g) in gl_train_set]
featuresets_icv_test = [(gender_features_icv(n), g) for (n,g) in gl_test_set]


classifier = nltk.NaiveBayesClassifier.train(featuresets_train)
classifier_az = nltk.NaiveBayesClassifier.train(featuresets_az_train)
classifier_4 = nltk.NaiveBayesClassifier.train(featuresets_4_train)
classifier_tls = nltk.NaiveBayesClassifier.train(featuresets_tls_train)
classifier_icv = nltk.NaiveBayesClassifier.train(featuresets_icv_train)



accuracy_base= nltk.classify.accuracy(classifier, featuresets_test)
accuracy_az= nltk.classify.accuracy(classifier_az, featuresets_az_test)
accuracy_4= nltk.classify.accuracy(classifier_4, featuresets_4_test)
accuracy_tls= nltk.classify.accuracy(classifier_tls, featuresets_tls_test)
accuracy_icv= nltk.classify.accuracy(classifier_icv, featuresets_icv_test)

print ("Accuracy with Last Letter  [classifier] check :{}".format(accuracy_base))
print ("Accuracy First and Last Letter [classifier_az] check :{}".format(accuracy_az))
print ("Accuracy First and Last Letter and last 2 letter [classifier_4] check :{}".format(accuracy_4))
print ("Accuracy Last Two Letter [classifier_tls] check :{}".format(accuracy_tls))
print ("Accuracy With All+ Vowels [classifier_icv] check :{}".format(accuracy_icv))

Accuracy with Last Letter  [classifier] check :0.7666163141993958
Accuracy First and Last Letter [classifier_az] check :0.771399798590131
Accuracy First and Last Letter and last 2 letter [classifier_4] check :0.8272910372608258
Accuracy Last Two Letter [classifier_tls] check :0.7812185297079557
Accuracy With All+ Vowels [classifier_icv] check :0.8267875125881168


In [1152]:
train_set = [(gender_features_tls(n), g) for (n,g) in train_names]
devtest_set = [(gender_features_tls(n), g) for (n,g) in devtest_names]
classifier_tls = nltk.NaiveBayesClassifier.train(train_set)
print (nltk.classify.accuracy(classifier_tls, devtest_set))

0.792


Testing accuracy:
We will test the accuracy of of both gender features of finding the gender by last name and counting the letters of names here. To do this, we will run each function 100 times.

In [369]:
import pandas as pd

In [570]:
def accuracy(number_of_runs, function_to_use):
    acc_df = {
        "classifier": [],
        "train_set_accuracy": [],
        "test_set_accuracy": [],
        "devtest_set_accuracy": [],
        "devtest_errors": []
    }
    for i in range(number_of_runs):
        random.shuffle(Gender_names)
        acc_train_names = Gender_names[1000:]
        acc_devtest_names = Gender_names[500:1000]
        acc_test_names = Gender_names[:500]
        acc_train_set = [(function_to_use(n), g) for (n,g) in acc_train_names]
        acc_devtest_set = [(function_to_use(n), g) for (n,g) in acc_devtest_names]
        acc_test_set = [(function_to_use(n), g) for (n,g) in acc_test_names]
        acc_classifier = nltk.NaiveBayesClassifier.train(acc_train_set)
        acc_df["classifier"].append(acc_classifier)
        acc_df["train_set_accuracy"].append(nltk.classify.accuracy(acc_classifier, acc_train_set))
        acc_df["test_set_accuracy"].append(nltk.classify.accuracy(acc_classifier, acc_test_set))
        acc_df["devtest_set_accuracy"].append(nltk.classify.accuracy(acc_classifier, acc_devtest_set))
        acc_errors = []
        for (name, tag) in acc_devtest_names:
            acc_guess = acc_classifier.classify(function_to_use(name))
            if acc_guess != tag:
                acc_errors.append( (tag, acc_guess, name) )
        acc_df["devtest_errors"].append(acc_errors)
    acc_df = pd.DataFrame.from_dict(acc_df)
    return(acc_df)

In [571]:
Accuracy_df_1 = accuracy(100, gender_features)
Accuracy_df_1.describe()

Unnamed: 0,train_set_accuracy,test_set_accuracy,devtest_set_accuracy
count,100.0,100.0,100.0
mean,0.763174,0.75926,0.75966
std,0.001712,0.018119,0.017762
min,0.759361,0.72,0.716
25%,0.761953,0.748,0.75
50%,0.763105,0.757,0.759
75%,0.764185,0.772,0.774
max,0.767137,0.808,0.81


The accuracy of the first features shows that the average (mean) accuracy accross the test_set are between 78.1% and 78.9%. The mean accuracy of train_set is more than the accuracy of devtest_set. 

In [572]:
Accuracy_df_2 = accuracy(100, gender_features_az)
Accuracy_df_2.describe()

Unnamed: 0,train_set_accuracy,test_set_accuracy,devtest_set_accuracy
count,100.0,100.0,100.0
mean,0.778135,0.7752,0.77426
std,0.002122,0.017805,0.015598
min,0.773474,0.73,0.738
25%,0.776642,0.7655,0.764
50%,0.77837,0.777,0.774
75%,0.779414,0.786,0.7845
max,0.783266,0.824,0.816


The accuracy of the second features shows that the average (mean) accuracy across the test_set are between 77.3% and 77.9%. The mean accuracy of train_set is more than the accuracy of devtest_set.

In [578]:
class_df_3 = accuracy(10, class_gender_features4)
class_df_3.describe()

Unnamed: 0,train_set_accuracy,test_set_accuracy,devtest_set_accuracy
count,10.0,10.0,10.0
mean,0.883338,0.8346,0.824
std,0.001471,0.007834,0.009615
min,0.881192,0.824,0.808
25%,0.88238,0.8285,0.8205
50%,0.883137,0.836,0.824
75%,0.884649,0.8375,0.8315
max,0.885513,0.852,0.838


The accuracy of the third features, looking for prefix and suffix in the names, shows that the average (mean) accuracy across the test_set are between 83.3% and 88.4%. The mean accuracy of train_set is more than the accuracy of devtest_set and test accuracy.

In [634]:
# NEW MODEL
gender_feature_icv("temp_name")
features = {}
temp_name = "Rajwanti"
cons_clusters = ["bl", "br", "ch", "cl", "cr", "dr", "fl", "fr", 
                     "gl", "gr", "pl", "pr", "sc", "sh", "sk", "sl", 
                     "sm", "sn", "sp", "st", "sw", "th", "tr", "tw", 
                     "wh", "wr", "sch", "scr", "shr", "sph", "spl", 
                     "spr", "squ", "str", "thr"]
features["firstletter"] = name[0].lower() 
features["lastletter"] = name[-1].lower() 
features["prefix"] = name[:3].lower() if len(name) > 4 else name[:2].lower() 
features["suffix"] = name[-3:].lower() if len(name) > 4 else name[-2:].lower()
# clusters = []
# for cluster in cons_clusters[::-1]:
#     if cluster in temp_name:
#         temp_name = temp_name.replace(cluster, "")
#         clusters.append(cluster)
# features["consonant_clusters_1"] = clusters[0] if len(clusters) > 0 else None
# features["consonant_clusters_2"] = clusters[1] if len(clusters) > 1 else None
# features["consonant_clusters_3"] = clusters[2] if len(clusters) > 2 else None
# len(temp_name)
Vowel = ['a','e','i','o','u']
Vclusters = []
for vb in Vowel[::-1]:
    if vb in temp_name:
        temp_name = temp_name.replace(vb, "")
        Vclusters.append(vb)
features["Len_vowel"] = len(Vclusters)

In [662]:
clusters,Vclusters, features
len(temp_name)
Vowel = ['a','e','i','o','u']
Vclusters = []
for vb in Vowel[::-1]:
    if vb in temp_name:
        temp_name.count(vb)
        temp_name = temp_name.replace(vb, "")
        Vclusters.append(vb)


In [865]:
gender_feature_icv("Rajwant")

# 'a' in "Rajwant"

# "Rajwant".count('a')

if Vowel in "Rajwant"

SyntaxError: invalid syntax (<ipython-input-865-d95ec52c4f2c>, line 7)

In [888]:
# MOVE TO END 
from sklearn.model_selection import train_test_split
base_data = [(n, g) for (n,g) in Gender_names]
# split data into training and test data.
gl_train_set, gl_test_set = train_test_split(base_data,train_size=0.5,test_size=0.5,shuffle =True)

featuresets_train = [(gender_features(n), g) for (n,g) in gl_train_set]
featuresets_test = [(gender_features(n), g) for (n,g) in gl_test_set]

featuresets_az_train = [(gender_features_az(n), g) for (n,g) in gl_train_set]
featuresets_az_test = [(gender_features_az(n), g) for (n,g) in gl_test_set]

featuresets_tls_train = [(gender_features_tls(n), g) for (n,g) in gl_train_set]
featuresets_tls_test = [(gender_features_tls(n), g) for (n,g) in gl_test_set]

featuresets_4_train = [(class_gender_features4(n), g) for (n,g) in gl_train_set]
featuresets_4_test = [(class_gender_features4(n), g) for (n,g) in gl_test_set]

featuresets_icv_train = [(gender_features_icv(n), g) for (n,g) in gl_train_set]
featuresets_icv_test = [(gender_features_icv(n), g) for (n,g) in gl_test_set]


classifier = nltk.NaiveBayesClassifier.train(featuresets_train)
classifier_az = nltk.NaiveBayesClassifier.train(featuresets_az_train)
classifier_4 = nltk.NaiveBayesClassifier.train(featuresets_4_train)
classifier_tls = nltk.NaiveBayesClassifier.train(featuresets_tls_train)
classifier_icv = nltk.NaiveBayesClassifier.train(featuresets_icv_train)

accuracy_base= nltk.classify.accuracy(classifier, featuresets_test)
accuracy_az= nltk.classify.accuracy(classifier_az, featuresets_az_test)
accuracy_4= nltk.classify.accuracy(classifier_4, featuresets_4_test)
accuracy_tls= nltk.classify.accuracy(classifier_tls, featuresets_tls_test)
accuracy_icv= nltk.classify.accuracy(classifier_icv, featuresets_icv_test)

print ("Accuracy with Last Letter  [classifier] check :{}".format(accuracy_base))
print ("Accuracy First and Last Letter [classifier_az] check :{}".format(accuracy_az))
print ("Accuracy First and Last Letter and last 2 letter [classifier_4] check :{}".format(accuracy_4))
print ("Accuracy Last Two Letter [classifier_tls] check :{}".format(accuracy_tls))
print ("Accuracy With All+ Vowels [classifier_icv] check :{}".format(accuracy_icv))

Accuracy with Last Letter  [classifier] check :0.7640986908358509
Accuracy First and Last Letter [classifier_az] check :0.7648539778449144
Accuracy First and Last Letter and last 2 letter [classifier_4] check :0.82527693856999
Accuracy Last Two Letter [classifier_tls] check :0.7842396777442094
Accuracy With All+ Vowels [classifier_icv] check :0.81797583081571
