# Baseline Classifier Evaluation

** uses nltk Max Entropy classifier **

original algorithm here: https://github.com/kitofans/ethnicityguesser


** TOC **

1. Import & Clean Name Data
2. EDA on Name Data
3. Training the Baseline Model
4. Evaluating the Baseline Model


In [2]:
from ethnicityguesser.NLTKMaxentEthnicityClassifier import NLTKMaxentEthnicityClassifier as mxec
from os import walk
import pandas as pd
import csv
import pickle
import numpy as np




# 1. Import & Clean Name Data




In [19]:
## Import names paired with ethnicities ##

# find names of files
f = []
for (dirpath, dirnames, filenames) in walk("ethnicityguesser/pickled_names"):
    f.extend(filenames)
    break

# list types of ethnicities
ethnicities = []
for each in f:
    ethnicities.append(each.partition('.')[0])


# pair type of ethnicity to its names in a dict
eth_dict = {}
for ethnicity in ethnicities:
    with open('ethnicityguesser/pickled_names/'+ethnicity+'.pkl', 'rb') as filename:
        names = pickle.load(filename)
    eth_dict[ethnicity] = names




In [20]:
ethnicities

['chinese',
 'vietnamese',
 'irish',
 'danish',
 'french',
 'russian',
 'japanese',
 'german',
 'czech',
 'arabic',
 'ukranian',
 'swedish',
 'spanish',
 'african',
 'swiss',
 'korean',
 'jewish',
 'greek',
 'italian',
 'slavic',
 'indian',
 'muslim',
 'portugese']

In [7]:
## make a super list of names and true ethnicities

super_list_names = []
super_list_ethnicities = []

for ethnicity in ethnicities:
    name_list = eth_dict[ethnicity][0]
    eth_list = []
    for name in name_list:
        eth_list.append(ethnicity)
    super_list_names = super_list_names + name_list
    super_list_ethnicities = super_list_ethnicities + eth_list
    
df = pd.DataFrame(
            {'Name': super_list_names,
             'True Ethnicity': super_list_ethnicities
            })
    

# 2. EDA on Name Data

Let's examine what our name data looks like in reality.

In [23]:
df.sample(frac=1).head(10)

Unnamed: 0,Name,True Ethnicity
9458,Grinberg,swedish
7489,Herda,czech
19557,Caro,portugese
13215,Guell,swiss
13208,Grunder,swiss
2000,Bonet,french
13994,Awerbuch,jewish
17392,Agli,italian
11597,Mejias,spanish
2349,Chabot,french


Let's compare the data we have for every ethnicity.

In [33]:
print "n for each ethnicity sample"
for ethnicity in ethnicities:
    print len(df[df['True Ethnicity']==ethnicity]), ethnicity
    

n for each ethnicity sample
426 chinese
129 vietnamese
318 irish
656 danish
4143 french
181 russian
566 japanese
723 german
1406 czech
108 arabic
409 ukranian
1158 swedish
2366 spanish
300 african
774 swiss
169 korean
3076 jewish
429 greek
711 italian
258 slavic
580 indian
525 muslim
834 portugese


23

Choosing between 23 specific ethnicities accurately is more difficult than choosing between around 5 or 6 broad defined ethnicities because (1) having more choices to choose from in general creates more opportunities for classification errors, and (2) some ethnicities from similar parts of the world have overlapping names (like "Alexander" is a common Danish, Greek, and French name).

Let's consolidate some groups that share name/cultural similarities. Let's also make equal sample sizes for each consolidated group, drawing evenly from each subgroup to prevent our classifier from ignoring "low frequency" ethnicities.

In [55]:
temp_df = pd.DataFrame(columns=['Name', 'True Ethnicity'])

target_df = temp_df
eth_list = ['danish', 'french', 'italian']

sample_df = pd.DataFrame(columns=['Name', 'True Ethnicity'])
for ethnicity in eth_list:
    eth_df = df[df['True Ethnicity']==ethnicity]
    n_per_eth = (1000 / len(eth_list))
    print n_per_eth
    sample_df = pd.concat([sample_df, eth_df.sample(n=n_per_eth)])

target_df = pd.concat([target_df,sample_df]) 

target_df['True Ethnicity'] = "white"


333
333
333


In [56]:
ethnicities

['chinese',
 'vietnamese',
 'irish',
 'danish',
 'french',
 'russian',
 'japanese',
 'german',
 'czech',
 'arabic',
 'ukranian',
 'swedish',
 'spanish',
 'african',
 'swiss',
 'korean',
 'jewish',
 'greek',
 'italian',
 'slavic',
 'indian',
 'muslim',
 'portugese']

In [65]:
# Consolidation function

df_c = pd.DataFrame(columns=['Name', 'True Ethnicity'])

def consolidate(eth_list, target_df, consolidated_eth):
    sample_df = pd.DataFrame(columns=['Name', 'True Ethnicity'])
    for ethnicity in eth_list:
        eth_df = df[df['True Ethnicity']==ethnicity]
        n_per_eth = (1000 / len(eth_list))
        sample_df = pd.concat([sample_df, eth_df.sample(n=n_per_eth, replace = True)])
    sample_df['True Ethnicity'] = consolidated_eth
    return pd.concat([target_df,sample_df]) 


# Consolidate East European
east_euro = ['russian','ukranian','czech','slavic']
df_c = consolidate(east_euro, df_c, 'Eastern European')

# Consolidate West European
west_euro = ['italian','irish','danish','french',
                'swedish','german','swiss']
df_c = consolidate(west_euro, df_c, 'Western European')

# Consolidate Muslim / Arab
muslim_arabic = ['muslim', 'arabic']
df_c = consolidate(muslim_arabic, df_c, 'Muslim/Arabic')

# Consolidate East Asian
east_asian = ['chinese','japanese','vietnamese','korean']
df_c = consolidate(east_asian, df_c, 'East Asian')

# Spanish / Hispanic can remain its own category
hispanic = ['spanish','portugese'] 
df_c = consolidate(hispanic, df_c, 'Hispanic')

# Jewish can remain its own category
jewish = ['jewish']
df_c = consolidate(jewish, df_c, 'Jewish')

# Indian can remain its own category
indian = ['indian']
df_c = consolidate(indian, df_c, 'Indian')

# African can remain its own category 
african = ['african']
df_c = consolidate(african, df_c, 'African')

print 'Cleaned sample size:', len(df_c)
df_c.sample(n=10)

Cleaned sample size: 7994


Unnamed: 0,Name,True Ethnicity
18894,Abdi,Muslim/Arabic
10846,Cobos,Hispanic
16637,Vidal,Jewish
11767,Napoleon,Hispanic
12599,Adanna,African
18334,Bal,Indian
18730,Rampersaud,Indian
18520,Jhaveri,Indian
18948,Assaf,Muslim/Arabic
8607,Koury,Muslim/Arabic


In [68]:
## List Consolidated Ethnicities
ethnicities_c = [
    'Eastern European',
    'Western European',
    'Muslim/Arabic',
    'East Asian',
    'Hispanic',
    'Jewish',
    'Indian',
    'African'
]

# 3. Train the Model

In [69]:
## Split into Training and Test
msk = np.random.rand(len(df_c)) < 0.5
train_df = df_c[msk]
test_df = df_c[~msk]

print "Total Sample (n)", len (df_c)
print "Test Sample (test n)", len(test_df)
print "Train Sample (train n)", len(train_df)

Total Sample (n) 7994
Test Sample (test n) 3894
Train Sample (train n) 4100


In [70]:
## Package DF into training token
train_tokens = []
for ethnicity in ethnicities_c:
    new_tokens = (list(train_df[train_df['True Ethnicity'] == ethnicity]['Name']), ethnicity)
    train_tokens.append(new_tokens)

# (Tokens must be a list of ([list of names], 'ethnicity') pairs.)

In [71]:
## Train Classifier (beware, this takes time)

classifier = mxec(train_tokens)
classifier.train()

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -2.07944        0.122
             2          -1.14179        0.929
             3          -0.78956        0.950
             4          -0.61208        0.966
             5          -0.50362        0.977
             6          -0.42959        0.982
             7          -0.37544        0.986
             8          -0.33392        0.988
             9          -0.30099        0.990
            10          -0.27420        0.990
            11          -0.25196        0.991
            12          -0.23319        0.991
            13          -0.21713        0.991
            14          -0.20324        0.991
            15          -0.19109        0.991
            16          -0.18039        0.992
            17          -0.17088        0.992
            18          -0.16238        0.992
            19          -0.15473        0.992
 

In [140]:
# Test Classifier
print classifier.classify('Michael')
print classifier.classify('Roberto')
print classifier.classify('Lee')
print classifier.classify('sajkfldsafh')

def prob(name):
    return classifier.prob_classify(name)._prob_dict.items()
    #return max((p,v) for (v,p) in classifier.prob_classify(name)._prob_dict.items())

# find probability of prediction as log (lower is better)
print prob('Michael')
#print prob('Roberto')
#print prob('Lee')
#print prob('sajkfldsafh')

Western European
Hispanic
East Asian
Muslim/Arabic
[('Eastern European', -3.7065529482840653), ('Jewish', -6.6514402039659508), ('Western European', -0.16855233603485154), ('African', -11.308063175888986), ('Hispanic', -7.7907496492955053), ('Indian', -8.3545578338112048), ('East Asian', -6.0727470332155962), ('Muslim/Arabic', -10.121878347922477)]


In [83]:
## Predict!!!!

test_names = list(test_df['Name'])
test_eth = list(test_df['True Ethnicity'])

test_preds = []

for name in test_names:
    pred = classifier.classify(name)
    test_preds.append(pred)

df_preds = pd.DataFrame({
    'Name': test_names,
    'True Ethnicity': test_eth,
    'Prediction': test_preds
})

df_preds.sample(15)

Unnamed: 0,Name,Prediction,True Ethnicity
419,Triebel,Western European,Eastern European
1386,Hakimi,Muslim/Arabic,Muslim/Arabic
2639,Alpron,Western European,Jewish
1978,Apollo,Hispanic,Hispanic
2899,Baria,Hispanic,Indian
3379,Shankar,Indian,Indian
2741,Zimbalist,Jewish,Jewish
1314,Mifsud,Muslim/Arabic,Muslim/Arabic
1589,Muraoka,East Asian,East Asian
479,Acconci,Western European,Western European


In [84]:
# Add True if you got it right
df_preds['Accuracy'] = (df_preds['Prediction']==df_preds['True Ethnicity'])
df_preds.sample(15)

Unnamed: 0,Name,Prediction,True Ethnicity,Accuracy
1368,Koury,Muslim/Arabic,Muslim/Arabic,True
2245,De Araujo,Hispanic,Hispanic,True
1092,Siddiqui,Muslim/Arabic,Muslim/Arabic,True
2399,Tzarfat,Jewish,Jewish,True
2872,Chazzan,Jewish,Jewish,True
113,Urban,Eastern European,Eastern European,True
1345,Atiyeh,Muslim/Arabic,Muslim/Arabic,True
3146,Upadhyay,Indian,Indian,True
3215,Sundaram,Jewish,Indian,False
3801,Oluwatoyin,African,African,True


# 4. Evaluate the Model

In [85]:
## Tool to Calculate Accuracy Rates

def calcAccuracy(df):
    length = len(df)
    length_true = len(df[df['Accuracy']==True])
    return float(length_true)/float(length)

def calcTP(df, eth):
    length = df[df['True Ethnicity']==]

In [88]:
accuracies = []
TP = [] # number of times predict X when  X ethnicity
FP = [] # number of times predict X when 'X ethnicity
TN = [] # number of times predict'X when 'X ethnicity
FN = [] # number of times predict'X when  X ethnicity


ethnicity_list = []

# Classification Accuracy
for ethnicity in ethnicities_c:
    accuracy = calcAccuracy(df_preds[df_preds['True Ethnicity']==ethnicity])
    accuracies.append(accuracy)
    ethnicity_list.append(ethnicity)

# Aggregate accuracy
accuracies.append(calcAccuracy(df_preds))
ethnicity_list.append('OVERALL')

# put into df
df_acc = pd.DataFrame({
    'True Ethnicity': ethnicity_list,
    'Classification Accuracy': accuracies
})

df_acc.set_index('True Ethnicity', inplace=True)

In [89]:
df_acc

Unnamed: 0_level_0,Classification Rate
True Ethnicity,Unnamed: 1_level_1
Eastern European,0.737418
Western European,0.577236
Muslim/Arabic,0.85259
East Asian,0.78913
Hispanic,0.776639
Jewish,0.520243
Indian,0.760784
African,0.947047
OVERALL,0.744992


Using 

In [91]:
df_voters = pd.read_csv('Milestone33.csv', sep='\t')
df_voters

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,index,district,id,LAST_NAME,FIRST_NAME,zip,female,dob,regyear,party,electiondate,general,typeofvote,age,GEN16,GEN14
0,0,0,4581539,DUV,103746250,Jennings,Barbara,32225,F,1944-05-23 00:00:00,08/12/2004,REP,11/08/2016,GEN,A,73.0,1,0
1,1,1,6399633,DUV,103699536,Nellenbach,Marlene,32223,F,1944-07-04 00:00:00,09/03/1996,NPA,11/04/2014,GEN,A,73.0,0,0
2,2,2,5068762,DUV,103823224,YORK,MARTHA,32246,F,1948-02-10 00:00:00,10/05/1992,REP,11/04/2008,GEN,Y,69.0,0,1
3,3,3,2390889,CLA,102858200,SUCO,BRANDI,32656,F,1981-09-25 00:00:00,05/16/2000,IDP,11/07/2006,GEN,N,36.0,0,0
4,4,4,7092052,DUV,103844317,Amato,Lisa,32250,F,1972-09-14 00:00:00,10/01/2004,DEM,11/04/2008,GEN,Y,45.0,0,1
5,8,8,7621091,DUV,120286726,Gadio,Adama,32206,F,1989-06-01 00:00:00,09/21/2012,DEM,11/08/2016,GEN,E,28.0,1,0
6,9,9,4305830,DUV,114831837,Schaap,Carole,32257,F,1942-02-04 00:00:00,11/28/2006,INT,11/04/2014,GEN,Y,75.0,0,0
7,10,10,2075069,CLA,102861065,DOXEY,VICKEY,32656,F,1961-08-20 00:00:00,08/28/2000,REP,11/06/2012,GEN,Y,56.0,0,0
8,11,11,206657,BAY,116352668,LOGAN,JESSICA,32405,F,1987-10-06 00:00:00,07/01/2008,REP,11/06/2012,GEN,Y,30.0,0,0
9,12,12,7439038,DUV,103580214,Thomas,Kim,32209,F,1960-04-09 00:00:00,07/31/2003,DEM,11/02/2010,GEN,Y,57.0,0,0


# Additional EDA

In [95]:
# Add predictions to voter data
ethnicity_predictions = []
for name in list(df_voters['LAST_NAME']):
    ethnicity_predictions.append(classifier.classify(name))
    
ethnicity_predictions

['Jewish',
 'Jewish',
 'Jewish',
 'African',
 'African',
 'Hispanic',
 'Jewish',
 'Western European',
 'Western European',
 'Indian',
 'Eastern European',
 'Jewish',
 'Western European',
 'Eastern European',
 'Western European',
 'Jewish',
 'Hispanic',
 'Jewish',
 'Muslim/Arabic',
 'African',
 'Eastern European',
 'Jewish',
 'East Asian',
 'Jewish',
 'Eastern European',
 'East Asian',
 'Eastern European',
 'Western European',
 'Hispanic',
 'Jewish',
 'Hispanic',
 'East Asian',
 'Jewish',
 'Hispanic',
 'Western European',
 'Jewish',
 'Western European',
 'East Asian',
 'Western European',
 'Jewish',
 'Jewish',
 'Western European',
 'Western European',
 'Western European',
 'Jewish',
 'African',
 'Western European',
 'Jewish',
 'Western European',
 'Hispanic',
 'Western European',
 'Western European',
 'Hispanic',
 'Indian',
 'East Asian',
 'East Asian',
 'Eastern European',
 'Eastern European',
 'Hispanic',
 'Hispanic',
 'Western European',
 'Western European',
 'East Asian',
 'Hispanic

In [96]:
df_voters['Ethnicity Prediction'] = ethnicity_predictions

In [100]:
df_voters.sample(15)[['LAST_NAME', 'Ethnicity Prediction']]

Unnamed: 0,LAST_NAME,Ethnicity Prediction
399,Phillips,Western European
1654,Burch,Eastern European
2752,Smith,Western European
2783,SAVISKY,Jewish
4068,Arroyo,Hispanic
3729,Whatley,Eastern European
5055,VAUGHAN,Western European
1226,Kerlicker,Jewish
362,Palmer,Jewish
3345,Rioux,Western European
