# Baseline Classifier Evaluation

### Overview

Here is our Baseline Model. Below, you will see design decisions we have made to construct this initial classifier, along with thoughts for potential improvements as we proceed in this final project.



### TOC

1. Import & Clean Name Data
2. EDA on Name Data
3. Training the Baseline Model
4. Evaluating the Baseline Model


In [2]:
from ethnicityguesser.NLTKMaxentEthnicityClassifier import NLTKMaxentEthnicityClassifier as mxec
from os import walk
import pandas as pd
import csv
import pickle
import numpy as np

# 1. Import & Clean Name Data

### Choosing the Data

There are a couple of ways to get lists of names paired with ethnicity. Some solutions include:

- **Lists of baby names** by ethnicity [familyeducation.com](http://familyeducation.com/baby-names/browse-origin/surname)
- **Wikipedia Metadata** [Ethnicity List Scraped from Wikipedia](https://raw.githubusercontent.com/appeler/ethnicolr/master/ethnicolr/data/wiki/wiki_name_race.csv)
- **Census Data** [Frequencies of names by race](https://raw.githubusercontent.com/appeler/ethnicolr/master/ethnicolr/data/census/census_2010.csv)

For our base model, we chose the lists of baby names. Strengths of this data source include:
- More specific ethnicity breakdowns (i.e. we can differentiate between Indian names vs Chinese names vs Vietnamese names, whereas the Census data would just tell us "Asian" because it is by race)
- Standard use of 26 character alphabet (Wikipedia has many special characters like "ä" or "ü" on many names). Standard use is important because voter datasets do not use these special characters.

Weaknesses of the baby names data source include:
- No way to tell the "frequency" of a name
- Exclusion of less common names not associated with ethnicity

Considering the strengths and weaknesses of the baby names data, it seems like a reasonable starting point for our baseline model, although we may seek to refine future models through frequency data available from the other data sets.

In [19]:
## Import names paired with ethnicities ##

# find names of files
f = []
for (dirpath, dirnames, filenames) in walk("ethnicityguesser/pickled_names"):
    f.extend(filenames)
    break

# list types of ethnicities
ethnicities = []
for each in f:
    ethnicities.append(each.partition('.')[0])


# pair type of ethnicity to its names in a dict
eth_dict = {}
for ethnicity in ethnicities:
    with open('ethnicityguesser/pickled_names/'+ethnicity+'.pkl', 'rb') as filename:
        names = pickle.load(filename)
    eth_dict[ethnicity] = names




Here are the ethnicities we have:

In [141]:
ethnicities

['chinese',
 'vietnamese',
 'irish',
 'danish',
 'french',
 'russian',
 'japanese',
 'german',
 'czech',
 'arabic',
 'ukranian',
 'swedish',
 'spanish',
 'african',
 'swiss',
 'korean',
 'jewish',
 'greek',
 'italian',
 'slavic',
 'indian',
 'muslim',
 'portugese']

Now lets package these into a nice dataframe

In [7]:
## make a datafrome of names and true ethnicities

super_list_names = []
super_list_ethnicities = []

for ethnicity in ethnicities:
    name_list = eth_dict[ethnicity][0]
    eth_list = []
    for name in name_list:
        eth_list.append(ethnicity)
    super_list_names = super_list_names + name_list
    super_list_ethnicities = super_list_ethnicities + eth_list
    
df = pd.DataFrame(
            {'Name': super_list_names,
             'True Ethnicity': super_list_ethnicities
            })
    

# 2. EDA on Name Data

Let's examine what our name data looks like in reality.

In [23]:
df.sample(frac=1).head(10)

Unnamed: 0,Name,True Ethnicity
9458,Grinberg,swedish
7489,Herda,czech
19557,Caro,portugese
13215,Guell,swiss
13208,Grunder,swiss
2000,Bonet,french
13994,Awerbuch,jewish
17392,Agli,italian
11597,Mejias,spanish
2349,Chabot,french


Let's compare the data we have for every ethnicity.

In [142]:
print "n for each ethnicity sample"
for ethnicity in ethnicities:
    print len(df[df['True Ethnicity']==ethnicity]), ethnicity
    

n for each ethnicity sample
426 chinese
129 vietnamese
318 irish
656 danish
4143 french
181 russian
566 japanese
723 german
1406 czech
108 arabic
409 ukranian
1158 swedish
2366 spanish
300 african
774 swiss
169 korean
3076 jewish
429 greek
711 italian
258 slavic
580 indian
525 muslim
834 portugese


### Problem 1: Non-balanced

We only have 169 Korean names, and 4143 French names. If we train a classifier on this data right off the bat, then we'll have something that may be unfairly biased towards French names.

Frequency in this dataset does not correlate to real world frequency, so we ought to atleast balance it to prevent unwanted bias.

### Problem 2: Too many categories

Choosing between 23 specific ethnicities accurately is more difficult than choosing between around 7 or 8 broadly defined ethnicities because:
  1. Having more choices to choose from in general creates more opportunities for classification errors
  2. Some ethnicities from similar parts of the world have overlapping names (like "Alexander" is a common Danish, Greek, and French name).

Let's consolidate some groups that share name/cultural similarities. Let's also make equal sample sizes for each consolidated group, drawing evenly from each subgroup to prevent our classifier from ignoring "low frequency" ethnicities that are only low frequency due to the bias in our data set.

In [147]:
# Consolidation function

df_c = pd.DataFrame(columns=['Name', 'True Ethnicity'])

def consolidate(eth_list, target_df, consolidated_eth):
    sample_df = pd.DataFrame(columns=['Name', 'True Ethnicity'])
    for ethnicity in eth_list:
        eth_df = df[df['True Ethnicity']==ethnicity]
        n_per_eth = (1000 / len(eth_list))
        sample_df = pd.concat([sample_df, eth_df.sample(n=n_per_eth, replace = True)])
    sample_df['True Ethnicity'] = consolidated_eth
    return pd.concat([target_df,sample_df]) 


# Consolidate East European
east_euro = ['russian','ukranian','czech','slavic']
df_c = consolidate(east_euro, df_c, 'Eastern European')

# Consolidate West European
west_euro = ['italian','irish','danish','french',
                'swedish','german','swiss']
df_c = consolidate(west_euro, df_c, 'Western European')

# Consolidate Muslim / Arab
muslim_arabic = ['muslim', 'arabic']
df_c = consolidate(muslim_arabic, df_c, 'Muslim/Arabic')

# Consolidate East Asian
east_asian = ['chinese','japanese','vietnamese','korean']
df_c = consolidate(east_asian, df_c, 'East Asian')

# Spanish / Hispanic can remain its own category
hispanic = ['spanish','portugese'] 
df_c = consolidate(hispanic, df_c, 'Hispanic')

# Jewish can remain its own category
jewish = ['jewish']
df_c = consolidate(jewish, df_c, 'Jewish')

# Indian can remain its own category
indian = ['indian']
df_c = consolidate(indian, df_c, 'Indian')

# African can remain its own category 
african = ['african']
df_c = consolidate(african, df_c, 'African')

print 'Cleaned sample size:', len(df_c)
df_c.sample(n=15)

Cleaned sample size: 7994


Unnamed: 0,Name,True Ethnicity
12676,Chinweike,African
513,Phuong,East Asian
12780,Monifa,African
18364,Bhalla,Indian
19354,Shehata,Muslim/Arabic
6904,Regenbogen,Western European
13800,Son,East Asian
530,Tieu,East Asian
19266,Rashed,Muslim/Arabic
12807,Nsia,African


There is a lot of academic debate about how ethnicity categories ought to be defined, but for our purposes, lets just try to consildate groups based on similarities in names and culture of their American diaspora.

# 3. Train the Model

The Baseline Model uses a multinomial logistic regression (also known as a MaxEnt / Maximum Entropy classifier).

### Motivation

Because the classifier has many ethnicities it can classify into (it's not just a binary decision), a multinomial logistic regression is appropriate as it can classify into several non-ordinal categorical dependent variables.

This model is particularly useful because we can output a probability for our prediction, that becomes an indicator for how certain we are about our classification. A future iteration of model can choose to "abstain" from predicting when uncertain, thereby giving us an extremely accurate list voters for a certain ethnicity if we choose a conservative threshold.

### Implementation

For the baseline model, we adapted an open source implementation of a Max Entropy classifier prepared by Github user Kitofans, who has created a wrapper around the actual NLTK Max Entropy algorithm to take in names as training features.

During training, the NLTK classifier considers all probability distributions that are consistent with the training data that has been fed in, and it chooses the distribution with the highest entropy.


** Citation **

Kitofan's wrapper here: https://github.com/kitofans/ethnicityguesser
NLTK MaxEnt model: http://www.nltk.org/api/nltk.classify.html


Now let's train this bad boy.

In [69]:
## Split into Training and Test
msk = np.random.rand(len(df_c)) < 0.5
train_df = df_c[msk]
test_df = df_c[~msk]

print "Total Sample (n)", len (df_c)
print "Test Sample (test n)", len(test_df)
print "Train Sample (train n)", len(train_df)

Total Sample (n) 7994
Test Sample (test n) 3894
Train Sample (train n) 4100


In [70]:
## List Consolidated Ethnicities
ethnicities_c = [
    'Eastern European',
    'Western European',
    'Muslim/Arabic',
    'East Asian',
    'Hispanic',
    'Jewish',
    'Indian',
    'African'
]

## Package DF into training token
train_tokens = []
for ethnicity in ethnicities_c:
    new_tokens = (list(train_df[train_df['True Ethnicity'] == ethnicity]['Name']), ethnicity)
    train_tokens.append(new_tokens)

# (Tokens must be a list of ([list of names], 'ethnicity') pairs.)

In [71]:
## Train Classifier (beware, this takes time)

classifier = mxec(train_tokens)
classifier.train()

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -2.07944        0.122
             2          -1.14179        0.929
             3          -0.78956        0.950
             4          -0.61208        0.966
             5          -0.50362        0.977
             6          -0.42959        0.982
             7          -0.37544        0.986
             8          -0.33392        0.988
             9          -0.30099        0.990
            10          -0.27420        0.990
            11          -0.25196        0.991
            12          -0.23319        0.991
            13          -0.21713        0.991
            14          -0.20324        0.991
            15          -0.19109        0.991
            16          -0.18039        0.992
            17          -0.17088        0.992
            18          -0.16238        0.992
            19          -0.15473        0.992
 

In [152]:
# Test Classifier
print "Michael     ", classifier.classify('Michael')
print "Roberto     ", classifier.classify('Roberto')
print "Lee         ", classifier.classify('Lee')
print "sajkfldsafh ", classifier.classify('sajkfldsafh')


Michael      Western European
Roberto      Hispanic
Lee          East Asian
sajkfldsafh  Muslim/Arabic


These predictions look reasonable. Let's take a peek under the hood and see what the probabilities of the predictions are like:

In [153]:
def prob(name):
    return classifier.prob_classify(name)._prob_dict.items()
    #return max((p,v) for (v,p) in classifier.prob_classify(name)._prob_dict.items())

# find probability of prediction as log (lower is better)
print prob('Michael')


[('Eastern European', -3.7065529482840653), ('Jewish', -6.6514402039659508), ('Western European', -0.16855233603485154), ('African', -11.308063175888986), ('Hispanic', -7.7907496492955053), ('Indian', -8.3545578338112048), ('East Asian', -6.0727470332155962), ('Muslim/Arabic', -10.121878347922477)]


According to documentation, the classifier chooses whichever ethnicity has the highest score here. For "Michael", that would be:
* 'Western European', -0.16855233603485154

In [83]:
## Predict!!!!

test_names = list(test_df['Name'])
test_eth = list(test_df['True Ethnicity'])

test_preds = []

for name in test_names:
    pred = classifier.classify(name)
    test_preds.append(pred)

df_preds = pd.DataFrame({
    'Name': test_names,
    'True Ethnicity': test_eth,
    'Prediction': test_preds
})

df_preds.sample(15)

Unnamed: 0,Name,Prediction,True Ethnicity
419,Triebel,Western European,Eastern European
1386,Hakimi,Muslim/Arabic,Muslim/Arabic
2639,Alpron,Western European,Jewish
1978,Apollo,Hispanic,Hispanic
2899,Baria,Hispanic,Indian
3379,Shankar,Indian,Indian
2741,Zimbalist,Jewish,Jewish
1314,Mifsud,Muslim/Arabic,Muslim/Arabic
1589,Muraoka,East Asian,East Asian
479,Acconci,Western European,Western European


Looks alright. Let's add an indicator variable of True if the guess is correct.

In [84]:
# Add True if you got it right
df_preds['Accuracy'] = (df_preds['Prediction']==df_preds['True Ethnicity'])
df_preds.sample(15)

Unnamed: 0,Name,Prediction,True Ethnicity,Accuracy
1368,Koury,Muslim/Arabic,Muslim/Arabic,True
2245,De Araujo,Hispanic,Hispanic,True
1092,Siddiqui,Muslim/Arabic,Muslim/Arabic,True
2399,Tzarfat,Jewish,Jewish,True
2872,Chazzan,Jewish,Jewish,True
113,Urban,Eastern European,Eastern European,True
1345,Atiyeh,Muslim/Arabic,Muslim/Arabic,True
3146,Upadhyay,Indian,Indian,True
3215,Sundaram,Jewish,Indian,False
3801,Oluwatoyin,African,African,True


# 4. Evaluate the Model

With logit regressions, the ROC curve and corresponding AUC are usually great ways to assess overall accuracy, but this is not a binary classification type of problem. We'll instead use evaluation metrics like:
- Classification Accuracy
- TPR
- FPR
- Precision
- Recall
- F1 score (harmonic mean of precision and recall)



In [180]:
## Tools to Calculate One vs. Rest Accuracy Rates

def calcTP(df, eth):
    P = df[df['Prediction']==eth]
    TP = P[P['True Ethnicity']==eth]
    return len(TP)

def calcFP(df, eth):
    P = df[df['Prediction']==eth]
    FP = P[P['True Ethnicity']!=eth]
    return len(FP)

def calcTN(df, eth):
    N = df[df['Prediction']!=eth]
    TN = N[N['True Ethnicity']!=eth]
    return len(TN)

def calcFN(df, eth):
    N = df[df['Prediction']!=eth]
    FN = N[N['True Ethnicity']==eth]
    return len(FN)




In [201]:
accuracies = []
TPs = [] # number of times predict X when  X ethnicity
FPs = [] # number of times predict X when 'X ethnicity
TNs = [] # number of times predict'X when 'X ethnicity
FNs = [] # number of times predict'X when  X ethnicity


ethnicity_list = []

# Classification Accuracy
for ethnicity in ethnicities_c:
    accuracy = calcAccuracy(df_preds[df_preds['True Ethnicity']==ethnicity])
    accuracies.append(accuracy)
    TPs.append(calcTP(df_preds, ethnicity))
    FPs.append(calcFP(df_preds, ethnicity))
    TNs.append(calcTN(df_preds, ethnicity))
    FNs.append(calcFN(df_preds, ethnicity))
    ethnicity_list.append(ethnicity)

# Aggregate accuracy
#accuracies.append(calcAccuracy(df_preds))
#ethnicity_list.append('OVERALL')

# put into df
df_acc = pd.DataFrame({
    'True Ethnicity': ethnicity_list,
    'Classification Accuracy': accuracies,
    'TP': TPs,
    'FP': FPs,
    'TN': TNs,
    'FN': FNs
})

df_acc.set_index('True Ethnicity', inplace=True)

# Add TPR (Sensistivity)
df_acc['Sensitivity (TPR)'] = (df_acc['TP']) / (df_acc['TP'] + df_acc['FN'])

# Add FPR
df_acc['FPR'] = (df_acc['FP']) / (df_acc['FP'] + df_acc['TN'])

# Add Precision
df_acc['Precision'] = (df_acc['TP']) / (df_acc['TP'] + df_acc['FP'])

# F1 Score (harmonic mean of precision and sensitivity)
df_acc['F1 Score'] = (2 * df_acc['TP'] / ((2*df_acc['TP'])+df_acc['FP']+df_acc['FN']))

# Accuracy (ACC)
df_acc['ACC'] = (df_acc['TP']+df_acc['TN']) / (df_acc['TP']+df_acc['TN']+df_acc['FP']+df_acc['FN'])






In [202]:
df_acc

Unnamed: 0_level_0,Classification Accuracy,FN,FP,TN,TP,Sensitivity (TPR),FPR,Precision,F1 Score,ACC
True Ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Eastern European,0.737418,120,130,3307,337,0.737418,0.037824,0.721627,0.729437,0.935799
Western European,0.577236,208,174,3228,284,0.577236,0.051146,0.620087,0.597895,0.9019
Muslim/Arabic,0.85259,74,123,3269,428,0.85259,0.036262,0.77677,0.812915,0.949409
East Asian,0.78913,97,96,3338,363,0.78913,0.027956,0.79085,0.789989,0.950437
Hispanic,0.776639,109,138,3268,379,0.776639,0.040517,0.733075,0.754229,0.936569
Jewish,0.520243,237,160,3240,257,0.520243,0.047059,0.616307,0.564215,0.898048
Indian,0.760784,122,113,3271,388,0.760784,0.033392,0.774451,0.767557,0.939651
African,0.947047,26,59,3344,465,0.947047,0.017338,0.887405,0.916256,0.978172


### Observations

Our ACCs (Overall accuracy scores) are super high, and our FPRs are very low, but these are actually misleading because they are inflated by our high True Negative count. We are using a one vs. rest calculation approach. This calculation approach leads to a high number of true negatives, because most guesses will simply not be the "one" ethnicity in question.

The most useful metrics here are probably True Positive Rate/Sensitivity (because it tells us how often we are predicting correctly within a given ethnicity), and the Precision (because it tells the likelihood of whether or not our guess is correct, once we make it).

F1 Score is interesting because it gives us a harmonic mean of these two metrics (Precision an Sensitivity) so it can help us consider those two together.

### Next Steps


** Tuning this Model **
Moving forward, we think that it would be beneficial to optimize for Precision (by only classifying as an ethnicity when we are sure, potentially abstaining in ambiguous cases). This would cause us only predict on some data points.

In our research project, we think this is a good trade off. Given the enormous size of our dataset (>10 million voters in Florida), it is fine for us to abstain from predicting for many people. Even if we predict on only 1 million of the most ethnically identifiable names for each ethnicity, that is enough to make conclusions on the state level.

One concern would be that be only classifying highly identifiable names, we are introducing a bias, because people of a certain ethnicity who have a less identifiable name may behave differently as voters. As a working assumption, we'll assume that the ethnic identifiability of person's name does not have a causal or confounding effect on their voting behavior, and that abstaining on those less identifiable names is better than making an incorrect classification.

** Testing Other Models **
This Multinomial Logistic Regression (MaxEnt) is one of many implementation that can be used to classify based on name. Other reasonable models could include:

- k-Nearest Neighbors
- hidden Markov models
- k-means clustering
- LDA

We will continue to explore the literature on name/language classification and will test one or more of the above models as appropriate.


** Using Better Data **

A limitation of our approach is that a certain name may appear more frequently in one ethnicity, and less in another, but our Baby Names dataset does not account for this.

We will try to experiment with other data sources to see if we can get a better performance.

## Mini EDA: Applying the Predictor to a subset of our data

In [208]:
## Import data cleaned by Riddhi and Kimia
df_voters = pd.read_csv('Milestone33.csv', sep='\t')
df_voters.sample(10)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,index,district,id,LAST_NAME,FIRST_NAME,zip,female,dob,regyear,party,electiondate,general,typeofvote,age,GEN16,GEN14
2644,5139,5139,3153570,CLL,103047739,Howard,Timothy,34117,M,1961-10-26 00:00:00,09/22/2000,REP,11/06/2012,GEN,A,56.0,0,0
754,1443,1443,1696657,CHA,102606670,Pruey,R,34224,M,1980-01-11 00:00:00,05/06/1998,NPA,11/08/2016,GEN,Y,37.0,1,0
2516,4871,4871,5669616,DUV,103293240,Moore,William,32233,M,1956-11-13 00:00:00,10/08/1990,REP,11/08/2016,GEN,Y,61.0,1,0
4975,9672,9672,7433685,DUV,103841413,Sanders,Sue,32210,U,1969-04-06 00:00:00,01/31/1995,DEM,11/06/2012,GEN,Y,48.0,0,0
34,70,70,1608176,CHA,102626244,Kelley,Kristine,34224,F,1966-02-24 00:00:00,11/15/2000,DEM,11/07/2006,GEN,N,51.0,0,0
5081,9887,9887,6220186,DUV,103752822,Carswell,Dana,32277,F,1952-09-01 00:00:00,11/08/1983,DEM,11/04/2014,GEN,E,65.0,0,0
1910,3704,3704,2768859,CLL,103016985,James,Carol,34112,F,1939-06-15 00:00:00,02/26/1998,REP,11/07/2006,GEN,E,78.0,0,0
2701,5248,5248,999427,BRA,100753930,Goodman,Anessa,32058,F,1975-07-07 00:00:00,05/05/1997,DEM,11/06/2012,GEN,Y,42.0,0,0
2801,5449,5449,3971893,CLM,103179402,DRAWDY,CAROLYN,32024,F,1941-07-16 00:00:00,03/28/1968,DEM,11/04/2008,GEN,E,76.0,0,1
834,1595,1595,1290058,CHA,102560007,Goldman,Jason,33954,M,1970-03-23 00:00:00,06/02/1988,DEM,11/04/2014,GEN,Y,47.0,0,0


In [209]:
## Add predictions to voter data
ethnicity_predictions = []
for name in list(df_voters['LAST_NAME']):
    ethnicity_predictions.append(classifier.classify(name))

In [210]:
df_voters['Ethnicity Prediction'] = ethnicity_predictions

In [295]:
df_voters.sample(15)[['LAST_NAME', 'Ethnicity Prediction']]

Unnamed: 0,LAST_NAME,Ethnicity Prediction
1526,Kelleher,Jewish
3536,Pullen,Western European
4547,Pineda,Hispanic
4707,MCDOWELL,Western European
2959,Lessord,Eastern European
3846,Clemons,Western European
3054,Montgomery,Western European
4772,THIBODAUX,Western European
1682,RANDLES,Hispanic
3758,THORNTON,Western European


Baseline looks reasonable on the voter data, but definitely has room for improvement.