# Baseline Ethnicity Imputer (1.0)


** Model: ** We used a *Multinomial Logistic Regression* because:
1. Our decision is categorical (non-binary)
2. We can use the substrings within names as features, and let the classifier assign coefficients to them as they get correlated to ethnicities
3. (Most Important) It allows us to output a score for each prediction, so we can tweak the threshold at which we go ahead and make a prediction. 

** Name-Ethnicity Datasets: ** We experiemented with two name-ethnicity datasets:
1. A list of baby names found on FamilyEducation.com
2. Names scraped from wikipedia that had ethnicity meta-data associated with them, open-sourced by (Ambekar, et al., 2009)

** Test Set: ** Here we had to get creative. Because we didn't have actual ethnicities attached to the Voting Records, we needed to test externally. To make our test set, we:
1. Took 50% of the total names in the Baby Names name-ethnicity dataset, and set them aside for testing
2. Out of that dataset, we eliminated the names that appeared already in the training set. This makes our test set actually more stringent than the Voter Records set. This was necesary because: many of the names in the Voter Records were names that did not appear in our training set, so we didn't want to have an artificially high accuracy score in case names in the training set had a high propensity to re-appear within the training/test set.
3. Used over-sampling to balance the test set.

** Training Set: ** The remaining 50% names went to train our model. We over-sampled / under-sampled certain ethnicities to balance the training sets. We did not eliminate the repetition of names, because their frequencies are important because they correlate to real world frequencies.


Chong, D., & Kim, D. (2006). The experiences and effects of economic status among racial and ethnic minorities. American Political Science Review, 100(3), 335–351.

Ambekar, A., Ward, C., Mohammed, J., Male, S., & Skiena, S. (2009, June). Name-ethnicity classification from open sources. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (pp. 49-58). ACM.

** TOC **

Baseline Model

1. Import & Clean Name Data
2. EDA on Name Data
3. Training the Baseline Model
4. Evaluating the Baseline Model

Revised Model (uses Race as a Prior + bigger Wikipedia dataset)

1. Import & Clean Name Data
2. EDA on Name Data
3. Training the Baseline Model
4. Evaluating the Baseline Model





In [2]:
from ethnicityguesser.NLTKMaxentEthnicityClassifier import NLTKMaxentEthnicityClassifier as mxec
from os import walk
import pandas as pd
import csv
import pickle
import numpy as np




##  Import & Clean Name Data




In [19]:
## Import names paired with ethnicities ##

# find names of files
f = []
for (dirpath, dirnames, filenames) in walk("ethnicityguesser/pickled_names"):
    f.extend(filenames)
    break

# list types of ethnicities
ethnicities = []
for each in f:
    ethnicities.append(each.partition('.')[0])


# pair type of ethnicity to its names in a dict
eth_dict = {}
for ethnicity in ethnicities:
    with open('ethnicityguesser/pickled_names/'+ethnicity+'.pkl', 'rb') as filename:
        names = pickle.load(filename)
    eth_dict[ethnicity] = names




In [20]:
ethnicities

['chinese',
 'vietnamese',
 'irish',
 'danish',
 'french',
 'russian',
 'japanese',
 'german',
 'czech',
 'arabic',
 'ukranian',
 'swedish',
 'spanish',
 'african',
 'swiss',
 'korean',
 'jewish',
 'greek',
 'italian',
 'slavic',
 'indian',
 'muslim',
 'portugese']

In [7]:
## make a super list of names and true ethnicities

super_list_names = []
super_list_ethnicities = []

for ethnicity in ethnicities:
    name_list = eth_dict[ethnicity][0]
    eth_list = []
    for name in name_list:
        eth_list.append(ethnicity)
    super_list_names = super_list_names + name_list
    super_list_ethnicities = super_list_ethnicities + eth_list
    
df = pd.DataFrame(
            {'Name': super_list_names,
             'True Ethnicity': super_list_ethnicities
            })
    

## EDA on Name Data

Let's examine what our name data looks like in reality.

In [23]:
df.sample(frac=1).head(10)

Unnamed: 0,Name,True Ethnicity
9458,Grinberg,swedish
7489,Herda,czech
19557,Caro,portugese
13215,Guell,swiss
13208,Grunder,swiss
2000,Bonet,french
13994,Awerbuch,jewish
17392,Agli,italian
11597,Mejias,spanish
2349,Chabot,french


Let's compare the data we have for every ethnicity.

In [33]:
print "n for each ethnicity sample"
for ethnicity in ethnicities:
    print len(df[df['True Ethnicity']==ethnicity]), ethnicity
    

n for each ethnicity sample
426 chinese
129 vietnamese
318 irish
656 danish
4143 french
181 russian
566 japanese
723 german
1406 czech
108 arabic
409 ukranian
1158 swedish
2366 spanish
300 african
774 swiss
169 korean
3076 jewish
429 greek
711 italian
258 slavic
580 indian
525 muslim
834 portugese


23

Choosing between 23 specific ethnicities accurately is more difficult than choosing between around 5 or 6 broad defined ethnicities because (1) having more choices to choose from in general creates more opportunities for classification errors, and (2) some ethnicities from similar parts of the world have overlapping names (like "Alexander" is a common Danish, Greek, and French name).

Let's consolidate some groups that share name/cultural similarities. Let's also make equal sample sizes for each consolidated group, drawing evenly from each subgroup to prevent our classifier from ignoring "low frequency" ethnicities.

In [55]:
temp_df = pd.DataFrame(columns=['Name', 'True Ethnicity'])

target_df = temp_df
eth_list = ['danish', 'french', 'italian']

sample_df = pd.DataFrame(columns=['Name', 'True Ethnicity'])
for ethnicity in eth_list:
    eth_df = df[df['True Ethnicity']==ethnicity]
    n_per_eth = (1000 / len(eth_list))
    print n_per_eth
    sample_df = pd.concat([sample_df, eth_df.sample(n=n_per_eth)])

target_df = pd.concat([target_df,sample_df]) 

target_df['True Ethnicity'] = "white"


333
333
333


In [56]:
ethnicities

['chinese',
 'vietnamese',
 'irish',
 'danish',
 'french',
 'russian',
 'japanese',
 'german',
 'czech',
 'arabic',
 'ukranian',
 'swedish',
 'spanish',
 'african',
 'swiss',
 'korean',
 'jewish',
 'greek',
 'italian',
 'slavic',
 'indian',
 'muslim',
 'portugese']

In [65]:
# Consolidation function

df_c = pd.DataFrame(columns=['Name', 'True Ethnicity'])

def consolidate(eth_list, target_df, consolidated_eth):
    sample_df = pd.DataFrame(columns=['Name', 'True Ethnicity'])
    for ethnicity in eth_list:
        eth_df = df[df['True Ethnicity']==ethnicity]
        n_per_eth = (1000 / len(eth_list))
        sample_df = pd.concat([sample_df, eth_df.sample(n=n_per_eth, replace = True)])
    sample_df['True Ethnicity'] = consolidated_eth
    return pd.concat([target_df,sample_df]) 


# Consolidate East European
east_euro = ['russian','ukranian','czech','slavic']
df_c = consolidate(east_euro, df_c, 'Eastern European')

# Consolidate West European
west_euro = ['italian','irish','danish','french',
                'swedish','german','swiss']
df_c = consolidate(west_euro, df_c, 'Western European')

# Consolidate Muslim / Arab
muslim_arabic = ['muslim', 'arabic']
df_c = consolidate(muslim_arabic, df_c, 'Muslim/Arabic')

# Consolidate East Asian
east_asian = ['chinese','japanese','vietnamese','korean']
df_c = consolidate(east_asian, df_c, 'East Asian')

# Spanish / Hispanic can remain its own category
hispanic = ['spanish','portugese'] 
df_c = consolidate(hispanic, df_c, 'Hispanic')

# Jewish can remain its own category
jewish = ['jewish']
df_c = consolidate(jewish, df_c, 'Jewish')

# Indian can remain its own category
indian = ['indian']
df_c = consolidate(indian, df_c, 'Indian')

# African can remain its own category 
african = ['african']
df_c = consolidate(african, df_c, 'African')

print 'Cleaned sample size:', len(df_c)
df_c.sample(n=10)

Cleaned sample size: 7994


Unnamed: 0,Name,True Ethnicity
18894,Abdi,Muslim/Arabic
10846,Cobos,Hispanic
16637,Vidal,Jewish
11767,Napoleon,Hispanic
12599,Adanna,African
18334,Bal,Indian
18730,Rampersaud,Indian
18520,Jhaveri,Indian
18948,Assaf,Muslim/Arabic
8607,Koury,Muslim/Arabic


In [68]:
## List Consolidated Ethnicities
ethnicities_c = [
    'Eastern European',
    'Western European',
    'Muslim/Arabic',
    'East Asian',
    'Hispanic',
    'Jewish',
    'Indian',
    'African'
]

## Train the Model

In [69]:
## Split into Training and Test
msk = np.random.rand(len(df_c)) < 0.5
train_df = df_c[msk]
test_df = df_c[~msk]

print "Total Sample (n)", len (df_c)
print "Test Sample (test n)", len(test_df)
print "Train Sample (train n)", len(train_df)

Total Sample (n) 7994
Test Sample (test n) 3894
Train Sample (train n) 4100


In [70]:
## Package DF into training token
train_tokens = []
for ethnicity in ethnicities_c:
    new_tokens = (list(train_df[train_df['True Ethnicity'] == ethnicity]['Name']), ethnicity)
    train_tokens.append(new_tokens)

# (Tokens must be a list of ([list of names], 'ethnicity') pairs.)

In [71]:
## Train Classifier (beware, this takes time)

classifier = mxec(train_tokens)
classifier.train()

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -2.07944        0.122
             2          -1.14179        0.929
             3          -0.78956        0.950
             4          -0.61208        0.966
             5          -0.50362        0.977
             6          -0.42959        0.982
             7          -0.37544        0.986
             8          -0.33392        0.988
             9          -0.30099        0.990
            10          -0.27420        0.990
            11          -0.25196        0.991
            12          -0.23319        0.991
            13          -0.21713        0.991
            14          -0.20324        0.991
            15          -0.19109        0.991
            16          -0.18039        0.992
            17          -0.17088        0.992
            18          -0.16238        0.992
            19          -0.15473        0.992
 

In [140]:
# Test Classifier
print classifier.classify('Michael')
print classifier.classify('Roberto')
print classifier.classify('Lee')
print classifier.classify('sajkfldsafh')

def prob(name):
    return classifier.prob_classify(name)._prob_dict.items()
    #return max((p,v) for (v,p) in classifier.prob_classify(name)._prob_dict.items())

# find probability of prediction as log (lower is better)
print prob('Michael')
#print prob('Roberto')
#print prob('Lee')
#print prob('sajkfldsafh')

Western European
Hispanic
East Asian
Muslim/Arabic
[('Eastern European', -3.7065529482840653), ('Jewish', -6.6514402039659508), ('Western European', -0.16855233603485154), ('African', -11.308063175888986), ('Hispanic', -7.7907496492955053), ('Indian', -8.3545578338112048), ('East Asian', -6.0727470332155962), ('Muslim/Arabic', -10.121878347922477)]


In [83]:
## Predict!!!!

test_names = list(test_df['Name'])
test_eth = list(test_df['True Ethnicity'])

test_preds = []

for name in test_names:
    pred = classifier.classify(name)
    test_preds.append(pred)

df_preds = pd.DataFrame({
    'Name': test_names,
    'True Ethnicity': test_eth,
    'Prediction': test_preds
})

df_preds.sample(15)

Unnamed: 0,Name,Prediction,True Ethnicity
419,Triebel,Western European,Eastern European
1386,Hakimi,Muslim/Arabic,Muslim/Arabic
2639,Alpron,Western European,Jewish
1978,Apollo,Hispanic,Hispanic
2899,Baria,Hispanic,Indian
3379,Shankar,Indian,Indian
2741,Zimbalist,Jewish,Jewish
1314,Mifsud,Muslim/Arabic,Muslim/Arabic
1589,Muraoka,East Asian,East Asian
479,Acconci,Western European,Western European


In [84]:
# Add True if you got it right
df_preds['Accuracy'] = (df_preds['Prediction']==df_preds['True Ethnicity'])
df_preds.sample(15)

Unnamed: 0,Name,Prediction,True Ethnicity,Accuracy
1368,Koury,Muslim/Arabic,Muslim/Arabic,True
2245,De Araujo,Hispanic,Hispanic,True
1092,Siddiqui,Muslim/Arabic,Muslim/Arabic,True
2399,Tzarfat,Jewish,Jewish,True
2872,Chazzan,Jewish,Jewish,True
113,Urban,Eastern European,Eastern European,True
1345,Atiyeh,Muslim/Arabic,Muslim/Arabic,True
3146,Upadhyay,Indian,Indian,True
3215,Sundaram,Jewish,Indian,False
3801,Oluwatoyin,African,African,True


## Evaluate the Baseline Model

In [85]:
## Tools to Calculate One vs. Rest Accuracy Rates

def calcTP(df, eth):
    P = df[df['Prediction']==eth]
    TP = P[P['True Ethnicity']==eth]
    return len(TP)

def calcFP(df, eth):
    P = df[df['Prediction']==eth]
    FP = P[P['True Ethnicity']!=eth]
    return len(FP)

def calcTN(df, eth):
    N = df[df['Prediction']!=eth]
    TN = N[N['True Ethnicity']!=eth]
    return len(TN)

def calcFN(df, eth):
    N = df[df['Prediction']!=eth]
    FN = N[N['True Ethnicity']==eth]
    return len(FN)

In [88]:
TPs = [] # number of times predict X when  X ethnicity
FPs = [] # number of times predict X when 'X ethnicity
TNs = [] # number of times predict'X when 'X ethnicity
FNs = [] # number of times predict'X when  X ethnicity


ethnicity_list = []

# Classification Accuracy
for ethnicity in ethnicities_c:

    TPs.append(calcTP(df_preds, ethnicity))
    FPs.append(calcFP(df_preds, ethnicity))
    TNs.append(calcTN(df_preds, ethnicity))
    FNs.append(calcFN(df_preds, ethnicity))
    ethnicity_list.append(ethnicity)



# put into df
df_acc = pd.DataFrame({
    'True Ethnicity': ethnicity_list,
    'TP': TPs,
    'FP': FPs,
    'TN': TNs,
    'FN': FNs
})

df_acc.set_index('True Ethnicity', inplace=True)

# Add TPR (Sensistivity)
df_acc['Sensitivity (TPR)'] = (df_acc['TP']) / (df_acc['TP'] + df_acc['FN'])

# Add FPR
df_acc['FPR'] = (df_acc['FP']) / (df_acc['FP'] + df_acc['TN'])

# Add Precision
df_acc['Precision'] = (df_acc['TP']) / (df_acc['TP'] + df_acc['FP'])

# F1 Score (harmonic mean of precision and sensitivity)
df_acc['F1 Score'] = (2 * df_acc['TP'] / ((2*df_acc['TP'])+df_acc['FP']+df_acc['FN']))

# Accuracy (ACC)
df_acc['ACC'] = (df_acc['TP']+df_acc['TN']) / (df_acc['TP']+df_acc['TN']+df_acc['FP']+df_acc['FN'])




In [202]:
df_acc

Unnamed: 0_level_0,Classification Accuracy,FN,FP,TN,TP,Sensitivity (TPR),FPR,Precision,F1 Score,ACC
True Ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Eastern European,0.737418,120,130,3307,337,0.737418,0.037824,0.721627,0.729437,0.935799
Western European,0.577236,208,174,3228,284,0.577236,0.051146,0.620087,0.597895,0.9019
Muslim/Arabic,0.85259,74,123,3269,428,0.85259,0.036262,0.77677,0.812915,0.949409
East Asian,0.78913,97,96,3338,363,0.78913,0.027956,0.79085,0.789989,0.950437
Hispanic,0.776639,109,138,3268,379,0.776639,0.040517,0.733075,0.754229,0.936569
Jewish,0.520243,237,160,3240,257,0.520243,0.047059,0.616307,0.564215,0.898048
Indian,0.760784,122,113,3271,388,0.760784,0.033392,0.774451,0.767557,0.939651
African,0.947047,26,59,3344,465,0.947047,0.017338,0.887405,0.916256,0.978172


### Observations

Our ACCs (Overall accuracy scores) are super high, and our FPRs are very low, but these are actually misleading because they are inflated by our high True Negative count. We are using a one vs. rest calculation approach. This calculation approach leads to a high number of true negatives, because most guesses will simply not be the "one" ethnicity in question.

The most useful metrics here are probably True Positive Rate/Sensitivity (because it tells us how often we are predicting correctly within a given ethnicity), and the Precision (because it tells the likelihood of whether or not our guess is correct, once we make it).

F1 Score is interesting because it gives us a harmonic mean of these two metrics (Precision an Sensitivity) so it can help us consider those two together.

### Next Steps


** Tuning this Model **
Moving forward, we think that it would be beneficial to optimize for Precision (by only classifying as an ethnicity when we are sure, potentially abstaining in ambiguous cases). This would cause us only predict on some data points.

In our research project, we think this is a good trade off. Given the enormous size of our dataset (>10 million voters in Florida), it is fine for us to abstain from predicting for many people. Even if we predict on only 1 million of the most ethnically identifiable names for each ethnicity, that is enough to make conclusions on the state level.

One concern would be that be only classifying highly identifiable names, we are introducing a bias, because people of a certain ethnicity who have a less identifiable name may behave differently as voters. As a working assumption, we'll assume that the ethnic identifiability of person's name does not have a causal or confounding effect on their voting behavior, and that abstaining on those less identifiable names is better than making an incorrect classification.

We will continue to explore the literature on name/language classification and will test one or more of the above models as appropriate.


** Using Better Data **

A limitation of our approach is that a certain name may appear more frequently in one ethnicity, and less in another, but our Baby Names dataset does not account for this.

We will try to experiment with other data sources to see if we can get a better performance.

## Actually imputing ethnicity into our Voter Dataset

In [208]:
## Import data cleaned by Riddhi and Kimia
df_voters = pd.read_csv('Milestone33.csv', sep='\t')
df_voters.sample(10)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,index,district,id,LAST_NAME,FIRST_NAME,zip,female,dob,regyear,party,electiondate,general,typeofvote,age,GEN16,GEN14
2644,5139,5139,3153570,CLL,103047739,Howard,Timothy,34117,M,1961-10-26 00:00:00,09/22/2000,REP,11/06/2012,GEN,A,56.0,0,0
754,1443,1443,1696657,CHA,102606670,Pruey,R,34224,M,1980-01-11 00:00:00,05/06/1998,NPA,11/08/2016,GEN,Y,37.0,1,0
2516,4871,4871,5669616,DUV,103293240,Moore,William,32233,M,1956-11-13 00:00:00,10/08/1990,REP,11/08/2016,GEN,Y,61.0,1,0
4975,9672,9672,7433685,DUV,103841413,Sanders,Sue,32210,U,1969-04-06 00:00:00,01/31/1995,DEM,11/06/2012,GEN,Y,48.0,0,0
34,70,70,1608176,CHA,102626244,Kelley,Kristine,34224,F,1966-02-24 00:00:00,11/15/2000,DEM,11/07/2006,GEN,N,51.0,0,0
5081,9887,9887,6220186,DUV,103752822,Carswell,Dana,32277,F,1952-09-01 00:00:00,11/08/1983,DEM,11/04/2014,GEN,E,65.0,0,0
1910,3704,3704,2768859,CLL,103016985,James,Carol,34112,F,1939-06-15 00:00:00,02/26/1998,REP,11/07/2006,GEN,E,78.0,0,0
2701,5248,5248,999427,BRA,100753930,Goodman,Anessa,32058,F,1975-07-07 00:00:00,05/05/1997,DEM,11/06/2012,GEN,Y,42.0,0,0
2801,5449,5449,3971893,CLM,103179402,DRAWDY,CAROLYN,32024,F,1941-07-16 00:00:00,03/28/1968,DEM,11/04/2008,GEN,E,76.0,0,1
834,1595,1595,1290058,CHA,102560007,Goldman,Jason,33954,M,1970-03-23 00:00:00,06/02/1988,DEM,11/04/2014,GEN,Y,47.0,0,0


In [209]:
## Add predictions to voter data
ethnicity_predictions = []
for name in list(df_voters['LAST_NAME']):
    ethnicity_predictions.append(classifier.classify(name))

In [210]:
df_voters['Ethnicity Prediction'] = ethnicity_predictions

In [295]:
df_voters.sample(15)[['LAST_NAME', 'Ethnicity Prediction']]

Unnamed: 0,LAST_NAME,Ethnicity Prediction
1526,Kelleher,Jewish
3536,Pullen,Western European
4547,Pineda,Hispanic
4707,MCDOWELL,Western European
2959,Lessord,Eastern European
3846,Clemons,Western European
3054,Montgomery,Western European
4772,THIBODAUX,Western European
1682,RANDLES,Hispanic
3758,THORNTON,Western European


Baseline looks reasonable on the voter data, but definitely has room for improvement.

# Revised Ethnicity Imputer (2.0)

## Improvements:
* Bigger dataset (Wikipedia)
* Option to abstain from predicting when uncertain
* Ability to use "Race" information as a prior to eliminate unlikely ethnicities

Now let's import and clean the wikipedia name data, as it it larger and may be able to better train our model.

In [6]:
df_wiki_raw = pd.read_csv('wikipedia_data_scraped/wiki_name_race.csv')
df_wiki_raw.sample(5)

Unnamed: 0,name_last,name_suffix,name_first,name_middle,race
98023,paul,,lyn,,"GreaterEuropean,British"
39842,okumu,,sibi,,"GreaterAfrican,Africans"
58182,iv,,honoré,,"GreaterEuropean,WestEuropean,French"
131755,rao,,rao,gopal,"Asian,IndianSubContinent"
96192,furphy,,ken,,"GreaterEuropean,British"


Yep. This is going to need some cleaning. Let's do this by:
- Creating a new row for each name that is present (first, middle, last). We'll assume for not that the distinction is not important.
- For ethnicity, let's only use the most specific ethnicity available to us (e.g. Italian instead of WestEuropean)

In [7]:
len(df_wiki_raw)

148275

In [8]:
super_list_names = []
super_list_ethnicities = []

# clean names, simplify "ethnicity" field to just most specific one
for row in range(len(df_wiki_raw)):
    # filter valid first names
    if type(df_wiki_raw.iloc[row].name_first) == str and len(df_wiki_raw.iloc[row].name_first) > 2:
        super_list_names.append(df_wiki_raw.iloc[row].name_first)
        super_list_ethnicities.append(df_wiki_raw.iloc[row].race.split(',')[-1])
    # filter valid middle names
    if type(df_wiki_raw.iloc[row].name_middle) == str and len(df_wiki_raw.iloc[row].name_middle) > 2:
        super_list_names.append(df_wiki_raw.iloc[row].name_middle)
        super_list_ethnicities.append(df_wiki_raw.iloc[row].race.split(',')[-1])
    # filter valid last names
    if type(df_wiki_raw.iloc[row].name_last) == str and len(df_wiki_raw.iloc[row].name_last) > 2:
        super_list_names.append(df_wiki_raw.iloc[row].name_last)
        super_list_ethnicities.append(df_wiki_raw.iloc[row].race.split(',')[-1])

# throw it into a dataframe
df_wiki = pd.DataFrame(
            {'Name': super_list_names,
             'True Ethnicity': super_list_ethnicities
            })

In [9]:
df_wiki.sample(20)

Unnamed: 0,Name,True Ethnicity
256259,abidi,IndianSubContinent
206431,elmsley,British
18151,muhammad,Muslim
252247,arun,IndianSubContinent
195046,hunter,British
104580,kuramoto,Japanese
244159,fan,EastAsian
72702,weinberger,Jewish
154685,andrew,British
169829,harley,British


In [10]:
df_wiki.describe()

Unnamed: 0,Name,True Ethnicity
count,295895,295895
unique,97507,13
top,john,British
freq,2910,88353


Now let's standardize the ethnicities between our two datasets (the baby names dataset has ten more ethnicity categories than the Wikipedia dataset, 23 and 13 respectively):

In [13]:
## standardize & consolidated ethnicities
# format = c_eth["consolidated"] = [[' eth from baby names'],['eth from wiki']]

ethnicities_wiki = df_wiki["True Ethnicity"].unique()

c_eth = {}

c_eth["East European"] = ['russian','ukranian','czech','slavic', 'greek', # baby names
                          'EastEuropean'] # wiki names

c_eth['West European'] = ['italian','irish','danish','french', 'swedish','german','swiss',
                          'Nordic','British', 'Germanic', 'French', 'Italian'] 

c_eth['Muslim'] = ['muslim', 'arabic',
                   'Muslim']

c_eth['East Asian'] = ['chinese','japanese','vietnamese','korean',
                       'EastAsian', 'Japanese']

c_eth['Hispanic'] = ['spanish','portugese',
                     'Hispanic']

c_eth['Jewish'] = ['jewish','Jewish']

c_eth['Indian'] = ['indian','IndianSubContinent']

c_eth['Continental African'] = ['african','Africans']


In [14]:
## transform datasets
def standardizeEth(df):
    names = list(df['Name'])
    org_eth = list(df['True Ethnicity'])    
    standard_eth = []
    for ethnicity in org_eth:
        # search ethnicity dict
        for c in c_eth:
            # if found
            if ethnicity in c_eth[c]:
                # then add to master list
                standard_eth.append(c)
    print len(names), len(standard_eth), len(org_eth)
    df_new = pd.DataFrame(
            {'Name': names,
             'True Ethnicity': org_eth,
             'Standardized Ethnicity': standard_eth
            })
    return df_new

df_baby = df

df_baby_standard = standardizeEth(df_baby)

df_wiki_standard = standardizeEth(df_wiki)
    

20245 20245 20245
295895 295895 295895


In [15]:
df_wiki_standard.sample(5)

Unnamed: 0,Name,Standardized Ethnicity,True Ethnicity
177607,davies,West European,British
63333,rowe,Jewish,Jewish
2108,roesler,West European,Germanic
140517,patrushev,East European,EastEuropean
13045,valero,Muslim,Muslim


In [13]:
df_wiki_standard.sample(5)## standardize & consolidated ethnicities
# format = c_eth["consolidated"] = [[' eth from baby names'],['eth from wiki']]

ethnicities_wiki = df_wiki["True Ethnicity"].unique()

c_eth = {}

c_eth["East European"] = ['russian','ukranian','czech','slavic', 'greek', # baby names
                          'EastEuropean'] # wiki names

c_eth['West European'] = ['italian','irish','danish','french', 'swedish','german','swiss',
                          'Nordic','British', 'Germanic', 'French', 'Italian'] 

c_eth['Muslim'] = ['muslim', 'arabic',
                   'Muslim']

c_eth['East Asian'] = ['chinese','japanese','vietnamese','korean',
                       'EastAsian', 'Japanese']

c_eth['Hispanic'] = ['spanish','portugese',
                     'Hispanic']

c_eth['Jewish'] = ['jewish','Jewish']

c_eth['Indian'] = ['indian','IndianSubContinent']

c_eth['Continental African'] = ['african','Africans']


Finally, nice data that we can use. Let's do one last step and balance the datasets.

In [16]:
## Balance Wiki

df_wiki_standard['True Ethnicity'].value_counts()

British               88353
French                27566
Italian               26711
Hispanic              24469
Jewish                22406
EastEuropean          18311
IndianSubContinent    17988
Japanese              15906
Muslim                14340
EastAsian             11459
Nordic                10927
Germanic               8999
Africans               8460
Name: True Ethnicity, dtype: int64

In [17]:
c_eth[c_eth.keys()[0]]

['spanish', 'portugese', 'Hispanic']

In [18]:
df_wiki_balanced = pd.DataFrame(columns=['Name', 'Standardized Ethnicity','True Ethnicity'])

##  balancing #1 - balance by True Ethnicity
for eth in ethnicities_wiki:
    # sample maximum amount from each(8460 - limit because of Africans)
    sample_df = df_wiki_standard[df_wiki_standard['True Ethnicity']==eth].sample(8460)
    df_wiki_balanced = pd.concat([df_wiki_balanced,sample_df])
    
## balancing #2 - balance by Standardized Eth, with equal numbers of True Eth in each group
df_wiki_b = pd.DataFrame(columns=['Name', 'Standardized Ethnicity','True Ethnicity'])

for eth in c_eth.keys():
    sample_df = df_wiki_balanced[df_wiki_balanced['Standardized Ethnicity']==eth].sample(8460)
    df_wiki_b = pd.concat([df_wiki_b,sample_df])

Let's make sure the proportions make sense.

In [19]:
df_wiki_b['Standardized Ethnicity'].value_counts()

West European          8460
East Asian             8460
Muslim                 8460
Indian                 8460
East European          8460
Continental African    8460
Jewish                 8460
Hispanic               8460
Name: Standardized Ethnicity, dtype: int64

In [20]:
df_wiki_b['True Ethnicity'].value_counts()

Muslim                8460
Africans              8460
Jewish                8460
Hispanic              8460
IndianSubContinent    8460
EastEuropean          8460
EastAsian             4258
Japanese              4202
British               1735
Germanic              1713
Nordic                1702
French                1659
Italian               1651
Name: True Ethnicity, dtype: int64

Looks good, lets finish this up by splitting into test/training (80/20).

In [21]:
## Split into Training and Test
msk = np.random.rand(len(df_wiki_b)) < 0.8
train_df_w = df_wiki_b[msk]
test_df_w = df_wiki_b[~msk]

print "Total Sample (n)", len (df_wiki_b)
print "Test Sample (test n)", len(test_df_w)
print "Train Sample (train n)", len(train_df_w)

Total Sample (n) 67680
Test Sample (test n) 13616
Train Sample (train n) 54064


In [39]:
# Our ethnicities
for each in c_eth:
    print each

Hispanic
Jewish
East Asian
Muslim
West European
East European
Indian
Continental African


## Training the Revised Ethnicity Imputer

First lets generate training tokens from our training dataframe

In [32]:
## Package DF into training token

def makeTokens(ethnicities, train_df):
    train_tokens = []
    for ethnicity in ethnicities:
        new_tokens = (list(train_df[train_df['Standardized Ethnicity'] == ethnicity]['Name']), ethnicity)
        train_tokens.append(new_tokens)
    return train_tokens

wiki_tokens = makeTokens(c_eth, train_df_w)


# (Tokens must be a list of ([list of names], 'ethnicity') pairs.)

Train Classifier (beware, this takes time)


In [34]:
## White
white_tokens = makeTokens(['East European', 'West European', 'Jewish', 'Muslim'], train_df_w)
white_classifier = mxec(white_tokens)
white_classifier.train()

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -1.38629        0.250
             2          -0.89420        0.786
             3          -0.70446        0.805
             4          -0.60567        0.821
             5          -0.54320        0.835
             6          -0.49901        0.846
             7          -0.46546        0.855
             8          -0.43875        0.861
             9          -0.41676        0.867
            10          -0.39821        0.871
            11          -0.38226        0.875
            12          -0.36835        0.879
            13          -0.35606        0.881
            14          -0.34511        0.884
            15          -0.33526        0.886
            16          -0.32636        0.887
            17          -0.31825        0.889
            18          -0.31084        0.890
            19          -0.30403        0.891
 

In [37]:
## Black
black_tokens = makeTokens(['West European', 'Continental African', 'Muslim'], train_df_w)
black_classifier = mxec(black_tokens)
black_classifier.train()

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -1.09861        0.333
             2          -0.71315        0.818
             3          -0.56806        0.837
             4          -0.49074        0.852
             5          -0.44072        0.865
             6          -0.40470        0.875
             7          -0.37699        0.883
             8          -0.35472        0.891
             9          -0.33625        0.896
            10          -0.32060        0.901
            11          -0.30709        0.905
            12          -0.29527        0.909
            13          -0.28482        0.911
            14          -0.27549        0.913
            15          -0.26710        0.915
            16          -0.25950        0.917
            17          -0.25259        0.918
            18          -0.24626        0.919
            19          -0.24044        0.920
 

In [38]:
## Asian
asian_tokens = makeTokens(['East Asian', 'Indian', 'Muslim'], train_df_w)
asian_classifier = mxec(asian_tokens)
asian_classifier.train()

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -1.09861        0.333
             2          -0.62121        0.862
             3          -0.46739        0.880
             4          -0.39155        0.895
             5          -0.34462        0.904
             6          -0.31182        0.914
             7          -0.28712        0.922
             8          -0.26758        0.927
             9          -0.25159        0.932
            10          -0.23817        0.937
            11          -0.22669        0.940
            12          -0.21672        0.942
            13          -0.20795        0.944
            14          -0.20017        0.946
            15          -0.19320        0.947
            16          -0.18692        0.948
            17          -0.18122        0.949
            18          -0.17603        0.950
            19          -0.17128        0.951
 

In [None]:
## Hispanic will be their own ethnic group

In [None]:
## Native American will be their own ethnic group

## Testing the Revised Model

We will test our model on the baby names test set to stay consistent with our testing criterion.

In [57]:
## Consolidation function: Combines similar ethnicities together into groups
# Example: 'russian','ukranian','czech','slavic' become "East European"

super_list_names = []
super_list_ethnicities = []

for ethnicity in ethnicities:
    name_list = eth_dict[ethnicity][0]
    eth_list = []
    for name in name_list:
        eth_list.append(ethnicity)
    super_list_names = super_list_names + name_list
    super_list_ethnicities = super_list_ethnicities + eth_list
    
df = pd.DataFrame(
            {'Name': super_list_names,
             'True Ethnicity': super_list_ethnicities
            })

df_c = pd.DataFrame(columns=['Name', 'True Ethnicity'])

def consolidate(eth_list, target_df, consolidated_eth):
    sample_df = pd.DataFrame(columns=['Name', 'True Ethnicity'])
    for ethnicity in eth_list:
        eth_df = df[df['True Ethnicity']==ethnicity]
        n_per_eth = (1000 / len(eth_list))
        sample_df = pd.concat([sample_df, eth_df.sample(n=n_per_eth, replace = True)])
    sample_df['True Ethnicity'] = consolidated_eth
    return pd.concat([target_df,sample_df]) 


# Consolidate East European
east_euro = ['russian','ukranian','czech','slavic']
df_c = consolidate(east_euro, df_c, 'East European')

# Consolidate West European
west_euro = ['italian','irish','danish','french',
                'swedish','german','swiss']
df_c = consolidate(west_euro, df_c, 'West European')

# Consolidate Muslim / Arab
muslim_arabic = ['muslim', 'arabic']
df_c = consolidate(muslim_arabic, df_c, 'Muslim')

# Consolidate East Asian
east_asian = ['chinese','japanese','vietnamese','korean']
df_c = consolidate(east_asian, df_c, 'East Asian')

# Spanish / Hispanic can remain its own category
hispanic = ['spanish','portugese'] 
df_c = consolidate(hispanic, df_c, 'Hispanic')

# Jewish can remain its own category
jewish = ['jewish']
df_c = consolidate(jewish, df_c, 'Jewish')

# Indian can remain its own category
indian = ['indian']
df_c = consolidate(indian, df_c, 'Indian')

# African can remain its own category 
african = ['african']
df_c = consolidate(african, df_c, 'Continental African')

print 'Cleaned sample size:', len(df_c)
df_c.sample(n=15)

Cleaned sample size: 7994


Unnamed: 0,Name,True Ethnicity
18601,Mangat,Indian
11774,Natal,Hispanic
15934,Porath,Jewish
9899,Quarnstrom,West European
9106,Andersson,West European
17885,Rais,West European
15998,Reichenheim,Jewish
4102,Levert,West European
13202,Grimme,West European
7935,Machart,East European


In [58]:
# split test set into subsets by race
test_df_black = df_c[df_c['True Ethnicity'].isin(['West European', 'Continental African', 'Muslim'])]
test_df_asian = df_c[df_c['True Ethnicity'].isin(['East Asian', 'Indian', 'Muslim'])]
test_df_white = df_c[df_c['True Ethnicity'].isin(['East European', 'West European', 'Jewish', 'Muslim'])]

In [61]:
## Predict on racially split test sets!!

def makePreds(test_df, classifier): 
    test_names = list(test_df['Name'])
    test_eth = list(test_df['True Ethnicity'])

    test_preds = []

    for name in test_names:
        pred = classifier.classify(name)
        test_preds.append(pred)

    df_preds = pd.DataFrame({
        'Name': test_names,
        'True Ethnicity': test_eth,
        'Prediction': test_preds
    })
    return df_preds

black_p = makePreds(test_df_black, black_classifier)
asian_p = makePreds(test_df_asian, asian_classifier)
white_p = makePreds(test_df_white, white_classifier)

asian_p

Unnamed: 0,Name,Prediction,True Ethnicity
0,Shaer,Muslim,Muslim
1,Sharaf,Muslim,Muslim
2,Jan,Muslim,Muslim
3,Kazmi,Muslim,Muslim
4,Niazi,Indian,Muslim
5,Kazemi,East Asian,Muslim
6,Dajani,Muslim,Muslim
7,Saeed,Muslim,Muslim
8,Othman,Muslim,Muslim
9,Doud,Muslim,Muslim


In [76]:
## Accuracy Table Generator


def accuracyTable(df_preds):
    accuracies = []
    TPs = [] # number of times predict X when  X ethnicity
    FPs = [] # number of times predict X when 'X ethnicity
    TNs = [] # number of times predict'X when 'X ethnicity
    FNs = [] # number of times predict'X when  X ethnicity


    ethnicities_c = list(df_preds['True Ethnicity'].unique())
    ethnicity_list = []
    
    # Classification Accuracy
    for ethnicity in ethnicities_c:
        #accuracy = calcAccuracy(df_preds[df_preds['True Ethnicity']==ethnicity])
        #accuracies.append(accuracy)
        TPs.append(calcTP(df_preds, ethnicity))
        FPs.append(calcFP(df_preds, ethnicity))
        TNs.append(calcTN(df_preds, ethnicity))
        FNs.append(calcFN(df_preds, ethnicity))
        ethnicity_list.append(ethnicity)

    # Aggregate accuracy
    #accuracies.append(calcAccuracy(df_preds))
    #ethnicity_list.append('OVERALL')

    # put into df
    df_acc = pd.DataFrame({
        'True Ethnicity': ethnicity_list,
        #'Classification Accuracy': accuracies,
        'TP': TPs,
        'FP': FPs,
        'TN': TNs,
        'FN': FNs
    })

    df_acc.set_index('True Ethnicity', inplace=True)

    # Add TPR (Sensistivity)
    df_acc['Sensitivity (TPR)'] = (df_acc['TP']) / (df_acc['TP'] + df_acc['FN'])

    # Add FPR
    df_acc['FPR'] = (df_acc['FP']) / (df_acc['FP'] + df_acc['TN'])

    # Add Precision
    df_acc['Precision'] = (df_acc['TP']) / (df_acc['TP'] + df_acc['FP'])

    # F1 Score (harmonic mean of precision and sensitivity)
    df_acc['F1 Score'] = (2 * df_acc['TP'] / ((2*df_acc['TP'])+df_acc['FP']+df_acc['FN']))

    # Accuracy (ACC)
    df_acc['ACC'] = (df_acc['TP']+df_acc['TN']) / (df_acc['TP']+df_acc['TN']+df_acc['FP']+df_acc['FN'])
    
    return df_acc
    

In [79]:
asian_table = accuracyTable(asian_p)
black_table = accuracyTable(black_p)
white_table = accuracyTable(white_p)

In [80]:
asian_table

Unnamed: 0_level_0,FN,FP,TN,TP,Sensitivity (TPR),FPR,Precision,F1 Score,ACC
True Ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Muslim,192,180,1820,808,0.808,0.09,0.817814,0.812877,0.876
East Asian,187,130,1870,813,0.813,0.065,0.862142,0.83685,0.894333
Indian,172,241,1759,828,0.828,0.1205,0.774556,0.800387,0.862333


In [81]:
black_table

Unnamed: 0_level_0,FN,FP,TN,TP,Sensitivity (TPR),FPR,Precision,F1 Score,ACC
True Ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
West European,275,133,1867,719,0.72334,0.0665,0.843897,0.778982,0.863727
Muslim,311,153,1841,689,0.689,0.07673,0.81829,0.7481,0.845023
Continental African,110,410,1584,890,0.89,0.205617,0.684615,0.773913,0.826319


In [82]:
white_table

Unnamed: 0_level_0,FN,FP,TN,TP,Sensitivity (TPR),FPR,Precision,F1 Score,ACC
True Ethnicity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
East European,491,266,2728,509,0.509,0.088844,0.656774,0.573521,0.810466
West European,430,334,2666,564,0.567404,0.111333,0.628062,0.596195,0.808713
Muslim,222,287,2707,778,0.778,0.095858,0.730516,0.753511,0.872559
Jewish,404,660,2334,596,0.596,0.220441,0.474522,0.528369,0.7336


## Apply predictions to Voter Records

First import the data

In [78]:
## Import data previously cleaned by Riddhi and Kimia
df_voters = pd.read_csv('with race.csv')
len(df_voters)

100000

Let's make a function that pulls the probability upon which our classifier is making its predictions:

In [77]:
## Generate probabilities

import operator
def black_proba(name):
    probs = black_classifier.prob_classify(str(name))._prob_dict.items()
    probs.sort(key=operator.itemgetter(1))
    return probs[-1][1]

def asian_proba(name):
    probs = asian_classifier.prob_classify(str(name))._prob_dict.items()
    probs.sort(key=operator.itemgetter(1))
    return probs[-1][1]

def white_proba(name):
    probs = white_classifier.prob_classify(str(name))._prob_dict.items()
    probs.sort(key=operator.itemgetter(1))
    return probs[-1][1]

# finds negative log likelihood of prediction as log (lower is better)
print black_proba('Halal')
print white_proba('Halal')
print asian_proba('Halal')

-0.0957566641707
-0.426646745245
-0.145693472201


^ These are actual negative log likelihood values, which are evaluating the ability of the model to fit this name. Higher is better.

Now let's get back to parsing our voter file for important things like race and name:

In [91]:
# Pull important columns out of big dataframe
name = df_voters['LAST_NAME']
race_prior = df_voters['19']

In [92]:
# Let's turn Race into a string instead of INT, for comprehensibility
race_prior2 = []
for row in range(len(race_prior)):
    if race_prior[row] == 5:   #white
        race_prior2.append('White')
    elif race_prior[row] == 2: #asian
        race_prior2.append('Asian')
    elif race_prior[row] == 3: #black 
        race_prior2.append('Black')
    elif race_prior[row] == 1: #American Indian
        race_prior2.append('American Indian')
    elif race_prior[row] == 4: #Hispanic
        race_prior2.append('Hispanic')
    else:                      #Others
        race_prior2.append('Other / Mixed')




Now let's make the predictions into our Voter Dataset

In [107]:
ethnicity_predictions = []
pred_score = []
abstain_predictions = []


## Make the predictions from the ensemble model

for row in range(len(df_voters)):
    if race_prior[row] == 5:   #white
        eth = white_classifier.classify(str(name[row]))
        prob = white_proba(name[row])
        ethnicity_predictions.append(eth)
        pred_score.append(prob)
        if prob > -0.75: # threshold for abstaining
            abstain_predictions.append(eth)
        else:
            abstain_predictions.append('Abstain')
    elif race_prior[row] == 2: #asian
        eth = asian_classifier.classify(str(name[row]))
        prob = asian_proba(name[row])
        ethnicity_predictions.append(eth)
        pred_score.append(prob)
        if prob > -0.75:
            abstain_predictions.append(eth)
        else:
            abstain_predictions.append('Abstain')
    elif race_prior[row] == 3: #black 
        eth = black_classifier.classify(str(name[row]))
        prob = black_proba(name[row])
        ethnicity_predictions.append(eth)
        pred_score.append(prob)
        if prob > -0.75:
            abstain_predictions.append(eth)
        else:
            abstain_predictions.append('Abstain')
    elif race_prior[row] == 1: #American Indian
        eth = 'American Indian'
        prob = 0.0
        ethnicity_predictions.append(eth)
        pred_score.append(prob)
        if prob > -0.75:
            abstain_predictions.append(eth)
        else:
            abstain_predictions.append('Abstain')
    elif race_prior[row] == 4: #Hispanic
        eth = 'Hispanic'
        prob = 0.0
        ethnicity_predictions.append(eth)
        pred_score.append(prob)
        if prob > -0.75:
            abstain_predictions.append(eth)
        else:
            abstain_predictions.append('Abstain')
    else:                      #Others
        eth = 'Abstain'
        prob = -10.0
        ethnicity_predictions.append(eth)
        pred_score.append(prob)
        abstain_predictions.append('Abstain')





Now let's relabel some of our ethnicities to make their labels more informative.

* e.g. A Black person with a "West European" name classification becomes "Blackamerican" instead of "Western European".
* e.g. A Black person with a "Muslim" name classification becomes "Black Muslim" instead of simply Muslim.

There may be trends we can see with these more grainular ethnicity distinctions.

In [108]:
# Relabel ethnicities
def label(race, eth):
    new_eths = []
    if len(eth) ==  len(race):
        print 'List lengths are equivalent, good.'
    for row in range(len(eth)):
        if race[row] == 'White':
            if eth[row] == 'Muslim':
                new_eths.append('Arab Muslim')
            else:
                new_eths.append(eth[row])
        elif race[row] == 'Black':
            if eth[row] == 'Muslim':
                new_eths.append('Black Muslim')
            elif eth[row] == 'West European':
                new_eths.append('Blackamerican')
            else:
                new_eths.append(eth[row])        
        elif race[row] == 'Asian':
            if eth[row] == 'Muslim':
                new_eths.append('Asian Muslim')
            elif eth[row] == 'Indian':
                new_eths.append('Indian Subcont.')
            else:
                new_eths.append(eth[row])            
        else:
            new_eths.append(eth[row])
    return new_eths

In [109]:
df_voters['Ethnicity Prediction'] = label(race_prior2, ethnicity_predictions)
df_voters['Prediction Score'] = pred_score
df_voters['Conservative Ethnicity Prediction'] = label(race_prior2,abstain_predictions)
df_voters['Race (Prior)'] = race_prior2

List lengths are equivalent, good.
List lengths are equivalent, good.


Let's take a peek at our predictions and the scores

In [110]:
df_voters.sample(100)[['LAST_NAME', 'Race (Prior)','Ethnicity Prediction', 
                       'Prediction Score', 'Conservative Ethnicity Prediction']]

Unnamed: 0,LAST_NAME,Race (Prior),Ethnicity Prediction,Prediction Score,Conservative Ethnicity Prediction
85982,MATYLANGE,Black,Continental African,-1.055064,Abstain
54106,MAHONIK,White,East European,-1.605525,Abstain
25209,Carroll,White,West European,-0.838545,Abstain
27516,Porter,White,Jewish,-0.607304,Jewish
86434,Rodriguez,Hispanic,Hispanic,0.000000,Hispanic
4487,Santana,Hispanic,Hispanic,0.000000,Hispanic
11562,Monk,Black,Blackamerican,-0.604397,Blackamerican
83412,Ceaser,Other / Mixed,Abstain,-10.000000,Abstain
77012,Snider,White,Jewish,-1.566677,Abstain
81044,Stewart,White,West European,-0.700196,West European


In [112]:
df_voters['Ethnicity Prediction'].value_counts()

West European          31623
Jewish                 22978
Hispanic               16181
Blackamerican           8507
East European           5043
Arab Muslim             4491
Abstain                 4467
Continental African     3680
Black Muslim             801
East Asian               761
Indian Subcont.          747
Asian Muslim             405
American Indian          316
Name: Ethnicity Prediction, dtype: int64

Hmmm, it seems that some minorities (like East Asians and American Indians) are very underrepresented and others (like West Europeans and Hispanics) are over represented. This seems to be in line, more or less, with our intuitions about the Florida populace.

We are predicting a lot of Jewish people (around 1/5 of the population). This is unlikely given the composition of Florida.

Below we have a more conservative model, that abstains when the threshold of -.75 Negative Log Likelihood is not met:

In [111]:
df_voters['Conservative Ethnicity Prediction'].value_counts()

Abstain                58144
Hispanic               16181
West European           8635
Blackamerican           6191
Jewish                  5895
Continental African     1723
East European            869
Arab Muslim              566
Indian Subcont.          521
East Asian               472
American Indian          316
Black Muslim             261
Asian Muslim             226
Name: Conservative Ethnicity Prediction, dtype: int64

Notice the count for Jewish names went down to 5000 (from 22000). This indicates many of those predictions were predictions that our classifier was less confident about.

The Conservative Model will be used as our "Revised Ethnicity model" for predictions into our Voter Records. Our Baseline Model will remain the "Old Ethnicity model".

Let's throw these imputations into a CSV to save it. This model took 3 hours to train, so don't want to mess up here.

In [113]:
df_voters.to_csv('updated_predictions.csv')