# Ethnicity Classifier 2.0

## Our story:

We hypothesized that people's voting patterns may be varied along ethnic lines. In voter turnout literature, scientists have examined that the ethnic composition of neighborhoods (as available from U.S. Census data) and have found that certain ethnicities were more or less likely to vote (Chong & Kim, 2006).

We want to test this conclusion with a different approach. Instead of looking at neighborhoods as a whole, we want to look at individuals. Neither U.S. Census Data nor Voter Records provide ethnicity data for individuals (although Voter Records provide a broad classification for "race", which is less specific than ethnicity). 

For each person, we will impute their ethnicity using a classifying algorithm that makes use of information within the Voter Records, and then we will compare the voting behaviors of different ethnic groups.

We hope to take this additional imputed piece of data, and add it to our larger "vote-turnout" prediction model to (hopefully) increase its accuracy.


## Our approach:

** Model: ** We used a *Multinomial Logistic Regression* because:
1. Our decision is categorical (non-binary)
2. We can use the substrings within names as features, and let the classifier assign coefficients to them as they get correlated to ethnicities
3. (Most Important) It allows us to output a score for each prediction, so we can tweak the threshold at which we go ahead and make a prediction. 

** Name-Ethnicity Datasets: ** We experiemented with two name-ethnicity datasets:
1. A list of baby names found on FamilyEducation.com
2. Names scraped from wikipedia that had ethnicity meta-data associated with them, open-sourced by (Ambekar, et al., 2009)

** Test Set: ** Here we had to get creative. Because we didn't have actual ethnicities attached to the Voting Records, we needed to test externally. To make our test set, we:
1. Took 10% of the total names in the wikipedia name-ethnicity dataset, and set them aside for testing
2. Out of that dataset, we eliminated the names that appeared already in the training set. This makes our test set actually more stringent than the Voter Records set. This was necesary because: many of the names in the Voter Records were names that did not appear in our training set, so we didn't want to have an artificially high accuracy score in case names in the training set had a high propensity to re-appear within the training/test set.
3. Used over-sampling to balance the test set.

** Training Set: ** The remaining 90% names went to train our model. We over-sampled / under-sampled certain ethnicities to balance the training sets. We did not eliminate the repetition of names, because their frequencies are important because they correlate to real world frequencies.




### Ethnicity Classifier 1.0: Naive (Baseline) Model

- **Trained on**: Lists of Baby Names.
- **Classification decision**: Whichever ethnicity has the highest *proba* score. A decision is always made.

### Ethnicity Classifier 1.1: New Training Data

- **Trained on**: **Wikipedia-scraped name-ethnicity pairs.**
- **Classification decision**: Whichever ethnicity has the highest *proba* score. A decision is always made.

### Ethnicity Classifier 2.0: Race Used as an Input

- **Trained on**: Wikipedia-scraped name-ethnicity pairs **& Race**.
- **Classification decision**: Whichever ethnicity has the highest *proba* score. A decision is always made.

### Ethnicity Classifier 2.1: Abstain option added

- **Trained on**: Wikipedia-scraped name-ethnicity pairs & Race.
- **Classification decision**: Only happens if *proba* exceeds a certain level, else we abstain from making a decision.



Chong, D., & Kim, D. (2006). The experiences and effects of economic status among racial and ethnic minorities. American Political Science Review, 100(3), 335–351.

Ambekar, A., Ward, C., Mohammed, J., Male, S., & Skiena, S. (2009, June). Name-ethnicity classification from open sources. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge Discovery and Data Mining (pp. 49-58). ACM.

In [1]:
from ethnicityguesser.NLTKMaxentEthnicityClassifier import NLTKMaxentEthnicityClassifier as mxec
from os import walk
import pandas as pd
import csv
import pickle
import numpy as np


## 1. Import & Clean Name Training/Testing Data

Let's import and clean our name-ethnicity training sets.
- List of Baby Names by Ethnicity
- Name-ethnicity pairs scraped from Wikipedia

Let's start with baby names:

In [2]:
# find names of files
f = []
for (dirpath, dirnames, filenames) in walk("ethnicityguesser/pickled_names"):
    f.extend(filenames)
    break

# list types of ethnicities
ethnicities_baby = []
for each in f:
    ethnicities_baby.append(each.partition('.')[0])

# pair type of ethnicity to its names in a dict
eth_dict = {}
for ethnicity in ethnicities_baby:
    with open('ethnicityguesser/pickled_names/'+ethnicity+'.pkl', 'rb') as filename:
        names = pickle.load(filename)
    eth_dict[ethnicity] = names
    
ethnicities = ethnicities_baby

Here are the ethnicities we have:

In [3]:
ethnicities_baby

['chinese',
 'vietnamese',
 'irish',
 'danish',
 'french',
 'russian',
 'japanese',
 'german',
 'czech',
 'arabic',
 'ukranian',
 'swedish',
 'spanish',
 'african',
 'swiss',
 'korean',
 'jewish',
 'greek',
 'italian',
 'slavic',
 'indian',
 'muslim',
 'portugese']

Now lets package these into a nice dataframe

In [4]:
## make a datafrome of names and true ethnicities

super_list_names = []
super_list_ethnicities = []

for ethnicity in ethnicities:
    name_list = eth_dict[ethnicity][0]
    eth_list = []
    for name in name_list:
        eth_list.append(ethnicity)
    super_list_names = super_list_names + name_list
    super_list_ethnicities = super_list_ethnicities + eth_list
    
df_baby = pd.DataFrame(
            {'Name': super_list_names,
             'True Ethnicity': super_list_ethnicities
            })
    

Look's good:

In [5]:
df_baby.sample(10)

Unnamed: 0,Name,True Ethnicity
5433,Thebeau,french
16914,Anania,greek
18489,Gokhale,indian
1227,Jonassen,danish
14420,Ditzah,jewish
4358,Masson,french
16491,Tabachnik,jewish
20145,Santiago,portugese
18301,Zachow,slavic
17896,Righi,italian


Now let's import and clean the wikipedia name data a bit.

In [6]:
df_wiki_raw = pd.read_csv('wikipedia_data_scraped/wiki_name_race.csv')
df_wiki_raw.sample(5)

Unnamed: 0,name_last,name_suffix,name_first,name_middle,race
98023,paul,,lyn,,"GreaterEuropean,British"
39842,okumu,,sibi,,"GreaterAfrican,Africans"
58182,iv,,honoré,,"GreaterEuropean,WestEuropean,French"
131755,rao,,rao,gopal,"Asian,IndianSubContinent"
96192,furphy,,ken,,"GreaterEuropean,British"


Yep. This is going to need some cleaning. Let's do this by:
- Creating a new row for each name that is present (first, middle, last). We'll assume for not that the distinction is not important.
- For ethnicity, let's only use the most specific ethnicity available to us (e.g. Italian instead of WestEuropean)

In [7]:
len(df_wiki_raw)

148275

In [8]:
super_list_names = []
super_list_ethnicities = []

# clean names, simplify "ethnicity" field to just most specific one
for row in range(len(df_wiki_raw)):
    # filter valid first names
    if type(df_wiki_raw.iloc[row].name_first) == str and len(df_wiki_raw.iloc[row].name_first) > 2:
        super_list_names.append(df_wiki_raw.iloc[row].name_first)
        super_list_ethnicities.append(df_wiki_raw.iloc[row].race.split(',')[-1])
    # filter valid middle names
    if type(df_wiki_raw.iloc[row].name_middle) == str and len(df_wiki_raw.iloc[row].name_middle) > 2:
        super_list_names.append(df_wiki_raw.iloc[row].name_middle)
        super_list_ethnicities.append(df_wiki_raw.iloc[row].race.split(',')[-1])
    # filter valid last names
    if type(df_wiki_raw.iloc[row].name_last) == str and len(df_wiki_raw.iloc[row].name_last) > 2:
        super_list_names.append(df_wiki_raw.iloc[row].name_last)
        super_list_ethnicities.append(df_wiki_raw.iloc[row].race.split(',')[-1])

# throw it into a dataframe
df_wiki = pd.DataFrame(
            {'Name': super_list_names,
             'True Ethnicity': super_list_ethnicities
            })

In [9]:
df_wiki.sample(20)

Unnamed: 0,Name,True Ethnicity
256259,abidi,IndianSubContinent
206431,elmsley,British
18151,muhammad,Muslim
252247,arun,IndianSubContinent
195046,hunter,British
104580,kuramoto,Japanese
244159,fan,EastAsian
72702,weinberger,Jewish
154685,andrew,British
169829,harley,British


In [10]:
df_wiki.describe()

Unnamed: 0,Name,True Ethnicity
count,295895,295895
unique,97507,13
top,john,British
freq,2910,88353


Looks fine. Now let's standardize the ethnicity categories output in the two datasets.

In [11]:
print "Baby:", ethnicities_baby



Baby: ['chinese', 'vietnamese', 'irish', 'danish', 'french', 'russian', 'japanese', 'german', 'czech', 'arabic', 'ukranian', 'swedish', 'spanish', 'african', 'swiss', 'korean', 'jewish', 'greek', 'italian', 'slavic', 'indian', 'muslim', 'portugese']


In [12]:
ethnicities_wiki = df_wiki["True Ethnicity"].unique()
print "Baby Names:", ethnicities_baby
print "Wiki:", ethnicities_wiki

Baby Names: ['chinese', 'vietnamese', 'irish', 'danish', 'french', 'russian', 'japanese', 'german', 'czech', 'arabic', 'ukranian', 'swedish', 'spanish', 'african', 'swiss', 'korean', 'jewish', 'greek', 'italian', 'slavic', 'indian', 'muslim', 'portugese']
Wiki: ['Germanic' 'Muslim' 'Nordic' 'Hispanic' 'Jewish' 'Africans' 'Japanese'
 'French' 'EastEuropean' 'British' 'EastAsian' 'IndianSubContinent'
 'Italian']


Now let's standardize the ethnicities between our two datasets (the baby names dataset has ten more ethnicity categories than the Wikipedia dataset, 23 and 13 respectively):

In [13]:
## standardize & consolidated ethnicities
# format = c_eth["consolidated"] = [[' eth from baby names'],['eth from wiki']]

c_eth = {}

c_eth["East European"] = ['russian','ukranian','czech','slavic', 'greek', # baby names
                          'EastEuropean'] # wiki names

c_eth['West European'] = ['italian','irish','danish','french', 'swedish','german','swiss',
                          'Nordic','British', 'Germanic', 'French', 'Italian'] 

c_eth['Muslim'] = ['muslim', 'arabic',
                   'Muslim']

c_eth['East Asian'] = ['chinese','japanese','vietnamese','korean',
                       'EastAsian', 'Japanese']

c_eth['Hispanic'] = ['spanish','portugese',
                     'Hispanic']

c_eth['Jewish'] = ['jewish','Jewish']

c_eth['Indian'] = ['indian','IndianSubContinent']

c_eth['Continental African'] = ['african','Africans']


In [14]:
## transform datasets
def standardizeEth(df):
    names = list(df['Name'])
    org_eth = list(df['True Ethnicity'])    
    standard_eth = []
    for ethnicity in org_eth:
        # search ethnicity dict
        for c in c_eth:
            # if found
            if ethnicity in c_eth[c]:
                # then add to master list
                standard_eth.append(c)
    print len(names), len(standard_eth), len(org_eth)
    df_new = pd.DataFrame(
            {'Name': names,
             'True Ethnicity': org_eth,
             'Standardized Ethnicity': standard_eth
            })
    return df_new

df_baby_standard = standardizeEth(df_baby)

df_wiki_standard = standardizeEth(df_wiki)
    

20245 20245 20245
295895 295895 295895


In [15]:
df_wiki_standard.sample(5)

Unnamed: 0,Name,Standardized Ethnicity,True Ethnicity
177607,davies,West European,British
63333,rowe,Jewish,Jewish
2108,roesler,West European,Germanic
140517,patrushev,East European,EastEuropean
13045,valero,Muslim,Muslim


Finally, nice data that we can use. Let's do one last step and balance the datasets.

In [16]:
## Balance Wiki

df_wiki_standard['True Ethnicity'].value_counts()

British               88353
French                27566
Italian               26711
Hispanic              24469
Jewish                22406
EastEuropean          18311
IndianSubContinent    17988
Japanese              15906
Muslim                14340
EastAsian             11459
Nordic                10927
Germanic               8999
Africans               8460
Name: True Ethnicity, dtype: int64

In [17]:
c_eth[c_eth.keys()[0]]

['spanish', 'portugese', 'Hispanic']

In [18]:
df_wiki_balanced = pd.DataFrame(columns=['Name', 'Standardized Ethnicity','True Ethnicity'])

##  balancing #1 - balance by True Ethnicity
for eth in ethnicities_wiki:
    # sample maximum amount from each(8460 - limit because of Africans)
    sample_df = df_wiki_standard[df_wiki_standard['True Ethnicity']==eth].sample(8460)
    df_wiki_balanced = pd.concat([df_wiki_balanced,sample_df])
    
## balancing #2 - balance by Standardized Eth, with equal numbers of True Eth in each group
df_wiki_b = pd.DataFrame(columns=['Name', 'Standardized Ethnicity','True Ethnicity'])

for eth in c_eth.keys():
    sample_df = df_wiki_balanced[df_wiki_balanced['Standardized Ethnicity']==eth].sample(8460)
    df_wiki_b = pd.concat([df_wiki_b,sample_df])

Let's make sure the proportions make sense.

In [19]:
df_wiki_b['Standardized Ethnicity'].value_counts()

West European          8460
East Asian             8460
Muslim                 8460
Indian                 8460
East European          8460
Continental African    8460
Jewish                 8460
Hispanic               8460
Name: Standardized Ethnicity, dtype: int64

In [20]:
df_wiki_b['True Ethnicity'].value_counts()

Muslim                8460
Africans              8460
Jewish                8460
Hispanic              8460
IndianSubContinent    8460
EastEuropean          8460
EastAsian             4258
Japanese              4202
British               1735
Germanic              1713
Nordic                1702
French                1659
Italian               1651
Name: True Ethnicity, dtype: int64

Looks good, lets finish this up by splitting into test/training (80/20).

In [21]:
## Split into Training and Test
msk = np.random.rand(len(df_wiki_b)) < 0.8
train_df_w = df_wiki_b[msk]
test_df_w = df_wiki_b[~msk]

print "Total Sample (n)", len (df_wiki_b)
print "Test Sample (test n)", len(test_df_w)
print "Train Sample (train n)", len(train_df_w)

Total Sample (n) 67680
Test Sample (test n) 13616
Train Sample (train n) 54064


Let's repeat for baby names df.


In [22]:
df_baby_standard['True Ethnicity'].value_counts()

french        4143
jewish        3076
spanish       2366
czech         1406
swedish       1158
portugese      834
swiss          774
german         723
italian        711
danish         656
indian         580
japanese       566
muslim         525
greek          429
chinese        426
ukranian       409
irish          318
african        300
slavic         258
russian        181
korean         169
vietnamese     129
arabic         108
Name: True Ethnicity, dtype: int64

Some 

In [23]:
for each in c_eth:
    print each

Hispanic
Jewish
East Asian
Muslim
West European
East European
Indian
Continental African


## Train Model

In [32]:
## Package DF into training token

def makeTokens(ethnicities, train_df):
    train_tokens = []
    for ethnicity in ethnicities:
        new_tokens = (list(train_df[train_df['Standardized Ethnicity'] == ethnicity]['Name']), ethnicity)
        train_tokens.append(new_tokens)
    return train_tokens

wiki_tokens = makeTokens(c_eth, train_df_w)


# (Tokens must be a list of ([list of names], 'ethnicity') pairs.)

In [33]:
## Train Classifier (beware, this takes time)

##classifier = mxec(wiki_tokens)
##classifier.train()

In [34]:
## White
white_tokens = makeTokens(['East European', 'West European', 'Jewish', 'Muslim'], train_df_w)
white_classifier = mxec(white_tokens)
white_classifier.train()

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -1.38629        0.250
             2          -0.89420        0.786
             3          -0.70446        0.805
             4          -0.60567        0.821
             5          -0.54320        0.835
             6          -0.49901        0.846
             7          -0.46546        0.855
             8          -0.43875        0.861
             9          -0.41676        0.867
            10          -0.39821        0.871
            11          -0.38226        0.875
            12          -0.36835        0.879
            13          -0.35606        0.881
            14          -0.34511        0.884
            15          -0.33526        0.886
            16          -0.32636        0.887
            17          -0.31825        0.889
            18          -0.31084        0.890
            19          -0.30403        0.891
 

In [37]:
## Black
black_tokens = makeTokens(['West European', 'Continental African', 'Muslim'], train_df_w)
black_classifier = mxec(black_tokens)
black_classifier.train()

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -1.09861        0.333
             2          -0.71315        0.818
             3          -0.56806        0.837
             4          -0.49074        0.852
             5          -0.44072        0.865
             6          -0.40470        0.875
             7          -0.37699        0.883
             8          -0.35472        0.891
             9          -0.33625        0.896
            10          -0.32060        0.901
            11          -0.30709        0.905
            12          -0.29527        0.909
            13          -0.28482        0.911
            14          -0.27549        0.913
            15          -0.26710        0.915
            16          -0.25950        0.917
            17          -0.25259        0.918
            18          -0.24626        0.919
            19          -0.24044        0.920
 

In [38]:
## Asian
asian_tokens = makeTokens(['East Asian', 'Indian', 'Muslim'], train_df_w)
asian_classifier = mxec(asian_tokens)
asian_classifier.train()

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -1.09861        0.333
             2          -0.62121        0.862
             3          -0.46739        0.880
             4          -0.39155        0.895
             5          -0.34462        0.904
             6          -0.31182        0.914
             7          -0.28712        0.922
             8          -0.26758        0.927
             9          -0.25159        0.932
            10          -0.23817        0.937
            11          -0.22669        0.940
            12          -0.21672        0.942
            13          -0.20795        0.944
            14          -0.20017        0.946
            15          -0.19320        0.947
            16          -0.18692        0.948
            17          -0.18122        0.949
            18          -0.17603        0.950
            19          -0.17128        0.951
 

In [None]:
## Hispanic

In [None]:
## Native American

In [30]:
# Test Classifier
print "Michael     ", classifier.classify('Michael')
print "Roberto     ", classifier.classify('Roberto')
print "Lee         ", classifier.classify('Lee')
print "Humza ", classifier.classify('Humza')

Michael     

NameError: name 'classifier' is not defined

In [None]:
print 'hello'