# Generate Sample Names with Gender and Ethnic Weights

# Purpose

This can be used to generate first-last name combinations that correlate strongly to specific ethnic groups. These names can be used to prompt CHatGPT or other LLMs to generate resumes in order to evaluate the biases it may have based on the implicit ethnicity of the the individual, rather than explicitly stating the demographic information. Sample names can also be generated manually using datasets 1 and 2, "firstnames" and "surnames", respectively.

Dataset 3 (us-likelihood-of-gender-by-name) is included here, but not used in our generation of names. This could be used to generate sample unisex names and/or quantify the likelihood that a generated name corresponds to a certian gender, however this dataset underrepresents non-white names, and therefore did not contain many of the names generated for the Black, Hispanic, and API ethnic groups. As such, we relied on general knowledge and/or other sources to determine the estimated gender of the names used in our created dataset.

Note: We only consider the following self-identified ethnic groups: White, Black, Asian and Pacific Islander, and Hispanic. The datasets used also contain the ethnic group "American Indian or Alaska Native", however, due to te relatively low number of people identifying with this group, we were not able to identify names that specifically correlated to members of this group, as as such we decided not to include the American Indian or Alaska Native group in our evaluation

### Datasets to be used
Note: Only datasets 1-3 are used to generate the sample weights and dataset 4 is used to evaluate findings against real-world baselines. Some datasets have been formatted in order to be more easily read into Python. The formatted versions of the datasets have been uploaded, however links to the original datasets have also been provided.

### 1.

File Name: firstnames

Source: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TYJKEZ

Description: The list includes 4,250 first names and information on their respective count and proportions across six mutually exclusive racial and Hispanic origin groups. These six categories are consistent with the categories used in the Census Bureau's surname list.

### 2. 

File Name: surnames

Source: https://github.com/fivethirtyeight/data/blob/master/most-common-name/surnames.csv

Description: Data on surnames from the U.S. Census Bureau, including a breakdown by race/ethnicity.

### 3.

File Name: us-likelihood-of-gender-by-name-in-2014

Source: https://github.com/organisciak/names/blob/master/data/us-likelihood-of-gender-by-name-in-2014.csv

Description: List of first names with likelihood that each name corresponds to a male or female.

### 4.

File Name: cpsaat11

Source: https://www.bls.gov/cps/cpsaat11.htm

Description: U.S. Beareau of Labor Statistics report on employed persons by detailed occupation, sex, race, and Hispanic or Latino ethnicity

In [1]:
import numpy as np
import pandas as pd
import random

### Import and clean firstnames data

In [2]:
xl_file = pd.ExcelFile('firstnames.xlsx')
firstnames = xl_file.parse('Data')
firstnames = firstnames.iloc[:, :-1]
firstnames.head()

Unnamed: 0,firstname,obs,pcthispanic,pctwhite,pctblack,pctapi,pctaian,pct2prace
0,AARON,3646,2.88,91.607,3.264,2.057,0.055,0.137
1,ABBAS,59,0.0,71.186,3.39,25.424,0.0,0.0
2,ABBEY,57,0.0,96.491,3.509,0.0,0.0,0.0
3,ABBIE,74,1.351,95.946,2.703,0.0,0.0,0.0
4,ABBY,262,1.527,94.656,1.527,2.29,0.0,0.0


Remove Least Common First Names

In [3]:
# Remove uncommon names to ensure the LLM is able to recognize each name
percentiles = [10, 20, 40, 60, 80, 100]
percentile_thresholds = np.percentile(firstnames['obs'], percentiles)

print("Percentile thresholds of first name occurrence counts:")
for i, p in enumerate(percentiles):
    print(f"{p}%: {percentile_thresholds[i]}")

Percentile thresholds of first name occurrence counts:
10%: 30.0
20%: 35.0
40%: 54.0
60%: 100.0
80%: 311.0
100%: 214124.0


In [4]:
firstnames = firstnames[firstnames['obs'] >= np.percentile(firstnames['obs'], 10)]

### Import and clean surnames data

In [5]:
lastnames = pd.read_csv('surnames.csv')
lastnames.head()

Unnamed: 0,name,rank,count,prop100k,cum_prop100k,pctwhite,pctblack,pctapi,pctaian,pct2prace,pcthispanic
0,SMITH,1,2376206,880.85,880.85,73.35,22.22,0.4,0.85,1.63,1.56
1,JOHNSON,2,1857160,688.44,1569.3,61.55,33.8,0.42,0.91,1.82,1.5
2,WILLIAMS,3,1534042,568.66,2137.96,48.52,46.72,0.37,0.78,2.01,1.6
3,BROWN,4,1380145,511.62,2649.58,60.71,34.54,0.41,0.83,1.86,1.64
4,JONES,5,1362755,505.17,3154.75,57.69,37.73,0.35,0.94,1.85,1.44


In [6]:
lastnames = lastnames.replace('(S)', 0)

Remove Least Common Last Names

In [7]:
percentile_thresholds = np.percentile(lastnames['count'], percentiles)

print("Percentile thresholds of last name occurrence counts:")
for i, p in enumerate(percentiles):
    print(f"{p}%: {percentile_thresholds[i]}")

Percentile thresholds of last name occurrence counts:
10%: 114.0
20%: 132.0
40%: 189.0
60%: 311.0
80%: 728.0
100%: 2376206.0


In [8]:
lastnames = lastnames[lastnames['count'] >= np.percentile(lastnames['count'], 60)]

In [9]:
lastnames[['count', 'pctwhite', 'pctblack', 'pctapi', 'pctaian', 'pct2prace', 'pcthispanic']] = lastnames[['count', 'pctwhite', 'pctblack', 'pctapi', 'pctaian', 'pct2prace', 'pcthispanic']].apply(pd.to_numeric)

### Import gender estimate data

In [10]:
genderestimates = pd.read_csv('us-likelihood-of-gender-by-name-in-2014.csv')
genderestimates.head()

Unnamed: 0,sex,name,gender.prob
0,F,Elaine,1.0
1,F,Cathy,1.0
2,F,Heidi,1.0
3,F,Vicki,1.0
4,F,Melinda,1.0


In [11]:
genderestimates['name'] = genderestimates['name'].str.upper()

In [12]:
genderestimates.head()

Unnamed: 0,sex,name,gender.prob
0,F,ELAINE,1.0
1,F,CATHY,1.0
2,F,HEIDI,1.0
3,F,VICKI,1.0
4,F,MELINDA,1.0


## Generate first names correlated with different ethnic groups

Most commonly White first names

In [13]:
fn_white = firstnames[firstnames['pctwhite'].isin(firstnames.nlargest(300, 'pctwhite')['pctwhite'])]
fn_white.head()

Unnamed: 0,firstname,obs,pcthispanic,pctwhite,pctblack,pctapi,pctaian,pct2prace
41,AGATA,48,2.083,97.917,0.0,0.0,0.0,0.0
76,ALEKSANDAR,30,0.0,100.0,0.0,0.0,0.0,0.0
78,ALEKSANDR,185,0.541,98.378,0.0,1.081,0.0,0.0
79,ALEKSANDRA,47,0.0,100.0,0.0,0.0,0.0,0.0
97,ALEXEI,30,0.0,100.0,0.0,0.0,0.0,0.0


Upon determining the names most highest correlated with individuals identifying as "White", sample the most common of this subset of names

In [14]:
common_fn_white = fn_white[fn_white['obs'].isin(fn_white.nlargest(25, 'obs')['obs'])]
common_fn_white.head()

Unnamed: 0,firstname,obs,pcthispanic,pctwhite,pctblack,pctapi,pctaian,pct2prace
411,BETH,2511,0.518,98.566,0.358,0.438,0.0,0.119
466,BRAD,1434,0.349,97.908,0.976,0.349,0.418,0.0
470,BRADLEY,3619,0.221,98.867,0.193,0.525,0.138,0.055
486,BRENDAN,524,0.763,97.901,0.191,0.954,0.191,0.0
493,BRETT,1951,0.359,98.36,0.871,0.256,0.103,0.051


Most commonly Black first names

Note that different numbers may be used based on the ethnic group as there are fewer names highly correlated with specific ethnic groups. For instance, the 50th most highly correlated name to individuals who identify as "Black" is less exclusive to this ethnic group than the 300th most highly correlated name to members of the "API" ethnic group.

In [15]:
fn_black = firstnames[firstnames['pctblack'].isin(firstnames.nlargest(50, 'pctblack')['pctblack'])]
fn_black.head()

Unnamed: 0,firstname,obs,pcthispanic,pctwhite,pctblack,pctapi,pctaian,pct2prace
53,AISHA,72,5.556,25.0,59.722,9.722,0.0,0.0
104,ALFREDA,35,2.857,22.857,71.429,2.857,0.0,0.0
136,ALPHONSO,56,7.143,14.286,76.786,1.786,0.0,0.0
139,ALTHEA,77,2.597,41.558,50.649,5.195,0.0,0.0
642,CEDRIC,132,0.0,27.273,62.121,9.091,0.0,1.515


In [16]:
common_fn_black = fn_black[fn_black['obs'].isin(fn_black.nlargest(25, 'obs')['obs'])]
common_fn_black.head()

Unnamed: 0,firstname,obs,pcthispanic,pctwhite,pctblack,pctapi,pctaian,pct2prace
53,AISHA,72,5.556,25.0,59.722,9.722,0.0,0.0
136,ALPHONSO,56,7.143,14.286,76.786,1.786,0.0,0.0
139,ALTHEA,77,2.597,41.558,50.649,5.195,0.0,0.0
642,CEDRIC,132,0.0,27.273,62.121,9.091,0.0,1.515
912,DARNELL,73,0.0,17.808,82.192,0.0,0.0,0.0


Most commonly Asian and Pacific Islander first names

In [17]:
fn_api = firstnames[firstnames['pctapi'].isin(firstnames.nlargest(500, 'pctapi')['pctapi'])]
fn_api.head()

Unnamed: 0,firstname,obs,pcthispanic,pctwhite,pctblack,pctapi,pctaian,pct2prace
10,ABHIJIT,34,0.0,17.647,0.0,82.353,0.0,0.0
49,AI,31,3.226,9.677,0.0,87.097,0.0,0.0
54,AJAY,154,0.649,25.325,0.649,70.779,1.948,0.649
55,AJIT,68,1.471,20.588,0.0,77.941,0.0,0.0
122,ALKA,36,0.0,16.667,0.0,80.556,2.778,0.0


In [18]:
common_fn_api = fn_api[fn_api['obs'].isin(fn_api.nlargest(25, 'obs')['obs'])]
common_fn_api.head()

Unnamed: 0,firstname,obs,pcthispanic,pctwhite,pctblack,pctapi,pctaian,pct2prace
165,AMIT,233,0.0,24.893,0.429,73.82,0.858,0.0
208,ANIL,215,0.93,18.14,0.93,79.535,0.0,0.465
701,CHI,224,0.0,2.679,0.0,97.321,0.0,0.0
1531,HAI,215,0.465,1.395,0.0,98.14,0.0,0.0
1641,HONG,398,0.503,1.508,0.0,97.739,0.0,0.251


Most commonly Hispanic first names

In [19]:
fn_hispanic = firstnames[firstnames['pcthispanic'].isin(firstnames.nlargest(100, 'pcthispanic')['pcthispanic'])]
fn_hispanic.head()

Unnamed: 0,firstname,obs,pcthispanic,pctwhite,pctblack,pctapi,pctaian,pct2prace
17,ADALBERTO,69,94.203,5.797,0.0,0.0,0.0,0.0
19,ADAN,146,91.781,5.479,2.055,0.685,0.0,0.0
45,AGUSTIN,239,89.958,5.858,0.418,3.766,0.0,0.0
74,ALEJANDRA,183,87.978,8.743,0.0,3.279,0.0,0.0
75,ALEJANDRO,821,90.134,6.943,0.122,2.801,0.0,0.0


In [20]:
common_fn_hispanic = fn_hispanic[fn_hispanic['obs'].isin(fn_hispanic.nlargest(25, 'obs')['obs'])]
common_fn_hispanic.head()

Unnamed: 0,firstname,obs,pcthispanic,pctwhite,pctblack,pctapi,pctaian,pct2prace
75,ALEJANDRO,821,90.134,6.943,0.122,2.801,0.0,0.0
445,BLANCA,489,91.207,7.975,0.204,0.613,0.0,0.0
1213,ENRIQUE,658,89.362,6.383,0.0,4.255,0.0,0.0
1309,FELIPE,413,90.557,3.632,0.726,4.6,0.484,0.0
1449,GILBERTO,295,91.864,7.458,0.339,0.339,0.0,0.0


## Generate last names correlated with different ethnic groups

Most commonly White last names

In [21]:
ln_white = lastnames[lastnames['pctwhite'].isin(lastnames.nlargest(20000, 'pctwhite')['pctwhite'])]
ln_white.head()

Unnamed: 0,name,rank,count,prop100k,cum_prop100k,pctwhite,pctblack,pctapi,pctaian,pct2prace,pcthispanic
135,OLSON,136,163502,60.61,19579.97,96.03,0.36,0.54,0.64,1.04,1.38
162,MEYER,163,149664,55.48,21141.4,96.07,0.45,0.57,0.24,1.13,1.55
170,SCHMIDT,171,145565,53.96,21577.96,96.48,0.28,0.46,0.33,0.87,1.57
223,LARSON,224,121064,44.88,24173.16,96.13,0.39,0.55,0.62,1.03,1.27
224,CARLSON,225,120124,44.53,24217.69,96.22,0.4,0.54,0.5,1.03,1.32


In [22]:
common_ln_white = ln_white[ln_white['count'].isin(ln_white.nlargest(25, 'count')['count'])]
common_ln_white

Unnamed: 0,name,rank,count,prop100k,cum_prop100k,pctwhite,pctblack,pctapi,pctaian,pct2prace,pcthispanic
135,OLSON,136,163502,60.61,19579.97,96.03,0.36,0.54,0.64,1.04,1.38
162,MEYER,163,149664,55.48,21141.4,96.07,0.45,0.57,0.24,1.13,1.55
170,SCHMIDT,171,145565,53.96,21577.96,96.48,0.28,0.46,0.33,0.87,1.57
223,LARSON,224,121064,44.88,24173.16,96.13,0.39,0.55,0.62,1.03,1.27
224,CARLSON,225,120124,44.53,24217.69,96.22,0.4,0.54,0.5,1.03,1.32
260,SCHULTZ,261,104962,38.91,25716.12,96.24,0.62,0.42,0.37,0.95,1.4
271,SCHNEIDER,272,100553,37.27,26134.39,96.67,0.33,0.43,0.27,0.9,1.41
314,BECKER,315,88114,32.66,27632.68,96.4,0.46,0.45,0.31,1.02,1.38
329,SCHWARTZ,330,84699,31.4,28112.37,96.77,0.4,0.46,0.16,1.0,1.21
350,ERICKSON,351,80936,30.0,28753.59,96.39,0.24,0.5,0.54,1.08,1.26


Most commonly Black last names

In [23]:
ln_black = lastnames[lastnames['pctblack'].isin(lastnames.nlargest(1000, 'pctblack')['pctblack'])]
ln_black.head()

Unnamed: 0,name,rank,count,prop100k,cum_prop100k,pctwhite,pctblack,pctapi,pctaian,pct2prace,pcthispanic
137,WASHINGTON,138,163036,60.44,19701.01,5.16,89.87,0.25,0.64,2.64,1.45
593,JEFFERSON,594,51361,19.04,34513.94,18.72,75.24,0.25,1.85,2.38,1.57
1139,ALSTON,1140,28089,10.41,42123.76,14.54,81.41,0.24,0.37,1.99,1.45
1358,BATTLE,1359,23934,8.87,44222.67,16.43,78.89,0.23,0.26,2.18,2.01
1378,PIERRE,1379,23575,8.74,44398.73,9.23,76.9,0.4,1.22,9.94,2.31


In [24]:
common_ln_black = ln_black[ln_black['count'].isin(ln_black.nlargest(25, 'count')['count'])]
common_ln_black

Unnamed: 0,name,rank,count,prop100k,cum_prop100k,pctwhite,pctblack,pctapi,pctaian,pct2prace,pcthispanic
137,WASHINGTON,138,163036,60.44,19701.01,5.16,89.87,0.25,0.64,2.64,1.45
593,JEFFERSON,594,51361,19.04,34513.94,18.72,75.24,0.25,1.85,2.38,1.57
1139,ALSTON,1140,28089,10.41,42123.76,14.54,81.41,0.24,0.37,1.99,1.45
1358,BATTLE,1359,23934,8.87,44222.67,16.43,78.89,0.23,0.26,2.18,2.01
1378,PIERRE,1379,23575,8.74,44398.73,9.23,76.9,0.4,1.22,9.94,2.31
1637,BOLDEN,1638,20015,7.42,46489.96,21.65,74.15,0.14,0.27,2.34,1.43
2187,RUFFIN,2188,15263,5.66,50035.96,15.32,81.05,0.24,0.2,1.87,1.32
2241,HAIRSTON,2242,14891,5.52,50337.29,12.37,83.04,0.2,0.21,2.91,1.27
2378,MUHAMMAD,2379,13972,5.18,51068.08,2.48,86.09,4.4,0.38,4.8,1.85
2383,CHATMAN,2384,13935,5.17,51093.93,14.53,81.12,0.19,0.22,2.5,1.43


Most commonly Asian and Pacific Islander last names

In [25]:
ln_api = lastnames[lastnames['pctapi'].isin(lastnames.nlargest(1000, 'pctapi')['pctapi'])]
ln_api.head()

Unnamed: 0,name,rank,count,prop100k,cum_prop100k,pctwhite,pctblack,pctapi,pctaian,pct2prace,pcthispanic
56,NGUYEN,57,310125,114.96,13251.76,1.26,0.18,95.93,0.04,2.01,0.58
108,KIM,109,194067,71.94,17779.16,2.6,0.36,94.52,0.03,1.99,0.5
171,PATEL,172,145066,53.78,21631.73,1.55,0.26,91.37,0.41,5.84,0.57
187,TRAN,188,136095,50.45,22454.47,1.56,0.16,95.61,0.07,1.98,0.62
259,CHEN,260,105544,39.12,25677.21,1.68,0.36,95.45,0.02,2.0,0.49


In [26]:
common_ln_api = ln_api[ln_api['count'].isin(ln_api.nlargest(25, 'count')['count'])]
common_ln_api

Unnamed: 0,name,rank,count,prop100k,cum_prop100k,pctwhite,pctblack,pctapi,pctaian,pct2prace,pcthispanic
56,NGUYEN,57,310125,114.96,13251.76,1.26,0.18,95.93,0.04,2.01,0.58
108,KIM,109,194067,71.94,17779.16,2.6,0.36,94.52,0.03,1.99,0.5
171,PATEL,172,145066,53.78,21631.73,1.55,0.26,91.37,0.41,5.84,0.57
187,TRAN,188,136095,50.45,22454.47,1.56,0.16,95.61,0.07,1.98,0.62
259,CHEN,260,105544,39.12,25677.21,1.68,0.36,95.45,0.02,2.0,0.49
276,WONG,277,99392,36.84,26319.98,3.33,0.71,88.5,0.05,4.39,3.02
367,LE,368,77453,28.71,29252.48,1.83,0.29,95.15,0.03,2.13,0.56
396,YANG,397,72627,26.92,30058.31,0.95,0.13,95.03,0.04,3.49,0.35
423,CHANG,424,69756,25.86,30769.92,2.28,0.94,90.13,0.02,3.69,2.94
437,WANG,438,67570,25.05,31124.87,3.25,0.19,94.47,0.03,1.73,0.33


Most commonly Hispanic last names

In [27]:
ln_hispanic = lastnames[lastnames['pcthispanic'].isin(lastnames.nlargest(1000, 'pcthispanic')['pcthispanic'])]
ln_hispanic.head()

Unnamed: 0,name,rank,count,prop100k,cum_prop100k,pctwhite,pctblack,pctapi,pctaian,pct2prace,pcthispanic
8,RODRIGUEZ,9,804240,298.13,4586.62,5.52,0.54,0.58,0.24,0.41,92.7
14,HERNANDEZ,15,706372,261.85,6239.18,4.55,0.38,0.65,0.27,0.35,93.81
22,GONZALEZ,23,597718,221.57,8146.97,4.76,0.37,0.38,0.18,0.33,93.99
41,RAMIREZ,42,388987,144.2,11387.3,4.4,0.29,0.97,0.27,0.4,93.67
146,JIMENEZ,147,157475,58.38,20235.68,4.46,0.31,1.53,0.27,0.45,92.98


In [28]:
common_ln_hispanic = ln_hispanic[ln_hispanic['count'].isin(ln_hispanic.nlargest(25, 'count')['count'])]
common_ln_hispanic

Unnamed: 0,name,rank,count,prop100k,cum_prop100k,pctwhite,pctblack,pctapi,pctaian,pct2prace,pcthispanic
8,RODRIGUEZ,9,804240,298.13,4586.62,5.52,0.54,0.58,0.24,0.41,92.7
14,HERNANDEZ,15,706372,261.85,6239.18,4.55,0.38,0.65,0.27,0.35,93.81
22,GONZALEZ,23,597718,221.57,8146.97,4.76,0.37,0.38,0.18,0.33,93.99
41,RAMIREZ,42,388987,144.2,11387.3,4.4,0.29,0.97,0.27,0.4,93.67
146,JIMENEZ,147,157475,58.38,20235.68,4.46,0.31,1.53,0.27,0.45,92.98
229,GUZMAN,230,118390,43.89,24438.93,4.73,0.49,1.44,0.21,0.45,92.69
231,MUNOZ,232,117774,43.66,24526.31,5.52,0.3,0.88,0.29,0.41,92.61
284,RIOS,285,96569,35.8,26612.15,5.33,0.45,0.39,0.4,0.41,93.02
293,ALVARADO,294,93723,34.74,26928.34,4.79,0.33,0.6,0.29,0.44,93.57
298,CONTRERAS,299,92660,34.35,27100.57,4.72,0.24,0.62,0.3,0.37,93.75


### Example Name Combinations

Note that due to the wide array of individuals that may fall under a broader ethnic group (as defined by the US Government), some attention may be paid to the generated names to ensure that the first and last names are consistent with each other and are not traditionally associated with cultures that are perceived as largely distinct from one another. For example a traditionally Scandanavian first name with a traditionally Middle Eastern last name, even though both cultures commonly identify as "White".

White Names

In [29]:
for i in range(25):
    # Randomly select first and last names from the most common and highly correlated names
    fn = common_fn_white['firstname'].sample().iloc[0]
    ln = common_ln_white['name'].sample().iloc[0]
    print(fn.title() + ' ' + ln.title())

Megan Schwartz
Brett Becker
Krista Koch
Beth Yoder
Kristin Schneider
Beth Mueller
Salvatore Becker
Tyler Koch
Kelley Odonnell
Meghan Schneider
Lindsay Becker
Lindsay Gallagher
Chad Larson
Kathleen Reilly
Krista Jacobson
Kurt Weiss
Krista Weiss
Brad Olson
Megan Schultz
Kristen Klein
Scott Erickson
Kari Yoder
Salvatore Gallagher
Lindsay Bauer
Scott Jacobson


Black Names

In [30]:
for i in range(25):
    fn = common_fn_black['firstname'].sample().iloc[0]
    ln = common_ln_black['name'].sample().iloc[0]
    print(fn.title() + ' ' + ln.title())

Latasha Toliver
Darnell Battle
Mattie Bowens
Keisha Braxton
Ebony Lockett
Latonya Lockett
Lillie Armstead
Fannie Hairston
Lula Ruffin
Mattie Faison
Darnell Faison
Mamie Bowens
Latasha Drayton
Ebony Hollins
Latasha Pierre
Marva Ruffin
Tyrone Pierre
Alphonso Artis
Althea Braxton
Latoya Armstead
Cedric Armstead
Jermaine Alston
Hattie Bethea
Hattie Bethea
Keisha Bowens


API Names

In [31]:
for i in range(25):
    fn = common_fn_api['firstname'].sample().iloc[0]
    ln = common_ln_api['name'].sample().iloc[0]
    print(fn.title() + ' ' + ln.title())

Hung Chan
Young Nguyen
Phuong Ho
Jin Patel
Sanjay Lin
Hung Lam
Jun Huang
Sanjay Yu
Minh Chan
Minh Li
Yong Wong
Rajesh Yang
Hong Yu
Jun Shah
Jin Shah
Chi Huang
Yan Liu
Hai Nguyen
Hai Zhang
Hung Ho
Hong Liu
Yan Choi
Sung Huynh
Phuong Le
Yan Shah


Hispanic Names

In [32]:
for i in range(25):
    fn = common_fn_hispanic['firstname'].sample().iloc[0]
    ln = common_ln_hispanic['name'].sample().iloc[0]
    print(fn.title() + ' ' + ln.title())

Ramiro Munoz
Rafael Ramirez
Julio Munoz
Jorge Espinoza
Juan Espinoza
Ignacio Rangel
Hector Mejia
Felipe Mejia
Gustavo Macias
Felipe Contreras
Ignacio Ramirez
Alejandro Ochoa
Javier Ayala
Santiago Hernandez
Ignacio Guzman
Alejandro Macias
Alejandro Guzman
Salvador Macias
Enrique Alvarado
Guillermo Ochoa
Hector Jimenez
Salvador Jimenez
Humberto Juarez
Alejandro Jimenez
Gilberto Rodriguez


### (Optional) Gender

In [33]:
gpercentiles = [.1, .5, 1, 2, 4, 6, 8, 10]
gpercentile_thresholds = np.percentile(genderestimates['gender.prob'], gpercentiles)

print("Percentile thresholds of last name occurrence counts:")
for i, p in enumerate(gpercentiles):
    print(f"{p}%: {gpercentile_thresholds[i]}")

Percentile thresholds of last name occurrence counts:
0.1%: 0.5114676929606898
0.5%: 0.5559718670076729
1%: 0.6291842535963481
2%: 0.7478845204956572
4%: 0.894009249623475
6%: 0.9535855345548424
8%: 0.9804211605800434
10%: 0.9921581943997311


In [34]:
unisex = genderestimates[genderestimates['gender.prob'] <= np.percentile(genderestimates['gender.prob'], 1)]

In [35]:
unisex

Unnamed: 0,sex,name,gender.prob
11202,M,RONIT,0.628959
11203,F,PAT,0.628610
11204,M,OZELL,0.627049
11205,M,KAMDYN,0.626033
11206,F,SHAY,0.624704
...,...,...,...
11311,F,KRIS,0.503682
11312,F,ALVA,0.502540
11313,F,NAZARETH,0.502110
11314,M,CHRISTAN,0.501880


In [36]:
unisex_names = unisex['name'].to_numpy()

In [37]:
unisex_names

array(['RONIT', 'PAT', 'OZELL', 'KAMDYN', 'SHAY', 'SHAMARI', 'SHEA',
       'EMERY', 'ELISHA', 'REMY', 'ISA', 'DANN', 'RICCI', 'REILLY',
       'STEVIE', 'OSIRIS', 'SANTANA', 'CHARLEY', 'RIO', 'JASPREET',
       'ROWAN', 'ARDELL', 'ALLYN', 'AN', 'ARLYN', 'AVEN', 'JAZIAH',
       'JAYLIN', 'KYRIE', 'ARLIN', 'ARDEN', 'CASEY', 'YU', 'LIAN',
       'RYLEY', 'JAEDYN', 'TEEGAN', 'CHONG', 'LAJUAN', 'PEYTON',
       'LORENZA', 'JACKIE', 'ROBBIE', 'KALIN', 'JAEL', 'LEIGHTON',
       'LAVON', 'ASHTEN', 'LENNIE', 'KALANI', 'DAYLIN', 'CARLIN',
       'ARMANI', 'ARIE', 'JAIME', 'CARROL', 'EMARI', 'BRITT', 'GENTRY',
       'STEPHANE', 'HARLEY', 'INDIANA', 'NATIVIDAD', 'KIRAN', 'OAKLEY',
       'UNNAMED', 'KRISHNA', 'TENZIN', 'DEVYN', 'ANAY', 'LAVERN', 'BLAIR',
       'PAYSON', 'OCEAN', 'LORIN', 'LANDRY', 'EMERSON', 'AMRIT', 'MILAN',
       'DEVINE', 'IRAN', 'BABY', 'CAMDYN', 'DIVINE', 'OCIE', 'ARTIE',
       'NOTNAMED', 'MICHAL', 'CAREY', 'PARRIS', 'CLAUDIE', 'KIMANI',
       'TRISTYN', 'KERRY', 'JU