# Missing Values Exercises - due thu 09/21 | *by Charlotte*
---
### Exercise 1
To begin, load the ACS Data we used in our first pandas exercise. That data can be found here. We’ll be working with US_ACS_2017_10pct_sample.dta.

In [76]:
import pandas as pd
acs = pd.read_stata('US_ACS_2017_10pct_sample.dta')

---
### Exercise 2
Let’s begin by calculating the mean US incomes from this data (recall that income is stored in the inctot variable).

In [77]:
f'Mean US incomes is ${acs["inctot"].mean().round(2)}'

'Mean US incomes is $1723646.27'

---
### Exercise 3
Hmmm… That doesn’t look right. The average American is definitely not earning 1.7 million dollars a year. Let’s look at the values of inctot using value_counts(). Do you see a problem?

- Yes, most likely we have missing values in the dataset that are still coded as 9999999 or 8888888, instead of na's. 

Now use value_counts() with the argument normalize=True to see proportions of the sample that report each value instead of the count of people in each category. What percentage of our sample has an income of 9,999,999? What percentage has an income of 0?

In [78]:
acs['inctot'].value_counts(normalize=True)

9999999    0.168967
0          0.105575
30000      0.014978
50000      0.013837
40000      0.013834
             ...   
70520      0.000003
76680      0.000003
57760      0.000003
200310     0.000003
505400     0.000003
Name: inctot, Length: 8471, dtype: float64

- 16.9% of the sample reportedly has an income of $9,999,999. 
- 10.6% has an income of $0.

---
### Exercise 4
To help out pandas, use the replace command to replace all values of 9999999 with np.nan.

In [79]:
import numpy as np
acs['inctot'] = acs['inctot'].replace(9999999, np.nan)
acs['inctot']

0              NaN
1           6000.0
2           6150.0
3          14000.0
4              NaN
            ...   
318999     22130.0
319000         NaN
319001      5000.0
319002    240000.0
319003     48000.0
Name: inctot, Length: 319004, dtype: float64

---
### Exercise 5

Now that we’ve properly labeled our missing data as np.nan, let’s calculate the average US income once more.

In [80]:
f'Mean US incomes is ${acs["inctot"].mean().round(2)}'

'Mean US incomes is $40890.18'

---
### Exercise 6
So let’s make sure we understand why data is missing for some people. If you recall from our last exercise, it seemed to be the case that most of the people who had incomes of 9999999 were children. Let’s make sure that’s true by looking at the distribution of the variable age for people for whom inctot is missing (i.e. subset the data to people with inctot missing, then look at the values of age with value_counts()).

In [81]:
acs_sub = acs.loc[acs['inctot'].isnull()]
acs_sub
acs_sub['age'].value_counts()

10    3997
9     3977
14    3847
12    3845
13    3800
      ... 
39       0
38       0
37       0
36       0
96       0
Name: age, Length: 97, dtype: int64

It seems that those who income is missing for are children. 

Then do the opposite: look at the distribution of the age variable for people for whom inctot is not missing.

In [82]:
acs_sub2 = acs.loc[acs['inctot'].notnull()]
acs_sub2
acs_sub2['age'].value_counts()

60                      4950
54                      4821
59                      4776
56                      4776
58                      4734
                        ... 
5                          0
4                          0
3                          0
2                          0
less than 1 year old       0
Name: age, Length: 97, dtype: int64

The opposite is true. Those where income is reported are adults. Which makes sense.

Can you determine when 9999999 was being used? Is it ok we’re excluding those people from our analysis?

- 9999999 was used for children. It is okay if we exclude them from our analysis as we look at the income distribution and how it varies by race. Children by default and in the normal case should not have income on their own and are unemployed. Excluding them from our analysis should therefore be okay. However, in order not to skew our analysis we must assume that this (i.e. children not having income and not being employed) holds true across different races. 

---
### Exercise 7
Let’s limit our attention to people who are currently working. We can do this using empstat. Remember you can use value_counts() to see what values of empstat are in the data!

In [83]:
acs_sub2['empstat'].value_counts()


employed              148758
not in labor force    104676
unemployed              7727
n/a                     3942
Name: empstat, dtype: int64

In [84]:
print("Subsetting the dataset for the people for whom empstat is equal to “employed”, i.e. people that are employed.")
acs_emp = acs.loc[acs['empstat'] == "employed"]
acs_emp

Subsetting the dataset for the people for whom empstat is equal to “employed”, i.e. people that are employed.


Unnamed: 0,year,datanum,serial,cbserial,numprec,subsamp,hhwt,hhtype,cluster,adjust,cpi99,region,stateicp,statefip,countyicp,countyfip,metro,city,citypop,strata,gq,farm,ownershp,ownershpd,mortgage,mortgag2,mortamt1,mortamt2,respmode,pernum,cbpernum,perwt,slwt,famunit,sex,age,marst,birthyr,race,raced,hispan,hispand,bpl,bpld,citizen,yrnatur,yrimmig,language,languaged,speakeng,hcovany,hcovpriv,hinsemp,hinspur,hinstri,hcovpub,hinscaid,hinscare,hinsva,hinsihs,school,educ,educd,gradeatt,gradeattd,schltype,degfield,degfieldd,degfield2,degfield2d,empstat,empstatd,labforce,occ,ind,classwkr,classwkrd,looking,availble,inctot,ftotinc,incwage,incbus00,incss,incwelfr,incinvst,incretir,incsupp,incother,incearn,poverty,migrate1,migrate1d,migplac1,migcounty1,migmet131,vetdisab,diffrem,diffphys,diffmob,diffcare,diffsens,diffeye,diffhear
1,2017,1,1200045,2.017001e+12,6,79,25,"male householder, no wife present",2.017012e+12,1.011189,0.679,west south central div,texas,texas,1210,121,in metropolitan area: not in central/principal...,not in identifiable city (or size group),0,200448,households under 1970 definition,non-farm,owned or being bought (loan),owned free and clear,"no, owned free and clear",,0,0,mail,3,3,57,57,1st family in household or group quarters,female,17,never married/single,2000,white,white,other,ecuadorian,texas,texas,,,0,spanish,spanish,"yes, speaks very well",with health insurance coverage,with private health insurance coverage,has insurance through employer/union,no insurance purchased directly,no insurance through tricare,without public health insurance coverage,no insurance through medicaid,no,no insurance through va,no insurance through indian health service,"yes, in school",grade 11,grade 11,grade 9 to grade 12,grade 12,public school,,,,,employed,at work,"yes, in the labor force",4110,8680,works for wages,"wage/salary, private","no, did not look for work","no, other reason(s)",6000.0,93000,6000,0,0,0,0,0,0,0,6000,286,same house,same house,,0,not in identifiable area,,no cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
2,2017,1,70831,2.017000e+12,1 person record,36,57,"male householder, living alone",2.017001e+12,1.011189,0.679,pacific division,california,california,790,79,in metropolitan area: central/principal city s...,not in identifiable city (or size group),0,790106,households under 1970 definition,non-farm,rented,with cash rent,,,0,0,internet,1,1,58,58,1st family in household or group quarters,male,63,separated,1954,white,white,not hispanic,not hispanic,california,california,,,0,english,english,"yes, speaks only english",with health insurance coverage,with private health insurance coverage,has insurance through employer/union,no insurance purchased directly,no insurance through tricare,without public health insurance coverage,no insurance through medicaid,no,no insurance through va,no insurance through indian health service,"no, not in school",4 years of college,bachelor's degree,,,not enrolled,family and consumer sciences,family and consumer sciences,,,employed,at work,"yes, in the labor force",540,7870,works for wages,state govt employee,not reported,not reported,6150.0,6150,6000,0,0,0,150,0,0,0,6000,49,same house,same house,,0,not in identifiable area,,has cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
5,2017,1,563897,2.017001e+12,3,19,66,"female householder, no husband present",2.017006e+12,1.011189,0.679,west south central div,louisiana,louisiana,710,71,in metropolitan area: in central/principal city,"new orleans, la",3933,240222,households under 1970 definition,non-farm,owned or being bought (loan),owned with mortgage or loan,"yes, mortgaged/ deed of trust or similar debt",no,1000,0,mail,2,2,195,195,1st family in household or group quarters,male,50,never married/single,1967,white,white,not hispanic,not hispanic,louisiana,louisiana,,,0,english,english,"yes, speaks only english",with health insurance coverage,with private health insurance coverage,has insurance through employer/union,no insurance purchased directly,no insurance through tricare,without public health insurance coverage,no insurance through medicaid,no,no insurance through va,no insurance through indian health service,"no, not in school",grade 12,"12th grade, no diploma",,,not enrolled,,,,,employed,at work,"yes, in the labor force",8965,4480,works for wages,"wage/salary, private","no, did not look for work","yes, available for work",50000.0,51600,50000,0,0,0,0,0,0,0,50000,352,same house,same house,,0,not in identifiable area,,no cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
9,2017,1,856859,2.017001e+12,5,69,12,married-couple family household,2.017009e+12,1.011189,0.679,middle atlantic division,new york,new york,0,0,metropolitan status indeterminable (mixed),not in identifiable city (or size group),0,80036,households under 1970 definition,non-farm,owned or being bought (loan),owned with mortgage or loan,"yes, mortgaged/ deed of trust or similar debt",no,400,0,mail,4,4,12,12,1st family in household or group quarters,female,17,never married/single,2000,white,white,not hispanic,not hispanic,new york,new york,,,0,english,english,"yes, speaks only english",with health insurance coverage,with private health insurance coverage,has insurance through employer/union,no insurance purchased directly,no insurance through tricare,without public health insurance coverage,no insurance through medicaid,no,no insurance through va,no insurance through indian health service,"yes, in school",grade 12,"12th grade, no diploma",grade 9 to grade 12,grade 12,public school,,,,,employed,at work,"yes, in the labor force",5240,8680,works for wages,"wage/salary, private",not reported,not reported,2000.0,71400,2000,0,0,0,0,0,0,0,2000,231,same house,same house,,0,not in identifiable area,,no cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
10,2017,1,175930,2.017001e+12,9,72,171,married-couple family household,2.017002e+12,1.011189,0.679,pacific division,california,california,590,59,in metropolitan area: central/principal city s...,"santa ana, ca",3341,591606,households under 1970 definition,non-farm,rented,with cash rent,,,0,0,cati/capi,2,2,162,162,1st family in household or group quarters,male,47,"married, spouse present",1970,"other race, nec","other race, n.e.c",mexican,mexican,mexico,mexico,not a citizen,,2003,spanish,spanish,"yes, speaks well",with health insurance coverage,without private health insurance coverage,no insurance through employer/union,no insurance purchased directly,no insurance through tricare,with public health insurance coverage,has insurance through medicaid,no,no insurance through va,no insurance through indian health service,"no, not in school",n/a or no schooling,no schooling completed,,,not enrolled,,,,,employed,at work,"yes, in the labor force",4020,8680,works for wages,"wage/salary, private",not reported,not reported,18000.0,78400,18000,0,0,0,0,0,0,0,18000,149,same house,same house,,0,not in identifiable area,,no cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
318995,2017,1,46231,2.017001e+12,3,36,104,married-couple family household,2.017000e+12,1.011189,0.679,mountain division,arizona,arizona,130,13,in metropolitan area: not in central/principal...,not in identifiable city (or size group),0,13204,households under 1970 definition,non-farm,owned or being bought (loan),owned with mortgage or loan,"yes, mortgaged/ deed of trust or similar debt",no,2500,0,internet,1,1,105,105,1st family in household or group quarters,male,67,"married, spouse present",1950,white,white,mexican,mexican,texas,texas,,,0,english,english,"yes, speaks only english",with health insurance coverage,without private health insurance coverage,no insurance through employer/union,no insurance purchased directly,no insurance through tricare,with public health insurance coverage,no insurance through medicaid,yes,no insurance through va,no insurance through indian health service,"no, not in school",grade 12,ged or alternative credential,,,not enrolled,,,,,employed,at work,"yes, in the labor force",20,770,works for wages,"wage/salary, private",not reported,not reported,125000.0,270000,125000,0,0,0,0,0,0,0,125000,501,moved within state,"different house, moved within state, within puma",arizona,13,"phoenix-mesa-scottsdale, az",,no cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
318999,2017,1,734396,2.017001e+12,4,78,100,married-couple family household,2.017007e+12,1.011189,0.679,west north central div,missouri,missouri,0,0,in metropolitan area: central/principal city s...,not in identifiable city (or size group),0,90129,households under 1970 definition,non-farm,owned or being bought (loan),owned with mortgage or loan,"yes, mortgaged/ deed of trust or similar debt",no,920,0,mail,2,2,165,165,1st family in household or group quarters,female,33,"married, spouse present",1984,white,white,not hispanic,not hispanic,missouri,missouri,,,0,english,english,"yes, speaks only english",with health insurance coverage,with private health insurance coverage,has insurance through employer/union,no insurance purchased directly,no insurance through tricare,without public health insurance coverage,no insurance through medicaid,no,no insurance through va,no insurance through indian health service,"yes, in school",4 years of college,bachelor's degree,college undergraduate,college undergraduate,"private school (1960,1990-2000,acs,prcs)",medical and health sciences and services,nursing,,,employed,at work,"yes, in the labor force",3255,8190,works for wages,wage/salary at non-profit,not reported,not reported,22130.0,82210,22000,0,0,0,130,0,0,0,22000,333,same house,same house,,0,not in identifiable area,,no cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
319001,2017,1,510444,2.017001e+12,2,43,152,"female householder, no husband present",2.017005e+12,1.011189,0.679,west north central div,iowa,iowa,1130,113,in metropolitan area: central/principal city s...,not in identifiable city (or size group),0,100019,households under 1970 definition,non-farm,rented,with cash rent,,,0,0,internet,2,2,145,145,1st family in household or group quarters,male,20,never married/single,1997,two major races,white and black,not hispanic,not hispanic,iowa,iowa,,,0,english,english,"yes, speaks only english",no health insurance coverage,without private health insurance coverage,no insurance through employer/union,no insurance purchased directly,no insurance through tricare,without public health insurance coverage,no insurance through medicaid,no,no insurance through va,no insurance through indian health service,"no, not in school",grade 12,"12th grade, no diploma",,,not enrolled,,,,,employed,at work,"yes, in the labor force",4020,8680,works for wages,"wage/salary, private",not reported,not reported,5000.0,33000,5000,0,0,0,0,0,0,0,5000,201,same house,same house,,0,not in identifiable area,,has cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
319002,2017,1,1220474,2.017001e+12,4,16,148,married-couple family household,2.017012e+12,1.011189,0.679,west south central div,texas,texas,1210,121,in metropolitan area: not in central/principal...,not in identifiable city (or size group),0,200448,households under 1970 definition,non-farm,owned or being bought (loan),owned with mortgage or loan,"yes, mortgaged/ deed of trust or similar debt",no,1200,0,internet,1,1,148,148,1st family in household or group quarters,male,47,"married, spouse present",1970,other asian or pacific islander,asian indian (hindu 1920_1940),not hispanic,not hispanic,india,india,naturalized citizen,2008,1996,dravidian,tamil,"yes, speaks very well",with health insurance coverage,with private health insurance coverage,no insurance through employer/union,has insurance purchased directly,no insurance through tricare,without public health insurance coverage,no insurance through medicaid,no,no insurance through va,no insurance through indian health service,"no, not in school",5+ years of college,master's degree,,,not enrolled,physical sciences,multi-disciplinary or general science,,,employed,at work,"yes, in the labor force",1010,7390,self-employed,"self-employed, incorporated",not reported,not reported,240000.0,260000,180000,60000,0,0,0,0,0,0,240000,501,same house,same house,,0,not in identifiable area,,no cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no


---
### Exercise 8

Now let’s estimate the racial income gap in the United States. 

In [85]:
print("Find race variable and thus print out all columns")
pd.set_option('display.max_columns', None)
acs.head()

Find race variable and thus print out all columns


Unnamed: 0,year,datanum,serial,cbserial,numprec,subsamp,hhwt,hhtype,cluster,adjust,cpi99,region,stateicp,statefip,countyicp,countyfip,metro,city,citypop,strata,gq,farm,ownershp,ownershpd,mortgage,mortgag2,mortamt1,mortamt2,respmode,pernum,cbpernum,perwt,slwt,famunit,sex,age,marst,birthyr,race,raced,hispan,hispand,bpl,bpld,citizen,yrnatur,yrimmig,language,languaged,speakeng,hcovany,hcovpriv,hinsemp,hinspur,hinstri,hcovpub,hinscaid,hinscare,hinsva,hinsihs,school,educ,educd,gradeatt,gradeattd,schltype,degfield,degfieldd,degfield2,degfield2d,empstat,empstatd,labforce,occ,ind,classwkr,classwkrd,looking,availble,inctot,ftotinc,incwage,incbus00,incss,incwelfr,incinvst,incretir,incsupp,incother,incearn,poverty,migrate1,migrate1d,migplac1,migcounty1,migmet131,vetdisab,diffrem,diffphys,diffmob,diffcare,diffsens,diffeye,diffhear
0,2017,1,177686,2017001000000.0,9,64,55,"female householder, no husband present",2017002000000.0,1.011189,0.679,pacific division,california,california,370,37,in metropolitan area: not in central/principal...,not in identifiable city (or size group),0,374206,additional households under 1990 definition,non-farm,rented,with cash rent,,,0,0,internet,8,8,84,84,2nd family in household or group quarters,female,4,never married/single,2013,white,white,mexican,mexican,california,california,,,0,n/a or blank,n/a or blank,n/a (blank),with health insurance coverage,without private health insurance coverage,no insurance through employer/union,no insurance purchased directly,no insurance through tricare,with public health insurance coverage,has insurance through medicaid,no,no insurance through va,no insurance through indian health service,"yes, in school",nursery school to grade 4,"nursery school, preschool",nursery school/preschool,nursery school/preschool,public school,,,,,,,,0,0,,,,,,21200,999999,999999,99999,99999,999999,999999,99999,99999,0,63,same house,same house,,0,not in identifiable area,,,,,,no vision or hearing difficulty,no,no
1,2017,1,1200045,2017001000000.0,6,79,25,"male householder, no wife present",2017012000000.0,1.011189,0.679,west south central div,texas,texas,1210,121,in metropolitan area: not in central/principal...,not in identifiable city (or size group),0,200448,households under 1970 definition,non-farm,owned or being bought (loan),owned free and clear,"no, owned free and clear",,0,0,mail,3,3,57,57,1st family in household or group quarters,female,17,never married/single,2000,white,white,other,ecuadorian,texas,texas,,,0,spanish,spanish,"yes, speaks very well",with health insurance coverage,with private health insurance coverage,has insurance through employer/union,no insurance purchased directly,no insurance through tricare,without public health insurance coverage,no insurance through medicaid,no,no insurance through va,no insurance through indian health service,"yes, in school",grade 11,grade 11,grade 9 to grade 12,grade 12,public school,,,,,employed,at work,"yes, in the labor force",4110,8680,works for wages,"wage/salary, private","no, did not look for work","no, other reason(s)",6000.0,93000,6000,0,0,0,0,0,0,0,6000,286,same house,same house,,0,not in identifiable area,,no cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
2,2017,1,70831,2017000000000.0,1 person record,36,57,"male householder, living alone",2017001000000.0,1.011189,0.679,pacific division,california,california,790,79,in metropolitan area: central/principal city s...,not in identifiable city (or size group),0,790106,households under 1970 definition,non-farm,rented,with cash rent,,,0,0,internet,1,1,58,58,1st family in household or group quarters,male,63,separated,1954,white,white,not hispanic,not hispanic,california,california,,,0,english,english,"yes, speaks only english",with health insurance coverage,with private health insurance coverage,has insurance through employer/union,no insurance purchased directly,no insurance through tricare,without public health insurance coverage,no insurance through medicaid,no,no insurance through va,no insurance through indian health service,"no, not in school",4 years of college,bachelor's degree,,,not enrolled,family and consumer sciences,family and consumer sciences,,,employed,at work,"yes, in the labor force",540,7870,works for wages,state govt employee,not reported,not reported,6150.0,6150,6000,0,0,0,150,0,0,0,6000,49,same house,same house,,0,not in identifiable area,,has cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
3,2017,1,557128,2017001000000.0,2,10,98,married-couple family household,2017006000000.0,1.011189,0.679,west south central div,louisiana,louisiana,0,0,metropolitan status indeterminable (mixed),not in identifiable city (or size group),0,180022,households under 1970 definition,non-farm,rented,no cash rent,,,0,0,mail,1,1,98,98,1st family in household or group quarters,female,66,"married, spouse present",1951,white,white,not hispanic,not hispanic,new jersey,new jersey,,,0,english,english,"yes, speaks only english",with health insurance coverage,with private health insurance coverage,no insurance through employer/union,has insurance purchased directly,no insurance through tricare,with public health insurance coverage,no insurance through medicaid,yes,no insurance through va,no insurance through indian health service,"no, not in school",grade 12,"some college, but less than 1 year",,,not enrolled,,,,,not in labor force,not in labor force,"no, not in the labor force",0,0,,,not reported,not reported,14000.0,28500,0,0,10300,0,100,3600,0,0,0,194,same house,same house,,0,not in identifiable area,,no cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
4,2017,1,614890,2017001000000.0,4,96,54,married-couple family household,2017006000000.0,1.011189,0.679,new england division,massachusetts,massachusetts,0,0,in metropolitan area: not in central/principal...,not in identifiable city (or size group),0,70125,households under 1970 definition,non-farm,rented,with cash rent,,,0,0,mail,4,4,45,45,1st family in household or group quarters,male,1,never married/single,2016,white,white,not hispanic,not hispanic,massachusetts,massachusetts,,,0,n/a or blank,n/a or blank,n/a (blank),with health insurance coverage,with private health insurance coverage,has insurance through employer/union,no insurance purchased directly,no insurance through tricare,without public health insurance coverage,no insurance through medicaid,no,no insurance through va,no insurance through indian health service,,n/a or no schooling,,,,,,,,,,,,0,0,,,,,,89000,999999,999999,99999,99999,999999,999999,99999,99999,0,361,same house,same house,,0,not in identifiable area,,,,,,no vision or hearing difficulty,no,no


What is the average salary for employed Black Americans, and what is the average salary for employed White Americans? In percentage terms, how much more does the average White American make than the average Black American?

In [86]:
acs_emp.groupby('race', as_index=False)['inctot'].mean().round(2)

Unnamed: 0,race,inctot
0,white,60473.15
1,black/african american/negro,41747.95
2,american indian or alaska native,37996.52
3,chinese,72804.92
4,japanese,78906.74
5,other asian or pacific islander,66647.74
6,"other race, nec",34989.4
7,two major races,49021.15
8,three or more major races,49787.18


he