# Missing Values Exercises

<span style="color: #008080">*Jiechen Li*</span>

<span style="color: #008080">*Bárbara Flores*</span>

## Gradescope Autograding

Please follow [all standard guidance](https://www.practicaldatascience.org/html/autograder_guidelines.html) for submitting this assignment to the Gradescope autograder, including storing your solutions in a dictionary called `results` and ensuring your notebook runs from the start to completion without any errors.

For this assignment, please name your file `exercise_missing.ipynb` before uploading.

You can check that you have answers for all questions in your `results` dictionary with this code:


```python
assert set(results.keys()) == {
    "ex2_avg_income",
    "ex3_share_making_9999999",
    "ex3_share_making_zero",
    "ex5_avg_income",
    "ex8_avg_income_black",
    "ex8_avg_income_white",
    "ex8_racial_difference",
    "ex9_avg_income_black",
    "ex9_avg_income_white",
    "ex10_wage_gap",
}
```

### Submission Limits

Please remember that you are **only allowed three submissions to the autograder.** Your last submission (if you submit 3 or fewer times), or your third submission (if you submit more than 3 times) will determine your grade Submissions that error out will **not** count against this total.


## Exercises

### Exercise 1

Today, we will be using the ACS data we used during out first `pandas` exercise to examine the US income distribution, and how it varies by race. Note that because the US income distribution has a very small number of people with *extremely* high incomes, and the ACS is just a sample of Americans, the far right tail of the distribution will not be very well estimated. However, this data should suffice for helping to understand wealth inequality in the United States. 

To begin, load the ACS Data we used in our first pandas exercise. That [data can be found here](https://github.com/nickeubank/MIDS_Data/tree/master/US_AmericanCommunitySurvey). We'll be working with `US_ACS_2017_10pct_sample.dta`. 

In [1]:
import pandas as pd

pd.set_option("mode.copy_on_write", True)

data = pd.read_stata(
    "https://github.com/nickeubank/MIDS_Data/raw/master/US_AmericanCommunitySurvey/US_ACS_2017_10pct_sample.dta"
)

data.head()

Unnamed: 0,year,datanum,serial,cbserial,numprec,subsamp,hhwt,hhtype,cluster,adjust,...,migcounty1,migmet131,vetdisab,diffrem,diffphys,diffmob,diffcare,diffsens,diffeye,diffhear
0,2017,1,177686,2017001000000.0,9,64,55,"female householder, no husband present",2017002000000.0,1.011189,...,0,not in identifiable area,,,,,,no vision or hearing difficulty,no,no
1,2017,1,1200045,2017001000000.0,6,79,25,"male householder, no wife present",2017012000000.0,1.011189,...,0,not in identifiable area,,no cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
2,2017,1,70831,2017000000000.0,1 person record,36,57,"male householder, living alone",2017001000000.0,1.011189,...,0,not in identifiable area,,has cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
3,2017,1,557128,2017001000000.0,2,10,98,married-couple family household,2017006000000.0,1.011189,...,0,not in identifiable area,,no cognitive difficulty,no ambulatory difficulty,no independent living difficulty,no,no vision or hearing difficulty,no,no
4,2017,1,614890,2017001000000.0,4,96,54,married-couple family household,2017006000000.0,1.011189,...,0,not in identifiable area,,,,,,no vision or hearing difficulty,no,no


### Exercise 2

Let's begin by calculating the mean US incomes from this data (recall that income is stored in the `inctot` variable). Store the answer in `results` under the key `"ex2_avg_income"`.

In [2]:
results = dict()
ex2_avg_income = data["inctot"].mean()
results["ex2_avg_income"] = ex2_avg_income


print(f"The mean US income in this sample is {round(ex2_avg_income):,}")

The mean US income in this sample is 1,723,646


### Exercise 3

Hmmm... That doesn't look right. The average American is definitely not earning that much a year! Let's look at the values of `inctot` using `value_counts()`. Do you see a problem?

Now use `value_counts()` with the argument `normalize=True` to see proportions of the sample that report each value instead of the count of people in each category. What percentage of our sample has an income of 9,999,999? Store that proportion (between 0 and 1) as `"ex3_share_making_9999999"`. What percentage has an income of 0? Store that proportion as `"ex3_share_making_zero"`.

(Recall `.value_counts()` returns a Series, so you can pull values out with our usual pandas tools.)

In [3]:
print("Let's see how the proportions of the income sample behave.\n")
sumarize = data["inctot"].value_counts(normalize=True)
print(sumarize)

Let's see how the proportions of the income sample behave.

inctot
9999999    0.168967
0          0.105575
30000      0.014978
50000      0.013837
40000      0.013834
             ...   
70520      0.000003
76680      0.000003
57760      0.000003
200310     0.000003
505400     0.000003
Name: proportion, Length: 8471, dtype: float64


In [4]:
ex3_share_making_9999999 = sumarize[9999999]
ex3_share_making_zero = sumarize[0]

results["ex3_share_making_9999999"] = ex3_share_making_9999999
results["ex3_share_making_zero"] = ex3_share_making_zero

print(
    f"The proportion of the sample with an income of 9,999,999 is {round(ex3_share_making_9999999,3)}."
)
print(
    f"The proportion of the sample with an income of 0 is {round(ex3_share_making_zero, 3)}."
)

The proportion of the sample with an income of 9,999,999 is 0.169.
The proportion of the sample with an income of 0 is 0.106.


### Exercise 4

As we discussed before, the ACS uses a value of 9999999 to denote that income information is not available for someone. The problem with using this kind of "sentinel value" is that pandas doesn't understand that this is supposed to denote missing data, and so when it averages the variable, it doesn't know to ignore 9999999. 

To help out `pandas`, use the `replace` command to replace all values of 9999999 with `np.nan`. 

In [5]:
import numpy as np

print(
    "Replacing the values of 999 with NaN, we observe the following resulting proportions:"
)

data["inctot"] = data["inctot"].replace(9999999, np.nan)
data["inctot"].value_counts(normalize=True)

Replacing the values of 999 with NaN, we observe the following resulting proportions:


inctot
0.0         0.127041
30000.0     0.018023
50000.0     0.016650
40000.0     0.016646
20000.0     0.015341
              ...   
246600.0    0.000004
90810.0     0.000004
341380.0    0.000004
15790.0     0.000004
505400.0    0.000004
Name: proportion, Length: 8470, dtype: float64

### Exercise 5

Now that we've properly labeled our missing data as `np.nan`, let's calculate the average US income once more. Store the answer in `results` under the key `"ex5_avg_income"`.

In [6]:
ex5_avg_income = data["inctot"].mean()
results["ex5_avg_income"] = ex5_avg_income

print(
    f"The mean US income in this sample, without considering values of 9999999, is {round(ex5_avg_income):,}"
)

The mean US income in this sample, without considering values of 9999999, is 40,890


### Exercise 6

OK, now we've been able to get a reasonable average income number. As we can see, a major advantage of using `np.nan` is that `pandas` knows that `np.nan` observations should just be ignored when we are calculating means. 

But it's not enough to just get rid of the people who had `inctot` values of 9999999. We also need to know why those values were missing. Suppose, for example, that the value of 9999999 was used for anyone who made more than 100,000 dollars: if we just dropped those people, then our estimate of average income wouldn't mean much, would it?

So let's make sure we understand *why* data is missing for some people. If you recall from our last exercise, it seemed to be the case that most of the people who had incomes of 9999999 were children. Let's make sure that's true by looking at the distribution of the variable `age` for people for whom `inctot` is missing (i.e. subset the data to people with `inctot` missing, then look at the values of `age` with `value_counts()`).

Then do the opposite: look at the distribution of the `age` variable for people who whom `inctot` is *not* missing. 

Can you determine when 9999999 was being used? Is it ok we're excluding those people from our analysis?

Note: In this data, Python doesn't understand `age` is a number; it thinks it is a string because the original data has categories like "90 (90+ in 1980 and 1990)" and "less than 1 year old". So you can't just use `min()` or `max()`. We'll discuss converting string variables into numbers in a future class.

In [7]:
age_missing_inctot = data[pd.isna(data["inctot"])]["age"].value_counts().sort_index()
for age, count in age_missing_inctot.items():
    print(age, count)

less than 1 year old 3150
1 3340
2 3405
3 3220
4 3318
5 3512
6 3524
7 3527
8 3648
9 3977
10 3997
11 3791
12 3845
13 3800
14 3847
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 0
31 0
32 0
33 0
34 0
35 0
36 0
37 0
38 0
39 0
40 0
41 0
42 0
43 0
44 0
45 0
46 0
47 0
48 0
49 0
50 0
51 0
52 0
53 0
54 0
55 0
56 0
57 0
58 0
59 0
60 0
61 0
62 0
63 0
64 0
65 0
66 0
67 0
68 0
69 0
70 0
71 0
72 0
73 0
74 0
75 0
76 0
77 0
78 0
79 0
80 0
81 0
82 0
83 0
84 0
85 0
86 0
87 0
88 0
89 0
90 (90+ in 1980 and 1990) 0
91 0
92 0
93 0
94 0
95 0
96 0


In [8]:
age_not_missing_inctot = (
    data[~pd.isna(data["inctot"])]["age"].value_counts().sort_index()
)


for age, count in age_not_missing_inctot.items():
    print(age, count)

less than 1 year old 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 3942
16 4106
17 4021
18 4496
19 4342
20 3992
21 3740
22 3617
23 3551
24 3641
25 3708
26 3781
27 3884
28 3808
29 3810
30 3917
31 3880
32 3883
33 3734
34 3942
35 3867
36 3834
37 3870
38 3718
39 3783
40 3884
41 3487
42 3603
43 3573
44 3656
45 3939
46 4064
47 4256
48 3956
49 3940
50 4272
51 4021
52 4418
53 4600
54 4821
55 4693
56 4776
57 4720
58 4734
59 4776
60 4950
61 4644
62 4614
63 4488
64 4287
65 4362
66 4106
67 4055
68 3951
69 3877
70 3953
71 2917
72 2901
73 2781
74 2819
75 2532
76 2170
77 2089
78 1985
79 1758
80 1721
81 1524
82 1464
83 1335
84 1157
85 1117
86 1041
87 908
88 859
89 628
90 (90+ in 1980 and 1990) 480
91 227
92 355
93 476
94 1035
95 471
96 10


In [9]:
data["race"].sample(30)

192067                               white
249102                               white
262223                             chinese
228867                               white
300234    american indian or alaska native
294634                               white
104516                               white
195075                               white
28867                                white
166544                               white
11359                                white
171051                               white
289595                               white
317304                               white
102481        black/african american/negro
141141                               white
131256        black/african american/negro
124090     other asian or pacific islander
220453                               white
127689                               white
193545                               white
12273                                white
316578                               white
10176      

><span style="color: #008080">*After analyzing the age distribution of the data where income is not missing versus the dataset where the value is missing, we can observe that for children under 14 years old, this value is missing, while for individuals aged 15 or older, this data is complete. This aligns with our hypothesis.*</span>
>
><span style="color: #008080">*Given this situation, it is acceptable not to include the observations with 9999999 income, considering that we are analyzing data on the population of individuals who can legally work.*</span>

### Exercise 7

Great, so now we know why those people had missing data, and we're ok with excluding them. 

But as we previously noted, there are also a lot of observations of zero income in our data, and it's not clear that we want everyone with a zero-income *should* be included in this average, since those may be people who are retired, or in school. 

Let's limit our attention to people who are currently working by subsetting to only employed respondents. We can do this using `empstat`. Remember you can use `value_counts()` to see what values of `empstat` are in the data!

In [10]:
print("First, let's look at how our variable 'empstat' is distributed.\n")
print(data["empstat"].value_counts())

First, let's look at how our variable 'empstat' is distributed.

empstat
employed              148758
not in labor force    104676
n/a                    57843
unemployed              7727
Name: count, dtype: int64


><span style="color: #008080">*We can observe that a substantial portion of the database does not fall within the workforce, either due to retirement or being in school. It is crucial to consider this situation for future analyses.*</span>
>
><span style="color: #008080">*Given that we want to analyze how salaries behave, we will limit our analysis to only those individuals who are employed. Therefore, we will select a subset of our data.*</span>

In [11]:
employed_data = data[data["empstat"] == "employed"]

### Exercise 8

Now let's estimate the racial income gap in the United States. What is the average salary for employed Black Americans, and what is the average salary for employed White Americans? In percentage terms, how much more does the average White American make than the average Black American?

**Note:** these values are not quite accurate estimates. As we'll discuss in later lessons, to get completely accurate estimates from the ACS we have to take into account how people were selected to be interviewed. But you get pretty good estimates in most cases even without weights—your estimate of the racial wage gap without weights is within 5\% of the corrected value. 

**Note:** This is actually an underestimate of the wage gap. The US Census treats Hispanic respondents as a sub-category of "White." While all ethnic distinctions are socially constructed, and so on some level these distinctions are all deeply problematic, this coding is inconsistent with what most Americans think of when they hear the term "White," a term *most* Americans think of as a category that is mutually exclusive of being Hispanic or Latino (categories which are also usually conflated in American popular discussion). With that in mind, most researchers working with US Census data split "White" into "White, Hispanic" and "White, Non-Hispanic" using `race` *and* `hispan`. But for the moment, just identify "White" respondents using the value in `race`.

Store your results in `results` under the keys `"ex8_avg_income_black"`, `"ex8_avg_income_white"`, and the percentage difference as `ex8_racial_difference`. Please note the wording above when calculating the percentage difference to ensure you get the reference category correct, and interpret your result as well.

In [12]:
print("Let's see how the variable 'race' is distributed in our dataset:\n")
print(employed_data["race"].value_counts())

Let's see how the variable 'race' is distributed in our dataset:

race
white                               116017
black/african american/negro         13175
other asian or pacific islander       6424
other race, nec                       5755
two major races                       3135
chinese                               2149
american indian or alaska native      1290
three or more major races              426
japanese                               387
Name: count, dtype: int64


In [13]:
# calculating results
ex8_avg_income_black = employed_data[
    employed_data["race"] == "black/african american/negro"
]["inctot"].mean()

ex8_avg_income_white = employed_data[employed_data["race"] == "white"]["inctot"].mean()

ex8_racial_difference = (
    (ex8_avg_income_white - ex8_avg_income_black) / ex8_avg_income_black * 100
)

# storing results
results["ex8_avg_income_black"] = ex8_avg_income_black
results["ex8_avg_income_white"] = ex8_avg_income_white
results["ex8_racial_difference"] = ex8_racial_difference

# printing results
print(
    f"The average income for employed individuals who identify as Black in this sample is {round(ex8_avg_income_black):,}."
)

print(
    f"The average income for employed individuals who identify as White in this sample is {round(ex8_avg_income_white):,}."
)

print(
    f"\nIn percentage terms, in this sample, the average income for employed White Americans\nis approximately  {round(ex8_racial_difference,1):,}% higher than the average salary for Black Americans."
)

The average income for employed individuals who identify as Black in this sample is 41,748.
The average income for employed individuals who identify as White in this sample is 60,473.

In percentage terms, in this sample, the average income for employed White Americans
is approximately  44.9% higher than the average salary for Black Americans.


### Exercise 9


As noted above, these estimates are not actually *quite* correct because we aren't using survey weights. To calculate a weighted average that takes into account survey weights, you need to use the following formula:

$$weighted\_mean\_of\_x = \frac{\sum_i x_i * weight_i}{\sum_i weight_i}$$

(As you can see, when $weight_i$ is constant for all observations, this just simplifies to our normal formula for mean values. It is only when weights vary across individuals that weights must be explicitly addressed).

In this data, weights are stored in the variable `perwt`, which is the number of people for which each observation is a stand-in (the inverse of that observations sampling probability). 

Using the formula, re-calculate the *weighted* average income for both populations and store them as `ex9_avg_income_white` and `ex9_avg_income_black`.


In [14]:
# calculating results
blk_weighted_sum = (
    employed_data[employed_data["race"] == "black/african american/negro"]["inctot"]
    * employed_data[employed_data["race"] == "black/african american/negro"]["perwt"]
).sum()
blk_total_weight = employed_data[
    employed_data["race"] == "black/african american/negro"
]["perwt"].sum()

ex9_avg_income_black = blk_weighted_sum / blk_total_weight

white_weighted_sum = (
    employed_data[employed_data["race"] == "white"]["inctot"]
    * employed_data[employed_data["race"] == "white"]["perwt"]
).sum()
white_total_weight = employed_data[employed_data["race"] == "white"]["perwt"].sum()
ex9_avg_income_white = white_weighted_sum / white_total_weight

# storing results
results["ex9_avg_income_black"] = ex9_avg_income_black
results["ex9_avg_income_white"] = ex9_avg_income_white

# printin results
print(
    f"The weighted average income for employed individuals who identify as Black in this sample is {round(ex9_avg_income_black):,}."
)

print(
    f"The weighted average income for employed individuals who identify as White in this sample is {round(ex9_avg_income_white):,}."
)

The weighted average income for employed individuals who identify as Black in this sample is 40,431.
The weighted average income for employed individuals who identify as White in this sample is 58,361.


### Exercise 10

Now calculate the weighted average income gap between *non-Hispanic* White Americans and Black Americans. What percentage more do employed White non-Hispanic Americans earn than employed Black Americans? Store as `"ex10_wage_gap"`.

In [15]:
# calculating results
white_non_hispanic_employed_data = employed_data[
    (employed_data["race"] == "white") & (employed_data["hispan"] == "not hispanic")
]

white_non_hispanic_avg_income_white = (
    (
        white_non_hispanic_employed_data["inctot"]
        * white_non_hispanic_employed_data["perwt"]
    ).sum()
) / white_non_hispanic_employed_data["perwt"].sum()


ex10_wage_gap = (
    (white_non_hispanic_avg_income_white - ex9_avg_income_black)
    / ex9_avg_income_black
    * 100
)

# storing results
results["ex10_wage_gap"] = ex10_wage_gap

# printing results
print(
    f"\nIn percentage terms, in this sample, the weighted average income for employed White Non hispanic Americans\nis approximately  {round(ex10_wage_gap,1):,}% higher than the average salary for Black Americans."
)


In percentage terms, in this sample, the weighted average income for employed White Non hispanic Americans
is approximately  52.5% higher than the average salary for Black Americans.


### Exercise 11

Is that greater or less than the difference you found in Exercise 8? Why do you think that's the case?

><span style="color: #008080">*The difference is greater than what was found in Exercise 8. The observed racial income gap is larger when considering only White Americans (52.5%) compared to when Hispanic respondents are included within the White category (44.9%). This suggests that the inclusion of Hispanics within the White category in the initial estimate generates a lower apparent racial income gap, leading to an underestimation of the true wage gap.*</span>
>
><span style="color: #008080">*TIt is important to take these variables into consideration, especially when analyzing data for public policy purposes.*</span>

## Check results

In [16]:
results

{'ex2_avg_income': 1723646.2703978634,
 'ex3_share_making_9999999': 0.1689665333350052,
 'ex3_share_making_zero': 0.10557547867738336,
 'ex5_avg_income': 40890.177564946454,
 'ex8_avg_income_black': 41747.949905123336,
 'ex8_avg_income_white': 60473.15372747098,
 'ex8_racial_difference': 44.85299006275197,
 'ex9_avg_income_black': 40430.953355310274,
 'ex9_avg_income_white': 58361.48196061399,
 'ex10_wage_gap': 52.52989147705372}

In [17]:
assert set(results.keys()) == {
    "ex2_avg_income",
    "ex3_share_making_9999999",
    "ex3_share_making_zero",
    "ex5_avg_income",
    "ex8_avg_income_black",
    "ex8_avg_income_white",
    "ex8_racial_difference",
    "ex9_avg_income_black",
    "ex9_avg_income_white",
    "ex10_wage_gap",
}