<center>
<img style="float: center;" src="images/CI_horizontal.png" width="400">
</center>
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>


<center> Rukhshan Mian, Benjamin Feder, Josh Edelmann </center>

# **Imputation, Outcome Measurement, and Inference**

What should you do when you encounter missing values in your data? Unfortunately, there is usually no *right* answer. However, different decisions on how to address missing values can have different implications on the inferences you draw from any analysis. One option is to omit observations that have missing values, but you may also choose to retain all observations and instead impute these missing values using various techniques. Imputation provides your best guess for each missing point's true value. 

Here, you will learn how you can approach working with missing wage records. Recall that we have already analyzed employment outcomes for our 2015 calendar year bachelor's degree recipients in Texas. Those employment outcomes only included those who appeared in Texas' Unemployment Insurance wage records, though. A person may not appear in these records for several reasons:
- The person is unemployed. 
- The person is out of labor force, e.g., schooling, childcare, etc...
- The person was employed outside of Texas.
- The person's job is not covered in wage records, e.g.,self-employed, independent contractors, federal government works, etc. <a href='https://www.nap.edu/read/10206/chapter/11#294'>(Hotz and Scholz, 2002)</a>

## **Learning Objectives**

We will focus primarily on the earnings for our cohort in their first quarter after graduation and apply various earnings imputation methods. After this notebook, you should know:

- How to identify those with missing records
- Options for imputing missing values
- How to visualize estimate changes following imputation

## **R Setup and Server Connection**

In [None]:
# Database interaction imports
library(odbc, warn.conflicts=F, quietly=T)

# For data manipulation/visualization
library(tidyverse, warn.conflicts=F, quietly=T)

# For faster date conversions
library(lubridate, warn.conflicts=F, quietly=T)

# Use percent() function
library(scales, warn.conflicts=F, quietly=T)
              
theme_set(theme_gray(base_size = 24))

# Adjust repr.plot.width and repr.plot.height to change the size of graphs
options(repr.plot.width = 20, repr.plot.height = 12)

In [None]:
# Connect to the server
con <- DBI::dbConnect(odbc::odbc(),
                     Driver = "SQL Server",
                     Server = "msssql01.c7bdq4o2yhxo.us-gov-west-1.rds.amazonaws.com",
                     Trusted_Connection = "True")

## **Data Preparation**

Before we begin imputating earnings, we need to do some quick data manipulation to isolate earnings from the first quarter after each individual's graduation. To do so, using the same approach as we did in the Data Exploration: Wages notebook, we will create a new column, `quarter_number`, to track the quarter after graduation by taking the difference between `job_date` and `grad_date` then dividing by 13 and rounding to the nearest whole number.

In [None]:
# read in earnings table
qry <- "
SELECT * 
FROM tr_tx_2021.dbo.nb_cohort_wages_link;
"

df_wages <- dbGetQuery(con, qry)

In [None]:
# add in quarter number
df_wages <- df_wages %>%
    mutate(
        quarter_number = round(as.double(difftime(as.Date(job_date), as.Date(grad_date), units = "weeks")/13), 0)
    )

# see evidence
df_wages %>%
    select(grad_date, job_date, quarter_number) %>%
    head()

In [None]:
# Filter quarter 1 after graduation
q1_wages <- df_wages %>%
    filter(quarter_number == 1)

In [None]:
# get total earnings in quarter because we will impute quarterly earnings
q1_wages <- q1_wages %>%
    group_by(gradid) %>%
    summarize(tot_wages = sum(wage)) %>%
    ungroup()

In [None]:
# number of graduates with positive earnings
q1_num <- q1_wages %>%
    summarize(
        n=n_distinct(gradid)
    )

cat('The total graduates with positive earnings during their first quarter after graduation:', q1_num$n)

We can now compare the number of graduates with positive earnings with that of the full cohort.

In [None]:
# read in cohort
# only take variables used for imputation in notebook
qry <- "
SELECT gradid, gradgen, gradmaj, gradyob, grad_date
FROM tr_tx_2021.dbo.grads15_dated;
"

df <- dbGetQuery(con, qry)

In [None]:
cat('The total number of graduates with positive earnings during their first quarter after graduation is', percent(q1_num$n/nrow(df), .01), 'of the study cohort.')

<font color=orange> <h3> Checkpoint 1: Identifying Earnings in the Fourth Quarter after Graduation </h3> </font>

Given the code above, create a data subset `q4_wages` that contains all earnings for the cohort in their fourth quarter after graduation. How many members of our cohort had positive earnings in this quarter? Do you expect this number to be higher or lower than the number in the first quarter?

In [None]:
# Import the file with hints and solutions
source("nb6_hints_and_solutions.txt")

In [None]:
# replace __ in the code below with the appropriate quarter
# filter by quarter number 4
q4_wages <- df_wages %>%
    filter(quarter_number == __)

# replace __ in the code below with the appropriate function
# grouping by SSN and summing wages
q4_wages <- q4_wages %>%
    ___(gradid) %>%
    summarize(tot_wages = sum(wage)) %>%
    ungroup()

# replace __ in the code below with the appropriate variable
# count total number of distinct SSN
q4_num <- __ %>%
    summarize(n=n_distinct(gradid))

cat('The total graduates with positive earnings during their fourth quarter after graduation:', q4_num$n)
cat('\nThat is', percent(q4_num$n/nrow(df), .01), 'of the study cohort.')

In [None]:
# Uncomment the line below to view a hint
# check_1.hint()

In [None]:
# Uncomment the line below to view a solution
# check_1.solution()

------

Our current data frame, `q1_wages`, only contains individuals with positive earnings in their first quarter after graduation in Texas. Let's add in members of our cohort who did not appear in Texas's wage records during this time period, as well the additional variables from the original cohort table to better describe the individuals. This will let us conduct different imputation methods on the cohort's earnings in their first quarter after graduation as we progress throughout this notebook.

In [None]:
# add grads without positive earnings
q1_all_wages <- df %>%
    left_join(q1_wages, c("gradid"))

In [None]:
# see if the number of those with wages in q1_all_wages is the same as total individuals in q1_wages
q1_all_wages %>%
    mutate(
        wage_ind = ifelse(is.na(tot_wages), 'no', 'yes')
    ) %>%
    group_by(wage_ind) %>%
    summarize(n=n_distinct(gradid))

In [None]:
# number of indviduals in q1_wages
q1_num$n

In [None]:
# see if number of rows is equal (each ssn should have one row)
nrow(df) == nrow(q1_all_wages)

We can see that these numbers make sense. If they did not add up, chances are there was an issue with the details of your join. Let's also check to see if we have any missing values for our demographic variables. If so, let's fill these in as `U` so they won't be dropped in future analyses.

In [None]:
# see number of na values by column
colSums(is.na(q1_all_wages))

In [None]:
# don't have any missing, but if we did we can do this
q1_all_wages <- q1_all_wages %>%
    replace_na(list(
        gradgen = 'U'
    ))

# see na distribution now
colSums(is.na(q1_all_wages))

Theoretically, you could apply these imputation methods to these missing demographic values. However, for the purposes of this notebook, we will focus our imputation techniques on missing earnings values.

<font color=orange> <h3> **Checkpoint 2: Replicate for Q4** </h3> </font>

Create a data frame `q4_all_wages` that mirrors `q1_all_wages` except for Q4. Feel free to add in as many code cells as you deem necessary.

In [None]:
# replace __ in the code below with the appropriate word
# add grads without positive earnings

q4_all_wages <- df %>%
   ___(q4_wages, c("gradid"))

# see number of na values by column
colSums(is.na(q4_all_wages))


# for each variable with missing data fill in with unknown, as shown above. 

q4_all_wages <- q4_all_wages %>%
    replace_na(list(
        ___ = 'U'
    ))

# see na distribution now
colSums(is.na(q4_all_wages))

In [None]:
# Uncomment the code below to view a hint
# check_2.hint()

In [None]:
# Uncomment the code below to view a solution
# check_2.solution()

## **Imputation**

Now that we have confirmed that our `q1_all_wages` data frame is ready for testing our imputation methods, we can get started. We will cover the following methods:
1. Dropping all people with "missing" values on the variable of interest (Q1 wages)
2. Filling in zero for people who do not have records in Texas wage data
3. Filling in missing values with the average earnings of people who are in the same degree field and have the same gender
4. Regression

### **1. Drop All Missing Values**

First, let's look at the earnings outcomes during first quarter after graduation when we drop all missing earnings values. Here, by ignoring potentially non-missing values, we are hoping that they mirror the same distribution as the present one. Although this is fairly common, it is not a recommended method.

> Deleting missing values is often called listwise deletion and essentially assumes that missing values are missing completely at random (MCAR). For a scholarly treatment of this issue, see (amongst others): 
> - Rubens (1976) "Inference and Missing Data" for the initial presentation, or
> - Peugh and Enders (2004) "Missing Data in Educational Research: A Review of Reporting Practices and Suggestions for Improvement" for a more recent discussion.  

In [None]:
# drop missing values
q1_no_missing <- q1_all_wages %>% 
    filter(!is.na(tot_wages))

In [None]:
# see earnings distribution
summary(q1_no_missing$tot_wages)

<font color=orange> <h3> **Checkpoint 3: Replicate for Q4** </h3> </font>

What does the earnings distribution look like for Q4 when you drop missing values?

In [None]:
# replace __ with the appropriate variable
# drop missing values
q4_no_missing <- ___ %>% 
    filter(!is.na(tot_wages))

# see earnings distribution
summary(q4_no_missing$tot_wages)

In [None]:
# Uncomment the code below to view a hint
# check_3.hint()

In [None]:
# Uncomment the code below to view a solution
# check_3.solution()

### **2. Fill in Missing Values with Zero**

Next, let's see how the earnings distribution shifts when we encode all missing earnings outcomes as 0. Here, we are assuming that all missing earnings are due to unemployment.

In [None]:
# fill all null tot_wages with 0
q1_wages_zero <- q1_all_wages %>%
    mutate(tot_wages = ifelse(is.na(tot_wages), 0, tot_wages)) 

In [None]:
# Take a look at the distribution. How does it vary from the distribution you get in method 1?
summary(q1_wages_zero$tot_wages)

In [None]:
cat('Average earnings if missing wages are dropped is $', round(mean(q1_no_missing$tot_wages), 2), sep = '', '.')

cat('\nAverage earnings if missing wages are imputed as 0 is $', round(mean(q1_wages_zero$tot_wages), 2), sep = '', '.')

<font color=orange> <h3> **Checkpoint 4: Replicate for Q4** </h3> </font>

What does the earnings distribution look like for Q4 when you fill missing values with zero?

In [None]:
# replace __ with the appropriate function
# fill all null tot_wages with 0
q4_wages_zero <- q4_all_wages %>%
    mutate(tot_wages = ifelse(___(tot_wages), 0, tot_wages))

# see earnings distribution
summary(q4_wages_zero$tot_wages)

# Take a look at the distribution. How does it vary from the distribution you get in method 1?
cat('Average earnings if missing wages are dropped is $', round(mean(q4_no_missing$tot_wages), 2), sep = '', '.')
cat('\nAverage earnings if missing wages are imputed as 0 is $', round(mean(q4_wages_zero$tot_wages), 2), sep = '', '.')

In [None]:
# Uncomment the line below to view a hint
# check_4.hint()

In [None]:
# Uncomment the line below to view a solution
# check_4.solution()

### **3. Fill in Missing Values with Major/Gender Mean Earnings**

Now, instead of either ignoring missing values or assuming the earnings are 0, we will try imputing missing earnings for each individual as the average quarterly earnings of the other individuals in our cohort of the same `gradgen` and `CIPTitle2010` (major title corresponding to the `gradmaj` CIP code).

Here, our strategy is as follows:
- Using populated wages, find mean earnings for each major by gender
- Merge the mean earnings, based on major and gender, to the overall cohort
 - Create an additional column `mean_wages`
- Recode so that missing values are populated with mean earnings
 - Store data in a new column `imputed_wages`


>Note: This process is frequently termed mean imputation. Implementing this method will compress the variance and covariance of the imputed variable, resulting in biased parameter estimates for all parameters except the mean (Peugh & Enders, 2004, p.529). In this example, we are assuming that the missing values in wages are conditional on both gender and major. We also assume that the missingness is not truly indicative of lack of wages.

First, we can recycle our code from Module 2 to create a `CIP_Program` variable in our wages data.

In [None]:
# Create a 2 digit CIP program code from the full CIP code in `gradmaj`
q1_all_wages <- q1_all_wages %>%
    mutate(
        CIP_Program = substring(gradmaj, 1, 2)
    )

# load CIP crosswalk into R
qry <- "
SELECT *
FROM ds_public_1.dbo.cip_lookup
"
cip_lookup <- dbGetQuery(con, qry)

# only select 2010 columns
cip_lookup <- cip_lookup %>%
    select(ends_with("2010"))

q1_all_wages <- q1_all_wages %>%
    inner_join(cip_lookup, by = c("CIP_Program" = "CIPCode2010"))

In [None]:
# mean earnings by gender/CIPTitle2010
q1_major_gend <- q1_all_wages %>%
    group_by(gradgen, CIPTitle2010) %>%
    summarize(
        mean_wages = mean(tot_wages, na.rm=T)
    ) %>%
    ungroup()

head(q1_major_gend)

In [None]:
# merge mean earnings into original data frame
# see if join works
q1_all_wages %>%
    inner_join(q1_major_gend, by=c('gradgen', 'CIPTitle2010')) %>%
    head()

In [None]:
# save join results to q1_joined_major_gend
q1_joined_major_gend <- q1_all_wages %>%
    inner_join(q1_major_gend, by=c('gradgen', 'CIPTitle2010'))

In [None]:
# create new column populated with mean wages iif previously NA, otherwise keep quarterly wages
q1_major_gend_impute <- q1_joined_major_gend %>%
    mutate(imputed_wages = ifelse(is.na(tot_wages), mean_wages, tot_wages))

In using this method, there is a chance we cannot impute missing values for all individuals in the cohort. If `imputed_wages` is still `NA`, we can assume there were no individuals in the cohort with non-missing earnings with the same major/gender combination.

In [None]:
# see if any still don't have imputed earnings
q1_major_gend_impute %>%
    filter(is.na(imputed_wages)) %>%
    summarize(n=n())

It appears we have available earnings for every combination of gender and primary degree.

In [None]:
# see distribution
summary(q1_major_gend_impute$imputed_wages)

In [None]:
cat('Average earnings if missing wages are dropped is $', round(mean(q1_no_missing$tot_wages), 2), sep = '', '.')

cat('\nAverage earnings if missing wages are imputed as 0 is $', round(mean(q1_wages_zero$tot_wages), 2), sep = '', '.')

cat('\nAverage earnings if missing wages are imputed using major/gender means earnings is $', round(mean(q1_major_gend_impute$imputed_wages, na.rm=TRUE), 2), sep = '', '.')

<font color=orange> <h3> **Checkpoint 5: Replicate for Q4** </h3> </font>

Impute missing earnings values as the mean earnings of individuals in the cohort with the same gender (`gradgen`) and degree designation (`CIPTitle2010`) in quarter 4. What does the earnings distribution look like? For how many individuals could you not impute values using this method?

In [None]:
# replace __ with the appropriate function
# Create a 2 digit CIP program code from the full CIP code in `gradmaj`
q4_all_wages <- q4_all_wages %>%
    mutate(
        CIP_Program = substring(gradmaj, 1, 2)
    )

q4_all_wages <- q4_all_wages %>%
    inner_join(cip_lookup, by = c("CIP_Program" = "CIPCode2010"))

#mean earnings by gender/CIPTitle2010
q4_major_gend <- q4_all_wages %>%
    group_by(gradgen, CIPTitle2010) %>%
    summarize(
        mean_wages = ___(tot_wages, na.rm=T)
    ) %>%
    ungroup()

# save join results to q4_joined_major_gend
q4_joined_major_gend <- q4_all_wages %>%
    inner_join(q4_major_gend, by=c('gradgen', 'CIPTitle2010'))

q4_major_gend_impute <- q4_joined_major_gend %>%
    mutate(imputed_wages = ifelse(___(tot_wages), mean_wages, tot_wages))

# see if any still don't have imputed earnings
q4_major_gend_impute %>%
    filter(is.na(___)) %>%
    summarize(n=n())

# see earnings distribution
summary(q4_major_gend_impute$imputed_wages)

cat('Average earnings if missing wages are dropped is $', round(mean(q4_no_missing$tot_wages), 2), sep = '', '.')

cat('\nAverage earnings if missing wages are imputed as 0 is $', round(mean(q4_wages_zero$tot_wages), 2), sep = '', '.')

cat('\nAverage earnings if missing wages are imputed using major/gender means earnings is $', round(mean(q4_major_gend_impute$imputed_wages, na.rm=TRUE), 2), sep = '', '.')

In [None]:
# Uncomment the code below to view a hint
# check_5.hint()

In [None]:
# Uncomment the code below to view a solution
# check_5.solution()

### **4. Regression Imputation**

We can also use regression to try to get more accurate imputed earnings. We will build a regression equation from the obervations for which we know the earnings, then use the equation to predict the missing earnings values. This is, in effect, an extension of the mean imputation by subgroup. Here, we will use information of graduates such as birth year, gender, major, and time of graduation as input variables to build out the regression.
> Note: We will not be checking the assumptions associated with linear regressions, as this example is aimed at merely displaying how to use a linear regression for imputation. If you plan on using regression imputation, please check all assumptions before employing a predictive model.

In [None]:
# subset to variables included in regression analysis
q1_reg <- q1_all_wages %>%
    select(-c(CIP_Program, gradmaj))

For ease of interpreting the linear regression results, we will filter for the top 5 majors and then group all other majors together. 

In [None]:
# finding top 5 majors
top_5_majors <- q1_reg %>%
    count(CIPTitle2010) %>% 
    arrange(desc(n)) %>%
    head(5)

top_5_majors

In [None]:
# creating an additional column with the top 5 majors and all others assigned to other.
# only take first word of major
q1_reg <- q1_reg  %>%
    mutate(
        major_group = ifelse(q1_reg$CIPTitle2010 %in% top_5_majors$CIPTitle2010, word(CIPTitle2010, 1), "Other")
    ) %>%
    select(-CIPTitle2010)

In [None]:
# convert gradyob to numeric due to predictive numerical power in age of graduate
# all unknown birth years will be dropped
q1_reg <- q1_reg %>%
    mutate(
        gradyob = as.numeric(gradyob)
    )

nrow(q1_reg)

In [None]:
# see summary of variables
summary(q1_reg)

Here, we are creating two data frames to predict the missing wages. `q1_wages_na` is the data frame that we need to predict wages for and `q1_wages_pred` is the data frame that is used to create the linear regression.

In [None]:
q1_wages_na <- q1_reg %>%
    filter(is.na(tot_wages)) %>%
# don't need tot_wages because they are null 
    select(-c(tot_wages))

# removing NAs from tot_wages
q1_wages_pred <- q1_reg %>%
    filter(!is.na(tot_wages))

In [None]:
# run model and fit coefficients
# linear regression can be performed with the lm() function with outcome to left of ~ and predictors to right
q1_wages_model <- lm(tot_wages ~ gradyob + gradgen + major_group, data = q1_wages_pred)

In [None]:
# see model summary
summary(q1_wages_model)

Part of regression-based imputation is to evaluate your model for any unusual relationships. Examining the above result suggets that younger graduates tend to earn less, males tend to earn more, and all majors tend to earn more than the comparison group (biological and biomedical sciences). While there is certainly more we could add to inform this model the sign of these coefficients make theoretical sense. 

Now that we have fit coefficients for each of the predictors in the model, we can predict the `tot_wages` variable for the missing data using `predict()`.

In [None]:
# predict earnings
pred_earnings <- data.frame(tot_wages = predict(q1_wages_model, newdata=q1_wages_na))

In [None]:
head(pred_earnings)

In [None]:
# see updated data frame with predicted earnings
# pred_earnings retains same order of rows so can see predicted earnings with characteristics
cbind(q1_wages_na, pred_earnings) %>% 
    head()

In [None]:
# save updated data frame with predicted earnings
q1_wages_na_w_earnings <- cbind(q1_wages_na, pred_earnings)

In [None]:
# combine the known earnings with predicted earnings
q1_reg_earnings <- rbind(q1_wages_na_w_earnings, q1_wages_pred)

In [None]:
# see earnings distribution for full cohort
summary(q1_reg_earnings$tot_wages)

In [None]:
# see earnings distribution for imputed portion of cohort
summary(q1_wages_na_w_earnings$tot_wages)

## **Visualizing Earnings Distributions**

We can quickly determine if these different imputation methods significantly altered the pre-imputation wage distribution by visualizing the overall earnings distribution. Plotting side-by-side boxplots can be an effective choice. To do so, we need to bind the earnings from all of these methods by rows, meaning they must have the same columns. For the sake of simplicity, we will have three columns in this data frame:

- `gradid`, the person identifier
- `tot_wages`, cumulative earnings in first quarter post-graduation
- `method`, type of imputation method

In [None]:
# adapt q1_no_missing
q1_no_missing %>%
    select(gradid, tot_wages) %>% head()

q1_no_missing <- q1_no_missing %>%
    select(gradid, tot_wages) %>%
    mutate(method = 'remove missing')

In [None]:
# adapt q1_reg_earnings
q1_reg_earnings%>%
    select(gradid, tot_wages) %>% head()

q1_reg_earnings <- q1_reg_earnings %>%
    select(gradid, tot_wages) %>%
    mutate(method = 'regression')

In [None]:
# adapt q1_wages_zero
q1_wages_zero %>%
    select(gradid, tot_wages) %>% head()

q1_wages_zero <- q1_wages_zero %>%
    select(gradid, tot_wages) %>%
    mutate(method = 'zero')

In [None]:
#adapt q1_major_gend_impute
q1_major_gend_impute %>% 
    select(gradid, imputed_wages) %>%
    rename(tot_wages = imputed_wages) %>% 
    head()

q1_major_gend_impute <- q1_major_gend_impute %>%
    select(gradid, tot_wages) %>%
    mutate(method = 'mean')

In [None]:
# combine earnings from all methods now that they have the same column names
all_methods <- rbind(q1_major_gend_impute, q1_reg_earnings, q1_no_missing, q1_wages_zero)
max(all_methods$tot_wages, na.rm=TRUE)

Before visualizing these distributions, we will filter out extreme outliers that would affect the rest of the visualizations.

In [None]:
# removing outliers that have a tot_wages values greater then 40000
all_methods <- all_methods %>% 
    filter(tot_wages < 40000)

# check the max value
max(all_methods$tot_wages, na.rm=TRUE)

In [None]:
# boxplot of all methods
all_methods %>%
    ggplot(aes(x=method, y = tot_wages)) +
    geom_boxplot() + 
    labs(
        title = "The Q1 Earnings Distribution's Quartiles are moderately affected across \n imputation methods",
        x='Imputation Method',
        y='Quarter Earnings',
        caption = 'Source: TX Wage Data'
    ) +
    theme(
        legend.text = element_text(size=24), # legend text font size
        legend.title = element_text(size=24), # legend title font size
        axis.text.x = element_text(size=24), # x axis label font size
        axis.title.x = element_text(size=24), # x axis title font size
        axis.text.y = element_text(size=24), # y axis label font size
        axis.title.y = element_text(size=24) # y axis title font size
    )
    


We can also look at the differences in the earnings distribution by looking at side-by-side histograms. Instead of using the `geom_` layer `geom_boxplot()`, we will use `geom_histogram()`.

In [None]:
all_methods %>%
    ggplot(aes(x=tot_wages)) +
    geom_histogram() + 
    facet_grid(method ~ .) +
    labs(
        title = 'Zero Imputation has a significant change on the overall earnings distribution',
        y='Number of Workers',
        x='Q1 After Graduation Wages',
        caption = 'Source: TX Wage records data'
    ) +
    theme(
        legend.text = element_text(size=24), # legend text font size
        legend.title = element_text(size=24), # legend title font size
        axis.text.x = element_text(size=24), # x axis label font size
        axis.title.x = element_text(size=24), # x axis title font size
        axis.text.y = element_text(size=24), # y axis label font size
        axis.title.y = element_text(size=24) # y axis title font size
    )

## **(Optional) Advanced: Using machine learning to impute values**

To impute values, we can also use machine learning algorithms such as `K-nearest Neighbors` and `Decision Trees`. The principle behind `K-nearest Neighbors` is quite simple: the missing values can be imputed by values of "closest neighbors" - as approximated by other, known, features. 

For example, if we had cases where the data on earnings of some graduates was completely missing, we could approximate their earnings by referring to other characteristics which could be shared by major group (their "closest neighbors" in terms of characteristics).

The algorithm calculates the distance between the input values (the missing values) and helps to identify the nearest possible value based on other features (such as known characteristics of the closest major group). Imputing missing data using machine learning has become a research hotbed, and there are plenty of papers covering the various algorithms if you are curious.

## **References**

Peugh, J. L., & Enders, C. K. (2004). Missing Data in Educational Research: A Review of Reporting Practices and Suggestions for Improvement. _Review of Educational Research_, 74(4), 525-556. doi: 10.3102/00346543074004525

Rubin, D. B. (1976). Inference and Missing Data. _Biometrika_, 63(3), 581-592. doi:10.2307/2335739