In [5]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display
import myst_nb

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)
        
# businesses
bus = pd.read_csv('data/businesses.csv', encoding='ISO-8859-1')

(ch:wrangling_missing)=
# Missing Values and Records

In {numref}`Section %s <sec:scope_accuracy>`, we considered the potential problems when the population and the access frame are not in alignment and everyone we want to study cannot be accessed. We also described problems when someone refuses to participate in the study. In these cases, entire records/observations are missing, and we discussed the kinds of bias that can occur due to missing records.  If nonrespondents differ in critical ways from respondents or if the nonresponse rate is not negligible, then our analysis may be seriously flawed.  The example of {numref}`Section %s <sec:theory_electionpolls>` showed that increasing the sample size without addressing nonresponse does not reduce nonresponse bias. Also in that section, we discussed ways to prevent nonresponse. These preventive measures include using  incentives to encourage response, keeping surveys short, writing clear questions, training interviewers, and investing in extensive follow up procedures. Unfortunately, some amount of nonresponse is unavoidable. 

After nonresponse has occurred, it is sometimes possible to use models to predict the missing data. (Predicting missing observations is never as good as observing them in the first place.)  Records are **missing completely at random** when the chance that a unit responds to a survey does not depend on what is being measured or on the sampling design. For example, if someone accidentally breaks the laboratory equipment at Manua Loa and CO2 is not recorded for a day, there is no reason to think that the lost measurements had anything to do with the level of CO2 that day.  At other times, we consider records **missing at random given covariates** when the nonresponse depends only on observed features and not on the main response.  For example, an ER visit in the DAWN survey would be missing at random given covariates if, say, nonresponse depends on race, sex, and age, which are all quantities measured in the survey,  but does not vary with the type of ER visit within each age/race/sex subgroup. In these cases, the observed data can be weighted to accommodate for nonresponse.

When a record is not entirely missing, but a particular field in a record is unavailable, we have nonresponse at the field-level.  Some datasets use a special coding to signify that the information is missing.  For example, Mauna Loa used -99.99 to indicate a missing CO2 measurement. We found only 7 of these values among 738 rows in the table. We showed in {numref}`Section %s <ch:wrangling_co2>` that these missing values have little impact on the analysis.

In some surveys, missing information is further categorized as to whether the respondent refused to answer,  was unsure of the answer, or the interviewer didn't ask the question. Each of these types of missing values is recorded using a different value. For example, many questions in the Behavioral Risk Factor Surveillance Survey use a code of 7 for don't know and 9 for refused to answer, and the field is left blank if the question was not asked. [^DawnCodebook] These codings help us further refine our study of nonresponse. 

[^DAWNCodebook]: See https://www.cdc.gov/brfss/annual_data/2020/pdf/codebook20_llcp-v2-508.pdf

At times, we substitute a reasonable value for missing ones to create a "clean" data frame.  This process is called **imputation**. Some common approaches for imputing values are **deductive**, **mean**, and **hot-deck** imputation. We can deductively impute a value through logical relations. For example, below are rows in the business data frame for San Francisco restaurant inspections. Their zip codes are erroneously marked as "Ca" and latitude and longitude are missing. We can look up the address on the USPS Website to get the correct zip code and we can use Google Maps to find the latitude and longitude of the restaurant to fill in these missing values.

In [6]:
bus[bus['postal_code'] == "Ca"]

Unnamed: 0,business_id,name,address,city,...,postal_code,latitude,longitude,phone_number
5480,88139,TACOLICIOUS,2250 CHESTNUT ST,San Francisco,...,Ca,,,14156496077


Mean imputation uses an average value from rows in the dataset that have values, and hot-deck imputation uses a chance process to select a value at random from rows that have values. For mean and hot-deck imputation, we often impute values based on others in the dataset who are similar in other features to the nonrespondents.  The average or the randomly selected value is from this reduced group.  A key issue with mean imputation is that the variability in the imputed feature will be smaller because the feature now has values that are identical to the mean. A  potential problem with hot-deck imputation is that the strength of a relationship might decline because  we have added randomness to the values. More sophisticated imputation techniques use nearest-neighbor methods to find similar subgroups of records and others use regression techniques to predict the missing value. [REFERENCE] 

In any of these types of imputation, we should create a new feature that contains the altered data or a new feature to indicate whether or not the response in the original feature has been imputed.

Decisions to keep or drop a record, to change a value,  or to remove a feature, may seem small, but they are critical. One anomalous record can seriously impact your findings. [Footnote to Greek austerity study]. Whatever you decide, be sure to check the impact of dropping or changing features and records. And, be transparent and thorough in reporting any modifications you make to the data. It's best to make these changes programmatically to reduce potential errors and enable others to confirm exactly what you have done by reviewing your code.