# Air pollution and mortality

We use the following two datasets:

1. [Synthetic Medicare Data for Environmental Health Studies](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/L7YF2G)
2. CDC all cause mortality data

Load the CDC all cause morality data
Users should manually download these data from https://wonder.cdc.gov/controller/datarequest/D77

Using the options:
1. Group results by: county;
4. Year 2010.
7. ICD codes I00-I99, J00-J98
8. Send results to a file. 

Click send and save the file to "data/cdc.tsv".

In [12]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelBinarizer

In [3]:
mort = pd.read_csv("data/cdc.tsv", sep="\t", dtype={"County Code": "object"})

In [6]:
mort[mort.Deaths.isin(["Suppressed"])] = np.nan
mort[mort.Deaths.isin(["Missing"])] = np.nan
mort["Deaths"] = mort["Deaths"].astype(float)
mort["Population"] = mort["Population"].astype(float)
mort["cdc_mortality_pct"] = 1e3 * mort["Deaths"] / mort["Population"]
mort = mort.rename({"County Code": "FIPS"}, axis=1).set_index("FIPS")

In [8]:
# Read confounder and exposure data
df = pd.read_csv("data/Study_dataset_2010.csv", index_col=0, dtype={"FIPS": object})

In [10]:
id_vars = ["NAME", "STATE_CODE", "STATE"]
discrete_vars = ["region"]
drop_vars = ["cs_total_population", "cs_area"]
df_id = df[id_vars]
df_discrete = []

In [13]:
for c in discrete_vars:
    col = df[c]
    lb = LabelBinarizer()
    lb.fit(col)
    bcols = pd.DataFrame(
        lb.transform(col), columns=["bin_" + x for x in lb.classes_], index=df.index
    )
    df_discrete.append(bcols.drop(columns="bin_" + lb.classes_[0]))

In [14]:
df_discrete = pd.concat(df_discrete, axis=1)
df = df.drop(columns=id_vars + discrete_vars + drop_vars)
df = df.merge(mort, how="left", right_index=True, left_index=True)
df = pd.concat([df, df_discrete], axis=1)

In [16]:
df.columns

Index(['qd_mean_pm25', 'cs_poverty', 'cs_hispanic', 'cs_black', 'cs_white',
       'cs_native', 'cs_asian', 'cs_ed_below_highschool',
       'cs_household_income', 'cs_median_house_value', 'cs_other',
       'cs_population_density', 'cdc_mean_bmi', 'cdc_pct_cusmoker',
       'cdc_pct_sdsmoker', 'cdc_pct_fmsmoker', 'cdc_pct_nvsmoker',
       'cdc_pct_nnsmoker', 'gmet_mean_tmmn', 'gmet_mean_summer_tmmn',
       'gmet_mean_winter_tmmn', 'gmet_mean_tmmx', 'gmet_mean_summer_tmmx',
       'gmet_mean_winter_tmmx', 'gmet_mean_rmn', 'gmet_mean_summer_rmn',
       'gmet_mean_winter_rmn', 'gmet_mean_rmx', 'gmet_mean_summer_rmx',
       'gmet_mean_winter_rmx', 'gmet_mean_sph', 'gmet_mean_summer_sph',
       'gmet_mean_winter_sph', 'cms_mortality_pct', 'cms_white_pct',
       'cms_black_pct', 'cms_others_pct', 'cms_hispanic_pct', 'cms_female_pct',
       'Notes', 'County', 'Deaths', 'Population', 'Crude Rate',
       'cdc_mortality_pct', 'bin_NORTHEAST', 'bin_SOUTH', 'bin_WEST'],
      dtype='objec

In [17]:
df.to_csv("../data/air_pollution_mortality.csv", index=False)