# Overview

The purpose of this notebook is to combine data from different sources and clean it. Two data sets are saved at the end:
* **Dataset 1:** Population & diabetes for all counties, 2004 to 2017. This data will only be used for exploring nationwide trends. 
* **Dataset 2:** Diabetes and census demographic (ACS) data for counties with popultation >65,000, 2006 to 2017. This data set will be used for the predictive model.



**Notes**
* All data from Puerto Rico has been removed from both data sets due to a lack of explanatory data (education level & age ratio). 
* Counties missing four or more years worth of ACS data (due to changing population numbers, either over or under the 65,000 population limit) are removed from dataset 2. 
* In any counties missing three or less years worth of data, the data is interpolated or backfilled if no points exist preceeding that year. 
* Much of the counties have null values for the pacific islander race for all years; if so, the null values are replaced with 0.


# Code Navigation
* [1. Load Packages & Data](#1.LoadPackages&Data)
* [2. All Counties](#2.AllCounties)
    * [2.1. Prep: Population Data](#2.1.Prep:PopulationData)
    * [2.2. Prep: Age Ratio Data](#2.2.Prep:AgeRatioData)
    * [2.3. Prep: Diabetes Data](#2.3.Prep:DiabetesData)
    * [2.4. Combine: All County Data](#2.4.Combine:AllCountyData)
* [3. Populous Counties + ACS](#3.PopulousCounties+ACS)
    * [3.1. Prep: ACS Demographic Data](#3.1.Prep:ACSDemographicData)
    * [3.2. Combine Demographic & Diabetes Data](#3.2.Combine:Demographic&DiabetesData)
* [4. Saving Data](#4.SavingData)

## 1. Load Packages & Data <a class="anchor" id="1.LoadPackages&Data"></a>

In [1]:
import requests
import pandas as pd
import csv
import os
import sys
from glob import glob
import numpy as np
import matplotlib.pyplot as plt
import geopandas as gpd
from tqdm import tqdm

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

#Change directory to project root directory
os.chdir("..")

#Import custom code
from src.code_flow.CreateJupyterNotebookHeadings import jupyternotebookheadings1, jupyternotebookheadings2

In [2]:
#Read in csv from raw data folder
df_diabetes=pd.read_csv('data/raw/diabetes_data_2004_2017.csv',index_col=0)
df_census=pd.read_csv('data/raw/ACS_data_2006_2019.csv',index_col=0)
df_pop=pd.read_csv('data/raw/population_est_2000_2019.csv',index_col=0)
df_pop_age=pd.read_csv('data/raw/population_age_ratio_2010.csv',index_col=0)

## 2. All Counties <a class="anchor" id="2.AllCounties"></a>

### 2.1. Prep: Population Data <a class="anchor" id="2.1.Prep:PopulationData"></a>

In [3]:
#Obtain year from date data, convert to an integer
df_pop.year=pd.to_datetime(df_pop.year).dt.year.astype('int64')

#Convert the value column to type float
df_pop.value=df_pop.value.astype('float64')

#List any duplicates - head & tail
df_pop[df_pop.duplicated(subset=['county','state','year'])].head(5)

Unnamed: 0,county,state,value,state_fips,county_fips,year,variable
1,Aroostook County,Maine,73872.0,23,3,2000,population
13,Cumberland County,Maine,266109.0,23,5,2000,population
25,Franklin County,Maine,29498.0,23,7,2000,population
37,Hancock County,Maine,51967.0,23,9,2000,population
49,Kennebec County,Maine,117177.0,23,11,2000,population


In [4]:
df_pop[df_pop.duplicated(subset=['county','state','year'])].tail(5)

Unnamed: 0,county,state,value,state_fips,county_fips,year,variable
38617,Yabucoa Municipio,Puerto Rico,37941.0,72,151,2010,population
38618,Yabucoa Municipio,Puerto Rico,37874.0,72,151,2010,population
38628,Yauco Municipio,Puerto Rico,42043.0,72,153,2010,population
38629,Yauco Municipio,Puerto Rico,41947.0,72,153,2010,population
38630,Yauco Municipio,Puerto Rico,41828.0,72,153,2010,population


In [5]:
#These are duplicate entries from 2000 & 2010 from the census data, drop them
df_pop.drop_duplicates(subset=['county','state','year'],keep='first',inplace=True)

#Remove population data between 2000-2003, 2018 & 2019 
#Since there's no other data to match from the CDC
df_pop=df_pop[~(df_pop.year.isin([2000,2001,2002,2003,2018,2019]))]

#Remove preceeding white spaces from state names
df_pop.state=df_pop.state.str.strip()

In [6]:
#Shannon County SD was renamed Oglala Lakota county in 2013 - will fix in all the data sets

#Rename any Shannon County entries as Oglala Lakota County & rename county fips
cond_s=df_pop.county=='Shannon County'
cond_o=df_pop.county=='Oglala Lakota County'
cond_sd=df_pop.state=='South Dakota'
df_pop.loc[(cond_s)&(cond_sd),'county_fips']=102
df_pop.loc[(cond_s)&(cond_sd),'county']='Oglala Lakota County'
#Pop one of the duplicate values
ind=df_pop[(cond_o) & (df_pop.year==2010)].index[0]
df_pop=df_pop.drop(ind).reset_index(drop=True)
#Ensure there's only one entry for 2010
cond_o=df_pop.county=='Oglala Lakota County'
df_pop[(cond_o) & (df_pop.year==2010)]

Unnamed: 0,county,state,value,state_fips,county_fips,year,variable
8910,Oglala Lakota County,South Dakota,13586.0,46,102,2010,population


### 2.2. Prep: Age Ratio Data <a class="anchor" id="2.2.Prep:AgeRatioData"></a>

In [7]:
#Rename columns
col_name={'STATE':'state_fips','COUNTY':'county_fips','STNAME':'state','CTYNAME':'county'}
df_pop_age.rename(columns=col_name,inplace=True)

In [8]:
#Rename Shannon County entry as Oglala Lakota County, change county fips
cond_s=df_pop_age.county=='Shannon County'
cond_sd=df_pop_age.state=='South Dakota'
df_pop_age.loc[(cond_s)&(cond_sd),'county_fips']=102
df_pop_age.loc[(cond_s)&(cond_sd),'county']='Oglala Lakota County'

In [9]:
#Merge with all population data
id_=['state_fips','county_fips']
df_pop=df_pop.merge(df_pop_age.drop(columns=['state','county']),right_on=id_,left_on=id_,how='left')

In [10]:
df_pop_age.head()

Unnamed: 0,state_fips,county_fips,state,county,adult_pop_ratio
0,1,1,Alabama,Autauga County,0.704971
1,1,3,Alabama,Baldwin County,0.74686
2,1,5,Alabama,Barbour County,0.757141
3,1,7,Alabama,Bibb County,0.749114
4,1,9,Alabama,Blount County,0.728592


In [11]:
df_pop.head()

Unnamed: 0,county,state,value,state_fips,county_fips,year,variable,adult_pop_ratio
0,Aroostook County,Maine,72959.0,23,3,2004,population,0.774986
1,Aroostook County,Maine,72881.0,23,3,2005,population,0.774986
2,Aroostook County,Maine,72827.0,23,3,2006,population,0.774986
3,Aroostook County,Maine,72711.0,23,3,2007,population,0.774986
4,Aroostook County,Maine,72542.0,23,3,2008,population,0.774986


In [12]:
#Calculate the adult population & drop the ratio column
df_pop['adult_pop']=round(df_pop.value*df_pop.adult_pop_ratio,0)
df_pop.drop(columns=['adult_pop_ratio'],inplace=True)

#Rename the value column, print a sample of the dataframe
df_pop.rename(columns={'value':'total_population'},inplace=True)
df_pop.sort_values(['state','county','year']).head(5)

Unnamed: 0,county,state,total_population,state_fips,county_fips,year,variable,adult_pop
13107,Autauga County,Alabama,48366.0,1,1,2004,population,34097.0
13108,Autauga County,Alabama,49676.0,1,1,2005,population,35020.0
13109,Autauga County,Alabama,51328.0,1,1,2006,population,36185.0
13110,Autauga County,Alabama,52405.0,1,1,2007,population,36944.0
13111,Autauga County,Alabama,53277.0,1,1,2008,population,37559.0


### 2.3. Prep: Diabetes Data <a class="anchor" id="2.3.Prep:DiabetesData"></a>

In [13]:
#Recast CountyFIPS as a string
df_diabetes.CountyFIPS=df_diabetes.CountyFIPS.astype(str)

#Create a county_fips column that is just the county fips number, not the state to match the census data
df_diabetes['county_fips']=df_diabetes.CountyFIPS.apply(lambda x: x[-3:])
def fips_state(x):
    """This function takes the combo state/county fips string and returns the state fips number"""
    if len(x)==5:
        return x[:2]
    else:
        return x[0]
    
#Do the same for state fips
df_diabetes['state_fips']=df_diabetes.CountyFIPS.apply(fips_state)
df_diabetes.drop(columns='CountyFIPS',inplace=True)

#Recast fips columns as integers
df_diabetes.county_fips=df_diabetes.county_fips.astype('int64')
df_diabetes.state_fips=df_diabetes.state_fips.astype('int64')

#Rename the columns to match the census data style
df_diabetes.rename(columns={'County':'county','State':'state','Percentage':'diabetes_%','Year':'year',
                           ' Upper Limit':'diabetes_%_upper','Lower Limit':'diabetes_%_lower'},inplace=True)
#Melt the dataframe
df_diabetes=pd.melt(df_diabetes,id_vars=['county','state','year','county_fips','state_fips'],
                    value_name='value',var_name='variable')

#Convert the value from string to numeric type
df_diabetes.value=pd.to_numeric(df_diabetes.value,errors='coerce')

#Reorder diabetes df columns to match others
df_diabetes=df_diabetes[['county','state','value','state_fips','county_fips','year','variable']]

#Make all county names completely lowercase
df_diabetes.county=df_diabetes.county.str.lower()

#Print a sample of the dataframe
df_diabetes.head()

Unnamed: 0,county,state,value,state_fips,county_fips,year,variable
0,autauga county,Alabama,10.1,1,1,2004,diabetes_%
1,autauga county,Alabama,11.5,1,1,2005,diabetes_%
2,autauga county,Alabama,11.0,1,1,2006,diabetes_%
3,autauga county,Alabama,11.2,1,1,2007,diabetes_%
4,autauga county,Alabama,10.9,1,1,2008,diabetes_%


### 2.4. Combine: All County Data <a class="anchor" id="2.4.CombiningAllCountyData"></a>

In [14]:
#Obtain just diabetes prevelance data and name value column after it
df_diabetes_=df_diabetes[df_diabetes.variable=='diabetes_%'].drop(columns='variable').rename(columns={'value':'diabetes_%'})
#Resolve Oglala Lakota County issues
cond_s=df_diabetes_.county=='shannon county'
cond_o=df_diabetes_.county=='oglala lakota county'
cond_sd=df_diabetes_.state=='South Dakota'
df_diabetes_.loc[(cond_s)&(cond_sd),'county_fips']=102
df_diabetes_.loc[(cond_s)&(cond_sd),'county']='oglala lakota county'
#Drop any rows that contain empty values for Oglala Lakota County
index_val=df_diabetes_[(cond_o) & (cond_sd)][df_diabetes_[(cond_o) & (cond_sd)]['diabetes_%'].isna()].index.values
df_diabetes_=df_diabetes_.drop(index_val).reset_index(drop=True)
#Doublecheck
cond_o=df_diabetes_.county=='oglala lakota county'
cond_sd=df_diabetes_.state=='South Dakota'
index_val=df_diabetes_[(cond_o) & (cond_sd)][df_diabetes_[(cond_o) & (cond_sd)]['diabetes_%'].isna()].index.values
df_diabetes_=df_diabetes_.drop(index_val).reset_index(drop=True)

In [15]:
#Merge diabetes & all county population data 
id_=['state_fips','county_fips','year']
df_pop_diabetes=df_diabetes_.merge(df_pop.drop(columns=['state','county']),right_on=id_,left_on=id_,how='right')

#Inspect df for missing diabetes values
df_pop_diabetes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45089 entries, 0 to 45088
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   county            45089 non-null  object 
 1   state             45089 non-null  object 
 2   diabetes_%        45060 non-null  float64
 3   state_fips        45089 non-null  int64  
 4   county_fips       45089 non-null  int64  
 5   year              45089 non-null  int64  
 6   total_population  45089 non-null  float64
 7   variable          45089 non-null  object 
 8   adult_pop         43989 non-null  float64
dtypes: float64(3), int64(3), object(3)
memory usage: 3.4+ MB


In [16]:
#Make a list of counties missing diabetes prevelance data
missing_counties=list(df_diabetes_[df_diabetes_['diabetes_%'].isna()].county.unique())
missing_counties

['hoonah-angoon census area',
 'kusilvak census area',
 'petersburg census area',
 'prince of wales - outer ketchikan',
 'prince of wales-hyder censu',
 'skagway municipality',
 'skagway-hoonah-angoon',
 'wade hampton census area',
 'wrangell city and borough',
 'wrangell-petersburg census area',
 'bedford city']

In [17]:
#Remove them from dataframe
df_pop_diabetes=df_pop_diabetes[~df_pop_diabetes.county.isin(missing_counties)]

#Verify there are no missing diabetes values
df_pop_diabetes[df_pop_diabetes['diabetes_%'].isna()]

Unnamed: 0,county,state,diabetes_%,state_fips,county_fips,year,total_population,variable,adult_pop


In [18]:
#Drop Puerto Rico from this data set - don't have age ratio
df_pop_diabetes=df_pop_diabetes[~(df_pop_diabetes.state_fips==72)]

#Ensure there are no missing values
df_pop_diabetes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43904 entries, 0 to 44542
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   county            43904 non-null  object 
 1   state             43904 non-null  object 
 2   diabetes_%        43904 non-null  float64
 3   state_fips        43904 non-null  int64  
 4   county_fips       43904 non-null  int64  
 5   year              43904 non-null  int64  
 6   total_population  43904 non-null  float64
 7   variable          43904 non-null  object 
 8   adult_pop         43904 non-null  float64
dtypes: float64(3), int64(3), object(3)
memory usage: 3.3+ MB


In [19]:
#Calculate number of adults with diabetes per county per year
df_pop_diabetes['adult_diabetes_pop']=round(df_pop_diabetes.adult_pop*(df_pop_diabetes['diabetes_%']/100))

#Reorder Columns
df_pop_diabetes=df_pop_diabetes[['county', 'state','state_fips', 'county_fips', 'year','total_population', 'adult_pop', 'diabetes_%','adult_diabetes_pop']]

#Print a sample from the dataframe
df_pop_diabetes.head()

Unnamed: 0,county,state,state_fips,county_fips,year,total_population,adult_pop,diabetes_%,adult_diabetes_pop
0,aroostook county,Maine,23,3,2004,72959.0,56542.0,8.7,4919.0
1,aroostook county,Maine,23,3,2005,72881.0,56482.0,9.3,5253.0
2,aroostook county,Maine,23,3,2006,72827.0,56440.0,10.2,5757.0
3,aroostook county,Maine,23,3,2007,72711.0,56350.0,10.2,5748.0
4,aroostook county,Maine,23,3,2008,72542.0,56219.0,10.5,5903.0


## 3. Populous Counties + ACS <a class="anchor" id="3.PopulousCounties+ACS"></a>

### 3.1. Prep: ACS Demographic Data <a class="anchor" id="3.1.Prep:ACSDemographicData"></a>

In [20]:
#Print out any duplicate entires 
df_census[df_census.duplicated(subset=['state','county','year','variable'],keep=False)].sort_values('county').head(5)

Unnamed: 0,county,state,value,state_fips,county_fips,year,variable
180,Ada County,Idaho,359035.0,16,1,2006,total_pop
180,Ada County,Idaho,359035.0,16,1,2006,total_pop
84,Adams County,Colorado,414338.0,8,1,2006,total_pop
186,Adams County,Illinois,67221.0,17,1,2006,total_pop
563,Adams County,Pennsylvania,101105.0,42,1,2006,total_pop


In [21]:
#These duplicates are hold over from the api call where population from 2006 was called twice 
#Will drop them now
df_census=df_census.drop_duplicates(subset=['state','county','year','variable'])

In [22]:
#Fix county in New Mexico with misspelling
df_census.loc[(df_census.county_fips==13) & (df_census.state_fips==35),'county']='Doña Ana County'

#Remove Preceeding white spaces from state names
df_census.state=df_census.state.str.strip()

#Make all county names completely lowercase
df_census.county=df_census.county.str.lower()

#Remove any apostrophes
df_census.county=df_census.county.str.replace("'","")

### 3.2. Combine: Demographic & Diabetes Data <a class="anchor" id="3.2.Combine:Demographic&DiabetesData"></a>

In [23]:
#Ensure that diabetes & census dataframes have same datatype
df_census.dtypes==df_diabetes.dtypes

county         True
state          True
value          True
state_fips     True
county_fips    True
year           True
variable       True
dtype: bool

In [24]:
#Save 2004 & 2005 data to its own df, merge again later
df_diabetes=df_diabetes[df_diabetes.year>2005]

In [25]:
#Merge all dataframes into one
df=pd.concat([df_diabetes,df_census])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 369114 entries, 2 to 839
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   county       369114 non-null  object 
 1   state        369114 non-null  object 
 2   value        368583 non-null  float64
 3   state_fips   369114 non-null  int64  
 4   county_fips  369114 non-null  int64  
 5   year         369114 non-null  int64  
 6   variable     369114 non-null  object 
dtypes: float64(1), int64(3), object(3)
memory usage: 22.5+ MB


In [26]:
#Find any duplicates county/state, year & variable entries
df[df.duplicated(subset=['year','county_fips','state_fips','variable'],keep=False)].sort_values(['county_fips','state_fips','year','variable'])

Unnamed: 0,county,state,value,state_fips,county_fips,year,variable


In [27]:
#Pivot table into wide form
df_=pd.pivot(df,index=['state','county','county_fips','state_fips','year'],columns=['variable'],values='value')
df_.reset_index(inplace=True)

#Inspect for missing values
df_.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40390 entries, 0 to 40389
Data columns (total 30 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   state               40390 non-null  object 
 1   county              40390 non-null  object 
 2   county_fips         40390 non-null  int64  
 3   state_fips          40390 non-null  int64  
 4   year                40390 non-null  int64  
 5   20_24_years         11499 non-null  float64
 6   25_34_years         11499 non-null  float64
 7   35_44_years         11499 non-null  float64
 8   45_54_years         11499 non-null  float64
 9   55_59_years         11499 non-null  float64
 10  60_64_years         11499 non-null  float64
 11  65_74_years         11499 non-null  float64
 12  75_84_years         11499 non-null  float64
 13  85_plus_years       11499 non-null  float64
 14  amer_indian_pop     11499 non-null  float64
 15  asian_pop           11499 non-null  float64
 16  blac

In [28]:
#Make combined county/state column
df_['county_state']=df_.county+', '+df_.state

In [29]:
#Search for counties that are missing some entries in the demographic data (not all)
df_na_less=df_.dropna(subset=['total_pop'])
acs_count=pd.DataFrame(df_na_less.groupby('county_state').state.count())

#Isolate counties missing at least 4 of the 14 years worth of total population data from ACS
drop_counties=list(acs_count[acs_count.state<10].index)
acs_count[acs_count.state<10]

Unnamed: 0_level_0,state
county_state,Unnamed: 1_level_1
"aguadilla municipio, Puerto Rico",4
"athens county, Ohio",5
"blue earth county, Minnesota",8
"boone county, Indiana",3
"broomfield county, Colorado",5
"catoosa county, Georgia",8
"chatham county, North Carolina",8
"crow wing county, Minnesota",1
"franklin county, North Carolina",3
"jackson county, Georgia",3


In [30]:
#Drop counties from dataframe with at least 4 ACS data years missing
df_=df_[~(df_.county_state.isin(drop_counties))]

In [31]:
df_.head(100)

variable,state,county,county_fips,state_fips,year,20_24_years,25_34_years,35_44_years,45_54_years,55_59_years,60_64_years,65_74_years,75_84_years,85_plus_years,amer_indian_pop,asian_pop,black_pop,diabetes_%,diabetes_%_lower,diabetes_%_upper,education_bach,education_hs,female,hispanic_pop,male,median_income,other_pop,pacific_island_pop,total_pop,white_pop,county_state
0,Alabama,autauga county,1,1,2006,,,,,,,,,,,,,11.0,8.5,13.9,,,,,,,,,,,"autauga county, Alabama"
1,Alabama,autauga county,1,1,2007,,,,,,,,,,,,,11.2,8.6,14.1,,,,,,,,,,,"autauga county, Alabama"
2,Alabama,autauga county,1,1,2008,,,,,,,,,,,,,10.9,8.5,13.5,,,,,,,,,,,"autauga county, Alabama"
3,Alabama,autauga county,1,1,2009,,,,,,,,,,,,,11.7,9.2,14.7,,,,,,,,,,,"autauga county, Alabama"
4,Alabama,autauga county,1,1,2010,,,,,,,,,,,,,11.2,8.8,13.9,,,,,,,,,,,"autauga county, Alabama"
5,Alabama,autauga county,1,1,2011,,,,,,,,,,,,,11.3,8.9,14.1,,,,,,,,,,,"autauga county, Alabama"
6,Alabama,autauga county,1,1,2012,,,,,,,,,,,,,11.1,8.8,13.8,,,,,,,,,,,"autauga county, Alabama"
7,Alabama,autauga county,1,1,2013,,,,,,,,,,,,,11.9,9.6,14.6,,,,,,,,,,,"autauga county, Alabama"
8,Alabama,autauga county,1,1,2014,,,,,,,,,,,,,11.4,9.3,13.9,,,,,,,,,,,"autauga county, Alabama"
9,Alabama,autauga county,1,1,2015,,,,,,,,,,,,,13.0,10.2,16.1,,,,,,,,,,,"autauga county, Alabama"


In [32]:
#Fill in missing values with nan, interpolate & backfill data by county
for col in list(df_.columns):
    if (df_[col].dtype!='object'):
        df_.loc[df_[col]<0,col]=np.nan
    df_[col]=df_.groupby('county_state')[col].apply(lambda group: group.interpolate())
    df_[col]=df_.groupby('county_state')[col].fillna(method='bfill')

#Drop any rows missing total population data
df_.dropna(subset=['total_pop'],inplace=True)

#Recast year as an integer
df.year=df.year.astype(int)

#Ensure that the diabetes values are marked as NaN for 2018 & 2019 (these are to be estimated later and are not in the 
#CDC data, but may have been interpolated by steps above)
df_.loc[df_.year==2018,'diabetes_%']=np.nan
df_.loc[df_.year==2019,'diabetes_%']=np.nan
df_.loc[df_.year==2018,'diabetes_%_upper']=np.nan
df_.loc[df_.year==2019,'diabetes_%_upper']=np.nan
df_.loc[df_.year==2018,'diabetes_%_lower']=np.nan
df_.loc[df_.year==2019,'diabetes_%_lower']=np.nan

#Remove all Puerto Rico data because there is no education data for any counties or years
df_=df_[~(df_.state=='Puerto Rico')]

#Fill missing Pacific Island pop with 0
df_.pacific_island_pop.fillna(value=0,inplace=True)

#Inspect data for missing values
df_.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11296 entries, 12 to 40269
Data columns (total 31 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   state               11296 non-null  object 
 1   county              11296 non-null  object 
 2   county_fips         11296 non-null  float64
 3   state_fips          11296 non-null  float64
 4   year                11296 non-null  float64
 5   20_24_years         11296 non-null  float64
 6   25_34_years         11296 non-null  float64
 7   35_44_years         11296 non-null  float64
 8   45_54_years         11296 non-null  float64
 9   55_59_years         11296 non-null  float64
 10  60_64_years         11296 non-null  float64
 11  65_74_years         11296 non-null  float64
 12  75_84_years         11296 non-null  float64
 13  85_plus_years       11296 non-null  float64
 14  amer_indian_pop     11296 non-null  float64
 15  asian_pop           11296 non-null  float64
 16  bla

## 4. Saving Data <a class="anchor" id="4.SavingData"></a>

In [33]:
#Write all county population & diabetes data to a csv file
df_pop_diabetes.to_csv('data/interim/population_diabetes_allcounties_2004_2017.csv')

#Write demographic & diabetes data for populous counties to a csv file
df_.to_csv('data/interim/diabetes_ACS_populous_counties_2004to2019.csv')