# Capstone Project - 03 - Feature Modification, Selection & Engineering

## Introduction

In this notebook, I make adjustments to the features in preparation for modeling. Following the EDA, I have an idea of which features will be most predictive. I must first face some of the issues such as null values, and the mismatch between counties in the Food Environment Atlas and those in the Diabetes Atlas. Additionally in this notebook, I add the mental health information into the Health dataset; I also create classes for the target diabetes values, and add the target values into all datasets for modeling.

## Data Import

In [302]:
import pandas as pd
import numpy as np

In [303]:
#imports data

file_path = '../data/DiabetesAtlasData_CLEAN.csv'

d18 = pd.read_csv(file_path)

In [304]:
#imports data

file_path = '../data/FoodEnvironmentAtlas.xls.HEALTH_CLEAN.csv'

health = pd.read_csv(file_path)

## Reconciling Differences between Food Environment Atlas & Diabetes Atlas Counties

Based on the results of EDA, the only columns that will be used in modeling are the recreation facility columns. These columns do not contain null values.

However, I will need to add in the target values - those in the Diabetes Atlas 2018 dataset.

First, I need to ensure that the counties match. This will be addressed below.

In [305]:
health.isnull().sum()

fips                       0
state                      0
county                     0
pct_diabetes_adults08      5
pct_diabetes_adults13      1
pct_obese_adults12         0
pct_obese_adults17         0
pct_hspa17               760
recfac11                   0
recfac16                   0
pch_recfac_11_16         143
recfacpth11                0
recfacpth16                0
pch_recfacpth_11_16      143
dtype: int64

In [306]:
health.shape

(3143, 14)

In [307]:
d18.isnull().sum()

fips              0
county            0
state             0
pct_diabetes18    0
dtype: int64

In [308]:
d18.shape

(3141, 4)

Ultimately, it will not be possible to predict prevalence of diabetes for any counties that are not within the Food Environmental Atlas. 

I will begin by confirming which (2) counties are in the Diabetes Atlas, but not the Food Environmental Atlas:

46102 South Dakota    Oglala Lakota 

2158  Alaska          Kusilvak Census Area


##### Ogala Lakota County, SD

In [309]:
d18.loc[d18['fips'] == 46102]

Unnamed: 0,fips,county,state,pct_diabetes18
2411,46102,Oglala Lakota,South Dakota,17.9


This county does appear in the Diabetes Atlas.

In [310]:
health.loc[health['fips'] == 46102]

Unnamed: 0,fips,state,county,pct_diabetes_adults08,pct_diabetes_adults13,pct_obese_adults12,pct_obese_adults17,pct_hspa17,recfac11,recfac16,pch_recfac_11_16,recfacpth11,recfacpth16,pch_recfacpth_11_16


It appears that this fips code is not within the Food Environment Atlas.

I will search more closely to confirm...

In [311]:
health.loc[health['fips'] == 46101]

Unnamed: 0,fips,state,county,pct_diabetes_adults08,pct_diabetes_adults13,pct_obese_adults12,pct_obese_adults17,pct_hspa17,recfac11,recfac16,pch_recfac_11_16,recfacpth11,recfacpth16,pch_recfacpth_11_16
2411,46101,SD,Moody,8.3,11.0,28.1,31.9,,0,0,0.0,0.0,0.0,0.0


FIPS Id 46101 is within the dataset, and it is in South Dakota, just as our first county of interest, Ogala Lakota.

The index number of FIPS 46101 is 2411. Because the FIPS id's occur in ascending order, Ogala Lakota should be at index 2412. I will check to see what is there.

In [312]:
health.iloc[2412]

fips                          46103
state                            SD
county                   Pennington
pct_diabetes_adults08           7.5
pct_diabetes_adults13           8.4
pct_obese_adults12             28.1
pct_obese_adults17             31.9
pct_hspa17                      NaN
recfac11                         14
recfac16                         16
pch_recfac_11_16          14.285714
recfacpth11                0.136743
recfacpth16                 0.14692
pch_recfacpth_11_16        7.442403
Name: 2412, dtype: object

It appears that FIPS 46102 was skipped, and FIPS 46103, Pennington South Dakota is there instead.

In order to determine what may have happened here, I went to the following website: https://oglalalakota.sdcounties.org/.

Among other things, I learned that Okala Lagota was previously called Shannon County, and was re-named in May 2015.

The absence of Shannon County, SD in the 2018 Diabetes Atlas was noticed during previous EDA. I will look more closely...

In [313]:
health.loc[(health['county'] == 'Shannon') & (health['state'] == 'SD')]

Unnamed: 0,fips,state,county,pct_diabetes_adults08,pct_diabetes_adults13,pct_obese_adults12,pct_obese_adults17,pct_hspa17,recfac11,recfac16,pch_recfac_11_16,recfacpth11,recfacpth16,pch_recfacpth_11_16
2417,46113,SD,Shannon,12.4,15.8,28.1,31.9,,0,0,0.0,0.0,0.0,0.0


Shannon, SD is within the Food Environment Atlas under a different name. 

I confirmed on the National Association of Counties website: https://www.naco.org/articles/shannon-county-sd-be-renamed-oglala-lakota-county that this county was renamed to Ogala Lakota in 2015.

Much of the data within the (most recent) Food Environment Atlas is from years prior to 2015, so the difference makes sense. 

Additionally, the fips codes appear to go in alphabetical order by state, so it is sensible that with a different name, the county would obtain a different fips code.

The question now is how to reconcile this difference within the two datasets...

According to the NACO website above, it meant a lot to the residents to rename the county. I will rename the county in the Food Environment Atlas to stay current and to respect their wishes.

I will return to the spreadsheet versions of the Food Environment Atlas in order to create this change so that I can easily make this change in each dataset, while also easily correcting the index order.

This has been completed. I will make any other revisions before displaying the results below.



#####  Kusilvak Census Area, AK

I went on to a the Alaska page on the census.gov website, https://www.census.gov/geographies/reference-files/2010/geo/state-local-geo-guides-2010/alaska.html, which reports the following: 

"Alaska has 11 statistical entities called “census areas.” Census areas are statistical areas established in cooperation with state government for reporting data in the portion of the state outside any borough."

This may again be a case of counties/census areas that are simply named differently in the two sites.

I will look more closely below.

In [314]:
d18.loc[d18['county'] == 'Kusilvak Census Area']

Unnamed: 0,fips,county,state,pct_diabetes18
81,2158,Kusilvak Census Area,Alaska,7.4


This "county" is present within the Diabetes Atlas.

In [315]:
health.loc[health['fips'] == 2158]

Unnamed: 0,fips,state,county,pct_diabetes_adults08,pct_diabetes_adults13,pct_obese_adults12,pct_obese_adults17,pct_hspa17,recfac11,recfac16,pch_recfac_11_16,recfacpth11,recfacpth16,pch_recfacpth_11_16


It is confirmed that the fips code is not within the Health datset.

In [316]:
health.loc[(health['county'] == 'Kusilvak Census Area') & (health['state']== 'AK')]

Unnamed: 0,fips,state,county,pct_diabetes_adults08,pct_diabetes_adults13,pct_obese_adults12,pct_obese_adults17,pct_hspa17,recfac11,recfac16,pch_recfac_11_16,recfacpth11,recfacpth16,pch_recfacpth_11_16


The Kusilvak Census Area is also not within the dataset.

According to this news website, https://www.adn.com/alaska-news/article/governor-announces-new-name-alaska-census-area-named-confederate-officer/2015/07/02/,

this census area was previously named Wade Hampton Census Area, and was changed in 2015.

This was another county that came up in the Health dataset but not in the Diabetes Atlas during EDA. I will observe Wade Hampton Census Area below.

In [317]:
health.loc[(health['county'] == 'Wade Hampton') & (health['state'] == 'AK')]

Unnamed: 0,fips,state,county,pct_diabetes_adults08,pct_diabetes_adults13,pct_obese_adults12,pct_obese_adults17,pct_hspa17,recfac11,recfac16,pch_recfac_11_16,recfacpth11,recfacpth16,pch_recfacpth_11_16
92,2270,AK,Wade Hampton,6.4,4.6,25.7,34.2,18.4,0,0,0.0,0.0,0.0,0.0


It does not contain the "Census Area" part of the name, but appears to be the county in reference. Before making change, I will ensure that this is not listed in the Diabetes Atlas.

In [318]:
d18.loc[d18['fips'] == 2270]

Unnamed: 0,fips,county,state,pct_diabetes18


In [319]:
d18.loc[(d18['county'] == 'Wade Hampton') & (d18['state'] == 'AK')]

Unnamed: 0,fips,county,state,pct_diabetes18


It is confirmed that this county name and fips are not in the Diabetes Atlas.

I will rename the county in the Food Environment Atlas to its current fips code and name. (2158, Kusilvak Census Area)

This change has been made. There are only 2 counties in question now, which appear in the Food Environment Atlas, but not the Diabetes Atlas:

35039	NM	Rio Arriba

51515	VA	Bedford

##### Rio Arriba, NM

First, I will ensure that these findings are correct:

In [320]:
health.loc[health['fips'] == 35039]

Unnamed: 0,fips,state,county,pct_diabetes_adults08,pct_diabetes_adults13,pct_obese_adults12,pct_obese_adults17,pct_hspa17,recfac11,recfac16,pch_recfac_11_16,recfacpth11,recfacpth16,pch_recfacpth_11_16
1816,35039,NM,Rio Arriba,6.4,8.8,27.1,28.4,30.8,2,1,-50.0,0.049708,0.025538,-48.624802


The referenced county is within the Health dataset.

In [321]:
d18.loc[d18['fips'] == 35039]

Unnamed: 0,fips,county,state,pct_diabetes18


However, the referenced county does not seem to be in the dataset (at least the fips code is not.) I will search for the county by name to make sure.

In [322]:
d18.loc[d18['county'] == 'Rio Arriba']

Unnamed: 0,fips,county,state,pct_diabetes18


This county name does not appear to be within the d18 dataframe at all.

According to this [website](https://www.ereferencedesk.com/resources/counties/new-mexico/rio-arriba.html),
    
"Rio Arriba County comprises the EspaÃ±ola, NM Micropolitan Statistical Area, which is also included in the Albuquerque-Santa Fe-Las Vegas, NM Combined Statistical Area." (E-ReferenceDesk.com)

I will look through the Diabetes Atlas New Mexico counties to see what I can find.

In [323]:
d18.loc[d18['state'] == 'New Mexico'] #shows counties in New Mexico

Unnamed: 0,fips,county,state,pct_diabetes18
1795,35001,Bernalillo,New Mexico,7.4
1796,35003,Catron,New Mexico,6.6
1797,35005,Chaves,New Mexico,9.8
1798,35006,Cibola,New Mexico,12.5
1799,35007,Colfax,New Mexico,8.0
1800,35009,Curry,New Mexico,9.9
1801,35011,De Baca,New Mexico,7.3
1802,35013,Doña Ana,New Mexico,9.4
1803,35015,Eddy,New Mexico,11.4
1804,35017,Grant,New Mexico,7.5


I do not see Albuquerque or Las Vegas in this list; however, I do see Santa Fe.

I will check to see if Santa Fe is listed as a NM county in the Health dataset.

In [324]:
health.loc[(health['county'] == 'Santa Fe') & (health['state'] == 'NM')]

Unnamed: 0,fips,state,county,pct_diabetes_adults08,pct_diabetes_adults13,pct_obese_adults12,pct_obese_adults17,pct_hspa17,recfac11,recfac16,pch_recfac_11_16,recfacpth11,recfacpth16,pch_recfacpth_11_16
1821,35049,NM,Santa Fe,4.0,4.7,27.1,28.4,30.8,17,18,5.882353,0.116885,0.121668,4.092395


Santa Fe County does appear to exist within the Health dataset. 

Additionally, the presence of this [website](https://datausa.io/profile/geo/rio-arriba-county-nm) suggests that Rio Arriba does still appear to be a current county name.

A quick glance through the 2016 [Diabetes Atlas](https://gis.cdc.gov/grasp/diabetes/diabetesatlas-sdoh.html#) table suggests that Rio County IS within it, but it is not included in the 2018 version of the table. 

However, it is included on the 2018 map version. Rio Arriba County, NM is shown to have a diagnosed diabetes prevalence of 9.4%. I will add the county along with its value.

I added the county in on a spreadsheet. I will reload the corrected data as a dataframe below.

In [325]:
file_path = '../data/DiabetesAtlasData_REC.csv'

d18_rec = pd.read_csv(file_path)

In [326]:
d18_rec.head()

Unnamed: 0,fips,county,state,pct_diabetes18
0,1001,Autauga,Alabama,9.5
1,1003,Baldwin,Alabama,8.4
2,1005,Barbour,Alabama,13.5
3,1007,Bibb,Alabama,10.2
4,1009,Blount,Alabama,10.5


In [327]:
d18_rec.tail()

Unnamed: 0,fips,county,state,pct_diabetes18
3137,56037,Sweetwater,Wyoming,7.8
3138,56039,Teton,Wyoming,3.8
3139,56041,Uinta,Wyoming,8.4
3140,56043,Washakie,Wyoming,7.4
3141,56045,Weston,Wyoming,7.6


In [328]:
d18_rec.shape

(3142, 4)

In [329]:
d18_rec.loc[d18_rec['county'] == 'Rio Arriba'] #shows row containing Rio Arriba County

Unnamed: 0,fips,county,state,pct_diabetes18
1816,35039,Rio Arriba,New Mexico,9.4


This shows the county was entered correctly with the fips code and diabetes prevalence.

I will now move on to address the Bedford County, VA discrepancy.

##### Bedford County, VA

In [330]:
health.loc[(health['county'] == 'Bedford') & (health['state'] == 'VA')]

Unnamed: 0,fips,state,county,pct_diabetes_adults08,pct_diabetes_adults13,pct_obese_adults12,pct_obese_adults17,pct_hspa17,recfac11,recfac16,pch_recfac_11_16,recfacpth11,recfacpth16,pch_recfacpth_11_16
2829,51019,VA,Bedford,9.1,11.1,27.4,30.1,22.4,5,5,0.0,0.066173,0.064244,-2.914118
2916,51515,VA,Bedford,12.6,,27.4,30.1,22.4,0,0,0.0,0.0,0.0,0.0


Now I see the problem. 

There are duplicate entries for Bedford County, VA in the Health dataset (and likely, therefore, in the rest of the Food Environment Atlas datasets.) The Bedford County with fips code is missing the 2013 prevalence of diabetes data (this must be the 1 missing value in this column that has been referenced all along!)

I will observe the information regarding Bedford County, VA in the Diabetes Atlas.

In [331]:
d18_rec.loc[(d18_rec['county'] == 'Bedford') & (d18_rec['state'] == 'Virginia')]

Unnamed: 0,fips,county,state,pct_diabetes18
2830,51019,Bedford,Virginia,10.9


The fips code of 51019 is what the Diabetes Atlas has for Bedford, VA.

According to this [website](https://transition.fcc.gov/oet/info/maps/census/fips/fips.txt#:~:text=FIPS%20codes%20are%20numbers%20which,to%20which%20the%20county%20belongs),

51515 is the fips code for "Bedford City." 51019 is listed as Bedford County. 

I will look for 51515 and Bedford City within the Diabetes Atlas.

In [332]:
d18_rec[d18_rec['county'] == 'Bedford City']

Unnamed: 0,fips,county,state,pct_diabetes18


There is no such county within the Diabetes Atlas.

I will search for the fips code.

In [333]:
d18_rec[d18_rec['fips'] == '51019']

Unnamed: 0,fips,county,state,pct_diabetes18


There is no such fips code within the Diabetes Atlas.

It is confirmed, then, the extra county within the Food Environment Atlas is Bedford City. (The previously referenced website for fips codes did say that it has codes for independent cities.)

In searching for Bedford City, I found this [website](https://www.city-data.com/city/Bedford-Virginia.html), which contains "Food Environment Statistics" for Bedford. One of these statistics is prevalence of diabetes, which it shows at 8.6%. This matches the value shown for 51019 Bedford, not that of 51515 Bedford.

My thought is that I may need to remove Bedford City from the Food Environment Atlas, because I do not have information regarding prevalence of diabetes after 2008.

One option would be to use the 2008 diabetes values, say from Virginia, and calculate the average percent change from 2008 to 2018, and come up with an estimate for Bedford City. (Still, it might be risky, as much may have changed over the 10 year period.)

I will consider this as I delve further into the Health dataset. First, I will confirm that the changes made above to the Health and target datasets were completed.

##### Confirmation of Changes

In [334]:
file_path = '../data/FoodEnvironmentAtlas.xls.HEALTH_REC.csv'

health_rec = pd.read_csv(file_path)

In [335]:
health_rec.iloc[2411:2414]

Unnamed: 0,fips,state,county,pct_diabetes_adults08,pct_diabetes_adults13,pct_obese_adults12,pct_obese_adults17,pct_hspa17,recfac11,recfac16,pch_recfac_11_16,recfacpth11,recfacpth16,pch_recfacpth_11_16
2411,46101,SD,Moody,8.3,11.0,28.1,31.9,,0,0,0.0,0.0,0.0,0.0
2412,46102,SD,Ogala Lakota,12.4,15.8,28.1,31.9,,0,0,0.0,0.0,0.0,0.0
2413,46103,SD,Pennington,7.5,8.4,28.1,31.9,,14,16,14.285714,0.136743,0.14692,7.442403


Oglala Lakota County appears in its correct location.

I will now ensure the Kusilvak Census Area is in its correct location:

In [336]:
health_rec.iloc[80:83]

Unnamed: 0,fips,state,county,pct_diabetes_adults08,pct_diabetes_adults13,pct_obese_adults12,pct_obese_adults17,pct_hspa17,recfac11,recfac16,pch_recfac_11_16,recfacpth11,recfacpth16,pch_recfacpth_11_16
80,2150,AK,Kodiak Island,6.2,6.6,25.7,34.2,18.4,0,0,0.0,0.0,0.0,0.0
81,2158,AK,Kusilvak Census Area,6.4,4.6,25.7,34.2,18.4,0,0,0.0,0.0,0.0,0.0
82,2164,AK,Lake and Peninsula,7.2,7.4,25.7,34.2,18.4,0,0,0.0,0.0,0.0,0.0


The county/area does appear in its correct location.

I will now ensure these counties are correct in the Diabetes Atlas as well.

In [337]:
d18_rec.iloc[2411:2414]

Unnamed: 0,fips,county,state,pct_diabetes18
2411,46101,Moody,South Dakota,7.1
2412,46102,Oglala Lakota,South Dakota,17.9
2413,46103,Pennington,South Dakota,8.2


Ogala Lakota County appears in its correct location.

In [338]:
d18_rec.iloc[80:83]

Unnamed: 0,fips,county,state,pct_diabetes18
80,2150,Kodiak Island Borough,Alaska,7.0
81,2158,Kusilvak Census Area,Alaska,7.4
82,2164,Lake and Peninsula Borough,Alaska,8.5


Kusilvak does appear in its correct location...however, I notice some differences in how the Kodiak Island and Lake and Penninsula counties are named...

However, the row numbers and fips codes match. I will still be able to concatenate the dataframes.

I will now move into preparing each individual dataset for modeling.

First, I will create the classes for prediction: low, medium, and high prevalence of diabetes.

##### Creating classes

In this CDC [report](https://www.cdc.gov/diabetes/pdfs/data/statistics/national-diabetes-statistics-report.pdf), it says that the prevalence of diabetes in adults in 2018 was 13.0%. (CDC, 2020)

According to the document, "Estimated percentages and total number of people with
diabetes and prediabetes were derived from the National Health and Nutrition Examination Survey
(NHANES), National Health Interview Survey (NHIS), IHS National Data Warehouse (NDW), Behavioral Risk
Factor Surveillance System (BRFSS), United States Diabetes Surveillance System (USDSS), and US resident
population estimates." (CDC, 2020)

The Diabetes Atlas values are also reported to be from the US Diabetes Surveillance system.

However, the report above is a percentage of the total U.S. population, not an average county prevalence. I will determine the average prevalence below.

In [339]:
d18_rec.pct_diabetes18.mean()

8.724506683641012

I will use this as the baseline: values above this will be considered high. Values at or below this will be considered low.

In [340]:
d18_rec['class'] = np.nan

In [341]:
d18_rec.head()

Unnamed: 0,fips,county,state,pct_diabetes18,class
0,1001,Autauga,Alabama,9.5,
1,1003,Baldwin,Alabama,8.4,
2,1005,Barbour,Alabama,13.5,
3,1007,Bibb,Alabama,10.2,
4,1009,Blount,Alabama,10.5,


In [342]:
d18_rec.loc[d18_rec['pct_diabetes18'] <= d18_rec.pct_diabetes18.mean(), 'class' ] = 0

d18_rec.loc[d18_rec['pct_diabetes18'] > d18_rec.pct_diabetes18.mean(), 'class'] = 1

In [343]:
d18_rec['class'].value_counts(normalize=True)

0.0    0.58275
1.0    0.41725
Name: class, dtype: float64

The percentages are sufficiently close. I will leave the classes as they are.

In [344]:
d18_rec.shape

(3142, 5)

In [345]:
d18_rec.tail()

Unnamed: 0,fips,county,state,pct_diabetes18,class
3137,56037,Sweetwater,Wyoming,7.8,0.0
3138,56039,Teton,Wyoming,3.8,0.0
3139,56041,Uinta,Wyoming,8.4,0.0
3140,56043,Washakie,Wyoming,7.4,0.0
3141,56045,Weston,Wyoming,7.6,0.0


## I. Health

For this dataset, I will need to find a solution for the missing values listed below. I will also add in the mental health provider and services information.

In [346]:
health_rec.isnull().sum()

fips                       0
state                      0
county                     0
pct_diabetes_adults08      5
pct_diabetes_adults13      0
pct_obese_adults12         0
pct_obese_adults17         0
pct_hspa17               760
recfac11                   0
recfac16                   0
pch_recfac_11_16         143
recfacpth11                0
recfacpth16                0
pch_recfacpth_11_16      143
dtype: int64

##### Columns Describing Health Conditions

Firstly, I will say that while it would be effective to predict using obesity as a factor, I will not do so in this study. The link between obesity and diabetes is [well-known](https://www.medschool.lsuhsc.edu/genetics/louisiana_genetics_and_hereditary_health_care_obesity_and_diabetes.aspx).

I am not attempting to predict diabetes based on health conditions, moreso I would like to look specifically at the environmental factors. (Thus I will not use the prevalence of diabetes in 2008 or 2013 either.)

I will drop these columns at this time.

In [347]:
health_rec = health_rec.drop(columns = ['pct_diabetes_adults08', 'pct_diabetes_adults13', 'pct_obese_adults12',
                                       'pct_obese_adults17'])

health_rec.columns

Index(['fips', 'state', 'county', 'pct_hspa17', 'recfac11', 'recfac16',
       'pch_recfac_11_16', 'recfacpth11', 'recfacpth16',
       'pch_recfacpth_11_16'],
      dtype='object')

The referenced columns have been dropped. I will move on to discuss the High Schoolers Physically Active column.

##### High Schoolers Physically Active (2017)

Additionally, I will not be able to utilize columns that are missing many values, such as high schoolers physically active (2017). There are no hspa values from prior years from which to calculate a 2017 value, and I would not wish to impute generalized values for so many rows.

I will proceed with removing the column.

In [348]:
health_rec = health_rec.drop(columns = ['pct_hspa17'])

health_rec.columns

Index(['fips', 'state', 'county', 'recfac11', 'recfac16', 'pch_recfac_11_16',
       'recfacpth11', 'recfacpth16', 'pch_recfacpth_11_16'],
      dtype='object')

##### Percent Change - Recreational Facitilities (2011 - 2016) & Recreational Facilities per 1000 (2011  - 2016)

These two columns are missing values: however, the actual numbers from which they are derived are not missing.

I will attempt to fill in the missing values below.

In [349]:
health_rec.head()

Unnamed: 0,fips,state,county,recfac11,recfac16,pch_recfac_11_16,recfacpth11,recfacpth16,pch_recfacpth_11_16
0,1001,AL,Autauga,4,6,50.0,0.072465,0.108542,49.785629
1,1003,AL,Baldwin,16,21,31.25,0.085775,0.1012,17.983256
2,1005,AL,Barbour,2,0,-100.0,0.073123,0.0,-100.0
3,1007,AL,Bibb,0,1,,0.0,0.044183,
4,1009,AL,Blount,3,4,33.333333,0.052118,0.06949,33.333333


It appears that the pch_recfac_11_16 column is derived through the following formula:

(recfac16 - recfac11) /recfac11 * 100

Similarly, the pch_recfacpth_11_16 column is derived through:

(recfacpth16 - recfacpth11) /recfacpth11 * 100

However, I see why some values are left blank: In some cases this would involve dividing by zero, as shown in Bibb County above.

Because we already have the 2011 and 2016 values, I suppose I will go ahead and delete these columns.

In [350]:
health_rec = health_rec.drop(columns = ['pch_recfac_11_16', 'pch_recfacpth_11_16'])

health_rec.columns

Index(['fips', 'state', 'county', 'recfac11', 'recfac16', 'recfacpth11',
       'recfacpth16'],
      dtype='object')

In [351]:
health_rec.isnull().sum()

fips           0
state          0
county         0
recfac11       0
recfac16       0
recfacpth11    0
recfacpth16    0
dtype: int64

Now, there are no missing values. I will need to add in the columns from the Diabetes Atlas...

First, however, I must still come up with a solution for the Bedford City discrepancy described above. 

If I had more time, and/or if it seemed crucial to create a model that included Bedford City, I would do so. 

Due to time constraints, I will simply remove the row.

In [352]:
health_rec = health_rec[health_rec['fips'] != 51515]


Bedford City has been dropped. I will now confirm below.

In [353]:
health_rec.loc[health_rec['fips'] == 51515]

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16


The fips for Bedford City is no longer present. I will now ensure the Health dataframe matches that of the Diabetes Atlas.

In [354]:
health_rec.shape #shows revised layout of Health dataframe

(3142, 7)

In [355]:
d18_rec.shape #shows layout of Diabetes 18 dataframe

(3142, 5)

In [356]:
health_rec.head()

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16
0,1001,AL,Autauga,4,6,0.072465,0.108542
1,1003,AL,Baldwin,16,21,0.085775,0.1012
2,1005,AL,Barbour,2,0,0.073123,0.0
3,1007,AL,Bibb,0,1,0.0,0.044183
4,1009,AL,Blount,3,4,0.052118,0.06949


In [357]:
d18_rec.head()

Unnamed: 0,fips,county,state,pct_diabetes18,class
0,1001,Autauga,Alabama,9.5,1.0
1,1003,Baldwin,Alabama,8.4,0.0
2,1005,Barbour,Alabama,13.5,1.0
3,1007,Bibb,Alabama,10.2,1.0
4,1009,Blount,Alabama,10.5,1.0


In [358]:
health_rec.tail()

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16
3137,56037,WY,Sweetwater,4,6,0.090882,0.135609
3138,56039,WY,Teton,9,13,0.419072,0.560828
3139,56041,WY,Uinta,3,2,0.143548,0.096567
3140,56043,WY,Washakie,1,1,0.118203,0.12213
3141,56045,WY,Weston,1,0,0.140036,0.0


In [359]:
d18_rec.tail()

Unnamed: 0,fips,county,state,pct_diabetes18,class
3137,56037,Sweetwater,Wyoming,7.8,0.0
3138,56039,Teton,Wyoming,3.8,0.0
3139,56041,Uinta,Wyoming,8.4,0.0
3140,56043,Washakie,Wyoming,7.4,0.0
3141,56045,Weston,Wyoming,7.6,0.0


In [360]:
d18_rec.shape

(3142, 5)

In [361]:
health_rec.shape

(3142, 7)

In [362]:
health_rec.isnull().sum()

fips           0
state          0
county         0
recfac11       0
recfac16       0
recfacpth11    0
recfacpth16    0
dtype: int64

In [363]:
d18_rec.isnull().sum()

fips              0
county            0
state             0
pct_diabetes18    0
class             0
dtype: int64

In [364]:
health_rec.fips.describe()

count     3142.000000
mean     30383.649268
std      15162.508374
min       1001.000000
25%      18177.500000
50%      29176.000000
75%      45080.500000
max      56045.000000
Name: fips, dtype: float64

In [365]:
d18_rec.fips.describe()

count     3142.000000
mean     30383.649268
std      15162.508374
min       1001.000000
25%      18177.500000
50%      29176.000000
75%      45080.500000
max      56045.000000
Name: fips, dtype: float64

It is confirmed. The Health dataframe and the Diabetes Atlas have the same number of rows. The counties for the top 5 rows and bottom 5 rows match. The statistical summary for the respective fips columns are the same.

I will now add the Mental Health Services information to the dataframe. Because the MHS data are 1 row for each state, whereas the Health data contain many rows for each state, I will need to create and use a function.

##### Adding Mental Health Services

In [366]:
file_path = '../data/NSDUH_RcvdMHServes2016_CLEAN.csv' #importing data

mhs = pd.read_csv(file_path)

In [367]:
mhs.head()

Unnamed: 0,state,age18plus,age18_25,age26plus
0,Alabama,13.0,11.0,13.0
1,Alaska,14.0,13.0,14.0
2,Arizona,12.0,10.0,12.0
3,Arkansas,16.0,13.0,16.0
4,California,12.0,10.0,12.0


In [368]:
health_rec['pct_mhs18'] = np.nan #adds the new columns (without values) into the Health dataframe

health_rec['pct_mhs1825'] = np.nan

health_rec['pct_mhs26'] = np.nan

In [369]:
health_rec.state.unique() #shows list of states in Health dataset

array(['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA',
       'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA',
       'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY',
       'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
       'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY'], dtype=object)

In [370]:
#generates lists of the three MH services columns

age18plus_list = mhs['age18plus'].values.tolist()

age18_25_list = mhs['age18_25'].values.tolist()

age26plus_list = mhs['age26plus'].values.tolist()

In [371]:
#creates a list of the states

state_list = ['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA',
       'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA',
       'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY',
       'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
       'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY']

In [372]:
#function for adding in values for age 18+ column

def add18plus(list1, list2):
    for n in range (0,51):
        
        health_rec.loc[health_rec['state'] == list1[n], 'pct_mhs18'] = list2[n]
        
add18plus(state_list, age18plus_list)

In [373]:
#function for adding in values for age 18-25 column

def add1825(list1, list2):
    for n in range (0,51):
        
        health_rec.loc[health_rec['state'] == list1[n], 'pct_mhs1825'] = list2[n]
        
add1825(state_list, age18_25_list)

In [374]:
#function for adding in values for age 26+ column

def add26plus(list1, list2):
    for n in range (0,51):
        
        health_rec.loc[health_rec['state'] == list1[n], 'pct_mhs26'] = list2[n]
        
add26plus(state_list, age26plus_list)

I will now check to see if this worked correctly.

In [375]:
health_rec.head(2)

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16,pct_mhs18,pct_mhs1825,pct_mhs26
0,1001,AL,Autauga,4,6,0.072465,0.108542,13.0,11.0,13.0
1,1003,AL,Baldwin,16,21,0.085775,0.1012,13.0,11.0,13.0


In [376]:
mhs.loc[mhs['state'] == 'Alabama']

Unnamed: 0,state,age18plus,age18_25,age26plus
0,Alabama,13.0,11.0,13.0


In [377]:
health_rec.loc[health_rec['state'] =='OH'].head(2)

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16,pct_mhs18,pct_mhs1825,pct_mhs26
2043,39001,OH,Adams,0,0,0.0,0.0,17.0,15.0,17.0
2044,39003,OH,Allen,12,11,0.113234,0.106151,17.0,15.0,17.0


In [378]:
mhs.loc[mhs['state'] == 'Ohio']

Unnamed: 0,state,age18plus,age18_25,age26plus
35,Ohio,17.0,15.0,17.0


In [379]:
health_rec.loc[health_rec['state'] =='KY'].head(2)

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16,pct_mhs18,pct_mhs1825,pct_mhs26
993,21001,KY,Adair,1,0,0.052676,0.0,18.0,15.0,18.0
994,21003,KY,Allen,1,1,0.049552,0.048242,18.0,15.0,18.0


In [380]:
mhs.loc[mhs['state'] == 'Kentucky']

Unnamed: 0,state,age18plus,age18_25,age26plus
17,Kentucky,18.0,15.0,18.0


In [381]:
health_rec.pct_mhs18.nunique()

12

In [382]:
mhs.age18plus.nunique()

12

In [383]:
health_rec.isnull().sum()

fips           0
state          0
county         0
recfac11       0
recfac16       0
recfacpth11    0
recfacpth16    0
pct_mhs18      0
pct_mhs1825    0
pct_mhs26      0
dtype: int64

It appears that the columns were added correctly. I will now move on to add the mental health providers information into the Health dataframe.

I will need to use a similar process with creating a function to apply, as the MHP data have 1 row for each state, and the Health data contain many rows for each state.

##### Adding Mental Health Providers Information

In [384]:
file_path = '../data/state_mh_providers.csv'

mhp = pd.read_csv(file_path)

In [385]:
mhp.head()

Unnamed: 0,state,rank18,per100th18,rank17,per100th17
0,Alabama,50,92.6,50,85.0
1,Alaska,7,391.2,8,364.2
2,Arizona,47,129.3,47,121.9
3,Arkansas,27,226.0,26,213.3
4,California,11,338.0,10,315.5


In [386]:
#adds the 4 columns (without values) into the Health dataframe

health_rec['rank18'] = np.nan

health_rec['per100th18'] = np.nan

health_rec['rank17'] = np.nan

health_rec['per100th17'] = np.nan

health_rec.head()

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16,pct_mhs18,pct_mhs1825,pct_mhs26,rank18,per100th18,rank17,per100th17
0,1001,AL,Autauga,4,6,0.072465,0.108542,13.0,11.0,13.0,,,,
1,1003,AL,Baldwin,16,21,0.085775,0.1012,13.0,11.0,13.0,,,,
2,1005,AL,Barbour,2,0,0.073123,0.0,13.0,11.0,13.0,,,,
3,1007,AL,Bibb,0,1,0.0,0.044183,13.0,11.0,13.0,,,,
4,1009,AL,Blount,3,4,0.052118,0.06949,13.0,11.0,13.0,,,,


The empty columns have been created. I will now fill in the values.

In [387]:
#generates lists of the 4 MH provider columns

rank18_list = mhp['rank18'].values.tolist()

per100th18_list = mhp['per100th18'].values.tolist()

rank17_list = mhp['rank17'].values.tolist()

per100th17_list = mhp['per100th17'].values.tolist()

This time, I will create a function that can be used to add values for all 4 of the columns.

In [388]:
#function for adding in values to mhp columns

def add_values(list1, list2, column):
    for n in range (0,51):
        
        health_rec.loc[health_rec['state'] == list1[n], column] = list2[n]


add_values(state_list, rank18_list, 'rank18')
add_values(state_list, per100th18_list, 'per100th18')
add_values(state_list, rank17_list, 'rank17')
add_values(state_list, per100th17_list, 'per100th17')


In [389]:
health_rec.head()

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16,pct_mhs18,pct_mhs1825,pct_mhs26,rank18,per100th18,rank17,per100th17
0,1001,AL,Autauga,4,6,0.072465,0.108542,13.0,11.0,13.0,50,92.6,50,85.0
1,1003,AL,Baldwin,16,21,0.085775,0.1012,13.0,11.0,13.0,50,92.6,50,85.0
2,1005,AL,Barbour,2,0,0.073123,0.0,13.0,11.0,13.0,50,92.6,50,85.0
3,1007,AL,Bibb,0,1,0.0,0.044183,13.0,11.0,13.0,50,92.6,50,85.0
4,1009,AL,Blount,3,4,0.052118,0.06949,13.0,11.0,13.0,50,92.6,50,85.0


Now, I must ensure that these columns were added correctly.

In [390]:
mhp.loc[mhp['state'] == 'Alabama']

Unnamed: 0,state,rank18,per100th18,rank17,per100th17
0,Alabama,50,92.6,50,85.0


In [391]:
health_rec.loc[health_rec['state'] == 'AK'].head(2)

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16,pct_mhs18,pct_mhs1825,pct_mhs26,rank18,per100th18,rank17,per100th17
67,2013,AK,Aleutians East,0,0,0.0,0.0,14.0,13.0,14.0,7,391.2,8,364.2
68,2016,AK,Aleutians West,0,0,0.0,0.0,14.0,13.0,14.0,7,391.2,8,364.2


In [392]:
mhp.loc[mhp['state'] == 'Alaska']

Unnamed: 0,state,rank18,per100th18,rank17,per100th17
1,Alaska,7,391.2,8,364.2


In [393]:
health_rec.loc[health_rec['state'] == 'WY'].head(2)

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16,pct_mhs18,pct_mhs1825,pct_mhs26,rank18,per100th18,rank17,per100th17
3119,56001,WY,Albany,4,3,0.108451,0.078974,14.0,13.0,14.0,13,331.6,12,310.2
3120,56003,WY,Big Horn,2,1,0.170561,0.083745,14.0,13.0,14.0,13,331.6,12,310.2


In [394]:
mhp.loc[mhp['state'] == 'Wyoming']

Unnamed: 0,state,rank18,per100th18,rank17,per100th17
50,Wyoming,13,331.6,12,310.2


In [395]:
health_rec.isnull().sum()

fips           0
state          0
county         0
recfac11       0
recfac16       0
recfacpth11    0
recfacpth16    0
pct_mhs18      0
pct_mhs1825    0
pct_mhs26      0
rank18         0
per100th18     0
rank17         0
per100th17     0
dtype: int64

These columns were filled in correctly. I will now move on to revise the ranks (DC currently has a rank of (dot value.)

#### Rank - Number of MH Providers (2018)

In [396]:
health_rec.loc[health_rec['state'] == 'DC']

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16,pct_mhs18,pct_mhs1825,pct_mhs26,rank18,per100th18,rank17,per100th17
319,11001,DC,District of Columbia,75,102,0.120902,0.14905,15.0,13.0,16.0,•,486.9,•,470.5


In [397]:
health_rec.loc[health_rec['state'] == 'DC', 'rank18'] = np.nan

In [398]:
health_rec.loc[health_rec['state'] == 'DC']

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16,pct_mhs18,pct_mhs1825,pct_mhs26,rank18,per100th18,rank17,per100th17
319,11001,DC,District of Columbia,75,102,0.120902,0.14905,15.0,13.0,16.0,,486.9,•,470.5


I need to give the District of Columbia the rank (2018) of 3. 

What I will do is assign it a rank of 2, add 1 to the entire column, and then reset Massachusetts and Oregon, which should be numbers 1 and 2 respectively.

##### Re-assigning ranks

In [399]:
health_rec.loc[health_rec['state'] == 'DC', 'rank18'] = 2 #gives DC a rank of 2

In [400]:
health_rec.rank18 = health_rec.rank18.astype('int') #changes column datatype to integer

In [401]:
health_rec['rank18'] = health_rec['rank18'] + 1 #adds 1 to each rank

In [402]:
#resets MA and OR to null values for this column

health_rec.loc[health_rec['state'] == 'MA', 'rank18'] = np.nan 
health_rec.loc[health_rec['state'] == 'OR', 'rank18'] = np.nan

In [403]:
#restores original/correct ranks to MA and OR

health_rec.loc[health_rec['state'] == 'MA', 'rank18'] = 1
health_rec.loc[health_rec['state'] == 'OR', 'rank18'] = 2

##### Confirming change

In [404]:
health_rec.rank18.unique()

array([51.,  8., 48., 28., 12., 11., 10., 21.,  3., 42., 47., 23., 33.,
       29., 43., 44., 36., 30., 19.,  4., 25.,  1., 20., 24., 45., 37.,
       17., 22., 32., 16., 31.,  9., 18., 26., 38., 27.,  7.,  2., 34.,
        6., 40., 39., 46., 50., 15.,  5., 41., 13., 49., 35., 14.])

In [405]:
health_rec.loc[health_rec['state'] == 'MA'].head(2)

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16,pct_mhs18,pct_mhs1825,pct_mhs26,rank18,per100th18,rank17,per100th17
1217,25001,MA,Barnstable,46,42,0.213616,0.196777,19.0,17.0,19.0,1.0,590.9,1,547.3
1218,25003,MA,Berkshire,13,14,0.099626,0.11036,19.0,17.0,19.0,1.0,590.9,1,547.3


In [406]:
health_rec.loc[health_rec['state'] == 'OR'].head(2)

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16,pct_mhs18,pct_mhs1825,pct_mhs26,rank18,per100th18,rank17,per100th17
2208,41001,OR,Baker,3,2,0.186776,0.125408,18.0,16.0,18.0,2.0,492.3,2,453.7
2209,41003,OR,Benton,14,12,0.162836,0.134385,18.0,16.0,18.0,2.0,492.3,2,453.7


In [407]:
health_rec.loc[health_rec['state'] == 'DC'].head(2)

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16,pct_mhs18,pct_mhs1825,pct_mhs26,rank18,per100th18,rank17,per100th17
319,11001,DC,District of Columbia,75,102,0.120902,0.14905,15.0,13.0,16.0,3.0,486.9,•,470.5


In [408]:
health_rec.loc[health_rec['state'] == 'ME'].head(2)

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16,pct_mhs18,pct_mhs1825,pct_mhs26,rank18,per100th18,rank17,per100th17
1177,23001,ME,Androscoggin,13,11,0.121048,0.102546,19.0,17.0,20.0,4.0,459.5,3,442.1
1178,23003,ME,Aroostook,3,4,0.042033,0.058723,19.0,17.0,20.0,4.0,459.5,3,442.1


The change appears to be successful. I will do the same for the 2017 rank column.

### Rank - Number of MH Providers (2017)

Previous EDA revealed that the 2017 rank for DC should be 2.

##### Re-assigning ranks

In [409]:
health_rec.loc[health_rec['state'] == 'DC', 'rank17'] = np.nan #resets DC rank to null value

In [410]:
health_rec.loc[health_rec['state'] == 'DC', 'rank17'] = 1 #assigns rank number 1 to DC

In [411]:
health_rec.rank17 = health_rec.rank17.astype('int') #changes column datatype to integer

In [412]:
health_rec['rank17'] = health_rec['rank17'] +1 #adds one to each rank

Now I must reset the value for Massachusetts, and change it to 1, as it again has the top rank for number of MH providers.

In [413]:
#resets MA 2017 rank to null, then restores true/original rank of 1

health_rec.loc[health_rec['state'] == 'MA', 'rank17'] = np.nan

health_rec.loc[health_rec['state'] == 'MA', 'rank17'] = 1

##### Confirming change

In [414]:
health_rec.loc[health_rec['state'] == 'MA'].head(2)

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16,pct_mhs18,pct_mhs1825,pct_mhs26,rank18,per100th18,rank17,per100th17
1217,25001,MA,Barnstable,46,42,0.213616,0.196777,19.0,17.0,19.0,1.0,590.9,1.0,547.3
1218,25003,MA,Berkshire,13,14,0.099626,0.11036,19.0,17.0,19.0,1.0,590.9,1.0,547.3


In [415]:
health_rec.loc[health_rec['state'] == 'DC'].head(2)

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16,pct_mhs18,pct_mhs1825,pct_mhs26,rank18,per100th18,rank17,per100th17
319,11001,DC,District of Columbia,75,102,0.120902,0.14905,15.0,13.0,16.0,3.0,486.9,2.0,470.5


In [416]:
health_rec.loc[health_rec['state'] == 'OR'].head(2)

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16,pct_mhs18,pct_mhs1825,pct_mhs26,rank18,per100th18,rank17,per100th17
2208,41001,OR,Baker,3,2,0.186776,0.125408,18.0,16.0,18.0,2.0,492.3,3.0,453.7
2209,41003,OR,Benton,14,12,0.162836,0.134385,18.0,16.0,18.0,2.0,492.3,3.0,453.7


In [417]:
health_rec.loc[health_rec['state'] == 'ME'].head(2)

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16,pct_mhs18,pct_mhs1825,pct_mhs26,rank18,per100th18,rank17,per100th17
1177,23001,ME,Androscoggin,13,11,0.121048,0.102546,19.0,17.0,20.0,4.0,459.5,4.0,442.1
1178,23003,ME,Aroostook,3,4,0.042033,0.058723,19.0,17.0,20.0,4.0,459.5,4.0,442.1


In [418]:
health_rec.rank17.unique()

array([51.,  9., 48., 27., 11., 12., 10., 20.,  2., 42., 47., 23., 30.,
       29., 43., 45., 35., 28., 19.,  4., 24.,  1., 21., 26., 46., 37.,
       17., 22., 31., 16., 32.,  8., 18., 25., 38., 33.,  6.,  3., 34.,
        7., 40., 39., 44., 50., 15.,  5., 41., 14., 49., 36., 13.])

The change appears to be correct. I will now move on to add in the columns for diabetes prevalence.

##### Adding Diabetes Prevalence

In [419]:
health_rec['pct_diabetes18'] = d18_rec['pct_diabetes18']

health_rec['class'] = d18_rec['class']

health_rec.head()

Unnamed: 0,fips,state,county,recfac11,recfac16,recfacpth11,recfacpth16,pct_mhs18,pct_mhs1825,pct_mhs26,rank18,per100th18,rank17,per100th17,pct_diabetes18,class
0,1001,AL,Autauga,4,6,0.072465,0.108542,13.0,11.0,13.0,51.0,92.6,51.0,85.0,9.5,1.0
1,1003,AL,Baldwin,16,21,0.085775,0.1012,13.0,11.0,13.0,51.0,92.6,51.0,85.0,8.4,0.0
2,1005,AL,Barbour,2,0,0.073123,0.0,13.0,11.0,13.0,51.0,92.6,51.0,85.0,13.5,1.0
3,1007,AL,Bibb,0,1,0.0,0.044183,13.0,11.0,13.0,51.0,92.6,51.0,85.0,10.2,1.0
4,1009,AL,Blount,3,4,0.052118,0.06949,13.0,11.0,13.0,51.0,92.6,51.0,85.0,10.5,1.0


In [420]:
health_rec.isnull().sum()

fips              0
state             0
county            0
recfac11          0
recfac16          0
recfacpth11       0
recfacpth16       0
pct_mhs18         0
pct_mhs1825       0
pct_mhs26         0
rank18            0
per100th18        0
rank17            0
per100th17        0
pct_diabetes18    0
class             0
dtype: int64

This dataset is ready for modeling. I will save it to a new csv. 

In [421]:
health_rec.to_csv('../data/FoodEnvironmentAtlas.xls.HEALTH_Modeling.csv', index=False)

## II. Access

I changed the Kusilvak Census Area and Ogata Lakota County names and removed Bedford City on google spreadsheets to create the csv below.

In [422]:
file_path = '../data/FoodEnvironmentAtlas.xls.ACCESS_REC.csv'

access_rec = pd.read_csv(file_path)

In [423]:
access_rec.head()

Unnamed: 0,fips,state,county,pct_laccess_pop10,pct_laccess_pop15,pct_laccess_lowi10,pct_laccess_lowi15,pct_laccess_hhnv10,pct_laccess_hhnv15,pct_laccess_snap15,pct_laccess_child10,pct_laccess_child15,pct_laccess_seniors10,pct_laccess_seniors15
0,1001,AL,Autauga,33.769657,32.062255,9.79353,11.991125,3.284786,3.351332,4.608749,8.837112,8.460485,4.376378,3.996279
1,1003,AL,Baldwin,19.318473,16.767489,5.460261,5.424427,2.147827,1.905114,1.2989,4.343199,3.844936,3.51357,3.06184
2,1005,AL,Barbour,20.840972,22.10556,11.420316,10.739667,4.135869,4.329378,4.303147,3.425062,3.758341,2.805166,3.001695
3,1007,AL,Bibb,4.559753,4.230324,2.144661,2.601627,3.45858,2.821427,0.67671,1.087518,1.015242,0.657008,0.600865
4,1009,AL,Blount,2.70084,6.49738,1.062468,2.88015,3.26938,3.336414,0.812727,0.67149,1.58872,0.340269,0.882583


In [424]:
access_rec.shape

(3142, 14)

In [425]:
d18_rec.shape

(3142, 5)

In [426]:
access_rec.head()

Unnamed: 0,fips,state,county,pct_laccess_pop10,pct_laccess_pop15,pct_laccess_lowi10,pct_laccess_lowi15,pct_laccess_hhnv10,pct_laccess_hhnv15,pct_laccess_snap15,pct_laccess_child10,pct_laccess_child15,pct_laccess_seniors10,pct_laccess_seniors15
0,1001,AL,Autauga,33.769657,32.062255,9.79353,11.991125,3.284786,3.351332,4.608749,8.837112,8.460485,4.376378,3.996279
1,1003,AL,Baldwin,19.318473,16.767489,5.460261,5.424427,2.147827,1.905114,1.2989,4.343199,3.844936,3.51357,3.06184
2,1005,AL,Barbour,20.840972,22.10556,11.420316,10.739667,4.135869,4.329378,4.303147,3.425062,3.758341,2.805166,3.001695
3,1007,AL,Bibb,4.559753,4.230324,2.144661,2.601627,3.45858,2.821427,0.67671,1.087518,1.015242,0.657008,0.600865
4,1009,AL,Blount,2.70084,6.49738,1.062468,2.88015,3.26938,3.336414,0.812727,0.67149,1.58872,0.340269,0.882583


In [427]:
access_rec.tail()

Unnamed: 0,fips,state,county,pct_laccess_pop10,pct_laccess_pop15,pct_laccess_lowi10,pct_laccess_lowi15,pct_laccess_hhnv10,pct_laccess_hhnv15,pct_laccess_snap15,pct_laccess_child10,pct_laccess_child15,pct_laccess_seniors10,pct_laccess_seniors15
3137,56037,WY,Sweetwater,30.570505,43.224074,5.512073,10.845331,0.877134,2.182752,2.141828,8.18123,11.077838,2.832748,4.454212
3138,56039,WY,Teton,29.174527,29.17437,4.975409,7.409463,1.374848,0.540222,0.670815,5.147907,5.147868,3.086168,3.086129
3139,56041,WY,Uinta,20.220414,22.189685,7.19015,9.727151,0.966219,2.759922,2.072485,6.328975,6.815148,1.70729,2.069942
3140,56043,WY,Washakie,10.915407,10.915407,2.737939,3.621591,0.396304,1.203633,1.05398,2.162275,2.162275,2.572249,2.572249
3141,56045,WY,Weston,17.209949,17.165192,3.989222,4.199467,1.483037,3.728765,0.971078,3.507074,3.504601,2.974616,2.959728


At times when running this notebook, the dataframe displays with an error in the index, showing the last row as 3142. I will reset the index.

In [428]:
access_rec.reset_index() #resets index

Unnamed: 0,index,fips,state,county,pct_laccess_pop10,pct_laccess_pop15,pct_laccess_lowi10,pct_laccess_lowi15,pct_laccess_hhnv10,pct_laccess_hhnv15,pct_laccess_snap15,pct_laccess_child10,pct_laccess_child15,pct_laccess_seniors10,pct_laccess_seniors15
0,0,1001,AL,Autauga,33.769657,32.062255,9.793530,11.991125,3.284786,3.351332,4.608749,8.837112,8.460485,4.376378,3.996279
1,1,1003,AL,Baldwin,19.318473,16.767489,5.460261,5.424427,2.147827,1.905114,1.298900,4.343199,3.844936,3.513570,3.061840
2,2,1005,AL,Barbour,20.840972,22.105560,11.420316,10.739667,4.135869,4.329378,4.303147,3.425062,3.758341,2.805166,3.001695
3,3,1007,AL,Bibb,4.559753,4.230324,2.144661,2.601627,3.458580,2.821427,0.676710,1.087518,1.015242,0.657008,0.600865
4,4,1009,AL,Blount,2.700840,6.497380,1.062468,2.880150,3.269380,3.336414,0.812727,0.671490,1.588720,0.340269,0.882583
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3137,3137,56037,WY,Sweetwater,30.570505,43.224074,5.512073,10.845331,0.877134,2.182752,2.141828,8.181230,11.077838,2.832748,4.454212
3138,3138,56039,WY,Teton,29.174527,29.174370,4.975409,7.409463,1.374848,0.540222,0.670815,5.147907,5.147868,3.086168,3.086129
3139,3139,56041,WY,Uinta,20.220414,22.189685,7.190150,9.727151,0.966219,2.759922,2.072485,6.328975,6.815148,1.707290,2.069942
3140,3140,56043,WY,Washakie,10.915407,10.915407,2.737939,3.621591,0.396304,1.203633,1.053980,2.162275,2.162275,2.572249,2.572249


In [429]:
access_rec #shows revised dataframe

Unnamed: 0,fips,state,county,pct_laccess_pop10,pct_laccess_pop15,pct_laccess_lowi10,pct_laccess_lowi15,pct_laccess_hhnv10,pct_laccess_hhnv15,pct_laccess_snap15,pct_laccess_child10,pct_laccess_child15,pct_laccess_seniors10,pct_laccess_seniors15
0,1001,AL,Autauga,33.769657,32.062255,9.793530,11.991125,3.284786,3.351332,4.608749,8.837112,8.460485,4.376378,3.996279
1,1003,AL,Baldwin,19.318473,16.767489,5.460261,5.424427,2.147827,1.905114,1.298900,4.343199,3.844936,3.513570,3.061840
2,1005,AL,Barbour,20.840972,22.105560,11.420316,10.739667,4.135869,4.329378,4.303147,3.425062,3.758341,2.805166,3.001695
3,1007,AL,Bibb,4.559753,4.230324,2.144661,2.601627,3.458580,2.821427,0.676710,1.087518,1.015242,0.657008,0.600865
4,1009,AL,Blount,2.700840,6.497380,1.062468,2.880150,3.269380,3.336414,0.812727,0.671490,1.588720,0.340269,0.882583
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3137,56037,WY,Sweetwater,30.570505,43.224074,5.512073,10.845331,0.877134,2.182752,2.141828,8.181230,11.077838,2.832748,4.454212
3138,56039,WY,Teton,29.174527,29.174370,4.975409,7.409463,1.374848,0.540222,0.670815,5.147907,5.147868,3.086168,3.086129
3139,56041,WY,Uinta,20.220414,22.189685,7.190150,9.727151,0.966219,2.759922,2.072485,6.328975,6.815148,1.707290,2.069942
3140,56043,WY,Washakie,10.915407,10.915407,2.737939,3.621591,0.396304,1.203633,1.053980,2.162275,2.162275,2.572249,2.572249


The final index number is now showing correctly as 3141. I will now add in the columns from the Diabetes Atlas.

In [430]:
access_rec['pct_diabetes18'] = d18_rec['pct_diabetes18']

access_rec['class'] = d18_rec['class']

access_rec.head()

Unnamed: 0,fips,state,county,pct_laccess_pop10,pct_laccess_pop15,pct_laccess_lowi10,pct_laccess_lowi15,pct_laccess_hhnv10,pct_laccess_hhnv15,pct_laccess_snap15,pct_laccess_child10,pct_laccess_child15,pct_laccess_seniors10,pct_laccess_seniors15,pct_diabetes18,class
0,1001,AL,Autauga,33.769657,32.062255,9.79353,11.991125,3.284786,3.351332,4.608749,8.837112,8.460485,4.376378,3.996279,9.5,1.0
1,1003,AL,Baldwin,19.318473,16.767489,5.460261,5.424427,2.147827,1.905114,1.2989,4.343199,3.844936,3.51357,3.06184,8.4,0.0
2,1005,AL,Barbour,20.840972,22.10556,11.420316,10.739667,4.135869,4.329378,4.303147,3.425062,3.758341,2.805166,3.001695,13.5,1.0
3,1007,AL,Bibb,4.559753,4.230324,2.144661,2.601627,3.45858,2.821427,0.67671,1.087518,1.015242,0.657008,0.600865,10.2,1.0
4,1009,AL,Blount,2.70084,6.49738,1.062468,2.88015,3.26938,3.336414,0.812727,0.67149,1.58872,0.340269,0.882583,10.5,1.0


In [431]:
access_rec.shape

(3142, 16)

The two columns have been added. Now I will address the null values.

In [432]:
access_rec.isnull().sum()

fips                      0
state                     0
county                    0
pct_laccess_pop10         0
pct_laccess_pop15        19
pct_laccess_lowi10        0
pct_laccess_lowi15       19
pct_laccess_hhnv10        0
pct_laccess_hhnv15        2
pct_laccess_snap15       19
pct_laccess_child10       0
pct_laccess_child15      19
pct_laccess_seniors10     0
pct_laccess_seniors15    19
pct_diabetes18            0
class                     0
dtype: int64

It appears there are 19 states with missing information for 2015.

What I will do is separate create two datasets:

1) One with 2010 information only (drop 2015 columns)


In [433]:
access_rec10 = access_rec.drop(columns = ['pct_laccess_pop15', 'pct_laccess_lowi15', 'pct_laccess_hhnv15',
                                  'pct_laccess_snap15', 'pct_laccess_child15', 'pct_laccess_seniors15' ])

access_rec10.isnull().sum()

fips                     0
state                    0
county                   0
pct_laccess_pop10        0
pct_laccess_lowi10       0
pct_laccess_hhnv10       0
pct_laccess_child10      0
pct_laccess_seniors10    0
pct_diabetes18           0
class                    0
dtype: int64

2) One with complete records only (drop rows missing values)

In [434]:
access_rec = access_rec.dropna()

access_rec.isnull().sum()

fips                     0
state                    0
county                   0
pct_laccess_pop10        0
pct_laccess_pop15        0
pct_laccess_lowi10       0
pct_laccess_lowi15       0
pct_laccess_hhnv10       0
pct_laccess_hhnv15       0
pct_laccess_snap15       0
pct_laccess_child10      0
pct_laccess_child15      0
pct_laccess_seniors10    0
pct_laccess_seniors15    0
pct_diabetes18           0
class                    0
dtype: int64

In [435]:
access_rec10.head()

Unnamed: 0,fips,state,county,pct_laccess_pop10,pct_laccess_lowi10,pct_laccess_hhnv10,pct_laccess_child10,pct_laccess_seniors10,pct_diabetes18,class
0,1001,AL,Autauga,33.769657,9.79353,3.284786,8.837112,4.376378,9.5,1.0
1,1003,AL,Baldwin,19.318473,5.460261,2.147827,4.343199,3.51357,8.4,0.0
2,1005,AL,Barbour,20.840972,11.420316,4.135869,3.425062,2.805166,13.5,1.0
3,1007,AL,Bibb,4.559753,2.144661,3.45858,1.087518,0.657008,10.2,1.0
4,1009,AL,Blount,2.70084,1.062468,3.26938,0.67149,0.340269,10.5,1.0


In [436]:
access_rec10.shape

(3142, 10)

In [437]:
access_rec.head()

Unnamed: 0,fips,state,county,pct_laccess_pop10,pct_laccess_pop15,pct_laccess_lowi10,pct_laccess_lowi15,pct_laccess_hhnv10,pct_laccess_hhnv15,pct_laccess_snap15,pct_laccess_child10,pct_laccess_child15,pct_laccess_seniors10,pct_laccess_seniors15,pct_diabetes18,class
0,1001,AL,Autauga,33.769657,32.062255,9.79353,11.991125,3.284786,3.351332,4.608749,8.837112,8.460485,4.376378,3.996279,9.5,1.0
1,1003,AL,Baldwin,19.318473,16.767489,5.460261,5.424427,2.147827,1.905114,1.2989,4.343199,3.844936,3.51357,3.06184,8.4,0.0
2,1005,AL,Barbour,20.840972,22.10556,11.420316,10.739667,4.135869,4.329378,4.303147,3.425062,3.758341,2.805166,3.001695,13.5,1.0
3,1007,AL,Bibb,4.559753,4.230324,2.144661,2.601627,3.45858,2.821427,0.67671,1.087518,1.015242,0.657008,0.600865,10.2,1.0
4,1009,AL,Blount,2.70084,6.49738,1.062468,2.88015,3.26938,3.336414,0.812727,0.67149,1.58872,0.340269,0.882583,10.5,1.0


In [438]:
access_rec.shape

(3123, 16)

These dataframes are complete. I will save them as csv files.

In [439]:
access_rec10.to_csv('../data/FoodEnvironmentAtlas.xls.ACCESS10_Modeling.csv', index=False)

access_rec.to_csv('../data/FoodEnvironmentAtlas.xls.ACCESS_Modeling.csv', index=False)

## III. Assistance

In [440]:
file_path = '../data/FoodEnvironmentAtlas.xls.ASSISTANCE_REC.csv'

assistance_rec = pd.read_csv(file_path)

In [441]:
assistance_rec.shape

(3142, 22)

In [442]:
assistance_rec.head()

Unnamed: 0,fips,state,county,pct_snap12,pct_snap17,snap_part_rate11,snap_part_rate16,pct_nslp12,pct_nslp17,pct_sbp12,...,pct_wic17,pct_wicinfantchild14,pct_wicinfantchild16,pct_wicwomen14,pct_wicwomen16,pct_cacfp12,pct_cacfp17,fdpir12,fdpir15,food_banks18
0,1001,AL,Autauga,18.908476,16.500056,84.02,86.898,68.226043,63.12659,27.206328,...,2.54357,33.481211,32.910876,3.318827,3.309759,0.891239,1.258763,0,0,0
1,1003,AL,Baldwin,18.908476,16.500056,84.02,86.898,68.226043,63.12659,27.206328,...,2.54357,33.481211,32.910876,3.318827,3.309759,0.891239,1.258763,0,0,0
2,1005,AL,Barbour,18.908476,16.500056,84.02,86.898,68.226043,63.12659,27.206328,...,2.54357,33.481211,32.910876,3.318827,3.309759,0.891239,1.258763,0,0,0
3,1007,AL,Bibb,18.908476,16.500056,84.02,86.898,68.226043,63.12659,27.206328,...,2.54357,33.481211,32.910876,3.318827,3.309759,0.891239,1.258763,0,0,0
4,1009,AL,Blount,18.908476,16.500056,84.02,86.898,68.226043,63.12659,27.206328,...,2.54357,33.481211,32.910876,3.318827,3.309759,0.891239,1.258763,0,0,0


In [443]:
assistance_rec.tail()

Unnamed: 0,fips,state,county,pct_snap12,pct_snap17,snap_part_rate11,snap_part_rate16,pct_nslp12,pct_nslp17,pct_sbp12,...,pct_wic17,pct_wicinfantchild14,pct_wicinfantchild16,pct_wicwomen14,pct_wicwomen16,pct_cacfp12,pct_cacfp17,fdpir12,fdpir15,food_banks18
3137,56037,WY,Sweetwater,5.956719,5.668505,58.381,56.037,59.171454,49.458965,16.42267,...,1.753551,22.089798,21.296784,2.571757,2.375607,1.635123,1.220234,0,0,0
3138,56039,WY,Teton,5.956719,5.668505,58.381,56.037,59.171454,49.458965,16.42267,...,1.753551,22.089798,21.296784,2.571757,2.375607,1.635123,1.220234,0,0,0
3139,56041,WY,Uinta,5.956719,5.668505,58.381,56.037,59.171454,49.458965,16.42267,...,1.753551,22.089798,21.296784,2.571757,2.375607,1.635123,1.220234,0,0,0
3140,56043,WY,Washakie,5.956719,5.668505,58.381,56.037,59.171454,49.458965,16.42267,...,1.753551,22.089798,21.296784,2.571757,2.375607,1.635123,1.220234,0,0,0
3141,56045,WY,Weston,5.956719,5.668505,58.381,56.037,59.171454,49.458965,16.42267,...,1.753551,22.089798,21.296784,2.571757,2.375607,1.635123,1.220234,0,0,0


The shape and row numbers are correct. I will ensure the fips codes are accurate.

In [444]:
assistance_rec.fips.describe()

count     3142.000000
mean     30383.649268
std      15162.508374
min       1001.000000
25%      18177.500000
50%      29176.000000
75%      45080.500000
max      56045.000000
Name: fips, dtype: float64

In [445]:
d18_rec.fips.describe()

count     3142.000000
mean     30383.649268
std      15162.508374
min       1001.000000
25%      18177.500000
50%      29176.000000
75%      45080.500000
max      56045.000000
Name: fips, dtype: float64

The datasets match in terms of layout, number of rows, and fips codes.

I will add the Diabetes Atlas to the Assistance dataframe.

In [446]:
assistance_rec['pct_diabetes18'] = d18_rec['pct_diabetes18']

assistance_rec['class'] = d18_rec['class']

assistance_rec.head()

Unnamed: 0,fips,state,county,pct_snap12,pct_snap17,snap_part_rate11,snap_part_rate16,pct_nslp12,pct_nslp17,pct_sbp12,...,pct_wicinfantchild16,pct_wicwomen14,pct_wicwomen16,pct_cacfp12,pct_cacfp17,fdpir12,fdpir15,food_banks18,pct_diabetes18,class
0,1001,AL,Autauga,18.908476,16.500056,84.02,86.898,68.226043,63.12659,27.206328,...,32.910876,3.318827,3.309759,0.891239,1.258763,0,0,0,9.5,1.0
1,1003,AL,Baldwin,18.908476,16.500056,84.02,86.898,68.226043,63.12659,27.206328,...,32.910876,3.318827,3.309759,0.891239,1.258763,0,0,0,8.4,0.0
2,1005,AL,Barbour,18.908476,16.500056,84.02,86.898,68.226043,63.12659,27.206328,...,32.910876,3.318827,3.309759,0.891239,1.258763,0,0,0,13.5,1.0
3,1007,AL,Bibb,18.908476,16.500056,84.02,86.898,68.226043,63.12659,27.206328,...,32.910876,3.318827,3.309759,0.891239,1.258763,0,0,0,10.2,1.0
4,1009,AL,Blount,18.908476,16.500056,84.02,86.898,68.226043,63.12659,27.206328,...,32.910876,3.318827,3.309759,0.891239,1.258763,0,0,0,10.5,1.0


In [447]:
assistance_rec.shape

(3142, 24)

The columns have been added. I will now address the null values.

In [448]:
assistance_rec.isnull().sum()

fips                    0
state                   0
county                  0
pct_snap12              0
pct_snap17              0
snap_part_rate11        0
snap_part_rate16        0
pct_nslp12              0
pct_nslp17              0
pct_sbp12               0
pct_sbp17               0
pct_wic12               0
pct_wic17               0
pct_wicinfantchild14    0
pct_wicinfantchild16    0
pct_wicwomen14          0
pct_wicwomen16          0
pct_cacfp12             0
pct_cacfp17             0
fdpir12                 0
fdpir15                 0
food_banks18            0
pct_diabetes18          0
class                   0
dtype: int64

There are no null values. The dataframe will be saved as-is.

In [449]:
assistance_rec.to_csv('../data/FoodEnvironmentAtlas.xls.ASSISTANCE_Modeling.csv', index=False)

## IV. Insecurity

In [450]:
file_path = '../data/FoodEnvironmentAtlas.xls.INSECURITY_REC.csv'

insecurity_rec = pd.read_csv(file_path)

In [451]:
insecurity_rec.shape

(3142, 9)

In [452]:
d18_rec.shape

(3142, 5)

In [453]:
insecurity_rec.head()

Unnamed: 0,fips,state,county,foodinsec_12_14,foodinsec_15_17,ch_foodinsec_14_17,vlfoodsec_12_14,vlfoodsec_15_17,ch_vlfoodsec_14_17
0,1001,AL,Autauga,16.8,16.3,-0.5,7.2,7.1,-0.1
1,1003,AL,Baldwin,16.8,16.3,-0.5,7.2,7.1,-0.1
2,1005,AL,Barbour,16.8,16.3,-0.5,7.2,7.1,-0.1
3,1007,AL,Bibb,16.8,16.3,-0.5,7.2,7.1,-0.1
4,1009,AL,Blount,16.8,16.3,-0.5,7.2,7.1,-0.1


In [454]:
d18_rec.head()

Unnamed: 0,fips,county,state,pct_diabetes18,class
0,1001,Autauga,Alabama,9.5,1.0
1,1003,Baldwin,Alabama,8.4,0.0
2,1005,Barbour,Alabama,13.5,1.0
3,1007,Bibb,Alabama,10.2,1.0
4,1009,Blount,Alabama,10.5,1.0


In [455]:
insecurity_rec.tail()

Unnamed: 0,fips,state,county,foodinsec_12_14,foodinsec_15_17,ch_foodinsec_14_17,vlfoodsec_12_14,vlfoodsec_15_17,ch_vlfoodsec_14_17
3137,56037,WY,Sweetwater,14.0,13.2,-0.8,5.3,5.1,-0.2
3138,56039,WY,Teton,14.0,13.2,-0.8,5.3,5.1,-0.2
3139,56041,WY,Uinta,14.0,13.2,-0.8,5.3,5.1,-0.2
3140,56043,WY,Washakie,14.0,13.2,-0.8,5.3,5.1,-0.2
3141,56045,WY,Weston,14.0,13.2,-0.8,5.3,5.1,-0.2


In [456]:
d18_rec.tail()

Unnamed: 0,fips,county,state,pct_diabetes18,class
3137,56037,Sweetwater,Wyoming,7.8,0.0
3138,56039,Teton,Wyoming,3.8,0.0
3139,56041,Uinta,Wyoming,8.4,0.0
3140,56043,Washakie,Wyoming,7.4,0.0
3141,56045,Weston,Wyoming,7.6,0.0


In [457]:
insecurity_rec.fips.describe()

count     3142.000000
mean     30383.649268
std      15162.508374
min       1001.000000
25%      18177.500000
50%      29176.000000
75%      45080.500000
max      56045.000000
Name: fips, dtype: float64

In [458]:
d18_rec.fips.describe()

count     3142.000000
mean     30383.649268
std      15162.508374
min       1001.000000
25%      18177.500000
50%      29176.000000
75%      45080.500000
max      56045.000000
Name: fips, dtype: float64

The top, bottom, layout, and fips scores are the same between the two datasets. I will add the Diabetes Atlas information into the Insecurity dataset.

In [459]:
insecurity_rec['pct_diabetes18'] = d18_rec['pct_diabetes18']

insecurity_rec['class'] = d18_rec['class']

insecurity_rec.head()

Unnamed: 0,fips,state,county,foodinsec_12_14,foodinsec_15_17,ch_foodinsec_14_17,vlfoodsec_12_14,vlfoodsec_15_17,ch_vlfoodsec_14_17,pct_diabetes18,class
0,1001,AL,Autauga,16.8,16.3,-0.5,7.2,7.1,-0.1,9.5,1.0
1,1003,AL,Baldwin,16.8,16.3,-0.5,7.2,7.1,-0.1,8.4,0.0
2,1005,AL,Barbour,16.8,16.3,-0.5,7.2,7.1,-0.1,13.5,1.0
3,1007,AL,Bibb,16.8,16.3,-0.5,7.2,7.1,-0.1,10.2,1.0
4,1009,AL,Blount,16.8,16.3,-0.5,7.2,7.1,-0.1,10.5,1.0


In [460]:
insecurity_rec.shape

(3142, 11)

The diabetes values were added. I will now address null values.

In [461]:
insecurity_rec.isnull().sum()

fips                  0
state                 0
county                0
foodinsec_12_14       0
foodinsec_15_17       0
ch_foodinsec_14_17    0
vlfoodsec_12_14       0
vlfoodsec_15_17       0
ch_vlfoodsec_14_17    0
pct_diabetes18        0
class                 0
dtype: int64

There are no null values. The dataframe will be saved into a csv as-is.

In [462]:
insecurity_rec.to_csv('../data/FoodEnvironmentAtlas.xls.INSECURITY_Modeling.csv', index=False)

## Local

In [463]:
file_path = '../data/FoodEnvironmentAtlas.xls.LOCAL_REC.csv'

local_rec = pd.read_csv(file_path)

In [464]:
local_rec.shape

(3142, 100)

In [465]:
d18_rec.shape

(3142, 5)

In [466]:
local_rec.head()

Unnamed: 0,fips,state,county,dirsales_farms07,dirsales_farms12,pch_dirsales_farms_07_12,pct_loclfarm07,pct_loclfarm12,pct_loclsale07,pct_loclsale12,...,csa12,pch_csa_07_12,agritrsm_ops07,agritrsm_ops12,pch_agritrsm_ops_07_12,agritrsm_rct07,agritrsm_rct12,pch_agritrsm_rct_07_12,farm_to_school13,farm_to_school15
0,1001,AL,Autauga,25.0,51.0,104.0,6.024096,13.11054,0.596374,1.554692,...,3.0,50.0,7.0,10.0,42.857143,228000.0,146000.0,-35.964912,,0.0
1,1003,AL,Baldwin,80.0,103.0,28.75,7.023705,10.41456,0.712634,0.47801,...,7.0,-46.153846,18.0,16.0,-11.111111,124000.0,204000.0,64.516129,0.0,1.0
2,1005,AL,Barbour,18.0,13.0,-27.777778,2.889246,2.276708,0.015403,0.012457,...,0.0,-100.0,27.0,32.0,18.518519,163000.0,304000.0,86.503067,1.0,0.0
3,1007,AL,Bibb,12.0,13.0,8.333333,5.687204,6.878307,,,...,3.0,50.0,5.0,6.0,20.0,,21000.0,,0.0,0.0
4,1009,AL,Blount,84.0,88.0,4.761905,5.940594,7.091056,0.267717,0.277792,...,4.0,-42.857143,10.0,8.0,-20.0,293000.0,30000.0,-89.761092,1.0,0.0


In [467]:
d18_rec.head()

Unnamed: 0,fips,county,state,pct_diabetes18,class
0,1001,Autauga,Alabama,9.5,1.0
1,1003,Baldwin,Alabama,8.4,0.0
2,1005,Barbour,Alabama,13.5,1.0
3,1007,Bibb,Alabama,10.2,1.0
4,1009,Blount,Alabama,10.5,1.0


In [468]:
local_rec.tail()

Unnamed: 0,fips,state,county,dirsales_farms07,dirsales_farms12,pch_dirsales_farms_07_12,pct_loclfarm07,pct_loclfarm12,pct_loclsale07,pct_loclsale12,...,csa12,pch_csa_07_12,agritrsm_ops07,agritrsm_ops12,pch_agritrsm_ops_07_12,agritrsm_rct07,agritrsm_rct12,pch_agritrsm_rct_07_12,farm_to_school13,farm_to_school15
3137,56037,WY,Sweetwater,15.0,22.0,46.666667,6.147541,8.627451,1.089204,0.444697,...,0.0,,1.0,2.0,100.0,,,,0.0,0.0
3138,56039,WY,Teton,4.0,11.0,175.0,2.222222,7.142857,0.021817,0.265604,...,0.0,,5.0,12.0,140.0,1614000.0,,,0.0,0.0
3139,56041,WY,Uinta,13.0,24.0,84.615385,3.77907,7.619048,,0.445308,...,0.0,-100.0,5.0,9.0,80.0,105000.0,,,0.0,0.0
3140,56043,WY,Washakie,25.0,5.0,-80.0,11.682243,2.392344,0.154231,,...,1.0,,8.0,6.0,-25.0,70000.0,62000.0,-11.428571,0.0,0.0
3141,56045,WY,Weston,15.0,3.0,-80.0,6.329114,1.136364,,,...,0.0,,14.0,9.0,-35.714286,147000.0,71000.0,-51.70068,0.0,0.0


In [469]:
d18_rec.tail()

Unnamed: 0,fips,county,state,pct_diabetes18,class
3137,56037,Sweetwater,Wyoming,7.8,0.0
3138,56039,Teton,Wyoming,3.8,0.0
3139,56041,Uinta,Wyoming,8.4,0.0
3140,56043,Washakie,Wyoming,7.4,0.0
3141,56045,Weston,Wyoming,7.6,0.0


In [470]:
local_rec.fips.describe()

count     3142.000000
mean     30383.649268
std      15162.508374
min       1001.000000
25%      18177.500000
50%      29176.000000
75%      45080.500000
max      56045.000000
Name: fips, dtype: float64

In [471]:
d18_rec.fips.describe()

count     3142.000000
mean     30383.649268
std      15162.508374
min       1001.000000
25%      18177.500000
50%      29176.000000
75%      45080.500000
max      56045.000000
Name: fips, dtype: float64

In [472]:
local_rec.loc[local_rec['county'] == 'Oglala Lakota']

Unnamed: 0,fips,state,county,dirsales_farms07,dirsales_farms12,pch_dirsales_farms_07_12,pct_loclfarm07,pct_loclfarm12,pct_loclsale07,pct_loclsale12,...,csa12,pch_csa_07_12,agritrsm_ops07,agritrsm_ops12,pch_agritrsm_ops_07_12,agritrsm_rct07,agritrsm_rct12,pch_agritrsm_rct_07_12,farm_to_school13,farm_to_school15
2412,46102,SD,Oglala Lakota,0.0,0.0,,,,,,...,0.0,,0.0,0.0,,0.0,0.0,,0.0,0.0


In [473]:
local_rec.loc[local_rec['county'] == 'Kusilvak Census Area']

Unnamed: 0,fips,state,county,dirsales_farms07,dirsales_farms12,pch_dirsales_farms_07_12,pct_loclfarm07,pct_loclfarm12,pct_loclsale07,pct_loclsale12,...,csa12,pch_csa_07_12,agritrsm_ops07,agritrsm_ops12,pch_agritrsm_ops_07_12,agritrsm_rct07,agritrsm_rct12,pch_agritrsm_rct_07_12,farm_to_school13,farm_to_school15
81,2158,AK,Kusilvak Census Area,,,,,,,,...,,,,,,,,,1.0,1.0


Certain rows in this dataframe are missing many values. However, the layout, top, bottom, and fips scores are the same match in the Local dataframe and Diabetes Atlas. I will add in the diabetes prevalence information.

In [474]:
local_rec['pct_diabetes18'] = d18_rec['pct_diabetes18']

local_rec['class'] = d18_rec['class']

local_rec.head()

Unnamed: 0,fips,state,county,dirsales_farms07,dirsales_farms12,pch_dirsales_farms_07_12,pct_loclfarm07,pct_loclfarm12,pct_loclsale07,pct_loclsale12,...,agritrsm_ops07,agritrsm_ops12,pch_agritrsm_ops_07_12,agritrsm_rct07,agritrsm_rct12,pch_agritrsm_rct_07_12,farm_to_school13,farm_to_school15,pct_diabetes18,class
0,1001,AL,Autauga,25.0,51.0,104.0,6.024096,13.11054,0.596374,1.554692,...,7.0,10.0,42.857143,228000.0,146000.0,-35.964912,,0.0,9.5,1.0
1,1003,AL,Baldwin,80.0,103.0,28.75,7.023705,10.41456,0.712634,0.47801,...,18.0,16.0,-11.111111,124000.0,204000.0,64.516129,0.0,1.0,8.4,0.0
2,1005,AL,Barbour,18.0,13.0,-27.777778,2.889246,2.276708,0.015403,0.012457,...,27.0,32.0,18.518519,163000.0,304000.0,86.503067,1.0,0.0,13.5,1.0
3,1007,AL,Bibb,12.0,13.0,8.333333,5.687204,6.878307,,,...,5.0,6.0,20.0,,21000.0,,0.0,0.0,10.2,1.0
4,1009,AL,Blount,84.0,88.0,4.761905,5.940594,7.091056,0.267717,0.277792,...,10.0,8.0,-20.0,293000.0,30000.0,-89.761092,1.0,0.0,10.5,1.0


In [475]:
local_rec.shape

(3142, 102)

The diabetes information has been added. I will now split the data into seperate dataframes, and then address the null values.

### 1. Sales

In [476]:
local_sales = local_rec[['fips', 'state', 'county', 'dirsales_farms07', 'dirsales_farms12',
       'pch_dirsales_farms_07_12', 'pct_loclfarm07', 'pct_loclfarm12',
       'pct_loclsale07', 'pct_loclsale12', 'dirsales07', 'dirsales12',
       'pch_dirsales_07_12', 'pc_dirsales07', 'pc_dirsales12',
       'pch_pc_dirsales_07_12','pct_diabetes18', 'class']]

In [477]:
local_sales = pd.DataFrame(local_sales)

In [478]:
local_sales.isnull().sum()

fips                          0
state                         0
county                        0
dirsales_farms07             62
dirsales_farms12             62
pch_dirsales_farms_07_12    123
pct_loclfarm07               68
pct_loclfarm12               67
pct_loclsale07              345
pct_loclsale12              286
dirsales07                  287
dirsales12                  241
pch_dirsales_07_12          458
pc_dirsales07               287
pc_dirsales12               242
pch_pc_dirsales_07_12       458
pct_diabetes18                0
class                         0
dtype: int64

##### Percent change columns

My hypothesis is that the percent change columns may be missing (among others) values in which there would need to be division by zero. I will display the empty values of one of the percent change columns.

In [479]:
local_sales.loc[local_sales['pch_dirsales_farms_07_12'].isnull()]

Unnamed: 0,fips,state,county,dirsales_farms07,dirsales_farms12,pch_dirsales_farms_07_12,pct_loclfarm07,pct_loclfarm12,pct_loclsale07,pct_loclsale12,dirsales07,dirsales12,pch_dirsales_07_12,pc_dirsales07,pc_dirsales12,pch_pc_dirsales_07_12,pct_diabetes18,class
67,2013,AK,Aleutians East,,,,,,,,,,,,,,9.3,1.0
68,2016,AK,Aleutians West,,,,,,,,,,,,,,7.9,0.0
70,2050,AK,Bethel,,,,,,,,,,,,,,8.0,0.0
71,2060,AK,Bristol Bay,,,,,,,,,,,,,,9.1,1.0
72,2068,AK,Denali,,,,,,,,,,,,,,6.8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2947,51790,VA,Staunton,,,,,,,,,,,,,,9.3,1.0
2950,51820,VA,Waynesboro,,,,,,,,,,,,,,13.2,1.0
2951,51830,VA,Williamsburg,,,,,,,,,,,,,,9.0,1.0
2952,51840,VA,Winchester,,,,,,,,,,,,,,9.3,1.0


In this case, it seems the problem is that there were no values with which to calculate a percent change.

While I would like to predict for as many counties as possible, I think I will need to drop rows that are missing values. I will still be able to predict for the excluded counties using other datasets.

In [480]:
local_sales = local_sales.dropna()

In [481]:
local_sales.isnull().sum()

fips                        0
state                       0
county                      0
dirsales_farms07            0
dirsales_farms12            0
pch_dirsales_farms_07_12    0
pct_loclfarm07              0
pct_loclfarm12              0
pct_loclsale07              0
pct_loclsale12              0
dirsales07                  0
dirsales12                  0
pch_dirsales_07_12          0
pc_dirsales07               0
pc_dirsales12               0
pch_pc_dirsales_07_12       0
pct_diabetes18              0
class                       0
dtype: int64

The rows missing values have been removed. I will observe the revised layout of the dataframe.

In [482]:
local_sales.shape

(2624, 18)

Using this dataset, I will still be able to predict for the majority of counties. 

The dataframe will be saved as a csv file for later modeling.

In [483]:
local_sales.to_csv('../data/FoodEnvironmentAtlas.xls.LOCAL_Modeling_Sales.csv', index=False)

I will now move to the next subset.

### 2. Farmer's Markets - Payment

In [484]:
local_frmrkt_payment = local_rec[['fips', 'state', 'county', 'fmrkt13', 'fmrkt18', 'pch_fmrkt_13_18',
       'fmrktpth13', 'fmrktpth18', 'pch_fmrktpth_13_18', 'fmrkt_snap18',
       'pct_fmrkt_snap18', 'fmrkt_wic18', 'pct_fmrkt_wic18', 'fmrkt_wiccash18',
       'pct_fmrkt_wiccash18', 'fmrkt_sfmnp18', 'pct_fmrkt_sfmnp18',
       'fmrkt_credit18', 'pct_fmrkt_credit18', 'pct_diabetes18', 'class']]

In [485]:
local_frmrkt_payment = pd.DataFrame(local_frmrkt_payment)

In [486]:
local_frmrkt_payment.isnull().sum()

fips                     0
state                    0
county                   0
fmrkt13                  1
fmrkt18                  1
pch_fmrkt_13_18        166
fmrktpth13               2
fmrktpth18               2
pch_fmrktpth_13_18     167
fmrkt_snap18             1
pct_fmrkt_snap18         1
fmrkt_wic18              1
pct_fmrkt_wic18          1
fmrkt_wiccash18          1
pct_fmrkt_wiccash18      1
fmrkt_sfmnp18            1
pct_fmrkt_sfmnp18        1
fmrkt_credit18           1
pct_fmrkt_credit18       1
pct_diabetes18           0
class                    0
dtype: int64

There may be one row that is missing the vast majority of its data. I will observe.

In [487]:
local_frmrkt_payment.loc[local_frmrkt_payment['fmrkt13'].isnull()]

Unnamed: 0,fips,state,county,fmrkt13,fmrkt18,pch_fmrkt_13_18,fmrktpth13,fmrktpth18,pch_fmrktpth_13_18,fmrkt_snap18,...,fmrkt_wic18,pct_fmrkt_wic18,fmrkt_wiccash18,pct_fmrkt_wiccash18,fmrkt_sfmnp18,pct_fmrkt_sfmnp18,fmrkt_credit18,pct_fmrkt_credit18,pct_diabetes18,class
81,2158,AK,Kusilvak Census Area,,,,,,,,...,,,,,,,,,7.4,0.0


It happens to be the Kusilvak Census area.

Regardless, I cannot use this row since it only contains the diabetes data. In fact, I would not like to use rows that are missing data at all. 

First, I will drop the columns (involving percent change) that are missing data in 166/167 rows. Since these are percent change values, they are not necessary anyway. 

In [488]:
local_frmrkt_payment.drop(columns = ['pch_fmrkt_13_18', 'pch_fmrktpth_13_18'], inplace = True)

local_frmrkt_payment.isnull().sum()

fips                   0
state                  0
county                 0
fmrkt13                1
fmrkt18                1
fmrktpth13             2
fmrktpth18             2
fmrkt_snap18           1
pct_fmrkt_snap18       1
fmrkt_wic18            1
pct_fmrkt_wic18        1
fmrkt_wiccash18        1
pct_fmrkt_wiccash18    1
fmrkt_sfmnp18          1
pct_fmrkt_sfmnp18      1
fmrkt_credit18         1
pct_fmrkt_credit18     1
pct_diabetes18         0
class                  0
dtype: int64

Additionally, I will drop the Kusilvak Census Area, since it is missing most of its values.

In [489]:
local_frmrkt_payment = local_frmrkt_payment[local_frmrkt_payment['county'] != 'Kusilvak Census Area']

local_frmrkt_payment.isnull().sum()

fips                   0
state                  0
county                 0
fmrkt13                0
fmrkt18                0
fmrktpth13             1
fmrktpth18             1
fmrkt_snap18           0
pct_fmrkt_snap18       0
fmrkt_wic18            0
pct_fmrkt_wic18        0
fmrkt_wiccash18        0
pct_fmrkt_wiccash18    0
fmrkt_sfmnp18          0
pct_fmrkt_sfmnp18      0
fmrkt_credit18         0
pct_fmrkt_credit18     0
pct_diabetes18         0
class                  0
dtype: int64

Kusilvak Census Area has been dropped. It appears there may be one county missing its farmer's market 13 and 18 columns. I will observe.

In [490]:
local_frmrkt_payment.loc[local_frmrkt_payment['fmrktpth13'].isnull()]

Unnamed: 0,fips,state,county,fmrkt13,fmrkt18,fmrktpth13,fmrktpth18,fmrkt_snap18,pct_fmrkt_snap18,fmrkt_wic18,pct_fmrkt_wic18,fmrkt_wiccash18,pct_fmrkt_wiccash18,fmrkt_sfmnp18,pct_fmrkt_sfmnp18,fmrkt_credit18,pct_fmrkt_credit18,pct_diabetes18,class
2412,46102,SD,Oglala Lakota,1.0,1.0,,,1.0,100.0,0.0,0.0,0.0,0.0,1.0,100.0,1.0,100.0,17.9,1.0


It happens to be another county of interest: Oglala Lakota. 

It appears that this county had 1 farmer's market in 2013, and still had 1 in 2018.

In the Supplemental County data, it shows that the 2013 population estimate was 14,127, and its 2018 population estimate was 14,309.

I will use these to calculate and impute the values. I will confirm below that a decimal value is what is needed.

In [491]:
local_frmrkt_payment.dtypes

fips                     int64
state                   object
county                  object
fmrkt13                float64
fmrkt18                float64
fmrktpth13             float64
fmrktpth18             float64
fmrkt_snap18           float64
pct_fmrkt_snap18       float64
fmrkt_wic18            float64
pct_fmrkt_wic18        float64
fmrkt_wiccash18        float64
pct_fmrkt_wiccash18    float64
fmrkt_sfmnp18          float64
pct_fmrkt_sfmnp18      float64
fmrkt_credit18         float64
pct_fmrkt_credit18     float64
pct_diabetes18         float64
class                  float64
dtype: object

The 2 referenced columns are float datatypes. I will now enter the values.

In [492]:
local_frmrkt_payment.loc[local_frmrkt_payment['fips'] == 46102, 'fmrktpth13'] = 1/14.127 #(there are 14.127 thousands)

local_frmrkt_payment.loc[local_frmrkt_payment['fips'] == 46102]

Unnamed: 0,fips,state,county,fmrkt13,fmrkt18,fmrktpth13,fmrktpth18,fmrkt_snap18,pct_fmrkt_snap18,fmrkt_wic18,pct_fmrkt_wic18,fmrkt_wiccash18,pct_fmrkt_wiccash18,fmrkt_sfmnp18,pct_fmrkt_sfmnp18,fmrkt_credit18,pct_fmrkt_credit18,pct_diabetes18,class
2412,46102,SD,Oglala Lakota,1.0,1.0,0.070786,,1.0,100.0,0.0,0.0,0.0,0.0,1.0,100.0,1.0,100.0,17.9,1.0


In [493]:
local_frmrkt_payment.loc[local_frmrkt_payment['fips'] == 46102, 'fmrktpth18'] = 1/14.309 #(there are 14.309 thousands)

local_frmrkt_payment.loc[local_frmrkt_payment['fips'] == 46102]

Unnamed: 0,fips,state,county,fmrkt13,fmrkt18,fmrktpth13,fmrktpth18,fmrkt_snap18,pct_fmrkt_snap18,fmrkt_wic18,pct_fmrkt_wic18,fmrkt_wiccash18,pct_fmrkt_wiccash18,fmrkt_sfmnp18,pct_fmrkt_sfmnp18,fmrkt_credit18,pct_fmrkt_credit18,pct_diabetes18,class
2412,46102,SD,Oglala Lakota,1.0,1.0,0.070786,0.069886,1.0,100.0,0.0,0.0,0.0,0.0,1.0,100.0,1.0,100.0,17.9,1.0


The missing values have been filled in. I will ensure there are no additional missing values before saving the dataframe.

In [494]:
local_frmrkt_payment.isnull().sum()

fips                   0
state                  0
county                 0
fmrkt13                0
fmrkt18                0
fmrktpth13             0
fmrktpth18             0
fmrkt_snap18           0
pct_fmrkt_snap18       0
fmrkt_wic18            0
pct_fmrkt_wic18        0
fmrkt_wiccash18        0
pct_fmrkt_wiccash18    0
fmrkt_sfmnp18          0
pct_fmrkt_sfmnp18      0
fmrkt_credit18         0
pct_fmrkt_credit18     0
pct_diabetes18         0
class                  0
dtype: int64

There are no additional null values.

In [495]:
local_frmrkt_payment.to_csv('../data/FoodEnvironmentAtlas.xls.LOCAL_Modeling_FrmrktP.csv', index=False)

### 3. Farmer's Markets - Foods

In [496]:
local_frmrkt_foods = local_rec[['fips', 'state', 'county', 'fmrkt_frveg18',
       'pct_fmrkt_frveg18', 'fmrkt_anmlprod18', 'pct_fmrkt_anmlprod18',
       'fmrkt_baked18', 'pct_fmrkt_baked18', 'fmrkt_otherfood18',
       'pct_fmrkt_otherfood18', 'pct_diabetes18', 'class']]

In [497]:
local_frmrkt_foods = pd.DataFrame(local_frmrkt_foods)

In [498]:
local_frmrkt_foods.isnull().sum()

fips                     0
state                    0
county                   0
fmrkt_frveg18            1
pct_fmrkt_frveg18        1
fmrkt_anmlprod18         1
pct_fmrkt_anmlprod18     1
fmrkt_baked18            1
pct_fmrkt_baked18        1
fmrkt_otherfood18        1
pct_fmrkt_otherfood18    1
pct_diabetes18           0
class                    0
dtype: int64

There may be just one row missing values. I will observe.

In [499]:
local_frmrkt_foods.loc[local_frmrkt_foods['fmrkt_frveg18'].isnull()]

Unnamed: 0,fips,state,county,fmrkt_frveg18,pct_fmrkt_frveg18,fmrkt_anmlprod18,pct_fmrkt_anmlprod18,fmrkt_baked18,pct_fmrkt_baked18,fmrkt_otherfood18,pct_fmrkt_otherfood18,pct_diabetes18,class
81,2158,AK,Kusilvak Census Area,,,,,,,,,7.4,0.0


It appears that my hypothesis was correct. I will remove the Kusilvak Census Area from the dataframe.

In [500]:
local_frmrkt_foods = local_frmrkt_foods[local_frmrkt_foods['fips'] != 2158]

local_frmrkt_foods.isnull().sum()

fips                     0
state                    0
county                   0
fmrkt_frveg18            0
pct_fmrkt_frveg18        0
fmrkt_anmlprod18         0
pct_fmrkt_anmlprod18     0
fmrkt_baked18            0
pct_fmrkt_baked18        0
fmrkt_otherfood18        0
pct_fmrkt_otherfood18    0
pct_diabetes18           0
class                    0
dtype: int64

There are no additional null values. The dataframe will be saved as-is.

In [501]:
local_frmrkt_foods.to_csv('../data/FoodEnvironmentAtlas.xls.LOCAL_Modeling_FrmrktF.csv', index=False)

### 4. Vegetable Acres/Farms

In [502]:
local_vegfarms = local_rec[['fips', 'state', 'county', 'veg_farms07', 'veg_farms12',
       'pch_veg_farms_07_12', 'veg_acres07', 'veg_acres12',
       'pch_veg_acres_07_12', 'veg_acrespth07', 'veg_acrespth12',
       'pch_veg_acrespth_07_12', 'freshveg_farms07', 'freshveg_farms12',
       'pch_freshveg_farms_07_12', 'freshveg_acres07', 'freshveg_acres12',
       'pch_freshveg_acres_07_12', 'freshveg_acrespth07',
       'freshveg_acrespth12', 'pch_freshveg_acrespth_07_12', 'pct_diabetes18', 'class']]

In [503]:
local_vegfarms = pd.DataFrame(local_vegfarms)

In [504]:
local_vegfarms.isnull().sum()

fips                              0
state                             0
county                            0
veg_farms07                      62
veg_farms12                      62
pch_veg_farms_07_12             349
veg_acres07                     610
veg_acres12                     561
pch_veg_acres_07_12            1082
veg_acrespth07                  610
veg_acrespth12                  611
pch_veg_acrespth_07_12          897
freshveg_farms07                 62
freshveg_farms12                 62
pch_freshveg_farms_07_12        363
freshveg_acres07               1299
freshveg_acres12               1225
pch_freshveg_acres_07_12       2090
freshveg_acrespth07            1299
freshveg_acrespth12            1226
pch_freshveg_acrespth_07_12    2090
pct_diabetes18                    0
class                             0
dtype: int64

There are many missing values! I will begin by removing the percent change columns.

In [505]:
local_vegfarms.drop(columns = ['pch_veg_farms_07_12', 'pch_veg_acres_07_12', 'pch_veg_acrespth_07_12',
                              'pch_freshveg_farms_07_12', 'pch_freshveg_acres_07_12', 'pch_freshveg_acrespth_07_12'],
                   inplace = True)

local_vegfarms.isnull().sum()

fips                      0
state                     0
county                    0
veg_farms07              62
veg_farms12              62
veg_acres07             610
veg_acres12             561
veg_acrespth07          610
veg_acrespth12          611
freshveg_farms07         62
freshveg_farms12         62
freshveg_acres07       1299
freshveg_acres12       1225
freshveg_acrespth07    1299
freshveg_acrespth12    1226
pct_diabetes18            0
class                     0
dtype: int64

Similar numbers appear between 07 and 12 versions of the same value, thus it would be difficult to impute a value based on expected change over the years. Using averages in place of true values for so many columns would likely weaken any models created.

While I will lose many counties by dropping rows containing null values, I will do so because I would like to keep the rest of these columns.

In [506]:
local_vegfarms = local_vegfarms.dropna()

local_vegfarms.isnull().sum()

fips                   0
state                  0
county                 0
veg_farms07            0
veg_farms12            0
veg_acres07            0
veg_acres12            0
veg_acrespth07         0
veg_acrespth12         0
freshveg_farms07       0
freshveg_farms12       0
freshveg_acres07       0
freshveg_acres12       0
freshveg_acrespth07    0
freshveg_acrespth12    0
pct_diabetes18         0
class                  0
dtype: int64

In [507]:
local_vegfarms.shape

(1218, 17)

Only a minority of the original 3142 counties remain, however, those excluded are still represented in other datasets.

I will save the dataframe, then move on to observe the Slaughterhouses/Greenhouses subset. (*I will not use the Berry Farms or Orchards subsets as they previously showed only weak correlations with prevalence of diabetes.*)

In [508]:
local_vegfarms.to_csv('../data/FoodEnvironmentAtlas.xls.LOCAL_Modeling_VegFarms.csv', index=False)

### 5. Slaughterhouses/Greenhouses

In [509]:
local_houses = local_rec[['fips', 'state', 'county', 'slhouse07',
       'slhouse12', 'pch_slhouse_07_12', 'ghveg_farms07', 'ghveg_farms12',
       'pch_ghveg_farms_07_12', 'ghveg_sqft07', 'ghveg_sqft12',
       'pch_ghveg_sqft_07_12', 'ghveg_sqftpth07', 'ghveg_sqftpth12',
       'pch_ghveg_sqftpth_07_12', 'pct_diabetes18', 'class']]

In [510]:
local_houses = pd.DataFrame(local_houses)

In [511]:
local_houses.isnull().sum()

fips                          0
state                         0
county                        0
slhouse07                     1
slhouse12                     1
pch_slhouse_07_12           166
ghveg_farms07                62
ghveg_farms12                62
pch_ghveg_farms_07_12      1941
ghveg_sqft07                842
ghveg_sqft12                894
pch_ghveg_sqft_07_12       2800
ghveg_sqftpth07             842
ghveg_sqftpth12             895
pch_ghveg_sqftpth_07_12    2800
pct_diabetes18                0
class                         0
dtype: int64

I will start by removing the percent change columns, as they contain many missing values.

In [512]:
local_houses = local_houses.drop(columns = ['pch_slhouse_07_12', 'pch_ghveg_farms_07_12', 'pch_ghveg_sqft_07_12',
                                            'pch_ghveg_sqftpth_07_12'])

local_houses.isnull().sum()

fips                 0
state                0
county               0
slhouse07            1
slhouse12            1
ghveg_farms07       62
ghveg_farms12       62
ghveg_sqft07       842
ghveg_sqft12       894
ghveg_sqftpth07    842
ghveg_sqftpth12    895
pct_diabetes18       0
class                0
dtype: int64

I will observe to see the row missing the slaughterhouse information.

In [513]:
local_houses.loc[local_houses['slhouse07'].isnull()]

Unnamed: 0,fips,state,county,slhouse07,slhouse12,ghveg_farms07,ghveg_farms12,ghveg_sqft07,ghveg_sqft12,ghveg_sqftpth07,ghveg_sqftpth12,pct_diabetes18,class
76,2105,AK,Hoonah-Angoon,,,,,,,,,7.3,0.0


This row contains only the diabetes information.

I will drop all rows containing missing values.

In [514]:
local_houses = local_houses.dropna()

In [515]:
local_houses.isnull().sum()

fips               0
state              0
county             0
slhouse07          0
slhouse12          0
ghveg_farms07      0
ghveg_farms12      0
ghveg_sqft07       0
ghveg_sqft12       0
ghveg_sqftpth07    0
ghveg_sqftpth12    0
pct_diabetes18     0
class              0
dtype: int64

In [516]:
local_houses.shape

(1778, 13)

More than half of the original counties remain.

I will save this dataset for modeling.

In [517]:
local_houses.to_csv('../data/FoodEnvironmentAtlas.xls.LOCAL_Modeling_Houses.csv', index=False)

### 6. Other Food Outlets

In [518]:
local_other = local_rec[['fips', 'county', 'state', 'foodhub18', 'csa07', 'csa12',
       'pch_csa_07_12', 'agritrsm_ops07', 'agritrsm_ops12',
       'pch_agritrsm_ops_07_12', 'agritrsm_rct07', 'agritrsm_rct12',
       'pch_agritrsm_rct_07_12', 'farm_to_school13', 'farm_to_school15', 'pct_diabetes18', 'class']]

In [519]:
local_other = pd.DataFrame(local_other)

In [520]:
local_other.isnull().sum()

fips                         0
county                       0
state                        0
foodhub18                 2962
csa07                       62
csa12                       62
pch_csa_07_12              795
agritrsm_ops07              62
agritrsm_ops12              62
pch_agritrsm_ops_07_12     425
agritrsm_rct07            1168
agritrsm_rct12            1023
pch_agritrsm_rct_07_12    1875
farm_to_school13           208
farm_to_school15           218
pct_diabetes18               0
class                        0
dtype: int64

The foodhub18 column is missing most of its values.

I will observe.

In [521]:
local_other.foodhub18.unique()

array([nan,  1.,  2.,  4.,  6.,  3.])

I might hypothesize that counties with null values in this column had no food hubs...but it could also be that the other counties were not assessed.

What I will do is replace the missing values with zero. I will try creating a model with and without the foodhub18 column and see what it produces.

In [522]:
local_other.foodhub18.replace(np.nan, 0, inplace = True)

In [523]:
local_other.isnull().sum()

fips                         0
county                       0
state                        0
foodhub18                    0
csa07                       62
csa12                       62
pch_csa_07_12              795
agritrsm_ops07              62
agritrsm_ops12              62
pch_agritrsm_ops_07_12     425
agritrsm_rct07            1168
agritrsm_rct12            1023
pch_agritrsm_rct_07_12    1875
farm_to_school13           208
farm_to_school15           218
pct_diabetes18               0
class                        0
dtype: int64

Next, I will delete the percent change columns.

In [524]:
local_other = local_other.drop(columns = ['pch_csa_07_12', 'pch_agritrsm_ops_07_12', 'pch_agritrsm_rct_07_12'] )

local_other.isnull().sum()

fips                   0
county                 0
state                  0
foodhub18              0
csa07                 62
csa12                 62
agritrsm_ops07        62
agritrsm_ops12        62
agritrsm_rct07      1168
agritrsm_rct12      1023
farm_to_school13     208
farm_to_school15     218
pct_diabetes18         0
class                  0
dtype: int64

I would like to use all of these columns, therefore I will delete the remaining rows that contain null values.

In [525]:
local_other = local_other.dropna()

In [526]:
local_other.isnull().sum()

fips                0
county              0
state               0
foodhub18           0
csa07               0
csa12               0
agritrsm_ops07      0
agritrsm_ops12      0
agritrsm_rct07      0
agritrsm_rct12      0
farm_to_school13    0
farm_to_school15    0
pct_diabetes18      0
class               0
dtype: int64

In [527]:
local_other.shape

(1314, 14)

Less than a third of the original counties remain, however, I will save this as a csv for modeling.

In [528]:
local_other.to_csv('../data/FoodEnvironmentAtlas.xls.LOCAL_Modeling_Other.csv', index=False)

### Restaurants

In [529]:
file_path = ('../data/FoodEnvironmentAtlas.xls.RESTAURANTS_REC.csv')

restaurants_rec = pd.read_csv(file_path)

In [530]:
restaurants_rec.shape

(3142, 15)

In [531]:
restaurants_rec.head()

Unnamed: 0,fips,state,county,ffr11,ffr16,ffrpth11,ffrpth16,fsr11,fsr16,fsrpth11,fsrpth16,pc_ffrsales07,pc_ffrsales12,pc_fsrsales07,pc_fsrsales12
0,1001,AL,Autauga,34,44,0.615953,0.795977,32,31,0.579721,0.560802,649.511367,674.80272,484.381507,512.280987
1,1003,AL,Baldwin,121,156,0.648675,0.751775,216,236,1.157966,1.1373,649.511367,674.80272,484.381507,512.280987
2,1005,AL,Barbour,19,23,0.694673,0.892372,17,14,0.621549,0.543183,649.511367,674.80272,484.381507,512.280987
3,1007,AL,Bibb,6,7,0.263794,0.309283,5,7,0.219829,0.309283,649.511367,674.80272,484.381507,512.280987
4,1009,AL,Blount,20,23,0.347451,0.399569,15,12,0.260589,0.208471,649.511367,674.80272,484.381507,512.280987


In [532]:
d18_rec.head()

Unnamed: 0,fips,county,state,pct_diabetes18,class
0,1001,Autauga,Alabama,9.5,1.0
1,1003,Baldwin,Alabama,8.4,0.0
2,1005,Barbour,Alabama,13.5,1.0
3,1007,Bibb,Alabama,10.2,1.0
4,1009,Blount,Alabama,10.5,1.0


In [533]:
restaurants_rec.tail()

Unnamed: 0,fips,state,county,ffr11,ffr16,ffrpth11,ffrpth16,fsr11,fsr16,fsrpth11,fsrpth16,pc_ffrsales07,pc_ffrsales12,pc_fsrsales07,pc_fsrsales12
3137,56037,WY,Sweetwater,25,31,0.568014,0.700644,33,32,0.749778,0.723246,656.20861,598.027144,715.635645,706.676425
3138,56039,WY,Teton,27,20,1.257217,0.862813,57,59,2.654126,2.545298,656.20861,598.027144,715.635645,706.676425
3139,56041,WY,Uinta,19,17,0.909134,0.82082,25,19,1.196229,0.917387,656.20861,598.027144,715.635645,706.676425
3140,56043,WY,Washakie,7,6,0.827423,0.73278,14,12,1.654846,1.465559,656.20861,598.027144,715.635645,706.676425
3141,56045,WY,Weston,3,4,0.420109,0.55571,11,10,1.540401,1.389275,656.20861,598.027144,715.635645,706.676425


In [534]:
d18_rec.tail()

Unnamed: 0,fips,county,state,pct_diabetes18,class
3137,56037,Sweetwater,Wyoming,7.8,0.0
3138,56039,Teton,Wyoming,3.8,0.0
3139,56041,Uinta,Wyoming,8.4,0.0
3140,56043,Washakie,Wyoming,7.4,0.0
3141,56045,Weston,Wyoming,7.6,0.0


In [535]:
restaurants_rec.fips.describe()

count     3142.000000
mean     30383.649268
std      15162.508374
min       1001.000000
25%      18177.500000
50%      29176.000000
75%      45080.500000
max      56045.000000
Name: fips, dtype: float64

In [536]:
d18_rec.fips.describe()

count     3142.000000
mean     30383.649268
std      15162.508374
min       1001.000000
25%      18177.500000
50%      29176.000000
75%      45080.500000
max      56045.000000
Name: fips, dtype: float64

The restaurants and diabetes atlas data match. I will add the diabetes into the restaurants dataframe.

In [537]:
restaurants_rec['pct_diabetes18'] = d18_rec['pct_diabetes18']

restaurants_rec['class'] = d18_rec['class']

restaurants_rec.head()

Unnamed: 0,fips,state,county,ffr11,ffr16,ffrpth11,ffrpth16,fsr11,fsr16,fsrpth11,fsrpth16,pc_ffrsales07,pc_ffrsales12,pc_fsrsales07,pc_fsrsales12,pct_diabetes18,class
0,1001,AL,Autauga,34,44,0.615953,0.795977,32,31,0.579721,0.560802,649.511367,674.80272,484.381507,512.280987,9.5,1.0
1,1003,AL,Baldwin,121,156,0.648675,0.751775,216,236,1.157966,1.1373,649.511367,674.80272,484.381507,512.280987,8.4,0.0
2,1005,AL,Barbour,19,23,0.694673,0.892372,17,14,0.621549,0.543183,649.511367,674.80272,484.381507,512.280987,13.5,1.0
3,1007,AL,Bibb,6,7,0.263794,0.309283,5,7,0.219829,0.309283,649.511367,674.80272,484.381507,512.280987,10.2,1.0
4,1009,AL,Blount,20,23,0.347451,0.399569,15,12,0.260589,0.208471,649.511367,674.80272,484.381507,512.280987,10.5,1.0


In [538]:
restaurants_rec.shape

(3142, 17)

The diabetes columns have been added. I will address null values.

In [539]:
restaurants_rec.isnull().sum()

fips              0
state             0
county            0
ffr11             0
ffr16             0
ffrpth11          0
ffrpth16          0
fsr11             0
fsr16             0
fsrpth11          0
fsrpth16          0
pc_ffrsales07     0
pc_ffrsales12     0
pc_fsrsales07     0
pc_fsrsales12     0
pct_diabetes18    0
class             0
dtype: int64

There are no missing values. I will save the dataframe for modeling.

In [540]:
restaurants_rec.to_csv('../data/FoodEnvironmentAtlas.xls.RESTAURANTS_Modeling.csv', index=False)

### Socioeconomic

In [541]:
file_path = '../data/FoodEnvironmentAtlas.xls.SOCIOECONOMIC_REC.csv'

sec_rec = pd.read_csv(file_path)

In [542]:
sec_rec.shape

(3142, 12)

In [543]:
sec_rec.head()

Unnamed: 0,fips,state,county,pct_65older10,pct_18younger10,medhhinc15,povrate15,perpov10,childpovrate15,perchldpov10,metro13,poploss10
0,1001,AL,Autauga,11.995382,26.777959,56580.0,12.7,0,18.8,0,1,0.0
1,1003,AL,Baldwin,16.771185,22.987408,52387.0,12.9,0,19.6,0,1,0.0
2,1005,AL,Barbour,14.236807,21.906982,31433.0,32.0,1,45.2,1,0,0.0
3,1007,AL,Bibb,12.68165,22.696923,40767.0,22.2,0,29.3,1,1,0.0
4,1009,AL,Blount,14.722096,24.608353,50487.0,14.7,0,22.2,0,1,0.0


In [544]:
d18_rec.head()

Unnamed: 0,fips,county,state,pct_diabetes18,class
0,1001,Autauga,Alabama,9.5,1.0
1,1003,Baldwin,Alabama,8.4,0.0
2,1005,Barbour,Alabama,13.5,1.0
3,1007,Bibb,Alabama,10.2,1.0
4,1009,Blount,Alabama,10.5,1.0


In [545]:
sec_rec.tail()

Unnamed: 0,fips,state,county,pct_65older10,pct_18younger10,medhhinc15,povrate15,perpov10,childpovrate15,perchldpov10,metro13,poploss10
3137,56037,WY,Sweetwater,8.316212,27.094462,71867.0,8.5,0,10.2,0,0,0.0
3138,56039,WY,Teton,9.852541,19.141542,83290.0,6.6,0,7.6,0,0,0.0
3139,56041,WY,Uinta,8.873946,30.168577,62968.0,9.8,0,11.9,0,0,0.0
3140,56043,WY,Washakie,17.672565,25.454119,56088.0,11.2,0,15.7,0,0,0.0
3141,56045,WY,Weston,15.940622,21.822974,60986.0,9.8,0,13.1,0,0,0.0


In [546]:
d18_rec.tail()

Unnamed: 0,fips,county,state,pct_diabetes18,class
3137,56037,Sweetwater,Wyoming,7.8,0.0
3138,56039,Teton,Wyoming,3.8,0.0
3139,56041,Uinta,Wyoming,8.4,0.0
3140,56043,Washakie,Wyoming,7.4,0.0
3141,56045,Weston,Wyoming,7.6,0.0


In [547]:
sec_rec.fips.describe()

count     3142.000000
mean     30383.649268
std      15162.508374
min       1001.000000
25%      18177.500000
50%      29176.000000
75%      45080.500000
max      56045.000000
Name: fips, dtype: float64

In [548]:
d18_rec.fips.describe()

count     3142.000000
mean     30383.649268
std      15162.508374
min       1001.000000
25%      18177.500000
50%      29176.000000
75%      45080.500000
max      56045.000000
Name: fips, dtype: float64

The datasets match. I will add the diabetes columns into the Socioeconomic dataframe.

In [549]:
sec_rec['pct_diabetes18'] = d18_rec['pct_diabetes18']

sec_rec['class'] = d18_rec['class']

sec_rec.head()

Unnamed: 0,fips,state,county,pct_65older10,pct_18younger10,medhhinc15,povrate15,perpov10,childpovrate15,perchldpov10,metro13,poploss10,pct_diabetes18,class
0,1001,AL,Autauga,11.995382,26.777959,56580.0,12.7,0,18.8,0,1,0.0,9.5,1.0
1,1003,AL,Baldwin,16.771185,22.987408,52387.0,12.9,0,19.6,0,1,0.0,8.4,0.0
2,1005,AL,Barbour,14.236807,21.906982,31433.0,32.0,1,45.2,1,0,0.0,13.5,1.0
3,1007,AL,Bibb,12.68165,22.696923,40767.0,22.2,0,29.3,1,1,0.0,10.2,1.0
4,1009,AL,Blount,14.722096,24.608353,50487.0,14.7,0,22.2,0,1,0.0,10.5,1.0


In [550]:
sec_rec.shape

(3142, 14)

The columns have been added. I will address the null values.

In [551]:
sec_rec.isnull().sum()

fips               0
state              0
county             0
pct_65older10      0
pct_18younger10    0
medhhinc15         3
povrate15          3
perpov10           0
childpovrate15     3
perchldpov10       0
metro13            0
poploss10          2
pct_diabetes18     0
class              0
dtype: int64

There are a small number of missing values in some of the rows. I will investigate.

In [552]:
sec_rec.loc[sec_rec['medhhinc15'].isnull()]

Unnamed: 0,fips,state,county,pct_65older10,pct_18younger10,medhhinc15,povrate15,perpov10,childpovrate15,perchldpov10,metro13,poploss10,pct_diabetes18,class
81,2158,AK,Kusilvak Census Area,5.416276,41.573938,,,1,,1,0,,7.4,0.0
548,15005,HI,Kalawao,28.888889,0.0,,,0,,0,1,0.0,8.4,0.0
2412,46102,SD,Oglala Lakota,5.881054,39.319888,,,1,,1,0,,17.9,1.0


I will remove these rows since because they are missing so much data.

In [553]:
#removes referenced rows above

sec_rec= sec_rec[sec_rec['fips'] !=2158]

sec_rec= sec_rec[sec_rec['fips'] !=15005]


sec_rec= sec_rec[sec_rec['fips'] !=46102]

In [554]:
sec_rec.shape

(3139, 14)

In [555]:
sec_rec.isnull().sum()

fips               0
state              0
county             0
pct_65older10      0
pct_18younger10    0
medhhinc15         0
povrate15          0
perpov10           0
childpovrate15     0
perchldpov10       0
metro13            0
poploss10          0
pct_diabetes18     0
class              0
dtype: int64

There are no missing values. I will save this as a csv.

In [556]:
sec_rec.to_csv('../data/FoodEnvironmentAtlas.xls.SOCIOECONOMIC_Modeling.csv', index=False)

## Stores

In [557]:
file_path = '../data/FoodEnvironmentAtlas.xls.STORES_REC.csv'

stores_rec = pd.read_csv(file_path)

In [558]:
stores_rec.shape

(3142, 39)

In [559]:
stores_rec.head()

Unnamed: 0,fips,state,county,groc11,groc16,pch_groc_11_16,grocpth11,grocpth16,pch_grocpth_11_16,superc11,...,pch_snaps_12_17,snapspth12,snapspth17,pch_snapspth_12_17,wics11,wics16,pch_wics_11_16,wicspth11,wicspth16,pch_wicspth_11_16
0,1001,AL,Autauga,5,3,-40.0,0.090581,0.054271,-40.085748,1,...,19.376392,0.674004,0.804747,19.3979,5.0,5.0,0.0,0.090567,0.090511,-0.061543
1,1003,AL,Baldwin,27,29,7.407407,0.144746,0.139753,-3.449328,6,...,36.927711,0.725055,0.890836,22.864524,26.0,28.0,7.692307,0.13938,0.134802,-3.284727
2,1005,AL,Barbour,6,4,-33.333333,0.21937,0.155195,-29.254287,0,...,3.349282,1.28059,1.424614,11.246689,7.0,6.0,-14.285714,0.255942,0.232387,-9.203081
3,1007,AL,Bibb,6,5,-16.666667,0.263794,0.220916,-16.254289,1,...,11.794872,0.719122,0.801423,11.444711,6.0,5.0,-16.666666,0.263771,0.221474,-16.035471
4,1009,AL,Blount,7,5,-28.571429,0.121608,0.086863,-28.571429,1,...,5.701754,0.657144,0.692374,5.361034,8.0,8.0,0.0,0.139,0.139089,0.064332


In [560]:
d18_rec.head()

Unnamed: 0,fips,county,state,pct_diabetes18,class
0,1001,Autauga,Alabama,9.5,1.0
1,1003,Baldwin,Alabama,8.4,0.0
2,1005,Barbour,Alabama,13.5,1.0
3,1007,Bibb,Alabama,10.2,1.0
4,1009,Blount,Alabama,10.5,1.0


In [561]:
stores_rec.tail()

Unnamed: 0,fips,state,county,groc11,groc16,pch_groc_11_16,grocpth11,grocpth16,pch_grocpth_11_16,superc11,...,pch_snaps_12_17,snapspth12,snapspth17,pch_snapspth_12_17,wics11,wics16,pch_wics_11_16,wicspth11,wicspth16,pch_wicspth_11_16
3137,56037,WY,Sweetwater,5,4,-20.0,0.113603,0.090406,-20.419482,1,...,40.343348,0.428936,0.625948,45.930131,4.0,4.0,0.0,0.090948,0.090344,-0.664035
3138,56039,WY,Teton,5,11,120.0,0.232818,0.474547,103.827437,0,...,52.380952,0.242215,0.343864,41.96678,3.0,3.0,0.0,0.140095,0.129528,-7.542849
3139,56041,WY,Uinta,3,2,-33.333333,0.143548,0.096567,-32.72818,1,...,30.714286,0.554895,0.744084,34.094553,3.0,3.0,0.0,0.143589,0.144991,0.976268
3140,56043,WY,Washakie,2,2,0.0,0.236407,0.24426,3.321935,0,...,7.352941,0.669502,0.754382,12.677988,2.0,2.0,0.0,0.236742,0.244858,3.428013
3141,56045,WY,Weston,4,4,0.0,0.560146,0.55571,-0.791887,0,...,24.096386,0.976654,1.239113,26.873192,3.0,3.0,0.0,0.42005,0.415916,-0.984338


In [562]:
d18_rec.tail()

Unnamed: 0,fips,county,state,pct_diabetes18,class
3137,56037,Sweetwater,Wyoming,7.8,0.0
3138,56039,Teton,Wyoming,3.8,0.0
3139,56041,Uinta,Wyoming,8.4,0.0
3140,56043,Washakie,Wyoming,7.4,0.0
3141,56045,Weston,Wyoming,7.6,0.0


In [563]:
stores_rec.fips.describe()

count     3142.000000
mean     30383.649268
std      15162.508374
min       1001.000000
25%      18177.500000
50%      29176.000000
75%      45080.500000
max      56045.000000
Name: fips, dtype: float64

In [564]:
d18_rec.fips.describe()

count     3142.000000
mean     30383.649268
std      15162.508374
min       1001.000000
25%      18177.500000
50%      29176.000000
75%      45080.500000
max      56045.000000
Name: fips, dtype: float64

I will add the diabetes information into the Stores dataframe.

In [565]:
stores_rec['pct_diabetes18'] = d18_rec['pct_diabetes18']

stores_rec['class'] = d18_rec['class']

stores_rec.head()

Unnamed: 0,fips,state,county,groc11,groc16,pch_groc_11_16,grocpth11,grocpth16,pch_grocpth_11_16,superc11,...,snapspth17,pch_snapspth_12_17,wics11,wics16,pch_wics_11_16,wicspth11,wicspth16,pch_wicspth_11_16,pct_diabetes18,class
0,1001,AL,Autauga,5,3,-40.0,0.090581,0.054271,-40.085748,1,...,0.804747,19.3979,5.0,5.0,0.0,0.090567,0.090511,-0.061543,9.5,1.0
1,1003,AL,Baldwin,27,29,7.407407,0.144746,0.139753,-3.449328,6,...,0.890836,22.864524,26.0,28.0,7.692307,0.13938,0.134802,-3.284727,8.4,0.0
2,1005,AL,Barbour,6,4,-33.333333,0.21937,0.155195,-29.254287,0,...,1.424614,11.246689,7.0,6.0,-14.285714,0.255942,0.232387,-9.203081,13.5,1.0
3,1007,AL,Bibb,6,5,-16.666667,0.263794,0.220916,-16.254289,1,...,0.801423,11.444711,6.0,5.0,-16.666666,0.263771,0.221474,-16.035471,10.2,1.0
4,1009,AL,Blount,7,5,-28.571429,0.121608,0.086863,-28.571429,1,...,0.692374,5.361034,8.0,8.0,0.0,0.139,0.139089,0.064332,10.5,1.0


The diabetes information has been added. Before addressing null values, I will split the data into subsets for processing.

### 1. Grocery/superstores

While EDA suggested that the correlations between prevalence of diabetes and the variables below are weak, perhaps they will be more predictive when 2018 data are used. I will still create a model with the subset below.

In [566]:
stores_grocery_super = stores_rec[['fips', 'state', 'county', 'groc11', 'groc16', 'pch_groc_11_16',
       'grocpth11', 'grocpth16', 'pch_grocpth_11_16', 'superc11', 'superc16',
       'pch_superc_11_16', 'supercpth11', 'supercpth16', 'pch_supercpth_11_16', 'pct_diabetes18', 'class']]

In [567]:
stores_grocery_super= pd.DataFrame(stores_grocery_super)

In [568]:
stores_grocery_super.head()

Unnamed: 0,fips,state,county,groc11,groc16,pch_groc_11_16,grocpth11,grocpth16,pch_grocpth_11_16,superc11,superc16,pch_superc_11_16,supercpth11,supercpth16,pch_supercpth_11_16,pct_diabetes18,class
0,1001,AL,Autauga,5,3,-40.0,0.090581,0.054271,-40.085748,1,1,0.0,0.018116,0.01809,-0.142914,9.5,1.0
1,1003,AL,Baldwin,27,29,7.407407,0.144746,0.139753,-3.449328,6,7,16.666667,0.032166,0.033733,4.874005,8.4,0.0
2,1005,AL,Barbour,6,4,-33.333333,0.21937,0.155195,-29.254287,0,1,,0.0,0.038799,,13.5,1.0
3,1007,AL,Bibb,6,5,-16.666667,0.263794,0.220916,-16.254289,1,1,0.0,0.043966,0.044183,0.494853,10.2,1.0
4,1009,AL,Blount,7,5,-28.571429,0.121608,0.086863,-28.571429,1,1,0.0,0.017373,0.017373,0.0,10.5,1.0


In [569]:
stores_grocery_super.isnull().sum()

fips                     0
state                    0
county                   0
groc11                   0
groc16                   0
pch_groc_11_16          21
grocpth11                0
grocpth16                0
pch_grocpth_11_16       16
superc11                 0
superc16                 0
pch_superc_11_16       132
supercpth11              0
supercpth16              0
pch_supercpth_11_16    132
pct_diabetes18           0
class                    0
dtype: int64

Only the percent change columns have missing values. I will remove these columns.

In [570]:
stores_grocery_super.drop(columns = ['pch_groc_11_16' , 'pch_grocpth_11_16' ,'pch_superc_11_16',
                                     'pch_supercpth_11_16'], inplace = True)

stores_grocery_super.isnull().sum()

fips              0
state             0
county            0
groc11            0
groc16            0
grocpth11         0
grocpth16         0
superc11          0
superc16          0
supercpth11       0
supercpth16       0
pct_diabetes18    0
class             0
dtype: int64

There are no additional null values. I will save this as a csv.

In [571]:
stores_grocery_super.to_csv('../data/FoodEnvironmentAtlas.xls.STORES_Modeling_GS.csv', index=False)

### 2. Convenience & Specialized Food Stores

In [572]:
stores_conv_spec = stores_rec[['fips', 'state', 'county', 'convs11', 'convs16', 'pch_convs_11_16', 'convspth11', 'convspth16',
       'pch_convspth_11_16', 'specs11', 'specs16', 'pch_specs_11_16',
       'specspth11', 'specspth16', 'pch_specspth_11_16', 'pct_diabetes18', 'class']]

In [573]:
stores_conv_spec = pd.DataFrame(stores_conv_spec)

In [574]:
stores_conv_spec.isnull().sum()

fips                    0
state                   0
county                  0
convs11                 0
convs16                 0
pch_convs_11_16        26
convspth11              0
convspth16              0
pch_convspth_11_16     21
specs11                 0
specs16                 0
pch_specs_11_16       250
specspth11              0
specspth16              0
pch_specspth_11_16    247
pct_diabetes18          0
class                   0
dtype: int64

This dataset also contains all of its missing values in the percent change columns. I will remove these.

In [575]:
stores_conv_spec.drop(columns = ['pch_convs_11_16', 'pch_convspth_11_16', 'pch_specs_11_16',
                                                    'pch_specspth_11_16'], inplace = True)

In [576]:
stores_conv_spec.isnull().sum()

fips              0
state             0
county            0
convs11           0
convs16           0
convspth11        0
convspth16        0
specs11           0
specs16           0
specspth11        0
specspth16        0
pct_diabetes18    0
class             0
dtype: int64

This will be saved as a csv.

In [577]:
stores_conv_spec.to_csv('../data/FoodEnvironmentAtlas.xls.STORES_Modeling_ConSpec.csv', index=False)

### 3. SNAP & WIC Stores

In [578]:
stores_snap_wic = stores_rec[['fips', 'state', 'county', 'snaps12', 'snaps17',
       'pch_snaps_12_17', 'snapspth12', 'snapspth17', 'pch_snapspth_12_17',
       'wics11', 'wics16', 'pch_wics_11_16', 'wicspth11', 'wicspth16',
       'pch_wicspth_11_16', 'pct_diabetes18', 'class']]

In [579]:
stores_snap_wic = pd.DataFrame(stores_snap_wic)

In [580]:
stores_snap_wic.isnull().sum()

fips                    0
state                   0
county                  0
snaps12                 0
snaps17                26
pch_snaps_12_17        32
snapspth12              0
snapspth17             26
pch_snapspth_12_17     32
wics11                132
wics16                158
pch_wics_11_16        159
wicspth11             134
wicspth16             160
pch_wicspth_11_16     161
pct_diabetes18          0
class                   0
dtype: int64

In this case, there are missing values outside of the percent change columns - and more than just a few.

First, I will delete the percent change columns. 

In [581]:
stores_snap_wic.drop(columns = ['pch_snaps_12_17', 'pch_snapspth_12_17', 'pch_wics_11_16',
                                'pch_wicspth_11_16'], inplace=True)

In [582]:
stores_snap_wic.isnull().sum()

fips                0
state               0
county              0
snaps12             0
snaps17            26
snapspth12          0
snapspth17         26
wics11            132
wics16            158
wicspth11         134
wicspth16         160
pct_diabetes18      0
class               0
dtype: int64

If I drop the rows with null values, I will still have the vast majority of counties.

I will drop the rows with null values below.

In [583]:
stores_snap_wic = stores_snap_wic.dropna()

In [584]:
stores_snap_wic.isnull().sum()

fips              0
state             0
county            0
snaps12           0
snaps17           0
snapspth12        0
snapspth17        0
wics11            0
wics16            0
wicspth11         0
wicspth16         0
pct_diabetes18    0
class             0
dtype: int64

In [585]:
stores_snap_wic.shape

(2978, 13)

There are no additional missing values. I will save this work and move into the next dataset. 

In [586]:
stores_snap_wic.to_csv('../data/FoodEnvironmentAtlas.xls.STORES_Modeling_SnapWic.csv', index=False)

## Taxes

In [587]:
file_path = '../data/FoodEnvironmentAtlas.xls.TAXES_REC.csv'

taxes_rec = pd.read_csv(file_path)

In [588]:
taxes_rec.shape

(3142, 8)

In [589]:
taxes_rec.head()

Unnamed: 0,fips,state,county,sodatax_stores14,sodatax_vendm14,chipstax_stores14,chipstax_vendm14,food_tax14
0,1001,AL,Autauga,4.0,4.0,4.0,4.0,4.0
1,1003,AL,Baldwin,4.0,4.0,4.0,4.0,4.0
2,1005,AL,Barbour,4.0,4.0,4.0,4.0,4.0
3,1007,AL,Bibb,4.0,4.0,4.0,4.0,4.0
4,1009,AL,Blount,4.0,4.0,4.0,4.0,4.0


In [590]:
d18_rec.head()

Unnamed: 0,fips,county,state,pct_diabetes18,class
0,1001,Autauga,Alabama,9.5,1.0
1,1003,Baldwin,Alabama,8.4,0.0
2,1005,Barbour,Alabama,13.5,1.0
3,1007,Bibb,Alabama,10.2,1.0
4,1009,Blount,Alabama,10.5,1.0


In [591]:
taxes_rec.head()

Unnamed: 0,fips,state,county,sodatax_stores14,sodatax_vendm14,chipstax_stores14,chipstax_vendm14,food_tax14
0,1001,AL,Autauga,4.0,4.0,4.0,4.0,4.0
1,1003,AL,Baldwin,4.0,4.0,4.0,4.0,4.0
2,1005,AL,Barbour,4.0,4.0,4.0,4.0,4.0
3,1007,AL,Bibb,4.0,4.0,4.0,4.0,4.0
4,1009,AL,Blount,4.0,4.0,4.0,4.0,4.0


In [592]:
d18_rec.head()

Unnamed: 0,fips,county,state,pct_diabetes18,class
0,1001,Autauga,Alabama,9.5,1.0
1,1003,Baldwin,Alabama,8.4,0.0
2,1005,Barbour,Alabama,13.5,1.0
3,1007,Bibb,Alabama,10.2,1.0
4,1009,Blount,Alabama,10.5,1.0


In [593]:
taxes_rec.tail()

Unnamed: 0,fips,state,county,sodatax_stores14,sodatax_vendm14,chipstax_stores14,chipstax_vendm14,food_tax14
3137,56037,WY,Sweetwater,0.0,4.0,0.0,4.0,0.0
3138,56039,WY,Teton,0.0,4.0,0.0,4.0,0.0
3139,56041,WY,Uinta,0.0,4.0,0.0,4.0,0.0
3140,56043,WY,Washakie,0.0,4.0,0.0,4.0,0.0
3141,56045,WY,Weston,0.0,4.0,0.0,4.0,0.0


In [594]:
d18_rec.tail()

Unnamed: 0,fips,county,state,pct_diabetes18,class
3137,56037,Sweetwater,Wyoming,7.8,0.0
3138,56039,Teton,Wyoming,3.8,0.0
3139,56041,Uinta,Wyoming,8.4,0.0
3140,56043,Washakie,Wyoming,7.4,0.0
3141,56045,Weston,Wyoming,7.6,0.0


In [595]:
taxes_rec.fips.describe()

count     3142.000000
mean     30383.649268
std      15162.508374
min       1001.000000
25%      18177.500000
50%      29176.000000
75%      45080.500000
max      56045.000000
Name: fips, dtype: float64

In [596]:
d18_rec.fips.describe()

count     3142.000000
mean     30383.649268
std      15162.508374
min       1001.000000
25%      18177.500000
50%      29176.000000
75%      45080.500000
max      56045.000000
Name: fips, dtype: float64

The taxes dataframe is ready to accept the diabetes columns.

In [597]:
taxes_rec['pct_diabetes18'] = d18_rec['pct_diabetes18']

taxes_rec['class'] = d18_rec['class']

taxes_rec.head()

Unnamed: 0,fips,state,county,sodatax_stores14,sodatax_vendm14,chipstax_stores14,chipstax_vendm14,food_tax14,pct_diabetes18,class
0,1001,AL,Autauga,4.0,4.0,4.0,4.0,4.0,9.5,1.0
1,1003,AL,Baldwin,4.0,4.0,4.0,4.0,4.0,8.4,0.0
2,1005,AL,Barbour,4.0,4.0,4.0,4.0,4.0,13.5,1.0
3,1007,AL,Bibb,4.0,4.0,4.0,4.0,4.0,10.2,1.0
4,1009,AL,Blount,4.0,4.0,4.0,4.0,4.0,10.5,1.0


The diabetes columns have been added.

In [598]:
taxes_rec.isnull().sum()

fips                 0
state                0
county               0
sodatax_stores14     0
sodatax_vendm14      0
chipstax_stores14    0
chipstax_vendm14     0
food_tax14           0
pct_diabetes18       0
class                0
dtype: int64

There are no missing values. I will save this dataframe for modeling.

In [599]:
taxes_rec.to_csv('../data/FoodEnvironmentAtlas.xls.TAXES_Modeling.csv', index=False)

### Discussion

The target values have been used to create classes, and both the percentage and classes were added to each dataset that will be used for modeling. While in some datasets, many rows needed to be dropped due to containing null values, others are complete. With the exception of Bedford City, VA (which was dropped entirely from the dataset due to missing diabetes prevalence and other information), all counties within the Food Environment Atlas are represented in multiple datasets.

The mental health-related information was added into the Health dataset for modeling purposes.

In the next notebook, I will create and evaluate models for predicting prevalence of diabetes for each dataset, using binary classification.