# Capstone Project - 02a EDA Phase 1 - SQL

## Introduction

In this notebook, I analyze the datasets using SQL. This gives me preliminary ideas for visualizations to prepare, features to engineer, and ultimately, models to create.

The information contained in these datasets is from distinct years, with some data from as long ago as 2008 to as recent as 2018. The objective here is to observe trends/relationships that may give insight into which trends to look for after reconciling the year differences through later feature transformation.

## I. Data Import

In [1]:
#imports necessary libraries

import pandas as pd
import numpy as np
from sqlalchemy import create_engine

In [2]:
engine = create_engine('postgresql://localhost/diabetes') #imports diabetes database

## II. EDA by Table

In [3]:
#displays table names in database

sql = """
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'public'
"""

df = pd.read_sql(sql, engine)
df.head()

Unnamed: 0,table_name
0,taxes
1,diabetes18
2,variables
3,assistance
4,insecurity


### 1. Variables (Data Dictionary)

This table is essentially a data dictionary for the extensive number of variables (281) used in the Food Environment Atlas. In the README for this project, I included the Variables dataset as an excel spreadsheet, because the information is much easier to use.

However, I will observe the dataset here for consistency, as I will perform EDA on all of the other datasets within the Food Environment Atlas.

In [4]:
#displays top 5 rows of table

sql = """
SELECT *
FROM variables;
"""

df = pd.read_sql_query(sql, engine)
df.head()

Unnamed: 0,category_name,category_code,subcategory_name,variable_name,variable_code,geography,units
0,Access and Proximity to Grocery Store,ACCESS,Overall,"Population, low access to store, 2010",LACCESS_POP10,CNTY10,Count
1,Access and Proximity to Grocery Store,ACCESS,Overall,"Population, low access to store, 2015",LACCESS_POP15,CNTY10,Count
2,Access and Proximity to Grocery Store,ACCESS,Overall,"Population, low access to store (% change), 20...",PCH_LACCESS_POP_10_15,CNTY10,% change
3,Access and Proximity to Grocery Store,ACCESS,Overall,"Population, low access to store (%), 2010",PCT_LACCESS_POP10,CNTY10,Percent
4,Access and Proximity to Grocery Store,ACCESS,Overall,"Population, low access to store (%), 2015",PCT_LACCESS_POP15,CNTY10,Percent


As this is simply a data dictionary, and appears to be complete, no additional analysis is needed.

Instead, I will move on to observe the Health dataset from the Food Environmental Atlas.

### 2. Health

This dataset contains information about obesity, diabetes (2008 and 2012,) number of recreational/fitness facilities, and more.  I will begin by observing pairing columns from this dataset together for analysis.

In [5]:
#displays top 5 rows of table

sql = """
SELECT *
FROM health;
"""

df = pd.read_sql_query(sql, engine)
df.head()

Unnamed: 0,fips,state,county,pct_diabetes_adults08,pct_diabetes_adults13,pct_obese_adults12,pct_obese_adults17,pct_hspa17,recfac11,recfac16,pch_recfac_11_16,recfacpth11,recfacpth16,pch_recfacpth_11_16
0,1001,AL,Autauga,11.4,13.0,33.0,36.3,,4,6,50.0,0.072465,0.108542,49.785629
1,1003,AL,Baldwin,9.8,10.4,33.0,36.3,,16,21,31.25,0.085775,0.1012,17.983256
2,1005,AL,Barbour,13.6,18.4,33.0,36.3,,2,0,-100.0,0.073123,0.0,-100.0
3,1007,AL,Bibb,11.1,14.8,33.0,36.3,,0,1,,0.0,0.044183,
4,1009,AL,Blount,11.4,14.1,33.0,36.3,,3,4,33.333333,0.052118,0.06949,33.333333


I see again that some values are blank. I know from my preliminary analysis that 760 of these are in the pct_hspa_17 (percent of physically active high schoolers in 2017,) with 143 missing from each of the pch or "percent change," columns. The % diabetes adults in 2008 and 2013 are missing only 5 and 1 value respectively.

Additionally, after observing mins and maxes for these columns, I know that none have the "-9999" that is at times used to fill in missing values. 

Therefore, at this time, I will analyze the information as-is.

##### Diabetes 2008

#### 1. Minimum Values

It may be worthwile determining which counties had the lowest incidence of diabetes in 2008. The ten counties with lowest diabetes values in 2008 will be determined. These will be observed side by side with their respective obesity values.

It would be interesting to observe the diabetes values alongside their respective obesity rates: however the years do not match up, as obesity data pertains to 2012 and 2017 (and appears to apply to states, rather than specific counties.)

In [6]:
#displays counties with 10 lowest prevalences of diabetes

sql = """
SELECT state, county, pct_diabetes_adults08
FROM health
ORDER BY pct_diabetes_adults08
LIMIT 10

"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,pct_diabetes_adults08
0,CO,Eagle,3.0
1,CO,Douglas,3.4
2,CO,Routt,3.4
3,CO,Boulder,3.5
4,CO,Summit,3.5
5,MT,Gallatin,3.6
6,CO,Gunnison,3.6
7,UT,Summit,3.9
8,CO,La Plata,4.0
9,NM,Santa Fe,4.0


It appears that the 5 lowest prevalence of diabetes occurs in Colorado (and 7 out of the lowest 10.) It may be worthwhile to observe what food environmental features are present in Colorado when state-wide data is later analyzed.

These values range from 3.0 - 4.0%.

*Additionally, the states represented here are in a certain geographical location.*


#### 2. Maximum Values

In [7]:
#displays counties with 10 highest prevalence of diabetes

sql = """
SELECT state, county, pct_diabetes_adults08
FROM health
WHERE pct_diabetes_adults08 IS NOT NULL
ORDER BY pct_diabetes_adults08 DESC
LIMIT 10

"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,pct_diabetes_adults08
0,AL,Greene,18.2
1,AL,Perry,17.5
2,AL,Wilcox,17.1
3,AL,Sumter,17.1
4,AL,Lowndes,17.1
5,GA,Stewart,16.8
6,AL,Macon,16.2
7,GA,Taliaferro,16.2
8,AL,Chambers,16.1
9,AL,Marengo,15.9


The top 5 values for prevalence of diabetes (and 8 out of the 10 highest) occur in Alabama. The other two occur in Georgia.

Whereas the lowest values ranged from 3.0 - 4.0 %, the highest range from 15.9 - 18.2%!

It will be worth examining the food environmental features in Alabama.

#### 3. Average Prevalence of Diabetes

In [8]:

sql = """
SELECT AVG (pct_diabetes_adults08)
FROM health
WHERE pct_diabetes_adults08 IS NOT NULL

"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,9.913257


#### *Findings*

Prevalence of diabetes ranges from 3.0 - 18.2% in 2008. Lowest values tend to be found in Colorado, whereas the highest tend to be found in Alabama. It may be worth comparing the food environmental features of each to see what could contribute to the difference.

Additionally, the average prevalence of diabetes in 9.91%.

##### Diabetes (2013)

Since there is obesity data from 2012, I would like to observe it alongside the diabetes information from 2012 (the two years are so close.)

#### 1. Minimum Diabetes Values: with Obesity, Ratio of Diabetes/Obesity

In [9]:
sql = """
SELECT state, county, pct_diabetes_adults13, pct_obese_adults12
    , pct_diabetes_adults13/pct_obese_adults12 AS ratio_diabetes_obese
FROM health
WHERE pct_diabetes_adults13 IS NOT NULL
ORDER BY pct_diabetes_adults13
LIMIT 10

"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,pct_diabetes_adults13,pct_obese_adults12,ratio_diabetes_obese
0,CO,Eagle,3.3,20.5,0.160976
1,CO,Summit,4.1,20.5,0.2
2,CO,Routt,4.1,20.5,0.2
3,CO,Boulder,4.1,20.5,0.2
4,VA,Arlington,4.2,27.4,0.153285
5,UT,Summit,4.2,24.3,0.17284
6,MT,Gallatin,4.3,24.3,0.176955
7,AK,Wade Hampton,4.6,25.7,0.178988
8,CO,Gunnison,4.6,20.5,0.22439
9,CO,Garfield,4.7,20.5,0.229268


The list of counties with 10 lowest prevalences of diabetes in 2013 is similar to that of 2008, except that Arlington, VA seems to present, and Santa Fe, NM is absent.

The obesity values appear lower than the accepted national value of 33%. The ratio of diabets to obesity ranges from approximately 0.16 to 0.22.

*Additionally, one can see that the lowest diabetes prevalence values are higher here (3.3 - 4.7%) are lower than in 2008 (3.3 - 4.0%,) suggesting rising prevalence of diabetes during this time period. Maximum values will be observed next.*

#### 2. Maximum Diabetes Values - with Obesity, Ratio of Diabetes/Obesity

In [10]:
sql = """
SELECT state, county, pct_diabetes_adults13, pct_obese_adults12
    , pct_diabetes_adults13/pct_obese_adults12 AS ratio_diabetes_obese
FROM health
WHERE pct_diabetes_adults13 IS NOT NULL
ORDER BY pct_diabetes_adults13 DESC
LIMIT 10

"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,pct_diabetes_adults13,pct_obese_adults12,ratio_diabetes_obese
0,AL,Lowndes,23.5,33.0,0.712121
1,AL,Perry,21.7,33.0,0.657576
2,AL,Greene,21.0,33.0,0.636364
3,AL,Marengo,20.2,33.0,0.612121
4,AL,Sumter,20.1,33.0,0.609091
5,AL,Bullock,19.6,33.0,0.593939
6,AL,Wilcox,19.3,33.0,0.584848
7,SC,Allendale,19.2,31.6,0.607595
8,AL,Dallas,19.0,33.0,0.575758
9,AL,Henry,18.7,33.0,0.566667


In 2013, 9 out of the 10 highest prevalences of diabetes occur in Alabama, and one in South Carolina - both southern states.

Prevalences here range from 18.7% to 23.5% of the population, meaning approximately 1/5 adults in these counties were diabetic. This is much higher than the 3.3 - 4.7% observed in the low prevalence set; additionally the range is higher than the maximum values of 15.9 - 18.2% observed for the high prevalence group in 2012.

It may also be worthwile to note that the obesity rates are much higher here (typically 33%,) than in the low diabetes prevalence set (20.5 - 27.4%.) 

*Again, obesity is a strong predictor of diabetes. It may be important to note that the ratio of diabetes to obesity is much higher heree (0.56 - 0.71) than in the low prevalence diabetes set (0.16 - 0.22.)*

#### 3. Average Prevalence of Diabetes (2013)

In [11]:
sql = """
SELECT AVG (pct_diabetes_adults13)
FROM health
WHERE pct_diabetes_adults13 IS NOT NULL

"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,11.236123


The average prevalence of diabetes was 11.24% - an increase from the 9.91% observed in 2008.

#### 4. Average Obesity (2012)

In [12]:
sql = """
SELECT AVG (pct_obese_adults12)
FROM health
WHERE pct_obese_adults12 IS NOT NULL

"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,29.05657


Average prevalence of obesity in 2013 was 29.1%.

For the ratio, values appear between 0.39 and 0.49. I will find the average, min, and max for this ratio.

#### 4. Mean diabetes/obesity ratio (2013/2012)

In [13]:
sql = """
SELECT AVG (pct_diabetes_adults13 / pct_obese_adults12) AS avg_ratio
FROM health;
"""

df = pd.read_sql_query(sql, engine)
df.head(10)

Unnamed: 0,avg_ratio
0,0.385371


At 0.385, the average ratio is close to 1/3. However, obesity values from 2012 were used instead of 2013. If the obesity value was higher in 2013 than 2012, this average ratio would be lower.

Now, I will determine how to fill in the missing prevalence of diabetes 2013 value.

#### *Findings*

A comparison of diabetes values between those of 2008 and 2013 reveals an increased rate of diabetes on average, and when looking at both high-prevalence and low-prevalence counties. It will be worthwhile to examine food environmental factors that changed in the 5 years in between.

Meanwhile, it remains the pattern first observed in 2008 that low-prevalence counties are typically in Colorado, whereas the high-prevalence counties are in Alabama. 

As the low versus high prevalence counties represent 2 distince geographical areas, it may be interesting to later observe patterns in diet, poverty, and other factors that differ between the two groups.

##### Recreation Facilities

I am also interested to observe for correlation between the number of recereation & fitness facilities per/1,000 pop. and percentage of adult diabetics.  

#### 2011

#### 1. In Low Diabetes Prevalence Counties

In [14]:
#displays prevalence of diabetes in 2013 with the recreational value for 2011

sql = """
SELECT state, county, pct_diabetes_adults13, recfacpth11
FROM health
WHERE pct_diabetes_adults13 IS NOT NULL
ORDER BY pct_diabetes_adults13
LIMIT 10;
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,pct_diabetes_adults13,recfacpth11
0,CO,Eagle,3.3,0.173437
1,CO,Summit,4.1,0.107189
2,CO,Routt,4.1,0.258109
3,CO,Boulder,4.1,0.186345
4,VA,Arlington,4.2,0.147854
5,UT,Summit,4.2,0.160265
6,MT,Gallatin,4.3,0.164303
7,AK,Wade Hampton,4.6,0.0
8,CO,Gunnison,4.6,0.259
9,CO,Garfield,4.7,0.089343


Only the county in Alaska has essentially no recreational facilities. The rest of the low-prevalence counties have 0.08 - 0.26 recreational facilities per 1000 people.

#### 2. In High Diabetes Prevalence Counties

In [15]:
#displays prevalence of diabetes in 2013 with the recreational value for 2011

sql = """
SELECT state, county, pct_diabetes_adults13, recfacpth11
FROM health
WHERE pct_diabetes_adults13 IS NOT NULL
ORDER BY pct_diabetes_adults13 DESC
LIMIT 10;
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,pct_diabetes_adults13,recfacpth11
0,AL,Lowndes,23.5,0.089847
1,AL,Perry,21.7,0.0
2,AL,Greene,21.0,0.0
3,AL,Marengo,20.2,0.096857
4,AL,Sumter,20.1,0.074123
5,AL,Bullock,19.6,0.0
6,AL,Wilcox,19.3,0.0
7,SC,Allendale,19.2,0.0
8,AL,Dallas,19.0,0.023118
9,AL,Henry,18.7,0.057551


Half (5) of the 10 high diabetes prevalence values have essentially no recreational facilities. Those that do range from 0.02 to 0.09 per 1000 people - this is lower than the 0.08 - 0.26 /1000 people observed in the low prevalence states.

This could be suggestive of a significant correlation. This will be further explored in the next notebook.

For now, 2016 data for recreational facilities will be observed next.

#### 2016

Unfortunately, this dataset contains information regarding diabetes for 2008 and 2012 only.

At a later time, the information may be estimated based on change between 2008 and 2012, or may be added manually from a different data source.

However, for now, obesity rates from 2017 will be used as a basis for comparison - however these are assessed by state. Therefore, the values below will be ordered by the recreational facilities information, rather than by obesity prevalence, in order to get county-specific information.

#### 1. Counties with 0 Recreational Facilities (per 1000)

Earlier, it was observed that some of the counties have 0 recreational facilities per 1000.

These exact counties will be observed below.

In [16]:
#displays prevalence of obesity from 2017 with the recreational values for 2016

sql = """
SELECT state, county, recfacpth16, pct_obese_adults17, pct_diabetes_adults13
FROM health
WHERE recfacpth16 = 0;

"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,recfacpth16,pct_obese_adults17,pct_diabetes_adults13
0,AL,Barbour,0.0,36.3,18.4
1,AL,Bullock,0.0,36.3,19.6
2,AL,Chambers,0.0,36.3,16.4
3,AL,Choctaw,0.0,36.3,18.1
4,AL,Clay,0.0,36.3,14.6
...,...,...,...,...,...
1004,WY,Crook,0.0,28.8,8.9
1005,WY,Goshen,0.0,28.8,9.0
1006,WY,Platte,0.0,28.8,8.8
1007,WY,Sublette,0.0,28.8,6.9


It appears that there are 1009 counties (of 3,143 total) that have essentially no recreational facilties! While there are other opportunities to gain exercise, this may reflect a mean/median socioeconomic status too low to have tax-funded recreational facilities (and/or a low value of exercise which would be difficult to assess in the current study.)


Although the pct diabetes is from 2013, it appeared from the 2008 and 2013 data that high prevalence diabetes counties are still that way 5 years later; similarly low prevalence diabetes counties seem to be low prevalence diabetes counties 5 years later.

The average % diabetes prevalence in the counties with 0 recreational facilities per 1000 people will be observed below.

#### 2. Mean Prevalence of Diabetes in Counties with 0 Recreational Facilities per 1000

In [17]:
sql = """
SELECT AVG (pct_diabetes_adults13)
FROM health
WHERE recfacpth16 = 0;

"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,11.789782


#### 3. Counties with the Highest Number of Recreational Facilities (per 1000 people)

In [18]:
sql = """
SELECT state, county, recfacpth16, pct_diabetes_adults13
FROM health
ORDER BY recfacpth16 DESC
LIMIT 10
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,recfacpth16,pct_diabetes_adults13
0,ND,Oliver,1.053741,10.9
1,ND,Divide,0.83717,10.1
2,VA,Fairfax,0.713476,7.5
3,NE,Perkins,0.688231,8.6
4,TX,Briscoe,0.672948,11.2
5,TX,Irion,0.636943,10.6
6,VA,Falls Church,0.576868,10.2
7,WY,Teton,0.560828,4.8
8,NE,Dundy,0.550964,10.8
9,NE,Nuckolls,0.468604,9.4


#### 4. Mean Prevalence of Diabetes in Counties with Highest Number of Recreational Facilities per 1000 people

In [19]:

sql = """
SELECT AVG (pct_diabetes_adults13)
FROM health
WHERE recfacpth16 >= 0.468604
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,9.411111


This average is lower than the mean percentage of diabetes in states with the minimum number (0) recreational facilities in 2016.

##### Findings

As mentioned earlier, data pertaining to prevalence of diabetes in 2017 is not contained within this dataset. However, the lower average prevalence of diabetes in counties (2013) with lowest number of recreational facilities (2016) suggests that a correlation will be found.

This will be assessed later using the target dataset of 2018 diabetes prevalence values and/or calculated values based on trajectory between 2008 and 2013 values.


#### Percent High Schoolers Active

If healthy habits are established during youth, it can be easier to maintain them when older. My hypothesis is that active high schoolers are less likely to develop diabetes during adulthood.

During preliminary EDA/data cleaning, I discoverd that 760 of the 3143 values for this feature are missing. If I use it, I may need to impute values.

First, however, I will observe for correlation between percent of high schoolers defined as active, and the prevalence of diabetes in 2013. It should be noted that unfortunately, the high schooler activity data is from 2017 only. Still, as noted previously, comparison of 2008 and 2013 diabetes prevalence values suggest that the same/similar counties will still be high or low respectively 5 years later.

#### 1. In Counties with Lowest Prevalence of Diabetes 

In [20]:
#displays lowest diabetes prevalence values (2013) & their corresponding pcthspa values (2017)

sql = """
SELECT state, county,  pct_diabetes_adults13, pct_hspa17
FROM health
WHERE pct_hspa17 IS NOT NULL AND pct_diabetes_adults13 IS NOT NULL
ORDER BY pct_diabetes_adults13
LIMIT 10

"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,pct_diabetes_adults13,pct_hspa17
0,CO,Eagle,3.3,27.4
1,CO,Summit,4.1,27.4
2,CO,Routt,4.1,27.4
3,CO,Boulder,4.1,27.4
4,VA,Arlington,4.2,22.4
5,UT,Summit,4.2,19.1
6,MT,Gallatin,4.3,28.0
7,AK,Wade Hampton,4.6,18.4
8,CO,Gunnison,4.6,27.4
9,CO,Garfield,4.7,27.4


For the low prevalence diabetes group, the pct_hspa17 ranges from 18.4 to 27.4. (In looking at the data, the pcthspa values be state-level, as opposed to county-level, with same values by state.)

PCTHSPA values will next be observed for the high-prevalence diabetes counties.

#### 2. In Counties with Highest Prevalence of Diabetes

In [21]:
#displays highest phasa values (2017) & their corresponding diabetes prevalence values (2013)

sql = """
SELECT state, county,  pct_diabetes_adults13, pct_hspa17
FROM health
WHERE pct_hspa17 IS NOT NULL AND pct_diabetes_adults13 IS NOT NULL
ORDER BY pct_diabetes_adults13 DESC
LIMIT 10

"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,pct_diabetes_adults13,pct_hspa17
0,SC,Allendale,19.2,21.7
1,SC,Fairfield,18.6,21.7
2,SC,Calhoun,18.6,21.7
3,TN,Cumberland,18.2,25.6
4,OK,Haskell,18.0,29.5
5,KY,Lawrence,17.7,22.0
6,TN,Fayette,17.6,25.6
7,WV,McDowell,17.6,23.4
8,SC,McCormick,17.6,21.7
9,KY,McLean,17.5,22.0


Here, the PCTHSPA values range from 21.7 to 25.6, which seems similar to the range in low-prevalence groups.

#### 3. Mean PCTHSPA Value

In [22]:
#displays highest phasa values (2017) & their corresponding diabetes prevalence values (2013)

sql = """
SELECT AVG(pct_hspa17)
FROM health
WHERE pct_hspa17 IS NOT NULL;

"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,24.62052


The mean PCTHSPA value is 24.62. This is within the PCTHSPA range of the both high and low diabetes prevalence counties.

#### 4. Level of pctHSPA values

It appears that the HSPA values are state-level. I will observe:

In [23]:
sql = """
SELECT state, pct_hspa17
FROM health
WHERE state = 'CO'
LIMIT 5
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,pct_hspa17
0,CO,27.4
1,CO,27.4
2,CO,27.4
3,CO,27.4
4,CO,27.4


In [24]:
sql = """
SELECT state, pct_hspa17
FROM health
WHERE state = 'CA'
LIMIT 5
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,pct_hspa17
0,CA,27.5
1,CA,27.5
2,CA,27.5
3,CA,27.5
4,CA,27.5


These values do appear to be state level. While there are many missing values in this column, I wonder if there is at least one for each state. I will observe below:

In [25]:
sql = """
SELECT DISTINCT state, pct_hspa17
FROM health
WHERE pct_hspa17 IS NOT NULL
ORDER BY state
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,pct_hspa17
0,AK,18.4
1,AR,21.4
2,AZ,24.5
3,CA,27.5
4,CO,27.4
5,CT,22.3
6,DC,13.4
7,DE,25.1
8,FL,22.8
9,HI,19.6


Unfortunately, not all states have a value for this column.

##### Findings

If a relationship exists between prevalence of diabetes and PCTHSPA, it is not apparent at this time. Although the exact target diabetes values (2018) will be different than those of 2013, it would seem that patterns could be observed through comparing high and low diabetes prevalence counties...however a relationship is not clear at this time.

Perhaps after observing a heatmap during later EDA/Visualization, a relationship may be revealed.

### Discussion

The data for obesity, diabetes, and recreational facilities do not line up in terms of years of assessment. However, it was observed that trends tend to persist in terms of which counties have higher prevalence of diabetes. Thus, additional trends that pertain to state, obesity, and number of recreational facilities seem to persist between 2008 and 2012, thus will likely still be in effect for the target data (2018 diabetes prevalences.) As noted, a relationship between PCTHSPA and diabetes has not been observed at this time.

At a later time, I may calculate or locate diabetes prevalence information for the specific years that match the other data (i.e. 2011, 2013, 2016, and/or 2017.)

First, however, the Access dataset will be analyzed.

### 3. Access

The Access dataset within the Food Environment Atlas contains information regarding residents' access to foods, including percentages of those who are low income and have low access to stores, and more.

In [26]:
#displays top 5 rows of table

sql = """
SELECT *
FROM access;
"""

df = pd.read_sql_query(sql, engine)
df.head()

Unnamed: 0,fips,state,county,pct_laccess_pop10,pct_laccess_pop15,pct_laccess_lowi10,pct_laccess_lowi15,pct_laccess_hhnv10,pct_laccess_hhnv15,pct_laccess_snap15,pct_laccess_child10,pct_laccess_child15,pct_laccess_seniors10,pct_laccess_seniors15
0,1001,AL,Autauga,33.769657,32.062255,9.79353,11.991125,3.284786,3.351332,4.608749,8.837112,8.460485,4.376378,3.996279
1,1003,AL,Baldwin,19.318473,16.767489,5.460261,5.424427,2.147827,1.905114,1.2989,4.343199,3.844936,3.51357,3.06184
2,1005,AL,Barbour,20.840972,22.10556,11.420316,10.739667,4.135869,4.329378,4.303147,3.425062,3.758341,2.805166,3.001695
3,1007,AL,Bibb,4.559753,4.230324,2.144661,2.601627,3.45858,2.821427,0.67671,1.087518,1.015242,0.657008,0.600865
4,1009,AL,Blount,2.70084,6.49738,1.062468,2.88015,3.26938,3.336414,0.812727,0.67149,1.58872,0.340269,0.882583


I am very interested to look for relationships between this information and the diabetes prevalence information.

However, in looking more closely, I see that the information here is for 2010 and 2015, whereas the data for diabetes prevalence is from 2008 and 2013.

When I create visualizations, I will probably first create columns of estimated 2013 values for the Access dataset, based on the percent change between 2010 and 2015. 

I will explore this here, using one of the feature types: % with low access to store (pct_laccess_low.)

### 1. Percent with low access to store (pct_laccess_low)  (Minimum Low Access to Store)

For now, I will observe the 2010 low access information with the 2013 diabetes values to see if differences exist between prevalence of diabetes between groups.

##### Prevalence of Diabetes in Counties with the LOWEST low access to stores (High access to stores counties)

In [27]:
sql = """
SELECT access.state, access.county, pct_laccess_pop10, health.pct_diabetes_adults13
FROM access
LEFT JOIN health
    ON health.fips = access.fips
ORDER BY pct_laccess_pop10
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,pct_laccess_pop10,pct_diabetes_adults13
0,KY,Menifee,0.000000,12.3
1,TN,Grainger,0.000000,14.4
2,MI,Benzie,0.000000,12.6
3,WV,Lincoln,0.000000,15.7
4,IN,Union,0.000000,12.8
...,...,...,...,...
3138,ID,Clark,100.000001,8.5
3139,AK,Denali,100.000001,5.8
3140,NE,Banner,100.000001,13.1
3141,NE,Arthur,100.000001,8.3


It appears that some have a percent low access of 0% (high access to stores.)

This will be explored further.

##### Prevalence of Diabetes in Counties with 0% low access to Stores (HIGH Access)

In [28]:
sql = """
SELECT access.state, access.county, pct_laccess_pop10, health.pct_diabetes_adults13
FROM access
LEFT JOIN health
    ON health.fips = access.fips
    WHERE pct_laccess_pop10 = 0
ORDER BY health.pct_diabetes_adults13
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,pct_laccess_pop10,pct_diabetes_adults13
0,VA,Alexandria,0.0,6.2
1,CO,Gilpin,0.0,6.3
2,VA,Fairfax,0.0,7.5
3,VA,King George,0.0,9.3
4,VA,Greene,0.0,9.9
5,GA,Dawson,0.0,10.5
6,TX,Somervell,0.0,10.7
7,NC,Avery,0.0,11.1
8,HI,Kalawao,0.0,11.3
9,WI,Pepin,0.0,11.7


It appears there are 27 counties without low access to stores. Many of these counties are in southern states.

There diabetes values range from 6.2% to 17.5%. I will next look at average pct_diabetes for this range. 

#### Average % Diabetes in 0% low access counties

In [29]:
sql = """
SELECT AVG(pct_diabetes_adults13)
FROM health
LEFT JOIN access
    ON health.fips = access.fips
WHERE pct_laccess_pop10 = 0
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,12.007143


It appears the average diabetes percentage for the 0% low access group is 12.007.

Next, the 100% + low access group will be examined.

##### Prevalence of Diabetes in Counties with 100% low access to stores (LOW access)

It looks like the query calculates the required column correctly. Just out of curiosity, I will find the mean value of this column.

In [30]:

sql = """
SELECT access.state, access.county, pct_laccess_pop10, health.pct_diabetes_adults13
FROM access
LEFT JOIN health
    ON health.fips = access.fips
    WHERE pct_laccess_pop10 >100.00000
ORDER BY health.pct_diabetes_adults13
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,pct_laccess_pop10,pct_diabetes_adults13
0,AK,Denali,100.000001,5.8
1,CO,Kiowa,100.0,7.7
2,NE,Arthur,100.000001,8.3
3,ID,Clark,100.000001,8.5
4,TX,Oldham,100.0,9.3
5,MT,Petroleum,100.0,9.5
6,NV,Lincoln,100.0,9.5
7,NE,Frontier,100.0,9.7
8,MT,Golden Valley,100.0,9.7
9,NM,De Baca,100.0,9.8


In the 24 states with 100% low access to stores, diabetes percentage ranges from 5.8 % to 13.9%.

I will determine the average diabetes percentage for this group below.

##### Average Prevalence of Diabetes in 100%+ low access to stores

In [31]:
sql = """
SELECT AVG(pct_diabetes_adults13)
FROM health
LEFT JOIN access
    ON health.fips = access.fips
WHERE pct_laccess_pop10 > 100.00000
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,10.184


I would have expected the low access to stores counties avereage diabetes prevalence to be higher, but in fact, the low access to stores county have a lower average prevalence of diabetes.

I suppose this makes sense - if one has access to money and a car, it may not necessarily matter that the store is far away in terms of getting nutritious food.

It may be the case that the low access + low income, and low access + low income + no vehicle variables will reveal more of a difference.

This will be explored later with a correlation chart, but for now, I will examine the low access + low income variable.

### 2. Percent Low Access to Stores - Low Income (pct_laccess_lowi10)

I suspect that when adding in the low income factor, a stronger relationship between this variable and diabetes prevalence will be seen.

##### Prevalence of Diabetes in Counties with the LOWEST pct_laccess_lowi10

In [32]:
sql = """
SELECT access.state, access.county, pct_laccess_lowi10, health.pct_diabetes_adults13
FROM access
LEFT JOIN health
    ON health.fips = access.fips
ORDER BY pct_laccess_lowi10
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,pct_laccess_lowi10,pct_diabetes_adults13
0,KY,Menifee,0.000000,12.3
1,GA,Pike,0.000000,11.7
2,VA,Fairfax,0.000000,7.5
3,MS,Covington,0.000000,16.3
4,WI,Pepin,0.000000,11.7
...,...,...,...,...
3138,TX,Foard,53.001464,11.7
3139,CO,Costilla,55.349060,8.3
3140,ID,Clark,59.976663,8.5
3141,SD,Corson,62.935290,13.3


It appears there are counties with 0% low access & low income. I will explore this further.

##### Prevalence of Diabetes in Counties with 0% low access & low income

In [33]:
sql = """
SELECT access.state, access.county, pct_laccess_lowi10, health.pct_diabetes_adults13
FROM access
LEFT JOIN health
    ON health.fips = access.fips
    WHERE pct_laccess_lowi10 = 0
ORDER BY health.pct_diabetes_adults13
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,pct_laccess_lowi10,pct_diabetes_adults13
0,VA,Alexandria,0.0,6.2
1,CO,Gilpin,0.0,6.3
2,NY,New York,0.0,6.7
3,VA,Fairfax,0.0,7.5
4,VA,King George,0.0,9.3
5,VA,Greene,0.0,9.9
6,TX,Loving,0.0,10.5
7,GA,Dawson,0.0,10.5
8,TX,Somervell,0.0,10.7
9,NC,Avery,0.0,11.1


There are 29 counties with 0% low income & low access. The prevalence of diabetes ranges from 6.2% to 17.5%.

I will calculate the average below.

##### Prevalence of Diabetes in Counties with highest low access & low income

In [34]:
sql = """
SELECT access.state, access.county, pct_laccess_lowi10, health.pct_diabetes_adults13
FROM access
LEFT JOIN health
    ON health.fips = access.fips
ORDER BY pct_laccess_lowi10 DESC
LIMIT 10
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,pct_laccess_lowi10,pct_diabetes_adults13
0,TX,Hudspeth,72.274456,10.0
1,SD,Corson,62.93529,13.3
2,ID,Clark,59.976663,8.5
3,CO,Costilla,55.34906,8.3
4,TX,Foard,53.001464,11.7
5,TX,Real,52.408342,12.1
6,CO,Saguache,51.455473,7.0
7,AK,Yukon-Koyukuk,50.088821,8.3
8,KS,Elk,49.830168,12.4
9,MT,Meagher,49.655386,8.9


The 10 counties with highest % low access & low income are displayed here. Diabetes Prevelance ranges from 7.0 - 13.3%. (Here I am looking at the top 10, whereas previously I was looking at the top 29 because they all had the same % low access & low income score.

I am not necessarily able to see a difference, in looking at the range of diabetes prevalences as I could with some of the other variables.

##### Findings

Relationships between access variables and diabetes prevalence are not clear at this time.

As mentioned, this will be explored further in the next notebook. First, however, I will analyze the variables in the Assistance dataset.

### 4. Assistance

This dataset contains information regarding food-related public benefits, including WIC, SNAP, and qualification for free meals in schools. Essentially, the participants in these programs are low-income, and therefore have limited food budgets.

Anecdotally as a nutritionist, my observation, and the observations of others, suggests that it is less expensive to fill oneself with foods such as fast food, chips and soda (high fat, sugar, and caloric content) than to fill oneself with whole food, plant-based options. Since SNAP participants have a very limited budget for food, it makes sense that they choose the first option, which could then lead to higher rates of conditions such as obesity and diabetes. (REFERENCE.)

Full heatmaps and correlation charts will be observed later. During this initial phase of EDA, I will explore the pct_snap12 variable, which is the participation of the population that participates in SNAP (Supplemental Nutrition Assistance Program.) (REFERENCE)

In [35]:
#displays top 5 rows of table

sql = """
SELECT *
FROM assistance;
"""

df = pd.read_sql_query(sql, engine)
df.head()

Unnamed: 0,fips,state,county,pct_snap12,pct_snap17,snap_part_rate11,snap_part_rate16,pct_nslp12,pct_nslp17,pct_sbp12,...,pct_wic17,pct_wicinfantchild14,pct_wicinfantchild16,pct_wicwomen14,pct_wicwomen16,pct_cacfp12,pct_cacfp17,fdpir12,fdpir15,food_banks18
0,1001,AL,Autauga,18.908476,16.500056,84.02,86.898,68.226043,63.12659,27.206328,...,2.54357,33.481211,32.910876,3.318827,3.309759,0.891239,1.258763,0,0,0
1,1003,AL,Baldwin,18.908476,16.500056,84.02,86.898,68.226043,63.12659,27.206328,...,2.54357,33.481211,32.910876,3.318827,3.309759,0.891239,1.258763,0,0,0
2,1005,AL,Barbour,18.908476,16.500056,84.02,86.898,68.226043,63.12659,27.206328,...,2.54357,33.481211,32.910876,3.318827,3.309759,0.891239,1.258763,0,0,0
3,1007,AL,Bibb,18.908476,16.500056,84.02,86.898,68.226043,63.12659,27.206328,...,2.54357,33.481211,32.910876,3.318827,3.309759,0.891239,1.258763,0,0,0
4,1009,AL,Blount,18.908476,16.500056,84.02,86.898,68.226043,63.12659,27.206328,...,2.54357,33.481211,32.910876,3.318827,3.309759,0.891239,1.258763,0,0,0


##### Lowest Percent SNAP Participants (2012) & Prevalence of Diabetes

SNAP stands for "Supplemental Nutrition Assistance Program," which is for low income families or individuals who need assistance with paying for food.

I hypothesize that places with higher participation rates will have higher prevalence of diabetes due to the population's inability to afford nutritious foods.

In [36]:
#shows counties with 10 lowest SNAP participation rates & their respective prevalences of diabetes

sql = """
SELECT assistance.state, assistance.county, pct_snap12, health.pct_diabetes_adults13
FROM assistance
LEFT JOIN health
    ON assistance.fips = health.fips
ORDER BY pct_snap12
LIMIT 10
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,pct_snap12,pct_diabetes_adults13
0,WY,Crook,5.956719,8.9
1,WY,Hot Springs,5.956719,11.0
2,WY,Campbell,5.956719,7.0
3,WY,Converse,5.956719,10.2
4,WY,Fremont,5.956719,9.6
5,WY,Goshen,5.956719,9.0
6,WY,Albany,5.956719,6.9
7,WY,Big Horn,5.956719,11.7
8,WY,Carbon,5.956719,8.2
9,WY,Johnson,5.956719,8.7


This information appears to be on state level, with all of these counties having the exact same participation value...

It seems that Wyoming has the lowest SNAP participation rate. I will determine the state average prevalence of diabetes for Wyoming.

##### Average Prevalence of Diabetes in Wyoming (lowest % SNAP participation, 2012)

In [37]:
sql = """
SELECT AVG(pct_diabetes_adults13) AS avg_prev_diabetes_wy
FROM health
WHERE state = 'WY'
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg_prev_diabetes_wy
0,8.9


In Wyoming, the average prevalence of diabetes is 8.9. I will compare this value to the average prevalence of diabetes in the state with highest SNAP participation percentage.

##### Highest Percent SNAP Participants (2012) & Prevalence of Diabetes

In [38]:
#shows counties with 10 highest SNAP participation rates & their respective prevalences of diabetes

sql = """
SELECT assistance.state, assistance.county, pct_snap12, health.pct_diabetes_adults13
FROM assistance
LEFT JOIN health
    ON assistance.fips = health.fips
ORDER BY pct_snap12 DESC
LIMIT 10
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,pct_snap12,pct_diabetes_adults13
0,DC,District of Columbia,22.205895,8.1
1,MS,Carroll,22.121369,16.8
2,MS,Alcorn,22.121369,14.3
3,MS,Attala,22.121369,15.4
4,MS,Bolivar,22.121369,16.5
5,MS,Calhoun,22.121369,14.7
6,MS,Benton,22.121369,15.9
7,MS,Adams,22.121369,13.9
8,MS,Amite,22.121369,17.6
9,MS,Chickasaw,22.121369,15.2


It appears that the District of Columbia has the highest SNAP participation rate at more than 22% - close to 1/4 of the population, however its prevalence of diabetes (8.1%) is actually lower than Wyoming's. 

If I compare the top 5 states versus the bottom 5 states, I may notice a difference. At this time, since the dataset contains state-level data pertaining to obesity, and obesity is a risk factor for diabetes, I will observe the obesity rates for each of the states.

##### Prevalence of Obesity in 5 states with lowest SNAP participation

In [39]:
#shows states with 5 lowest SNAP participation rates & their respective average prevalences of diabetes

sql = """
SELECT DISTINCT assistance.state, pct_snap12, health.pct_obese_adults12
FROM assistance
LEFT JOIN health
    ON assistance.fips = health.fips
ORDER BY pct_snap12
LIMIT 5
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,pct_snap12,pct_obese_adults12
0,WY,5.956719,24.6
1,ND,8.38295,29.7
2,NH,8.849462,27.3
3,NJ,9.301112,24.6
4,CO,9.479336,20.5


The prevalence of obesity in these states with lowest SNAP participation ranges from 20.6 to 29.7.

##### Prevalence of Obesity in 5 states with highest SNAP participation

In [40]:
#shows states with 5 highest SNAP participation rates & their respective average prevalences of obesity

sql = """
SELECT DISTINCT assistance.state, pct_snap12, health.pct_obese_adults12
FROM assistance
LEFT JOIN health
    ON assistance.fips = health.fips
ORDER BY pct_snap12 DESC
LIMIT 5
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,pct_snap12,pct_obese_adults12
0,DC,22.205895,21.9
1,MS,22.121369,34.6
2,NM,21.033493,27.1
3,OR,20.935749,27.3
4,LA,20.613165,34.7


The 5 states with highest SNAP participation rate have a prevalence of obesity ranging from 21.9 to 34.7, which is higher than the 20.6 - 29.7 observed in the low SNAP participation rate.

This is the anticipated finding. 

##### Findings

Preliminary EDA suggests that correlations may be found between these assistance variables and the prevalence of diabets. Visualizations will be explored in the next notebook, and for now, I will move into analyzing the Insecurity dataset.

## 5. Insecurity

This dataset contains information about the prevalence of household insecurity in each of the states. Because this variable is state-level, I will observe the relationships between these variables and obesity.

Variables that can be assessed include food insecurity, (foodinsec,) and very low food insecurity (vlfoodsec.) Both will be analyzed here. I will use the 2012 - 2014 data in order to do this, before observing averages of some of the columns.

In [41]:
#displays top 5 rows of table

sql = """
SELECT *
FROM insecurity;
"""

df = pd.read_sql_query(sql, engine)
df.head()

Unnamed: 0,fips,state,county,foodinsec_12_14,foodinsec_15_17,ch_foodinsec_14_17,vlfoodsec_12_14,vlfoodsec_15_17,ch_vlfoodsec_14_17
0,1001,AL,Autauga,16.8,16.3,-0.5,7.2,7.1,-0.1
1,1003,AL,Baldwin,16.8,16.3,-0.5,7.2,7.1,-0.1
2,1005,AL,Barbour,16.8,16.3,-0.5,7.2,7.1,-0.1
3,1007,AL,Bibb,16.8,16.3,-0.5,7.2,7.1,-0.1
4,1009,AL,Blount,16.8,16.3,-0.5,7.2,7.1,-0.1


##### Lowest Prevalence of Food Insecurity (2012 - 2014) & Prevalence of Obesity (2012)

In [42]:
sql = """
SELECT DISTINCT insecurity.state, foodinsec_12_14, health.pct_obese_adults12
FROM insecurity
LEFT JOIN health
    ON insecurity.fips = health.fips
ORDER BY foodinsec_12_14
LIMIT 5
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,foodinsec_12_14,pct_obese_adults12
0,NV,8.4,26.2
1,ME,9.6,28.4
2,NJ,10.0,24.6
3,VT,10.1,23.7
4,MN,10.4,25.7


The states with lowest food insecurity (8.4 to 10.4%) have a prevalence of obesity that ranges from 23.7 to 28.4.

##### Highest Prevalence of Food Insecurity (2012 - 2014) & Prevalence of Obesity (2012)

In [43]:
sql = """
SELECT DISTINCT insecurity.state, foodinsec_12_14, health.pct_obese_adults12
FROM insecurity
LEFT JOIN health
    ON insecurity.fips = health.fips
ORDER BY foodinsec_12_14 DESC
LIMIT 5
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,foodinsec_12_14,pct_obese_adults12
0,MO,22.0,29.6
1,AZ,19.9,26.0
2,LA,17.6,34.7
3,KY,17.5,31.3
4,TX,17.2,29.2


The 5 states with highest food insecurity (17.2 to 22.0) ranges from 29.2 to 34.7%. This is higher than the range for the states with low food insecurity.

I anticipate that a relationship between food insecurity and diabetes will be found when assessed. For now, I will move on to observing patterns for states with very low food insecurity.

##### Lowest Prevalence of Very Low Food Insecurity (2012 - 2014) & Prevalence of Obesity (2012)

In [44]:
sql = """
SELECT DISTINCT insecurity.state, vlfoodsec_12_14, health.pct_obese_adults12
FROM insecurity
LEFT JOIN health
    ON insecurity.fips = health.fips
ORDER BY vlfoodsec_12_14
LIMIT 5
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,vlfoodsec_12_14,pct_obese_adults12
0,NV,2.9,26.2
1,HI,4.0,23.6
2,ME,4.1,28.4
3,MN,4.2,25.7
4,VT,4.3,23.7


Not surprisingly, it appears that the same states that were lowest in food insecurity are also low in very low food insecurity. Obesity prevalence in these states is, therefore, the same as shown previously, ranging from 23.6 to 28.4%.

##### Highest Prevalence of Very Low Food Insecurity (2012 - 2014) & Prevalence of Obesity (2012)

In [45]:
sql = """
SELECT DISTINCT insecurity.state, vlfoodsec_12_14, health.pct_obese_adults12
FROM insecurity
LEFT JOIN health
    ON insecurity.fips = health.fips
ORDER BY vlfoodsec_12_14 DESC
LIMIT 5
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,vlfoodsec_12_14,pct_obese_adults12
0,AZ,8.1,26.0
1,MS,7.9,34.6
2,OH,7.5,30.1
3,MA,7.5,22.9
4,MO,7.3,29.6


This list is different than the 5 states with highest prevalence of very low security. The 5 states with high very low food security are AZ, MS, OH, MA, and MO. In contrast, the states with high food insecurity are MO, AZ, LA, KY, and TX.

The high very low security prevalence of obesity ranges from 22.9 to 34.6%. This is similar to the range of 29.2 to 34.7% seen in the high food insecurity states, but higher than the 23.7 to 28.4% seen in the states with low very low security. This is the expected finding.


Next, I will determine the average food insecurity (of all states combined.)

##### Average Food Insecurity (2012 - 2014)

In [46]:
sql = """
SELECT AVG(foodinsec_12_14)
FROM insecurity
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,14.863952


The average prevalence of food insecurity in states (2012 - 2014) was 14.86%.

##### Average Food Insecurity (2015 - 2017)

In [47]:
sql = """
SELECT AVG(foodinsec_15_17)
FROM insecurity
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,12.833726


The average prevalence of food insecurity in states (2015-2017) is lower than in 2012-2014. If time permits, I will explore further what the causes of this change might be.

For now, I will determine the averages for very low food security.

##### Average Very Low Food Security (2012 - 2014)

In [48]:
sql = """
SELECT AVG(vlfoodsec_12_14)
FROM insecurity

"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,5.915527


The average very low food security (2012 to 2014) is 5.92%.

##### Average Very Low Food Security (2015 - 2017)

In [49]:

sql = """
SELECT AVG(vlfoodsec_15_17)
FROM insecurity

"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,5.098027


This average very low food security prevalence (2015 to 2017) is lower than it was for the years of 2012 - 2017.

##### Findings

States with high prevalence of food insecurity and very low food security were generally shown to have higher rates of obesity, which is a risk factor for diabetes. These variables may be of use during later modeling.

Additionally, it appears that on average, food insecurity and very low food security were less prevalent between 2015 - 2017 when compared to values from 2012 - 2014.

Further details will be sought later - for now, I will move on to discuss the Local dataset.

## 6. Local

This dataset contains 100 columns of information regarding local food options, including local farms, farming acres, farmers' markets, and more, including percents, percent changes, and whole numbers.

I will observe these columns, their relationships, and more using pandas.

For now, I will move on to the Restaurants dataset.

## 7. Restaurants

This dataset contains information regarding the number of full-service restaurants (fsr), fast food restaurants (ffr) per 1000 people, and expenditures there-in.

First, I will calculate the average number of fast food restaurants per 1000 people, and average expenditure on fast food in the counties. Then, I will explore the relationships these features may have with prevalence of diabetes (2013).

In [50]:
#displays top 5 rows of table

sql = """
SELECT *
FROM restaurants;
"""

df = pd.read_sql_query(sql, engine)
df.head()

Unnamed: 0,fips,state,county,ffr11,ffr16,ffrpth11,ffrpth16,fsr11,fsr16,fsrpth11,fsrpth16,pc_ffrsales07,pc_ffrsales12,pc_fsrsales07,pc_fsrsales12
0,1001,AL,Autauga,34,44,0.615953,0.795977,32.0,31.0,0.579721,0.560802,649.511367,674.80272,484.381507,512.280987
1,1003,AL,Baldwin,121,156,0.648675,0.751775,216.0,236.0,1.157966,1.1373,649.511367,674.80272,484.381507,512.280987
2,1005,AL,Barbour,19,23,0.694673,0.892372,17.0,14.0,0.621549,0.543183,649.511367,674.80272,484.381507,512.280987
3,1007,AL,Bibb,6,7,0.263794,0.309283,5.0,7.0,0.219829,0.309283,649.511367,674.80272,484.381507,512.280987
4,1009,AL,Blount,20,23,0.347451,0.399569,15.0,12.0,0.260589,0.208471,649.511367,674.80272,484.381507,512.280987


In looking at these values, it appears that the fsr and ffr values are county-level as the counties shown all have different values. Meanwhile, the exact same value for the sales columns appears in each of the 5 counties shown, suggesting this information is state-level.

##### Average Fast Food Restaurants per 1000 (2011)

In [51]:


sql = """
SELECT AVG(ffrpth11)
FROM restaurants;
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,0.560159


It appears that on average, each county has approximately 0.56 fast food restaurants per 1000 people.

##### Average Expenditure on Fast Food Restaurants per capita (2012)

In [52]:
sql = """
SELECT AVG(pc_ffrsales12)
FROM restaurants;
"""

df = pd.read_sql_query(sql, engine)
df.head()

Unnamed: 0,avg
0,599.639926


This would seem to mean that the average person spent $599.64 on fast food in 2012.

##### Prevalence of Diabetes (2013) in Counties with a Low Prevalence of Fast Food Restaurants (2011)

In [53]:
sql = """
SELECT restaurants.state, restaurants.county, ffrpth11, health.pct_diabetes_adults13
FROM restaurants
LEFT JOIN health
    ON restaurants.fips = health.fips
ORDER BY ffrpth11
LIMIT 10
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,ffrpth11,pct_diabetes_adults13
0,AK,Skagway,0.0,6.4
1,AK,Wade Hampton,0.0,4.6
2,AK,Petersburg,0.0,7.1
3,AK,Prince of Wales-Hyder,0.0,7.3
4,AL,Greene,0.0,21.0
5,AK,Lake and Peninsula,0.0,7.4
6,AK,Bristol Bay,0.0,8.5
7,AL,Coosa,0.0,17.3
8,AK,Hoonah-Angoon,0.0,7.3
9,AK,Wrangell,0.0,8.9


It appears that some counties have essentially no fast food restaurants. All counties shown are in Alaska, but there could be others. I will determine below.

##### Prevalence of Diabetes (2013) in Counties with 0 Fast Food Restaurants (2011)

In [54]:


sql = """
SELECT restaurants.state, restaurants.county, ffrpth11, health.pct_diabetes_adults13
FROM restaurants
LEFT JOIN health
    ON restaurants.fips = health.fips
WHERE ffrpth11 = 0
ORDER BY health.pct_diabetes_adults13

"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,ffrpth11,pct_diabetes_adults13
0,AK,Wade Hampton,0.0,4.6
1,CO,Crowley,0.0,6.0
2,CO,Gilpin,0.0,6.3
3,AK,Skagway,0.0,6.4
4,CO,Jackson,0.0,6.6
...,...,...,...,...
161,GA,Quitman,0.0,16.9
162,GA,Clay,0.0,17.2
163,AL,Coosa,0.0,17.3
164,AL,Greene,0.0,21.0


It appears that there are 166 counties with 0 fast food restaurants per 1000 people. 

I will determine the average below.

##### Average Prevalence of Diabetes in Counties with 0 fast food restaurants per 1000 people

In [55]:


sql = """
SELECT AVG(pct_diabetes_adults13)
FROM health
LEFT JOIN restaurants
    ON restaurants.fips = health.fips
WHERE restaurants.ffrpth11 = 0 AND health.pct_diabetes_adults13 IS NOT NULL

"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,11.004242


The average prevalence of diabetes in counties with 0 fast food restaurants per 1000 people is 11.00%.

I will next determine the average prevalence of diabeties in counties with highest number of fast food restaurants per 1000 people.

##### Prevalence of Diabetes in Counties with Highest Number of Fast Food Restaurants per 1000 people.

In [56]:
sql = """
SELECT restaurants.state, restaurants.county, ffrpth11, health.pct_diabetes_adults13
FROM restaurants
LEFT JOIN health
    ON restaurants.fips = health.fips
ORDER BY ffrpth11 DESC
LIMIT 10
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,ffrpth11,pct_diabetes_adults13
0,CO,San Juan,5.797101,6.7
1,VA,Norton,3.173828,9.6
2,VA,Falls Church,2.988596,10.2
3,VA,Fairfax,2.935682,7.5
4,UT,Rich,2.614379,8.7
5,CO,San Miguel,2.276074,5.0
6,VA,Colonial Heights,2.08261,11.6
7,MT,Petroleum,2.016129,9.5
8,MA,Nantucket,1.876729,8.1
9,NJ,Cape May,1.864107,11.0


In observing these, it appears that the average prevalence of diabetes in these counties will be less than the 11.00% observed in the counties with 0 fast food restaurants per 1000. Still, I will calculate it below. 

##### Average Prevalence of Diabetes in Counties with Highest Number of Fast Food Restaurants per 1000 people

In [57]:
sql = """
SELECT AVG(pct_diabetes_adults13)
FROM health
LEFT JOIN restaurants
    ON restaurants.fips = health.fips
WHERE restaurants.ffrpth11 >= 1.864107

"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,8.544444


While these counties have many fast food restaurants, it appears that their average prevalence of diabetes is lower than the average prevalence of diabetes in those with essentially 0 fast food restaurants. 

Perhaps the expenditure in the fast food restaurants is more predictive of prevalence of diabetes. I will explore this next.

##### Prevalence of Diabetes in Counties with Low Fast Food Expenditure per capita

In [58]:
sql = """
SELECT restaurants.state, restaurants.county, pc_ffrsales12, health.pct_diabetes_adults13
FROM restaurants
LEFT JOIN health
    ON restaurants.fips = health.fips
ORDER BY pc_ffrsales12
LIMIT 10
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,county,pc_ffrsales12,pct_diabetes_adults13
0,VT,Franklin,364.112002,8.1
1,VT,Orange,364.112002,7.7
2,VT,Caledonia,364.112002,8.2
3,VT,Essex,364.112002,10.2
4,VT,Grand Isle,364.112002,7.4
5,VT,Lamoille,364.112002,7.0
6,VT,Addison,364.112002,6.7
7,VT,Bennington,364.112002,8.8
8,VT,Chittenden,364.112002,6.5
9,VT,Orleans,364.112002,9.3


This confirms that the information in this column is county level, with the exact same sales value shown in every row.

I will observe the 5 states with the lowest fast food expenditure.

##### Distinct States with Lowest Fast Food Expenditure per capita

In [59]:
sql = """
SELECT DISTINCT state, pc_ffrsales12
FROM restaurants
ORDER BY pc_ffrsales12
LIMIT 5
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,pc_ffrsales12
0,VT,364.112002
1,PA,425.683469
2,CT,436.682244
3,NJ,436.783112
4,ME,447.176931


The states with the lowest expenditure on fast food are Vermont, Pennsylvania, Connecticut, New Jersey, and Maine. These states are on the Northeast coast.

Since the information is state-level, I will compare it to obesity prevalence from 2013, which shows 

In [60]:
sql = """
SELECT AVG(pct_obese_adults12)
FROM health
LEFT JOIN restaurants
    ON health.fips = restaurants.fips
WHERE health.state IN ('VT', 'PA', 'CT', 'NJ', 'ME')
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,27.438889


The average prevalence of obesity in the states with lowest expenditure on fast food is 27.44%.

Now, I will observe the states with highest expenditure on fast food.

##### Distinct States with Highest Expenditure on Food per capita

In [61]:
sql = """
SELECT DISTINCT state, pc_ffrsales12
FROM restaurants
ORDER BY pc_ffrsales12 DESC
LIMIT 5
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,pc_ffrsales12
0,DC,1035.391608
1,HI,836.408434
2,TX,728.768645
3,NV,694.278586
4,KY,677.731961


The "state" with highest expenditure per capita on fast food is Washington D.C., whose expenditure more than doubles the expenditure in each of the 5 states with lowest fast food expenditure.

The 4 other states with highest food expenditure are Hawaii, Texas, Nevada, and Kentucky.

I will observe their rates of obesity.

In [62]:
sql = """
SELECT AVG(pct_obese_adults12)
FROM health
LEFT JOIN restaurants
    ON health.fips = restaurants.fips
WHERE health.state IN ('DC', 'HI', 'TX', 'NV', 'KY')
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,29.61738


The average prevalence of obesity in the states with highest fast food expenditure is 29.62%, which is higher than the average prevalence of obesity in the states with lowest fast food expenditure.


##### Findings

Counties with the highest number of fast food restaurants per 1000 people actually had a lower average prevalence of diabetes when compared to those counties with the lowest number of fast food restaurants. This was not the anticipated finding. 

However, states with lowest expenditure on fast food were shown to have an average lower prevalence of obesity when compared to states with the highest expenditure on fast food. I suppose it makes sense that the number of restaurants is less important when predicting health outcomes than the amount of money people spend (eat) at the restaurants.

Relationships between obesity and/or diabetes and full service restaurants may also exist, but will be explored in the next notebook. For now, I will move on to analysis of the Socioeconomic datset.

## 8. Socioeconomic

This dataset includes information regarding socioeconomic status of the populations living in each of the cities. For now, I will observe information regarding persistent poverty (perpov10), metro/nonmetro counties, and how these variables relate to prevalence of diabetes (2012.) 

In [63]:
#displays top 5 rows of table

sql = """
SELECT *
FROM socioeconomic;
"""

df = pd.read_sql_query(sql, engine)
df.head()

Unnamed: 0,fips,state,county,pct_65older10,pct_18younger10,medhhinc15,povrate15,perpov10,childpovrate15,perchldpov10,metro13,poploss10
0,1001,AL,Autauga,11.995382,26.777959,56580.0,12.7,0,18.8,0,1,0.0
1,1003,AL,Baldwin,16.771185,22.987408,52387.0,12.9,0,19.6,0,1,0.0
2,1005,AL,Barbour,14.236807,21.906982,31433.0,32.0,1,45.2,1,0,0.0
3,1007,AL,Bibb,12.68165,22.696923,40767.0,22.2,0,29.3,1,1,0.0
4,1009,AL,Blount,14.722096,24.608353,50487.0,14.7,0,22.2,0,1,0.0


##### Number of Persistent Poverty Counties

In [64]:

sql = """
SELECT COUNT(perpov10)
FROM socioeconomic
WHERE perpov10 = 1
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,count
0,353


In 2010, there were 353 counties experiencing persisent poverty in the United States. As there are approximately 3143 counties, this means more than 1/10 of these were experiencing persistent poverty.

##### Average Prevelance of Diabetes in Persistent Poverty Positive (1) Counties

In [65]:
sql = """
SELECT AVG(pct_diabetes_adults13)
FROM health
LEFT JOIN socioeconomic
    ON health.fips = socioeconomic.fips
WHERE socioeconomic.perpov10 = 1

"""

df = pd.read_sql_query(sql, engine)
df.head()

Unnamed: 0,avg
0,13.72068


The average prevalence of diabetes was 13.72% (approximate, as the years of the diabetes prevalence and poverty measurement are 3 years apart.)

##### Average Prevelance of Diabetes in Persistent Poverty Negative (0) Counties

In [66]:
sql = """
SELECT AVG(pct_diabetes_adults13)
FROM health
LEFT JOIN socioeconomic
    ON health.fips = socioeconomic.fips
WHERE socioeconomic.perpov10 = 0

"""

df = pd.read_sql_query(sql, engine)
df.head()

Unnamed: 0,avg
0,10.921657


The average diabetes prevalence of 10.92% was less than that of the persistent poverty states (13.72%).

This suggests a link may exist between poverty and prevalence of diabetes. I will now explore the variable of metro/nonmetro counties to see if a link may exist with it and prevalence of diabetes.

##### Number of Metropolitan (1) Counties

In [67]:
sql = """
SELECT COUNT(metro13)
FROM socioeconomic
WHERE metro13 = 1
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,count
0,1167


There were 1167 metropolitan counties in 2013. I will observe to see if there is a difference in diabetes prevalence between metropolitan and non-metropolitan counties.

##### Average Prevelance of Diabetes in Metropolitan (1) Counties

In [68]:
sql = """
SELECT AVG(pct_diabetes_adults13)
FROM health
LEFT JOIN socioeconomic
    ON health.fips = socioeconomic.fips
WHERE socioeconomic.metro13 = 1

"""

df = pd.read_sql_query(sql, engine)
df.head()

Unnamed: 0,avg
0,10.770926


The average prevalence of diabetes in metropolitan counties was 10.77%. In this case, the metropolitan data and diabetes prevalence data are from the same year (2013.)

##### Average Prevelance of Diabetes in Non-Metropolitan (0) Counties

In [69]:
sql = """

SELECT AVG(pct_diabetes_adults13)
FROM health
LEFT JOIN socioeconomic
    ON health.fips = socioeconomic.fips
WHERE socioeconomic.metro13 = 0

"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,11.510628


The average prevalence of diabetes is higher in non-metropolitan counties at 11.51%.

##### Findings

The results of the EDA performed for this dataset indicate that persistent poverty is positively correlated with diabetes in U.S. counties, while on average, metropolitan counties seem to experience lower rates of diabetes.

I will explore the relationships between prevalence of diabetes and the other columns in this dataset in the next notebook. First, I will discuss the Stores dataset.

## 9. Stores

This dataset contains 39 variables related to stores, including grocery stores, supercenters, SNAP-authorized stores, WIC-authorized stores, and more.

Any of these variables might be useful, however there are many of them. Rather than analyze them here, I will use Pandas to look for correlations and to prepare visualizations. After that, I will decide which, if any, to use in my models.

For now, I will move on to discuss the Supplemental County dataset. 

## 10. Supplemental Data - County

This dataset contains the census population of each county for 2010, and population estimates for 2011 - 2018. I may not end up using it, and no analysis is needed at this time. I will move on to discuss the Supplemental State dataset.

## 11. Supplemental Data - State

This dataset contains raw numbers of participants in various state programs such as WIC, and state populations from 2012 - 2018. I may not need this information, and no analysis is needed at this time. I will move on to observe the Taxes dataset.

## 12. Taxes

This dataset contains information about state taxes on soda, chips, and food in general.

Since this information is state-level, I will observe for relationships between these state taxes and obesity, which is also state-level.

In [70]:
#displays top 5 rows of table

sql = """
SELECT *
FROM taxes;
"""

df = pd.read_sql_query(sql, engine)
df.head()

Unnamed: 0,fips,state,county,sodatax_stores14,sodatax_vendm14,chipstax_stores14,chipstax_vendm14,food_tax14
0,1001,AL,Autauga,4.0,4.0,4.0,4.0,4.0
1,1003,AL,Baldwin,4.0,4.0,4.0,4.0,4.0
2,1005,AL,Barbour,4.0,4.0,4.0,4.0,4.0
3,1007,AL,Bibb,4.0,4.0,4.0,4.0,4.0
4,1009,AL,Blount,4.0,4.0,4.0,4.0,4.0


##### Average soda tax in stores (2014)

In [71]:

sql = """
SELECT AVG(sodatax_stores14)
FROM taxes;
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,3.951254


The average tax on soda in stores (2014) was 3.95%.

##### Average chip tax in stores (2014)

In [72]:
sql = """
SELECT AVG(chipstax_stores14)
FROM taxes
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,1.102251


The average tax on chips in stores (2014) was 1.10%.

##### Average food tax (2014)

In [73]:
sql = """
SELECT AVG(food_tax14)
FROM taxes
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,1.102251


This is the same as the tax placed on chips. It appears that on average, there is a higher tax on soda than on foods in general. I could speculate that this is because soda is such a popular item, and states hope to maximize revenue.

I will move on to observe the differences in obesity prevalence in low chips/soda/food tax states versus states in which these taxes are high.

##### Prevalence of obesity (2012) in states with low taxes on soda (2014)

In [74]:
sql = """

SELECT DISTINCT taxes.state, taxes.sodatax_stores14, health.pct_obese_adults12
FROM taxes
LEFT JOIN health
    ON taxes.fips = health.fips
WHERE taxes.sodatax_stores14 = 0
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,sodatax_stores14,pct_obese_adults12
0,DE,0.0,26.9
1,AZ,0.0,26.0
2,LA,0.0,34.7
3,NE,0.0,28.6
4,AK,0.0,25.7
5,NM,0.0,27.1
6,WY,0.0,24.6
7,VT,0.0,23.7
8,GA,0.0,29.1
9,MT,0.0,24.3


It appears that these 15 states have no tax on their soda. I will find their average obsesity prevalence.

##### Average Prevalence of Obesity in States with no tax on soda

In [75]:
sql = """

SELECT AVG(health.pct_obese_adults12)
FROM health
LEFT JOIN taxes
    ON taxes.fips = health.fips
WHERE taxes.sodatax_stores14 = 0
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,28.680432


The average obesity prevalence in these sates is 28.68%. I will observe states with highest tax on soda.

##### Prevalence of Obesity in States with highest tax on soda

In [76]:
sql = """

SELECT DISTINCT taxes.state, taxes.sodatax_stores14, health.pct_obese_adults12
FROM taxes
LEFT JOIN health
    ON taxes.fips = health.fips
ORDER BY taxes.sodatax_stores14 DESC
LIMIT 5
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,sodatax_stores14,pct_obese_adults12
0,IN,7.0,31.4
1,NJ,7.0,24.6
2,RI,7.0,25.7
3,MS,7.0,34.6
4,MN,6.875,25.7


It appears that four states share the same highest tax on soda at 7.00%. Since I selected states with the same tax (0%) for lowest, I will use the four states with the same highest tax for calculation of average prevalence of diabetes.

##### Average Prevalence of Obesity in states with 7% (highest) tax on soda

In [77]:
sql = """

SELECT AVG(health.pct_obese_adults12)
FROM health
LEFT JOIN taxes
    ON taxes.fips = health.fips
WHERE taxes.sodatax_stores14 = 7
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,31.8555


The average prevalence of obesity is higher in the states with high tax on soda at 31.86%. Again, I could only speculate as to why this is the case, but perhaps people in the high soda tax state drink more soda (high in calories and sugar, low in satiety,) and therefore are taxed more in order to generate revenue.

In any event, now that I have seen a potential correlation between tax on a specific food item and prevalence of diabetes, I will look for a relationship between overall food tax and prevalence of diabetes. 

##### Prevalence of Obesity in States with no food tax

In [78]:
sql = """

SELECT DISTINCT taxes.state, taxes.food_tax14, health.pct_obese_adults12
FROM taxes
LEFT JOIN health
    ON taxes.fips = health.fips
WHERE taxes.food_tax14 = 0
LIMIT 10
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,food_tax14,pct_obese_adults12
0,AZ,0.0,26.0
1,MN,0.0,25.7
2,LA,0.0,34.7
3,KY,0.0,31.3
4,AK,0.0,25.7
5,NM,0.0,27.1
6,MT,0.0,24.3
7,IN,0.0,31.4
8,SC,0.0,31.6
9,WI,0.0,29.7


These states have 0 food tax, and there may be others. I will find the average prevalence of obesity in states with no food tax.

##### Average Prevelance of obesity in states with no food tax

In [79]:
sql = """

SELECT AVG(health.pct_obese_adults12)
FROM health
LEFT JOIN taxes
    ON taxes.fips = health.fips
WHERE taxes.food_tax14 = 0
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,28.559571


The average prevalence of obesity in states with no food tax is 28.56%. I will now observe states with the highest food tax.

##### Prevalence of Obesity in states with highest food tax

In [80]:
sql = """

SELECT DISTINCT taxes.state, taxes.food_tax14, health.pct_obese_adults12
FROM taxes
LEFT JOIN health
    ON taxes.fips = health.fips
ORDER BY taxes.food_tax14 DESC
LIMIT 5
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,food_tax14,pct_obese_adults12
0,MS,7.0,34.6
1,KS,6.15,29.9
2,ID,6.0,26.8
3,TN,5.0,31.1
4,OK,4.5,32.2


The states with the 5 highest taxes on food are Mississippi, Kansas, Idaho, Tennessee, and Oklahoma. All tax amounts are unique. I will calculate the average food tax for these states.

##### Average prevalence of diabetes in states with highest food tax

In [81]:
sql = """

SELECT AVG(health.pct_obese_adults12)
FROM health
LEFT JOIN taxes
    ON taxes.fips = health.fips
WHERE taxes.state in ('MS', 'KS', 'ID', 'TN', 'OK')
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,31.240199


The average prevalence here, 31.24% is higher than in the states with no food tax. Rather than hazard a speculation at this point, I will wait and see if other patterns appear that might shed light into why these states have a higher food tax and obesity rate.

##### Findings

Preliminary observations suggest that states with higher tax on soda and food in general may have higher rates of obesity. These may translate into higher prevalence of diabetes as well, which will be further explored.

Before moving on to visualizations, the target dataset will be observed here.

## 13. Diabetes Atlas Data - 2018

This dataset contains the prevalence of diabetes in each of the counties. As previously mentioned, my intent is to use these values as the target. However, preliminary analysis in the previous notebook suggests that there may be a difference in terms of the counties for which there are data.

This will be explored below. Additionally, I will observe min, max, and average diabetes prevalences for 2018.

In [82]:
sql = """

SELECT *
FROM diabetes18
"""

df = pd.read_sql_query(sql, engine)
df.head()

Unnamed: 0,fips,county,state,pct_diabetes18
0,1001,Autauga,Alabama,9.5
1,1003,Baldwin,Alabama,8.4
2,1005,Barbour,Alabama,13.5
3,1007,Bibb,Alabama,10.2
4,1009,Blount,Alabama,10.5


##### Counties in Health that are not in Diabetes18

In [83]:
sql = """

SELECT health.fips, health.state, health.county, health.pct_diabetes_adults13
FROM health
LEFT JOIN diabetes18
    ON health.fips = diabetes18.fips
WHERE diabetes18.county IS NULL
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,fips,state,county,pct_diabetes_adults13
0,2270,AK,Wade Hampton,4.6
1,35039,NM,Rio Arriba,8.8
2,46113,SD,Shannon,15.8
3,51515,VA,Bedford,


It appears that these four counties are in the Health dataset, but not in the Diabetes18 dataset...however, the missing value for Bedford, VA may have complicated this situation. I will need to verify these findings once the EDA is complete.

For now, I will check for counties that are in the Diabetes18 dataset that are not in the Health dataset.

##### Counties in Diabetes18 that are not in Health

In [84]:
sql = """

SELECT diabetes18.fips, diabetes18.state, diabetes18.county, diabetes18.pct_diabetes18
FROM diabetes18
LEFT JOIN health
    ON health.fips = diabetes18.fips
WHERE health.county IS NULL
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,fips,state,county,pct_diabetes18
0,46102,South Dakota,Oglala Lakota,17.9
1,2158,Alaska,Kusilvak Census Area,7.4


##### Findings

The difference in numbers seems to be explained: The health dataset contains 3143 counties, and the Diabetes18 dataset contains 3141. Four counties in Health that are not present in Diabetes18, and two counties in Diabetes18 that are not in Health creates a numerical difference of 2. However, these findings must be confirmed.

Furthermore, the Bedford County, VA diabetes prevalence for 2013 is null anyway; if I cannot find it for 2013 or 2018, I may as well drop it. For the 3 other counties in the Health dataset that are not present in the Diabetes18 dataset, I will need to find the 2018 prevalences, drop the counties before modeling...or perhaps impute the values based on expected percent change between 2013 and 2018. 

As for the two counties in the Diabetes18 dataset that are not in the Health dataset...I will need to check the other Food Environment Atlas datasets to see if I have information for them at all. If not, I will need to drop them from the target dataset, as I would have no information from which to make a prediction.

I will re-address this situation following the second portion of EDA.

### 14. Mental Health Practitioners (State)

As far as this portion of EDA, my main concern is determining what the rank of District of Columbia should be. The rank is given based on the number of practitioners per 100,000 population, with the highest number being given the number 1 rank, and so on.

Additionally, I will determine the national average for the number of practitioners per 100,000 population for each of the years.

In [85]:
#displays top 5 rows of dataset

sql = """

SELECT *
FROM mhp
"""

df = pd.read_sql_query(sql, engine)
df.head()

Unnamed: 0,state,rank18,per100th18,rank17,per100th17
0,Alabama,50,92.6,50,85.0
1,Alaska,7,391.2,8,364.2
2,Arizona,47,129.3,47,121.9
3,Arkansas,27,226.0,26,213.3
4,California,11,338.0,10,315.5


##### States w/ Rank and per100th MH providers (2018)

In [86]:
sql = """

SELECT state, rank18, per100th18
FROM mhp
ORDER BY per100th18 DESC
"""

df = pd.read_sql_query(sql, engine)
df.head()

Unnamed: 0,state,rank18,per100th18
0,Massachusetts,1,590.9
1,Oregon,2,492.3
2,District of Columbia,•,486.9
3,Maine,3,459.5
4,Vermont,4,433.4


This shows that the District of Columbia should be ranked 3rd for the 2018 rank.

I will follow the same process for identifying the rank for 2017.

##### States w/Rank and per100th MH providers (2017)

In [87]:
sql = """

SELECT state, rank17, per100th17
FROM mhp
ORDER BY per100th17 DESC
"""

df = pd.read_sql_query(sql, engine)
df.head()

Unnamed: 0,state,rank17,per100th17
0,Massachusetts,1,547.3
1,District of Columbia,•,470.5
2,Oregon,2,453.7
3,Maine,3,442.1
4,Vermont,4,407.3


This shows that the District of Columbia should be ranked 2nd for 2017.

I will make the edits to the dataframe as part of my preparations for creating visualizations during the next phase of EDA.

Now, I will determine average values for the practitioners per 1000 population.

##### Average number of MH providers per state (2017, 2018)

In [88]:
sql = """

SELECT AVG(per100th18) as avg2018, AVG(per100th17) as avg2017
FROM mhp
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg2018,avg2017
0,257.519608,240.937255


This shows that on average, the number of mental health practitioners per state increased from 2017 to 2018.

On average, there were 257/258 mental health practitioners per 100,000 population in each state in 2018.

### 15. Mental Health Services (State)

The values in this dataset represent the percentage of the (adult) population that received mental health services in the last year. Since the ages 18 and older column incorporates the other two columns (age 18-25 and age 26 and older respectively) I will observe the highest and lowest percentages for the 18+ column alone.

In [89]:
sql = """

SELECT *
FROM mhs
"""

df = pd.read_sql_query(sql, engine)
df.head()

Unnamed: 0,state,age18plus,age18_25,age26plus
0,Alabama,13.0,11.0,13.0
1,Alaska,14.0,13.0,14.0
2,Arizona,12.0,10.0,12.0
3,Arkansas,16.0,13.0,16.0
4,California,12.0,10.0,12.0


##### States with lowest % of adults receiving MH services

In [90]:
sql = """

SELECT state, age18plus
FROM mhs
ORDER BY age18plus
LIMIT 5
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,age18plus
0,Hawaii,9.0
1,Texas,11.0
2,Arizona,12.0
3,California,12.0
4,Georgia,12.0


It appears that Hawaii had the lowest percentage at 9.0%. I will observe the highest percentages next.

##### States with highest % of adults receiving MH services

In [91]:
sql = """

SELECT state, age18plus
FROM mhs
ORDER BY age18plus DESC
LIMIT 5
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,state,age18plus
0,New Hampshire,21.0
1,Vermont,20.0
2,Minnesota,19.0
3,Maine,19.0
4,Massachusetts,19.0


It appears that New Hampshire is the highest, at 21.0%.

I will next observe the average percentage.

##### Average % of adults receiving MH Services

In [92]:

sql = """

SELECT AVG(age18plus)
from mhs
"""

df = pd.read_sql_query(sql, engine)
df

Unnamed: 0,avg
0,15.254902


It appears that the average percentage of adults receiving mental health services (2016) per state was 15.25%.

## Discussion

Some of the datasets, namely Local and Stores were not observed here due to the sheer number of variables. The target (Diabetes 2018) dataset and the Health dataset contain largely the same number of counties, with the exception of 4 included in the Health dataset that are not present in the target dataset, and 2 included in the target dataset that are not in the Health dataset. Null values must still be corrected, and the appropriate manner in which to do this must be determined.

Still, much information was gleaned through this phase of EDA. Relationships between some (but not all) of the variables and the prevalence of obesity and/or diabetes were observed. Minimums, maximums, and averages were determined as described above.

Before addressing the issues noted in the first paragraph above, I will seek additional information about correlations in the next notebook.