# DiabetesAtlasdata Prep

# Setup

## Imports

In [2]:
import pandas as pd
import numpy as np

## Parameters

In [3]:
LOCAL_DIABETES_ATLAS_DATASET = "../../../data/RQ3/raw/DiabetesAtlasdata.csv"

PROCESSED_COUNTYCITY_DIABETES_ATLAS_DATASET = "../../../data/RQ3/processed/diabetes_by_countyCity_df.csv"
PROCESSED_STATE_DIABETES_ATLAS_DATASET = "../../../data/RQ3/processed/diabetes_diagnosis_percentage_state_df.csv"

## Configuration

In [4]:
#eg %matplotlib inline

# Loading the Diabetes Atlas Dataset

We load the dataset and specify the correct header row (else we get an error)

In [5]:
diabetes_by_countyCity_df = pd.read_csv(LOCAL_DIABETES_ATLAS_DATASET, header=2)
diabetes_by_countyCity_df.shape

(3142, 6)

In [6]:
diabetes_by_countyCity_df.sample(5)

Unnamed: 0,Year,County_FIPS,County,State,Diagnosed Diabetes Percentage,Overall SVI
1011,2018,21037.0,Campbell County,Kentucky,10.3,0.2188
1865,2018,36077.0,Otsego County,New York,7.0,0.4236
2158,2018,40057.0,Harmon County,Oklahoma,7.0,0.8108
2728,2018,48413.0,Schleicher County,Texas,6.9,0.5857
1502,2018,29039.0,Cedar County,Missouri,6.5,0.6411


We now look to see if the dataset has been loaded correctly.

# Cleaning the Dataset

We can see that the last row is a footer and not a datapoint to the dataset so we should remove this row.

In [7]:
diabetes_by_countyCity_df.tail()

Unnamed: 0,Year,County_FIPS,County,State,Diagnosed Diabetes Percentage,Overall SVI
3137,2018,56039.0,Teton County,Wyoming,3.8,0.1127
3138,2018,56041.0,Uinta County,Wyoming,8.4,0.4522
3139,2018,56043.0,Washakie County,Wyoming,7.4,0.3732
3140,2018,56045.0,Weston County,Wyoming,7.6,0.3475
3141,US Diabetes Surveillance System; www.cdc.gov/d...,,,,,


In [8]:
#drop the last row
diabetes_by_countyCity_df = diabetes_by_countyCity_df.drop(index=3141)
diabetes_by_countyCity_df.tail()

Unnamed: 0,Year,County_FIPS,County,State,Diagnosed Diabetes Percentage,Overall SVI
3136,2018,56037.0,Sweetwater County,Wyoming,7.8,0.3701
3137,2018,56039.0,Teton County,Wyoming,3.8,0.1127
3138,2018,56041.0,Uinta County,Wyoming,8.4,0.4522
3139,2018,56043.0,Washakie County,Wyoming,7.4,0.3732
3140,2018,56045.0,Weston County,Wyoming,7.6,0.3475


We collected data for 2018, but let's make sure that this dataset only contains data from this year.

In [9]:
#Make sure there is only one year
print("Number of unique values for year: ", diabetes_by_countyCity_df.Year.nunique())
print("All unique values for year: " , diabetes_by_countyCity_df.Year.unique())

Number of unique values for year:  1
All unique values for year:  ['2018']


## Select Key Columns

In [10]:
#drop unneeded columns
key_cols = ['County', 'State', 'Diagnosed Diabetes Percentage']
diabetes_by_countyCity_df = diabetes_by_countyCity_df[key_cols]
diabetes_by_countyCity_df.head()

Unnamed: 0,County,State,Diagnosed Diabetes Percentage
0,Autauga County,Alabama,9.5
1,Baldwin County,Alabama,8.4
2,Barbour County,Alabama,13.5
3,Bibb County,Alabama,10.2
4,Blount County,Alabama,10.5


In [11]:
#make a second df for analysis
diabetes_diagnosis_percentage_state_df = diabetes_by_countyCity_df.copy()
diabetes_diagnosis_percentage_state_df = diabetes_diagnosis_percentage_state_df.drop(['County'], axis=1)
diabetes_diagnosis_percentage_state_df = diabetes_diagnosis_percentage_state_df.groupby(['State']).mean()
diabetes_diagnosis_percentage_state_df

Unnamed: 0_level_0,Diagnosed Diabetes Percentage
State,Unnamed: 1_level_1
Alabama,11.214925
Alaska,8.155172
Arizona,9.333333
Arkansas,9.616
California,7.956897
Colorado,6.853125
Connecticut,7.7125
Delaware,10.433333
District of Columbia,8.7
Florida,10.079104


## Identifying bad data

Even though we have a column for 'County' the values actually consist of both columns and independent cities. We should create a more accurate name.

In [12]:
diabetes_by_countyCity_df[diabetes_by_countyCity_df['County'].str.contains('City')]

Unnamed: 0,County,State,Diagnosed Diabetes Percentage
77,Juneau City and Borough,Alaska,6.3
89,Sitka City and Borough,Alaska,7.3
93,Wrangell City and Borough,Alaska,8.4
94,Yakutat City and Borough,Alaska,9.0
1196,Baltimore City,Maryland,12.4
1584,St. Louis City,Missouri,11.9
1747,Carson City,Nevada,9.1
2821,Alexandria City,Virginia,5.9
2832,Bristol City,Virginia,8.4
2836,Buena Vista City,Virginia,7.2


In [13]:
#Renaming the column 'County' to 'County/City'
diabetes_by_countyCity_df = diabetes_by_countyCity_df.rename(columns = {"County": "County/City"})
diabetes_by_countyCity_df

Unnamed: 0,County/City,State,Diagnosed Diabetes Percentage
0,Autauga County,Alabama,9.5
1,Baldwin County,Alabama,8.4
2,Barbour County,Alabama,13.5
3,Bibb County,Alabama,10.2
4,Blount County,Alabama,10.5
...,...,...,...
3136,Sweetwater County,Wyoming,7.8
3137,Teton County,Wyoming,3.8
3138,Uinta County,Wyoming,8.4
3139,Washakie County,Wyoming,7.4


There also seems to have been an issue with some of the text processing. We can see an obvious error in the column value below.

In [14]:
diabetes_by_countyCity_df['County/City'].iloc[1210]

'Queen Anne&#39;s County'

We now look for all similar cases and correct for them.

In [15]:
diabetes_by_countyCity_df[diabetes_by_countyCity_df['County/City'].str.contains('&')]

Unnamed: 0,County/City,State,Diagnosed Diabetes Percentage
859,O&#39;brien County,Iowa,6.4
1209,Prince George&#39;s County,Maryland,12.6
1210,Queen Anne&#39;s County,Maryland,7.4
1212,St. Mary&#39;s County,Maryland,10.6


In [16]:
#Replace incorrect characters and check

index = diabetes_by_countyCity_df[diabetes_by_countyCity_df['County/City'].str.contains('&#39;')].index
diabetes_by_countyCity_df['County/City'] = diabetes_by_countyCity_df['County/City'].str.replace("&#39;", "'")
diabetes_by_countyCity_df['County/City'].iloc[index]

859             O'brien County
1209    Prince George's County
1210       Queen Anne's County
1212         St. Mary's County
Name: County/City, dtype: object

# Dealing with Missing Data

A check for missing data returns no issues.

In [17]:
diabetes_by_countyCity_df.isnull().sum()

County/City                      0
State                            0
Diagnosed Diabetes Percentage    0
dtype: int64

# Saving the Dataframes

In [18]:
diabetes_by_countyCity_df.to_csv(PROCESSED_COUNTYCITY_DIABETES_ATLAS_DATASET , index=False)
diabetes_by_countyCity_df.shape

(3141, 3)

In [19]:
diabetes_diagnosis_percentage_state_df.to_csv(PROCESSED_STATE_DIABETES_ATLAS_DATASET)