# Capstone 2: Obesity in America
## Data Wrangling

This step will focus on data wrangling, this includes collecting the data, organizing the data and the file structure for the rest of the project, defining, and cleaning up the data.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#### Data Collection

Checking that I am in the correct directory and change if need be. 

In [3]:
cd springboard/Capstone2Project/

/Users/erinquense/springboard/Capstone2Project


Import my dataset using read_csv and taking a look at the first 3 rows to get an idea of what the dataframe looks like. 

In [4]:
df = pd.read_csv('Capstone_BRFSS_Obesity_CSV.csv')
df.head(3)

Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,Datasource,Class,Topic,Question,Data_Value_Unit,Data_Value_Type,...,GeoLocation,ClassID,TopicID,QuestionID,DataValueTypeID,LocationID,StratificationCategory1,Stratification1,StratificationCategoryId1,StratificationID1
0,2011,2011,AL,Alabama,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,"(32.84057112200048, -86.63186076199969)",OWS,OWS1,Q036,VALUE,1,Total,Total,OVR,OVERALL
1,2011,2011,AL,Alabama,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,"(32.84057112200048, -86.63186076199969)",OWS,OWS1,Q036,VALUE,1,Gender,Male,GEN,MALE
2,2011,2011,AL,Alabama,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,"(32.84057112200048, -86.63186076199969)",OWS,OWS1,Q036,VALUE,1,Gender,Female,GEN,FEMALE


#### Data Organization

Creating a file structure to store my data, figures and models I create.  

In [None]:
path = 'springboard/Capstone2Project'
print ("The current working directory is %s" % path)

In [None]:
mkdir data

In [None]:
mkdir figures

In [None]:
mkdir models

#### Data Definition

At this point, I want to gain an understanding of what my data looks like, and what might need to happen to make it cleaner to work with later on. First, I am going to drop some columns I don't need.  
-YearEnd is the same as YearStart, DataSource is the same for all observations, Data Footnote Symbol is unnecessary because we have the footnote itself, and we don't need to know confidence limits. We can also drop geolocation because we have LocationDesc to identify the state. 

In [5]:
df = df.drop(['GeoLocation', 'YearEnd', 'Datasource', 'Data_Value_Unit', 'Data_Value_Footnote_Symbol', 'Data_Value_Type', 'DataValueTypeID', 'Data_Value_Alt', 'Low_Confidence_Limit', 'High_Confidence_Limit '], axis=1)

In [6]:
df.sample(5)

Unnamed: 0,YearStart,LocationAbbr,LocationDesc,Class,Topic,Question,Data_Value,Data_Value_Footnote,Sample_Size,Total,...,Income,Race/Ethnicity,ClassID,TopicID,QuestionID,LocationID,StratificationCategory1,Stratification1,StratificationCategoryId1,StratificationID1
30942,2013,UT,Utah,Physical Activity,Physical Activity - Behavior,Percent of adults who achieve at least 150 min...,55.3,,11589.0,Total,...,,,PA,PA1,Q043,49,Total,Total,OVR,OVERALL
15281,2013,MI,Michigan,Fruits and Vegetables,Fruits and Vegetables - Behavior,Percent of adults who report consuming fruit l...,29.5,,3925.0,,...,,,FV,FV1,Q018,26,Age (years),65 or older,AGEYR,AGEYR65PLUS
3255,2013,CA,California,Physical Activity,Physical Activity - Behavior,Percent of adults who engage in muscle-strengt...,29.2,,2652.0,,...,,Hispanic,PA,PA1,Q046,6,Race/Ethnicity,Hispanic,RACE,RACEHIS
40520,2015,KS,Kansas,Physical Activity,Physical Activity - Behavior,Percent of adults who achieve at least 300 min...,28.0,,2976.0,,...,Data not reported,,PA,PA1,Q045,20,Income,Data not reported,INC,INCNR
26250,2014,OR,Oregon,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,33.8,,1084.0,,...,,,OWS,OWS1,Q037,41,Age (years),55 - 64,AGEYR,AGEYR5564


In [7]:
df['Question'].unique()

array(['Percent of adults aged 18 years and older who have obesity',
       'Percent of adults aged 18 years and older who have an overweight classification',
       'Percent of adults who report consuming fruit less than one time daily',
       'Percent of adults who report consuming vegetables less than one time daily',
       'Percent of adults who engage in muscle-strengthening activities on 2 or more days a week',
       'Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)',
       'Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week',
       'Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity 

Here I am going to pivot the table so that the 9 quesitons defined in the 'Question' columns become features and will be filled with the 'Data_Value'.  And then change their column names so they take up less space, but still describe what we need to know. 

In [8]:
no_pivot = ['YearStart', 'LocationAbbr', 'LocationDesc', 'LocationID',
       'StratificationCategory1', 'Stratification1',
       'StratificationCategoryId1', 'StratificationID1']
df = pd.pivot_table(df, columns='Question', values='Data_Value', index=no_pivot).reset_index()

In [9]:
df.rename(columns={'Percent of adults aged 18 years and older who have an overweight classification': 'overweight'}, inplace=True)

In [10]:
df.rename(columns={'Percent of adults aged 18 years and older who have obesity': 'obese'}, inplace=True)

In [11]:
df.rename(columns={'Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)': 'some_activity'}, inplace=True)

In [12]:
df.rename(columns={'Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week': 'some_and_muslce'}, inplace=True)

In [13]:
df.rename(columns={'Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)': 'more_activity'}, inplace=True)

In [14]:
df.rename(columns={'Percent of adults who engage in muscle-strengthening activities on 2 or more days a week': 'Strength_training'}, inplace=True)

In [15]:
df.rename(columns={'Percent of adults who engage in no leisure-time physical activity': 'no_physical_activity'}, inplace=True)

In [16]:
df.rename(columns={'Percent of adults who report consuming fruit less than one time daily': 'fruit'}, inplace=True)

In [17]:
df.rename(columns={'Percent of adults who report consuming vegetables less than one time daily': 'vegetables'}, inplace=True)

In [18]:
df.sample(8)

Question,YearStart,LocationAbbr,LocationDesc,LocationID,StratificationCategory1,Stratification1,StratificationCategoryId1,StratificationID1,overweight,obese,some_activity,some_and_muslce,more_activity,Strength_training,no_physical_activity,fruit,vegetables
7370,2016,MN,Minnesota,27,Age (years),25 - 34,AGEYR,AGEYR2534,35.4,24.6,,,,,14.9,,
1787,2012,KY,Kentucky,21,Income,"$35,000 - $49,999",INC,INC3550,37.2,32.0,,,,,26.3,,
3000,2013,IA,Iowa,19,Income,"Less than $15,000",INC,INCLESS15,31.5,37.1,37.2,11.6,23.7,26.0,38.3,49.0,39.3
645,2011,MO,Missouri,29,Total,Total,OVR,OVERALL,34.6,30.3,49.5,17.3,30.5,24.7,28.4,43.9,25.2
2661,2012,WY,Wyoming,56,Income,"$50,000 - $74,999",INC,INC5075,37.3,29.1,,,,,16.1,,
129,2011,CO,Colorado,8,Age (years),25 - 34,AGEYR,AGEYR2534,31.1,19.3,59.4,27.2,35.8,38.1,14.2,38.4,21.7
3526,2013,NV,Nevada,32,Income,"$75,000 or greater",INC,INC75PLUS,41.0,26.7,61.7,32.2,39.6,41.4,16.5,32.1,14.3
3649,2013,PA,Pennsylvania,42,Gender,Female,GEN,FEMALE,28.0,29.7,47.0,16.4,28.8,24.5,28.1,32.7,22.2


We are also going to pivot the table so that each category in 'StratificationCategory1' are features, and each new column will be filled with the categories 'Stratification1', defining its datapoint

In [19]:
df = df.drop(['StratificationCategoryId1', 'StratificationID1'], axis=1)

In [20]:
no_pivot = ['YearStart', 'LocationAbbr', 'LocationDesc', 'LocationID', 'overweight', 'obese',
       'some_activity', 'some_and_muslce', 'more_activity',
       'Strength_training', 'no_physical_activity', 'fruit', 'vegetables']
       
df = pd.pivot_table(df, columns='StratificationCategory1', values='Stratification1', index=no_pivot, aggfunc='first').reset_index()

We now have a dataframe which is organized by year and state and all features are columns. 

In [21]:
df.head()

StratificationCategory1,YearStart,LocationAbbr,LocationDesc,LocationID,overweight,obese,some_activity,some_and_muslce,more_activity,Strength_training,no_physical_activity,fruit,vegetables,Age (years),Education,Gender,Income,Race/Ethnicity,Total
0,2011,AK,Alaska,2,24.6,33.6,56.0,22.3,38.1,30.3,25.0,46.5,35.0,,,,"Less than $15,000",,
1,2011,AK,Alaska,2,29.8,35.5,49.9,32.9,30.1,42.3,35.6,36.0,18.6,,,,,Other,
2,2011,AK,Alaska,2,31.3,26.7,57.4,21.7,37.5,28.9,24.0,34.2,16.2,,,Female,,,
3,2011,AK,Alaska,2,32.0,19.8,62.8,34.9,32.4,51.9,16.1,45.0,29.1,18 - 24,,,,,
4,2011,AK,Alaska,2,33.5,30.2,51.8,17.3,32.7,28.1,24.6,44.6,26.1,,,,"$15,000 - $24,999",,


In [22]:
df.isnull().mean()

StratificationCategory1
YearStart               0.000000
LocationAbbr            0.000000
LocationDesc            0.000000
LocationID              0.000000
overweight              0.000000
obese                   0.000000
some_activity           0.000000
some_and_muslce         0.000000
more_activity           0.000000
Strength_training       0.000000
no_physical_activity    0.000000
fruit                   0.000000
vegetables              0.000000
Age (years)             0.762814
Education               0.841206
Gender                  0.920603
Income                  0.722111
Race/Ethnicity          0.792965
Total                   0.960302
dtype: float64

There are many NaN values in the categorical features (i.e. education, age, gender, etc.).  This is because each question was assessed as a separate instance for each categorical feature.  For I will leave those and explore further in the next step, exploratory data analysis. 

In [23]:
df.shape

(3980, 19)

In [24]:
df.describe().T

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
StratificationCategory1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
YearStart,3980.0,2013.013568,1.637957,2011.0,2011.0,2013.0,2015.0,2015.0
LocationID,3980.0,29.969347,16.544564,1.0,17.0,30.0,44.0,72.0
overweight,3980.0,35.010678,4.576289,13.7,32.575,35.5,38.0,56.1
obese,3980.0,28.710151,6.807211,0.9,25.2,29.2,33.1,56.2
some_activity,3980.0,50.204347,7.532289,24.0,45.2,50.3,55.3,77.6
some_and_muslce,3980.0,20.089523,5.631964,2.2,16.3,19.7,23.4,46.5
more_activity,3980.0,31.252161,6.086512,12.5,27.1,31.1,35.1,64.9
Strength_training,3980.0,29.625804,7.095855,3.3,24.9,29.2,33.725,58.6
no_physical_activity,3980.0,26.797965,7.679105,2.5,21.7,26.3,31.6,60.5
fruit,3980.0,40.201432,6.944133,9.6,35.5,40.0,44.9,63.0


### Data Cleaning

Lets rename some of our columns so they are more descriptive of what they contain. 

In [25]:
df.rename(columns={'YearStart':'Year', 'LocationDesc':'Location'}, inplace=True)

In [26]:
df.nunique()

StratificationCategory1
Year                      3
LocationAbbr             54
Location                 54
LocationID               54
overweight              279
obese                   396
some_activity           401
some_and_muslce         320
more_activity           342
Strength_training       393
no_physical_activity    399
fruit                   363
vegetables              364
Age (years)               6
Education                 4
Gender                    2
Income                    7
Race/Ethnicity            8
Total                     1
dtype: int64

In [27]:
df['Location'].unique()

array(['Alaska', 'Alabama', 'Arkansas', 'Arizona', 'California',
       'Colorado', 'Connecticut', 'District of Columbia', 'Delaware',
       'Florida', 'Georgia', 'Hawaii', 'Iowa', 'Idaho', 'Illinois',
       'Indiana', 'Kansas', 'Kentucky', 'Louisiana', 'Massachusetts',
       'Maryland', 'Maine', 'Michigan', 'Minnesota', 'Missouri',
       'Mississippi', 'Montana', 'North Carolina', 'North Dakota',
       'Nebraska', 'New Hampshire', 'New Jersey', 'New Mexico', 'Nevada',
       'New York', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
       'Texas', 'National', 'Utah', 'Virginia', 'Vermont', 'Washington',
       'Wisconsin', 'West Virginia', 'Wyoming', 'Guam', 'Puerto Rico'],
      dtype=object)

In [28]:
duplicateRowsDF = df[df.duplicated()]
duplicateRowsDF

StratificationCategory1,Year,LocationAbbr,Location,LocationID,overweight,obese,some_activity,some_and_muslce,more_activity,Strength_training,no_physical_activity,fruit,vegetables,Age (years),Education,Gender,Income,Race/Ethnicity,Total


In [29]:
df.sample(5)

StratificationCategory1,Year,LocationAbbr,Location,LocationID,overweight,obese,some_activity,some_and_muslce,more_activity,Strength_training,no_physical_activity,fruit,vegetables,Age (years),Education,Gender,Income,Race/Ethnicity,Total
1050,2011,SD,South Dakota,46,39.3,26.1,52.7,20.1,28.3,29.6,18.1,31.3,18.9,,College graduate,,,,
3403,2015,NH,New Hampshire,33,44.2,28.0,57.9,23.9,38.6,34.3,22.3,38.6,20.2,,,Male,,,
1376,2013,AR,Arkansas,5,36.8,32.3,42.4,13.1,26.6,20.9,33.7,50.2,24.2,,,,,Non-Hispanic White,
274,2011,GA,Georgia,13,37.1,34.4,50.3,16.1,33.4,23.8,28.3,40.5,22.1,55 - 64,,,,,
3599,2015,PA,Pennsylvania,42,37.5,33.2,44.4,20.3,27.8,31.6,29.9,42.9,21.9,,,,"$50,000 - $74,999",,


In [None]:
df.to_csv('df_wrangled.csv')

At this point the data has been transformed and ready to start some exploratory data analysis in the next step.  