# Capstone 2: Obesity in America
## Data Wrangling

This step will focus on data wrangling, this includes collecting the data, organizing the data and the file structure for the rest of the project, defining, and cleaning up the data.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#### Data Collection

Checking that I am in the correct directory and change if need be. 

In [3]:
cd springboard/Capstone2Project/

/Users/erinquense/springboard/Capstone2Project


Import my dataset using read_csv and taking a look at the first 3 rows to get an idea of what the dataframe looks like. 

In [4]:
df = pd.read_csv('Capstone_BRFSS_Obesity_CSV.csv')
df.head(3)

Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,Datasource,Class,Topic,Question,Data_Value_Unit,Data_Value_Type,...,GeoLocation,ClassID,TopicID,QuestionID,DataValueTypeID,LocationID,StratificationCategory1,Stratification1,StratificationCategoryId1,StratificationID1
0,2011,2011,AL,Alabama,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,"(32.84057112200048, -86.63186076199969)",OWS,OWS1,Q036,VALUE,1,Total,Total,OVR,OVERALL
1,2011,2011,AL,Alabama,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,"(32.84057112200048, -86.63186076199969)",OWS,OWS1,Q036,VALUE,1,Gender,Male,GEN,MALE
2,2011,2011,AL,Alabama,Behavioral Risk Factor Surveillance System,Obesity / Weight Status,Obesity / Weight Status,Percent of adults aged 18 years and older who ...,,Value,...,"(32.84057112200048, -86.63186076199969)",OWS,OWS1,Q036,VALUE,1,Gender,Female,GEN,FEMALE


#### Data Organization

Creating a file structure to store my data, figures and models I create.  

In [5]:
path = 'springboard/Capstone2Project'
print ("The current working directory is %s" % path)

The current working directory is springboard/Capstone2Project


In [6]:
mkdir data

mkdir: data: File exists


In [7]:
mkdir figures

mkdir: figures: File exists


In [8]:
mkdir models

mkdir: models: File exists


#### Data Definition

At this point, I want to gain an understanding of what my data looks like, and what might need to happen to make it cleaner to work with later on. First, I am going to drop some columns I don't need.  
-YearEnd is the same as YearStart, DataSource is the same for all observations, Data Footnote Symbol is unnecessary because we have the footnote itself, and we don't need to know confidence limits. We can also drop geolocation because we have LocationDesc to identify the state. 

In [9]:
df = df.drop(['GeoLocation', 'YearEnd', 'Datasource', 'Data_Value_Unit', 'Data_Value_Footnote_Symbol', 'Data_Value_Type', 'DataValueTypeID', 'Data_Value_Alt', 'Low_Confidence_Limit', 'High_Confidence_Limit '], axis=1)

In [12]:
df['Question'].unique()

array(['Percent of adults aged 18 years and older who have obesity',
       'Percent of adults aged 18 years and older who have an overweight classification',
       'Percent of adults who report consuming fruit less than one time daily',
       'Percent of adults who report consuming vegetables less than one time daily',
       'Percent of adults who engage in muscle-strengthening activities on 2 or more days a week',
       'Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)',
       'Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week',
       'Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity 

Here I am going to pivot the table so that the 9 quesitons defined in the 'Question' columns become features and will be filled with the 'Data_Value'.  And then change their column names so they take up less space, but still describe what we need to know. 

In [13]:
no_pivot = ['YearStart', 'LocationAbbr', 'LocationDesc', 'LocationID',
       'StratificationCategory1', 'Stratification1',
       'StratificationCategoryId1', 'StratificationID1']
df = pd.pivot_table(df, columns='Question', values='Data_Value', index=no_pivot).reset_index()

In [14]:
df.rename(columns={'Percent of adults aged 18 years and older who have an overweight classification': 'overweight'}, inplace=True)

In [15]:
df.rename(columns={'Percent of adults aged 18 years and older who have obesity': 'obese'}, inplace=True)

In [16]:
df.rename(columns={'Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)': 'some_activity'}, inplace=True)

In [17]:
df.rename(columns={'Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week': 'some_and_muslce'}, inplace=True)

In [18]:
df.rename(columns={'Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)': 'more_activity'}, inplace=True)

In [19]:
df.rename(columns={'Percent of adults who engage in muscle-strengthening activities on 2 or more days a week': 'Strength_training'}, inplace=True)

In [20]:
df.rename(columns={'Percent of adults who engage in no leisure-time physical activity': 'no_physical_activity'}, inplace=True)

In [21]:
df.rename(columns={'Percent of adults who report consuming fruit less than one time daily': 'fruit'}, inplace=True)

In [22]:
df.rename(columns={'Percent of adults who report consuming vegetables less than one time daily': 'vegetables'}, inplace=True)

In [23]:
df.head(8)

Question,YearStart,LocationAbbr,LocationDesc,LocationID,StratificationCategory1,Stratification1,StratificationCategoryId1,StratificationID1,overweight,obese,some_activity,some_and_muslce,more_activity,Strength_training,no_physical_activity,fruit,vegetables
0,2011,AK,Alaska,2,Age (years),18 - 24,AGEYR,AGEYR1824,32.0,19.8,62.8,34.9,32.4,51.9,16.1,45.0,29.1
1,2011,AK,Alaska,2,Age (years),25 - 34,AGEYR,AGEYR2534,38.7,23.5,57.6,27.0,35.7,36.4,18.1,43.6,18.8
2,2011,AK,Alaska,2,Age (years),35 - 44,AGEYR,AGEYR3544,38.9,29.5,55.3,22.2,32.3,30.8,21.1,42.3,18.4
3,2011,AK,Alaska,2,Age (years),45 - 54,AGEYR,AGEYR4554,43.3,29.2,56.2,25.2,38.8,32.4,24.7,35.8,17.0
4,2011,AK,Alaska,2,Age (years),55 - 64,AGEYR,AGEYR5564,38.9,33.4,58.2,20.1,43.3,26.6,26.0,33.3,16.0
5,2011,AK,Alaska,2,Age (years),65 or older,AGEYR,AGEYR65PLUS,40.9,29.3,58.9,19.8,46.6,24.2,27.4,28.8,22.0
6,2011,AK,Alaska,2,Education,College graduate,EDU,EDUCOGRAD,39.9,22.1,67.2,33.8,43.3,41.8,12.5,28.9,12.8
7,2011,AK,Alaska,2,Education,High school graduate,EDU,EDUHSGRAD,38.0,33.1,49.9,17.4,30.1,25.9,26.9,47.5,25.4


In [26]:
df.shape

(8153, 17)

In [37]:
df['StratificationCategory1'].unique()

array(['Age (years)', 'Education', 'Gender', 'Income', 'Race/Ethnicity',
       'Total'], dtype=object)

We also want each category in StratificationCategory1 to be a feature we can explore.  I am going to pull out each category into its own dataframe and then concat the dataframes together. 

In [27]:
df = df.drop(['StratificationCategoryId1', 'StratificationID1'], axis=1)

In [28]:
df_age = df[df['StratificationCategory1']== 'Age (years)']
df_age = df_age.drop('StratificationCategory1', axis=1)
df_age.rename(columns={'Stratification1': 'Age'}, inplace=True)

In [29]:
df_ed = df[df['StratificationCategory1']== 'Education']
df_ed = df_ed.drop('StratificationCategory1', axis=1)
df_ed.rename(columns={'Stratification1': 'Education'}, inplace=True)

In [30]:
df_gen = df[df['StratificationCategory1']== 'Gender']
df_gen = df_gen.drop('StratificationCategory1', axis=1)
df_gen.rename(columns={'Stratification1': 'Gender'}, inplace=True)

In [31]:
df_inc = df[df['StratificationCategory1']== 'Income']
df_inc = df_inc.drop('StratificationCategory1', axis=1)
df_inc.rename(columns={'Stratification1': 'Income'}, inplace=True)

In [32]:
df_race = df[df['StratificationCategory1']== 'Race/Ethnicity']
df_race = df_race.drop('StratificationCategory1', axis=1)
df_race.rename(columns={'Stratification1': 'Race/Ethnicity'}, inplace=True)

In [33]:
df_total = df[df['StratificationCategory1']== 'Total']
df_total = df_total.drop('StratificationCategory1', axis=1)
df_total.rename(columns={'Stratification1': 'Total'}, inplace=True)

In [34]:
df_all = [df_age, df_ed, df_gen, df_inc, df_race, df_total]

In [35]:
df = pd.concat(df_all)

In [36]:
df.shape

(8153, 19)

In [38]:
df.head(10)

Unnamed: 0,YearStart,LocationAbbr,LocationDesc,LocationID,Age,overweight,obese,some_activity,some_and_muslce,more_activity,Strength_training,no_physical_activity,fruit,vegetables,Education,Gender,Income,Race/Ethnicity,Total
0,2011,AK,Alaska,2,18 - 24,32.0,19.8,62.8,34.9,32.4,51.9,16.1,45.0,29.1,,,,,
1,2011,AK,Alaska,2,25 - 34,38.7,23.5,57.6,27.0,35.7,36.4,18.1,43.6,18.8,,,,,
2,2011,AK,Alaska,2,35 - 44,38.9,29.5,55.3,22.2,32.3,30.8,21.1,42.3,18.4,,,,,
3,2011,AK,Alaska,2,45 - 54,43.3,29.2,56.2,25.2,38.8,32.4,24.7,35.8,17.0,,,,,
4,2011,AK,Alaska,2,55 - 64,38.9,33.4,58.2,20.1,43.3,26.6,26.0,33.3,16.0,,,,,
5,2011,AK,Alaska,2,65 or older,40.9,29.3,58.9,19.8,46.6,24.2,27.4,28.8,22.0,,,,,
25,2011,AL,Alabama,1,18 - 24,27.1,16.3,49.4,27.2,25.8,41.6,22.5,45.0,31.4,,,,,
26,2011,AL,Alabama,1,25 - 34,31.9,35.2,41.9,16.3,20.9,28.3,27.2,43.1,25.6,,,,,
27,2011,AL,Alabama,1,35 - 44,33.3,35.5,41.8,17.0,22.8,28.3,29.8,50.1,23.8,,,,,
28,2011,AL,Alabama,1,45 - 54,35.8,38.0,39.3,12.4,22.1,20.3,35.3,47.6,23.7,,,,,


We now have a dataframe which is organized by year and state and all features are columns. 

There are many NaN values in the categorical features (i.e. education, age, gender, etc.).  This is because each question was assessed as a separate instance for each categorical feature.  For now I will leave those and explore further in the next step, exploratory data analysis. 

In [40]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
YearStart,8153.0,2013.5268,1.709682,2011.0,2012.0,2014.0,2015.0,2016.0
LocationID,8153.0,30.39004,17.037228,1.0,17.0,30.0,44.0,78.0
overweight,8127.0,34.926467,4.673879,10.1,32.4,35.4,37.9,58.3
obese,8127.0,28.929384,6.92926,0.9,25.4,29.4,33.25,60.4
some_activity,3994.0,50.202479,7.547899,24.0,45.2,50.3,55.3,77.6
some_and_muslce,3991.0,20.094287,5.643522,2.2,16.3,19.7,23.4,46.5
more_activity,3991.0,31.250388,6.088415,12.5,27.1,31.1,35.1,64.9
Strength_training,4003.0,29.640919,7.141949,3.3,24.9,29.2,33.8,61.0
no_physical_activity,8119.0,25.560488,8.053705,2.5,20.0,25.1,30.7,60.5
fruit,3998.0,40.174562,6.975817,9.6,35.4,40.0,44.9,63.0


### Data Cleaning

Lets rename some of our columns so they are more descriptive of what they contain. 

In [41]:
df.rename(columns={'YearStart':'Year', 'LocationDesc':'Location'}, inplace=True)

In [43]:
df['Location'].unique()

array(['Alaska', 'Alabama', 'Arkansas', 'Arizona', 'California',
       'Colorado', 'Connecticut', 'District of Columbia', 'Delaware',
       'Florida', 'Georgia', 'Hawaii', 'Iowa', 'Idaho', 'Illinois',
       'Indiana', 'Kansas', 'Kentucky', 'Louisiana', 'Massachusetts',
       'Maryland', 'Maine', 'Michigan', 'Minnesota', 'Missouri',
       'Mississippi', 'Montana', 'North Carolina', 'North Dakota',
       'Nebraska', 'New Hampshire', 'New Jersey', 'New Mexico', 'Nevada',
       'New York', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
       'Texas', 'National', 'Utah', 'Virginia', 'Vermont', 'Washington',
       'Wisconsin', 'West Virginia', 'Wyoming', 'Puerto Rico', 'Guam',
       'Virgin Islands'], dtype=object)

In [44]:
duplicateRowsDF = df[df.duplicated()]
duplicateRowsDF

Unnamed: 0,Year,LocationAbbr,Location,LocationID,Age,overweight,obese,some_activity,some_and_muslce,more_activity,Strength_training,no_physical_activity,fruit,vegetables,Education,Gender,Income,Race/Ethnicity,Total


At this point the data has been transformed and ready to start some exploratory data analysis in the next step.  The last thing to do is save my new, cleaned dataframe.  

In [46]:
df.to_csv('df_wrangled.csv')