##Reduce / transform Census Tract data
[Data](https://docs.google.com/file/d/0B1aa6nX82m2WY2RBNERsd0VfN1k/edit)  
[US Census Documentation](http://www.census.gov/acs/www/data_documentation/documentation_main/)

In [227]:
reset

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


In [228]:
import pandas as pd
from pprint import pprint

In [229]:
path_base = ("/Users/brian/Google Drive/")
path_specfic = "SPHIP/2009-2013 ACS Demographic Data/Census Tract Data Files and Documentation/"
years = ['2', '3', '4', '5']
year = years[0] # Later this will 'for year in years:'

In [243]:
file_name = 'ACS_13_5YR_DP0'+year+'.txt'
print('File: '+file_name+'\n')
with open(path_base+path_specfic+file_name) as f:
    text_description = f.read()
print(text_description[:1000])
print('...')

File: ACS_13_5YR_DP02.txt

DP02
SELECTED SOCIAL CHARACTERISTICS IN THE UNITED STATES

Although the American Community Survey (ACS) produces population, demographic and housing unit estimates, it is the Census Bureau's Population Estimates Program that produces and disseminates the official estimates of the population for the nation, states, counties, cities and towns and estimates of housing units for states and counties.


Supporting documentation on code lists, subject definitions, data accuracy, and statistical testing can be found on the American Community Survey website in the Data and Documentation section.

Sample size and data quality measures (including coverage rates, allocation rates, and response rates) can be found on the American Community Survey website in the Methodology section.


Source:  U.S. Census Bureau, 2009-2013 5-Year American Community Survey


Explanation of Symbols:An '**' entry in the margin of error column indicates that either no sample observations or to

In [244]:
# Load meta data
file_name = "ACS_13_5YR_DP0"+year+"_metadata.csv"
path_name = (path_base+path_specfic+file_name)
df_meta = pd.read_csv(path_name,
                      header=None)
df_meta.columns = ['data_column_name', 'description']
# df_meta.head()

In [232]:
# Load data
file_name = "ACS_13_5YR_DP0"+year+"_with_ann.csv"
path_name = (path_base+path_specfic+file_name)
df_data_with_ann = pd.read_csv(path_name)

# Create data dictionary
df_data_with_ann[:1] # 1st row is annotations / meta data
data_dictionary = df_data_with_ann[:1].T.to_dict()[0]

# Drop annotations/meta data from dataframe
df_data = df_data_with_ann.drop(df_data_with_ann.head(n=1).index)

In [233]:
# Compare data dictionary to meta-data-frame (they should be the same)
df_meta.set_index('data_column_name')['description'].to_dict() == data_dictionary

True

In [234]:
# Improve column naming

# Data dataframe
# Lowercase
df_data.columns = [c.lower() for c in df_data.columns] 
# Rename
df_data = df_data.rename(columns={'geo.id': 'geo_id',
                                  'geo.id2': 'geo_id2',
                                  'geo.display-label': 'geo_display_label'
                            })

# Meta dataframe
# Lowercase
df_meta.data_column_name = df_meta.data_column_name.apply(lambda x: x.lower())
# Rename
df_meta.ix[0,'data_column_name'] = 'geo_id'
df_meta.ix[1,'data_column_name'] = 'geo_id2'
df_meta.ix[2,'data_column_name'] = 'geo_display_label'
df_meta.head(n=4)

# Data dictionary
# Lowercase
data_dictionary = {k.lower(): v for k, v in data_dictionary.items()}
# Rename
data_dictionary['geo_display_label'] = data_dictionary.pop('geo.display-label')
data_dictionary['geo_id'] = data_dictionary.pop('geo.id')
data_dictionary['geo_display_label'] = data_dictionary.pop('geo.id2')

In [235]:
print('data dictionary')
pprint(data_dictionary)
df_meta.head(n=2)

data dictionary
{'geo_display_label': 'Id2',
 'geo_id': 'Id',
 'hc01_vc03': 'Estimate; HOUSEHOLDS BY TYPE - Total households',
 'hc01_vc04': 'Estimate; HOUSEHOLDS BY TYPE - Total households - Family '
              'households (families)',
 'hc01_vc05': 'Estimate; HOUSEHOLDS BY TYPE - Total households - Family '
              'households (families) - With own children under 18 years',
 'hc01_vc06': 'Estimate; HOUSEHOLDS BY TYPE - Total households - Family '
              'households (families) - Married-couple family',
 'hc01_vc07': 'Estimate; HOUSEHOLDS BY TYPE - Total households - Family '
              'households (families) - Married-couple family - With own '
              'children under 18 years',
 'hc01_vc08': 'Estimate; HOUSEHOLDS BY TYPE - Total households - Family '
              'households (families) - Male householder, no wife present, '
              'family',
 'hc01_vc09': 'Estimate; HOUSEHOLDS BY TYPE - Total households - Family '
              'households (families) -

Unnamed: 0,data_column_name,description
0,geo_id,Id
1,geo_id2,Id2


In [236]:
print("Data:")
df_data.head(n=2)

Data:


Unnamed: 0,geo_id,geo_id2,geo_display_label,hc01_vc03,hc02_vc03,hc03_vc03,hc04_vc03,hc01_vc04,hc02_vc04,hc03_vc04,...,hc03_vc216,hc04_vc216,hc01_vc217,hc02_vc217,hc03_vc217,hc04_vc217,hc01_vc218,hc02_vc218,hc03_vc218,hc04_vc218
1,1400000US06075010100,6075010100,"Census Tract 101, San Francisco County, Califo...",2177,126,2177,(X),603,137,27.7,...,(X),(X),(X),(X),(X),(X),(X),(X),(X),(X)
2,1400000US06075010200,6075010200,"Census Tract 102, San Francisco County, Califo...",2547,186,2547,(X),677,136,26.6,...,(X),(X),(X),(X),(X),(X),(X),(X),(X),(X)


Ideas and work is tracked on [Hackpad](https://datakindsfbayarea.hackpad.com/SF-Health-Improvement-Partnership-SFHIP-IdGfO4Yn60V)

In [237]:
# Filter / reduce data
# Define these groups:
#     ACS_13_5YR_DP02_metadata
#     Total households
#     Families with children 
#     Under 12
#     12 to 18
#     19 to 64
#     65 and over
#     Single Parents
#     Under 12
#     12 to 18
#     19 to 64
#     65 and over
#     Average family size
#     Estimate; HOUSEHOLDS BY TYPE - Average family size
#     Educational Attainment
#     No Educational Attainment
#     High School
#     Collect
#     Master and Beyond
#     Disability
#     Percent; DISABILITY STATUS OF THE CIVILIAN NONINSTITUTIONALIZED POPULATION - 18 to 64 years
#     Percent; PLACE OF BIRTH - Total population - Native

In [238]:
# Find columns that have name
# phrase = 'Percent Margin of Error; SEX AND AGE - Total population - Female'
phrase = 'Born in Puerto Rico'

df_selected = df_meta[df_meta.description.str.contains(phrase)]
df_selected

Unnamed: 0,data_column_name,description
363,hc01_vc135,Estimate; PLACE OF BIRTH - Total population - ...
364,hc02_vc135,Margin of Error; PLACE OF BIRTH - Total popula...
365,hc03_vc135,Percent; PLACE OF BIRTH - Total population - N...
366,hc04_vc135,Percent Margin of Error; PLACE OF BIRTH - Tota...


In [239]:
# Print the information
for key, value in df_selected.values:
    print("""Column key:  \t{}\n\nDescription: \t{}\n\n***
            """.format(key, value))

Column key:  	hc01_vc135

Description: 	Estimate; PLACE OF BIRTH - Total population - Native - Born in Puerto Rico, U.S. Island areas, or born abroad to American parent(s)

***
            
Column key:  	hc02_vc135

Description: 	Margin of Error; PLACE OF BIRTH - Total population - Native - Born in Puerto Rico, U.S. Island areas, or born abroad to American parent(s)

***
            
Column key:  	hc03_vc135

Description: 	Percent; PLACE OF BIRTH - Total population - Native - Born in Puerto Rico, U.S. Island areas, or born abroad to American parent(s)

***
            
Column key:  	hc04_vc135

Description: 	Percent Margin of Error; PLACE OF BIRTH - Total population - Native - Born in Puerto Rico, U.S. Island areas, or born abroad to American parent(s)

***
            


In [240]:
# Select columns
selected_cols = df_meta[df_meta.description.str.contains(phrase)].data_column_name.values
# Display data for those columns
df_data[['geo_display_label']+selected_cols.tolist()].head(n=5)

Unnamed: 0,geo_display_label,hc01_vc135,hc02_vc135,hc03_vc135,hc04_vc135
1,"Census Tract 101, San Francisco County, Califo...",164,181,4.4,4.8
2,"Census Tract 102, San Francisco County, Califo...",57,51,1.4,1.3
3,"Census Tract 103, San Francisco County, Califo...",12,20,0.3,0.4
4,"Census Tract 104, San Francisco County, Califo...",102,98,2.0,1.9
5,"Census Tract 105, San Francisco County, Califo...",20,26,0.8,1.0
