07-06-2018

**_Author: Dana Chermesh, Regional Planning intern_**


### US Metros comparison 
comparison by the county level of 15 regions (CSA's) accross the country

----

### _Notebook no.3_
# Housing + Labor force data
# ACS 1-yesr estimates 2017 using Census API

----

A user guide for Census Data API:

# [Census Data API User Guide](https://www.census.gov/content/dam/Census/data/developers/api-user-guide/api-guide.pdf)

The Census Data API in an API that gives the public access to raw statistical data from various Census Bureau data
programs. In terms of space, we aggregate the data and usually associate them with a
certain Census geographic boundary/area defined by a FIPS code. 

## _get your API key from:_ 
https://api.census.gov/data/key_signup.html

**Recommended:** In order to keep your API key confidential, please save your API key in a .py file named **censusAPI.py** as follows:

```python
myAPI = 'XXXXXXXXXXXXXXX'
```
Then read into this notebook as in the following cell:
```python
from censusAPI import myAPI
```

### The complete list of all available datasets for the API is located here:
https://api.census.gov/data.html

---- 

### More on the 2000 Decennial: Summary File 1 (SF1)
- [geographies](https://api.census.gov/data/2000/sf1/geography.html)
- [variables](https://api.census.gov/data/2000/sf1/variables.html)

----
# Housing units 2016
### _data were obtained from the  ACS 2012-2016 5-year estimate, all counties in the US_
variables to be acquired:
- **B25001_001E** |	Total Housing Units (occupied+vacant)
- **B25003_002E** | Owner occupied
- **B25003_003E** | Owner occupied

In [1]:
import pandas as pd
import json
# reading in my api key saved in censusAPI.py as
# myAPI = 'XXXXXXXXXXXXXXX'
# request an api key in: https://api.census.gov/data/key_signup.html
from censusAPI import myAPI

In [2]:
import json
import requests 
import urllib
import numpy as np

#read in in the variables available. the info you need is in the 1year ACS data
url = "https://api.census.gov/data/2016/acs/acs5/variables.json"
resp = requests.request('GET', url)
aff1y = json.loads(resp.text)

In [3]:
#turning things into arrays to enable broadcasting
#Python3
affkeys = np.array(list(aff1y['variables'].keys()))

affkeys

array(['B25036_005E', 'B08534_008E', 'B24126_481E', ..., 'B25124_028E',
       'B24123_413E', 'B24125_205E'], dtype='<U14')

In [4]:
# keyword for POP estimates
totalHU = 'B25001_001E'
owner = 'B25003_002E'
renter = 'B25003_003E'

aff1y['variables'][totalHU]

{'attributes': 'B25001_001M,B25001_001MA,B25001_001EA',
 'concept': 'HOUSING UNITS',
 'group': 'B25001',
 'label': 'Estimate!!Total',
 'limit': 0,
 'predicateType': 'int',
 'validValues': []}

In [5]:
# HU2016 data for all counties in the US
totalHU16 = pd.read_json('https://api.census.gov/data/2016/acs/acs5?get='+
                         totalHU + ',' +
                         owner + ',' +
                         renter +',NAME&for=county:*&in=state:*')
totalHU16.columns = totalHU16.iloc[0]
totalHU16 = totalHU16[1:]

totalHU16['state'] = totalHU16['state'].apply(lambda x: '{0:0>2}'.format(x))
totalHU16['county'] = totalHU16['county'].apply(lambda x: '{0:0>3}'.format(x))
totalHU16['STCO'] = totalHU16[['state', 'county']].apply(lambda x: ''.join(x), axis=1)

totalHU16 = totalHU16.drop(['state', 'county'], axis=1)
totalHU16.columns = ['TotalHousing16', 'Owners16', 'renters16',
                     'Name', 'STCO']

print(totalHU16.shape)
totalHU16.head()

(3220, 5)


Unnamed: 0,TotalHousing16,Owners16,renters16,Name,STCO
1,22714,15218,5582,"Autauga County, Alabama",1001
2,107579,53905,21244,"Baldwin County, Alabama",1003
3,11802,5829,3293,"Barbour County, Alabama",1005
4,8972,5119,1929,"Bibb County, Alabama",1007
5,23850,16254,4365,"Blount County, Alabama",1009


##  Reading in geo-coded dataset
created on a different notebook, please refer to _**ADD NOTEBOOK NAME**_

In [6]:
geo = pd.read_csv('data/USmetros_full.csv').iloc[:,:-2] \
            .drop(['Unnamed: 0', 'SHAPE_AREA'], axis=1)
geo['STCO'] = geo['STCO'].apply(lambda x: '{0:0>5}'.format(x))

print(geo.shape)
geo.head()

(270, 4)


Unnamed: 0,CSA,CSA_name,County_name,STCO
0,488,"San Jose-San Francisco-Oakland, CA",Alameda,6001
1,488,"San Jose-San Francisco-Oakland, CA",Contra Costa,6013
2,488,"San Jose-San Francisco-Oakland, CA",Marin,6041
3,488,"San Jose-San Francisco-Oakland, CA",Napa,6055
4,488,"San Jose-San Francisco-Oakland, CA",San Benito,6069


### Merging datasets

In [7]:
HOUSING16_CO = totalHU16.merge(geo, on='STCO').set_index('County_name')

print(HOUSING16_CO.shape)
HOUSING16_CO.head()

(270, 7)


Unnamed: 0_level_0,TotalHousing16,Owners16,renters16,Name,STCO,CSA,CSA_name
County_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alameda,592796,296634,267659,"Alameda County, California",6001,488,"San Jose-San Francisco-Oakland, CA"
Contra Costa,406803,250055,137485,"Contra Costa County, California",6013,488,"San Jose-San Francisco-Oakland, CA"
Los Angeles,3490118,1499576,1782269,"Los Angeles County, California",6037,348,"Los Angeles-Long Beach, CA"
Marin,112259,66200,38200,"Marin County, California",6041,488,"San Jose-San Francisco-Oakland, CA"
Napa,55301,30411,18964,"Napa County, California",6055,488,"San Jose-San Francisco-Oakland, CA"


In [8]:
HOUSING16_CO[HOUSING16_CO['CSA']==408].shape

(31, 7)

In [14]:
# convert numeric columns from str ('object') to int via to_numeric
HOUSING16_CO.iloc[:,:3] = HOUSING16_CO.iloc[:,:3].apply(pd.to_numeric,
                                                      errors='coerce')

HOUSING16_CO.dtypes

TotalHousing16     int64
Owners16           int64
renters16          int64
Name              object
STCO              object
CSA                int64
CSA_name          object
dtype: object

### Exporting all counties Housing data to .csv

In [15]:
HOUSING16_CO.to_csv('HOUSING16_CO.csv')

## Groupby CSAs to sum

In [17]:
CSA_housing16 = HOUSING16_CO.groupby(['CSA', 'CSA_name']).sum()

print(CSA_housing16.shape)
CSA_housing16

(15, 3)


Unnamed: 0_level_0,Unnamed: 1_level_0,TotalHousing16,Owners16,renters16
CSA,CSA_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
122,"Atlanta--Athens-Clarke County--Sandy Springs, GA",2475969,1396394,824382
148,"Boston-Worcester-Providence, MA-RI-NH-CT",3412143,1934972,1151423
176,"Chicago-Naperville, IL-IN-WI",1792036,1222063,435630
206,"Dallas-Fort Worth, TX-OK",2832798,1566020,1034702
216,"Denver-Aurora, CO",1350238,814383,464369
220,"Detroit-Warren-Ann Arbor, MI",2340603,1420760,652609
288,"Houston-The Woodlands, TX",2534512,1383407,907205
348,"Los Angeles-Long Beach, CA",4562239,2081082,2217775
370,"Miami-Fort Lauderdale-Port St. Lucie, FL",2809034,1419029,888587
378,"Minneapolis-St. Paul, MN-WI",1554227,1028692,444354


### Exporting CSA's Housing data to .csv

In [19]:
CSA_housing16.to_csv('HOUSING16_CSAs.csv')

# Housing units 2010
### _data were obtained from the Decennial Census 2010, SF1, all counties in the US_
variables to be acquired:
- **B25001_001E** |	Total Housing Units (occupied+vacant)
- **B25003_002E** | Owner occupied
- **B25003_003E** | Owner occupied

In [20]:
url10 = "https://api.census.gov/data/2010/sf1/variables.json"
resp10 = requests.request('GET', url10)
aff1y10 = json.loads(resp10.text)

In [21]:
#turning things into arrays to enable broadcasting
#Python3
affkeys10 = np.array(list(aff1y10['variables'].keys()))

affkeys10

array(['PCT0150019', 'P029E001', 'P031H013', ..., 'PCT012I167',
       'PCT012C117', 'PCT013C004'], dtype='<U10')

In [22]:
# keyword for POP estimates
totalHU10 = 'H00010001'
test10 = 'H0030001'
owner10_1 = 'H0040002'
owner10_2 = 'H0040003'
renter10 = 'H0040004'

aff1y10['variables'][totalHU10]

{'concept': 'H1. Housing Units [1]',
 'group': 'N/A',
 'label': 'Housing units',
 'limit': 0,
 'validValues': []}

In [None]:
# HU2010 data for all counties in the US
totalHU10 = pd.read_json('https://api.census.gov/data/2010/sf1?get='+
                         totalHU10 + ',' +
                         test10 + ',' +
                         owner10_1 + ',' +
                         owner10_2 + ',' +
                         renter10 +',NAME&for=county:*&in=state:*')
totalHU10.columns = totalHU10.iloc[0]
totalHU10 = totalHU10[1:]

totalHU10['state'] = totalHU10['state'].apply(lambda x: '{0:0>2}'.format(x))
totalHU10['county'] = totalHU10['county'].apply(lambda x: '{0:0>3}'.format(x))
totalHU10['STCO'] = totalHU10[['state', 'county']].apply(lambda x: ''.join(x), axis=1)

totalHU10 = totalHU10.drop(['state', 'county'], axis=1)
# totalHU10.columns = ['TotalHousing16', 'Owners16', 'renters16',
#                      'Name', 'STCO']

print(totalHU10.shape)
totalHU10.head()

------

## Obtaining PLACES 2000 data in order to define cities within major metros

### 2000

In [78]:
# total POP for all counties in the US, 2000
POP00_place = pd.read_json('https://api.census.gov/data/2000/sf1?get=P001001,NAME&for=place:*')
POP00_place.columns = POP00_place.iloc[0]
POP00_place = POP00_place[1:]

POP00_place.rename(columns={'P001001':'2000'}, inplace=True)
POP00_place['GEOID'] = POP00_place[['state', 'place']].apply(lambda x: ''.join(x), axis=1)

print(POP00_place.shape)
POP00_place.head()

(25150, 5)


Unnamed: 0,2000,NAME,state,place,GEOID
1,984,Altoona town,1,1660,101660
2,7411,Boaz city,1,7912,107912
3,3158,Calera city,1,11416,111416
4,4927,Childersburg city,1,14464,114464
5,53929,Decatur city,1,20104,120104


In [79]:
POP00_place[POP00_place['NAME'] == 'San Francisco city']

Unnamed: 0,2000,NAME,state,place,GEOID
2330,776733,San Francisco city,6,67000,667000


### Reading in my Geocoded places table
Created by Dara Goldberg

In [80]:
cities = pd.read_excel('data/CSA Population+Change_2010-2017.xlsx', 
             sheet_name='Cities_pop+geoinfo')

# setting GEOID to 7 digits to assure match
cities['GEOID'] = cities['GEOID'].apply(lambda x: '{0:0>7}'.format(x))
# setting GEOID to str
cities.GEOID = cities.GEOID.astype(str)

print(cities.shape)
cities

(19, 15)


Unnamed: 0,GEOID,NAMELSAD,NAME,CSA,ALAND_mi,Dec,Est,2010,2011,2012,2013,2014,2015,2016,2017
0,644000,"Los Angeles city, California",Los Angeles,348,468.65867,3792621,3792724,3796060,3824592,3859267,3891783,3922668,3953459,3981116,3999759
1,653000,"Oakland city, California",Oakland,488,55.89604,390724,390822,391571,396480,401906,407567,413933,418929,421566,425195
2,667000,"San Francisco city, California",San Francisco,488,46.90564,805235,805193,805770,816294,830406,841270,853258,866320,876103,884363
3,668000,"San Jose city, California",San Jose,488,177.5141,945942,952574,955255,971352,985722,1003735,1016708,1027560,1031942,1035317
4,820000,"Denver city, Colorado",Denver,216,153.30483,600158,599813,603218,619356,633798,648049,663271,681618,694777,704621
5,1150000,"Washington city, District of Columbia",Washington,548,61.13988,601723,601766,605040,620336,635630,650114,660797,672736,684336,693972
6,1245000,"Miami city, Florida",Miami,370,35.98691,399457,399527,400864,410932,416157,421149,431645,442277,456632,463347
7,1304000,"Atlanta city, Georgia",Atlanta,122,133.43344,420003,420425,422849,431729,443008,447812,455589,463479,472967,486290
8,1714000,"Chicago city, Illinois",Chicago,176,227.3401,2695598,2695620,2697661,2706670,2717989,2724482,2726533,2725154,2720275,2716450
9,2507000,"Boston city, Massachusetts",Boston,148,48.34364,617594,617725,620702,630072,641955,652039,661103,669255,678430,685094


### Merging 2000 + cities datasets

In [81]:
POP_place = cities.merge(POP00_place, on='GEOID')
POP_place = POP_place.drop(['NAME_x', 'Dec', 'Est', 2011, 2012, 2013, 
                           2014, 2015, 2016, 2017, 'state', 'place'], axis=1)
POP_place.columns = ['GEOID', 'NAMELSAD', 'CSA', 'ALAND_mi',
                     '2010', '2000', 'NAME']

POP_place['2000'] = POP_place['2000'].astype(int)
POP_place['2010'] = POP_place['2010'].astype(int)

POP_place['2000-2010_NET'] = POP_place['2010'] - POP_place['2000']
POP_place['2000-2010_%'] = (POP_place['2010'] - POP_place['2000'])/POP_place['2000']

print(POP_place.shape)
POP_place

(19, 9)


Unnamed: 0,GEOID,NAMELSAD,CSA,ALAND_mi,2010,2000,NAME,2000-2010_NET,2000-2010_%
0,644000,"Los Angeles city, California",348,468.65867,3796060,3694820,Los Angeles city,101240,0.027401
1,653000,"Oakland city, California",488,55.89604,391571,399484,Oakland city,-7913,-0.019808
2,667000,"San Francisco city, California",488,46.90564,805770,776733,San Francisco city,29037,0.037384
3,668000,"San Jose city, California",488,177.5141,955255,894943,San Jose city,60312,0.067392
4,820000,"Denver city, Colorado",216,153.30483,603218,554636,Denver city,48582,0.087593
5,1150000,"Washington city, District of Columbia",548,61.13988,605040,572059,Washington city,32981,0.057653
6,1245000,"Miami city, Florida",370,35.98691,400864,362470,Miami city,38394,0.105923
7,1304000,"Atlanta city, Georgia",122,133.43344,422849,416474,Atlanta city,6375,0.015307
8,1714000,"Chicago city, Illinois",176,227.3401,2697661,2896016,Chicago city,-198355,-0.068492
9,2507000,"Boston city, Massachusetts",148,48.34364,620702,589141,Boston city,31561,0.053571


### Exporting Places 2000 table to .csv

In [82]:
POP_place.to_csv('SF1_POP00-10_Places.csv')

----


----

# _Another approach to collect census data and export to excel file:_
* example for the desirable var of this notebook of **population for all counties from SF1 Decennial Census 2000**: 

### Census/Collect_Census_into_Excel.py
from:
https://github.com/xbwei/Data-Mining-on-Social-Media/blob/master/Census/Collect_Census_into_Excel.py

- Tutorial on this package:
https://www.youtube.com/watch?v=5vvAOsIB2fY

The code is as bellow:

In [6]:
from urllib import request
import json
# from pprint import pprint
import xlwt
# import xlrd
# from xlutils.copy import copy

census_api_key = myAPI #get your key from https://api.census.gov/data/key_signup.html
 
 
url_str = 'https://api.census.gov/data/2000/sf1?get=P001001,NAME'+
          '&for=county:*&in=state:*&key='+census_api_key # create the url of your census data
 
response = request.urlopen(url_str) # read the response into computer
 
 
file = xlwt.Workbook() # create a new excel file
sheet_Co = file.add_sheet('SF1_2000_Co') # add a new sheet named test
html_str = response.read().decode("utf-8") # convert the response into string
i = 0 
if (html_str):
    json_data = json.loads(html_str) # convert the string into json
    for row in json_data:
        cl1, cl2, cl3, cl4 =row
        
        #write format (row_num, col_num, value)
        sheet_Co.write(i,0,cl1)
        sheet_Co.write(i,1,cl2)
        sheet_Co.write(i,2,cl3)
        sheet_Co.write(i,3,cl4)
        i = i+1

file.save('SF1_2000_Co.xlsx') #define the location of your excel file

In [8]:
# reading in the excel file we extracted from 
POP00all = pd.read_excel('SF1_2000_Co.xlsx')

POP00all['state'] = POP00all['state'].apply(lambda x: '{0:0>2}'.format(x))
POP00all['county'] = POP00all['county'].apply(lambda x: '{0:0>3}'.format(x))

POP00all['STCO'] = POP00all[['state', 'county']].apply(lambda x: ''.join(x), axis=1)

print(POP00all.shape)
POP00all.tail(10)

(3141, 5)


Unnamed: 0,P001001,NAME,state,county,STCO
3131,2407,Niobrara County,56,27,56027
3132,25786,Park County,56,29,56029
3133,8807,Platte County,56,31,56031
3134,26560,Sheridan County,56,33,56033
3135,5920,Sublette County,56,35,56035
3136,37613,Sweetwater County,56,37,56037
3137,18251,Teton County,56,39,56039
3138,19742,Uinta County,56,41,56041
3139,8289,Washakie County,56,43,56043
3140,6644,Weston County,56,45,56045
