### Citrics Documentation - Data Collection

The data used in the project was from a number of different sources. Merging data from these data sources required making a uniform naming convention for each city, the data science team decided to use a format of `city_name ST`, where 'ST' is the postal abbreviation used for each state.

The most comprehensive data set that we built off of was the US Census Data (https://www.census.gov/acs/www/data/data-tables-and-tools/data-profiles/2018/)

### Census Data Collection Methodology



The process for collecting the Census data was fairly manual. We went to the Census Website:  https://data.census.gov/cedsci/table?q=&d=ACS%205-Year%20Estimates%20Data%20Profiles&table=DP02&tid=ACSDP5Y2017.DP02&lastDisplayedRow=39&hidePreview=true&g=0100000US.160000

1) Select the Geographies option from the navigation bar

2) Clear anything in the "Selected Geographies:" across the bottom of the page

3) From the remaining Geography column, scroll and select "Place", a "Place" column will appear

4) Scroll and select which state you need the data for, that state's name will appear in a new column
- Note: for some of the Tables you will be able to select multiple states at a time to speed up this process

5) From that column, select the first option "all places in state_name"

- image: census-select-geographies

6) Hit the close button in the bottom right corner

7) Select "Download Table"

8) Select that you want the file as a csv with the date you want

9) You'll then have a zip file with the data for that state/time

10) Repeat 1-9 until you have all states for that particular table

11) Then from the top-left corner of the page, select the "Tables" link

- image: census-tables-highlighted

12) Select another table (in total, you'll need DP02, DP03, DP04, and DP05) 

13) Select "Customize Table" from the top-right corner of the page and repeat steps 1-12 until you've collected the entirety of the data needed

### Walk Score Data Collection Methodology

Walk Score is a service that offers a uniform scoring methodology to rate cities overall as well as specific addresses on their walkability, how bike-friendly they are, and a score for the public transit available. Using a third party for this because we do not have to make our own methodology for these types of figures.

About Walk Score:
https://www.walkscore.com/professional/research.php

Source of the data tables:
https://www.walkscore.com/cities-and-neighborhoods/states/

Trademark Guidelines:
https://www.walkscore.com/trademark-use.shtml

In [1]:
import pandas as pd

In [33]:
us_state_abbrev = ['AL','AK','AZ','AR','CA','CO','CT','DE','DC','FL','GA','HI','ID','IL','IN','IA','KS',
                   'KY','LA','ME','MD','MA','MI','MN','MS','MO','MT','NE','NV','NH','NJ','NM','NY','NC',
                   'ND','OH','OK','OR','PA','RI','SC','SD','TN','TX','UT','VT','VA','WA','WV','WI','WY']

In [37]:
# Empty DataFrame to add to in the loop below
df = pd.DataFrame()

for i in us_state_abbrev:
    df_i = pd.read_html('https://www.walkscore.com/' + i)[0]
    df_i['State'] = i
    df = pd.concat([df, df_i])

This is a simple loop that uses the `pd.read_html` function and a list of state abbreviations to scrape the summary scores for all major cities in their publicly-available data tables. 

In [39]:
df.head()

Unnamed: 0,City,Zip Code,Walk Score,Transit Score,Bike Score,Population,State
0,Birmingham (the largest city in Alabama),35211.0,35,25,31,212237,AL
1,Montgomery,36109.0,27,16,38,205764,AL
2,Mobile,36605.0,33,--,39,195111,AL
3,Huntsville,35810.0,23,13,40,180105,AL
4,Tuscaloosa,,33,--,37,90468,AL
