# US Census Data API/Data Collection
## ACS Survey 5

The ACS Survey 5 contains data from 2019 to 2022 and includes social, economic, demographic, and housing data all the way down to the zip code level. We have decided to use zip code level data for our analysis. We can then split zip codes into urban/suburban/rural and into various household income buckets. We can also summarize across the state.

In [1]:
# Dependencies
import requests
import numpy as np
import pandas as pd
from census import Census
from us import states
from config import census_api_key

## Data Requirements

So, what data will we need to collect? 

For all zip codes in california: 
For the five years we are interested in:
-        What is the population by zip code?
-        What is the geographic area by zip code
-        What is the household income by zip code

Those are the basics. It may be worth exploring to see if there are other data that would be helpful for answering some intriguing questions, but this should be enough to cover the questions already identified in our project. We can marry these data with the information on car registration by fuel from the california government website as well as California energy department info on energy production.

At present, I don't have geographic area by zip code. Census does not publish this. They do publish population density, but not by zip code. May need to do it by city.



In [2]:
# Create an instance of the Census library
# Run Census Search to retrieve data on all zip codes (2018 to 2022 ACS5 Census--latest five years)
# We'll need a for loop to go through the years from 2018 to 2022
# For loops stop just before the last member of the range since it is a zero index system

multi_census_pd = pd.DataFrame()

for year in range(2018, 2023):
    c = Census(
        census_api_key,
        year = year
    )
    census_data = c.acs5.get(
        (
            "NAME",
            "B19013_001E",
            "B01003_001E",
        ),
        {'for': 'zip code tabulation area:*'}
    )

    # Convert to DataFrame
    census_pd = pd.DataFrame(census_data)

    # Column renaming
    census_pd = census_pd.rename(
        columns = {
            "NAME": "Name",
            "B19013_001E": "Household Income",
            "B01003_001E": "Population",
            "zip code tabulation area": "Zipcode"
        }
    )

    # Configure the final DataFrame (basically, just drop the "Name" column)
    census_pd = census_pd[
        [
            "Zipcode",
            "Population",
            "Household Income",
        ]
    ]
    census_pd['Year'] = year

    multi_census_pd = pd.concat([census_pd, multi_census_pd], axis=0)

# Filters/Slices

I will need to filter this dataset down to just show California zip codes and add in their city, longitude, latitude in case
we want to plot something on a map, or in case we want to group by city or something like that. 

In [3]:
# In order to filter by state, we can use a csv file that is available from US census containing all 
# states, cities, zips, and the state_fips abbreviation.

zip_state_pd = pd.read_csv("Resources/geo-data.csv")

# Now, we just need to merge our two dataframes to get just the California values
multi_census_pd = multi_census_pd.merge(zip_state_pd, how='inner', left_on = 'Zipcode', right_on = 'zipcode', copy='false')



In [4]:
# OK, now let's do some cleanup and drop states we don't need...
multi_census_pd.drop(columns='zipcode', inplace=True)
multi_census_pd = multi_census_pd[multi_census_pd['state_abbr'] == 'CA']

# Data Cleansing and Output

I will need to cleanse the data a bit so that I get rid of NaN's (yes, they exist) and negative values, and maybe some other things.


In [5]:
# We need to drop the following rows:
#     1) Any rows where the population is zero (to reduce risk of dividing by zero)
#     2) Any rows where Household Income is negative
#     3) Any rows missing cities (in case we end up agregating by city)
#     4) Any rows missing counties (in case we end up agregating by county)
#     5) Any rows where teh city is just a ZCTA (zip code)
#     6) Any rows where the count by zip code is less than 5, so we don't have mismatched data by year

multi_census_pd = multi_census_pd[multi_census_pd['Population']!= 0]
multi_census_pd = multi_census_pd[multi_census_pd['Household Income']> 0]
multi_census_pd = multi_census_pd[multi_census_pd['city'].notna()]
multi_census_pd = multi_census_pd[multi_census_pd['county'].notna()]
multi_census_pd = multi_census_pd[multi_census_pd['city'].str.startswith('Zcta') == False]
# To do number 6 above, we need to first make a list of zip codes
# Where the row count is less than five (since we ran this for five years)
keepzip = multi_census_pd.groupby('Zipcode').count()
keepzip = keepzip.loc[keepzip['city'] == 5]
# Now we can delete any rows from multi_census_pd
# Where the Zipcode is not in the 'keepzip' file
multi_census_pd = multi_census_pd[multi_census_pd['Zipcode'].isin(keepzip.index)]

# df.groupby('key1')['key2'].apply(lambda x: (x=='one').sum()).reset_index(name='count')
# multi_census_pd = multi_census_pd[multi_census_pd['Zipcode'].isin()


In [13]:
# Now we can write out our data file for incorporation into the master Pandas notebook
multi_census_pd.to_csv("Resources/census_data.csv", header=True, index=False)

# Alternate Data Sets
Not all of the diferent datasets are available by zip code, so I need to prepare a couple different files:
- Census data by county
- A cross reference from zip code to county

In [7]:
# Create an instance of the Census library
# Run Census Search to retrieve data on all zip codes (2018 to 2022 ACS5 Census--latest five years)
# We'll need a for loop to go through the years from 2018 to 2022
# For: loops stop just before the last member of the range since it is a zero index system
# We'll start by creating an empty dataframe that we will hold all of our data for each year...

multi_census_county = pd.DataFrame()

for year in range(2018, 2023):
    c = Census(
        census_api_key,
        year = year
    )
    census_data = c.acs5.get(
        (
            "NAME",
            "B19013_001E",
            "B01003_001E"
        ),
        {'for': 'county:*', 'in': 'state:06'}
    )


    # Convert to DataFrame
    census_pd = pd.DataFrame(census_data)

    # Column renaming
    census_pd = census_pd.rename(
        columns = {
            "NAME": "Name",
            "B19013_001E": "Household Income",
            "B01003_001E": "Population",
            "county": "County FIPS"
        }
    )


    
    # Configure the final DataFrame (basically, just drop the "Name" column)
    census_pd = census_pd[
        [
            "County FIPS",
            "Population",
            "Household Income",
        ]
    ]
    census_pd['Year'] = year

    multi_census_county = pd.concat([census_pd, multi_census_county], axis=0)


In [8]:
# But our county FIPS does not tell us the name of the county.
# We can pull in a CSV with a FIPS lookup value for California taken from US Census SF1

# We need to make sure the FIPS column isn't interpreted as a number, so we specify the dtype
county_name_pd = pd.read_csv('Resources/census_county_name_fips.csv', dtype={
    'Name': 'string',
    'FIPS': 'string'
})

In [9]:
# So, now we will join our county name cross reference to our dataset...
multi_census_county = multi_census_county.merge(county_name_pd,
                                                how='inner',
                                                left_on = 'County FIPS',
                                                right_on = 'FIPS',
                                                copy = 'false'
                                               )
# Then, we will do a little cleanup--renaming columns, dropping the FIPS columns, and reordering...
multi_census_county = multi_census_county.rename(columns={"Name": "County"})
multi_census_county.drop('County FIPS', axis=1, inplace=True)
multi_census_county.drop('FIPS', axis=1, inplace=True)
multi_census_county = multi_census_county[['County', 'Year', 'Population', 'Household Income']]

In [16]:
# Now we can write our data set to a CSV file for any of the collaborators to use...
multi_census_county.to_csv("Resources/census_data_by_county.csv", header=True, index=False)

In [11]:
# We still need to make a file containing the zip codes and the counties
# so that we can summarize by county in situations where there is no zipcode level information

zipcode_county_pd = multi_census_pd.loc[multi_census_pd['Year'] == 2022]
zipcode_county_pd = zipcode_county_pd[
        [
            "Zipcode",
            "county"
        ]
    ]

In [12]:
# And now we will write out our zip/county cross reference file for anyone who needs it...
zipcode_county_pd.to_csv("Resources/zip_to_county.csv", header=True, index=False)