Goal: compare PBA50 BAUS 2020 output (TAZ data) with Census 2020 data (2020 Decennial and 2020 ACS-1 Year experiment) and scale BAUS output if needed.

Methodology (refer to the next cell):
- fields do not need to modify: id_fields, land_fields.

- For the other fields, first get totals from Census data by county. "Total attributes" to compare: total population (Census Decennial), total group quarters population (Census Decennial), total housing units (ACS 1-year), total households (ACS 1-year).

- Compare the 4 total attributes with BAUS output county sums, calculate an adjustment ratio for each. If BAUS 2020 output is very close to Census 2020 numbers, then no adjustment is needed, done. If need to adjust, continue:

- apply the adjust ratio of each total attribute to all the TAZs. E.g., TAZ 1450 in Marin County where ACS county total hh (105298) / BAUS county total hh (108118) = 0.973917. BAUS TAZ 1450 total hh = 2729, so, need to adjust by 2729 * 0.973917.

- for each TAZ, adjust sub-totals proportionally, and resolve the rounding errors to the largest category. E.g., TAZ 145), BAUS output has 'HHINCQ1' 269, 'HHINCQ2' 434, 'HHINCQ3' 589, 'HHINCQ4' 1437; adjust the first three categories by * 0.973917, and calculate HHINCQ4 = TOTHH - sum(HHINCQ1, HHINCQ2, HHINCQ3). This method applies to the fields in "emp_fields", "pop_fields", "hh_fields", "housing_fields".

- for density_fields, recalculate based on adjusted values.

In [139]:
#  categorize BAUS output TAZ table fields into groups and write out the relationship among fields

                 # ID fields
id_fields     = ['TAZ', 'SD', 'ZONE', 'COUNTY', 'COUNTY_NAME', 'county', 'county_name']
                 
                 # employment: sum('AGREMPN', 'FPSEMPN', 'HEREMPN', 'RETEMPN', 'MWTEMPN', 'OTHEMPN') = 'TOTEMP'
emp_fields    = ['AGREMPN', 'FPSEMPN', 'HEREMPN', 'RETEMPN', 'MWTEMPN', 'OTHEMPN', 'TOTEMP',
                 # employed residents = total population * employed ratio?
                 'EMPRES']

                 # sum('HHPOP', 'GQPOP') = 'TOTPOP'
pop_fields    = ['HHPOP', 'GQPOP', 'TOTPOP',
                 # Share of the population age 62 or older = 'TOTPOP' * 62P_ratio
                 'SHPOP62P',
                 # age breakdown: sum ('AGE0004', 'AGE0519', 'AGE2044', 'AGE4564', 'AGE65P') = 'TOTPOP'
                 'AGE0004', 'AGE0519', 'AGE2044', 'AGE4564', 'AGE65P',
                 # gp breakdown: sum ('gq_type_univ', 'gq_type_mil', 'gq_type_othnon') = 'gq_tot_pop'
                 'gq_type_univ', 'gq_type_mil', 'gq_type_othnon', 'gq_tot_pop']

                 # household income breakdown: sum('HHINCQ1', 'HHINCQ2', 'HHINCQ3', 'HHINCQ4') = 'TOTHH'
hh_fields     = ['HHINCQ1', 'HHINCQ2', 'HHINCQ3', 'HHINCQ4',  'TOTHH',
                 # by hh size: sum('hh_size_1', 'hh_size_2', 'hh_size_3', 'hh_size_4_plus') = 'TOTHH'
                 'hh_size_1', 'hh_size_2', 'hh_size_3', 'hh_size_4_plus',
                 # by worker count: sum('hh_wrks_0', 'hh_wrks_1', 'hh_wrks_2', 'hh_wrks_3_plus') = 'TOTHH'
                 'hh_wrks_0', 'hh_wrks_1', 'hh_wrks_2', 'hh_wrks_3_plus',
                 # by with kids or not: sum('hh_kids_no', 'hh_kids_yes') = 'TOTHH'
                 'hh_kids_no', 'hh_kids_yes']

                  # housing units: sum('MFDU', 'SFDU') = 'RES_UNITS'
housing_fields = ['RES_UNITS', 'MFDU', 'SFDU']

land_fields    = ['TOTACRE', 'RESACRE_UNWEIGHTED', 'CIACRE_UNWEIGHTED', 'CIACRE', 'RESACRE']

                  # Area type designation, no need to update
density_fields = ['AREATYPE',
                  # density = tot pop or emp / acerage
                  'DENSITY_POP', 'DENSITY_EMP', 'DENSITY']

In [140]:
import os
import pandas as pd

In [141]:
# inputs

# BAUS output
BOX_DIR = 'C:\\Users\\{}\\Box\\Modeling and Surveys'.format(os.getenv('USERNAME'))
BAUS_PBA50_FBP_DIR = os.path.join(BOX_DIR, 'Urban Modeling', 'Bay Area UrbanSim', 'PBA50', 'Final Blueprint runs',
                                  'Final Blueprint (s24)', 'BAUS v2.25 - FINAL VERSION')
BAUS_2020_TAZ_FILE = os.path.join(BAUS_PBA50_FBP_DIR, 'run182_taz_summaries_2020.csv')

# Census 2020 decennial data
L_DIR = 'L:\\Application\\Model_One\\TransitRecovery\\land_use_preprocessing'
CENSUS_INPUT_DIR = os.path.join(L_DIR, 'census_raw_data')
DEC_P1_FILE = os.path.join(CENSUS_INPUT_DIR, 'DECENNIALPL2020.P1-2022-05-06T201441.csv') # P1: total pop by race, will use 'total population'
DEC_P5_FILE = os.path.join(CENSUS_INPUT_DIR, 'DECENNIALPL2020.P5-2022-05-06T201358.csv') # P5: group quarters pop by major group quarters type (use 'total group quarter pop')
# Census 2020 ACS 1 Year data
ACS_DP05_FILE = os.path.join(CENSUS_INPUT_DIR, 'ACSDP1Y2019.DP05-2022-05-06T202628.csv') # DP05: selected demographic characteristics, will use 'population by age' and 'total housing units'
ACS_DP04_FILE = os.path.join(CENSUS_INPUT_DIR, 'ACSDP1Y2019.DP04-2022-05-06T202641.csv') # DP04: selected housing characteristics, will use "housing by tenure"
ACS_DP02_FILE = os.path.join(CENSUS_INPUT_DIR, 'ACSDP1Y2019.DP02-2022-05-06T202705.csv') # DP02: selected social characteristics, will use 'total households'

# ERSI business data for employment

## get the following data from Census: total population, total households, total housing units

In [143]:
# 1. total population from Decennial P1 table
tot_pop_dec_raw = pd.read_csv(DEC_P1_FILE)
# only keep the total pop data and transpose the table so that each row represents one county
tot_pop_dec_raw.set_index('Label (Grouping)', inplace=True)
tot_pop_dec = tot_pop_dec_raw.loc[
    tot_pop_dec_raw.index == 'Total:'].transpose().rename(columns={'Total:': 'TOTPOP_dec'}).reset_index()
display(tot_pop_dec)

Label (Grouping),index,TOTPOP_dec
0,"Alameda County, California",1682353
1,"Contra Costa County, California",1165927
2,"Marin County, California",262321
3,"Napa County, California",138019
4,"San Francisco County, California",873965
5,"San Mateo County, California",764442
6,"Santa Clara County, California",1936259
7,"Solano County, California",453491
8,"Sonoma County, California",488863


In [144]:
# 2. total group quarter pop from Decennial P5 table
tot_gp_pop_dec_raw = pd.read_csv(DEC_P5_FILE)
# only keep the total gp pop and transpose the table so that each row represents one county
tot_gp_pop_dec_raw.set_index('Label (Grouping)', inplace=True)
tot_gp_pop_dec = tot_gp_pop_dec_raw.loc[
    tot_gp_pop_dec_raw.index == 'Total:'].transpose().rename(columns={'Total:': 'GQPOP_dec'}).reset_index()
display(tot_gp_pop_dec)

Label (Grouping),index,GQPOP_dec
0,"Alameda County, California",53833
1,"Contra Costa County, California",11255
2,"Marin County, California",7743
3,"Napa County, California",5172
4,"San Francisco County, California",27892
5,"San Mateo County, California",9352
6,"Santa Clara County, California",39607
7,"Solano County, California",11137
8,"Sonoma County, California",8866


In [145]:
# 3. total household from ACE 1-year DP02 table
tot_hh_acs_raw = pd.read_csv(ACS_DP02_FILE)
# only keep columns with 'Estimate', drop columns for 'Margin of Error'
estimate_cols = [col for col in tot_hh_acs_raw.columns if 'Estimate' in col]
tot_hh_acs = tot_hh_acs_raw.loc[:, ['Label (Grouping)'] + estimate_cols]
# trim the leading space of the attributes
tot_hh_acs.loc[:, 'Label (Grouping)'] = tot_hh_acs['Label (Grouping)'].apply(lambda x: x.strip())
tot_hh_acs.set_index('Label (Grouping)', inplace=True)
# drop_duplicates() because the table contains multiple breakdowns, with the same total hh number
tot_hh_acs = tot_hh_acs.loc[tot_hh_acs.index == 'Total households'].drop_duplicates().transpose().rename(
    columns={'Total households': 'TOTHH_acs'}).reset_index()
# remove "!!Estimate" from county names
tot_hh_acs.loc[:, 'index'] = tot_hh_acs['index'].apply(lambda x: x.replace('!!Estimate', ''))
display(tot_hh_acs)

Label (Grouping),index,TOTHH_acs
0,"Alameda County, California",585632
1,"Contra Costa County, California",399792
2,"Marin County, California",105298
3,"Napa County, California",48107
4,"San Francisco County, California",365851
5,"San Mateo County, California",265003
6,"Santa Clara County, California",643637
7,"Solano County, California",150393
8,"Sonoma County, California",190689


In [146]:
# 4. total housing units
tot_unit_acs_raw = pd.read_csv(ACS_DP04_FILE)
# only keep columns with 'Estimate', drop columns for 'Margin of Error'
estimate_cols = [col for col in tot_unit_acs_raw.columns if 'Estimate' in col]
tot_unit_acs = tot_unit_acs_raw.loc[:, ['Label (Grouping)'] + estimate_cols]
# trim the leading space of the attributes
tot_unit_acs.loc[:, 'Label (Grouping)'] = tot_unit_acs['Label (Grouping)'].apply(lambda x: x.strip())
tot_unit_acs.set_index('Label (Grouping)', inplace=True)
# drop_duplicates() because the table contains multiple breakdowns, with the same total hh number
tot_unit_acs = tot_unit_acs.loc[tot_unit_acs.index == 'Total housing units'].drop_duplicates().transpose().rename(
    columns={'Total housing units': 'RES_UNITS_acs'}).reset_index()
# remove "!!Estimate" from county names
tot_unit_acs.loc[:, 'index'] = tot_unit_acs['index'].apply(lambda x: x.replace('!!Estimate', ''))
display(tot_unit_acs)

Label (Grouping),index,RES_UNITS_acs
0,"Alameda County, California",622957
1,"Contra Costa County, California",418696
2,"Marin County, California",113345
3,"Napa County, California",55659
4,"San Francisco County, California",406399
5,"San Mateo County, California",280500
6,"Santa Clara County, California",686306
7,"Solano County, California",159804
8,"Sonoma County, California",208293


In [148]:
# combine census county-level total metrics and modify county names to be consistent with modeling convention
census_tots_county = tot_pop_dec.merge(
                     tot_gp_pop_dec, on='index', how='outer').merge(
                     tot_hh_acs, on='index', how='outer').merge(
                     tot_unit_acs, on='index', how='outer')
census_tots_county.loc[:, 'COUNTY_NAME'] = census_tots_county['index'].apply(lambda x: x.replace(' County, California', ''))
census_tots_county.drop(columns='index', inplace=True)
# convert value fields to numeric
for col_name in ['TOTPOP_dec', 'GQPOP_dec', 'TOTHH_acs', 'RES_UNITS_acs']:
    census_tots_county.loc[:, col_name] = census_tots_county[col_name].apply(lambda x: int(x.replace(',','')))
display(census_tots_county)

Label (Grouping),TOTPOP_dec,GQPOP_dec,TOTHH_acs,RES_UNITS_acs,COUNTY_NAME
0,1682353,53833,585632,622957,Alameda
1,1165927,11255,399792,418696,Contra Costa
2,262321,7743,105298,113345,Marin
3,138019,5172,48107,55659,Napa
4,873965,27892,365851,406399,San Francisco
5,764442,9352,265003,280500,San Mateo
6,1936259,39607,643637,686306,Santa Clara
7,453491,11137,150393,159804,Solano
8,488863,8866,190689,208293,Sonoma


## read PAB50 county-level total population, gp population, total households, total housing units, and calculate county-level baus/census adjustment ratios

In [150]:
# PBA50 BAUS 2020 output
baus_taz = pd.read_csv(BAUS_2020_TAZ_FILE)
taz_fields = list(baus_taz)
print('read {} rows of BAUS output taz data, with the following fields: {}'.format(baus_taz.shape[0], taz_fields))

read 1454 rows of BAUS output taz data, with the following fields: ['TAZ', 'SD', 'ZONE', 'COUNTY', 'COUNTY_NAME', 'AGREMPN', 'FPSEMPN', 'HEREMPN', 'RETEMPN', 'MWTEMPN', 'OTHEMPN', 'TOTEMP', 'HHINCQ1', 'HHINCQ2', 'HHINCQ3', 'HHINCQ4', 'HHPOP', 'TOTHH', 'SHPOP62P', 'GQPOP', 'TOTACRE', 'TOTPOP', 'RES_UNITS', 'MFDU', 'SFDU', 'RESACRE_UNWEIGHTED', 'CIACRE_UNWEIGHTED', 'CIACRE', 'RESACRE', 'EMPRES', 'DENSITY_POP', 'DENSITY_EMP', 'DENSITY', 'AREATYPE', 'AGE0004', 'AGE0519', 'AGE2044', 'AGE4564', 'AGE65P', 'gq_type_univ', 'gq_type_mil', 'gq_type_othnon', 'gq_tot_pop', 'hh_size_1', 'hh_size_2', 'hh_size_3', 'hh_size_4_plus', 'county', 'county_name', 'hh_wrks_0', 'hh_wrks_1', 'hh_wrks_2', 'hh_wrks_3_plus', 'hh_kids_no', 'hh_kids_yes']


In [151]:
# county-level sums of the same fields
baus_tots_county = baus_taz.groupby('COUNTY_NAME')[['TOTPOP', 'GQPOP', 'TOTHH', 'RES_UNITS']].sum().reset_index()

In [152]:
# merge with census data
baus_census_tots_county = baus_tots_county.merge(census_tots_county, on='COUNTY_NAME', how='outer')

In [154]:
# calculate adjustment ratios
baus_census_tots_county.loc[:, 'TOTPOP_ratio'] = baus_census_tots_county['TOTPOP_dec'] / baus_census_tots_county['TOTPOP']
baus_census_tots_county.loc[:, 'GQPOP_ratio'] = baus_census_tots_county['GQPOP_dec'] / baus_census_tots_county['GQPOP']
baus_census_tots_county.loc[:, 'TOTHH_ratio'] = baus_census_tots_county['TOTHH_acs'] / baus_census_tots_county['TOTHH']
baus_census_tots_county.loc[:, 'RES_UNITS_ratio'] = baus_census_tots_county['RES_UNITS_acs'] / baus_census_tots_county['RES_UNITS']
print(baus_census_tots_county[['TOTPOP_ratio', 'GQPOP_ratio', 'TOTHH_ratio', 'RES_UNITS_ratio']])

   TOTPOP_ratio  GQPOP_ratio  TOTHH_ratio  RES_UNITS_ratio
0      1.010646     1.526095     1.019235         1.002349
1      1.018940     1.121910     1.030057         1.000117
2      0.963074     1.774696     0.973917         1.000627
3      0.923050     1.339549     0.938362         0.954585
4      0.921860     1.115368     0.941447         0.969949
5      0.963413     1.038649     0.977535         0.977464
6      0.964673     1.319969     0.988550         0.980909
7      1.031871     0.850088     1.051471         1.005911
8      0.947839     0.879476     1.010375         0.985951


## employment data

In [98]:
emp = pd.read_csv('M:\Data\BusinessData\Businesses_2020_BayArea_wcountyTAZ.csv')

In [101]:
print(list(emp))

['OBJECTID', 'Join_Count', 'TARGET_FID', 'Join_Cou_1', 'TARGET_F_1', 'LOCNUM', 'CONAME', 'STREET', 'CITY', 'STATE', 'STATE_NAME', 'ZIP', 'ZIP4', 'NAICS', 'SIC', 'SALESVOL', 'HDBRCH', 'ULTNUM', 'PUBPRV', 'EMPNUM', 'FRNCOD', 'ISCODE', 'SQFTCODE', 'LOC_NAME', 'STATUS', 'SCORE', 'SOURCE', 'REC_TYPE', 'POINT_X', 'POINT_Y', 'COUNTYNAME', 'SUPERD', 'TAZ1454']
