Goal: compare PBA50 BAUS 2020 TAZ output ('run182_taz_summaries_2020.csv') with other data sources, and scale taz_summaries if needed.

Referencing data sources:
- population: Census 2020 Decennial
- household: Census 2020 PUMS 1 Year
- housing units: Census 2020 PUMS 1 Year
- employment: ESRI Business Analyst, Census 2020 PUMS 1 Year

Methodology (refer to the next cell):
- fields do not need to modify: id_fields, land_fields.

- pop_fields, hh_fields, housing_fields:

    - first get totals from Census data by county. "Total attributes" to compare: total population (Census Decennial), total group quarters population (Census Decennial), total housing units (ACS 1-year), total households (ACS 1-year).

    - Compare the 4 total attributes with BAUS output county sums, calculate an adjustment ratio for each. If BAUS 2020 output is very close to Census 2020 numbers, then no adjustment is needed, done. If need to adjust, continue:

    - apply the adjust ratio of each total attribute to all the TAZs. E.g., TAZ 1450 in Marin County where ACS county TOTHH (105298) / BAUS county TOTHH (108118) = 0.973917. BAUS TAZ 1450 total hh = 2729, so, need to adjust by 2729 * 0.973917. Within each county, resolve rounding error by adjusting the TAZ with the largest pop/hh/housing count.

    - for each TAZ, adjust sub-totals proportionally, and resolve the rounding errors to the largest category. E.g., TAZ 145), BAUS output has 'HHINCQ1' 269, 'HHINCQ2' 434, 'HHINCQ3' 589, 'HHINCQ4' 1437; adjust the first three categories by * 0.973917, and calculate HHINCQ4 = TOTHH - sum(HHINCQ1, HHINCQ2, HHINCQ3). 

- emp_fields:

    - get TAZ level total employment from ESRI Business Analyst (running script https://github.com/BayAreaMetro/petrale/blob/main/applications/travel_model_lu_inputs/2015/Employment/summarize_BusinessData_by_TAZ_industry.R), and summarize to county-level total employment.
    
    - compare county-level total employment from ESRI with BAUS. If they are close, no adjustment is needed, done. If the descrapencies are large, continue with the following adjustment:

    - for total employment, apply county-level adjustment ratio to all TAZs within each county. Within each county, resolve rounding error by adjusting the TAZ with the largest employment.

    - for each TAZ, adjust sub-totals by employment category, and resolve the rounding errors to the largest category.


- empres_fields: 

    - get county-level total employed residents from PUMS persons file, based on "ESR".

    - compare PUMS data with BAUS county-level 'EMPRES'. If they are close, no adjustment is needed, done. If the descrapencies are large, continue with the following adjustment:

    - apply county-level adjustment ratio to all TAZs within each county. Within each county, resolve rounding error by adjusting the TAZ with the largest employment.

- density_fields: recalculate based on adjusted values.

In [1]:
#  categorize BAUS output TAZ table fields into groups and write out the relationship among fields

                  # ID fields
id_fields      = ['TAZ', 'SD', 'ZONE', 'COUNTY', 'COUNTY_NAME', 'county', 'county_name']
                 
                  # employment: sum('AGREMPN', 'FPSEMPN', 'HEREMPN', 'RETEMPN', 'MWTEMPN', 'OTHEMPN') = 'TOTEMP'
emp_fields     = ['AGREMPN', 'FPSEMPN', 'HEREMPN', 'RETEMPN', 'MWTEMPN', 'OTHEMPN', 'TOTEMP']

empres_fields = [# employed residents = total population * resident employed ratio?
                 'EMPRES']

                  # sum('HHPOP', 'GQPOP') = 'TOTPOP'
pop_fields     = ['HHPOP', 'GQPOP', 'TOTPOP',
                  # Share of the population age 62 or older = 'TOTPOP' * 62P_ratio
                  'SHPOP62P',
                  # age breakdown: sum ('AGE0004', 'AGE0519', 'AGE2044', 'AGE4564', 'AGE65P') = 'TOTPOP'
                  'AGE0004', 'AGE0519', 'AGE2044', 'AGE4564', 'AGE65P',
                  # gp breakdown: sum ('gq_type_univ', 'gq_type_mil', 'gq_type_othnon') = 'gq_tot_pop'
                  'gq_type_univ', 'gq_type_mil', 'gq_type_othnon', 'gq_tot_pop'
]

                  # household income breakdown: sum('HHINCQ1', 'HHINCQ2', 'HHINCQ3', 'HHINCQ4') = 'TOTHH'
hh_fields      = ['HHINCQ1', 'HHINCQ2', 'HHINCQ3', 'HHINCQ4',  'TOTHH',
                  # by hh size: sum('hh_size_1', 'hh_size_2', 'hh_size_3', 'hh_size_4_plus') = 'TOTHH'
                  'hh_size_1', 'hh_size_2', 'hh_size_3', 'hh_size_4_plus',
                  # by worker count: sum('hh_wrks_0', 'hh_wrks_1', 'hh_wrks_2', 'hh_wrks_3_plus') = 'TOTHH'
                  'hh_wrks_0', 'hh_wrks_1', 'hh_wrks_2', 'hh_wrks_3_plus',
                  # by with kids or not: sum('hh_kids_no', 'hh_kids_yes') = 'TOTHH'
                  'hh_kids_no', 'hh_kids_yes']

                  # housing units: sum('MFDU', 'SFDU') = 'RES_UNITS'
housing_fields = ['RES_UNITS', 'MFDU', 'SFDU']

land_fields    = ['TOTACRE', 'RESACRE_UNWEIGHTED', 'CIACRE_UNWEIGHTED', 'CIACRE', 'RESACRE']

                  # Area type designation
density_fields = ['AREATYPE',
                  # density_pop = tot pop/acre, density_emp = tot emp/acre, density = density_pop + density_emp 
                  'DENSITY_POP', 'DENSITY_EMP', 'DENSITY']

In [2]:
import os
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [3]:
# inputs

# BAUS output
BOX_DIR = 'C:\\Users\\{}\\Box\\Modeling and Surveys'.format(os.getenv('USERNAME'))
BAUS_PBA50_FBP_DIR = os.path.join(BOX_DIR, 'Urban Modeling', 'Bay Area UrbanSim', 'PBA50', 'Final Blueprint runs',
                                  'Final Blueprint (s24)', 'BAUS v2.25 - FINAL VERSION')
BAUS_2020_TAZ_FILE = os.path.join(BAUS_PBA50_FBP_DIR, 'run182_taz_summaries_2020.csv')

# Census 2020 decennial data
L_DIR = 'L:\\Application\\Model_One\\TransitRecovery\\land_use_preprocessing'
CENSUS_INPUT_DIR = os.path.join(L_DIR, 'census_data')
DEC_P1_FILE = os.path.join(CENSUS_INPUT_DIR, 'DECENNIALPL2020.P1-2022-05-06T201441.csv') # P1: total pop by race, will use 'total population'
DEC_P5_FILE = os.path.join(CENSUS_INPUT_DIR, 'DECENNIALPL2020.P5-2022-05-06T201358.csv') # P5: group quarters pop by major group quarters type (use 'total group quarter pop')

# Census 2020 ACS PUMS 1-year data
PUMS_H_FILE = os.path.join(CENSUS_INPUT_DIR, 'hbayarea20.csv')  # PUMS housing records, for housing units and total household count
PUMS_P_FILE = os.path.join(CENSUS_INPUT_DIR, 'pbayarea20.csv') 

# ERSI business data for employment, already summarized by TAZ and scaled to match regional control totals;
# produced by the script "https://github.com/BayAreaMetro/petrale/blob/main/applications/travel_model_lu_inputs/2015/Employment/summarize_BusinessData_by_TAZ_industry.R"
ESRI_EMP_TAZ_FILE = os.path.join(L_DIR, 'esri_business_analyst', 'BusinessData_2020_TAZ_industry.csv')

# outputs
taz_summaries_scaled = os.path.join(L_DIR, 'run182_taz_summaries_2020.csv')

## 1. population, household, housing unite - compare BAUS with Census

### 1.1 get data from Census: total population, total households, total housing units

In [4]:
# 1. total population from Decennial P1 table
tot_pop_dec_raw = pd.read_csv(DEC_P1_FILE)
# only keep the total pop data and transpose the table so that each row represents one county
tot_pop_dec_raw.set_index('Label (Grouping)', inplace=True)
tot_pop_dec = tot_pop_dec_raw.loc[
    tot_pop_dec_raw.index == 'Total:'].transpose().rename(columns={'Total:': 'TOTPOP_dec'}).reset_index()
tot_pop_dec.loc[:, 'COUNTY_NAME'] = tot_pop_dec['index'].apply(lambda x: x.replace(' County, California', ''))
tot_pop_dec.drop(columns=['index'], inplace=True)
display(tot_pop_dec)

Label (Grouping),TOTPOP_dec,COUNTY_NAME
0,1682353,Alameda
1,1165927,Contra Costa
2,262321,Marin
3,138019,Napa
4,873965,San Francisco
5,764442,San Mateo
6,1936259,Santa Clara
7,453491,Solano
8,488863,Sonoma


In [5]:
# 2. total group quarter pop from Decennial P5 table
tot_gp_pop_dec_raw = pd.read_csv(DEC_P5_FILE)
# only keep the total gp pop and transpose the table so that each row represents one county
tot_gp_pop_dec_raw.set_index('Label (Grouping)', inplace=True)
tot_gp_pop_dec = tot_gp_pop_dec_raw.loc[
    tot_gp_pop_dec_raw.index == 'Total:'].transpose().rename(columns={'Total:': 'GQPOP_dec'}).reset_index()
tot_gp_pop_dec.loc[:, 'COUNTY_NAME'] = tot_gp_pop_dec['index'].apply(lambda x: x.replace(' County, California', ''))
tot_gp_pop_dec.drop(columns=['index'], inplace=True)
display(tot_gp_pop_dec)

Label (Grouping),GQPOP_dec,COUNTY_NAME
0,53833,Alameda
1,11255,Contra Costa
2,7743,Marin
3,5172,Napa
4,27892,San Francisco
5,9352,San Mateo
6,39607,Santa Clara
7,11137,Solano
8,8866,Sonoma


In [None]:
# 3. total household from PUMS 2020

hh_pums_2020 = pd.read_csv(PUMS_H_FILE, usecols = [
    'PUMA', 
    'County_Name',
    'WGTP',    # Housing Unit Weight: 
#               0       Group quarters place holder record 
#               1..9999 Integer weight of housing unit
    'NP',      # Number of persons in this household:
#               0 .Vacant unit 
#               1 .One person in household or any person in group quarters 
#               2..20 .Number of persons in household 
    'TYPEHUGQ' # Type of unit 
#               1 .Housing unit
#               2 .Institutional group quarters 
#               3 .Noninstitutional group quarters 
])

# total households are represented by PUMS records with 'NP' > 0 (non-vacant) and 'WGTP' > 0 (non group quarter)
tot_hh_pums = hh_pums_2020.loc[(hh_pums_2020.NP > 0)].groupby('County_Name')['WGTP'].sum().reset_index()

tot_hh_pums.rename(columns={'County_Name': 'COUNTY_NAME', 
                            'WGTP': 'TOTHH_pums'}, inplace=True)

print(tot_hh_pums['TOTHH_pums'].sum())
display(tot_hh_pums)

In [None]:
# 4. total housing units from PUMS 2020
# PUMS records with unit type = 1 (non group quarter)
tot_unit_pums = hh_pums_2020.loc[hh_pums_2020.TYPEHUGQ == 1].groupby('County_Name')['WGTP'].sum().reset_index()
tot_unit_pums.rename(columns={'County_Name': 'COUNTY_NAME', 
                              'WGTP': 'RES_UNITS_pums'}, inplace=True)

print(tot_unit_pums['RES_UNITS_pums'].sum())
tot_unit_pums

In [None]:
# combine census county-level total metrics and modify county names to be consistent with modeling convention
census_tots_county = tot_pop_dec.merge(
                     tot_gp_pop_dec, on='COUNTY_NAME', how='outer').merge(
                     tot_hh_pums, on='COUNTY_NAME', how='outer').merge(
                     tot_unit_pums, on='COUNTY_NAME', how='outer')
# census_tots_county.loc[:, 'COUNTY_NAME'] = census_tots_county['index'].apply(lambda x: x.replace(' County, California', ''))
# census_tots_county.drop(columns='index', inplace=True)
# convert value fields to numeric
for col_name in ['TOTPOP_dec', 'GQPOP_dec']:
    census_tots_county.loc[:, col_name] = census_tots_county[col_name].apply(lambda x: int(x.replace(',','')))
display(census_tots_county)

### 1.2 compare PAB50 county-level total population, gp population, total households, total housing units with Census

In [None]:
# PBA50 BAUS 2020 output
baus_taz = pd.read_csv(BAUS_2020_TAZ_FILE)
taz_fields = list(baus_taz)
print('read {} rows of BAUS output taz data, with the following fields: {}'.format(baus_taz.shape[0], taz_fields))

In [None]:
# county-level sums of the same fields
baus_demo_tots_county = baus_taz.groupby('COUNTY_NAME')[['TOTPOP', 'GQPOP', 'TOTHH', 'RES_UNITS']].sum().reset_index()
display(baus_demo_tots_county.head())
baus_demo_tots_county.columns = ['COUNTY_NAME'] + [x+'_baus' for x in list(baus_demo_tots_county)[1:]]
display(baus_demo_tots_county.head())

In [None]:
baus_taz.groupby('COUNTY_NAME')['gq_tot_pop'].sum()

In [None]:
tot_gp_pop_dec.groupby('COUNTY_NAME')['GQPOP_dec'].sum()

In [None]:
# merge with census data
baus_census_tots_county_comp = baus_demo_tots_county.merge(census_tots_county, on='COUNTY_NAME', how='outer')

In [None]:
# calculate HHPOP
baus_census_tots_county_comp['HHPOP_dec'] = baus_census_tots_county_comp['TOTPOP_dec'] - baus_census_tots_county_comp['GQPOP_dec']
baus_census_tots_county_comp['HHPOP_baus'] = baus_census_tots_county_comp['TOTPOP_baus'] - baus_census_tots_county_comp['GQPOP_baus']

In [None]:
# calculate diffs and adjustment ratios
attr_source = {'TOTPOP': 'dec',
               'GQPOP' : 'dec',
               'HHPOP' : 'dec',
               'TOTHH' : 'pums',
               'RES_UNITS' : 'pums'}


for demo_attr in ['TOTPOP', 'GQPOP', 'HHPOP', 'TOTHH', 'RES_UNITS']:
    source = attr_source[demo_attr]
    baus_census_tots_county_comp.loc[:, demo_attr+'_'+source+'_baus_diff'] = \
        baus_census_tots_county_comp[demo_attr+'_'+source] - baus_census_tots_county_comp[demo_attr+'_baus']
    baus_census_tots_county_comp.loc[:, demo_attr+'_'+source+'_baus_ratio'] = \
        baus_census_tots_county_comp[demo_attr+'_'+source] / baus_census_tots_county_comp[demo_attr+'_baus']

In [None]:
# print out comparison

print('Total Population comparison:')
print('Census: {:,}'.format(baus_census_tots_county_comp['TOTPOP_dec'].sum()))
print('BAUS: {:,}'.format(int(baus_census_tots_county_comp['TOTPOP_baus'].sum())))
display(baus_census_tots_county_comp[[
    'COUNTY_NAME', 'TOTPOP_dec', 'TOTPOP_baus',
    'TOTPOP_dec_baus_diff', 'TOTPOP_dec_baus_ratio']].sort_values('TOTPOP_dec_baus_diff', ascending=False))

print('\nGroup quarters Population comparison:')
print('Census: {:,}'.format(baus_census_tots_county_comp['GQPOP_dec'].sum()))
print('BAUS: {:,}'.format(int(baus_census_tots_county_comp['GQPOP_baus'].sum())))
display(baus_census_tots_county_comp[[
    'COUNTY_NAME', 'GQPOP_dec', 'GQPOP_baus', 
    'GQPOP_dec_baus_diff', 'GQPOP_dec_baus_ratio']].sort_values('GQPOP_dec_baus_diff', ascending=False))

print('\nHouseholds Population comparison:')
print('Census: {:,}'.format(baus_census_tots_county_comp['HHPOP_dec'].sum()))
print('BAUS: {:,}'.format(int(baus_census_tots_county_comp['HHPOP_baus'].sum())))
display(baus_census_tots_county_comp[[
    'COUNTY_NAME', 'HHPOP_dec', 'HHPOP_baus',
    'HHPOP_dec_baus_diff', 'HHPOP_dec_baus_ratio']].sort_values('HHPOP_dec_baus_diff', ascending=False))


print('\nTotal Households comparison')
print('Census: {:,}'.format(baus_census_tots_county_comp['TOTHH_pums'].sum()))
print('BAUS: {:,}'.format(int(baus_census_tots_county_comp['TOTHH_baus'].sum())))
display(baus_census_tots_county_comp[[
    'COUNTY_NAME', 'TOTHH_pums', 'TOTHH_baus',
    'TOTHH_pums_baus_diff', 'TOTHH_pums_baus_ratio']].sort_values('TOTHH_pums_baus_diff', ascending=False))


print('\nTotal Housing Units comparison')
print('Census: {:,}'.format(baus_census_tots_county_comp['RES_UNITS_pums'].sum()))
print('BAUS: {:,}'.format(int(baus_census_tots_county_comp['RES_UNITS_baus'].sum())))
display(baus_census_tots_county_comp[[
    'COUNTY_NAME', 'RES_UNITS_pums', 'RES_UNITS_baus',
    'RES_UNITS_pums_baus_diff', 'RES_UNITS_pums_baus_ratio']].sort_values('RES_UNITS_pums_baus_diff', ascending=False))

In [None]:
baus_census_tots_county_comp

### 1.3 scale BAUS household counts to be consistent with ESRI data at the county level
Including total households and household by categories in each TAZ.

In [None]:
# calculate scale ratio by county
hh_adjust_ratio = baus_census_tots_county_comp[['COUNTY_NAME', 'TOTHH_pums_baus_ratio']]

baus_taz_hh_unscaled = baus_taz[['TAZ', 'COUNTY_NAME'] + hh_fields].merge(hh_adjust_ratio, on='COUNTY_NAME', how='left')
baus_taz_hh_unscaled

In [None]:
def scale_by_taz(df_unscaled, target_df, attr_name, attr_name_in_target_df, scale_ratio_field):
    """
    Scales a TAZ-level attribute (e.g. TOTHH, totemp) by county-level scale ratios, 
    so that the sums by county equal the target.
    Three steps:
     - apply the county-level scale ratio to all TAZs within each county
     - round to the nearest interger
     - within each county, correct rounding error by modifying the value of the largest TAZ 
       (largest in terms of the attribute for scaling)
    
    Arguments:
        df_unscaled:            e.g. baus_taz_hh_unadjusted
        target_df:              e.g. tot_hh_pums
        attr_name:              e.g. 'TOTHH'
        attr_name_in_target_df: e.g. 'TOTHH_pums'
        scale_ratio_field:      e.g. 'TOTHH_pums_baus_ratio'
    """
    
    df_scaled = pd.DataFrame()

    for county in target_df['COUNTY_NAME'].unique():
    # for county in ['Alameda']:
        print(county)

        ##### get sub-dataframe of TAZs within a county
        county_df = df_unscaled.loc[df_unscaled.COUNTY_NAME == county]
#         display(county_df.sort_values(attr_name, ascending=False).head(3))

        ##### adjust total employment
        # calculate adjusted tot employment of all TAZs within the county
        county_df.loc[:, attr_name] = county_df[attr_name] * county_df[scale_ratio_field]

        # round to the nearest integer
        county_df.loc[:, attr_name] = county_df[attr_name].apply(lambda x: int(round(x)))
#         display(county_df.sort_values(attr_name, ascending=False).head(3))

        # correct for rounding errors by allocating the diff to the TAZ with the largest TOTEMP
        target = target_df.loc[target_df.COUNTY_NAME == county][attr_name_in_target_df].sum()
        rounding_diff = target - county_df[attr_name].sum()
#         print(county_df[attr_name].sum(), rounding_diff)
        county_df.loc[county_df[attr_name] == county_df[attr_name].max(),
                           attr_name] = county_df[attr_name] + rounding_diff
#         display(county_df.sort_values(attr_name, ascending=False).head(3))
        
        df_scaled = pd.concat([df_scaled, county_df])
    
    # drop the scale ratio field
    df_scaled.drop(columns=scale_ratio_field, inplace=True)
    
    return df_scaled

In [None]:
def scale_by_taz_and_category(df_cat_unscaled, target_df, attr_tot_name, attr_cat_names, scale_ratio_field):
    """
    Scales a set of sub-category attributes (e.g. 'HHINCQ1', 'HHINCQ2', 'HHINCQ3', 'HHINCQ4') at the TAZ level based on
    county-level scale ratios, and ensures that the sum of the sub-category values for each TAZ equals the total attribute (e.g. TOTHH).
    Three steps:
     - apply the county-level scale ratio to each sub-category attribute for all TAZs within each county
     - round to the nearest interger
     - within each TAZ, correct rounding error by modifying the value of the largest sub-category
    
    Arguments:
        df_cat_unscaled:        e.g. baus_taz_hh_income_unscaled
        target_df:              e.g. baus_taz_tot_hh_scaled
        attr_tot_name:          e.g. 'TOTHH'
        attr_cat_names:         e.g. ['HHINCQ1', 'HHINCQ2', 'HHINCQ3', 'HHINCQ4']
        scale_ratio_field:      e.g. 'TOTHH_pums_baus_ratio'

    """
    
    df_scaled = pd.DataFrame()

    for county in target_df['COUNTY_NAME'].unique():
    # for county in ['Alameda']:
        print(county)

        # get sub-dataframe of TAZs within a county
        county_df = df_cat_unscaled.loc[df_cat_unscaled.COUNTY_NAME == county]
#         display(county_df.head(3))

        # apply the scale ratio to all sub-categories and round to the nearest integer
        for i in attr_cat_names:
            county_df[i].fillna(0, inplace=True)
            county_df.loc[:, i] = county_df[i] * county_df[scale_ratio_field]
            county_df.loc[:, i] = county_df[i].apply(lambda x: int(round(x)))
    
        # correct for rounding errors within each TAZ by allocating the diff to the largest sub-category
        # 1. merge in the scaled total values of the sub-categories
#         print(county_df.shape[0])
        county_df = county_df.merge(target_df, on=['TAZ', 'COUNTY_NAME'], how='left')
#         display(county_df.head())
#         print(county_df.shape[0])
        
        # 2. calculate rounding diff
        county_df.loc[:, 'tot_temp'] = county_df[attr_cat_names].sum(axis=1)
        county_df['rounding_diff'] = county_df[attr_tot_name] - county_df['tot_temp']

        # 3. get the employment values of the largest employment category before rounding error correction 
        largest_cat_values = county_df[attr_cat_names].max(axis=1)
#         print(largest_cat_values)
        # 4. calculate the employment values of the largest employment category with rounding error correction    
        county_df['rounding_adj'] = largest_cat_values + county_df['rounding_diff']
        # 5. get the name of the largest employment category for each TAZ
#         display(county_df[['TAZ', attr_tot_name] + attr_cat_names].max(axis=1))
        county_df['largest_cat'] = county_df[attr_cat_names].idxmax(axis=1)
        # 6. loop through each TAZ to correct the employment value of the largest employment group
#         display(county_df[['TAZ', attr_tot_name] + attr_cat_names + ['tot_temp','rounding_adj','largest_cat']].head())
        for i in county_df.index:
            county_df.loc[i, county_df['largest_cat'][i]] = county_df.loc[i, 'rounding_adj']
#         display(county_df[['TAZ', attr_tot_name] + attr_cat_names + ['tot_temp','rounding_adj','largest_cat']].head())

        df_scaled = pd.concat([df_scaled, county_df])

    # drop the scale ratio field
    df_scaled.drop(columns=[attr_tot_name, scale_ratio_field, 'tot_temp', 'rounding_diff', 'rounding_adj', 'largest_cat'], inplace=True)
    
    return df_scaled

In [None]:
# scale total household counts

attr_name = 'TOTHH'
scale_ratio_field = 'TOTHH_pums_baus_ratio'
df_unscaled = baus_taz_hh_unscaled[['TAZ', 'COUNTY_NAME', 'TOTHH', 'TOTHH_pums_baus_ratio']]
target_df = tot_hh_pums
attr_name_in_target_df = 'TOTHH_pums'

baus_taz_tot_hh_scaled = scale_by_taz(df_unscaled, target_df, attr_name, attr_name_in_target_df, scale_ratio_field)

In [None]:
baus_taz_hh_unscaled[['COUNTY_NAME', 'TOTHH']].groupby('COUNTY_NAME')['TOTHH'].sum()

In [None]:
baus_taz_tot_hh_scaled[['COUNTY_NAME', 'TOTHH']].groupby('COUNTY_NAME')['TOTHH'].sum()

In [None]:
# scale households by income data

df_cat_unscaled = baus_taz_hh_unscaled[['TAZ', 'COUNTY_NAME', 'HHINCQ1', 'HHINCQ2', 'HHINCQ3', 'HHINCQ4', 'TOTHH_pums_baus_ratio']]
target_df = baus_taz_tot_hh_scaled
attr_tot_name = 'TOTHH'
attr_cat_names = ['HHINCQ1', 'HHINCQ2', 'HHINCQ3', 'HHINCQ4']
scale_ratio_field = 'TOTHH_pums_baus_ratio'

baus_taz_hh_income_scaled = scale_by_taz_and_category(df_cat_unscaled, target_df, attr_tot_name, attr_cat_names, scale_ratio_field)

In [None]:
# scale households by size data

df_cat_unscaled = baus_taz_hh_unscaled[['TAZ', 'COUNTY_NAME', 'hh_size_1', 'hh_size_2', 'hh_size_3', 'hh_size_4_plus', 'TOTHH_pums_baus_ratio']]
target_df = baus_taz_tot_hh_scaled
attr_tot_name = 'TOTHH'
attr_cat_names = ['hh_size_1', 'hh_size_2', 'hh_size_3', 'hh_size_4_plus']
scale_ratio_field = 'TOTHH_pums_baus_ratio'

baus_taz_hh_size_scaled = scale_by_taz_and_category(df_cat_unscaled, target_df, attr_tot_name, attr_cat_names, scale_ratio_field)

In [None]:
# scale households by worker count data
df_cat_unscaled = baus_taz_hh_unscaled[['TAZ', 'COUNTY_NAME', 'hh_wrks_0', 'hh_wrks_1', 'hh_wrks_2', 'hh_wrks_3_plus', 'TOTHH_pums_baus_ratio']]
target_df = baus_taz_tot_hh_scaled
attr_tot_name = 'TOTHH'
attr_cat_names = ['hh_wrks_0', 'hh_wrks_1', 'hh_wrks_2', 'hh_wrks_3_plus']
scale_ratio_field = 'TOTHH_pums_baus_ratio'

baus_taz_hh_worker_scaled = scale_by_taz_and_category(df_cat_unscaled, target_df, attr_tot_name, attr_cat_names, scale_ratio_field)

In [None]:
# scale households by kids data
df_cat_unscaled = baus_taz_hh_unscaled[['TAZ', 'COUNTY_NAME', 'hh_kids_no', 'hh_kids_yes', 'TOTHH_pums_baus_ratio']]
target_df = baus_taz_tot_hh_scaled
attr_tot_name = 'TOTHH'
attr_cat_names = ['hh_kids_no', 'hh_kids_yes']
scale_ratio_field = 'TOTHH_pums_baus_ratio'

baus_taz_hh_kids_scaled = scale_by_taz_and_category(df_cat_unscaled, target_df, attr_tot_name, attr_cat_names, scale_ratio_field)

In [None]:
# put all household fields together
baus_taz_hh_scaled = baus_taz_tot_hh_scaled.merge(
                     baus_taz_hh_income_scaled, on=['TAZ', 'COUNTY_NAME'], how='outer').merge(
                     baus_taz_hh_size_scaled, on=['TAZ', 'COUNTY_NAME'], how='outer').merge(
                     baus_taz_hh_worker_scaled, on=['TAZ', 'COUNTY_NAME'], how='outer').merge(
                     baus_taz_hh_kids_scaled, on=['TAZ', 'COUNTY_NAME'], how='outer')

baus_taz_hh_scaled

## 2. employment - compare BAUS with ESRI

### 2.1 ESRI employment data

In [None]:
# TAZ-level employment data from ESRI Business Analyst, scaled to match 2020 regional control totals
esri_emp_taz = pd.read_csv(ESRI_EMP_TAZ_FILE)
esri_emp_taz = esri_emp_taz[['TAZ1454', 'County_Name', 
                                           'TOTEMP', 'AGREMPN', 'FPSEMPN', 'HEREMPN', 'MWTEMPN', 'OTHEMPN', 'RETEMPN']]
display(esri_emp_taz.head())
esri_emp_taz.columns = ['TAZ1454', 'COUNTY_NAME'] + [x + '_esri' for x in list(esri_emp_taz)[2:]]

display(esri_emp_taz.head())

In [None]:
# get total employment by county
esribiz_emptot_county = esri_emp_taz.groupby('COUNTY_NAME')['TOTEMP_esri'].sum().reset_index()
display(esribiz_emptot_county)

### 2.2 compare BAUS total employment by county with ESRI

In [None]:
# get BAUS 2020 output total employment by county
baus_emptot_county = baus_taz.groupby('COUNTY_NAME')[['TOTEMP']].sum().reset_index().rename(
    columns={'TOTEMP': 'TOTEMP_baus'})
display(baus_emptot_county)

In [None]:
# merge ESRI with BAUS
emptot_compare = esribiz_emptot_county.merge(baus_emptot_county, on='COUNTY_NAME', how='left')

# convert 'TOTEMP_esri' to integer
emptot_compare['TOTEMP_esri'] = emptot_compare['TOTEMP_esri'].apply(lambda x: int(round(x)))

print('total employment comparison:\n{}'.format(emptot_compare[['TOTEMP_esri', 'TOTEMP_baus']].sum()))

# correct rounding error by adjusting the largest employment county
rounding_diff = emptot_compare['TOTEMP_baus'].sum() - emptot_compare['TOTEMP_esri'].sum()
emptot_compare.loc[emptot_compare.TOTEMP_esri == emptot_compare.TOTEMP_esri.max(),
                   'TOTEMP_esri'] = emptot_compare['TOTEMP_esri'] + rounding_diff
# check the totals match
print('after correcting for rounding error, total employment comparison:\n{}'.format(emptot_compare[['TOTEMP_esri', 'TOTEMP_baus']].sum()))

# add esri / baus ratio by county
emptot_compare['totemp_esri_baus_diff'] = emptot_compare['TOTEMP_esri'] - emptot_compare['TOTEMP_baus']
emptot_compare['totemp_esri_baus_ratio'] = emptot_compare['TOTEMP_esri'] / emptot_compare['TOTEMP_baus']

display(emptot_compare.sort_values('totemp_esri_baus_diff', ascending=False))

### 2.3 scale BAUS employment to be consistent with ESRI data at the county level

In [None]:
# calculate scale ratio by county
emp_adjust_ratio = emptot_compare[['COUNTY_NAME', 'totemp_esri_baus_ratio']]

baus_taz_emp_unscaled = baus_taz[['TAZ', 'COUNTY_NAME', 'AGREMPN', 'FPSEMPN', 'HEREMPN', 'RETEMPN', 'MWTEMPN', 'OTHEMPN', 'TOTEMP']].merge(emp_adjust_ratio, on='COUNTY_NAME', how='left')
baus_taz_emp_unscaled

In [None]:
# scale total employment
attr_name = 'TOTEMP'
scale_ratio_field = 'totemp_esri_baus_ratio'
df_unscaled = baus_taz_emp_unscaled[['TAZ', 'COUNTY_NAME', 'TOTEMP', 'totemp_esri_baus_ratio']]
target_df = emptot_compare[['COUNTY_NAME', 'TOTEMP_esri']]
attr_name_in_target_df = 'TOTEMP_esri'

baus_taz_tot_emp_scaled = scale_by_taz(df_unscaled, target_df, attr_name, attr_name_in_target_df, scale_ratio_field)

In [None]:
# scale employment by type
df_cat_unscaled = baus_taz_emp_unscaled[['TAZ', 'COUNTY_NAME', 'AGREMPN', 'FPSEMPN', 'HEREMPN', 'RETEMPN', 'MWTEMPN', 'OTHEMPN', 'totemp_esri_baus_ratio']]
target_df = baus_taz_tot_emp_scaled
attr_tot_name = 'TOTEMP'
attr_cat_names = ['AGREMPN', 'FPSEMPN', 'HEREMPN', 'RETEMPN', 'MWTEMPN', 'OTHEMPN']
scale_ratio_field = 'totemp_esri_baus_ratio'

baus_taz_emp_type_scaled = scale_by_taz_and_category(df_cat_unscaled, target_df, attr_tot_name, attr_cat_names, scale_ratio_field)

In [None]:
# put all employment fields together
baus_taz_emp_scaled = baus_taz_tot_emp_scaled.merge(baus_taz_emp_type_scaled, on=['TAZ', 'COUNTY_NAME'], how='outer')
baus_taz_emp_scaled

## 3. employed residents

### 3.1 PUMS person data

In [None]:
# employment data in PUMS person file

p_pums_2020 = pd.read_csv(PUMS_P_FILE, usecols = [
    'PUMA', 
    'County_Name',
    'PWGTP',   # Person's weight:
#               1..9999 .Integer weight of person 
    'ESR',     # Employment status recode:
#               b .N/A (less than 16 years old)
#               1 .Civilian employed, at work
#               2 .Civilian employed, with a job but not at work
#               3 .Unemployed
#               4 .Armed forces, at work
#               5 .Armed forces, with a job but not at work
#               6 .Not in labor force
])
p_pums_2020.rename(columns={'County_Name': 'COUNTY_NAME'}, inplace=True)
display(p_pums_2020)

In [None]:
# recode ESR: non employed categories to 0, employed categories to 1 
employed_dict = {1: 1,
                 2: 1,
                 3: 0,
                 4: 1,
                 5: 1,
                 6: 0}

p_pums_2020['employed_recode'] = p_pums_2020['ESR'].map(employed_dict)
p_pums_2020['employed_recode'].fillna(0)
p_pums_2020['EMPRES_pums'] = p_pums_2020['employed_recode'] * p_pums_2020['PWGTP']
pums_empres_county = p_pums_2020.groupby('COUNTY_NAME')['EMPRES_pums'].sum().reset_index()

In [None]:
print(pums_empres_county.EMPRES_pums.sum())
pums_empres_county

### 3.2 compare with BAUS 'EMPRES'

In [None]:
# get BAUS 2020 output 'EMPRES' by county
baus_empres_county = baus_taz.groupby('COUNTY_NAME')[['EMPRES']].sum().reset_index().rename(
    columns={'EMPRES': 'EMPRES_baus'})
display(baus_empres_county)

In [None]:
# merge ESRI with BAUS
empres_compare = pums_empres_county.merge(baus_empres_county, on='COUNTY_NAME', how='left')

print('total employment withno incommute comparison:\n{}'.format(empres_compare[['EMPRES_pums', 'EMPRES_baus']].sum()))

# add esri / baus ratio by county
empres_compare['empres_pums_baus_diff'] = empres_compare['EMPRES_pums'] - empres_compare['EMPRES_baus']
empres_compare['empres_pums_baus_ratio'] = empres_compare['EMPRES_pums'] / empres_compare['EMPRES_baus']

display(empres_compare.sort_values('empres_pums_baus_diff', ascending=False))

### 3.3 scale BAUS EMPRES by county

In [None]:
# calculate scale ratio by county
empres_adjust_ratio = empres_compare[['COUNTY_NAME', 'empres_pums_baus_ratio']]

baus_taz_empres_unscaled = baus_taz[['TAZ', 'COUNTY_NAME', 'EMPRES']].merge(empres_adjust_ratio, on='COUNTY_NAME', how='left')
baus_taz_empres_unscaled

In [None]:
# scale total employment
attr_name = 'EMPRES'
scale_ratio_field = 'empres_pums_baus_ratio'
df_unscaled = baus_taz_empres_unscaled[['TAZ', 'COUNTY_NAME', 'EMPRES', 'empres_pums_baus_ratio']]
target_df = pums_empres_county
attr_name_in_target_df = 'EMPRES_pums'

baus_taz_empres_scaled = scale_by_taz(df_unscaled, target_df, attr_name, attr_name_in_target_df, scale_ratio_field)
baus_taz_empres_scaled

In [None]:
baus_taz_empres_scaled.groupby('COUNTY_NAME')['EMPRES'].sum()

## 4. recalculate densities

In [None]:
# BAUS TAZ data after scaling
baus_taz_scaled = baus_taz_hh_scaled.merge(
                  baus_taz_emp_scaled, on=['TAZ', 'COUNTY_NAME'], how='outer').merge(
                  baus_taz[id_fields + pop_fields + housing_fields + land_fields], on=['TAZ', 'COUNTY_NAME'], how='outer').merge(
                  baus_taz_empres_scaled)



In [None]:
# calculate density fields
baus_taz_scaled['DENSITY_POP'] = baus_taz_scaled.TOTPOP / baus_taz_scaled.TOTACRE
baus_taz_scaled['DENSITY_POP'].fillna(0, inplace=True)

baus_taz_scaled['DENSITY_EMP'] = (2.5 * baus_taz_scaled.TOTEMP) / baus_taz_scaled.TOTACRE
baus_taz_scaled['DENSITY_EMP'].fillna(0, inplace=True)

baus_taz_scaled['DENSITY'] = baus_taz_scaled['DENSITY_POP'] + baus_taz_scaled['DENSITY_EMP']
baus_taz_scaled['AREATYPE'] = pd.cut(
    baus_taz_scaled.DENSITY,
    bins=[0, 6, 30, 55, 100, 300, np.inf],
    labels=[5, 4, 3, 2, 1, 0]
)

In [None]:
baus_taz_scaled

In [None]:
# check the fields are the same as BAUS output
sorted(list(baus_taz_scaled)) == sorted(list(baus_taz))

## 5. export

In [None]:
baus_taz_scaled.to_csv(taz_summaries_scaled, index=False)