# Purpose

This notebook prepares tract level self-employment data, place of work, broken down by industry / mtc6 sector.

The LEHD WAC files used for travel model taz level data are also place of work and industry, but exclude proprietors.

# Approach
* We need tract level data on self-employed workers - by place of work - broken down by industry.
* That is not available from plain vanilla ACS. The closest is C24070. 	Industry by Class of Worker for the Civilian Employed Population 16 Years and Over. But that is for place of residence geography only.
* Other tables get close - B08528 has the class of worker portion (not industry), but [not available](https://data.census.gov/table/ACSDT5Y2021.B08528?g=1400000US06085500100) at the tract level, though a [place of residence variant is](https://data.census.gov/table/ACSDT5Y2021.B08128?g=1400000US06085500100)

* Instead, we turn to CTPP, which provides a place of work based class of worker accounting in table A202102.
* This is a good start, but it doesn't provide us with key pieces: 
  1. The sectoral breakdown for the self employed workers
  1. Temporal currency for the estimates.

For the first one, we rely on county level industry distribution totals to apply to tract level distributions. This is then our "seed" data - a representation of ACS/CTPP 2012-2016 self employed workers, with a sectoral distribution with known deficiencies: It is wrong at the county level insofar as it applies to the total universe of workers not just self employed ones - and it is wrong at the tract level insofar as tracts don't necessarily mirror county distributions.

For the second one, we apply an iterative proportional fitting to scale seed tract data to more current marginals on sectoral and class of worker distributions. We obtain those from ACS PUMS 2019 and 2021, pooling the two years for a larger sample to bring down standard errors and because 2020 is experimental.


In [82]:
import pandas as pd
import geopandas as gpd
import numpy as np
import os

drop = os.environ['DROPBOX_LOC']


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas as gpd


# Mappings and helpers

In [14]:
# for halo counties, I rely on the regionalization in http://www.cdss.ca.gov/research/res/pdf/multireports/RegionsofCalifornia.pdf

bayareafips_full = {'06001': 'Alameda', '06013': 'Contra Costa', '06041': 'Marin', '06055': 'Napa',
                    '06075': 'San Francisco', '06081': 'San Mateo', '06085': 'Santa Clara', '06097': 'Sonoma', '06095': 'Solano'}

In [15]:
indus_to_mtc = pd.read_excel(
    '/Users/aolsen/Dropbox/Documents/Data/BayArea/Projections 2013/NAICS_to_ABAG_SECTORS.xlsx', 'both')
indus_to_mtc['naics_2'] = indus_to_mtc['NAICS-2'].astype(str)

In [16]:
# naics_supersector = pd.read_csv(
#     '/Users/aolsen/Box/Vital Signs/02_data/RESTRUCTURE/sources/CENSUS/codelists/high_level_crosswalk.csv')
# naics_supersector['naics_sector'] = naics_supersector.naics_sector.str.split(
#     ' ').apply(lambda x: x[1])
# naics_supersector['naics_2'] = naics_supersector['naics_sector'].str.slice(
#     0, 2)
# naics_supersector['naics_super'] = naics_supersector.super_sector.str.replace(
#     '^\d{4} ', '').str.strip()

# naics_supersector['sector_list'] = naics_supersector.naics_sector.str.split(
#     '-').apply(lambda x: [int(y) for y in x]).apply(lambda x: range(np.array(x).min(), 1+np.array(x).max()))
# naics_supersector = naics_supersector.explode('sector_list')
# naics_supersector['sector_list'] = naics_supersector['sector_list'].astype(str)
# map_naics_supersector = naics_supersector.set_index('sector_list').naics_super
# map_naics_supersector

In [17]:
pums_jwtr_map = {1: 'drove_alone',
            2: 'bus/streetcar/trolley',
            3: 'rail',  # 'subway',
            4: 'rail',  # 'railroad',
            5: 'bus/streetcar/trolley',
            6: 'other',
            7: 'other',
            8: 'other',
            9: 'other',
            10: 'other',
            11: 'worked_at_home',
            12: 'other'}

# JWRIP 2-10 provides info on carpool

In [18]:
pums_cow_map = {  # 'b': 'N/A (less than 16 years old/NILF who last worked more than 5 years ago or never worked)',
    1: 'Private',
    2: 'Private',
    3: 'Government',
    4: 'Government',
    5: 'Government',
    6: 'Self',
    7: 'Self',
    8: 'No pay family',
    9: 'Unemployed'}

In [19]:

def simple_estimate_SE(grouped_df, weight='PWGTP', years=1):
    """
    Calculate standard errors, margins of error, and confidence intervals for simple sums using ACS PUMS data.

    Args:
        grouped_grouped_df (pd.GroupBy): A grouped DataFrame containing the ACS PUMS data, typically grouped by relevant variables.
        weight (str): Column name representing replicate weights. Default is 'PWGTP'.
        years (int): Number of years the data represents. Default is 1.

    Returns:
        pd.Series: A Series containing the following statistics:
            - 'Total': Total estimate of the variable.
            - 'ci_upper': Upper bound of the confidence interval.
            - 'ci_lower': Lower bound of the confidence interval.
            - 'se': Standard error.
            - 'moe': Margin of error.
            - 'coef_variation': Coefficient of variation.
            - 'sample_recs': Number of records in the grouped DataFrame.
    """


    # Create a regular expression to match replicate weight columns
    repwgt_str = f'{weight}\d{{1,2}}'

    # Calculate the sum of replicate weights for each column that matches the pattern
    estim_repwgts = grouped_df.filter(regex=repwgt_str).sum().div(float(years))

    # Calculate the sum of primary weights
    estim_prim = grouped_df[weight].sum() / float(years)

    # Calculate squared differences
    squared_diffs = (estim_repwgts - estim_prim)**2
    squared_diffs_summed = squared_diffs.sum()

    # Calculate variance and standard error
    variance = (4 / (80 * years)) * squared_diffs_summed
    standard_error = variance**0.5

    # Calculate coefficient of variation
    coefficient_of_variation = standard_error / estim_prim

    # Calculate margin of error (moe) and confidence intervals
    moe = standard_error * 1.645
    ci_upper = np.ceil(estim_prim + moe)
    ci_lower = np.ceil(estim_prim - moe)
    
    # Ensure confidence interval lower bound is not negative
    ci_lower = ci_lower if ci_lower > 0 else 0

    # Create a Series with the calculated statistics
    output = pd.Series({
        'Total': int(estim_prim),
        'ci_upper': ci_upper,
        'ci_lower': ci_lower,
        'se': standard_error,
        'moe': moe,
        'coef_variation': coefficient_of_variation,
        'sample_recs': grouped_df.shape[0]
    })

    return output


In [20]:
# data_dict_csv = '/Users/aolsen/Dropbox/My Mac (AOLSEN-MBP.local)/Downloads/PUMS_Data_Dictionary_2022.csv'
# data_dict = pd.read_csv(data_dict_csv, index_col=False, engine='python', names=[
#                         'NAME_OR_VAL', 'VAR', 'DTYPE', 'DWIDTH', 'DESC', 'DET1', 'DET2'])


# cow_map = data_dict.query('NAME_OR_VAL=="VAL" & VAR=="COW"').set_index('DESC').DET2
# cow_map.to_dict()

# Get PUMS data

In [21]:
# 5 year data
PUMS_PATH_2021 = '/Users/aolsen/Dropbox/Documents/Data/_Census/ACS/PUMS/2021/csv_pca/psam_p06.csv'
PUMS_PATH_2019 = '/Users/aolsen/Dropbox/Documents/Data/_Census/ACS/PUMS/2019/csv_pca/psam_p06.csv'


In [22]:
KEEP_COLS = ['RT', 'SERIALNO', 'SPORDER', 'PUMA', 'ST', 'ADJINC', 'PWGTP', 'AGEP', 'COW', 'JWTRNS', 'JWRIP', 'ESR', 'MIG',
             'MIGSP', 'SCHL', 'SEX', 'WAGP', 'PINCP', 'JWMNP', 'WKL', 'RAC1P', 'HISP', 'POWPUMA', 'POWSP', 'MIGPUMA', 'POWSP', 'NAICSP']
REP_WGTS = [f'PWGTP{i}' for i in range(1, 81)]

In [23]:
%%time

pers_data_2021 = pd.read_csv(PUMS_PATH_2021, usecols=KEEP_COLS+REP_WGTS)

pers_data_2019 = pd.read_csv(PUMS_PATH_2019, usecols=KEEP_COLS+REP_WGTS)

pers_data = pd.concat([pers_data_2019, pers_data_2021], keys=[
                      2019, 2021], names=['YEAR', 'OID']).reset_index()

CPU times: user 10.2 s, sys: 1.89 s, total: 12.1 s
Wall time: 12.1 s


## Geographic assignments

In [24]:
pers_data['STCOUNTY'] = pers_data.ST.map(
    lambda x: f'{x:02.0f}')+pers_data.PUMA.map(lambda x: f'{x:05.0f}'[:-2])

pers_data['STPUMA'] = pers_data.ST.apply(
    lambda x: f'{x:02d}') + pers_data.PUMA.apply(lambda x: f'{x:05d}')

In [25]:

# set powpuma var
mask_powpuma = pers_data.POWPUMA.notna()

pers_data.loc[mask_powpuma, 'POWSTPUMA'] = pers_data.loc[mask_powpuma, 'POWSP'].map(
    lambda x: f'{x:03.0f}')+pers_data.loc[mask_powpuma, 'POWPUMA'].map(lambda x: f'{x:05.0f}')

# set powstcounty var, leveraging powsp =6 and the fact that CA prefixes with county
# though - what about multi-county pumas?
mask_powsp_ca = pers_data.POWSP == 6

pers_data.loc[mask_powsp_ca, 'POWSTCOUNTY'] = pers_data.loc[mask_powsp_ca, 'POWSP'].map(
    lambda x: f'{x:02.0f}')+pers_data.loc[mask_powsp_ca, 'POWPUMA'].map(lambda x: f'{x:05.0f}'[:-2])

pers_data.loc[mask_powsp_ca, 'wrk_county'] = pers_data.loc[mask_powsp_ca,
                                                           'POWSTCOUNTY'].map(bayareafips_full)
pers_data.loc[mask_powsp_ca, 'wrk_county']

6               NaN
12              NaN
17              NaN
19          Alameda
20          Alameda
            ...    
766133    San Mateo
766140          NaN
766145          NaN
766149          NaN
766150          NaN
Name: wrk_county, Length: 347046, dtype: object

## Industry and class of worker assignments

In [26]:
pers_data['cow'] = pers_data.COW.map(pums_cow_map).fillna('Not a worker')
pers_data['cow'].value_counts()

Private          322122
Not a worker     313055
Government        71182
Self              54192
Unemployed         3993
No pay family      1608
Name: cow, dtype: int64

In [27]:
pers_data['naics_2'] = pers_data.NAICSP.str.slice(
    0, 2).replace({'4M': '44', '3M': '31'})
# pers_data['indus_super'] = pers_data['naics_2'].map(
#     map_naics_supersector).fillna('Not a worker')
# pers_data['indus_super'].dropna()
pers_data['mtc6'] = pers_data.naics_2.map(
    indus_to_mtc.set_index('naics_2').MTCname)
pers_data['mtc6']
# pers_data.loc[(pers_data.naics_2.str.contains('M$',na=False))&(pers_data.naics_2.notna())].NAICSP.unique()

0         mwtempn
1             NaN
2         herempn
3             NaN
4         herempn
           ...   
766147        NaN
766148        NaN
766149    fpsempn
766150    mwtempn
766151        NaN
Name: mtc6, Length: 766152, dtype: object

# Prepare PUMS-derived county marginals 

Coefficients of variation are mostly respectable for most observations, but still high for, say, Napa retail where we only have 8 sample records for the estimate.

In [37]:
# place of work based accounting of self employed workes by industry

cow_indus_summary_pow = pers_data.query('POWSTCOUNTY.isin(@bayareafips_full)').groupby(
    ['wrk_county',  'cow', 'mtc6']).apply(simple_estimate_SE, years=2)  # .loc(0)[:, 'Self']
cow_indus_summary_pow.loc(0)[:,'Self']

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Total,ci_upper,ci_lower,se,moe,coef_variation,sample_recs
wrk_county,cow,mtc6,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alameda,Self,agrempn,640.0,880.0,401.0,145.342763,239.088845,0.227098,15.0
Alameda,Self,fpsempn,25920.0,27260.0,24581.0,814.258217,1339.454767,0.031414,598.0
Alameda,Self,herempn,29133.0,31003.0,27265.0,1136.195186,1869.04108,0.039,585.0
Alameda,Self,mwtempn,11251.0,12450.0,10054.0,728.499438,1198.381576,0.064747,211.0
Alameda,Self,othempn,10114.0,11303.0,8926.0,722.422665,1188.385283,0.071428,202.0
Alameda,Self,retempn,4719.0,5399.0,4041.0,412.647708,678.80548,0.087435,90.0
Contra Costa,Self,agrempn,849.0,1169.0,530.0,194.163961,319.399716,0.228697,15.0
Contra Costa,Self,fpsempn,19891.0,21295.0,18488.0,853.450141,1403.925483,0.042906,382.0
Contra Costa,Self,herempn,15662.0,16944.0,14381.0,779.156439,1281.712342,0.049748,299.0
Contra Costa,Self,mwtempn,3809.0,4486.0,3133.0,411.228138,676.470287,0.107962,74.0


In [44]:
# pers_data.query('STCOUNTY.isin(@bayareafips_full)').groupby(
#     ['STCOUNTY',  'cow']).apply(simple_estimate_SE, years=2)


In [45]:
county_industry_marginals = cow_indus_summary_pow.Total.loc(0)[:,'Self'].reset_index(
    1).Total.unstack().fillna(0).astype(int)#.to_records()
county_industry_marginals.index = county_industry_marginals.index.set_names('county_name')
county_industry_marginals.columns = county_industry_marginals.columns.set_names('industry')
county_industry_marginals

industry,agrempn,fpsempn,herempn,mwtempn,othempn,retempn
county_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alameda,640,25920,29133,11251,10114,4719
Contra Costa,849,19891,15662,3809,6230,2889
Marin,338,9496,7400,1652,2843,1271
Napa,427,1882,2565,952,1266,403
San Francisco,0,24105,19800,7751,8352,4281
San Mateo,281,14022,11067,4142,5413,1736
Santa Clara,253,31829,24836,10019,10870,4476
Solano,256,4263,5152,1707,2286,1162
Sonoma,932,9414,10761,2761,5262,2696


# CTPP data

In [46]:
# ctpp_cow_codes = {1: 'Total, all class of worker',
#              2: 'Private for-profit wage and salary workers',
#              3: 'Private not-for-profit wage and salary workers',
#              4: 'Local government workers',indus
#              5: 'State government workers',
#              6: 'Federal government workers',
#              7: 'Self-employed workers in own not incorporated business',
#              8: 'Self-employed workers in own, incorporated business',
#              9: 'Unpaid family workers'}

ctpp_cow_codes = {#1: 'Total, all class of worker',
             2: 'Private',
             3: 'Private',
             4: 'Government',
             5: 'Government',
             6: 'Government',
             7: 'Self-employed',
             8: 'Self-employed',
             9: 'Unpaid family workers'}

In [47]:
indus_to_mtc = pd.read_excel(
    '/Users/aolsen/Box/Modeling and Surveys/Regional Modeling/Regional Forecast PBA50 Plus Update/mappings/NAICS_to_ABAG_SECTORS.xlsx', 'ctpp_to_mtc')
indus_to_mtc = indus_to_mtc.groupby('CTPP2').MTCname.first()
indus_to_mtc

CTPP2
Agriculture, forestry, fishing and hunting, and mining                                  agrempn
Armed forces                                                                            othempn
Arts, entertainment, recreation, accommodation and food services                        herempn
Construction                                                                            othempn
Educational, health and social services                                                 herempn
Finance, insurance, real estate and rental and leasing                                  fpsempn
Information                                                                             othempn
Manufacturing                                                                           mwtempn
Other services (except public administration)                                           herempn
Professional, scientific, management, administrative,  and waste management services    fpsempn
Public administration             

In [48]:
data_dict_all = pd.read_excel(
    '/Users/aolsen/Dropbox/Documents/Data/_Census/CTPP/ACS2012_2016/docs/2012-2016 CTPP Final Table Specs.xlsx', 'Table Specs', skiprows=[1])

# detailed / itemized
ctpp_indus_codes = data_dict_all.query('`Table ID`=="A202212" & Type=="D" ').set_index([
    'Line Number']).Stub

# summary level totals
ctpp_indus_codes = data_dict_all.query('`Table ID`=="A202212" & Type=="I" ').set_index([
    'Line Number']).Stub
ctpp_indus_codes

Line Number
2.0     Agriculture, forestry, fishing and hunting, an...
3.0                                          Construction
4.0                                         Manufacturing
5.0                                       Wholesale trade
6.0                                          Retail trade
7.0         Transportation and warehousing, and utilities
8.0                                           Information
9.0     Finance, insurance, real estate and rental and...
10.0    Professional, scientific, management, administ...
11.0              Educational, health and social services
12.0    Arts, entertainment, recreation, accommodation...
13.0        Other services (except public administration)
14.0                                Public administration
15.0                                         Armed forces
Name: Stub, dtype: object

In [49]:
ctpp_indus_cow_all_codes = data_dict_all.query(
    '`Table ID`=="A202220" & `Line Number`>0').set_index(['Line Number']).Stub  # .to_dict()
ctpp_indus_cow_part_codes = data_dict_all.query(
    '`Table ID`=="A202220" & 90<`Line Number`<120').set_index(['Line Number']).Stub  # .to_dict()

In [50]:
# ctpp_occ_codes = pd.read_excel(os.path.join(
#     drop, 'Documents/Data/_Census/CTPP/ACS2006_2010/occupation_codes.xlsx'), 'occ', index_col=0).occ
# #indcodes = pd.read_excel(os.path.join(drop,'Documents/Data/_Census/CTPP/2006_2010/occupation_codes.xlsx'),'indus',index_col=0).indus
# ctpp_occ_codes

In [51]:
data_cow_indus = pd.read_csv(
    '/Users/aolsen/Dropbox/Documents/Data/_Census/CTPP/ACS2012_2016/06/CA_2012thru2016_A202220.csv')
data_cow_indus['SUMLEVEL'] = data_cow_indus.GEOID.str.slice(0, 3)
data_cow_indus.head()

Unnamed: 0,GEOID,TBLID,LINENO,EST,MOE,SOURCE,SUMLEVEL
0,C2200US06,A202220,1,17192045,"+/-20,898",,C22
1,C2200US06,A202220,2,399070,"+/-5,797",,C22
2,C2200US06,A202220,3,1024990,"+/-7,841",,C22
3,C2200US06,A202220,4,1657620,"+/-10,586",,C22
4,C2200US06,A202220,5,519670,"+/-5,831",,C22


In [52]:
def process_CTTP_data(table_id='A202102', sumlevel='C31'):
    """
    Process data from CTPP, subsetted to the Bay Area.

    Loads data from a CSV file, subsets it to California (CA), and then further
    narrows it down to the Bay Area. It also extracts relevant columns and performs
    data transformations.

    Returns:
    cow_data_tract_bayarea: A DataFrame containing the processed data for the Bay Area tracts.
    """

    # Load the CTPP data
    file_path = f'/Users/aolsen/Dropbox/Documents/Data/_Census/CTPP/ACS2012_2016/06/CA_2012thru2016_{table_id}.csv'
    print(file_path)

    data = pd.read_csv(
        file_path)
    data['SUMLEVEL'] = data.GEOID.str.slice(0, 3)

    # Subset data to relevant summary level
    data_tract = data[data.GEOID.str.slice(0, 3) == sumlevel]

    # Get the numeric value, stripping formating characters
    data_tract['value'] = pd.to_numeric(data_tract.EST.str.replace(',', ''))

    # Extract county name, geoid10
    data_tract['county_name'] = data_tract.GEOID.str.slice(
        7, 12).map(bayareafips_full)

    # this will be tract level detail when C31 is passed as sumlevel
    data_tract['geoid10'] = data_tract.GEOID.str.slice(7, 18)

    # Subset to Bay Area tracts
    cow_data_tract_bayarea = data_tract[data_tract.county_name.isin(
        bayareafips_full.values())]

    return cow_data_tract_bayarea

## Tract level data for industry

In [53]:
indus_data_tract_bayarea = process_CTTP_data(
    table_id='A202212', sumlevel='C31')
indus_data_tract_bayarea['industry'] = indus_data_tract_bayarea.LINENO.map(
    ctpp_indus_codes.map(indus_to_mtc))

# get just the high level industry total numbers from the A202212 table
indus_linenos = list(range(2, 16))

indus_data_tract_bayarea = indus_data_tract_bayarea.query(
    'LINENO.isin(@indus_linenos)')

indus_data_tract_bayarea = indus_data_tract_bayarea.groupby(
    ['geoid10', 'county_name', 'industry']).value.sum().unstack('industry').fillna(0).astype(int)
indus_data_tract_bayarea

/Users/aolsen/Dropbox/Documents/Data/_Census/CTPP/ACS2012_2016/06/CA_2012thru2016_A202212.csv


  data = pd.read_csv(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_tract['value'] = pd.to_numeric(data_tract.EST.str.replace(',', ''))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_tract['county_name'] = data_tract.GEOID.str.slice(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_tract['geoid10'] = data_tract.GEOID.str.slice(7, 

Unnamed: 0_level_0,industry,agrempn,fpsempn,herempn,mwtempn,othempn,retempn
geoid10,county_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
06001400100,Alameda,10,205,640,79,90,15
06001400200,Alameda,0,185,825,120,45,325
06001400300,Alameda,0,555,1055,145,275,275
06001400400,Alameda,0,230,320,80,85,120
06001400500,Alameda,0,190,355,70,14,65
...,...,...,...,...,...,...,...
06097154201,Sonoma,0,155,335,505,155,240
06097154202,Sonoma,50,200,365,100,180,95
06097154302,Sonoma,0,245,565,90,145,45
06097154303,Sonoma,10,100,164,19,80,15


## Tract level data for class of workers

In [54]:
cow_data_tract_bayarea = process_CTTP_data(table_id='A202102', sumlevel='C31')
cow_data_tract_bayarea['class_of_worker'] = cow_data_tract_bayarea.LINENO.map(
    ctpp_cow_codes)
cow_data_tract_bayarea = cow_data_tract_bayarea.query('LINENO!=1')
cow_data_tract_bayarea = cow_data_tract_bayarea.groupby(['geoid10','county_name','class_of_worker']).value.sum().unstack('class_of_worker').fillna(0).astype(int)
cow_data_tract_bayarea.head()

/Users/aolsen/Dropbox/Documents/Data/_Census/CTPP/ACS2012_2016/06/CA_2012thru2016_A202102.csv


  data = pd.read_csv(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_tract['value'] = pd.to_numeric(data_tract.EST.str.replace(',', ''))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_tract['county_name'] = data_tract.GEOID.str.slice(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_tract['geoid10'] = data_tract.GEOID.str.slice(7, 

Unnamed: 0_level_0,class_of_worker,Government,Private,Self-employed,Unpaid family workers
geoid10,county_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6001400100,Alameda,160,705,185,0
6001400200,Alameda,39,1235,214,4
6001400300,Alameda,195,1415,680,10
6001400400,Alameda,40,595,195,0
6001400500,Alameda,90,430,174,0


## County level data for industry

In [55]:
indus_data_county_bayarea = process_CTTP_data(
    table_id='A202212', sumlevel='C29')
indus_data_county_bayarea['industry'] = indus_data_county_bayarea.LINENO.map(
    ctpp_indus_codes.map(indus_to_mtc))

# get just the high level industry total numbers from the A202212 table
indus_linenos = list(range(2, 16))

indus_data_county_bayarea = indus_data_county_bayarea.query(
    'LINENO.isin(@indus_linenos)')

indus_data_county_bayarea = indus_data_county_bayarea.groupby(
    ['county_name', 'industry']).value.sum().unstack('industry').fillna(0).astype(int)
indus_data_county_bayarea

/Users/aolsen/Dropbox/Documents/Data/_Census/CTPP/ACS2012_2016/06/CA_2012thru2016_A202212.csv


  data = pd.read_csv(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_tract['value'] = pd.to_numeric(data_tract.EST.str.replace(',', ''))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_tract['county_name'] = data_tract.GEOID.str.slice(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_tract['geoid10'] = data_tract.GEOID.str.slice(7, 

industry,agrempn,fpsempn,herempn,mwtempn,othempn,retempn
county_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alameda,2645,145635,258620,144135,89380,69045
Contra Costa,4155,92530,143425,48990,47920,41855
Marin,860,33000,49655,12050,17685,14260
Napa,5815,10095,30560,14220,8100,7405
San Francisco,1430,224575,232170,67100,99305,61320
San Mateo,2055,105775,118455,73360,51025,38575
Santa Clara,6055,250715,313780,238000,138805,94400
Solano,2395,19820,53515,21590,27290,18365
Sonoma,7480,37645,83575,37110,28380,26895


In [56]:
indus_data_county_bayarea_pct = indus_data_county_bayarea.div(
    indus_data_county_bayarea.sum(axis=1), axis=0)
indus_data_county_bayarea_pct

industry,agrempn,fpsempn,herempn,mwtempn,othempn,retempn
county_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alameda,0.003728,0.205276,0.364531,0.203162,0.125983,0.09732
Contra Costa,0.010967,0.244223,0.378555,0.129304,0.12648,0.110472
Marin,0.006745,0.258803,0.38942,0.094502,0.138695,0.111834
Napa,0.076317,0.132489,0.401076,0.186626,0.106306,0.097185
San Francisco,0.002085,0.327417,0.33849,0.097828,0.144781,0.089401
San Mateo,0.005279,0.271744,0.30432,0.188467,0.131087,0.099102
Santa Clara,0.005812,0.240666,0.301203,0.228461,0.133242,0.090616
Solano,0.016751,0.138626,0.374296,0.151005,0.190873,0.128449
Sonoma,0.033833,0.170274,0.378022,0.167854,0.128367,0.12165


## County level data for class of workers

In [57]:
cow_data_county_bayarea = process_CTTP_data(table_id='A202102', sumlevel='C29')
cow_data_county_bayarea['class_of_worker'] = cow_data_county_bayarea.LINENO.map(
    ctpp_cow_codes)
cow_data_county_bayarea = cow_data_county_bayarea.query('LINENO!=1')
cow_data_county_bayarea = cow_data_county_bayarea.groupby(
    ['county_name', 'class_of_worker']).value.sum().unstack('class_of_worker').fillna(0).astype(int)
cow_data_county_bayarea.head()

/Users/aolsen/Dropbox/Documents/Data/_Census/CTPP/ACS2012_2016/06/CA_2012thru2016_A202102.csv


  data = pd.read_csv(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_tract['value'] = pd.to_numeric(data_tract.EST.str.replace(',', ''))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_tract['county_name'] = data_tract.GEOID.str.slice(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_tract['geoid10'] = data_tract.GEOID.str.slice(7, 

class_of_worker,Government,Private,Self-employed,Unpaid family workers
county_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alameda,106340,528380,73790,960
Contra Costa,47970,280255,49945,715
Marin,14270,88870,24165,210
Napa,9565,57660,8855,115
San Francisco,87115,523775,74285,720


In [58]:
tract_seed_self_employed = indus_data_county_bayarea_pct.mul(
    cow_data_tract_bayarea['Self-employed'], axis=0)
#tract_seed_self_employed.columns.set_names('class_detail')

In [59]:
cow_data_tract_bayarea

Unnamed: 0_level_0,class_of_worker,Government,Private,Self-employed,Unpaid family workers
geoid10,county_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
06001400100,Alameda,160,705,185,0
06001400200,Alameda,39,1235,214,4
06001400300,Alameda,195,1415,680,10
06001400400,Alameda,40,595,195,0
06001400500,Alameda,90,430,174,0
...,...,...,...,...,...
06097154201,Sonoma,80,1175,120,10
06097154202,Sonoma,210,575,215,0
06097154302,Sonoma,175,700,185,25
06097154303,Sonoma,59,215,115,0


# Update 2012-2016 data to 2019-2019 data using IPF

In [60]:
from ipfn import ipfn

# first - seed df - in long form
tract_data = tract_seed_self_employed.stack().reset_index(name='total')
tract_data_orig = tract_data.copy()
print('Before: ', tract_data_orig.total.sum())

# second - prep marginals - county x industry totals
margins_data_long = county_industry_marginals.stack()
#margins_data_long.index = margins_data_long.index.set_names('class_detail', 1)

# denote marginals and their mappigs
aggregates = [margins_data_long]
dimensions = [['county_name', 'industry']]

# call the ipf on the data
IPF = ipfn.ipfn(tract_data, aggregates, dimensions)
tract_data_updated = IPF.iteration()
print('After: ', tract_data_updated.total.sum())

Before:  329196.0
After:  391487.0


In [61]:
tract_data_updated

Unnamed: 0,county_name,industry,geoid10,total
0,Alameda,agrempn,06001400100,1.974222
1,Alameda,agrempn,06001400200,2.283694
2,Alameda,agrempn,06001400300,7.256599
3,Alameda,agrempn,06001400400,2.080936
4,Alameda,agrempn,06001400500,1.856836
...,...,...,...,...
9475,Sonoma,retempn,06097154201,11.363541
9476,Sonoma,retempn,06097154202,20.359677
9477,Sonoma,retempn,06097154302,17.518792
9478,Sonoma,retempn,06097154303,10.890060


In [62]:
self_emp_combo = pd.concat([tract_data_orig.set_index(['county_name', 'geoid10', 'industry']).total,
                            tract_data_updated.set_index(['county_name', 'geoid10', 'industry']).total], keys=['orig', 'upd']).unstack(0)
self_emp_combo.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,orig,upd
county_name,geoid10,industry,Unnamed: 3_level_1,Unnamed: 4_level_1
Alameda,6001400100,agrempn,0.689715,1.974222
Alameda,6001400100,fpsempn,37.976031,79.95598
Alameda,6001400100,herempn,67.438192,89.86719
Alameda,6001400100,mwtempn,37.584889,34.706201
Alameda,6001400100,othempn,23.306881,31.198873


## Check against marginals
They should match to a 't' for the relevant dimension

checks out


In [63]:
self_emp_combo.sum(level=['industry'])

  self_emp_combo.sum(level=['industry'])


Unnamed: 0_level_0,orig,upd
industry,Unnamed: 1_level_1,Unnamed: 2_level_1
agrempn,3206.604528,3976.0
fpsempn,79818.279738,140822.0
herempn,113754.09463,126376.0
mwtempn,55254.038407,44044.0
othempn,44168.411476,52636.0
retempn,32994.571221,23633.0


In [64]:
self_emp_combo.sum(level=['industry']).div(margins_data_long.sum(level='industry'),axis=0)

  self_emp_combo.sum(level=['industry']).div(margins_data_long.sum(level='industry'),axis=0)
  self_emp_combo.sum(level=['industry']).div(margins_data_long.sum(level='industry'),axis=0)


Unnamed: 0_level_0,orig,upd
industry,Unnamed: 1_level_1,Unnamed: 2_level_1
agrempn,0.80649,1.0
fpsempn,0.566803,1.0
herempn,0.900124,1.0
mwtempn,1.254519,1.0
othempn,0.839129,1.0
retempn,1.396123,1.0


In [65]:
self_emp_combo.sum(level=['county_name']).div(margins_data_long.sum(level='county_name'),axis=0)

  self_emp_combo.sum(level=['county_name']).div(margins_data_long.sum(level='county_name'),axis=0)
  self_emp_combo.sum(level=['county_name']).div(margins_data_long.sum(level='county_name'),axis=0)


Unnamed: 0_level_0,orig,upd
county_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Alameda,0.733372,1.0
Contra Costa,0.809548,1.0
Marin,0.836522,1.0
Napa,0.970914,1.0
San Francisco,0.9142,1.0
San Mateo,0.91315,1.0
Santa Clara,0.863277,1.0
Solano,0.743154,1.0
Sonoma,0.894552,1.0


In [66]:
self_emp_combo_df = self_emp_combo.round(0).astype(int).unstack('industry')


self_emp_combo_df.columns = ['_'.join(col).strip() for col in self_emp_combo_df.columns.values]
self_emp_combo_df.sum(level=0)

  self_emp_combo_df.sum(level=0)


Unnamed: 0_level_0,orig_agrempn,orig_fpsempn,orig_herempn,orig_mwtempn,orig_othempn,orig_retempn,upd_agrempn,upd_fpsempn,upd_herempn,upd_mwtempn,upd_othempn,upd_retempn
county_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Alameda,208,12301,21868,12154,7553,5833,641,25925,29126,11255,10107,4733
Contra Costa,431,9759,15113,5160,5051,4409,853,19883,15669,3807,6229,2890
Marin,129,4976,7491,1821,2668,2152,340,9497,7402,1650,2842,1271
Napa,553,960,2912,1356,775,705,428,1883,2568,953,1265,402
San Francisco,110,19237,19893,5747,8513,5252,0,24106,19805,7757,8362,4284
San Mateo,177,9100,10190,6311,4382,3300,279,14020,11071,4140,5411,1737
Santa Clara,398,17095,21412,16230,9461,6432,217,31829,24814,10017,10863,4481
Solano,192,1523,4126,1666,2103,1415,252,4267,5147,1711,2281,1161
Sonoma,960,4850,10764,4775,3652,3461,931,9413,10762,2770,5264,2695


In [125]:
# store in a long form dataframe

out_df = self_emp_combo.upd.round(0).astype(int).reset_index(name='value')
out_df['tract10'] = out_df.geoid10
out_df.head()

Unnamed: 0,county_name,geoid10,industry,value,tract10
0,Alameda,6001400100,agrempn,2,6001400100
1,Alameda,6001400100,fpsempn,80,6001400100
2,Alameda,6001400100,herempn,90,6001400100
3,Alameda,6001400100,mwtempn,35,6001400100
4,Alameda,6001400100,othempn,31,6001400100


In [80]:
# write to disk
out_path = '.'

out_df.to_csv(os.path.join(out_path,'tract_self_employed_workers_2020.csv'))

# Translate to TAZ geographies

We need the data by TAZ geographies. 

* We assign census 2010 blocks to the containing TAZ based on block centroid.
* We then use WAC jobs data as weights - summarizing blocks by tract and taz and getting for each tract the share of jobs in related TAZs.

In [212]:
# Wac data, LODES 7 geo vintage (2010s)
wac_2014 = pd.read_csv(
    '/Users/aolsen/Dropbox/Documents/Data/_Census/LEHD/LODES/workplace_area_characteristics/ca_wac_S000_JT00_2014.csv', dtype={'w_geocode': str})
wac_2014

Unnamed: 0,w_geocode,C000,CA01,CA02,CA03,CE01,CE02,CE03,CNS01,CNS02,...,CFA02,CFA03,CFA04,CFA05,CFS01,CFS02,CFS03,CFS04,CFS05,createdate
0,060014001001007,45,17,15,13,16,7,22,0,0,...,0,0,0,0,0,0,0,0,0,20190825
1,060014001001008,26,3,15,8,3,5,18,0,0,...,0,0,0,0,0,0,0,0,0,20190825
2,060014001001017,9,2,3,4,1,1,7,0,0,...,0,0,0,0,0,0,0,0,0,20190825
3,060014001001024,22,9,10,3,9,6,7,0,0,...,0,0,0,0,0,0,0,0,0,20190825
4,060014001001026,4,1,2,1,0,3,1,0,0,...,0,0,0,0,0,0,0,0,0,20190825
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
240521,061150411004037,16,2,8,6,4,7,5,0,0,...,0,0,0,0,0,0,0,0,0,20190825
240522,061150411004040,2,0,0,2,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,20190825
240523,061150411004047,6,0,3,3,4,1,1,0,0,...,0,0,0,0,0,0,0,0,0,20190825
240524,061150411004051,9,5,2,2,3,4,2,0,0,...,0,0,0,0,0,0,0,0,0,20190825


In [213]:
# Get TAZ zones
user = 'aolsen'
zones_path = f'/Users/{user}/Box/Modeling and Surveys/Urban Modeling/Spatial/Zones/TAZ1454/zones1454.shp'

zones = gpd.read_file(
    zones_path).to_crs('EPSG:26910')

In [94]:
import os
import geopandas as gpd
import pandas as pd
from pygris import blocks


def fetch_bayarea_blocks(output_path, year):
    BLOCK_PATH = output_path

    if not os.path.exists(BLOCK_PATH):
        marin_blocks = blocks(state='CA', county='041', year=year, cache=True)
        napa_blocks = blocks(state='CA', county='055', year=year, cache=True)
        solano_blocks = blocks(state='CA', county='095', year=year, cache=True)
        sonoma_blocks = blocks(state='CA', county='097', year=year, cache=True)
        alameda_blocks = blocks(
            state='CA', county='001', year=year, cache=True)
        contracosta_blocks = blocks(
            state='CA', county='013', year=year, cache=True)
        sanmateo_blocks = blocks(
            state='CA', county='081', year=year, cache=True)
        santaclara_blocks = blocks(
            state='CA', county='085', year=year, cache=True)
        sanfrancisco_blocks = blocks(
            state='CA', county='075', year=year, cache=True)

        bayarea_blocks = pd.concat([marin_blocks, napa_blocks, solano_blocks, sonoma_blocks,
                                    sanmateo_blocks, santaclara_blocks, sanfrancisco_blocks, contracosta_blocks, alameda_blocks])

        bayarea_blocks.to_feather(BLOCK_PATH)
    else:
        bayarea_blocks = gpd.read_feather(BLOCK_PATH)

    bayarea_blocks = bayarea_blocks.to_crs('EPSG:26910')
    return bayarea_blocks


year = 2010
output_path = f'/Users/aolsen/Downloads/bayarea_blocks_{year}.feather'
bayarea_blocks = fetch_bayarea_blocks(output_path, year)

In [183]:
bayarea_blocks['jobs'] = bayarea_blocks.GEOID10.map(
    wac_2014.set_index('w_geocode').C000)
bayarea_blocks['jobs'].sum()

bayarea_blocks['tract10'] = bayarea_blocks.GEOID10.str.slice(0, 11)

In [107]:
# get centroid / representative point for block
bayarea_blocks['geom_pt'] = bayarea_blocks.representative_point()

In [190]:
# for each block, get the zone it falls within

blocks_x_zones = gpd.sjoin(bayarea_blocks.set_geometry('geom_pt'), zones)

In [214]:
# sum BLOCK level jobs into both ZONES and TRACTS - so we can figure out the share of each tract that goes to each zone

pct = lambda x: x/x.sum()

jobs_by_tract_zone = blocks_x_zones.groupby(['tract10','zone_id']).jobs.sum()
jobs_by_tract_zone_pct = jobs_by_tract_zone.groupby(level='tract10',observed=True).apply(pct)

jobs_by_tract_zone_pct

To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  jobs_by_tract_zone_pct = jobs_by_tract_zone.groupby(level='tract10',observed=True).apply(pct)


tract10      zone_id
06001400100  914        0.002128
             1005       0.997872
             1007       0.000000
             1028       0.000000
             1155       0.000000
                          ...   
06097154202  1402       1.000000
06097154302  1389       1.000000
             1390       0.000000
06097154303  1403       1.000000
06097154304  1403       1.000000
Name: jobs, Length: 3454, dtype: float64

In [215]:
self_emp_series = out_df.set_index(['tract10','industry']).value#.unstack('industry')
self_emp_series

tract10      industry
06001400100  agrempn      2
             fpsempn     80
             herempn     90
             mwtempn     35
             othempn     31
                         ..
06097154304  fpsempn     36
             herempn     41
             mwtempn     11
             othempn     20
             retempn     10
Name: value, Length: 9480, dtype: int64

In [216]:
# distribute to TAZs
self_emp_distributed = self_emp_series.mul(jobs_by_tract_zone_pct)

In [217]:
# sum to TAZs
self_emp_distributed_out = self_emp_distributed.sum(
    level=['zone_id', 'industry']).round(0).astype(int).reset_index(name='value')
self_emp_distributed_out

  self_emp_distributed_out = self_emp_distributed.sum(level=['zone_id','industry']).round(0).astype(int).reset_index(name='value')


Unnamed: 0,zone_id,industry,value
0,914,agrempn,2
1,1005,agrempn,2
2,1007,agrempn,0
3,1028,agrempn,1
4,1155,agrempn,3
...,...,...,...
8713,1403,fpsempn,74
8714,1403,herempn,84
8715,1403,mwtempn,22
8716,1403,othempn,41


In [218]:
self_emp_distributed_out.to_csv(os.path.join(out_path,'taz_self_employed_workers_2020.csv'))