## Background

This script is for estimating an updated WFH model for TM1.6.1. The [initial WFH model created for TM1.6.0](https://github.com/BayAreaMetro/travel-model-one/pull/63) was estimated using ACS 2021 data (and later updated to use ACS 2022 data) based on household income, home county, and industry. However, ACS and ACS PUMS data on WFH has the limitation that for workers who report WFH as their primary journey-to-work mode for the reference week, no distinct workplace (from home) is reported.

MTC staff looked at data from the [2023 Bay Area Travel Study (BATS)](https://mtc.ca.gov/tools-resources/survey-program) where workplace location was reported for all respondents who had a workplace, even if they worked from home on a survey day. We [found that the distances between work and home tended to be longer for people who worked from home](https://10ay.online.tableau.com/t/metropolitantransportationcommission/views/BATS-2023-SurveyDataViz--WeightedDataset_09112024_17261683633090/Dist-to-WorkTable), which is intuitve because a longer commute would incentive workers to WFH. However, the ACS-based WFH model implemented resulted in the reverse pattern.

Therefore, the goal here is to estimate a simple model using BATS 2023 data and a similar set of independent variables (to minimize implementation effort due to limited time available), but with the addition of a distance-to-work term included.

Asana task (internal): [Estimate WFH binomial logit model using BATS2023](https://app.asana.com/0/15119358130897/1208621825395379/f)

In [1]:
import pandas as pd
import statsmodels
import os
import pathlib

pd.options.display.max_rows = 1000

## 1. Prepare BATS 2023 data

In [2]:
# default is C:/Users/username/Box
BOX_ROOT_DIR = pathlib.Path("C:/Users") / os.environ['USERNAME'] / "Box"
if (os.environ['USERNAME'] == 'lzorn'):
    BOX_ROOT_DIR = pathlib.Path("E:/Box")

BATS_DATA_DIR = BOX_ROOT_DIR / "Modeling and Surveys" / "Surveys" / "Travel Diary Survey" / "Biennial Travel Diary Survey" / \
    "MTC_RSG_Partner Repository" / "5.Deliverables" / "Task 10 - Final Weighted and Expanded Data Files" / "WeightedDataset_09112024"

### 1.1 Read Households data for household income, home location

In [3]:
# read households
bats_hhlds = pd.read_csv(BATS_DATA_DIR / "hh.csv")
print(f"Read {len(bats_hhlds):,} rows from \"{BATS_DATA_DIR / 'hh.csv'}")

# print("\nbats_hhlds.dtypes:")
# print(bats_hhlds.dtypes)

# select relevant variables
bats_hhlds = bats_hhlds[[
    'hh_id',
    'home_lon','home_lat','home_in_region','home_county',
    'income_detailed','income_followup','income_broad','income_imputed']]

# filter to home_in_region? Not necesary -- these are all 1
print("\nAll records have home_in_region==1:")
print(bats_hhlds['home_in_region'].value_counts(dropna=False))

# use income_imputed?
print("\nIncome variable tabulation:")
print(bats_hhlds[['income_detailed','income_imputed']].value_counts(dropna=False))

Read 8,258 rows from "E:\Box\Modeling and Surveys\Surveys\Travel Diary Survey\Biennial Travel Diary Survey\MTC_RSG_Partner Repository\5.Deliverables\Task 10 - Final Weighted and Expanded Data Files\WeightedDataset_09112024\hh.csv

All records have home_in_region==1:
home_in_region
1    8258
Name: count, dtype: int64

Income variable tabulation:
income_detailed  income_imputed   
7                $100,000-$199,999    1370
10               $200,000 or more     1235
8                $100,000-$199,999     935
6                $75,000-$99,999       852
5                $50,000-$74,999       761
9                $200,000 or more      622
999              $100,000-$199,999     547
1                Under $25,000         428
4                $25,000-$49,999       417
2                Under $25,000         305
3                $25,000-$49,999       303
999              $200,000 or more      261
                 Under $25,000         122
                 $25,000-$49,999        45
                

### 1.2 Read Person data for employment status, work location and industry

Filter to employed persons who have a recorded work location in the region.
Merge with households information.

In [4]:
# read persons
bats_persons = pd.read_csv(BATS_DATA_DIR / "person.csv")
print(f"Read {len(bats_persons):,} rows from \"{BATS_DATA_DIR / 'person.csv'}\"")

# print("\nbats_persons.dtypes:")
# print(bats_persons.dtypes)

# select relevant variables
bats_persons = bats_persons[[
    'hh_id','person_id',
    'employment','work_lat','work_lon','work_in_region','work_county','industry',
    'can_telework']]

print("\nFiltering to: employment == 1 Employed full-time (paid) or 2 Employed part-time (paid)")
bats_persons = bats_persons.loc[bats_persons.employment.isin([1,2])]
print(f" => {len(bats_persons):,} rows")

# set variable for has_work_location
bats_persons['has_work_location'] = False
bats_persons.loc[pd.notna(bats_persons.work_lat) & pd.notna(bats_persons.work_lon), "has_work_location"] = True

print("\nWork in region vs has_work_location tabulation:")
print(bats_persons[["work_in_region","has_work_location"]].value_counts(dropna=False))

# It looks like work_in_region==995 => has_work_location==False
# Drop work_in_region==0 and has_work_location==False
print("\nFiltering to: work_in_region==1 and has_work_location==True")
bats_persons = bats_persons.loc[(bats_persons.work_in_region == 1)&
                                (bats_persons.has_work_location==True)]
print(f" => {len(bats_persons):,} rows")



Read 15,985 rows from "E:\Box\Modeling and Surveys\Surveys\Travel Diary Survey\Biennial Travel Diary Survey\MTC_RSG_Partner Repository\5.Deliverables\Task 10 - Final Weighted and Expanded Data Files\WeightedDataset_09112024\person.csv"

Filtering to: employment == 1 Employed full-time (paid) or 2 Employed part-time (paid)
 => 8,374 rows

Work in region vs has_work_location tabulation:
work_in_region  has_work_location
1               True                 5982
995             False                2328
0               True                   64
Name: count, dtype: int64

Filtering to: work_in_region==1 and has_work_location==True
 => 5,982 rows


### 1.3 Merge persons with households

In [5]:
# Merge with households
bats_persons = pd.merge(
    left=bats_persons,
    right=bats_hhlds,
    on=['hh_id'],
    how='left',
    validate='many_to_one',
    indicator=True
)
# verify all person records have household information
assert all(bats_persons['_merge'] == 'both'), "Not all values in _merge are 'both'"
bats_persons.drop(columns=['_merge'],inplace=True)
# bats_persons.head()

### 1.4 Read Day data

In [6]:
# read day data for day of week and telecommute time spent
bats_day = pd.read_csv(BATS_DATA_DIR / "day.csv")
print(f"Read {len(bats_day):,} rows from \"{BATS_DATA_DIR / 'day.csv'}\"")

print("\nbats_day.dtypes:")
print(bats_day.dtypes)

# select relevant variables
bats_day = bats_day[[
    'hh_id','person_id','day_id',
    'travel_date','travel_dow',
    'telecommute_time']]

Read 89,112 rows from "E:\Box\Modeling and Surveys\Surveys\Travel Diary Survey\Biennial Travel Diary Survey\MTC_RSG_Partner Repository\5.Deliverables\Task 10 - Final Weighted and Expanded Data Files\WeightedDataset_09112024\day.csv"

bats_day.dtypes:
hh_id                          int64
num_trips                      int64
person_id                      int64
person_num                     int64
hh_is_complete                 int64
hh_is_complete_a               int64
hh_is_complete_b               int64
surveyable                     int64
is_participant                 int64
is_complete                    int64
is_complete_a                  int64
is_complete_b                  int64
day_id                         int64
day_num                        int64
travel_date                   object
travel_dow                     int64
num_complete_trip_surveys      int64
hh_day_complete                int64
hh_day_complete_a              int64
hh_day_complete_b              int64
num_flagg