# Estimating Auto Ownership

This notebook illustrates how to re-estimate ActivitySim's auto ownership model.  The steps in the process are:
  - Run ActivitySim in estimation mode to read household travel survey files, run the ActivitySim submodels to write estimation data bundles (EDB) that contains the model utility specifications, coefficients, chooser data, and alternatives data for each submodel.
  - Read and transform the relevant EDB into the format required by the model estimation package [larch](https://larch.newman.me) and then re-estimate the model coefficients.  No changes to the model specification will be made.
  - Update the ActivitySim model coefficients and re-run the model in simulation mode.
  
The basic estimation workflow is shown below and explained in the next steps.

![estimation workflow](https://github.com/RSGInc/activitysim/raw/develop/docs/images/estimation_example.jpg)

# Load libraries

In [1]:
import os
import larch  # !conda install larch -c conda-forge # for estimation
import pandas as pd

# Review Inputs

In addition to a working ActivitySim model setup, estimation mode requires an ActivitySim format household travel survey.  An ActivitySim format household travel survey is very similar to ActivitySim's simulation model tables:

 - households
 - persons
 - tours
 - joint_tour_participants
 - trips (not yet implemented)

Examples of the ActivitySim format household travel survey are included in the [example_estimation data folders](https://github.com/RSGInc/activitysim/tree/develop/activitysim/examples/example_estimation).  The user is responsible for formatting their household travel survey into the appropriate format.  

After creating an ActivitySim format household travel survey, the `scripts/infer.py` script is run to append additional calculated fields.  An example of an additional calculated field is the `household:joint_tour_frequency`, which is calculated based on the `tours` and `joint_tour_participants` tables.  

The input survey files are below.

### Survey households

In [2]:
pd.read_csv("../data_sf/survey_data/override_households.csv")

Unnamed: 0,household_id,home_zone_id,income,hhsize,HHT,auto_ownership,num_workers,joint_tour_frequency
0,841891,126,48000,1,4,1,1,0_tours
1,990869,134,48000,2,1,2,2,0_tours
2,125886,113,25900,1,4,1,1,0_tours
3,727893,8,26100,2,1,0,1,0_tours
4,2741769,150,121600,4,1,2,1,0_tours
...,...,...,...,...,...,...,...,...
1995,663493,110,19180,1,6,1,1,0_tours
1996,569375,20,7400,1,6,1,0,0_tours
1997,1445193,17,75000,1,4,0,1,0_tours
1998,2833455,69,0,1,0,0,0,0_tours


### Survey persons

In [3]:
pd.read_csv("../data_sf/survey_data/override_persons.csv")

Unnamed: 0,person_id,household_id,age,PNUM,sex,pemploy,pstudent,ptype,school_zone_id,workplace_zone_id,free_parking_at_work,cdap_activity,mandatory_tour_frequency,_escort,_shopping,_othmaint,_othdiscr,_eatout,_social,non_mandatory_tour_frequency
0,166,166,54,1,2,3,3,4,-1,-1,False,N,,0,0,0,0,1,0,4
1,197,197,46,1,2,3,3,4,-1,-1,False,N,,0,1,0,0,0,0,16
2,268,268,46,1,1,3,3,4,-1,-1,False,N,,0,0,1,1,0,0,9
3,375,375,54,1,2,3,3,4,-1,-1,False,N,,0,0,1,0,0,0,8
4,387,387,44,1,2,3,3,4,-1,-1,False,N,,1,0,0,1,0,0,33
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4400,7554799,2863464,93,1,2,3,3,5,-1,-1,False,N,,0,0,0,1,0,0,1
4401,7554818,2863483,68,1,1,3,3,5,-1,-1,False,N,,0,0,1,1,0,0,9
4402,7555141,2863806,93,1,2,3,3,5,-1,-1,False,N,,0,2,0,1,0,0,17
4403,7555853,2864518,71,1,1,3,3,5,-1,-1,False,N,,0,0,0,0,0,1,2


# Example Setup if Needed

To avoid duplication of inputs, especially model settings and expressions, the `example_estimation` depends on the `example`.  The following commands create an example setup for use.  The location of these example setups (i.e. the folders) are important because the paths are referenced in this notebook.  The commands below download the skims.omx for the SF county example from the [activitysim resources repository](https://github.com/RSGInc/activitysim_resources).

In [None]:
!activitysim create -e example_estimation_sf -d test

# Run the Estimation Example

The next step is to run the model with an `estimation.yaml` settings file with the following settings in order to output the EDB for all submodels:

```
enable=True

bundles:
  - school_location
  - workplace_location
  - auto_ownership
  - free_parking
  - cdap
  - mandatory_tour_frequency
  - mandatory_tour_scheduling
  - joint_tour_frequency
  - joint_tour_composition
  - joint_tour_participation
  - joint_tour_destination
  - joint_tour_scheduling
  - non_mandatory_tour_frequency
  - non_mandatory_tour_destination
  - non_mandatory_tour_scheduling
  - tour_mode_choice
  - atwork_subtour_frequency
  - atwork_subtour_destination
  - atwork_subtour_scheduling
  - atwork_subtour_mode_choice
  
survey_tables:
  households:
    file_name: survey_data/override_households.csv
    index_col: household_id
  persons:
    file_name:  survey_data/override_persons.csv
    index_col: person_id
  tours:
    file_name:  survey_data/override_tours.csv
  joint_tour_participants:
    file_name:  survey_data/override_joint_tour_participants.csv
```

This enables the estimation mode functionality, identifies which models to run and their output estimation data bundles (EDBs), and the input survey tables, which include the override settings for each model choice.  

With this setup, the model will output an EBD with the following tables for this submodel:
  - model settings - auto_ownership_model_settings.yaml
  - coefficients - auto_ownership_coefficients.csv
  - utilities specification - auto_ownership_SPEC.csv
  - chooser and alternatives data - auto_ownership_values_combined.csv
  
The following code runs the software in estimation mode, inheriting the settings from the simulation setup and using the San Francisco county data setup.  It produces the EDB for all submodels but runs all the model steps identified in the inherited settings file.  

In [4]:
%cd test

/activitysim/activitysim/examples/example_estimation/notebooks/test


In [None]:
!activitysim run -c configs_estimation/configs -c configs -o output -d data_sf

# Load data and prep model for estimation

In [5]:
from activitysim.estimation.larch.auto_ownership import auto_ownership_model
model, data = auto_ownership_model(return_data=True)

# Review data loaded from the EDB

The next step is to read the EDB, including the coefficients, model settings, utilities specification, and chooser and alternative data.

### Coefficients

In [6]:
data.coefficients

Unnamed: 0_level_0,value,constrain
coefficient_name,Unnamed: 1_level_1,Unnamed: 2_level_1
coef_cars1_drivers_2,0.0000,T
coef_cars1_drivers_3,0.0000,T
coef_cars1_persons_16_17,0.0000,T
coef_cars234_asc_marin,0.0000,T
coef_cars1_persons_25_34,0.0000,T
...,...,...
coef_cars4_drivers_3,5.2080,F
coef_cars3_drivers_3,5.5131,F
coef_cars2_drivers_4_up,6.3662,F
coef_cars3_drivers_4_up,8.5148,F


#### Utility specification

In [7]:
data.spec

Unnamed: 0,Label,Description,Expression,cars0,cars1,cars2,cars3,cars4
0,util_drivers_2,2 Adults (age 16+),num_drivers==2,,coef_cars1_drivers_2,coef_cars2_drivers_2,coef_cars3_drivers_2,coef_cars4_drivers_2
1,util_drivers_3,3 Adults (age 16+),num_drivers==3,,coef_cars1_drivers_3,coef_cars2_drivers_3,coef_cars3_drivers_3,coef_cars4_drivers_3
2,util_drivers_4_up,4+ Adults (age 16+),num_drivers>3,,coef_cars1_drivers_4_up,coef_cars2_drivers_4_up,coef_cars3_drivers_4_up,coef_cars4_drivers_4_up
3,util_persons_16_17,Persons age 16-17,num_children_16_to_17,,coef_cars1_persons_16_17,coef_cars2_persons_16_17,coef_cars34_persons_16_17,coef_cars34_persons_16_17
4,util_persons_18_24,Persons age 18-24,num_college_age,,coef_cars1_persons_18_24,coef_cars2_persons_18_24,coef_cars34_persons_18_24,coef_cars34_persons_18_24
5,util_persons_25_34,Persons age 35-34,num_young_adults,,coef_cars1_persons_25_34,coef_cars2_persons_25_34,coef_cars34_persons_25_34,coef_cars34_persons_25_34
6,util_presence_children_0_4,Presence of children age 0-4,num_young_children>0,,coef_cars1_presence_children_0_4,coef_cars234_presence_children_0_4,coef_cars234_presence_children_0_4,coef_cars234_presence_children_0_4
7,util_presence_children_5_17,Presence of children age 5-17,(num_children_5_to_15+num_children_16_to_17)>0,,coef_cars1_presence_children_5_17,coef_cars2_presence_children_5_17,coef_cars34_presence_children_5_17,coef_cars34_presence_children_5_17
8,util_num_workers_clip_3,"Number of workers, capped at 3",@df.num_workers.clip(upper=3),,coef_cars1_num_workers_clip_3,coef_cars2_num_workers_clip_3,coef_cars3_num_workers_clip_3,coef_cars4_num_workers_clip_3
9,util_hh_income_0_30k,"Piecewise Linear household income, $0-30k","@df.income_in_thousands.clip(0, 30)",,coef_cars1_hh_income_0_30k,coef_cars2_hh_income_0_30k,coef_cars3_hh_income_0_30k,coef_cars4_hh_income_0_30k


### Chooser and alternatives data

In [8]:
data.chooser_data

Unnamed: 0_level_0,model_choice,override_choice,util_drivers_2,util_drivers_3,util_drivers_4_up,util_persons_16_17,util_persons_18_24,util_persons_25_34,util_presence_children_0_4,util_presence_children_5_17,...,OPRKCST,area_type,HSENROLL,COLLFTE,COLLPTE,TOPOLOGY,TERMINAL,household_density,employment_density,density_index
_caseid_,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
166,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.00000,2,0.0,0.00000,0.00000,1,3.21263,24.783133,31.566265,13.883217
197,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,116.00000,2,0.0,0.00000,0.00000,1,3.68156,56.783784,10.459459,8.832526
268,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.00000,1,0.0,3598.08521,0.00000,1,3.29100,11.947644,45.167539,9.448375
375,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,68.00000,1,0.0,0.00000,0.00000,1,4.11499,73.040169,28.028350,20.255520
387,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.00000,3,0.0,227.78223,41.22827,1,3.83527,26.631579,45.868421,16.848945
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2863464,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,314.01431,0,0.0,72.14684,0.00000,1,5.52555,38.187500,978.875000,36.753679
2863483,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,225.00000,1,0.0,0.00000,0.00000,3,3.99027,39.838272,71.693001,25.608291
2863806,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,202.24750,2,0.0,0.00000,0.00000,1,4.27539,51.675676,47.216216,24.672699
2864518,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.00000,1,0.0,0.00000,0.00000,1,25.52083,15.938148,551.353820,15.490363


# Estimate

With the model setup for estimation, the next step is to estimate the model coefficients.  Make sure to use a sufficiently large enough household sample and set of zones to avoid an over-specified model, which does not have a numerically stable likelihood maximizing solution.  Larch has two built-in estimation methods: BHHH and SLSQP.  BHHH is the default and typically runs faster, but does not follow constraints on parameters.  SLSQP is safer, but slower, and may need additional iterations.

In [9]:
model.estimate()

req_data does not request avail_ca or avail_co but it is set and being provided


Unnamed: 0,value,initvalue,nullvalue,minimum,maximum,holdfast,note,best
coef_cars1_asc,4.744711,0.0,0.0,,,0,,4.744711
coef_cars1_asc_county,-0.566000,0.0,0.0,,,0,,-0.566000
coef_cars1_asc_marin,-0.243399,0.0,0.0,,,0,,-0.243399
coef_cars1_asc_san_francisco,3.984111,0.0,0.0,,,0,,3.984111
coef_cars1_auto_time_saving_per_worker,-0.039384,0.0,0.0,,,0,,-0.039384
...,...,...,...,...,...,...,...,...
coef_retail_auto_no_workers,-0.637704,0.0,0.0,,,0,,-0.637704
coef_retail_auto_workers,-0.531112,0.0,0.0,,,0,,-0.531112
coef_retail_non_motor,-0.030000,0.0,0.0,,,1,,-0.030000
coef_retail_transit_no_workers,-0.333447,0.0,0.0,,,0,,-0.333447


  model.estimate()
  model.estimate()


Unnamed: 0,0
coef_cars1_asc,4.744711
coef_cars1_asc_county,-0.566000
coef_cars1_asc_marin,-0.243399
coef_cars1_asc_san_francisco,3.984111
coef_cars1_auto_time_saving_per_worker,-0.039384
coef_cars1_density_0_10_no_workers,0.000000
coef_cars1_density_10_up_no_workers,-0.006930
coef_cars1_density_10_up_workers,-0.016448
coef_cars1_drivers_2,0.000000
coef_cars1_drivers_3,0.000000

Unnamed: 0,0
coef_cars1_asc,4.744711
coef_cars1_asc_county,-0.566
coef_cars1_asc_marin,-0.243399
coef_cars1_asc_san_francisco,3.984111
coef_cars1_auto_time_saving_per_worker,-0.039384
coef_cars1_density_0_10_no_workers,0.0
coef_cars1_density_10_up_no_workers,-0.00693
coef_cars1_density_10_up_workers,-0.016448
coef_cars1_drivers_2,0.0
coef_cars1_drivers_3,0.0


### Estimated coefficients

In [10]:
model.parameter_summary()

Unnamed: 0,Value,Std Err,t Stat,Signif,Like Ratio,Null Value,Constrained
coef_cars1_asc,4.74,2.66,1.78,,,0.0,
coef_cars1_asc_county,-0.566,0.0194,-29.22,***,,0.0,
coef_cars1_asc_marin,-0.243,0.112,-2.18,*,,0.0,
coef_cars1_asc_san_francisco,3.98,2.66,1.5,,,0.0,
coef_cars1_auto_time_saving_per_worker,-0.0394,0.561,-0.07,,,0.0,
coef_cars1_density_0_10_no_workers,0.0,,,,,0.0,fixed value
coef_cars1_density_10_up_no_workers,-0.00693,0.00514,-1.35,,,0.0,
coef_cars1_density_10_up_workers,-0.0164,0.0039,-4.21,***,,0.0,
coef_cars1_drivers_2,0.0,,,,,0.0,fixed value
coef_cars1_drivers_3,0.0,,,,,0.0,fixed value


# Output Estimation Results

In [11]:
from activitysim.estimation.larch import update_coefficients
result_dir = data.edb_directory/"estimated"
update_coefficients(
    model, data, result_dir,
    output_file="auto_ownership_coefficients_revised.csv",
);

### Write the model estimation report, including coefficient t-statistic and log likelihood

In [12]:
model.to_xlsx(
    result_dir/"auto_ownership_model_estimation.xlsx", 
    data_statistics=False,
)

<larch.util.excel.ExcelWriter at 0x7fe150021f40>

# Next Steps

The final step is to either manually or automatically copy the `auto_ownership_coefficients_revised.csv` file to the configs folder, rename it to `auto_ownership_coefficients.csv`, and run ActivitySim in simulation mode.

In [13]:
pd.read_csv(result_dir/"auto_ownership_coefficients_revised.csv")

Unnamed: 0,coefficient_name,value,constrain
0,coef_cars1_drivers_2,0.000000,T
1,coef_cars1_drivers_3,0.000000,T
2,coef_cars1_persons_16_17,0.000000,T
3,coef_cars234_asc_marin,0.000000,T
4,coef_cars1_persons_25_34,0.000000,T
...,...,...,...
62,coef_cars4_drivers_3,564.490158,F
63,coef_cars3_drivers_3,5.048488,F
64,coef_cars2_drivers_4_up,6.856405,F
65,coef_cars3_drivers_4_up,8.317950,F
