# Estimating Non-Mandatory Tour Frequency

This notebook illustrates how to re-estimate a single model component for ActivitySim.  This process 
includes running ActivitySim in estimation mode to read household travel survey files and write out
the estimation data bundles used in this notebook.  To review how to do so, please visit the other
notebooks in this directory.

# Load libraries

In [None]:
import os
import larch  # !conda install larch -c conda-forge # for estimation
import pandas as pd
import numpy as np
import activitysim
import datetime
activitysim.__version__

pd.options.display.max_columns = 150

We'll work in our `test` directory, where ActivitySim has saved the estimation data bundles.

In [None]:
os.chdir('C:\ABM3_dev\outputs')

In [None]:
def write_coeffs(segment):
    path = r'output\estimation_data_bundle\non_mandatory_tour_frequency'
    spec = pd.read_csv(os.path.join(path, f'non_mandatory_tour_frequency_SPEC.csv'), comment='#')
    # spec = pd.read_csv(os.path.join(r'C:\ABM3_dev\ABM\src\asim\configs\estimation', f'non_mandatory_tour_frequency.csv'), comment='#')
    coefs = spec[segment].dropna()
    coefs_df = pd.DataFrame()
    coefs_df['coefficient_name'] = coefs
    coefs_df.drop_duplicates(subset='coefficient_name', keep='first', inplace=True)
    coefs_df['value'] = 0.0
    coefs_df['constrain'] = 'F'
    coefs_df.loc[coefs_df['coefficient_name'] == 'coef_unavailable', 'value'] = -999
    coefs_df.loc[coefs_df['coefficient_name'] == 'coef_unavailable', 'constrain'] = 'T'
    coefs_df.to_csv(os.path.join(path, segment, f'non_mandatory_tour_frequency_coefficients_{segment}.csv'), index=False)
    # coefs_df.to_csv(os.path.join(r'C:\ABM3_dev\ABM\src\asim\configs\estimation', f'non_mandatory_tour_frequency_coefficients_{segment}.csv'), index=False)

# write_coeffs('PTYPE_FULL')
# write_coeffs('PTYPE_PART')
# write_coeffs('PTYPE_UNIVERSITY')
# write_coeffs('PTYPE_NONWORK')
# write_coeffs('PTYPE_RETIRED')
# write_coeffs('PTYPE_DRIVING')
# write_coeffs('PTYPE_SCHOOL')
# write_coeffs('PTYPE_PRESCHOOL')

In [None]:
# tours = pd.read_csv(r"C:\ABM3_dev\outputs\output_estimation\final_tours.csv")
# persons = pd.read_csv(r"C:\ABM3_dev\run_data\data_2z_series15\override_persons.csv")
# tours = pd.read_csv(r"C:\ABM3_dev\run_data\data_2z_series15\override_tours.csv")

In [None]:
persons = pd.read_csv(r"C:\ABM3_dev\outputs\output\final_persons.csv")
nm_purposes = ['_escort', '_shopping', '_othmaint', '_eatout', '_social', '_othdiscr']
persons['total_indNM_tours'] = persons[nm_purposes].sum(axis=1)
persons['age_binned'], bins = pd.cut(persons.age, bins=np.arange(0,91,2), retbins=True)
persons.groupby(['age_binned']).total_indNM_tours.mean().plot(kind='bar', figsize=(12,5))
# persons.age_binned.value_counts()

# Load data and prep model for estimation

In [None]:
modelname = "nonmand_tour_freq"

from activitysim.estimation.larch import component_model
# model, data = component_model(modelname, return_data=True, condense_parameters=False, num_chunks=10)
model, data = component_model(modelname, return_data=True, condense_parameters=False, segment_subset=['PTYPE_SCHOOL', 'PTYPE_FULL', 'PTYPE_PRESCHOOL'], num_chunks=10)

ptype_for_display = 'PTYPE_SCHOOL'

The prototype model spec we are re-estimating has 210 rows for each person type, but the
accompanying dataset is not large enough to successfully estimate anywhere near than many
parameters. The `condense_parameters` option is activated here as a short cut to making
a model that can be estimated with stable parameter results.  When activated, it merges
parameters not only by name (i.e. when the same name appears twice it is the same parameter)
but also by value, so that if the initial value of any two parameters is identical
then they are treated as the same parameter.  Using "condense_parameters" in actual model
estimation efforts is ill advised and may generate confusing or unexpected results.

This component actually has a distinct choice model for each person type, so
instead of a single model there's a `dict` of models.

In [None]:
type(model)

In [None]:
model.keys()

# Review data loaded from the EDB

We can review the data loaded as well, similarly there is seperate data 
for each person type.

## Coefficients

In [None]:
data.coefficients[ptype_for_display]

## Utility specification

In [None]:
data.spec[ptype_for_display]

## Chooser data

In [None]:
data.chooser_data[ptype_for_display]

In [None]:
alt_df = data.alt_values[ptype_for_display]
alt_df.head()

In [None]:
df = data.chooser_data[ptype_for_display].copy()
alts = pd.read_csv(r"C:\ABM3_dev\outputs\output\estimation_data_bundle\non_mandatory_tour_frequency\non_mandatory_tour_frequency_alternatives.csv", index_col=0)
df = df.merge(alts, how='left', left_on='override_choice', right_index=True)

In [None]:
tour_counts = []
for col in ['escort','shopping','othmaint','eatout','social','othdiscr','tot_tours', 'num_mandatory_tours', 'num_joint_tours']:
    tmp = df[col].value_counts()
    tour_counts.append(tmp)

tour_counts = pd.concat(tour_counts, axis=1).fillna(0).astype(int)
tour_counts.loc['Total'] = tour_counts.sum(axis=0)
tour_counts

In [None]:
pd.crosstab(df.tot_tours, df.income_segment)

# Estimate

With the model setup for estimation, the next step is to estimate the model coefficients.  Make sure to use a sufficiently large enough household sample and set of zones to avoid an over-specified model, which does not have a numerically stable likelihood maximizing solution.  The prototype model spec we are re-estimating has 210 rows for each person type, but the accompanying dataset is not large enough to successfully estimate anywhere near than many parameters, so a short cut is applied by having one parameter only per unique existing parameter value.

In [None]:
for k, m in model.items():
    print(f"Person type {k} has {len(m.utility_ca)} utility terms and {len(m.pf)} unique parameters.")

For future estimation work, parameters can be intelligently named and applied to match the model developer's desired structure (by using the same named parameter for multiple rows of the spec file).  If this is done, the "short cut" should be disabled by setting `condense_parameters=False` in the loading step above.

Larch has a built-in estimation methods including BHHH, and also offers access to more advanced general purpose non-linear optimizers in the `scipy` package, including SLSQP, which allows for bounds and constraints on parameters.  BHHH is the default and typically runs faster, but does not follow constraints on parameters.

In [None]:
for k, m in model.items():
    # m.estimate(method='SLSQP')
    m.estimate(method='BHHH', options={'maxiter':1500})

### Estimated coefficients

In [None]:
model[ptype_for_display].parameter_summary()

# Output Estimation Results

In [None]:
from activitysim.estimation.larch import update_coefficients
for k, m in model.items():
    result_dir = data.edb_directory/k/"estimated"
    update_coefficients(
        m, data.coefficients[k], result_dir,
        output_file=f"{modelname}_{k}_coefficients_revised_{datetime.datetime.now().strftime('%d_%m_%Y %H_%M_%S')}.csv",
        relabel_coef=data.relabel_coef.get(k),
    );

### Write the model estimation report, including coefficient t-statistic and log likelihood

In [None]:
for k, m in model.items():
    result_dir = data.edb_directory/k/"estimated"
    m.to_xlsx(
        result_dir/f"{modelname}_{k}_model_estimation_{datetime.datetime.now().strftime('%d_%m_%Y %H_%M_%S')}.xlsx", 
        data_statistics=True,
    )

# Next Steps

The final step is to either manually or automatically copy the `*_coefficients_revised.csv` file to the configs folder, rename it to `*_coefficients.csv`, and run ActivitySim in simulation mode.

In [None]:
# result_dir = data.edb_directory/'PTYPE_FULL'/"estimated"
# pd.read_csv(result_dir/f"{modelname}_PTYPE_FULL_coefficients_revised.csv")