Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 2 additions & 11 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ install:
- conda info -a
- conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION
- conda activate test-environment
- conda install pytest pytest-cov coveralls pycodestyle
- conda install pytest pytest-cov coveralls pycodestyle cytoolz
- pip install .
- pip freeze

Expand All @@ -37,16 +37,7 @@ deploy:
provider: pages
local_dir: docs/_build/html
skip_cleanup: true
github_token: $GH_TOKEN
github_token: $GH_TOKEN_POPSIM # Set in the settings page of the repository, as a secure variable
keep_history: true
on:
branch: master

notifications:
slack:
secure: l7FunH4QsSrIumeyfAlihCOCv8MwPBv8UbPKYQWghAkaIcxQaS6YZ5MTPDX9Lx7eIax0vXRl090Or3QqCD/Ap/hLdZS3WaHC0111UjSMi96nyjeQouvW2NYxFuqBLfKJVLWjIVzDsLlMDS7a3Q25YvhnY/s5xjUsUY34fEEFRdKqRM5sUcCEARSXCAJeTVV/OQIgvraKqypmvYFX5LMlxfzhfE5U1H1Qg4vGkHkc42ZS/lSa5ow4nerpurHcLTq5zPLWCTU0XS6ikMhn98Qi8l0z/4mX8DV6Aka79tN0uZit1K3TS/uK1sDN1uoFFEVseu9y2yycJszb4PSBU5hTaurqc4Ui1OTfXN171MVC37WzpltsCyRoGKIm/50/1lWQPJijsszaYXSr6qXuOX7TjmdXZclhkxZCTDXPW0CHSFn+7ShDWMiODcBNeVQKEWlpOVATQDRRLzMnd7TvIYV8yEa3cpMtI2W2v/LxFPKygHynBoq5FY4l4rOvHaPkilUI48vfx/6AP+BuX4PAseyuuqAlLwWwXhSnNE1HHQzjwvhHx9PQ+S/GiCj3oLnxIDUE8t2hQHrD4yYXOUxjfFrjQNF6SiiNWi0+bfI/ZjpiLpFyVxDJBQovXaWRLPAM61LwqRZmoXGyUg3UWDKuEZ8Fs2Dnz5cZx05+ePb8AroaWyI=

env:
global:
# GH_TOKEN
secure: IjVfvlQqAhryvf14wFhQ9hPVfT/an2bn152wax9ZKsvo5u6OyWk9GOWEbPujGCf1l/tXeDXmYzxQJb2Yedr2BMDF1399ssSVr1fFaka0S8WhftqKf2UN8uDe9IztxoARNHxPJjecHv6bJiXQFFXREsMa6bGlM8b17GzdDiIFME2WVoBw6eb7WZNRYfJeS66ObBHxjUTpAxNaWvhF7LAt6Mcu9kHhLggaAa21zPJPJ2d2AAZpFSUpmTIfFhTAI+JFwpbH1d8M3toDQpUmVYYDjHfhBxZ9QTd/C17FOcSwAU6HnQjOxHnsJX78dl0cYacEAVngEPcYgsfwE8b13uTGVLwbRoU73FJrBQvVjK4udJA5CDd5eEk5bpAl2NPmV223RFW5ZAwMuUKpQMAQNN8nswnzYQ+XfxdgR7wopNa6ZJ9JPd2EZ5ZNrBB9cRQUS8oY016SKuOs+C93PJZmPgJBHH9WUoZlaRX5x4XIXST0v6+B+oyPvdQrKb8t8r0RA7nvg9hq4B0o9OI5boDOiVlC1akjD0XWRUxa6arWGip7g5UBxtosbg9LeUG0XuQZpHpSsJj5BJPT0xtN4wdlSOBwXOUIi/HJT8jjxukMRkEwajWaMZsfseHN9zcI7j9wsFQ7lqkWodIOtRiXprkUljfvD+aWDeBk7JidT/ksZaIAJPk=
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
PopulationSim
=============

[![Build Status](https://travis-ci.org/activitysim/populationsim.svg?branch=master)](https://travis-ci.org/ActivitySim/populationsim) [![Coverage Status](https://coveralls.io/repos/ActivitySim/populationsim/badge.png?branch=master)](https://coveralls.io/r/ActivitySim/populationsim?branch=master)
[![Build Status](https://travis-ci.org/activitysim/populationsim.svg?branch=master)](https://travis-ci.org/ActivitySim/populationsim) [![Coverage Status](https://coveralls.io/repos/ActivitySim/populationsim/badge.png?branch=master)](https://coveralls.io/r/ActivitySim/populationsim?branch=master)<a href="https://medium.com/zephyrfoundation/populationsim-the-synthetic-commons-670e17383048"><img src="https://github.com/ZephyrTransport/zephyr-website/blob/gh-pages/img/badging/project_pages/populationsim/PopulationSim.png" width="72.6" height="19.8"></a>


PopulationSim is an open platform for population synthesis. It emerged
Expand Down
43 changes: 40 additions & 3 deletions docs/application_configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -320,7 +320,7 @@ These settings control the functionality of the PopulationSim algorithm. The set
| | | The maximum expansion factor may have to be adjusted upwards if the target |br| |
| | | is much greater than the seed number of households. |br| |
+--------------------------------------+------------+---------------------------------------------------------------------------------+
| MAX_BALANCE_ITERATIONS_SIMULTANEOUS | Integer | Number of simultaneous list balancer iterations |
| MAX_BALANCE_ITERATIONS_SIMULTANEOUS | Integer | Number of list balancer iterations. The default may be more than is needed. |
+--------------------------------------+------------+---------------------------------------------------------------------------------+


Expand Down Expand Up @@ -693,7 +693,7 @@ This sections describes the settings that are configured differently for the *re

**Input Data Tables for repop mode**

The repop mode runs over an existing synthetic population and uses the data pipeline (HDF5 file) from the regular run as an input. User should copy the HDF5 file from the regular outputs to the *output* folder of the repop set up. The data input which needs to be specified in this setting is the control data for the subset of geographies to be modified. Input tables for the repop mode can be specified in the same manner as regular mode. However, only one geography can be controlled. In the example below, TAZ controls are specified. The controls specified in TAZ_control_data do not have to be consistent with the controls specified in the data used to control the initial population. Only those geographic units to be repopulated should be specified in the control data (for example, TAZs 314 through 317).
The repop mode runs over an existing synthetic population and uses the data pipeline (HDF5 file) from the regular run as an input. User should copy the HDF5 file from the regular outputs to the *output* folder of the repop set up. The data input which needs to be specified in this setting is the control data for the subset of geographies to be modified. Input tables for the repop mode can be specified in the same manner as regular mode. However, only one geography can be controlled and the geography must be the lowest in "geographies" setting. In the example below, TAZ controls are specified. The controls specified in TAZ_control_data do not have to be consistent with the controls specified in the data used to control the initial population. Only those geographic units to be repopulated should be specified in the control data (for example, TAZs 314 through 317).

::

Expand All @@ -713,6 +713,7 @@ The repop mode runs over an existing synthetic population and uses the data pipe
| Attribute | Description |
+===========================+=============================================================+
| repop_control_file_name | Name of the CSV control specification file for repop mode |
| | Must include total_hh_control field |
+---------------------------+-------------------------------------------------------------+


Expand Down Expand Up @@ -847,7 +848,7 @@ Attribute definitions are as follows:
:seed_table:
seed_table is the seed table the control applies to and it can be ``households`` or ``persons``. If persons, then persons are aggregated to households using the count operator.
:importance:
importance is the importance weight for the control. A higher weight will cause PopulationSim to attempt to match the control at the possible expense of matching lower-weight controls.
importance is the importance weight for the control. A higher weight will cause PopulationSim to attempt to match the control at the possible expense of matching lower-weight controls. The importance weights are described in more detail in the :ref:`importance` and :ref:`setting-importance` sections.
:control_field:
control_field is the field in the control data input files that this control applies to. Note that the control field names should be unique even if they are for different geographies.
:expression:
Expand All @@ -858,6 +859,42 @@ Some conventions for writing expressions:
* Expressions must be vectorized expressions and can use most numpy and pandas expressions.
* When editing the CSV files in Excel, use single quote ' or space at the start of a cell to get Excel to accept the expression

.. _importance:

What are importance weights
~~~~~~~~~~~~~~~~~~~~~~~~~~~

PopulationSim uses the relative entropy maximization-based list balancing to match controls specified at various geographic levels. The relative entropy-based optimization ensures that the least amount of new information is introduced in finding a feasible solution. The base entropy is defined by the initial weights in the seed sample. The weights generated by the entropy maximization algorithm preserve the distribution of initial weights while matching the marginal controls. This ensures that the resulting weights are both uniform and preserves the distribution of the uncontrolled variables in the seed sample. A general relative entropy optimization problem is formulated as:

:math:`\min\limits_{\rm x_{n}} \sum_{n}{x_{n}} ln\dfrac {x_{n}} {w_{n}}`

Where :math:`x_{n}` are the resulting household level weights, :math:`x_{n}` are the initial weights. The marginal controls are specified as:

:math:`\sum_{n}{a_{in}*x_{n}} = A_{i}`

In PopulationSim, the hard marginal controls are relaxed by use of slack or relaxation factors in the constraints as shown below:

:math:`\sum_{n}{a_{in}*x_{n}} = A_{i}*z_{i}`

Where, :math:`z_{i}` are relaxation factors and :math:`a_{in}` are incidence values that map household/person attribute to marginal controls. To ensure that marginal controls are not relaxed significantly, the relaxation factors are also included in the objective function with a penalty. With control relaxations, the relative entropy optimization problem is formulated as follows:

:math:`\min\limits_{\rm x_{n}, z_{i}} \sum_{n}{x_{n}} ln\dfrac {x_{n}} {w_{n}} + \sum_{i}{u_{i}*(z_{i}ln{z_{i}})}`

Where, :math:`u_{i}` are the penalties termed as importance factors or importance weights in PopulationSim.

:math:`x_{n}` and :math:`z_{i}` are the parameters solved by the optimization while importance weights (:math:`u_{i}`) are the hyperparameters that are exposed to the user and impact the optimization externally. The objective of the relative entropy optimization is to find a set of weights that are uniform and satisfy marginal controls. The importance weights allow the user to trade-off between these objectives. High importance weights (e.g., 1E10) on all controls result in a hard constrained optimization which gives a high preference to matching marginal controls. Low importance weights (e.g., <50) results in an almost unconstrained problem. The user may also specify different importance weights for each marginal control. In this case, the controls with higher importance weights are given preference over the ones with low importance weights. Therefore, both absolute and relative value of the importance weights impacts the optimization problem and the solution.

.. _setting-importance:

Setting importance weights
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Given the flexibility that importance weights offer to the user, they need to be tuned to get the desired optimality in the outputs for the given seed sample and marginal controls. The quality of the outputs is defined by a uniformity measure of the weights and goodness of fit across marginal controls. Here are general guidelines on setting importance weights:

* Start with a reasonable importance factor value across all controls (e.g., 1000 has typically worked well for multiple regions). This excludes the control on the total number of households which should be set to very high importance to ensure that the right number of households is generated for each zone.
* After achieving reasonable goodness of fit across controls, the importance weights can be increased/decreased to favor one control over the other, or all importance weights can be reduced to improve the uniformity of the weights. Which controls to favor depends on the type of application and the quality of the marginal data.
* The importance weights are generally updated in factors of 10. The user may need to run PopulationSim multiple times using various combinations of importance weights to reach the desired quality of outputs.



Error Handling & Debugging
Expand Down
10 changes: 8 additions & 2 deletions docs/getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,13 @@ This page describes how to install and run PopulationSim with the provided examp
Installation
------------

1. Install `Anaconda 64bit Python 3 <https://www.anaconda.com/distribution/>`__. Anaconda Python is required for PopulationSim.
1. It is recommended that you install and use a *conda* package manager
for your system. One easy way to do so is by using `Anaconda 64bit Python 3 <https://www.anaconda.com/distribution/>`__,
although you should consult the `terms of service <https://www.anaconda.com/terms-of-service>`__
for this product and ensure you qualify (as of summer 2021, businesses and
governments with over 200 employees do not qualify for free usage). If you prefer
a completely free open source *conda* tool, you can download and install the
appropriate version of `Miniforge <https://github.com/conda-forge/miniforge#miniforge3>`__.

2. If you access the internet from behind a firewall, then you will need to configure your proxy server. To do so, create a .condarc file in your Anaconda installation folder (i.e. ``C:\ProgramData\Anaconda3``), such as:

Expand Down Expand Up @@ -62,7 +68,7 @@ ActivitySim
ActivitySim depends + some handy Python installation management tools.

For more information on Anaconda and ActivitySim, see ActivitySim's `getting started
<https://activitysim.github.io/activitysim/gettingstarted.html#anaconda>`__ guide.
<https://activitysim.github.io/activitysim/gettingstarted.html>`__ guide.


Run Examples
Expand Down
15 changes: 0 additions & 15 deletions docs/software.rst
Original file line number Diff line number Diff line change
Expand Up @@ -224,18 +224,3 @@ Contribution Guidelines

PopulationSim development follows the same `development guidelines <https://activitysim.github.io/activitysim/development.html>`__ as ActivitySim.


Release Notes
-------------

* v0.3 - first release
* v0.3.1 - allow zones with zero households
* v0.3.2 - fix bug in mult-integerizer with total_hh_parent_control_index
* v0.3.3 - add disgnostic printouts on assert fail in mult_integerizer
* v0.3.4 - add survey weighting use case
* v0.3.5 - add Python 3.5+ support
* v0.4 - transfer to ActivitySim.org
* v0.4.1 - package updates
* v0.4.2 - validation script in Python
* v0.4.3 - allow non-binary incidence
* v0.5 - support for multiprocessing
3 changes: 2 additions & 1 deletion example_survey_weighting/configs/settings.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@ USE_SIMUL_INTEGERIZER: True
USE_CVXPY: False
max_expansion_factor: 4 # Default is 30
min_expansion_factor: 0.5

absolute_upper_bounds: 20000
absolute_lower_bounds: 1

# Geographic Settings
# ------------------------------------------------------------------
Expand Down
21 changes: 18 additions & 3 deletions populationsim/balancer.py
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,7 @@ def np_balancer(
def do_balancing(control_spec,
total_hh_control_col,
max_expansion_factor, min_expansion_factor,
absolute_upper_bound, absolute_lower_bound,
incidence_df, control_totals, initial_weights):

# incidence table should only have control columns
Expand All @@ -262,14 +263,21 @@ def do_balancing(control_spec,

if min_expansion_factor:

# number_of_households in this seed geograpy as specified in seed_controlss
# number_of_households in this seed geograpy as specified in seed_controls
number_of_households = control_totals[total_hh_control_index]

total_weights = initial_weights.sum()
lb_ratio = min_expansion_factor * float(number_of_households) / float(total_weights)

lb_weights = initial_weights * lb_ratio
lb_weights = lb_weights.clip(lower=0)

if absolute_lower_bound:
lb_weights = lb_weights.clip(lower=absolute_lower_bound)
else:
lb_weights = lb_weights.clip(lower=0)

elif absolute_lower_bound:
lb_weights = initial_weights.clip(lower=absolute_lower_bound)

else:
lb_weights = None
Expand All @@ -283,7 +291,14 @@ def do_balancing(control_spec,
ub_ratio = max_expansion_factor * float(number_of_households) / float(total_weights)

ub_weights = initial_weights * ub_ratio
ub_weights = ub_weights.round().clip(lower=1).astype(int)

if absolute_upper_bound:
ub_weights = ub_weights.round().clip(upper=absolute_upper_bound, lower=1).astype(int)
else:
ub_weights = ub_weights.round().clip(lower=1).astype(int)

elif absolute_upper_bound:
ub_weights = ub_weights.round().clip(upper=absolute_upper_bound, lower=1).astype(int)

else:
ub_weights = None
Expand Down
4 changes: 4 additions & 0 deletions populationsim/steps/final_seed_balancing.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,8 @@ def final_seed_balancing(settings, crosswalk, control_spec, incidence_table):

max_expansion_factor = settings.get('max_expansion_factor', None)
min_expansion_factor = settings.get('min_expansion_factor', None)
absolute_upper_bound = settings.get('absolute_upper_bound', None)
absolute_lower_bound = settings.get('absolute_lower_bound', None)

relaxation_factors = pd.DataFrame(index=seed_controls_df.columns.tolist())

Expand All @@ -86,6 +88,8 @@ def final_seed_balancing(settings, crosswalk, control_spec, incidence_table):
total_hh_control_col=total_hh_control_col,
max_expansion_factor=max_expansion_factor,
min_expansion_factor=min_expansion_factor,
absolute_lower_bound=absolute_lower_bound,
absolute_upper_bound=absolute_upper_bound,
incidence_df=seed_incidence_df,
control_totals=seed_controls_df.loc[seed_id],
initial_weights=seed_incidence_df['sample_weight'])
Expand Down
4 changes: 4 additions & 0 deletions populationsim/steps/initial_seed_balancing.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,8 @@ def initial_seed_balancing(settings, crosswalk, control_spec, incidence_table):

max_expansion_factor = settings.get('max_expansion_factor', None)
min_expansion_factor = settings.get('min_expansion_factor', None)
absolute_upper_bound = settings.get('absolute_upper_bound', None)
absolute_lower_bound = settings.get('absolute_lower_bound', None)

# run balancer for each seed geography
weight_list = []
Expand All @@ -82,6 +84,8 @@ def initial_seed_balancing(settings, crosswalk, control_spec, incidence_table):
total_hh_control_col=total_hh_control_col,
max_expansion_factor=max_expansion_factor,
min_expansion_factor=min_expansion_factor,
absolute_upper_bound=absolute_upper_bound,
absolute_lower_bound=absolute_lower_bound,
incidence_df=seed_incidence_df,
control_totals=seed_controls_df.loc[seed_id],
initial_weights=seed_incidence_df['sample_weight'])
Expand Down
4 changes: 4 additions & 0 deletions populationsim/steps/repop_balancing.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,8 @@ def repop_balancing(settings, crosswalk, control_spec, incidence_table):

max_expansion_factor = settings.get('max_expansion_factor', None)
min_expansion_factor = settings.get('min_expansion_factor', None)
absolute_upper_bound = settings.get('absolute_upper_bound', None)
absolute_lower_bound = settings.get('absolute_lower_bound', None)

# run balancer for each low geography
low_weight_list = []
Expand Down Expand Up @@ -101,6 +103,8 @@ def repop_balancing(settings, crosswalk, control_spec, incidence_table):
total_hh_control_col=total_hh_control_col,
max_expansion_factor=max_expansion_factor,
min_expansion_factor=min_expansion_factor,
absolute_upper_bound=absolute_upper_bound,
absolute_lower_bound=absolute_lower_bound,
incidence_df=seed_incidence_df,
control_totals=low_controls_df.loc[low_id],
initial_weights=initial_weights)
Expand Down
10 changes: 5 additions & 5 deletions populationsim/steps/setup_data_structures.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,11 +111,11 @@ def add_geography_columns(incidence_table, households_df, crosswalk_df):
# add seed_geography col to incidence table
incidence_table[seed_geography] = households_df[seed_geography]

# add meta column to incidence table
seed_to_meta = \
crosswalk_df[[seed_geography, meta_geography]] \
.groupby(seed_geography, as_index=True).min()[meta_geography]
incidence_table[meta_geography] = incidence_table[seed_geography].map(seed_to_meta)
# add meta column to incidence table (unless it's already there)
if seed_geography != meta_geography:
tmp = crosswalk_df[list({seed_geography, meta_geography})]
seed_to_meta = tmp.groupby(seed_geography, as_index=True).min()[meta_geography]
incidence_table[meta_geography] = incidence_table[seed_geography].map(seed_to_meta)

return incidence_table

Expand Down
Loading