ActivitySim · bstabler · Nov 19, 2021 · May 12, 2020 · Jun 19, 2020 · Oct 20, 2020
diff --git a/.travis.yml b/.travis.yml
@@ -16,7 +16,7 @@ install:
 - conda info -a
 - conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION
 - conda activate test-environment
-- conda install pytest pytest-cov coveralls pycodestyle
+- conda install pytest pytest-cov coveralls pycodestyle cytoolz
 - pip install .
 - pip freeze
 
@@ -37,16 +37,7 @@ deploy:
   provider: pages
   local_dir: docs/_build/html
   skip_cleanup: true
-  github_token: $GH_TOKEN
+  github_token: $GH_TOKEN_POPSIM  # Set in the settings page of the repository, as a secure variable
   keep_history: true
   on:
     branch: master
-
-notifications:
-  slack:
-    secure: l7FunH4QsSrIumeyfAlihCOCv8MwPBv8UbPKYQWghAkaIcxQaS6YZ5MTPDX9Lx7eIax0vXRl090Or3QqCD/Ap/hLdZS3WaHC0111UjSMi96nyjeQouvW2NYxFuqBLfKJVLWjIVzDsLlMDS7a3Q25YvhnY/s5xjUsUY34fEEFRdKqRM5sUcCEARSXCAJeTVV/OQIgvraKqypmvYFX5LMlxfzhfE5U1H1Qg4vGkHkc42ZS/lSa5ow4nerpurHcLTq5zPLWCTU0XS6ikMhn98Qi8l0z/4mX8DV6Aka79tN0uZit1K3TS/uK1sDN1uoFFEVseu9y2yycJszb4PSBU5hTaurqc4Ui1OTfXN171MVC37WzpltsCyRoGKIm/50/1lWQPJijsszaYXSr6qXuOX7TjmdXZclhkxZCTDXPW0CHSFn+7ShDWMiODcBNeVQKEWlpOVATQDRRLzMnd7TvIYV8yEa3cpMtI2W2v/LxFPKygHynBoq5FY4l4rOvHaPkilUI48vfx/6AP+BuX4PAseyuuqAlLwWwXhSnNE1HHQzjwvhHx9PQ+S/GiCj3oLnxIDUE8t2hQHrD4yYXOUxjfFrjQNF6SiiNWi0+bfI/ZjpiLpFyVxDJBQovXaWRLPAM61LwqRZmoXGyUg3UWDKuEZ8Fs2Dnz5cZx05+ePb8AroaWyI=
-
-env:
-  global:
-    # GH_TOKEN
-    secure: IjVfvlQqAhryvf14wFhQ9hPVfT/an2bn152wax9ZKsvo5u6OyWk9GOWEbPujGCf1l/tXeDXmYzxQJb2Yedr2BMDF1399ssSVr1fFaka0S8WhftqKf2UN8uDe9IztxoARNHxPJjecHv6bJiXQFFXREsMa6bGlM8b17GzdDiIFME2WVoBw6eb7WZNRYfJeS66ObBHxjUTpAxNaWvhF7LAt6Mcu9kHhLggaAa21zPJPJ2d2AAZpFSUpmTIfFhTAI+JFwpbH1d8M3toDQpUmVYYDjHfhBxZ9QTd/C17FOcSwAU6HnQjOxHnsJX78dl0cYacEAVngEPcYgsfwE8b13uTGVLwbRoU73FJrBQvVjK4udJA5CDd5eEk5bpAl2NPmV223RFW5ZAwMuUKpQMAQNN8nswnzYQ+XfxdgR7wopNa6ZJ9JPd2EZ5ZNrBB9cRQUS8oY016SKuOs+C93PJZmPgJBHH9WUoZlaRX5x4XIXST0v6+B+oyPvdQrKb8t8r0RA7nvg9hq4B0o9OI5boDOiVlC1akjD0XWRUxa6arWGip7g5UBxtosbg9LeUG0XuQZpHpSsJj5BJPT0xtN4wdlSOBwXOUIi/HJT8jjxukMRkEwajWaMZsfseHN9zcI7j9wsFQ7lqkWodIOtRiXprkUljfvD+aWDeBk7JidT/ksZaIAJPk=
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 PopulationSim
 =============
 
-[![Build Status](https://travis-ci.org/activitysim/populationsim.svg?branch=master)](https://travis-ci.org/ActivitySim/populationsim) [![Coverage Status](https://coveralls.io/repos/ActivitySim/populationsim/badge.png?branch=master)](https://coveralls.io/r/ActivitySim/populationsim?branch=master)
+[![Build Status](https://travis-ci.org/activitysim/populationsim.svg?branch=master)](https://travis-ci.org/ActivitySim/populationsim) [![Coverage Status](https://coveralls.io/repos/ActivitySim/populationsim/badge.png?branch=master)](https://coveralls.io/r/ActivitySim/populationsim?branch=master)<a href="https://medium.com/zephyrfoundation/populationsim-the-synthetic-commons-670e17383048"><img src="https://github.com/ZephyrTransport/zephyr-website/blob/gh-pages/img/badging/project_pages/populationsim/PopulationSim.png" width="72.6" height="19.8"></a>
 
 
 PopulationSim is an open platform for population synthesis.  It emerged

diff --git a/docs/application_configuration.rst b/docs/application_configuration.rst
@@ -320,7 +320,7 @@ These settings control the functionality of the PopulationSim algorithm. The set
 |                                      |            | The maximum expansion factor may have to be adjusted upwards if the target |br| |
 |                                      |            | is much greater than the seed number of households.                        |br| |
 +--------------------------------------+------------+---------------------------------------------------------------------------------+
-| MAX_BALANCE_ITERATIONS_SIMULTANEOUS  | Integer    | Number of simultaneous list balancer iterations                                 |
+| MAX_BALANCE_ITERATIONS_SIMULTANEOUS  | Integer    | Number of list balancer iterations.  The default may be more than is needed.    |
 +--------------------------------------+------------+---------------------------------------------------------------------------------+
 
 
@@ -693,7 +693,7 @@ This sections describes the settings that are configured differently for the *re
 
 **Input Data Tables for repop mode**
 
-The repop mode runs over an existing synthetic population and uses the data pipeline (HDF5 file) from the regular run as an input. User should copy the HDF5 file from the regular outputs to the *output* folder of the repop set up. The data input which needs to be specified in this setting is the control data for the subset of geographies to be modified. Input tables for the repop mode can be specified in the same manner as regular mode. However, only one geography can be controlled. In the example below, TAZ controls are specified. The controls specified in TAZ_control_data do not have to be consistent with the controls specified in the data used to control the initial population. Only those geographic units to be repopulated should be specified in the control data (for example, TAZs 314 through 317).
+The repop mode runs over an existing synthetic population and uses the data pipeline (HDF5 file) from the regular run as an input. User should copy the HDF5 file from the regular outputs to the *output* folder of the repop set up. The data input which needs to be specified in this setting is the control data for the subset of geographies to be modified. Input tables for the repop mode can be specified in the same manner as regular mode. However, only one geography can be controlled and the geography must be the lowest in "geographies" setting. In the example below, TAZ controls are specified. The controls specified in TAZ_control_data do not have to be consistent with the controls specified in the data used to control the initial population. Only those geographic units to be repopulated should be specified in the control data (for example, TAZs 314 through 317).
 
 ::
 
@@ -713,6 +713,7 @@ The repop mode runs over an existing synthetic population and uses the data pipe
 | Attribute                 | Description                                                 |
 +===========================+=============================================================+
 | repop_control_file_name   | Name of the CSV control specification file for repop mode   |
+|                           | Must include total_hh_control field                         |
 +---------------------------+-------------------------------------------------------------+
 
 
@@ -847,7 +848,7 @@ Attribute definitions are as follows:
 :seed_table:
         seed_table is the seed table the control applies to and it can be ``households`` or ``persons``.  If persons, then persons are aggregated to households using the count operator.
 :importance:
-        importance is the importance weight for the control. A higher weight will cause PopulationSim to attempt to match the control at the possible expense of matching lower-weight controls.
+        importance is the importance weight for the control. A higher weight will cause PopulationSim to attempt to match the control at the possible expense of matching lower-weight controls. The importance weights are described in more detail in the :ref:`importance` and :ref:`setting-importance` sections.
 :control_field:
         control_field is the field in the control data input files that this control applies to. Note that the control field names should be unique even if they are for different geographies.
 :expression:
@@ -858,6 +859,42 @@ Some conventions for writing expressions:
   * Expressions must be vectorized expressions and can use most numpy and pandas expressions.
   * When editing the CSV files in Excel, use single quote ' or space at the start of a cell to get Excel to accept the expression
 
+.. _importance: 
+
+What are importance weights
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+PopulationSim uses the relative entropy maximization-based list balancing to match controls specified at various geographic levels. The relative entropy-based optimization ensures that the least amount of new information is introduced in finding a feasible solution. The base entropy is defined by the initial weights in the seed sample. The weights generated by the entropy maximization algorithm preserve the distribution of initial weights while matching the marginal controls. This ensures that the resulting weights are both uniform and preserves the distribution of the uncontrolled variables in the seed sample. A general relative entropy optimization problem is formulated as:
+
+:math:`\min\limits_{\rm x_{n}} \sum_{n}{x_{n}} ln\dfrac {x_{n}} {w_{n}}`
+
+Where :math:`x_{n}` are the resulting household level weights, :math:`x_{n}` are the initial weights. The marginal controls are specified as:
+
+:math:`\sum_{n}{a_{in}*x_{n}} = A_{i}`
+
+In PopulationSim, the hard marginal controls are relaxed by use of slack or relaxation factors in the constraints as shown below:
+
+:math:`\sum_{n}{a_{in}*x_{n}} = A_{i}*z_{i}`
+
+Where, :math:`z_{i}` are relaxation factors and :math:`a_{in}` are incidence values that map household/person attribute to marginal controls. To ensure that marginal controls are not relaxed significantly, the relaxation factors are also included in the objective function with a penalty. With control relaxations, the relative entropy optimization problem is formulated as follows:
+
+:math:`\min\limits_{\rm x_{n}, z_{i}} \sum_{n}{x_{n}} ln\dfrac {x_{n}} {w_{n}} + \sum_{i}{u_{i}*(z_{i}ln{z_{i}})}`
+
+Where, :math:`u_{i}` are the penalties termed as importance factors or importance weights in PopulationSim.
+
+:math:`x_{n}` and :math:`z_{i}`  are the parameters solved by the optimization while importance weights (:math:`u_{i}`) are the hyperparameters that are exposed to the user and impact the optimization externally. The objective of the relative entropy optimization is to find a set of weights that are uniform and satisfy marginal controls. The importance weights allow the user to trade-off between these objectives. High importance weights (e.g., 1E10) on all controls result in a hard constrained optimization which gives a high preference to matching marginal controls. Low importance weights (e.g., <50) results in an almost unconstrained problem. The user may also specify different importance weights for each marginal control. In this case, the controls with higher importance weights are given preference over the ones with low importance weights. Therefore, both absolute and relative value of the importance weights impacts the optimization problem and the solution. 
+
+.. _setting-importance: 
+
+Setting importance weights
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Given the flexibility that importance weights offer to the user, they need to be tuned to get the desired optimality in the outputs for the given seed sample and marginal controls. The quality of the outputs is defined by a uniformity measure of the weights and goodness of fit across marginal controls. Here are general guidelines on setting importance weights:
+
+   * Start with a reasonable importance factor value across all controls (e.g., 1000 has typically worked well for multiple regions). This excludes the control on the total number of households which should be set to very high importance to ensure that the right number of households is generated for each zone.
+   * After achieving reasonable goodness of fit across controls, the importance weights can be increased/decreased to favor one control over the other, or all importance weights can be reduced to improve the uniformity of the weights. Which controls to favor depends on the type of application and the quality of the marginal data. 
+   * The importance weights are generally updated in factors of 10. The user may need to run PopulationSim multiple times using various combinations of importance weights to reach the desired quality of outputs. 
+
 
 
 Error Handling & Debugging

diff --git a/docs/getting_started.rst b/docs/getting_started.rst
@@ -12,7 +12,13 @@ This page describes how to install and run PopulationSim with the provided examp
 Installation
 ------------
 
-1. Install `Anaconda 64bit Python 3 <https://www.anaconda.com/distribution/>`__. Anaconda Python is required for PopulationSim.
+1. It is recommended that you install and use a *conda* package manager
+for your system. One easy way to do so is by using `Anaconda 64bit Python 3 <https://www.anaconda.com/distribution/>`__,
+although you should consult the `terms of service <https://www.anaconda.com/terms-of-service>`__
+for this product and ensure you qualify (as of summer 2021, businesses and
+governments with over 200 employees do not qualify for free usage).  If you prefer
+a completely free open source *conda* tool, you can download and install the
+appropriate version of `Miniforge <https://github.com/conda-forge/miniforge#miniforge3>`__.
 
 2. If you access the internet from behind a firewall, then you will need to configure your proxy server. To do so, create a .condarc file in your Anaconda installation folder (i.e. ``C:\ProgramData\Anaconda3``), such as:
 
@@ -62,7 +68,7 @@ ActivitySim
   ActivitySim depends + some handy Python installation management tools.
 
   For more information on Anaconda and ActivitySim, see ActivitySim's `getting started
-  <https://activitysim.github.io/activitysim/gettingstarted.html#anaconda>`__ guide.
+  <https://activitysim.github.io/activitysim/gettingstarted.html>`__ guide.
 
 
 Run Examples

diff --git a/docs/software.rst b/docs/software.rst
@@ -224,18 +224,3 @@ Contribution Guidelines
 
 PopulationSim development follows the same `development guidelines <https://activitysim.github.io/activitysim/development.html>`__ as ActivitySim.
 
-
-Release Notes
--------------
-
-  * v0.3 - first release
-  * v0.3.1 - allow zones with zero households
-  * v0.3.2 - fix bug in mult-integerizer with total_hh_parent_control_index
-  * v0.3.3 - add disgnostic printouts on assert fail in mult_integerizer
-  * v0.3.4 - add survey weighting use case
-  * v0.3.5 - add Python 3.5+ support
-  * v0.4 - transfer to ActivitySim.org
-  * v0.4.1 - package updates
-  * v0.4.2 - validation script in Python
-  * v0.4.3 - allow non-binary incidence 
-  * v0.5 - support for multiprocessing
diff --git a/example_survey_weighting/configs/settings.yaml b/example_survey_weighting/configs/settings.yaml
@@ -18,7 +18,8 @@ USE_SIMUL_INTEGERIZER: True
 USE_CVXPY: False
 max_expansion_factor: 4 # Default is 30
 min_expansion_factor: 0.5
-
+absolute_upper_bounds: 20000 
+absolute_lower_bounds: 1
 
 # Geographic Settings
 # ------------------------------------------------------------------

diff --git a/populationsim/balancer.py b/populationsim/balancer.py
@@ -242,6 +242,7 @@ def np_balancer(
 def do_balancing(control_spec,
                  total_hh_control_col,
                  max_expansion_factor, min_expansion_factor,
+                 absolute_upper_bound, absolute_lower_bound,
                  incidence_df, control_totals, initial_weights):
 
     # incidence table should only have control columns
@@ -262,14 +263,21 @@ def do_balancing(control_spec,
 
     if min_expansion_factor:
 
-        # number_of_households in this seed geograpy as specified in seed_controlss
+        # number_of_households in this seed geograpy as specified in seed_controls
         number_of_households = control_totals[total_hh_control_index]
 
         total_weights = initial_weights.sum()
         lb_ratio = min_expansion_factor * float(number_of_households) / float(total_weights)
 
         lb_weights = initial_weights * lb_ratio
-        lb_weights = lb_weights.clip(lower=0)
+
+        if absolute_lower_bound:
+            lb_weights = lb_weights.clip(lower=absolute_lower_bound)
+        else:
+            lb_weights = lb_weights.clip(lower=0)
+
+    elif absolute_lower_bound:
+        lb_weights = initial_weights.clip(lower=absolute_lower_bound)
 
     else:
         lb_weights = None
@@ -283,7 +291,14 @@ def do_balancing(control_spec,
         ub_ratio = max_expansion_factor * float(number_of_households) / float(total_weights)
 
         ub_weights = initial_weights * ub_ratio
-        ub_weights = ub_weights.round().clip(lower=1).astype(int)
+
+        if absolute_upper_bound:
+            ub_weights = ub_weights.round().clip(upper=absolute_upper_bound, lower=1).astype(int)
+        else:
+            ub_weights = ub_weights.round().clip(lower=1).astype(int)
+
+    elif absolute_upper_bound:
+        ub_weights = ub_weights.round().clip(upper=absolute_upper_bound, lower=1).astype(int)
 
     else:
         ub_weights = None

diff --git a/populationsim/steps/final_seed_balancing.py b/populationsim/steps/final_seed_balancing.py
@@ -68,6 +68,8 @@ def final_seed_balancing(settings, crosswalk, control_spec, incidence_table):
 
     max_expansion_factor = settings.get('max_expansion_factor', None)
     min_expansion_factor = settings.get('min_expansion_factor', None)
+    absolute_upper_bound = settings.get('absolute_upper_bound', None)
+    absolute_lower_bound = settings.get('absolute_lower_bound', None)
 
     relaxation_factors = pd.DataFrame(index=seed_controls_df.columns.tolist())
 
@@ -86,6 +88,8 @@ def final_seed_balancing(settings, crosswalk, control_spec, incidence_table):
             total_hh_control_col=total_hh_control_col,
             max_expansion_factor=max_expansion_factor,
             min_expansion_factor=min_expansion_factor,
+            absolute_lower_bound=absolute_lower_bound,
+            absolute_upper_bound=absolute_upper_bound,
             incidence_df=seed_incidence_df,
             control_totals=seed_controls_df.loc[seed_id],
             initial_weights=seed_incidence_df['sample_weight'])

diff --git a/populationsim/steps/initial_seed_balancing.py b/populationsim/steps/initial_seed_balancing.py
@@ -65,6 +65,8 @@ def initial_seed_balancing(settings, crosswalk, control_spec, incidence_table):
 
     max_expansion_factor = settings.get('max_expansion_factor', None)
     min_expansion_factor = settings.get('min_expansion_factor', None)
+    absolute_upper_bound = settings.get('absolute_upper_bound', None)
+    absolute_lower_bound = settings.get('absolute_lower_bound', None)
 
     # run balancer for each seed geography
     weight_list = []
@@ -82,6 +84,8 @@ def initial_seed_balancing(settings, crosswalk, control_spec, incidence_table):
             total_hh_control_col=total_hh_control_col,
             max_expansion_factor=max_expansion_factor,
             min_expansion_factor=min_expansion_factor,
+            absolute_upper_bound=absolute_upper_bound,
+            absolute_lower_bound=absolute_lower_bound,
             incidence_df=seed_incidence_df,
             control_totals=seed_controls_df.loc[seed_id],
             initial_weights=seed_incidence_df['sample_weight'])

diff --git a/populationsim/steps/repop_balancing.py b/populationsim/steps/repop_balancing.py
@@ -60,6 +60,8 @@ def repop_balancing(settings, crosswalk, control_spec, incidence_table):
 
     max_expansion_factor = settings.get('max_expansion_factor', None)
     min_expansion_factor = settings.get('min_expansion_factor', None)
+    absolute_upper_bound = settings.get('absolute_upper_bound', None)
+    absolute_lower_bound = settings.get('absolute_lower_bound', None)
 
     # run balancer for each low geography
     low_weight_list = []
@@ -101,6 +103,8 @@ def repop_balancing(settings, crosswalk, control_spec, incidence_table):
                 total_hh_control_col=total_hh_control_col,
                 max_expansion_factor=max_expansion_factor,
                 min_expansion_factor=min_expansion_factor,
+                absolute_upper_bound=absolute_upper_bound,
+                absolute_lower_bound=absolute_lower_bound,
                 incidence_df=seed_incidence_df,
                 control_totals=low_controls_df.loc[low_id],
                 initial_weights=initial_weights)

diff --git a/populationsim/steps/setup_data_structures.py b/populationsim/steps/setup_data_structures.py
@@ -111,11 +111,11 @@ def add_geography_columns(incidence_table, households_df, crosswalk_df):
     # add seed_geography col to incidence table
     incidence_table[seed_geography] = households_df[seed_geography]
 
-    # add meta column to incidence table
-    seed_to_meta = \
-        crosswalk_df[[seed_geography, meta_geography]] \
-        .groupby(seed_geography, as_index=True).min()[meta_geography]
-    incidence_table[meta_geography] = incidence_table[seed_geography].map(seed_to_meta)
+    # add meta column to incidence table (unless it's already there)
+    if seed_geography != meta_geography:
+        tmp = crosswalk_df[list({seed_geography, meta_geography})]
+        seed_to_meta = tmp.groupby(seed_geography, as_index=True).min()[meta_geography]
+        incidence_table[meta_geography] = incidence_table[seed_geography].map(seed_to_meta)
 
     return incidence_table