[PULL REQUEST] New methodology for n-dimensional rounding by Eric-Liu-SANDAG · Pull Request #179 · SANDAG/Estimates-Program

Eric-Liu-SANDAG · 2025-12-13T00:07:26Z

Describe this pull request. What changes are being made?

A collection of changes related to n-dimensional rounding. Specifically, attempting to solve it such that it's guaranteed to converge and ideally runs faster

What issues does this pull request address?

Resolves [FEATURE] Upgrade the 2D integerizer for better solutions and speed #159
Resolves [FEATURE] Test various PuLP models for speed #161
Resolves [FEATURE] Test the new ND integerizer on actual data #162
Resolves [FEATURE] Minimize inputs to PuLP solver for speed #163

Additional context

Some of the work done on this branch was shifted into a different branch and has already been merged into the main branch via #176. Therefore, the main purpose of this PR is to fully document the work done and to ensure that when the branch is deleted, the work will not be lost

Includes: * IPF implementation using numpy (much faster than IPFN!) * Utilities for creating various random test data * Stochastic (aka fuzzy) and PuLP methodologies for ND integerization, both minimally tested on at least 3-D data

`gq_other` went from ~54 seconds to ~14 seconds. At this point, the solving is taking about half the time, aka ~7 seconds, so I'm not sure there's much more to optimize

File is too large for GH, so it's in SQL Server as `[ws].[dbo].[Group_Quarters_Institutional_Correctional_Facilities_PULP_CBC_CMD]`

Eric-Liu-SANDAG · 2025-12-16T22:17:27Z

This PR contains quite a few changes, which will be fully summarized below. Hopefully, this means the changes are easier to search up in the future in case the work is required again.

Eric-Liu-SANDAG · 2025-12-16T22:18:50Z

`ipf.py`

This contents of this file have mostly been merged into main already via #176. Changes that were not merged are limited to testing code

Eric-Liu-SANDAG · 2025-12-16T22:27:29Z

`random_data.py`

This file contains helper functions to create deterministic random data of the specified shape as well as some marginals to play around with. The three random functions used are uniform random, low skewed (80% of the values over .1 are randomly assigned new values between 0 and .1), and sparse (values are randomly set to zero, based on an input fraction (default fraction of 70%)).

Eric-Liu-SANDAG · 2025-12-16T23:23:09Z

`nd_rounding.py`

This file contains a bunch of different methods for solving ND rounding and some testing code. The methods include:

A stochastic method (nd_controlling_fuzzy() and _nd_controlling_fuzzy_step()). In every iteration, the method selects a certain percentage (default 50%) of the current rounding error, and probabilistically assigns corrections until all rounding error is zero. The weight used to assign corrections is to multiply the rounding error along each dimension, so the coordinate $(x, y, z)$ would have the weight of $RE_x * RE_y * RE_z$, where $RE_n$ means the rounding error summed along that axis at point $n$. This method worked perfectly for large and dense data, but tended to run into dead ends when working on sparse data. The next method was a small attempt to address this issue
A third-party solver (nd_controlling_pulp_solver() and nd_controlling_pulp_solver_2d()). These functions are basically the same, except one is for the 2D case only. These use the PuLP python library and various free third-party solving software to find ND rounding solutions. Most of the code in these functions is to actually set up the problem, using variables pulp.LpVariable() and equations pulp.LpAffineExpression(). After set up, solve() is called and the outputs are coerced back into the original format. This methods always works, but slows down significantly on larger datasets. Not only does the solving slow down, but the actual setup of creating variables/equations can get really slow.
A combination of 1 and 2 (nd_controlling_mixed() and nd_controlling_mixed_safe()). The idea here is to use the extreme speed of method 1 and the actual solving capabilities of method 2 to do things quickly with a guaranteed solve. In other words, do stochastic until some point, then try PuLP to actually solve. The first uses a total rounding error threshold, default value of 1000. The second tried the stochastic method until failure, then undoes the steps, trying PuLP each time, until one finally works to solve. I found that neither of these solutions were very helpful in speeding things up. For the first method, it would still invariably on some input data no matter what threshold was chosen. On the second method, in the worst case, we would end up setting up and attempting to solve many different PuLP systems of equations, which was extremely slow

Overall, it was determined that barring some major speedup on larger datasets, solution 2 was workable, but only on group quarters data

Eric-Liu-SANDAG · 2025-12-17T00:46:11Z

All `.csv` files

A bunch of data pulled from midway through the ASE module of the Estimates Program. This data was used to test the IPF and ND rounding on actual ASE data

Eric-Liu-SANDAG · 2025-12-17T00:47:32Z

`environment.yml`

File was updated to include both pulp==3.3.0, which is the code to run various third-party solvers, and pulp[open_py]==3.3.0, which actually installs all the free third-party solvers

Eric-Liu-SANDAG · 2025-12-17T00:48:45Z

.gitignore

File was updated to ignore all *.npy files, which is a compressed version of an numpy np.ndarray. This was used for method number 3 from nd_rounding.py in order to "rollback" from an unsolvable state to a previous state

Eric-Liu-SANDAG · 2025-12-17T00:49:41Z

@GregorSchroeder any further thoughts or questions before I close the PR and delete the branch?

Eric-Liu-SANDAG · 2025-12-17T19:38:05Z

Just to confirm, the branch has been deleted so you cannot actually access the branch. However, the commits and changes still live on in this PR, so we can access the code again if necessary

Eric-Liu-SANDAG and others added 17 commits September 22, 2025 12:18

#159: Existing work migrated from local

67ef314

Includes: * IPF implementation using numpy (much faster than IPFN!) * Utilities for creating various random test data * Stochastic (aka fuzzy) and PuLP methodologies for ND integerization, both minimally tested on at least 3-D data

#159: Documentation update

172771a

#159: Created mixed methodology

be399c9

#159: Suppress PuLP output

738c633

#159, #163: Added safe mixed methodology

31410ab

#159, #163: Ignore all NumPy binary files

e54deea

#159: Fixed divide by zero warning

392ed8e

#159: Mixed methodology tested on data sized to ASE

036a801

Merge remote-tracking branch 'refs/remotes/origin/main' into nd-rounding

dd9b8c8

#159, #162: Testing on various actual data

f3101cc

#159: Code for testing on gq_prison

5af0a5d

#159: Changed model construction methodology for speed

da38b51

#159: Further optimization by removing np.take

132a6d2

`gq_other` went from ~54 seconds to ~14 seconds. At this point, the solving is taking about half the time, aka ~7 seconds, so I'm not sure there's much more to optimize

#159, #161: Added all open source solvers

9ae4f1a

#159, #161: Added inline custom profiling

05e66c2

#159: Removed some extraneous variables/constraints

701d06a

#159: Post-PuLP data export

ca698ff

File is too large for GH, so it's in SQL Server as `[ws].[dbo].[Group_Quarters_Institutional_Correctional_Facilities_PULP_CBC_CMD]`

Eric-Liu-SANDAG closed this Dec 17, 2025

Eric-Liu-SANDAG deleted the nd-rounding branch December 17, 2025 19:36

Eric-Liu-SANDAG mentioned this pull request Mar 19, 2026

[FEATURE] Create regional controls for household characteristics #113

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PULL REQUEST] New methodology for n-dimensional rounding#179

[PULL REQUEST] New methodology for n-dimensional rounding#179
Eric-Liu-SANDAG wants to merge 17 commits intomainfrom
nd-rounding

Eric-Liu-SANDAG commented Dec 13, 2025 •

edited

Loading

Uh oh!

Eric-Liu-SANDAG commented Dec 16, 2025

Uh oh!

Eric-Liu-SANDAG commented Dec 16, 2025

Uh oh!

Eric-Liu-SANDAG commented Dec 16, 2025

Uh oh!

Eric-Liu-SANDAG commented Dec 16, 2025 •

edited

Loading

Uh oh!

Eric-Liu-SANDAG commented Dec 17, 2025

Uh oh!

Eric-Liu-SANDAG commented Dec 17, 2025

Uh oh!

Eric-Liu-SANDAG commented Dec 17, 2025

Uh oh!

Eric-Liu-SANDAG commented Dec 17, 2025

Uh oh!

Eric-Liu-SANDAG commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Eric-Liu-SANDAG commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe this pull request. What changes are being made?

What issues does this pull request address?

Additional context

Uh oh!

Eric-Liu-SANDAG commented Dec 16, 2025

Uh oh!

Eric-Liu-SANDAG commented Dec 16, 2025

ipf.py

Uh oh!

Eric-Liu-SANDAG commented Dec 16, 2025

random_data.py

Uh oh!

Eric-Liu-SANDAG commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

nd_rounding.py

Uh oh!

Eric-Liu-SANDAG commented Dec 17, 2025

All .csv files

Uh oh!

Eric-Liu-SANDAG commented Dec 17, 2025

environment.yml

Uh oh!

Eric-Liu-SANDAG commented Dec 17, 2025

.gitignore

Uh oh!

Eric-Liu-SANDAG commented Dec 17, 2025

Uh oh!

Eric-Liu-SANDAG commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Eric-Liu-SANDAG commented Dec 13, 2025 •

edited

Loading

`ipf.py`

`random_data.py`

Eric-Liu-SANDAG commented Dec 16, 2025 •

edited

Loading

`nd_rounding.py`

All `.csv` files

`environment.yml`