[PULL REQUEST] New IPF implementation by Eric-Liu-SANDAG · Pull Request #176 · SANDAG/Estimates-Program

Eric-Liu-SANDAG · 2025-12-03T22:38:07Z

Describe this pull request. What changes are being made?

A new bespoke implementation of IPF

What issues does this pull request address?

Resolves [BUG] Non-zero marginals associated with all zero data in IPF #174
Resolves [BUG] IPF process non-convergence #175

Additional context

Originally based on work done for #159

Eric-Liu-SANDAG · 2025-12-04T23:15:55Z

New IPF methodology implemented for the main usage of ipfn. Test run for 2020 can be found under [run_id] = 106 (it passed all QC checks with no error rows). The below query shows differences compared to the actual 2024 Estimates data, aka [run_id] = 82

SELECT 
    [82].[year],
    [82].[mgra],
    [82].[age_group],
    [82].[sex],
    [82].[ethnicity],
    [82].[value],
    [106].[value]
FROM [EstimatesProgram].[outputs].[ase] AS [82]
LEFT JOIN [EstimatesProgram].[outputs].[ase] AS [106]
    ON [82].[year] = [106].[year]
    AND [82].[mgra] = [106].[mgra]
    AND [82].[pop_type] = [106].[pop_type]
    AND [82].[age_group] = [106].[age_group]
    AND [82].[sex] = [106].[sex]
    AND [82].[ethnicity] = [106].[ethnicity]
WHERE [82].[run_id] = 82
    AND [106].[run_id] = 106
    AND [82].[value] != [106].[value]
ORDER BY ABS([82].[value] - [106].[value]) DESC

Quite a few rows of differences, hard to really tell what's significant and what's not...

When looking at the ethnicity distribution of GQ college, nothing really stands out to me as a clear and obvious error.

SELECT 
    [82].[year],
    [82].[mgra],
    [82].[pop_type],
    -- [82].[age_group],
    -- [82].[sex],
    [82].[ethnicity],
    SUM([82].[value]) AS [82_value],
    SUM([106].[value]) AS [106_value]
FROM [EstimatesProgram].[outputs].[ase] AS [82]
LEFT JOIN [EstimatesProgram].[outputs].[ase] AS [106]
    ON [82].[year] = [106].[year]
    AND [82].[mgra] = [106].[mgra]
    AND [82].[pop_type] = [106].[pop_type]
    AND [82].[age_group] = [106].[age_group]
    AND [82].[sex] = [106].[sex]
    AND [82].[ethnicity] = [106].[ethnicity]
WHERE [82].[run_id] = 82
    AND [106].[run_id] = 106
    AND [82].[pop_type] = 'Group Quarters - College'
    AND [82].[mgra] IN (
        SELECT DISTINCT [mgra] 
        FROM [EstimatesProgram].[outputs].[gq]
        WHERE [run_id] = 82
            AND [year] = 2020
            AND [gq_type] = 'Group Quarters - College'
            AND [value] > 0)
GROUP BY 
    [82].[year],
    [82].[mgra],
    [82].[pop_type],
    -- [82].[age_group],
    -- [82].[sex],
    [82].[ethnicity]
ORDER BY 
    [82].[year],
    [82].[mgra],
    [82].[pop_type],
    -- [82].[age_group],
    -- [82].[sex],
    [82].[ethnicity]

Values which are zero in 82 are typically still zero or tiny values in 106. Similarly, distributions seem to be more or less the same, with the slight caveat that 106 seems to have fewer overall zeros

Also, turns out that the input data to IPFN wasn't very clean, and the new IPF function has a bunch of input checking which resulted in some fail states. So I had to clean up the data a little bit

Eric-Liu-SANDAG · 2025-12-10T01:01:47Z

[run_id] = 115 now contains a run of 2020 with all IPFN replaced with the new IPF function (all automated checks passed). The GQ College check which was done for [run_id] = 106 was repeated using 115 and similar results were observed. Will do some more rigorous testing later, aka tomorrow

Eric-Liu-SANDAG · 2025-12-11T21:02:02Z

A complete run 2020 to 2024 was completed as well ([run_id] = 117) and all automated checks passed. This includes the MGRA restrictions checks, which seems a little odd to me. Maybe the old IPFN was doing something weird that caused the integerization to fail?

The GQ College check was done again and results similar to 106 and 115 were observed, but also spread over all the years.

GQ Military was virtually identical between 82 and 117, with off by one differences but not much more

GQ Other/Prison/HHP were both similar to GQ College in that distributions were broadly the same, but actual values differed, sometimes by a more substantial amount (less so for HHP). But for the most part, any changes can functionally be attributed to noise

Eric-Liu-SANDAG · 2025-12-11T21:41:37Z

Of particular note is this little bit of data cleaning that I was forced to do:

Estimates-Program/python/ase.py

Lines 355 to 364 in e1a520f

    
           # Haha, JK, not finally, there's one more thing to fix. Due to the removal of 
        
           # B01001F from the seed data, we could have a non-zero marginal associated with 
        
           # all zero seed data. We'll resolve by just seeding with a tiny value. Note this 
        
           # can only occur with the column controls (B01001) 
        
           if np.any((col_controls > 0) & (seed.sum(axis=0) == 0)): 
        
               col_idx_to_adjust = np.where( 
        
                   (col_controls > 0) & (seed.sum(axis=0).to_numpy() == 0) 
        
               ) 
        
               for col_idx in col_idx_to_adjust: 
        
                   seed[seed.columns[col_idx]] = 0.001

I observed in certain GQ mgras (easier to see in GQ data since there's a lot less of it) that some MGRAs got a large spike in NH2+, likely result of this pre-seeding

SELECT 
    [82].[year],
    [82].[mgra],
    [82].[pop_type],
    -- [82].[age_group],
    -- [82].[sex],
    [82].[ethnicity],
    SUM([82].[value]) AS [82_value],
    SUM([117].[value]) AS [117_value],
    ABS(SUM([82].[value]) - SUM([117].[value])) AS [diff]
FROM [EstimatesProgram].[outputs].[ase] AS [82]
LEFT JOIN [EstimatesProgram].[outputs].[ase] AS [117]
    ON [82].[year] = [117].[year]
    AND [82].[mgra] = [117].[mgra]
    AND [82].[pop_type] = [117].[pop_type]
    AND [82].[age_group] = [117].[age_group]
    AND [82].[sex] = [117].[sex]
    AND [82].[ethnicity] = [117].[ethnicity]
WHERE [82].[run_id] = 82
    AND [117].[run_id] = 117
    AND [82].[pop_type] = 'Group Quarters - Institutional Correctional Facilities'
    AND [82].[mgra] IN (
        SELECT DISTINCT [mgra] 
        FROM [EstimatesProgram].[outputs].[gq]
        WHERE [run_id] = 82
            AND [gq_type] = 'Group Quarters - Institutional Correctional Facilities'
            AND [value] > 0)
    AND [82].[ethnicity] = 'Non-Hispanic, Two or More Races'
GROUP BY 
    [82].[year],
    [82].[mgra],
    [82].[pop_type],
    -- [82].[age_group],
    -- [82].[sex],
    [82].[ethnicity]
HAVING 
    SUM([82].[value]) = 0
ORDER BY 
    [82].[year],
    [82].[mgra],
    [82].[pop_type],
    --[82].[age_group],
    --[82].[sex],
    [82].[ethnicity]
    -- ABS(SUM([82].[value]) - SUM([117].[value])) DESC

But all else being equal, I think the new data is better as it explicitly fixes data issues, instead of having them fixed by undefined behavior of IPFN. Also the exact differences don't rise to the level of concern for me

Eric-Liu-SANDAG · 2025-12-11T22:25:19Z

The data, when analyzed at larger geographies, seems to have most differences canceled out such that you nearly can't tell that we switched to a new IPF. It seems like despite some pop types hav large-ish MGRA level differences, they are more or less canceled out with opposite differences

Copilot

Pull request overview

This PR introduces a bespoke IPF (Iterative Proportional Fitting) implementation to replace the external ipfn library dependency. The new implementation provides direct control over the IPF algorithm and simplifies the codebase by removing an external dependency.

Key Changes:

Added a new ipf() function in utils.py with comprehensive input validation and convergence controls
Refactored ase.py to use the new IPF implementation with improved data preprocessing and clearer logic
Removed the ipfn package dependency from environment.yml

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
python/utils.py	Implements new IPF algorithm with validation, convergence logic, and test code in `__main__` block
python/ase.py	Refactors seed creation and ASE calculation to use new IPF implementation with improved data handling
environment.yml	Removes external `ipfn` library dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Eric-Liu-SANDAG · 2025-12-11T22:53:46Z

Just realized, I still need to run a stability test, so I'll need to re-run 2020 again and compare with 117 to ensure results are deterministic

Documentation, additional data checks, removing testing code

Eric-Liu-SANDAG · 2025-12-12T00:02:08Z

Stability is good, following query returned no rows of data:

SELECT 
    [117].[year],
    [117].[mgra],
    [117].[pop_type],
    [117].[age_group],
    [117].[sex],
    [117].[ethnicity],
    [117].[value],
    [120].[value]
FROM [EstimatesProgram].[outputs].[ase] AS [117]
LEFT JOIN [EstimatesProgram].[outputs].[ase] AS [120]
    ON [117].[year] = [120].[year]
    AND [117].[mgra] = [120].[mgra]
    AND [117].[pop_type] = [120].[pop_type]
    AND [117].[age_group] = [120].[age_group]
    AND [117].[sex] = [120].[sex]
    AND [117].[ethnicity] = [120].[ethnicity]
WHERE [117].[run_id] = 117
    AND [120].[run_id] = 120
    AND [117].[year] = 2020
    AND [117].[value] != [120].[value]

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

GregorSchroeder

Minor changes requested. This is an excellent addition to the project.
I checked major QA findings from Estimates 2024 on [run_id]=117 and found nothing of note.

Eric-Liu-SANDAG and others added 2 commits December 3, 2025 14:36

#175: Added new IPF implementation

bd99e64

#175: Replacement of 1/2 instances of ipfn

9762bb4

Eric-Liu-SANDAG added 2 commits December 10, 2025 00:56

#175: Final replacement of IPFN

e1a520f

Also, turns out that the input data to IPFN wasn't very clean, and the new IPF function has a bunch of input checking which resulted in some fail states. So I had to clean up the data a little bit

#174: Removed IPFN library

ece9516

Eric-Liu-SANDAG requested a review from Copilot December 11, 2025 22:25

Copilot AI reviewed Dec 11, 2025

View reviewed changes

Comment thread python/utils.py

Comment thread python/utils.py Outdated

Comment thread python/utils.py Outdated

Comment thread python/utils.py Outdated

Comment thread python/utils.py Outdated

Comment thread python/ase.py Outdated

Update python/utils.py

fc64b55

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

#175: PR Feedback

ca72311

Documentation, additional data checks, removing testing code

Eric-Liu-SANDAG requested a review from Copilot December 12, 2025 00:02

Copilot AI reviewed Dec 12, 2025

View reviewed changes

Comment thread python/ase.py

Comment thread python/ase.py

Comment thread python/ase.py

Eric-Liu-SANDAG requested a review from GregorSchroeder December 12, 2025 00:04

GregorSchroeder requested changes Dec 12, 2025

View reviewed changes

Comment thread python/utils.py Outdated

Comment thread python/utils.py Outdated

Comment thread python/ase.py Outdated

Comment thread python/ase.py

Comment thread python/ase.py Outdated

Comment thread python/ase.py Outdated

#175: PR Feedback

1af067b

Eric-Liu-SANDAG requested a review from GregorSchroeder December 12, 2025 21:35

GregorSchroeder approved these changes Dec 12, 2025

View reviewed changes

Eric-Liu-SANDAG merged commit 3f2a4e3 into main Dec 13, 2025

Eric-Liu-SANDAG deleted the ipf-fix branch December 13, 2025 00:04

Eric-Liu-SANDAG mentioned this pull request Dec 13, 2025

[PULL REQUEST] New methodology for n-dimensional rounding #179

Closed

Conversation

Eric-Liu-SANDAG commented Dec 3, 2025

Describe this pull request. What changes are being made?

What issues does this pull request address?

Additional context

Uh oh!

Eric-Liu-SANDAG commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Eric-Liu-SANDAG commented Dec 10, 2025

Uh oh!

Eric-Liu-SANDAG commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Eric-Liu-SANDAG commented Dec 11, 2025

Uh oh!

Eric-Liu-SANDAG commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Eric-Liu-SANDAG commented Dec 11, 2025

Uh oh!

Eric-Liu-SANDAG commented Dec 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GregorSchroeder left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Eric-Liu-SANDAG commented Dec 4, 2025 •

edited

Loading

Eric-Liu-SANDAG commented Dec 11, 2025 •

edited

Loading

Eric-Liu-SANDAG commented Dec 11, 2025 •

edited

Loading