Skip to content

[PULL REQUEST] New IPF implementation#176

Merged
Eric-Liu-SANDAG merged 7 commits intomainfrom
ipf-fix
Dec 13, 2025
Merged

[PULL REQUEST] New IPF implementation#176
Eric-Liu-SANDAG merged 7 commits intomainfrom
ipf-fix

Conversation

@Eric-Liu-SANDAG
Copy link
Copy Markdown
Contributor

Describe this pull request. What changes are being made?

A new bespoke implementation of IPF

What issues does this pull request address?

Additional context

Originally based on work done for #159

@Eric-Liu-SANDAG
Copy link
Copy Markdown
Contributor Author

Eric-Liu-SANDAG commented Dec 4, 2025

New IPF methodology implemented for the main usage of ipfn. Test run for 2020 can be found under [run_id] = 106 (it passed all QC checks with no error rows). The below query shows differences compared to the actual 2024 Estimates data, aka [run_id] = 82

SELECT 
    [82].[year],
    [82].[mgra],
    [82].[age_group],
    [82].[sex],
    [82].[ethnicity],
    [82].[value],
    [106].[value]
FROM [EstimatesProgram].[outputs].[ase] AS [82]
LEFT JOIN [EstimatesProgram].[outputs].[ase] AS [106]
    ON [82].[year] = [106].[year]
    AND [82].[mgra] = [106].[mgra]
    AND [82].[pop_type] = [106].[pop_type]
    AND [82].[age_group] = [106].[age_group]
    AND [82].[sex] = [106].[sex]
    AND [82].[ethnicity] = [106].[ethnicity]
WHERE [82].[run_id] = 82
    AND [106].[run_id] = 106
    AND [82].[value] != [106].[value]
ORDER BY ABS([82].[value] - [106].[value]) DESC

Quite a few rows of differences, hard to really tell what's significant and what's not...

When looking at the ethnicity distribution of GQ college, nothing really stands out to me as a clear and obvious error.

SELECT 
    [82].[year],
    [82].[mgra],
    [82].[pop_type],
    -- [82].[age_group],
    -- [82].[sex],
    [82].[ethnicity],
    SUM([82].[value]) AS [82_value],
    SUM([106].[value]) AS [106_value]
FROM [EstimatesProgram].[outputs].[ase] AS [82]
LEFT JOIN [EstimatesProgram].[outputs].[ase] AS [106]
    ON [82].[year] = [106].[year]
    AND [82].[mgra] = [106].[mgra]
    AND [82].[pop_type] = [106].[pop_type]
    AND [82].[age_group] = [106].[age_group]
    AND [82].[sex] = [106].[sex]
    AND [82].[ethnicity] = [106].[ethnicity]
WHERE [82].[run_id] = 82
    AND [106].[run_id] = 106
    AND [82].[pop_type] = 'Group Quarters - College'
    AND [82].[mgra] IN (
        SELECT DISTINCT [mgra] 
        FROM [EstimatesProgram].[outputs].[gq]
        WHERE [run_id] = 82
            AND [year] = 2020
            AND [gq_type] = 'Group Quarters - College'
            AND [value] > 0)
GROUP BY 
    [82].[year],
    [82].[mgra],
    [82].[pop_type],
    -- [82].[age_group],
    -- [82].[sex],
    [82].[ethnicity]
ORDER BY 
    [82].[year],
    [82].[mgra],
    [82].[pop_type],
    -- [82].[age_group],
    -- [82].[sex],
    [82].[ethnicity]

Values which are zero in 82 are typically still zero or tiny values in 106. Similarly, distributions seem to be more or less the same, with the slight caveat that 106 seems to have fewer overall zeros

Also, turns out that the input data to IPFN wasn't very clean, and the new IPF function has a bunch of input checking which resulted in some fail states. So I had to clean up the data a little bit
@Eric-Liu-SANDAG
Copy link
Copy Markdown
Contributor Author

[run_id] = 115 now contains a run of 2020 with all IPFN replaced with the new IPF function (all automated checks passed). The GQ College check which was done for [run_id] = 106 was repeated using 115 and similar results were observed. Will do some more rigorous testing later, aka tomorrow

@Eric-Liu-SANDAG
Copy link
Copy Markdown
Contributor Author

Eric-Liu-SANDAG commented Dec 11, 2025

A complete run 2020 to 2024 was completed as well ([run_id] = 117) and all automated checks passed. This includes the MGRA restrictions checks, which seems a little odd to me. Maybe the old IPFN was doing something weird that caused the integerization to fail?

The GQ College check was done again and results similar to 106 and 115 were observed, but also spread over all the years.

GQ Military was virtually identical between 82 and 117, with off by one differences but not much more

GQ Other/Prison/HHP were both similar to GQ College in that distributions were broadly the same, but actual values differed, sometimes by a more substantial amount (less so for HHP). But for the most part, any changes can functionally be attributed to noise

@Eric-Liu-SANDAG
Copy link
Copy Markdown
Contributor Author

Of particular note is this little bit of data cleaning that I was forced to do:

# Haha, JK, not finally, there's one more thing to fix. Due to the removal of
# B01001F from the seed data, we could have a non-zero marginal associated with
# all zero seed data. We'll resolve by just seeding with a tiny value. Note this
# can only occur with the column controls (B01001)
if np.any((col_controls > 0) & (seed.sum(axis=0) == 0)):
col_idx_to_adjust = np.where(
(col_controls > 0) & (seed.sum(axis=0).to_numpy() == 0)
)
for col_idx in col_idx_to_adjust:
seed[seed.columns[col_idx]] = 0.001

I observed in certain GQ mgras (easier to see in GQ data since there's a lot less of it) that some MGRAs got a large spike in NH2+, likely result of this pre-seeding

SELECT 
    [82].[year],
    [82].[mgra],
    [82].[pop_type],
    -- [82].[age_group],
    -- [82].[sex],
    [82].[ethnicity],
    SUM([82].[value]) AS [82_value],
    SUM([117].[value]) AS [117_value],
    ABS(SUM([82].[value]) - SUM([117].[value])) AS [diff]
FROM [EstimatesProgram].[outputs].[ase] AS [82]
LEFT JOIN [EstimatesProgram].[outputs].[ase] AS [117]
    ON [82].[year] = [117].[year]
    AND [82].[mgra] = [117].[mgra]
    AND [82].[pop_type] = [117].[pop_type]
    AND [82].[age_group] = [117].[age_group]
    AND [82].[sex] = [117].[sex]
    AND [82].[ethnicity] = [117].[ethnicity]
WHERE [82].[run_id] = 82
    AND [117].[run_id] = 117
    AND [82].[pop_type] = 'Group Quarters - Institutional Correctional Facilities'
    AND [82].[mgra] IN (
        SELECT DISTINCT [mgra] 
        FROM [EstimatesProgram].[outputs].[gq]
        WHERE [run_id] = 82
            AND [gq_type] = 'Group Quarters - Institutional Correctional Facilities'
            AND [value] > 0)
    AND [82].[ethnicity] = 'Non-Hispanic, Two or More Races'
GROUP BY 
    [82].[year],
    [82].[mgra],
    [82].[pop_type],
    -- [82].[age_group],
    -- [82].[sex],
    [82].[ethnicity]
HAVING 
    SUM([82].[value]) = 0
ORDER BY 
    [82].[year],
    [82].[mgra],
    [82].[pop_type],
    --[82].[age_group],
    --[82].[sex],
    [82].[ethnicity]
    -- ABS(SUM([82].[value]) - SUM([117].[value])) DESC
image

But all else being equal, I think the new data is better as it explicitly fixes data issues, instead of having them fixed by undefined behavior of IPFN. Also the exact differences don't rise to the level of concern for me

@Eric-Liu-SANDAG
Copy link
Copy Markdown
Contributor Author

Eric-Liu-SANDAG commented Dec 11, 2025

The data, when analyzed at larger geographies, seems to have most differences canceled out such that you nearly can't tell that we switched to a new IPF. It seems like despite some pop types hav large-ish MGRA level differences, they are more or less canceled out with opposite differences

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a bespoke IPF (Iterative Proportional Fitting) implementation to replace the external ipfn library dependency. The new implementation provides direct control over the IPF algorithm and simplifies the codebase by removing an external dependency.

Key Changes:

  • Added a new ipf() function in utils.py with comprehensive input validation and convergence controls
  • Refactored ase.py to use the new IPF implementation with improved data preprocessing and clearer logic
  • Removed the ipfn package dependency from environment.yml

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
python/utils.py Implements new IPF algorithm with validation, convergence logic, and test code in __main__ block
python/ase.py Refactors seed creation and ASE calculation to use new IPF implementation with improved data handling
environment.yml Removes external ipfn library dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/utils.py
Comment thread python/utils.py Outdated
Comment thread python/utils.py Outdated
Comment thread python/utils.py Outdated
Comment thread python/utils.py Outdated
Comment thread python/ase.py Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@Eric-Liu-SANDAG
Copy link
Copy Markdown
Contributor Author

Just realized, I still need to run a stability test, so I'll need to re-run 2020 again and compare with 117 to ensure results are deterministic

Documentation, additional data checks, removing testing code
@Eric-Liu-SANDAG
Copy link
Copy Markdown
Contributor Author

Stability is good, following query returned no rows of data:

SELECT 
    [117].[year],
    [117].[mgra],
    [117].[pop_type],
    [117].[age_group],
    [117].[sex],
    [117].[ethnicity],
    [117].[value],
    [120].[value]
FROM [EstimatesProgram].[outputs].[ase] AS [117]
LEFT JOIN [EstimatesProgram].[outputs].[ase] AS [120]
    ON [117].[year] = [120].[year]
    AND [117].[mgra] = [120].[mgra]
    AND [117].[pop_type] = [120].[pop_type]
    AND [117].[age_group] = [120].[age_group]
    AND [117].[sex] = [120].[sex]
    AND [117].[ethnicity] = [120].[ethnicity]
WHERE [117].[run_id] = 117
    AND [120].[run_id] = 120
    AND [117].[year] = 2020
    AND [117].[value] != [120].[value]

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/ase.py
Comment thread python/ase.py
Comment thread python/ase.py
Copy link
Copy Markdown
Contributor

@GregorSchroeder GregorSchroeder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes requested. This is an excellent addition to the project.
I checked major QA findings from Estimates 2024 on [run_id]=117 and found nothing of note.

Comment thread python/utils.py Outdated
Comment thread python/utils.py Outdated
Comment thread python/ase.py Outdated
Comment thread python/ase.py
Comment thread python/ase.py Outdated
Comment thread python/ase.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] IPF process non-convergence [BUG] Non-zero marginals associated with all zero data in IPF

3 participants