[PULL REQUEST] New IPF implementation#176
Conversation
|
New IPF methodology implemented for the main usage of SELECT
[82].[year],
[82].[mgra],
[82].[age_group],
[82].[sex],
[82].[ethnicity],
[82].[value],
[106].[value]
FROM [EstimatesProgram].[outputs].[ase] AS [82]
LEFT JOIN [EstimatesProgram].[outputs].[ase] AS [106]
ON [82].[year] = [106].[year]
AND [82].[mgra] = [106].[mgra]
AND [82].[pop_type] = [106].[pop_type]
AND [82].[age_group] = [106].[age_group]
AND [82].[sex] = [106].[sex]
AND [82].[ethnicity] = [106].[ethnicity]
WHERE [82].[run_id] = 82
AND [106].[run_id] = 106
AND [82].[value] != [106].[value]
ORDER BY ABS([82].[value] - [106].[value]) DESCQuite a few rows of differences, hard to really tell what's significant and what's not... When looking at the ethnicity distribution of GQ college, nothing really stands out to me as a clear and obvious error. SELECT
[82].[year],
[82].[mgra],
[82].[pop_type],
-- [82].[age_group],
-- [82].[sex],
[82].[ethnicity],
SUM([82].[value]) AS [82_value],
SUM([106].[value]) AS [106_value]
FROM [EstimatesProgram].[outputs].[ase] AS [82]
LEFT JOIN [EstimatesProgram].[outputs].[ase] AS [106]
ON [82].[year] = [106].[year]
AND [82].[mgra] = [106].[mgra]
AND [82].[pop_type] = [106].[pop_type]
AND [82].[age_group] = [106].[age_group]
AND [82].[sex] = [106].[sex]
AND [82].[ethnicity] = [106].[ethnicity]
WHERE [82].[run_id] = 82
AND [106].[run_id] = 106
AND [82].[pop_type] = 'Group Quarters - College'
AND [82].[mgra] IN (
SELECT DISTINCT [mgra]
FROM [EstimatesProgram].[outputs].[gq]
WHERE [run_id] = 82
AND [year] = 2020
AND [gq_type] = 'Group Quarters - College'
AND [value] > 0)
GROUP BY
[82].[year],
[82].[mgra],
[82].[pop_type],
-- [82].[age_group],
-- [82].[sex],
[82].[ethnicity]
ORDER BY
[82].[year],
[82].[mgra],
[82].[pop_type],
-- [82].[age_group],
-- [82].[sex],
[82].[ethnicity]Values which are zero in |
Also, turns out that the input data to IPFN wasn't very clean, and the new IPF function has a bunch of input checking which resulted in some fail states. So I had to clean up the data a little bit
|
|
|
A complete run 2020 to 2024 was completed as well ( The GQ College check was done again and results similar to GQ Military was virtually identical between GQ Other/Prison/HHP were both similar to GQ College in that distributions were broadly the same, but actual values differed, sometimes by a more substantial amount (less so for HHP). But for the most part, any changes can functionally be attributed to noise |
|
Of particular note is this little bit of data cleaning that I was forced to do: Estimates-Program/python/ase.py Lines 355 to 364 in e1a520f I observed in certain GQ mgras (easier to see in GQ data since there's a lot less of it) that some MGRAs got a large spike in NH2+, likely result of this pre-seeding SELECT
[82].[year],
[82].[mgra],
[82].[pop_type],
-- [82].[age_group],
-- [82].[sex],
[82].[ethnicity],
SUM([82].[value]) AS [82_value],
SUM([117].[value]) AS [117_value],
ABS(SUM([82].[value]) - SUM([117].[value])) AS [diff]
FROM [EstimatesProgram].[outputs].[ase] AS [82]
LEFT JOIN [EstimatesProgram].[outputs].[ase] AS [117]
ON [82].[year] = [117].[year]
AND [82].[mgra] = [117].[mgra]
AND [82].[pop_type] = [117].[pop_type]
AND [82].[age_group] = [117].[age_group]
AND [82].[sex] = [117].[sex]
AND [82].[ethnicity] = [117].[ethnicity]
WHERE [82].[run_id] = 82
AND [117].[run_id] = 117
AND [82].[pop_type] = 'Group Quarters - Institutional Correctional Facilities'
AND [82].[mgra] IN (
SELECT DISTINCT [mgra]
FROM [EstimatesProgram].[outputs].[gq]
WHERE [run_id] = 82
AND [gq_type] = 'Group Quarters - Institutional Correctional Facilities'
AND [value] > 0)
AND [82].[ethnicity] = 'Non-Hispanic, Two or More Races'
GROUP BY
[82].[year],
[82].[mgra],
[82].[pop_type],
-- [82].[age_group],
-- [82].[sex],
[82].[ethnicity]
HAVING
SUM([82].[value]) = 0
ORDER BY
[82].[year],
[82].[mgra],
[82].[pop_type],
--[82].[age_group],
--[82].[sex],
[82].[ethnicity]
-- ABS(SUM([82].[value]) - SUM([117].[value])) DESC
But all else being equal, I think the new data is better as it explicitly fixes data issues, instead of having them fixed by undefined behavior of IPFN. Also the exact differences don't rise to the level of concern for me |
|
The data, when analyzed at larger geographies, seems to have most differences canceled out such that you nearly can't tell that we switched to a new IPF. It seems like despite some pop types hav large-ish MGRA level differences, they are more or less canceled out with opposite differences |
There was a problem hiding this comment.
Pull request overview
This PR introduces a bespoke IPF (Iterative Proportional Fitting) implementation to replace the external ipfn library dependency. The new implementation provides direct control over the IPF algorithm and simplifies the codebase by removing an external dependency.
Key Changes:
- Added a new
ipf()function inutils.pywith comprehensive input validation and convergence controls - Refactored
ase.pyto use the new IPF implementation with improved data preprocessing and clearer logic - Removed the
ipfnpackage dependency fromenvironment.yml
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| python/utils.py | Implements new IPF algorithm with validation, convergence logic, and test code in __main__ block |
| python/ase.py | Refactors seed creation and ASE calculation to use new IPF implementation with improved data handling |
| environment.yml | Removes external ipfn library dependency |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Just realized, I still need to run a stability test, so I'll need to re-run 2020 again and compare with |
Documentation, additional data checks, removing testing code
|
Stability is good, following query returned no rows of data: SELECT
[117].[year],
[117].[mgra],
[117].[pop_type],
[117].[age_group],
[117].[sex],
[117].[ethnicity],
[117].[value],
[120].[value]
FROM [EstimatesProgram].[outputs].[ase] AS [117]
LEFT JOIN [EstimatesProgram].[outputs].[ase] AS [120]
ON [117].[year] = [120].[year]
AND [117].[mgra] = [120].[mgra]
AND [117].[pop_type] = [120].[pop_type]
AND [117].[age_group] = [120].[age_group]
AND [117].[sex] = [120].[sex]
AND [117].[ethnicity] = [120].[ethnicity]
WHERE [117].[run_id] = 117
AND [120].[run_id] = 120
AND [117].[year] = 2020
AND [117].[value] != [120].[value] |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
GregorSchroeder
left a comment
There was a problem hiding this comment.
Minor changes requested. This is an excellent addition to the project.
I checked major QA findings from Estimates 2024 on [run_id]=117 and found nothing of note.

Describe this pull request. What changes are being made?
A new bespoke implementation of IPF
What issues does this pull request address?
Additional context
Originally based on work done for #159