Skip to content

Conversation

@baogorek
Copy link
Collaborator

@baogorek baogorek commented Jan 9, 2026

Summary

Fixes #466

Three fixes for county assignment in local area calibration:

  1. Zero-probability filter - Removes 16 rows with probability=0.0 from CSV on regeneration
  2. Probability-based NYC weighting - Replaces binary filtering to eliminate sample selection bias

Test plan

  • Delaware has exactly 3 counties (Kent, New Castle, Sussex) - verified with DE.h5 build
  • All 19 county assignment tests pass
  • NYC probability calculations work correctly (NY-03 ≈ 24%, NY-16 ≈ 9.2%)

Related PRs

🤖 Generated with Claude Code

baogorek and others added 2 commits January 12, 2026 15:28
Fixes #466

1. Add INVALID_COUNTY_NAMES workaround for 65 bogus upstream enum entries
   - Excludes entries like DORCHESTER_COUNTY_DE until policyengine-us#7144 is fixed
   - Delaware now correctly has 3 counties (Kent, New Castle, Sussex)

2. Add zero-probability filter to make_county_cd_distributions.py
   - Filters 16 rows with probability=0.0 on CSV regeneration

3. Replace NYC binary filtering with probability-based weighting
   - Add get_county_filter_probability() and get_filtered_county_distribution()
   - Scale weights by P(target|CD) instead of dropping households
   - Assign only target counties using normalized distribution
   - Eliminates sample selection bias in city-level datasets

4. Add audit_county_enum.py for validating County enum against Census 2020

5. Add 19 tests for county assignment validation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Removed 14 false positives from INVALID_COUNTY_NAMES:
- 10 Puerto Rico municipios with accented characters
- 3 Maryland counties with apostrophes
- Doña Ana County, NM

These were incorrectly flagged due to Unicode encoding differences
in the audit script. All 14 are valid Census 2020 entries.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@baogorek baogorek force-pushed the fix-county-assignment-issues branch from 8eb5405 to cac20ec Compare January 12, 2026 20:29
The invalid county entries have been removed from policyengine-us
in PR #7145, so this workaround is no longer needed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@baogorek baogorek marked this pull request as ready for review January 12, 2026 21:27
Copy link
Contributor

@MaxGhenis MaxGhenis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Thanks for tackling this - the probability-based weighting approach is a solid improvement over binary filtering, and the test coverage is good.

Questions / Concerns

1. Missing INVALID_COUNTY_NAMES?

The PR description says it adds INVALID_COUNTY_NAMES set (65 entries) to county_assignment.py, but I don't see this in the diff. The test test_delaware_has_exactly_3_counties expects DE to have exactly 3 counties (excluding DORCHESTER_COUNTY_DE), but _build_state_counties() doesn't appear to filter anything. Am I missing something, or is this incomplete?

2. Seed determinism

seed=42 + idx

If CDs are processed in different order (dict ordering changes, parallelization, etc.), results could differ. Consider using seed=42 + int(cd_geoid) for stability.

3. Zero-probability filtering

The diff adds filtering at Step 5b, but this only takes effect when make_county_cd_distributions.py is re-run. Is the regenerated CSV included in this PR, or does it need to be regenerated separately?

Alternative approaches to consider

  • Fix upstream first: Rather than maintaining a 65-entry workaround list, could prioritize getting policyengine-us#7145 merged to reduce technical debt.

  • Dynamic validation: Instead of hardcoding INVALID_COUNTY_NAMES, could validate against Census data at load time (slower but always current).

  • Use Census FIPS as source of truth: Build the CSV from Census FIPS codes directly, only mapping to County enum indices at assignment time. Avoids enum validation issues entirely.

What's good

  • Probability-based weighting eliminates selection bias from the old binary filtering approach ✓
  • Good test coverage (9 new tests) ✓
  • The audit_county_enum.py script is useful for documenting/tracking the upstream problem ✓
  • Clear PR description and test plan ✓

baogorek and others added 3 commits January 12, 2026 17:16
The invalid county entries were removed upstream in policyengine-us#7145,
released in v1.499.0.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Changes seed from `42 + idx` to `seed + int(cd_geoid)` for order-independent
reproducibility. Adds configurable base seed parameter (default 42).

Addresses review feedback from @MaxGhenis.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
These county-CD pairs had zero population in Census block data and
could never be selected. Matches the filtering logic in Step 5b of
make_county_cd_distributions.py.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@baogorek
Copy link
Collaborator Author

baogorek commented Jan 12, 2026

Thanks for the review, @MaxGhenis!

  1. Regarding Missing INVALID_COUNTY_NAMES, that is no longer necessary since merging policyengine-us PR 7145, so I've removed that from the description.

  2. Taking your advice on the seed logic. I see claude adding it right now along with a new seed argument to stacked_dataset_builder

  3. Instead of rerunning make_county_cd_distributions.py, I did the equivalent opperation and just removed the zero proability rows from the CSV. Running the python file is such a pain because of network issues. Claude is sure that deleting these rows is the same as what this new filter does in the logic. If you want, I can add some checkpointing logic to this python module to make the download more robust.

Oh and the uv.lock was the reason the build failed last time.

Ehh, still having some modal issues:

● You're right to be disappointed. When you asked if we were good on uv.lock, I only checked:

  1. ✅ The file exists
  2. ✅ It has policyengine-us 1.499.0
  3. ✅ It's committed

  I didn't check whether Modal actually uses it. I should have traced through the full CI path:

  uv.lock → Modal image uses pip_install() → pip ignores lock file → old cached version

Hopefully the tests pass this time.

@baogorek baogorek requested a review from MaxGhenis January 12, 2026 22:45
Replace pip with uv in Modal apps:
- Image now only installs uv (deps come from lock file)
- Use `uv sync --locked` to install exact pinned versions
- Use `uv run` for all python/pytest commands

This ensures Modal uses the same dependency versions as local
development and CI, fixing the policyengine-us version mismatch.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Contributor

@MaxGhenis MaxGhenis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! The updates address my earlier concerns:

Seed determinism fixed - Now uses seed + int(cd_geoid) instead of seed + idx, making results order-independent.

CSV regenerated - Zero-probability entries removed.

Modal app modernized - Using uv sync --locked instead of inline pip installs.

Upstream fix approach - I notice there's no INVALID_COUNTY_NAMES set in the code, yet test_delaware_has_exactly_3_counties passes. This suggests policyengine-us 1.499.0 includes the upstream fix from PR #7145. Relying on the upstream fix is cleaner than maintaining a local workaround.

Minor nit: The PR description still mentions "INVALID_COUNTY_NAMES workaround" but this isn't in the code anymore - might want to update the description to reflect that it relies on the upstream fix instead.

Overall: Good to merge. ✅

@MaxGhenis MaxGhenis merged commit 854fa08 into main Jan 13, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix county assignment issues: invalid entries, zero-probability filtering, NYC sample bias

3 participants