[PULL REQUEST] Create A/S/E Regional Controls by Population Type #84

GregorSchroeder · 2025-04-24T21:37:54Z

Describe this pull request. What changes are being made?
The initial conceit of this feature branch was to perform all the work necessary to create estimates of MGRA-level population by age/sex/ethnicity by type. While undertaking this task it was decided that would be too large of a pull request. As such, this pull request stops after generating and inserting the regional age/sex/ethnicity controls by population type.

The key piece of this process, and the upcoming MGRA-level population by age/sex/ethnicity by type estimates, is the integerization procedure encapsulated in the _2d_integerize() function. A more detailed description of this function is included below. One key piece missing in this pull request is where to put this documentation and the higher-level explanation of how the controls are generated. My first thought was to wait and include it in the wrapper run_ ase() function that will be created in the subsequent MGRA-level population by age/sex/ethnicity by type estimates pull request so that is why it is not present in the code itself as of this time.

_2d_integerize()

This function is used here to integerize regional a/s/e controls matching the total number of group quarters by type to what is present in the [outputs].[gq].
This is done assuming the group quarters by type are columns in a Pandas DataFrame (always 2-dimensional) as the function always matches column-level marginal control totals perfectly via a simple call to iteround.saferound.
For row-level marginal control totals the function has two options controlled via the condition input parameter. The condition="exact" option matches the row marginal control totals exactly. This will be used for the estimates of MGRA-level population by age/sex/ethnicity by type. The condition="less than" option uses the row marginal controls to ensure that all rows are less than or equal to their marginal control. This used here to ensure that no age/sex/ethnicity category, when summed across all group quarters by type, exceeds the regional control for total population by age/sex/ethnicity category.
The function works, at a high-level via the following methodology

Column values are safe rounded to integers such that their sums match the column marginal control totals exactly.

Begin row allocation process:
    Rows with sums that exceed their row marginal control totals have their smallest non-zero column
    subtracted from such that the row matches the control or the column value becomes 0. These adjustments
    are tracked and stored at the column-level and are subsequently referred to as the column-level adjustment
    budget.

    For rows with sums smaller than their row marginal control totals, the largest non-zero column is added to
    such that the row sum matches the control total, or the column-level adjustment budget is spent.

    The row allocation process is repeated until there is no column-level adjustment budget to spend or all rows
    either have their sums match the marginal control totals or have sums less than or equal to the marginal
    control totals, depending on the option selected for the condition parameter.

What issues does this pull request address?
closes #83
closes #71
closes #72
closes #73

Additional context
See #69 for the parent issue this is under.

While working on this project, specifically sql/ase/get_region_gq_ase_dist.sql, it was found that our group quarters labels didn't make sense as it was assuming every non-Disabled person in non-College or non-Military Group Quarters was in prison. As such, the labels were updated to reflect the extent of our knowledge. See #83.

Testing was done manually on [run_id]=1 if you would like to review my results.

sql/ase/get_region_pop.sql

Eric-Liu-SANDAG · 2025-04-25T15:48:50Z

sql/ase/get_region_ase_total.sql

This is more an issue with the DOF data we have loaded in, but there seem to be rows missing. It seems like the missing rows should have population of zero, but it still makes it really annoying to do row count checks on this data. For example, the following query should return seven rows and not four:

SELECT * FROM [socioec_data].[ca_dof].[projections_p3] WHERE [projections_id] = 10 AND [fips] = '06073' AND [age] = 103 AND [sex] = 'Male' AND [year] = 2020 ORDER BY year, age, sex, [race/ethnicity]

Mostly, this makes me really nervous that some data got dropped on accident, as there is no way to no for sure if the missing rows are actually zeroed data

This is an issue as I assumed this dataset contained 0s. It looks like the current Vintage 2024 (2024.9.23) release on their projections site contains the 0s but the dataset I have saved in the repository for the release does not! I wonder if my initial grab of that dataset from the CA DOF erroneously had 0s removed and they added them back in later? There is nothing in the insert_p3() code that would have done this.

Great catch. I need to re-load this p3. See https://github.com/SANDAG/CA-DOF/issues/40

sql/create_objects.sql

python/ase.py

Eric-Liu-SANDAG · 2025-04-25T17:16:53Z

sql/ase/get_region_gq_ase_dist.sql

Can probably change the long PUMS query to use BETWEEN and 'FROM [acs].[pums].[vi_5y_' + @year - 4 + '_' + @year + '_persons_sd]'

SQL data types in all caps

I notice we are missing WHEN [RAC1P] = '8', which corresponds to "Some other race alone". Should we be grouping this data into some other existing race/eth category?

How much does 2010/11 data differ from 2012+ due to the lack of a disabled population filter?

TBH I would like these results stored in an inputs table so that they can be easily referenced. I feel like we look at these ASE distributions often enough it is worth

In reference to above, it would be alot easier to check if the created distributions have face validity, especially since military has a hard age floor of 17, which is technically 18 since we are using age groups and not single year of age

Speaking of that, there are a few data points which may need to be zeroed out (I only looked at 2020). Or we could just depend on the [special_mgra] age/sex restrictions to deal with it

Agreed.

Agreed.

We are ignoring it which implicitly assumes it is distributed equally among all categories which is a very neutral/low information guess. Also, very low effort which is good. If we were creating counts and not distributions, we would have to allocate it across groups but luckily, we are not.

Something else besides the lack of [DIS] must be happening in 2010, but you can see the discontinuity between 2011 and the rest of the years in the table below. Note "records" just means age/sex/ethnicity categories, of which there are 280 categories.

I do not want to load in any age/sex/ethnicity stuff unless it is directly used as a control. I understand these controls are altered a bit by the scaling, rounding, and re-allocating process but I would prefer to not load in interim calculations.

What needs to be zero-ed out? The process should respect the 0s coming out of the ACS distributions. For the MGRA-level we will be using the [special_mgra] table to enforce restrictions at the MGRA-level.

What needs to be zero-ed out? The process should respect the 0s coming out of the ACS distributions. For the MGRA-level we will be using the [special_mgra] table to enforce restrictions at the MGRA-level.

Yeah, that's what I was referring to, using the [special_mgras] table to zero out. Just be aware there should be a tiny fraction of 17 year old military GQ. Currently military MGRAs are not loaded into [special_mgras], and the [min_age]/[max_age] columns are using single year of age and not age groups. Not sure how this will interact with GQ ASE distributions that use age groups

GregorSchroeder · 2025-04-30T20:39:58Z

All changes have been made except I am blocked on https://github.com/SANDAG/CA-DOF/issues/40 although that fix is outside of this repository.

Eric-Liu-SANDAG

Small changes requested. You can merge after making the changes, no review required

Eric-Liu-SANDAG · 2025-04-30T21:31:04Z

python/utils.py

Just need to remove leading _ from _2d_integerize() as it is not a private function.

Also I know you said that imports should be alphabetical, but I aesthetically hate how it's like:

import pandas as pd import pathlib import sqlalchemy as sql

where a non-as import is squeezed between two as imports. IMO to make it look good, simple imports should be grouped together alphabetically, then all import _ as _'s should be grouped together alphabetically, then all import python._ as _'s should be grouped together alphabetically. For example, see this (not alphabetical, but I don't actually care that much about being alphabetical)

GregorSchroeder added 4 commits April 24, 2025 14:00

#83 - gq labels

1c59073

#72 - table to hold regional ase controls

1ae6aaf

#71 - table to hold output ase results

5575d8f

#73 - create ase controls

c7e64dd

GregorSchroeder added bug Something isn't working enhancement New feature or request labels Apr 24, 2025

GregorSchroeder requested a review from Eric-Liu-SANDAG April 24, 2025 21:37

GregorSchroeder self-assigned this Apr 24, 2025

GregorSchroeder linked an issue Apr 24, 2025 that may be closed by this pull request

[FEATURE] Generate Population by Age/Sex/Ethnicity by Type #69

Closed

Eric-Liu-SANDAG requested changes Apr 25, 2025

View reviewed changes

GregorSchroeder added 3 commits April 29, 2025 10:34

#84 - PR feedback (UNION and CONSTRAINT)

810bf22

#84 - PR Feedback (ASE Python)

6eda235

#84 - PR Feedback (PUMS query and data types)

236aef7

GregorSchroeder requested a review from Eric-Liu-SANDAG April 30, 2025 20:40

Eric-Liu-SANDAG approved these changes Apr 30, 2025

View reviewed changes

#84 - PR Feedback (imports and private function)

34cd14e

GregorSchroeder merged commit 628b05a into main Apr 30, 2025

GregorSchroeder deleted the 69-feature-ase branch April 30, 2025 22:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PULL REQUEST] Create A/S/E Regional Controls by Population Type #84

[PULL REQUEST] Create A/S/E Regional Controls by Population Type #84

Uh oh!

GregorSchroeder commented Apr 24, 2025

Uh oh!

Uh oh!

Eric-Liu-SANDAG Apr 25, 2025

Uh oh!

GregorSchroeder Apr 25, 2025

Uh oh!

Uh oh!

Uh oh!

Eric-Liu-SANDAG Apr 25, 2025

Uh oh!

GregorSchroeder Apr 25, 2025

Uh oh!

Eric-Liu-SANDAG Apr 29, 2025

Uh oh!

GregorSchroeder commented Apr 30, 2025

Uh oh!

Eric-Liu-SANDAG left a comment

Uh oh!

Eric-Liu-SANDAG Apr 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[PULL REQUEST] Create A/S/E Regional Controls by Population Type #84

[PULL REQUEST] Create A/S/E Regional Controls by Population Type #84

Uh oh!

Conversation

GregorSchroeder commented Apr 24, 2025

Uh oh!

Uh oh!

Eric-Liu-SANDAG Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

GregorSchroeder Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Eric-Liu-SANDAG Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

GregorSchroeder Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

Eric-Liu-SANDAG Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

GregorSchroeder commented Apr 30, 2025

Uh oh!

Eric-Liu-SANDAG left a comment

Choose a reason for hiding this comment

Uh oh!

Eric-Liu-SANDAG Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Eric-Liu-SANDAG Apr 30, 2025 •

edited

Loading