Skip to content

Conversation

@GregorSchroeder
Copy link
Contributor

Describe this pull request. What changes are being made?
The initial conceit of this feature branch was to perform all the work necessary to create estimates of MGRA-level population by age/sex/ethnicity by type. While undertaking this task it was decided that would be too large of a pull request. As such, this pull request stops after generating and inserting the regional age/sex/ethnicity controls by population type.

The key piece of this process, and the upcoming MGRA-level population by age/sex/ethnicity by type estimates, is the integerization procedure encapsulated in the _2d_integerize() function. A more detailed description of this function is included below. One key piece missing in this pull request is where to put this documentation and the higher-level explanation of how the controls are generated. My first thought was to wait and include it in the wrapper run_ ase() function that will be created in the subsequent MGRA-level population by age/sex/ethnicity by type estimates pull request so that is why it is not present in the code itself as of this time.

_2d_integerize()

  • This function is used here to integerize regional a/s/e controls matching the total number of group quarters by type to what is present in the [outputs].[gq].
  • This is done assuming the group quarters by type are columns in a Pandas DataFrame (always 2-dimensional) as the function always matches column-level marginal control totals perfectly via a simple call to iteround.saferound.
  • For row-level marginal control totals the function has two options controlled via the condition input parameter. The condition="exact" option matches the row marginal control totals exactly. This will be used for the estimates of MGRA-level population by age/sex/ethnicity by type. The condition="less than" option uses the row marginal controls to ensure that all rows are less than or equal to their marginal control. This used here to ensure that no age/sex/ethnicity category, when summed across all group quarters by type, exceeds the regional control for total population by age/sex/ethnicity category.
  • The function works, at a high-level via the following methodology
Column values are safe rounded to integers such that their sums match the column marginal control totals exactly.

Begin row allocation process:
    Rows with sums that exceed their row marginal control totals have their smallest non-zero column
    subtracted from such that the row matches the control or the column value becomes 0. These adjustments
    are tracked and stored at the column-level and are subsequently referred to as the column-level adjustment
    budget.

    For rows with sums smaller than their row marginal control totals, the largest non-zero column is added to
    such that the row sum matches the control total, or the column-level adjustment budget is spent.

    The row allocation process is repeated until there is no column-level adjustment budget to spend or all rows
    either have their sums match the marginal control totals or have sums less than or equal to the marginal
    control totals, depending on the option selected for the condition parameter.

What issues does this pull request address?
closes #83
closes #71
closes #72
closes #73

Additional context
See #69 for the parent issue this is under.

While working on this project, specifically sql/ase/get_region_gq_ase_dist.sql, it was found that our group quarters labels didn't make sense as it was assuming every non-Disabled person in non-College or non-Military Group Quarters was in prison. As such, the labels were updated to reflect the extent of our knowledge. See #83.

Testing was done manually on [run_id]=1 if you would like to review my results.

@GregorSchroeder GregorSchroeder added bug Something isn't working enhancement New feature or request labels Apr 24, 2025
@GregorSchroeder GregorSchroeder self-assigned this Apr 24, 2025
@GregorSchroeder GregorSchroeder linked an issue Apr 24, 2025 that may be closed by this pull request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more an issue with the DOF data we have loaded in, but there seem to be rows missing. It seems like the missing rows should have population of zero, but it still makes it really annoying to do row count checks on this data. For example, the following query should return seven rows and not four:

SELECT *
FROM [socioec_data].[ca_dof].[projections_p3]
WHERE [projections_id] = 10
    AND [fips] = '06073'
    AND [age] = 103
    AND [sex] = 'Male'
    AND [year] = 2020
ORDER BY year, age, sex, [race/ethnicity]

Mostly, this makes me really nervous that some data got dropped on accident, as there is no way to no for sure if the missing rows are actually zeroed data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an issue as I assumed this dataset contained 0s. It looks like the current Vintage 2024 (2024.9.23) release on their projections site contains the 0s but the dataset I have saved in the repository for the release does not! I wonder if my initial grab of that dataset from the CA DOF erroneously had 0s removed and they added them back in later? There is nothing in the insert_p3() code that would have done this.

Great catch. I need to re-load this p3. See https://github.com/SANDAG/CA-DOF/issues/40

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can probably change the long PUMS query to use BETWEEN and 'FROM [acs].[pums].[vi_5y_' + @year - 4 + '_' + @year + '_persons_sd]'
  • SQL data types in all caps
  • I notice we are missing WHEN [RAC1P] = '8', which corresponds to "Some other race alone". Should we be grouping this data into some other existing race/eth category?
  • How much does 2010/11 data differ from 2012+ due to the lack of a disabled population filter?
  • TBH I would like these results stored in an inputs table so that they can be easily referenced. I feel like we look at these ASE distributions often enough it is worth
  • In reference to above, it would be alot easier to check if the created distributions have face validity, especially since military has a hard age floor of 17, which is technically 18 since we are using age groups and not single year of age
  • Speaking of that, there are a few data points which may need to be zeroed out (I only looked at 2020). Or we could just depend on the [special_mgra] age/sex restrictions to deal with it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Agreed.
  • Agreed.
  • We are ignoring it which implicitly assumes it is distributed equally among all categories which is a very neutral/low information guess. Also, very low effort which is good. If we were creating counts and not distributions, we would have to allocate it across groups but luckily, we are not.
  • Something else besides the lack of [DIS] must be happening in 2010, but you can see the discontinuity between 2011 and the rest of the years in the table below. Note "records" just means age/sex/ethnicity categories, of which there are 280 categories.
    image
  • I do not want to load in any age/sex/ethnicity stuff unless it is directly used as a control. I understand these controls are altered a bit by the scaling, rounding, and re-allocating process but I would prefer to not load in interim calculations.
  • What needs to be zero-ed out? The process should respect the 0s coming out of the ACS distributions. For the MGRA-level we will be using the [special_mgra] table to enforce restrictions at the MGRA-level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What needs to be zero-ed out? The process should respect the 0s coming out of the ACS distributions. For the MGRA-level we will be using the [special_mgra] table to enforce restrictions at the MGRA-level.

Yeah, that's what I was referring to, using the [special_mgras] table to zero out. Just be aware there should be a tiny fraction of 17 year old military GQ. Currently military MGRAs are not loaded into [special_mgras], and the [min_age]/[max_age] columns are using single year of age and not age groups. Not sure how this will interact with GQ ASE distributions that use age groups

@GregorSchroeder
Copy link
Contributor Author

All changes have been made except I am blocked on https://github.com/SANDAG/CA-DOF/issues/40 although that fix is outside of this repository.

Copy link
Contributor

@Eric-Liu-SANDAG Eric-Liu-SANDAG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small changes requested. You can merge after making the changes, no review required

Copy link
Contributor

@Eric-Liu-SANDAG Eric-Liu-SANDAG Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to remove leading _ from _2d_integerize() as it is not a private function.

Also I know you said that imports should be alphabetical, but I aesthetically hate how it's like:

import pandas as pd
import pathlib
import sqlalchemy as sql

where a non-as import is squeezed between two as imports. IMO to make it look good, simple imports should be grouped together alphabetically, then all import _ as _'s should be grouped together alphabetically, then all import python._ as _'s should be grouped together alphabetically. For example, see this (not alphabetical, but I don't actually care that much about being alphabetical)

@GregorSchroeder GregorSchroeder merged commit 628b05a into main Apr 30, 2025
@GregorSchroeder GregorSchroeder deleted the 69-feature-ase branch April 30, 2025 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

3 participants