-
Notifications
You must be signed in to change notification settings - Fork 0
[PULL REQUEST] Create A/S/E Regional Controls by Population Type #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more an issue with the DOF data we have loaded in, but there seem to be rows missing. It seems like the missing rows should have population of zero, but it still makes it really annoying to do row count checks on this data. For example, the following query should return seven rows and not four:
SELECT *
FROM [socioec_data].[ca_dof].[projections_p3]
WHERE [projections_id] = 10
AND [fips] = '06073'
AND [age] = 103
AND [sex] = 'Male'
AND [year] = 2020
ORDER BY year, age, sex, [race/ethnicity]Mostly, this makes me really nervous that some data got dropped on accident, as there is no way to no for sure if the missing rows are actually zeroed data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an issue as I assumed this dataset contained 0s. It looks like the current Vintage 2024 (2024.9.23) release on their projections site contains the 0s but the dataset I have saved in the repository for the release does not! I wonder if my initial grab of that dataset from the CA DOF erroneously had 0s removed and they added them back in later? There is nothing in the insert_p3() code that would have done this.
Great catch. I need to re-load this p3. See https://github.com/SANDAG/CA-DOF/issues/40
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can probably change the long PUMS query to use
BETWEENand'FROM [acs].[pums].[vi_5y_' + @year - 4 + '_' + @year + '_persons_sd]' - SQL data types in all caps
- I notice we are missing
WHEN [RAC1P] = '8', which corresponds to "Some other race alone". Should we be grouping this data into some other existing race/eth category? - How much does 2010/11 data differ from 2012+ due to the lack of a disabled population filter?
- TBH I would like these results stored in an inputs table so that they can be easily referenced. I feel like we look at these ASE distributions often enough it is worth
- In reference to above, it would be alot easier to check if the created distributions have face validity, especially since military has a hard age floor of 17, which is technically 18 since we are using age groups and not single year of age
- Speaking of that, there are a few data points which may need to be zeroed out (I only looked at 2020). Or we could just depend on the
[special_mgra]age/sex restrictions to deal with it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Agreed.
- Agreed.
- We are ignoring it which implicitly assumes it is distributed equally among all categories which is a very neutral/low information guess. Also, very low effort which is good. If we were creating counts and not distributions, we would have to allocate it across groups but luckily, we are not.
- Something else besides the lack of
[DIS]must be happening in 2010, but you can see the discontinuity between 2011 and the rest of the years in the table below. Note "records" just means age/sex/ethnicity categories, of which there are 280 categories.

- I do not want to load in any age/sex/ethnicity stuff unless it is directly used as a control. I understand these controls are altered a bit by the scaling, rounding, and re-allocating process but I would prefer to not load in interim calculations.
- What needs to be zero-ed out? The process should respect the 0s coming out of the ACS distributions. For the MGRA-level we will be using the
[special_mgra]table to enforce restrictions at the MGRA-level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What needs to be zero-ed out? The process should respect the 0s coming out of the ACS distributions. For the MGRA-level we will be using the
[special_mgra]table to enforce restrictions at the MGRA-level.
Yeah, that's what I was referring to, using the [special_mgras] table to zero out. Just be aware there should be a tiny fraction of 17 year old military GQ. Currently military MGRAs are not loaded into [special_mgras], and the [min_age]/[max_age] columns are using single year of age and not age groups. Not sure how this will interact with GQ ASE distributions that use age groups
|
All changes have been made except I am blocked on https://github.com/SANDAG/CA-DOF/issues/40 although that fix is outside of this repository. |
Eric-Liu-SANDAG
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small changes requested. You can merge after making the changes, no review required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just need to remove leading _ from _2d_integerize() as it is not a private function.
Also I know you said that imports should be alphabetical, but I aesthetically hate how it's like:
import pandas as pd
import pathlib
import sqlalchemy as sqlwhere a non-as import is squeezed between two as imports. IMO to make it look good, simple imports should be grouped together alphabetically, then all import _ as _'s should be grouped together alphabetically, then all import python._ as _'s should be grouped together alphabetically. For example, see this (not alphabetical, but I don't actually care that much about being alphabetical)
Describe this pull request. What changes are being made?
The initial conceit of this feature branch was to perform all the work necessary to create estimates of MGRA-level population by age/sex/ethnicity by type. While undertaking this task it was decided that would be too large of a pull request. As such, this pull request stops after generating and inserting the regional age/sex/ethnicity controls by population type.
The key piece of this process, and the upcoming MGRA-level population by age/sex/ethnicity by type estimates, is the integerization procedure encapsulated in the
_2d_integerize()function. A more detailed description of this function is included below. One key piece missing in this pull request is where to put this documentation and the higher-level explanation of how the controls are generated. My first thought was to wait and include it in the wrapperrun_ ase()function that will be created in the subsequent MGRA-level population by age/sex/ethnicity by type estimates pull request so that is why it is not present in the code itself as of this time._2d_integerize()[outputs].[gq].iteround.saferound.conditioninput parameter. Thecondition="exact"option matches the row marginal control totals exactly. This will be used for the estimates of MGRA-level population by age/sex/ethnicity by type. Thecondition="less than"option uses the row marginal controls to ensure that all rows are less than or equal to their marginal control. This used here to ensure that no age/sex/ethnicity category, when summed across all group quarters by type, exceeds the regional control for total population by age/sex/ethnicity category.What issues does this pull request address?
closes #83
closes #71
closes #72
closes #73
Additional context
See #69 for the parent issue this is under.
While working on this project, specifically
sql/ase/get_region_gq_ase_dist.sql, it was found that our group quarters labels didn't make sense as it was assuming every non-Disabled person in non-College or non-Military Group Quarters was in prison. As such, the labels were updated to reflect the extent of our knowledge. See #83.Testing was done manually on
[run_id]=1if you would like to review my results.