-
Notifications
You must be signed in to change notification settings - Fork 11
Add ACS, rent and property taxes and 3-year CPS #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
46 commits
Select commit
Hold shift + click to select a range
d9146dd
Migrate ACS from policyengine-us
PavelMakarchuk 7fadd29
Merge branch 'main' of https://github.com/PavelMakarchuk/policyengine…
PavelMakarchuk 5065de3
populate acs
PavelMakarchuk b9e0f22
Update PolicyEngine US data
PavelMakarchuk 68e2e98
Merge branch 'main' of https://github.com/PolicyEngine/policyengine-u…
PavelMakarchuk 74b53bf
format
PavelMakarchuk 0ba70a6
data fix
PavelMakarchuk 380d6b2
test
PavelMakarchuk c72d813
changelog
PavelMakarchuk 9d2e340
Update PolicyEngine US data
PavelMakarchuk 040ea97
remove extra
PavelMakarchuk 2390120
chagelog
PavelMakarchuk 9c8ecd5
Update PolicyEngine US data
PavelMakarchuk a292087
readme file
PavelMakarchuk 8a8c93f
Merge branch 'main' of https://github.com/PavelMakarchuk/policyengine…
PavelMakarchuk 553f63f
property tax
PavelMakarchuk 96013e9
changelog
PavelMakarchuk ed627e8
Update PolicyEngine US data
PavelMakarchuk cd66e84
Merge branch 'main' of https://github.com/PolicyEngine/policyengine-u…
PavelMakarchuk 6d48d19
format
PavelMakarchuk 317de21
changelog
PavelMakarchuk 8914b9e
Pool 3 CPS years
nikhilwoodruff 43e3bb7
Upload ECPS result in PRs
nikhilwoodruff 247230c
Feed into ECPS
nikhilwoodruff 84ac325
Bump version and ECPS file
nikhilwoodruff 824bf8e
Merge branch 'main' of https://github.com/PavelMakarchuk/policyengine…
PavelMakarchuk 0338cb9
changelog
PavelMakarchuk 42fdd24
Move back to old ECPS
nikhilwoodruff d619ef0
Merge branch 'main' of https://github.com/PolicyEngine/policyengine-u…
PavelMakarchuk abf512e
init
PavelMakarchuk 33251a9
storage
PavelMakarchuk 8af92c3
Fix imports
nikhilwoodruff 80be6b9
Move versioning back
nikhilwoodruff 1526b47
Merge branch 'main' of https://github.com/PolicyEngine/policyengine-u…
nikhilwoodruff 7edbccc
Add URL for ACS 2022
nikhilwoodruff 95d7980
Add QRF rewrite and full imputations
nikhilwoodruff 9330572
Merge branch 'nikhilwoodruff/issue66' of https://github.com/PolicyEng…
nikhilwoodruff c35a21c
Add calibration
nikhilwoodruff 5a3f94d
Shift to branch of US
nikhilwoodruff a23329b
Make optional install
nikhilwoodruff dcda8bd
Generate ACS before CPS
nikhilwoodruff 502d8c9
What a silly error
nikhilwoodruff c8e2710
Minor improvements
nikhilwoodruff 7024666
Fix bugs
nikhilwoodruff 54449a2
Adjust QRF to enable single-output predictions
nikhilwoodruff b67a64f
Fix bug in QRF
nikhilwoodruff File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| - bump: minor | ||
| changes: | ||
| added: | ||
| - Migrate the ACS from the US-repository. | ||
| changed: | ||
| - Enhanced CPS now uses a 3-year pooled CPS. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| 2022 ACS 1 Year Data Dictionary: | ||
| https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.pdf | ||
| User Guide: | ||
| https://www2.census.gov/programs-surveys/acs/tech_docs/pums/2022ACS_PUMS_User_Guide.pdf | ||
| PUMS Documentation: | ||
| https://www.census.gov/programs-surveys/acs/microdata/documentation.html |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| from .acs import * | ||
| from .census_acs import * |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,118 @@ | ||
| import logging | ||
| from policyengine_core.data import Dataset | ||
| import h5py | ||
| from policyengine_us_data.datasets.acs.census_acs import CensusACS_2022 | ||
| from policyengine_us_data.storage import STORAGE_FOLDER | ||
| from pandas import DataFrame | ||
| import numpy as np | ||
| import pandas as pd | ||
|
|
||
|
|
||
| class ACS(Dataset): | ||
| data_format = Dataset.ARRAYS | ||
| time_period = None | ||
| census_acs = None | ||
|
|
||
| def generate(self) -> None: | ||
| """Generates the ACS dataset.""" | ||
|
|
||
| raw_data = self.census_acs(require=True).load() | ||
| acs = h5py.File(self.file_path, mode="w") | ||
| person, household = [ | ||
| raw_data[entity] for entity in ("person", "household") | ||
| ] | ||
|
|
||
| self.add_id_variables(acs, person, household) | ||
| self.add_person_variables(acs, person, household) | ||
| self.add_household_variables(acs, household) | ||
|
|
||
| acs.close() | ||
| raw_data.close() | ||
|
|
||
| @staticmethod | ||
| def add_id_variables( | ||
| acs: h5py.File, | ||
| person: DataFrame, | ||
| household: DataFrame, | ||
| ) -> None: | ||
| # Create numeric IDs based on SERIALNO | ||
| h_id_to_number = pd.Series( | ||
| np.arange(len(household)), index=household["SERIALNO"] | ||
| ) | ||
| household["household_id"] = h_id_to_number[ | ||
| household["SERIALNO"] | ||
| ].values | ||
| person["household_id"] = h_id_to_number[person["SERIALNO"]].values | ||
| person["person_id"] = person.index + 1 | ||
|
|
||
| acs["person_id"] = person["person_id"] | ||
| acs["household_id"] = household["household_id"] | ||
| acs["spm_unit_id"] = acs["household_id"] | ||
| acs["tax_unit_id"] = acs["household_id"] | ||
| acs["family_id"] = acs["household_id"] | ||
| acs["marital_unit_id"] = acs["household_id"] | ||
| acs["person_household_id"] = person["household_id"] | ||
| acs["person_spm_unit_id"] = person["household_id"] | ||
| acs["person_tax_unit_id"] = person["household_id"] | ||
| acs["person_family_id"] = person["household_id"] | ||
| acs["person_marital_unit_id"] = person["household_id"] | ||
| acs["household_weight"] = household.WGTP | ||
|
|
||
| @staticmethod | ||
| def add_person_variables( | ||
| acs: h5py.File, person: DataFrame, household: DataFrame | ||
| ) -> None: | ||
| acs["age"] = person.AGEP | ||
| acs["is_male"] = person.SEX == 1 | ||
| acs["employment_income"] = person.WAGP | ||
| acs["self_employment_income"] = person.SEMP | ||
| acs["social_security"] = person.SSP | ||
| acs["taxable_private_pension_income"] = person.RETP | ||
| person[["rent", "real_estate_taxes"]] = ( | ||
| household.set_index("household_id") | ||
| .loc[person["household_id"]][["RNTP", "TAXAMT"]] | ||
| .values | ||
| ) | ||
| acs["is_household_head"] = person.SPORDER == 1 | ||
| factor = person.SPORDER == 1 | ||
| person.rent *= factor * 12 | ||
| person.real_estate_taxes *= factor | ||
| acs["rent"] = person.rent | ||
| acs["real_estate_taxes"] = person.real_estate_taxes | ||
| acs["tenure_type"] = ( | ||
| household.TEN.astype(int) | ||
| .map( | ||
| { | ||
| 1: "OWNED_WITH_MORTGAGE", | ||
| 2: "OWNED_OUTRIGHT", | ||
| 3: "RENTED", | ||
| } | ||
| ) | ||
| .fillna("NONE") | ||
| .astype("S") | ||
| ) | ||
|
|
||
| @staticmethod | ||
| def add_spm_variables(acs: h5py.File, spm_unit: DataFrame) -> None: | ||
| acs["spm_unit_net_income_reported"] = spm_unit.SPM_RESOURCES | ||
| acs["spm_unit_spm_threshold"] = spm_unit.SPM_POVTHRESHOLD | ||
|
|
||
| @staticmethod | ||
| def add_household_variables(acs: h5py.File, household: DataFrame) -> None: | ||
| acs["household_vehicles_owned"] = household.VEH | ||
| acs["state_fips"] = acs["household_state_fips"] = household.ST.astype( | ||
| int | ||
| ) | ||
|
|
||
|
|
||
| class ACS_2022(ACS): | ||
| name = "acs_2022" | ||
| label = "ACS 2022" | ||
| time_period = 2022 | ||
| file_path = STORAGE_FOLDER / "acs_2022.h5" | ||
|
nikhilwoodruff marked this conversation as resolved.
|
||
| census_acs = CensusACS_2022 | ||
| url = "release://PolicyEngine/policyengine-us-data/release/acs_2022.h5" | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| ACS_2022().generate() | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,208 @@ | ||
| from io import BytesIO | ||
| import logging | ||
| from typing import List | ||
| from zipfile import ZipFile | ||
| import pandas as pd | ||
| from policyengine_core.data import Dataset | ||
| import requests | ||
| from tqdm import tqdm | ||
| from policyengine_us_data.storage import STORAGE_FOLDER | ||
|
|
||
| logging.getLogger().setLevel(logging.INFO) | ||
|
|
||
| PERSON_COLUMNS = [ | ||
| "SERIALNO", # Household ID | ||
| "SPORDER", # Person number within household | ||
| "PWGTP", # Person weight | ||
| "AGEP", # Age | ||
| "CIT", # Citizenship | ||
| "MAR", # Marital status | ||
| "WAGP", # Wage/salary | ||
| "SSP", # Social security income | ||
| "SSIP", # Supplemental security income | ||
| "SEX", # Sex | ||
| "SEMP", # Self-employment income | ||
| "SCHL", # Educational attainment | ||
| "RETP", # Retirement income | ||
| "PAP", # Public assistance income | ||
| "OIP", # Other income | ||
| "PERNP", # Total earnings | ||
| "PINCP", # Total income | ||
| "POVPIP", # Income-to-poverty line percentage | ||
| "RAC1P", # Race | ||
| ] | ||
|
|
||
| HOUSEHOLD_COLUMNS = [ | ||
| "SERIALNO", # Household ID | ||
| "PUMA", # PUMA area code | ||
| "ST", # State code | ||
| "ADJHSG", # Adjustment factor for housing dollar amounts | ||
| "ADJINC", # Adjustment factor for income | ||
| "WGTP", # Household weight | ||
| "NP", # Number of persons in household | ||
| "BDSP", # Number of bedrooms | ||
| "ELEP", # Electricity monthly cost | ||
| "FULP", # Fuel monthly cost | ||
| "GASP", # Gas monthly cost | ||
| "RMSP", # Number of rooms | ||
| "RNTP", # Monthly rent | ||
| "TEN", # Tenure | ||
| "VEH", # Number of vehicles | ||
| "FINCP", # Total income | ||
| "GRNTP", # Gross rent | ||
| "TAXAMT", # Property taxes | ||
| ] | ||
|
|
||
|
|
||
| class CensusACS(Dataset): | ||
| data_format = Dataset.TABLES | ||
|
|
||
| def generate(self) -> None: | ||
| spm_url = f"https://www2.census.gov/programs-surveys/supplemental-poverty-measure/datasets/spm/spm_{self.time_period}_pu.dta" | ||
| person_url = f"https://www2.census.gov/programs-surveys/acs/data/pums/{self.time_period}/1-Year/csv_pus.zip" | ||
| household_url = f"https://www2.census.gov/programs-surveys/acs/data/pums/{self.time_period}/1-Year/csv_hus.zip" | ||
|
|
||
| with pd.HDFStore(self.file_path, mode="w") as storage: | ||
| household = self.process_household_data( | ||
| household_url, "psam_hus", HOUSEHOLD_COLUMNS | ||
| ) | ||
| person = self.process_person_data( | ||
| person_url, "psam_pus", PERSON_COLUMNS | ||
| ) | ||
| person = person[person.SERIALNO.isin(household.SERIALNO)] | ||
| household = household[household.SERIALNO.isin(person.SERIALNO)] | ||
| storage["household"] = household | ||
| storage["person"] = person | ||
|
|
||
| @staticmethod | ||
| def process_household_data( | ||
| url: str, prefix: str, columns: List[str] | ||
| ) -> pd.DataFrame: | ||
| req = requests.get(url, stream=True) | ||
| with BytesIO() as f: | ||
| pbar = tqdm() | ||
| for chunk in req.iter_content(chunk_size=1024): | ||
| if chunk: | ||
| pbar.update(len(chunk)) | ||
| f.write(chunk) | ||
| f.seek(0) | ||
| zf = ZipFile(f) | ||
| a = pd.read_csv( | ||
| zf.open(prefix + "a.csv"), | ||
| usecols=columns, | ||
| dtype={"SERIALNO": str}, | ||
| ) | ||
| b = pd.read_csv( | ||
| zf.open(prefix + "b.csv"), | ||
| usecols=columns, | ||
| dtype={"SERIALNO": str}, | ||
| ) | ||
| res = pd.concat([a, b]).fillna(0) | ||
| res.columns = res.columns.str.upper() | ||
|
|
||
| # Ensure correct data types | ||
| res["ST"] = res["ST"].astype(int) | ||
|
|
||
| return res | ||
|
|
||
| @staticmethod | ||
| def process_person_data( | ||
| url: str, prefix: str, columns: List[str] | ||
| ) -> pd.DataFrame: | ||
| req = requests.get(url, stream=True) | ||
| with BytesIO() as f: | ||
| pbar = tqdm() | ||
| for chunk in req.iter_content(chunk_size=1024): | ||
| if chunk: | ||
| pbar.update(len(chunk)) | ||
| f.write(chunk) | ||
| f.seek(0) | ||
| zf = ZipFile(f) | ||
| a = pd.read_csv( | ||
| zf.open(prefix + "a.csv"), | ||
| usecols=columns, | ||
| dtype={"SERIALNO": str}, | ||
| ) | ||
| b = pd.read_csv( | ||
| zf.open(prefix + "b.csv"), | ||
| usecols=columns, | ||
| dtype={"SERIALNO": str}, | ||
| ) | ||
| res = pd.concat([a, b]).fillna(0) | ||
| res.columns = res.columns.str.upper() | ||
|
|
||
| # Ensure correct data types | ||
| res["SPORDER"] = res["SPORDER"].astype(int) | ||
|
|
||
| return res | ||
|
|
||
| @staticmethod | ||
| def create_spm_unit_table( | ||
| storage: pd.HDFStore, person: pd.DataFrame | ||
| ) -> None: | ||
| SPM_UNIT_COLUMNS = [ | ||
| "CAPHOUSESUB", | ||
| "CAPWKCCXPNS", | ||
| "CHILDCAREXPNS", | ||
| "EITC", | ||
| "ENGVAL", | ||
| "EQUIVSCALE", | ||
| "FEDTAX", | ||
| "FEDTAXBC", | ||
| "FICA", | ||
| "GEOADJ", | ||
| "MEDXPNS", | ||
| "NUMADULTS", | ||
| "NUMKIDS", | ||
| "NUMPER", | ||
| "POOR", | ||
| "POVTHRESHOLD", | ||
| "RESOURCES", | ||
| "SCHLUNCH", | ||
| "SNAPSUB", | ||
| "STTAX", | ||
| "TENMORTSTATUS", | ||
| "TOTVAL", | ||
| "WCOHABIT", | ||
| "WICVAL", | ||
| "WKXPNS", | ||
| "WUI_LT15", | ||
| "ID", | ||
| ] | ||
| spm_table = ( | ||
| person[["SPM_" + column for column in SPM_UNIT_COLUMNS]] | ||
| .groupby(person.SPM_ID) | ||
| .first() | ||
| ) | ||
|
|
||
| original_person_table = storage["person"] | ||
| original_person_table.to_csv("person.csv") | ||
| person.to_csv("spm_person.csv") | ||
|
|
||
| # Ensure SERIALNO is treated as string | ||
| JOIN_COLUMNS = ["SERIALNO", "SPORDER"] | ||
| original_person_table["SERIALNO"] = original_person_table[ | ||
| "SERIALNO" | ||
| ].astype(str) | ||
| original_person_table["SPORDER"] = original_person_table[ | ||
| "SPORDER" | ||
| ].astype(int) | ||
| person["SERIALNO"] = person["SERIALNO"].astype(str) | ||
| person["SPORDER"] = person["SPORDER"].astype(int) | ||
|
|
||
| # Add SPM_ID from the SPM person table to the original person table. | ||
| combined_person_table = pd.merge( | ||
| original_person_table, | ||
| person[JOIN_COLUMNS + ["SPM_ID"]], | ||
| on=JOIN_COLUMNS, | ||
| ) | ||
|
|
||
| storage["person_matched"] = combined_person_table | ||
| storage["spm_unit"] = spm_table | ||
|
|
||
|
|
||
| class CensusACS_2022(CensusACS): | ||
| label = "Census ACS (2022)" | ||
| name = "census_acs_2022.h5" | ||
| file_path = STORAGE_FOLDER / "census_acs_2022.h5" | ||
| time_period = 2022 |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.