Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble replicating Imperial county typologies #115

Open
xing-gao-phd opened this issue Jan 13, 2022 · 0 comments
Open

Trouble replicating Imperial county typologies #115

xing-gao-phd opened this issue Jan 13, 2022 · 0 comments
Assignees

Comments

@xing-gao-phd
Copy link

Describe the bug
I'm running through the codes in the "SCAG-DT" folder using CA census based statistical areas to define "city", so the only comparable cities in my dataset to UDP outputs are Ventura and Imperial. After getting through 4_typology.py, my dataset's Ventura tracts matched well with the UDP Ventura typology file (only off by 1 census tract), but Imperial had 11/27 tracts not matching up, mostly the displacement and gentrification categories. In my dataset, there are 31 tracts in Imperial county but only 27 in UDP imperial_typology_output.csv and scag.csv. I suspect it's because the median calculation was off due to having a different total n, creating discrepancies when creating categorical variables, which accumulate and result in different typology categories.

To Reproduce
I think the discrepancy starts around line 384 in 2_data_curation.py due to median calculation, for example rm_hinc_18 = np.nanmedian(census['hinc_18'])
At this point there should still be 31 tracts in Imperial county (based on codes from beginning of 2_data_curation.py to line 384). if I filter out tracts not in the Imperial_database_2018.csv, then the medians match.
For example, median(mydataset_31tracts["hinc_18"])=41767, median(Imperial_database_2018.csv["hinc_18"])=43651, and median(mydataset_27tracts["hinc_18"])=43651.

I think this is also happening when working with pums and zillow data to create categorical variables.

The four missing tracts are: 6025010102 6025010900 6025012302 6025940000. These tracts are in Imperialcensus_summ_2018.csv as the input at the beginning of 2_data_curation.py. Do you know why these tracts are not included? Where in the codes should I be excluding the tracts, and based on what criteria? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants