New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backend: Speed up cdc_restricted_local
step
#2342
Backend: Speed up cdc_restricted_local
step
#2342
Conversation
✅ Deploy Preview for health-equity-tracker ready!Built without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! Great job.
# Create a new combined race/eth column Initialize with UNKNOWN | ||
df[RACE_ETH_COL] = std_col.Race.UNKNOWN.value | ||
# Overwrite specific race if given | ||
df.loc[other_mask, RACE_ETH_COL] = df.loc[other_mask, RACE_COL].map(RACE_NAMES_MAPPING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Description
Speed Optimizations
Significantly speeds up the local step that processes the CDC's restricted, raw, unzipped .csv files into the non-restricted, aggregated .csv files needed for the
cdc_restricted
ingestion step.To benchmark, I ran the code against just a single raw csv file (instead of all ~20).
Here are the incremental changes I made, and the improvement from each step:
combine_race_eth()
to use pandas built in optimized functions rather than mapping/applying a lambda function against each row, and only reading in the needed columns with the argusecols
in theread_csv
: 191sWhen running against the full set of all raw files, the run time went from 2h40m down to 34m, and produced identical output files.
Updates
numeric_only
to prevent the futurewarningAbandoned Optimizations
I tried several more vectorization optimizations, replacing the original code's
map
andapplymap
andapply
with the Pandas vectorized version, but saw no improvement on those. For example replacing:with
Although Pandas vectorized methods are usually much faster than using
apply
, in these cases they were not and it's probably because the.str.
vectorized methods are not super optimzed, and sometimes are SLOWER. So going forward we should prioritize refactoring our python code with these vectorized versions (like #2033 ) only when it's doing numerical work and not necessarily with string based work.Motivation and Context
It was taking hours to run the local code, extremely slow and frustrating and locked up my machine for the duration
Has this been tested? How?
First I ran the original code and generated the HET tables with a
old_
prefix, and then switched to this branch and ran the updated code to generate the tables without that prefix. I than used a quick bash script to iterate over the tables and compare the old and new versions of the same table and ensure that the matching files were identical betweenmain
branch and this branch. I ensured this checker was working by manually diffing age_county against age_state and seeing that it did in fact show a differenceThe original unit tests continue to pass with all updates
Screenshots (if appropriate):
Types of changes
Post-merge TODO
I have inspected frontend changes and/or run affected data pipelines:
Any target user persona(s)?
Preview link below in Netlify comment 😎