Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend: Speed up cdc_restricted_local step #2342

Merged

Conversation

benhammondmusic
Copy link
Collaborator

@benhammondmusic benhammondmusic commented Aug 30, 2023

Description

Speed Optimizations

Significantly speeds up the local step that processes the CDC's restricted, raw, unzipped .csv files into the non-restricted, aggregated .csv files needed for the cdc_restricted ingestion step.

To benchmark, I ran the code against just a single raw csv file (instead of all ~20).

  • The original code took 313s to process the one file

Here are the incremental changes I made, and the improvement from each step:

  • Using vectorized combine_race_eth() to use pandas built in optimized functions rather than mapping/applying a lambda function against each row, and only reading in the needed columns with the arg usecols in the read_csv: 191s
  • Above + Removing unneeded 2nd quote cleaning: 188s
  • Above + Removing unneeded 1st quote cleaning: 180s
  • Above + chunk_size = 1 million (using chunk size splits the df into chunks so your machine doesn't have to hold the entire thing in memory; however my computer is fast enough to hold the entire thing so it's faster to have a giant chunk size that reduces the need to iterate over chunks (but also would kick in if cdc ever sends a much bigger file along): 75s
  • Above + chunk_size = 2 million: 64s
  • Above + chunk_size = 5 million: 57s

When running against the full set of all raw files, the run time went from 2h40m down to 34m, and produced identical output files.

Updates

  • specifies numeric_only to prevent the futurewarning

Abandoned Optimizations

I tried several more vectorization optimizations, replacing the original code's map and applymap and apply with the Pandas vectorized version, but saw no improvement on those. For example replacing:

df[COUNTY_FIPS_COL] = df[COUNTY_FIPS_COL].map(lambda x: x.zfill(5) if len(x) > 0 else x)

with

df[COUNTY_FIPS_COL] = df[COUNTY_FIPS_COL].str.zfill(5)

Although Pandas vectorized methods are usually much faster than using apply, in these cases they were not and it's probably because the .str. vectorized methods are not super optimzed, and sometimes are SLOWER. So going forward we should prioritize refactoring our python code with these vectorized versions (like #2033 ) only when it's doing numerical work and not necessarily with string based work.

Motivation and Context

It was taking hours to run the local code, extremely slow and frustrating and locked up my machine for the duration

Has this been tested? How?

First I ran the original code and generated the HET tables with a old_ prefix, and then switched to this branch and ran the updated code to generate the tables without that prefix. I than used a quick bash script to iterate over the tables and compare the old and new versions of the same table and ensure that the matching files were identical between main branch and this branch. I ensured this checker was working by manually diffing age_county against age_state and seeing that it did in fact show a difference

The original unit tests continue to pass with all updates

Screenshots (if appropriate):

Types of changes

  • Refactor / chore

Post-merge TODO

I have inspected frontend changes and/or run affected data pipelines:

  • on DEV
  • on PROD

Any target user persona(s)?

Preview link below in Netlify comment 😎

@netlify
Copy link

netlify bot commented Aug 30, 2023

Deploy Preview for health-equity-tracker ready!

Built without sensitive environment variables

Name Link
🔨 Latest commit e8271b8
🔍 Latest deploy log https://app.netlify.com/sites/health-equity-tracker/deploys/64ef602b3cc7700008069de3
😎 Deploy Preview https://deploy-preview-2342--health-equity-tracker.netlify.app/
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@benhammondmusic benhammondmusic marked this pull request as ready for review August 30, 2023 15:47
@benhammondmusic benhammondmusic self-assigned this Aug 30, 2023
@benhammondmusic benhammondmusic added Data DX Developer Experience labels Aug 30, 2023
Copy link
Collaborator

@eriwarr eriwarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Great job.

# Create a new combined race/eth column Initialize with UNKNOWN
df[RACE_ETH_COL] = std_col.Race.UNKNOWN.value
# Overwrite specific race if given
df.loc[other_mask, RACE_ETH_COL] = df.loc[other_mask, RACE_COL].map(RACE_NAMES_MAPPING)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@benhammondmusic benhammondmusic merged commit e871d45 into SatcherInstitute:main Aug 30, 2023
10 checks passed
@benhammondmusic benhammondmusic deleted the cdc-local-vectorize branch August 30, 2023 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data DX Developer Experience
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

2 participants