Backend: Speed up `cdc_restricted_local` step #2342

benhammondmusic · 2023-08-30T14:36:03Z

Description

Speed Optimizations

Significantly speeds up the local step that processes the CDC's restricted, raw, unzipped .csv files into the non-restricted, aggregated .csv files needed for the cdc_restricted ingestion step.

To benchmark, I ran the code against just a single raw csv file (instead of all ~20).

The original code took 313s to process the one file

Here are the incremental changes I made, and the improvement from each step:

Using vectorized combine_race_eth() to use pandas built in optimized functions rather than mapping/applying a lambda function against each row, and only reading in the needed columns with the arg usecols in the read_csv: 191s
Above + Removing unneeded 2nd quote cleaning: 188s
Above + Removing unneeded 1st quote cleaning: 180s
Above + chunk_size = 1 million (using chunk size splits the df into chunks so your machine doesn't have to hold the entire thing in memory; however my computer is fast enough to hold the entire thing so it's faster to have a giant chunk size that reduces the need to iterate over chunks (but also would kick in if cdc ever sends a much bigger file along): 75s
Above + chunk_size = 2 million: 64s
Above + chunk_size = 5 million: 57s

When running against the full set of all raw files, the run time went from 2h40m down to 34m, and produced identical output files.

Updates

specifies numeric_only to prevent the futurewarning

Abandoned Optimizations

I tried several more vectorization optimizations, replacing the original code's map and applymap and apply with the Pandas vectorized version, but saw no improvement on those. For example replacing:

df[COUNTY_FIPS_COL] = df[COUNTY_FIPS_COL].map(lambda x: x.zfill(5) if len(x) > 0 else x)

with

df[COUNTY_FIPS_COL] = df[COUNTY_FIPS_COL].str.zfill(5)

Although Pandas vectorized methods are usually much faster than using apply, in these cases they were not and it's probably because the .str. vectorized methods are not super optimzed, and sometimes are SLOWER. So going forward we should prioritize refactoring our python code with these vectorized versions (like #2033 ) only when it's doing numerical work and not necessarily with string based work.

Motivation and Context

It was taking hours to run the local code, extremely slow and frustrating and locked up my machine for the duration

Has this been tested? How?

First I ran the original code and generated the HET tables with a old_ prefix, and then switched to this branch and ran the updated code to generate the tables without that prefix. I than used a quick bash script to iterate over the tables and compare the old and new versions of the same table and ensure that the matching files were identical between main branch and this branch. I ensured this checker was working by manually diffing age_county against age_state and seeing that it did in fact show a difference

The original unit tests continue to pass with all updates

Screenshots (if appropriate):

Types of changes

Refactor / chore

Post-merge TODO

I have inspected frontend changes and/or run affected data pipelines:

on DEV
on PROD

Any target user persona(s)?

Preview link below in Netlify comment 😎

…ity-tracker

netlify · 2023-08-30T14:37:34Z

✅ Deploy Preview for health-equity-tracker ready!

Built without sensitive environment variables

Name	Link
🔨 Latest commit	`e8271b8`
🔍 Latest deploy log	https://app.netlify.com/sites/health-equity-tracker/deploys/64ef602b3cc7700008069de3
😎 Deploy Preview	https://deploy-preview-2342--health-equity-tracker.netlify.app/
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

This reverts commit 9425fe0.

eriwarr

This is awesome! Great job.

eriwarr · 2023-08-30T17:36:30Z

python/datasources/cdc_restricted_local.py

+    # Create a new combined race/eth column Initialize with UNKNOWN
+    df[RACE_ETH_COL] = std_col.Race.UNKNOWN.value
+    # Overwrite specific race if given
+    df.loc[other_mask, RACE_ETH_COL] = df.loc[other_mask, RACE_COL].map(RACE_NAMES_MAPPING)


benhammondmusic added 30 commits July 31, 2023 15:59

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

08045ee

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

12fb315

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

74c9477

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

18c4e44

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

9850889

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

1d74516

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

6c0ac9c

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

7c1b2dc

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

744654e

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

eed29ae

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

e95e1e7

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

60dac18

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

4add785

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

c9c86bf

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

45a7116

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

85ea6a5

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

79712dd

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

5b4a48d

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

cc346a1

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

14947f1

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

c878c4d

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

023563d

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

5d876ae

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

4f7a527

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

52679ef

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

098ec3f

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

318745c

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

5f02bc6

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

0c13c70

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

80e9867

…ity-tracker

benhammondmusic added 10 commits August 24, 2023 08:11

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

7094a70

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

a523d67

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

4cf154c

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

bdb9f74

…ity-tracker

Merge branch 'main' of https://github.com/SatcherInstitute/health-equ…

f828711

…ity-tracker

try new function with timing

b3230cb

some improvements on local maybe

a81acd5

faster

79f2e47

hm

4f39bd6

improvements

1c6c647

benhammondmusic added 6 commits August 30, 2023 08:37

rv

9ecc8f5

cleanup

33bfcad

removes suppression logic

9425fe0

Revert "removes suppression logic"

875c221

This reverts commit 9425fe0.

lint

a156bba

Merge branch 'main' into cdc-local-vectorize

e8271b8

benhammondmusic marked this pull request as ready for review August 30, 2023 15:47

benhammondmusic requested a review from eriwarr August 30, 2023 15:47

benhammondmusic self-assigned this Aug 30, 2023

benhammondmusic added Data DX Developer Experience labels Aug 30, 2023

eriwarr approved these changes Aug 30, 2023

View reviewed changes

benhammondmusic merged commit e871d45 into SatcherInstitute:main Aug 30, 2023
10 checks passed

benhammondmusic deleted the cdc-local-vectorize branch August 30, 2023 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend: Speed up `cdc_restricted_local` step #2342

Backend: Speed up `cdc_restricted_local` step #2342

benhammondmusic commented Aug 30, 2023 •

edited

netlify bot commented Aug 30, 2023 •

edited

eriwarr left a comment

eriwarr Aug 30, 2023

Backend: Speed up cdc_restricted_local step #2342

Backend: Speed up cdc_restricted_local step #2342

Conversation

benhammondmusic commented Aug 30, 2023 • edited

Description

Speed Optimizations

Updates

Abandoned Optimizations

Motivation and Context

Has this been tested? How?

Screenshots (if appropriate):

Types of changes

Post-merge TODO

Any target user persona(s)?

Preview link below in Netlify comment 😎

netlify bot commented Aug 30, 2023 • edited

✅ Deploy Preview for health-equity-tracker ready!

eriwarr left a comment

Choose a reason for hiding this comment

eriwarr Aug 30, 2023

Choose a reason for hiding this comment

Backend: Speed up `cdc_restricted_local` step #2342

Backend: Speed up `cdc_restricted_local` step #2342

benhammondmusic commented Aug 30, 2023 •

edited

netlify bot commented Aug 30, 2023 •

edited