refactor: avoid append to speed up extraction by danielolsen · Pull Request #90 · Breakthrough-Energy/PostREISE

danielolsen · 2020-03-27T01:15:16Z

Purpose

Speed up the extraction of simulation results, which have grown to ~1hr now that we are running 24-hour intervals. For 144-hour intervals, extracting results had been very quick, 15 minutes at the most.

What is the code doing?

The code pre-allocates a DataFrame of the correct size, then adds results from each interval in the appropriate location. This will be more efficient than continually appending (old way), which must change the size of a DataFrame or create a new one of the correct size in every loop.

Some testing results

I have not tested this on files on the server. However, I have a modified version of postreise/extract/extract_data.py which I've been using for extracting results from Julia-created matfiles, and I tested both approaches locally:

old: 100%|████████████████████████████████████████| 366/366 [05:59<00:00, 2.05s/it]
new: 100%|████████████████████████████████████████| 366/366 [00:17<00:00, 21.01it/s]

It appears to be a ~20x speedup, so for Eastern scenarios which are ~20 hours to run and ~1 hour to extract, this offers a ~5% improvement in overall throughput.

Time estimate

30 minutes. I'm not sure if we can test this locally, so we will want to scrutinize the code carefully before deploying/testing on the server.

rouille

It looks good. it is a good thing to allocate the memory at the beginning instead of keeping resizing the array.

Data extraction is not a big deal if it fails since the output MAT-files are only deleted at the very end.

BainanXia

This is nice! Actually, it is faster in general when pre-allocate memory instead of maintaining a dynamic space, such as fill out 0s in a fixed size list instead of appending it. Agree we @rouille , we are safe in the early stage of extraction given the mat files won't be deleted if it breaks down.

BainanXia · 2020-03-27T01:44:09Z

+            if i == 0:
+                interval_length, n_columns = temps[v].shape
+                total_length = end_index * interval_length
+                outputs[v] = pd.DataFrame(np.zeros((total_length, n_columns)))


Not sure whether zeros are good default values for place holder, given sometimes zeros are actual results, very unlikely though.

How do you allocate memory in Python?

In general, you can't. But starting with Python 3.6, you can declare types of variables and functions, not a lot of people using it though (if so, why not C++). Although the memory is dynamically allocated in Python, but making changes in the space that is already used is still faster than asking for more.

There won't be many zeros for PF and LMP, but there will probably be plenty for CONGU, CONGL, and PG.

Instead of creating an array of zeros, we could create an empty array (random values) and then fill it with NaNs, in case we want to make it super clear when there is some sort of weird bug. np.empty([total_length, n_columns]).fill(np.nan) or np.full([total_length, n_columns], np.nan). But this code seems foolproof (as dangerous as it is to ever say that around software).

Since all the cells will be replaced through outputs[v].iloc[start_hour:end_hour, :] = temps[v], I am fine with the preset zeros.

danielolsen · 2020-03-27T05:28:23Z

This refactor has been tested successfully with Scenario 404.

danielolsen · 2020-03-27T16:21:49Z

This has been rebased to be independent of the un-merged changes to calculate_averaged_cong.

danielolsen requested review from BainanXia and rouille March 27, 2020 01:15

danielolsen assigned danielolsen and rouille Mar 27, 2020

rouille approved these changes Mar 27, 2020

View reviewed changes

BainanXia approved these changes Mar 27, 2020

View reviewed changes

BainanXia reviewed Mar 27, 2020

View reviewed changes

kasparm force-pushed the develop branch from c3d7ff3 to 9765fda Compare March 27, 2020 14:15

danielolsen force-pushed the faster_extract branch from 82936a5 to 8eae0bb Compare March 27, 2020 16:20

refactor: avoid append to speed up extraction

6ddee07

danielolsen force-pushed the faster_extract branch from 8eae0bb to 6ddee07 Compare March 27, 2020 17:36

danielolsen merged commit 804c5ea into develop Mar 27, 2020

danielolsen deleted the faster_extract branch March 27, 2020 17:37

Lab-ITTeam unassigned danielolsen and rouille Jul 15, 2020

ahurli mentioned this pull request Mar 16, 2021

Develop into Master #242

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: avoid append to speed up extraction#90

refactor: avoid append to speed up extraction#90
danielolsen merged 1 commit intodevelopfrom
faster_extract

danielolsen commented Mar 27, 2020 •

edited

Loading

Uh oh!

rouille left a comment •

edited

Loading

Uh oh!

BainanXia left a comment

Uh oh!

BainanXia Mar 27, 2020

Uh oh!

rouille Mar 27, 2020

Uh oh!

BainanXia Mar 27, 2020

Uh oh!

danielolsen Mar 27, 2020

Uh oh!

rouille Mar 27, 2020

Uh oh!

danielolsen commented Mar 27, 2020

Uh oh!

danielolsen commented Mar 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danielolsen commented Mar 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

What is the code doing?

Some testing results

Time estimate

Uh oh!

rouille left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BainanXia left a comment

Choose a reason for hiding this comment

Uh oh!

BainanXia Mar 27, 2020

Choose a reason for hiding this comment

Uh oh!

rouille Mar 27, 2020

Choose a reason for hiding this comment

Uh oh!

BainanXia Mar 27, 2020

Choose a reason for hiding this comment

Uh oh!

danielolsen Mar 27, 2020

Choose a reason for hiding this comment

Uh oh!

rouille Mar 27, 2020

Choose a reason for hiding this comment

Uh oh!

danielolsen commented Mar 27, 2020

Uh oh!

danielolsen commented Mar 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danielolsen commented Mar 27, 2020 •

edited

Loading

rouille left a comment •

edited

Loading