Skip to content

refactor: avoid append to speed up extraction#90

Merged
danielolsen merged 1 commit intodevelopfrom
faster_extract
Mar 27, 2020
Merged

refactor: avoid append to speed up extraction#90
danielolsen merged 1 commit intodevelopfrom
faster_extract

Conversation

@danielolsen
Copy link
Copy Markdown
Contributor

@danielolsen danielolsen commented Mar 27, 2020

Purpose

Speed up the extraction of simulation results, which have grown to ~1hr now that we are running 24-hour intervals. For 144-hour intervals, extracting results had been very quick, 15 minutes at the most.

What is the code doing?

The code pre-allocates a DataFrame of the correct size, then adds results from each interval in the appropriate location. This will be more efficient than continually appending (old way), which must change the size of a DataFrame or create a new one of the correct size in every loop.

Some testing results

I have not tested this on files on the server. However, I have a modified version of postreise/extract/extract_data.py which I've been using for extracting results from Julia-created matfiles, and I tested both approaches locally:

old: 100%|████████████████████████████████████████| 366/366 [05:59<00:00, 2.05s/it]
new: 100%|████████████████████████████████████████| 366/366 [00:17<00:00, 21.01it/s]

It appears to be a ~20x speedup, so for Eastern scenarios which are ~20 hours to run and ~1 hour to extract, this offers a ~5% improvement in overall throughput.

Time estimate

30 minutes. I'm not sure if we can test this locally, so we will want to scrutinize the code carefully before deploying/testing on the server.

Copy link
Copy Markdown
Collaborator

@rouille rouille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good. it is a good thing to allocate the memory at the beginning instead of keeping resizing the array.

Data extraction is not a big deal if it fails since the output MAT-files are only deleted at the very end.

Copy link
Copy Markdown
Collaborator

@BainanXia BainanXia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice! Actually, it is faster in general when pre-allocate memory instead of maintaining a dynamic space, such as fill out 0s in a fixed size list instead of appending it. Agree we @rouille , we are safe in the early stage of extraction given the mat files won't be deleted if it breaks down.

if i == 0:
interval_length, n_columns = temps[v].shape
total_length = end_index * interval_length
outputs[v] = pd.DataFrame(np.zeros((total_length, n_columns)))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether zeros are good default values for place holder, given sometimes zeros are actual results, very unlikely though.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you allocate memory in Python?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, you can't. But starting with Python 3.6, you can declare types of variables and functions, not a lot of people using it though (if so, why not C++). Although the memory is dynamically allocated in Python, but making changes in the space that is already used is still faster than asking for more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There won't be many zeros for PF and LMP, but there will probably be plenty for CONGU, CONGL, and PG.

Instead of creating an array of zeros, we could create an empty array (random values) and then fill it with NaNs, in case we want to make it super clear when there is some sort of weird bug. np.empty([total_length, n_columns]).fill(np.nan) or np.full([total_length, n_columns], np.nan). But this code seems foolproof (as dangerous as it is to ever say that around software).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all the cells will be replaced through outputs[v].iloc[start_hour:end_hour, :] = temps[v], I am fine with the preset zeros.

@danielolsen
Copy link
Copy Markdown
Contributor Author

This refactor has been tested successfully with Scenario 404.

@danielolsen
Copy link
Copy Markdown
Contributor Author

This has been rebased to be independent of the un-merged changes to calculate_averaged_cong.

@danielolsen danielolsen merged commit 804c5ea into develop Mar 27, 2020
@danielolsen danielolsen deleted the faster_extract branch March 27, 2020 17:37
@ahurli ahurli mentioned this pull request Mar 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants