-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Garbage collection/memory optimization #374
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
elijahbenizzy
force-pushed
the
garbage-collect-parallelism
branch
2 times, most recently
from
September 22, 2023 21:28
5b60218
to
5e5f9cb
Compare
skrawcz
reviewed
Sep 23, 2023
elijahbenizzy
force-pushed
the
garbage-collect-parallelism
branch
from
September 23, 2023 19:01
5e5f9cb
to
7a7e5e3
Compare
elijahbenizzy
force-pushed
the
garbage-collect-parallelism
branch
2 times, most recently
from
October 4, 2023 04:46
6d1cf8a
to
549a41b
Compare
elijahbenizzy
changed the title
Ensures parallelism has the ability to garbage collect
Garbage collection/memory optimization
Oct 4, 2023
This works nicely. See this script: import numpy as np
import pandas as pd
from hamilton import driver
from hamilton.ad_hoc_utils import create_temporary_module
from hamilton.function_modifiers import parameterize, source
NUM_ITERS = 100
import pandas as pd
import numpy as np
def foo_0(memory_size: int = 100_000_000) -> pd.DataFrame:
"""
Generates a large DataFrame with memory size close to the specified memory_size_gb.
Parameters:
memory_size_gb (float): Desired memory size of the DataFrame in GB. Default is 1 GB.
Returns:
pd.DataFrame: Generated DataFrame with approximate memory usage of memory_size_gb.
"""
# Number of rows in the DataFrame
num_rows = 10 ** 6
# Calculate the number of columns required to make a DataFrame close to memory_size_gb
# Assuming float64 type which takes 8 bytes
bytes_per_row = 8 * num_rows
target_bytes = memory_size
num_cols = target_bytes // bytes_per_row
# Create a DataFrame with random data
data = {f'col_{i}': np.random.random(num_rows) for i in range(int(num_cols))}
df = pd.DataFrame(data)
# Print DataFrame info, including memory usage
print(df.info(memory_usage='deep'))
return df
@parameterize(
**{f"foo_{i}" : {"foo_i_minus_one" : source(f"foo_{i-1}")} for i in range(1,NUM_ITERS)}
)
def foo_i(foo_i_minus_one: pd.DataFrame) -> pd.DataFrame:
print(f"foo_{i}")
return foo_i_minus_one * 1.01
if __name__ == '__main__':
mod = create_temporary_module(foo_i, foo_0)
dr = driver.Builder().with_modules(mod).build()
output = dr.execute(
[f"foo_{NUM_ITERS-1}"],
inputs=dict(memory_size=1_000_000_000)
) Using mprof, we can easily run it, and get the following plot. My computer dies without that code: |
elijahbenizzy
force-pushed
the
garbage-collect-parallelism
branch
from
October 4, 2023 17:22
549a41b
to
b4d0bab
Compare
skrawcz
reviewed
Oct 4, 2023
hamilton/version.py
Outdated
@@ -1 +1 @@ | |||
VERSION = (1, 30, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have enough pushed that 1.32.0 is warrented.
skrawcz
approved these changes
Oct 4, 2023
This deletes unused references in parallelism/dynamism. That said, the python garbage collector occasionally has to be told what to do, so the user should be aware. Fixes #373
This is not perfect, but what we do is after computing and sticking a node in the memoization dict, we ensure that, for each dependency, if we know that dependency is not needed by anything else, we delete that from computed. This way we'll save on memory. There are likely a few edge cases where this won't work (E.G. we're only executing a portion of the graph), but in many cases this should help, and it certainly won't harm.
elijahbenizzy
force-pushed
the
garbage-collect-parallelism
branch
from
October 4, 2023 17:27
b4d0bab
to
0b81557
Compare
elijahbenizzy
temporarily deployed
to
github-pages
October 4, 2023 17:27
— with
GitHub Actions
Inactive
skrawcz
approved these changes
Oct 4, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This deletes unused references in parallelism/dynamism. That said, the python garbage collector occasionally has to be told what to do, so the user should be aware.
Fixes #373
[Short description explaining the high-level reason for the pull request]
Changes
How I tested this
Notes
Checklist