Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage collection/memory optimization #374

Merged
merged 3 commits into from
Oct 4, 2023

Conversation

elijahbenizzy
Copy link
Collaborator

This deletes unused references in parallelism/dynamism. That said, the python garbage collector occasionally has to be told what to do, so the user should be aware.

Fixes #373

[Short description explaining the high-level reason for the pull request]

Changes

How I tested this

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

@elijahbenizzy elijahbenizzy force-pushed the garbage-collect-parallelism branch 2 times, most recently from 5b60218 to 5e5f9cb Compare September 22, 2023 21:28
@elijahbenizzy elijahbenizzy force-pushed the garbage-collect-parallelism branch 2 times, most recently from 6d1cf8a to 549a41b Compare October 4, 2023 04:46
@elijahbenizzy elijahbenizzy changed the title Ensures parallelism has the ability to garbage collect Garbage collection/memory optimization Oct 4, 2023
@elijahbenizzy
Copy link
Collaborator Author

This works nicely. See this script:

import numpy as np
import pandas as pd

from hamilton import driver
from hamilton.ad_hoc_utils import create_temporary_module
from hamilton.function_modifiers import parameterize, source

NUM_ITERS = 100

import pandas as pd
import numpy as np


def foo_0(memory_size: int = 100_000_000) -> pd.DataFrame:
    """
    Generates a large DataFrame with memory size close to the specified memory_size_gb.

    Parameters:
    memory_size_gb (float): Desired memory size of the DataFrame in GB. Default is 1 GB.

    Returns:
    pd.DataFrame: Generated DataFrame with approximate memory usage of memory_size_gb.
    """
    # Number of rows in the DataFrame
    num_rows = 10 ** 6

    # Calculate the number of columns required to make a DataFrame close to memory_size_gb
    # Assuming float64 type which takes 8 bytes
    bytes_per_row = 8 * num_rows
    target_bytes = memory_size
    num_cols = target_bytes // bytes_per_row

    # Create a DataFrame with random data
    data = {f'col_{i}': np.random.random(num_rows) for i in range(int(num_cols))}
    df = pd.DataFrame(data)

    # Print DataFrame info, including memory usage
    print(df.info(memory_usage='deep'))
    return df


@parameterize(
    **{f"foo_{i}" : {"foo_i_minus_one" : source(f"foo_{i-1}")}  for i in range(1,NUM_ITERS)}
)
def foo_i(foo_i_minus_one: pd.DataFrame) -> pd.DataFrame:
    print(f"foo_{i}")
    return foo_i_minus_one * 1.01

if __name__ == '__main__':
    mod = create_temporary_module(foo_i, foo_0)
    dr = driver.Builder().with_modules(mod).build()
    output = dr.execute(
        [f"foo_{NUM_ITERS-1}"],
        inputs=dict(memory_size=1_000_000_000)
    )

Using mprof, we can easily run it, and get the following plot. My computer dies without that code:

image

@elijahbenizzy
Copy link
Collaborator Author

If you comment out the new code, you'll notice that things act strangely. The caveat here is that mprof doesn't record swap space. So, you see a large alocation, then things start to act quite junky (my computer's music skips 😆). Note this is 100mb increments instead of 1gb. 1gb will just give up.

image

Now, with the code added, it caps out at 500mb before GC -- I think this is storing a few pointers to the node + general python memory usage. Its keeping a few extra ones around -- Hard to tell why, as there are likely more than one reference being kept around due to the way the del function works with dicts/when GC is happening:

image

@@ -1 +1 @@
VERSION = (1, 30, 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have enough pushed that 1.32.0 is warrented.

This deletes unused references in parallelism/dynamism.
That said, the python garbage collector occasionally has to be told what
to do, so the user should be aware.

Fixes #373
This is not perfect, but what we do is after computing and sticking a
node in the memoization dict, we ensure that, for each dependency, if we
know that dependency is not needed by anything else, we delete that from
computed. This way we'll save on memory.

There are likely a few edge cases where this won't work (E.G. we're only
executing a portion of the graph), but in many cases this should help,
and it certainly won't harm.
@elijahbenizzy elijahbenizzy temporarily deployed to github-pages October 4, 2023 17:27 — with GitHub Actions Inactive
@elijahbenizzy elijahbenizzy merged commit a92f6ca into main Oct 4, 2023
21 checks passed
@elijahbenizzy elijahbenizzy deleted the garbage-collect-parallelism branch October 4, 2023 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Garbage Collection for Parallelism
2 participants