Garbage collection/memory optimization #374

elijahbenizzy · 2023-09-22T18:52:19Z

This deletes unused references in parallelism/dynamism. That said, the python garbage collector occasionally has to be told what to do, so the user should be aware.

Fixes #373

[Short description explaining the high-level reason for the pull request]

Changes

How I tested this

Notes

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

hamilton/execution/executors.py

elijahbenizzy · 2023-10-04T16:46:41Z

This works nicely. See this script:

import numpy as np
import pandas as pd

from hamilton import driver
from hamilton.ad_hoc_utils import create_temporary_module
from hamilton.function_modifiers import parameterize, source

NUM_ITERS = 100

import pandas as pd
import numpy as np


def foo_0(memory_size: int = 100_000_000) -> pd.DataFrame:
    """
    Generates a large DataFrame with memory size close to the specified memory_size_gb.

    Parameters:
    memory_size_gb (float): Desired memory size of the DataFrame in GB. Default is 1 GB.

    Returns:
    pd.DataFrame: Generated DataFrame with approximate memory usage of memory_size_gb.
    """
    # Number of rows in the DataFrame
    num_rows = 10 ** 6

    # Calculate the number of columns required to make a DataFrame close to memory_size_gb
    # Assuming float64 type which takes 8 bytes
    bytes_per_row = 8 * num_rows
    target_bytes = memory_size
    num_cols = target_bytes // bytes_per_row

    # Create a DataFrame with random data
    data = {f'col_{i}': np.random.random(num_rows) for i in range(int(num_cols))}
    df = pd.DataFrame(data)

    # Print DataFrame info, including memory usage
    print(df.info(memory_usage='deep'))
    return df


@parameterize(
    **{f"foo_{i}" : {"foo_i_minus_one" : source(f"foo_{i-1}")}  for i in range(1,NUM_ITERS)}
)
def foo_i(foo_i_minus_one: pd.DataFrame) -> pd.DataFrame:
    print(f"foo_{i}")
    return foo_i_minus_one * 1.01

if __name__ == '__main__':
    mod = create_temporary_module(foo_i, foo_0)
    dr = driver.Builder().with_modules(mod).build()
    output = dr.execute(
        [f"foo_{NUM_ITERS-1}"],
        inputs=dict(memory_size=1_000_000_000)
    )

Using mprof, we can easily run it, and get the following plot. My computer dies without that code:

elijahbenizzy · 2023-10-04T17:13:07Z

If you comment out the new code, you'll notice that things act strangely. The caveat here is that mprof doesn't record swap space. So, you see a large alocation, then things start to act quite junky (my computer's music skips 😆). Note this is 100mb increments instead of 1gb. 1gb will just give up.

Now, with the code added, it caps out at 500mb before GC -- I think this is storing a few pointers to the node + general python memory usage. Its keeping a few extra ones around -- Hard to tell why, as there are likely more than one reference being kept around due to the way the del function works with dicts/when GC is happening:

skrawcz · 2023-10-04T17:25:52Z

hamilton/version.py

@@ -1 +1 @@
-VERSION = (1, 30, 1)


I think we have enough pushed that 1.32.0 is warrented.

This deletes unused references in parallelism/dynamism. That said, the python garbage collector occasionally has to be told what to do, so the user should be aware. Fixes #373

This is not perfect, but what we do is after computing and sticking a node in the memoization dict, we ensure that, for each dependency, if we know that dependency is not needed by anything else, we delete that from computed. This way we'll save on memory. There are likely a few edge cases where this won't work (E.G. we're only executing a portion of the graph), but in many cases this should help, and it certainly won't harm.

…g/optimization

elijahbenizzy force-pushed the garbage-collect-parallelism branch 2 times, most recently from 5b60218 to 5e5f9cb Compare September 22, 2023 21:28

skrawcz reviewed Sep 23, 2023

View reviewed changes

hamilton/execution/executors.py Show resolved Hide resolved

elijahbenizzy force-pushed the garbage-collect-parallelism branch from 5e5f9cb to 7a7e5e3 Compare September 23, 2023 19:01

elijahbenizzy mentioned this pull request Oct 4, 2023

Aggressive Memory Pruning #433

Closed

elijahbenizzy force-pushed the garbage-collect-parallelism branch 2 times, most recently from 6d1cf8a to 549a41b Compare October 4, 2023 04:46

elijahbenizzy changed the title ~~Ensures parallelism has the ability to garbage collect~~ Garbage collection/memory optimization Oct 4, 2023

elijahbenizzy force-pushed the garbage-collect-parallelism branch from 549a41b to b4d0bab Compare October 4, 2023 17:22

skrawcz reviewed Oct 4, 2023

View reviewed changes

hamilton/version.py Outdated

@@ -1 +1 @@

VERSION = (1, 30, 1)

Copy link

Collaborator

skrawcz Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have enough pushed that 1.32.0 is warrented.

skrawcz approved these changes Oct 4, 2023

View reviewed changes

elijahbenizzy added 2 commits October 4, 2023 10:26

Ensures parallelism has the ability to garbage collect

4c65150

This deletes unused references in parallelism/dynamism. That said, the python garbage collector occasionally has to be told what to do, so the user should be aware. Fixes #373

elijahbenizzy force-pushed the garbage-collect-parallelism branch from b4d0bab to 0b81557 Compare October 4, 2023 17:27

elijahbenizzy temporarily deployed to github-pages October 4, 2023 17:27 — with GitHub Actions Inactive

Adds basic memory testing script to demonstrate memory usage profilin…

c3461e6

…g/optimization

skrawcz approved these changes Oct 4, 2023

View reviewed changes

elijahbenizzy merged commit a92f6ca into main Oct 4, 2023
21 checks passed

elijahbenizzy deleted the garbage-collect-parallelism branch October 4, 2023 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbage collection/memory optimization #374

Garbage collection/memory optimization #374

elijahbenizzy commented Sep 22, 2023

elijahbenizzy commented Oct 4, 2023

elijahbenizzy commented Oct 4, 2023

skrawcz Oct 4, 2023

Garbage collection/memory optimization #374

Garbage collection/memory optimization #374

Conversation

elijahbenizzy commented Sep 22, 2023

Changes

How I tested this

Notes

Checklist

elijahbenizzy commented Oct 4, 2023

elijahbenizzy commented Oct 4, 2023

skrawcz Oct 4, 2023

Choose a reason for hiding this comment