[mlir][python] Tracking of PyOperations references in `liveOperations` map breaks down in reasonable scenarios #92344

christopherbate · 2024-05-16T04:09:36Z

Background:

The PyMlirContext has a mechanism for creating unique references to MlirOperation objects. When a PyOperation object is created, it is interned into a map (together with an owning py::object) into a map called liveOperations held by the PyMlirContext. This allows for various checks (assertions in the C++ code or tests written in Python) to occur to ensure that objects have the appropriate lifetimes. Since MLIR IR has a nested structure, it's possible to create a reference to an Operation in Python and then do something funny (e.g. delete the parent), and then try to use the child op somehow. In C++, printing a freed Operation will probably crash, but in Python printing a PyOperation whose underlying Operation* has been freed should recover gracefully. For reasons like this (I think), the PyOperation retains a valid member variable which should be set to false in such situations, and most operations in PyOperation are guarded by a check to ensure valid == true.

This seems to work well, but there are several holes in how lifetimes of PyOperations are tracked (and therefore how they are inserted or cleaned up from the liveOperations map). These are mostly documented in IRCore.cpp as TODOs.

For example, the following code appears to crash:

def testModuleCleanup():
    ctx = Context()
    module = Module.parse(r"""module @my_module {
        func.func @my_func() {
            return
        }
    }""", ctx)
    func = module.body.operations[0]
    assert ctx._get_live_module_count() == 1
    assert ctx._get_live_operation_count() == 1
    module = None
    gc.collect()
    assert ctx._get_live_module_count() == 0
    assert ctx._get_live_operation_count() == 1
    print(func)

The reason is that even though module (PyModule) was dereferenced and cleaned up, it didn't invalidate its children. Therefore, the PyOperation pointed to by func still has its valid bit set (and is still in the liveOperations map).

Furthermore, even if you don't do anything with func, it remaining in liveOperations is problematic. If you re-use the same Context, as in the below code, the MLIR C++ API and the underlying allocator may return a pointer that is still in liveOperations (since operations may be freed by MLIR but still exist in liveOperations, that is a valid state):

def testModuleCleanup():
    ctx = Context()
    module = Module.parse(r"""module @my_module {
        func.func @my_func() {
            return
        }
    }""", ctx)
    func = module.body.operations[0]              # suppose the underlying Operation* == 0x5e89746a58b0
    assert ctx._get_live_module_count() == 1
    assert ctx._get_live_operation_count() == 1
    module = None
    gc.collect()
    assert ctx._get_live_module_count() == 0
   
   # live operations is still 1, this is expected. In `liveOperations` map, 
   # PyOperation are tracked by the underlying `Operation*`, which means the key 0x5e89746a58b0
   # is still in the map
    assert ctx._get_live_operation_count() == 1
   
   #  This can result in assertion unpredictably, because MLIR may return new Operation* == 0x5e89746a58b0, 
   # and the key exists in `liveOperation` map.
   module = Module.parse(r"""module @my_module { }""")

An assertion will be encountered because of this line:
https://github.com/llvm/llvm-project/blame/main/mlir/lib/Bindings/Python/IRCore.cpp#L1164

Here you can see that even if we amend the assertion to include a check if it is valid, proceeding to update the liveOperations map may invalidate a reference still held by the Python user. IMO this is overly cautious because there are too many ways in which the liveOperations tracking mechanism can loose track of which operations are valid or not.

In addition, compiling MLIR with CMAKE_BUILD_TYPE=RelWithDebInfo vs CMAKE_BUILD_TYPE=Debug appears to change the likelihood of encountering the assertion in the second example.

The text was updated successfully, but these errors were encountered:

llvmbot · 2024-05-16T05:07:28Z

@llvm/issue-subscribers-mlir-python

Author: Christopher Bate (christopherbate)

Background:

The PyMlirContext has a mechanism for creating unique references to MlirOperation objects. When a PyOperation object is created, it is interned into a map (together with an owning py::object) into a map called liveOperations held by the PyMlirContext. This allows for various checks (assertions in the C++ code or tests written in Python) to occur to ensure that objects have the appropriate lifetimes. Since MLIR IR has a nested structure, it's possible to create a reference to an Operation in Python and then do something funny (e.g. delete the parent), and then try to use the child op somehow. In C++, printing a freed Operation will probably crash, but in Python printing a PyOperation whose underlying Operation* has been freed should recover gracefully. For reasons like this (I think), the PyOperation retains a valid member variable which should be set to false in such situations, and most operations in PyOperation are guarded by a check to ensure valid == true.

This seems to work well, but there are several holes in how lifetimes of PyOperations are tracked (and therefore how they are inserted or cleaned up from the liveOperations map). These are mostly documented in IRCore.cpp as TODOs.

For example, the following code appears to crash:

def testModuleCleanup():
    ctx = Context()
    module = Module.parse(r"""module @<!-- -->my_module {
        func.func @<!-- -->my_func() {
            return
        }
    }""", ctx)
    func = module.body.operations[0]
    assert ctx._get_live_module_count() == 1
    assert ctx._get_live_operation_count() == 1
    module = None
    gc.collect()
    assert ctx._get_live_module_count() == 0
    assert ctx._get_live_operation_count() == 1
    print(func)

The reason is that even though module (PyModule) was dereferenced and cleaned up, it didn't invalidate its children. Therefore, the PyOperation pointed to by func still has its valid bit set (and is still in the liveOperations map).

Furthermore, even if you don't do anything with func, it remaining in liveOperations is problematic. If you re-use the same Context, as in the below code, the MLIR C++ API and the underlying allocator may return a pointer that is still in liveOperations (since operations may be freed by MLIR but still exist in liveOperations, that is a valid state):

def testModuleCleanup():
    ctx = Context()
    module = Module.parse(r"""module @<!-- -->my_module {
        func.func @<!-- -->my_func() {
            return
        }
    }""", ctx)
    func = module.body.operations[0]              # suppose the underlying Operation* == 0x5e89746a58b0
    assert ctx._get_live_module_count() == 1
    assert ctx._get_live_operation_count() == 1
    module = None
    gc.collect()
    assert ctx._get_live_module_count() == 0
   
   # live operations is still 1, this is expected. In `liveOperations` map, 
   # PyOperation are tracked by the underlying `Operation*`, which means the key 0x5e89746a58b0
   # is still in the map
    assert ctx._get_live_operation_count() == 1
   
   #  This can result in assertion unpredictably, because MLIR may return new Operation* == 0x5e89746a58b0, 
   # and the key exists in `liveOperation` map.
   module = Module.parse(r"""module @<!-- -->my_module { }""")

An assertion will be encountered because of this line:
https://github.com/llvm/llvm-project/blame/main/mlir/lib/Bindings/Python/IRCore.cpp#L1164

Here you can see that even if we amend the assertion to include a check if it is valid, proceeding to update the liveOperations map may invalidate a reference still held by the Python user. IMO this is overly cautious because there are too many ways in which the liveOperations tracking mechanism can loose track of which operations are valid or not.

In addition, compiling MLIR with CMAKE_BUILD_TYPE=RelWithDebInfo vs CMAKE_BUILD_TYPE=Debug appears to change the likelihood of encountering the assertion in the second example.

makslevental · 2024-05-16T06:03:31Z

This is all true but I just have to say there's basically no way out of this without a major overhaul of the ownership reconciliation system. And maybe not even then because you're essentially dealing with the unsolvable problem of managed memory on the C++ side vs unmanaged memory on the Python side because it's actually C. So we can patch up these issues using more heuristics but it's inevitable that the heuristics become unwieldy (and maybe they already have).

makslevental · 2024-05-16T06:04:57Z

I wish there existed even a tedious but comprehensive solution - I would sit down and write it - but I don't think there is short of rewriting everything in Rust (lol).

stellaraccident · 2024-05-16T06:16:18Z

The dangling pointers in liveOperations are the biggest problem imo. I think we need to overhaul that, possibly with a different API to scope access... Maybe a context manager that holds the map and guarantees PyOperation uniquing within its schedule.

Alternatively, we could back off of all of this liveness to accounting and just embrace that PyOperations are very ephemeral things vs trying to keep the worlds in sync.

makslevental · 2024-05-16T06:55:21Z

PyOperations are very ephemeral things

It's late so I'm not thinking straight - what does this world look like? Re-creating/materializing a new PyOperation each time someone digs around in a Module? If so then I guess there's a perf hit for anyone doing that a lot? If that's the only thing (and we recover sanity), I'm for it.

stellaraccident · 2024-05-16T16:26:43Z

Yeah, same as PyAttribute and PyType in that world. I'd be shocked if this caching is paying for itself by reducing an object allocation... You can't go an inch in Python without such an allocation anyway.

We could implement some extra special methods to make the is check fudge over the difference.

I'd want to test such a change carefully with some downstreams vs just doing a rip and run. I'm trying to remember if there was a more nuanced rationale for the current scheme and not remembering one. I think it was just a premature optimization gone wrong.

ftynse · 2024-05-24T20:01:49Z

I happened to stumble upon another instance of this problem before I reached this thread in my email, so I went ahead and fixed my problem here: #93339. This will not be sufficient to fix this problem for at least two reasons:

Python bidings don't know that modules are operations #93337 (destroying the module will not trigger the cascading invalidation I added);
there's no checkValid before print.

Both seem fixable, though the former may require a bit of re-engineering of how modules are handled. So maybe there is a chance to improve the existing solution, potentially with explicit scoping as suggested above. FWIW, my initial "fix" to this was to call _clear_live_operations_inside.

stellaraccident · 2024-05-24T20:05:26Z

Yeah, we really need to so this fix. Sorry, I've been out of commission with illness and not staying in top of things. I want to find time next week to put some pressure on this patch.

github-actions bot added the mlir label May 16, 2024

EugeneZelenko added mlir:python MLIR Python bindings and removed mlir labels May 16, 2024

makslevental mentioned this issue May 19, 2024

[mlir][python][wip] remove liveOperations #92631

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mlir][python] Tracking of PyOperations references in `liveOperations` map breaks down in reasonable scenarios #92344

[mlir][python] Tracking of PyOperations references in `liveOperations` map breaks down in reasonable scenarios #92344

christopherbate commented May 16, 2024 •

edited

llvmbot commented May 16, 2024

makslevental commented May 16, 2024

makslevental commented May 16, 2024 •

edited

stellaraccident commented May 16, 2024

makslevental commented May 16, 2024

stellaraccident commented May 16, 2024

ftynse commented May 24, 2024

stellaraccident commented May 24, 2024

[mlir][python] Tracking of PyOperations references in liveOperations map breaks down in reasonable scenarios #92344

[mlir][python] Tracking of PyOperations references in liveOperations map breaks down in reasonable scenarios #92344

Comments

christopherbate commented May 16, 2024 • edited

llvmbot commented May 16, 2024

makslevental commented May 16, 2024

makslevental commented May 16, 2024 • edited

stellaraccident commented May 16, 2024

makslevental commented May 16, 2024

stellaraccident commented May 16, 2024

ftynse commented May 24, 2024

stellaraccident commented May 24, 2024

[mlir][python] Tracking of PyOperations references in `liveOperations` map breaks down in reasonable scenarios #92344

[mlir][python] Tracking of PyOperations references in `liveOperations` map breaks down in reasonable scenarios #92344

christopherbate commented May 16, 2024 •

edited

makslevental commented May 16, 2024 •

edited