Skip to content

Handle pickling for generic pydantic models, fixes #210#211

Open
NeejWeej wants to merge 2 commits into
Point72:mainfrom
NeejWeej:nk/generic_cloudpickle
Open

Handle pickling for generic pydantic models, fixes #210#211
NeejWeej wants to merge 2 commits into
Point72:mainfrom
NeejWeej:nk/generic_cloudpickle

Conversation

@NeejWeej
Copy link
Copy Markdown

@NeejWeej NeejWeej commented May 15, 2026

Pydantic Generic Pickle and Ray Notes

TLDR

Concrete Pydantic generic BaseModel specializations can be fragile across
fresh-process pickle/cloudpickle boundaries. ccflow works around this for
concrete generic ccflow.BaseModel instances by pickling them as stable
origin + args + state data and recreating the specialized class during load.
Fixes #210.

Summary

This issue is a mismatch between three things:

  1. Python pickle prefers to reconstruct classes by importing a global module path.
  2. Pydantic creates concrete generic model classes, such as GenericResult[int], dynamically at runtime.
  3. Ray workers are fresh processes and may unpickle an object before the same concrete generic class has been created in that worker.

The bug is not specific to GenericResult. Any concrete Pydantic generic BaseModel specialization can have the same problem if that specialized class object crosses a process boundary before the receiver has materialized it.

The fix is confusing because there are two separate objects involved:

  • the model instance being pickled, such as GenericResult[int](value=5)
  • the generated model class objects that may appear in the instance's class or
    generic type arguments, such as GenericResult[int], ListResult[int], or
    CallableModelGenericType[NullContext, GenericResult[int]]

Fixing only the top-level instance class is not enough if generated generic
classes are also embedded inside the type arguments that define that instance's
specialized class.

What Pydantic Does

Pydantic v2 does not require pydantic.generics.GenericModel; normal BaseModel subclasses can be generic. When code evaluates:

GenericResult[int]

Pydantic runs BaseModel.__class_getitem__. In the local environment, this is Pydantic 2.13.4.
The relevant source is pinned to the v2.13.4 tag here:
https://github.com/pydantic/pydantic/blob/v2.13.4/pydantic/main.py#L904-L969

The relevant flow is:

  1. Check Pydantic's generic specialization cache.
  2. Map type variables to concrete args.
  3. Compute a parametrized model name, such as GenericResult[int].
  4. Call _generics.create_generic_submodel(...).
  5. Cache the generated class.
  6. Return that generated class.

The generated class has metadata like:

GenericResult[int].__pydantic_generic_metadata__

with shape:

{
    "origin": GenericResult,
    "args": (int,),
    "parameters": (),
}

For module-level models like this repro, origin is stable and importable. In
general, the reducer still relies on the origin class itself being importable or
otherwise serializable by cloudpickle. The generated specialized class is
runtime-created.

In Pydantic's _generics.create_generic_submodel, the new subclass is created
with the origin model's __module__ and generic metadata. The relevant source
is pinned here:
https://github.com/pydantic/pydantic/blob/v2.13.4/pydantic/_internal/_generics.py#L105-L149

namespace = {"__module__": origin.__module__}
bases = (origin,)
...
created_model = meta(
    model_name,
    bases,
    namespace,
    __pydantic_generic_metadata__={
        "origin": origin,
        "args": args,
        "parameters": params,
    },
    ...
)

Then Pydantic conditionally registers the generated class in the origin module
when _get_caller_frame_info(...) decides the specialization was created from
a global context:
https://github.com/pydantic/pydantic/blob/v2.13.4/pydantic/_internal/_generics.py#L140-L147

model_module, called_globally = _get_caller_frame_info(depth=3)
if called_globally:
    reference_module_globals = sys.modules[created_model.__module__].__dict__
    reference_module_globals.setdefault(reference_name, created_model)

That global registration is the key point. Sometimes a process has ccflow.result.generic.GenericResult[int] as a module attribute because that process already materialized it in a context Pydantic considers global. A fresh process may not.

What Pickle and Cloudpickle Do

Pickle generally reconstructs class objects by global reference:

module name + class name

For an ordinary class, this is fine:

ccflow.result.generic.GenericResult

can be imported in any process.

For a generated specialization, pickle/cloudpickle may see:

__module__ = "ccflow.result.generic"
__name__ = "GenericResult[int]"

and serialize it by reference as if this were importable:

ccflow.result.generic.GenericResult[int]

That works in the process that created and registered the class. It can fail in a fresh process:

AttributeError: Can't get attribute 'GenericResult[int]' on module 'ccflow.result.generic'

Ray makes this easy to hit because Ray workers are separate Python processes. Importing GenericResult in the worker does not necessarily create GenericResult[int]. If unpickling happens first, the generated class name is missing.

Pydantic's Instance Pickle Behavior

Pydantic already has instance pickle machinery.

BaseModel.__getstate__() returns a dict containing Pydantic's internal model
state, and BaseModel.__setstate__() restores that state directly. The source
is pinned here:
https://github.com/pydantic/pydantic/blob/v2.13.4/pydantic/main.py#L1145-L1160

{
    "__dict__": self.__dict__,
    "__pydantic_extra__": self.__pydantic_extra__,
    "__pydantic_fields_set__": self.__pydantic_fields_set__,
    "__pydantic_private__": private,
}

That is important because pickle should preserve an already-validated object. It should not rerun normal validation, coerce values again, drop private attrs, or rebuild the object through model_validate.

ccflow already overrides __getstate__ / __setstate__ slightly to make __pydantic_fields_set__ deterministic in pickle output.

The new fix keeps this Pydantic state-based behavior. For concrete generic
specializations of ccflow.BaseModel, it changes only the reduce recipe.

The Exact Failure

A minimal failing shape is:

payload = cloudpickle.dumps(GenericResult[int](value=5))

Then in a fresh process that has imported GenericResult but has not evaluated GenericResult[int]:

cloudpickle.loads(payload)

can fail because the receiver has the origin class:

ccflow.result.generic.GenericResult

but not the generated specialization:

ccflow.result.generic.GenericResult[int]

The problem is broader than the top-level class:

GenericResult[ListResult[int]](...)

Here the top-level class is GenericResult[ListResult[int]], and the generic arg contains another generated class, ListResult[int].

This also appears inside typing aliases:

GenericResult[list[ListResult[int]]](...)
GenericResult[typing.List[ListResult[int]]](...)
GenericResult[typing.Callable[[ListResult[int]], int]](...)
GenericResult[GenericContext[int] | None](...)

typing.Callable is especially annoying because its parameter types can appear as a plain Python list inside typing.get_args():

typing.get_args(Callable[[ListResult[int]], int])
# ([ListResult[int]], int)

So a helper that only walks typing.get_args() recursively can still miss generated classes inside that list.

Generated classes can also appear as field values:

GenericResult[type](value=ListResult[int])

Even if the instance class is reconstructed correctly, the field value
ListResult[int] would otherwise be pickled by its fragile generated class
name. That field-value shape is not fixed by the current change. The current
fix deliberately covers generated classes used as the instance class and inside
generic type arguments, while leaving arbitrary class objects stored in model
state to pickle/cloudpickle's normal behavior.

Pydantic-Only Repro

The smallest useful repro does not need ccflow or Ray. It only needs a plain
Pydantic generic model, two Python processes, and a cold receiver that imports
the generic origin class without first materializing the concrete specialization.

Consider the following repro:

# /// script
# dependencies = [
#   "cloudpickle",
#   "pydantic>=2.6,<3",
# ]
# ///
"""Minimal Pydantic-only repro for generic BaseModel pickle lookup failures.

This intentionally does not import ccflow. It creates a normal importable module
containing a plain Pydantic generic model, pickles ``Box[int](...)`` in one
process, then tries to unpickle it in a second fresh process that imported
``Box`` but did not materialize ``Box[int]``.
"""

from __future__ import annotations

import base64
import subprocess
import sys
import tempfile
import textwrap
from pathlib import Path


def run_python(script: str) -> subprocess.CompletedProcess[str]:
    return subprocess.run(
        [sys.executable, "-c", textwrap.dedent(script)],
        capture_output=True,
        text=True,
        timeout=30,
    )


with tempfile.TemporaryDirectory() as temp_dir:
    root = Path(temp_dir)
    module_path = root / "pydantic_generic_model.py"
    module_path.write_text(
        textwrap.dedent(
            """
            from typing import Generic, TypeVar

            from pydantic import BaseModel

            T = TypeVar("T")

            class Box(BaseModel, Generic[T]):
                value: T
            """
        )
    )

    creator = run_python(
        f"""
        import base64
        import cloudpickle
        import pydantic
        import sys

        sys.path.insert(0, {str(root)!r})
        import pydantic_generic_model
        from pydantic_generic_model import Box

        print("pydantic_version=", pydantic.__version__)
        value = Box[int](value=5)
        print("creator_has_Box_int=", hasattr(pydantic_generic_model, "Box[int]"))
        print(base64.b64encode(cloudpickle.dumps(value, protocol=5)).decode())
        """
    )
    if creator.returncode != 0:
        raise SystemExit(creator.stderr)

    payload = creator.stdout.splitlines()[-1]
    print("=== creator ===")
    print("\n".join(creator.stdout.splitlines()[:-1]))

    cold_loader = run_python(
        f"""
        import base64
        import cloudpickle
        import sys

        sys.path.insert(0, {str(root)!r})
        import pydantic_generic_model
        from pydantic_generic_model import Box

        print("loader_has_Box_int_before=", hasattr(pydantic_generic_model, "Box[int]"))
        value = cloudpickle.loads(base64.b64decode({payload!r}))
        print(value)
        """
    )
    print("\n=== cold receiver: imports Box but not Box[int] ===")
    print("returncode:", cold_loader.returncode)
    print("stdout:")
    print(cold_loader.stdout.rstrip() or "<empty>")
    print("stderr:")
    print(cold_loader.stderr.rstrip() or "<empty>")

    warm_loader = run_python(
        f"""
        import base64
        import cloudpickle
        import sys

        sys.path.insert(0, {str(root)!r})
        import pydantic_generic_model
        from pydantic_generic_model import Box

        _ = Box[int]
        print("loader_has_Box_int_before=", hasattr(pydantic_generic_model, "Box[int]"))
        value = cloudpickle.loads(base64.b64decode({payload!r}))
        print(value)
        """
    )
    print("\n=== warm receiver: materializes Box[int] before load ===")
    print("returncode:", warm_loader.returncode)
    print("stdout:")
    print(warm_loader.stdout.rstrip() or "<empty>")
    print("stderr:")
    print(warm_loader.stderr.rstrip() or "<empty>")

Then it runs three subprocess steps:

  1. A creator process imports Box, evaluates Box[int], constructs
    Box[int](value=5), and serializes it with cloudpickle.
  2. A cold receiver process imports Box but does not evaluate Box[int]
    before calling cloudpickle.loads(...).
  3. A warm receiver process imports Box, evaluates Box[int], then calls
    cloudpickle.loads(...).

The observed output is:

=== creator ===
pydantic_version= 2.13.4
creator_has_Box_int= True

=== cold receiver: imports Box but not Box[int] ===
returncode: 1
stdout:
loader_has_Box_int_before= False
stderr:
Traceback (most recent call last):
  File "<string>", line 11, in <module>
AttributeError: Can't get attribute 'Box[int]' on <module 'pydantic_generic_model' ...>

=== warm receiver: materializes Box[int] before load ===
returncode: 0
stdout:
loader_has_Box_int_before= True
value=5
stderr:
<empty>

That is the core bug in isolation. The creating process has a generated
Box[int] class registered on the module. The cold receiving process has only
the importable generic origin class Box, so pickle's global lookup for
Box[int] fails. The warm receiver evaluates Box[int] at module/global scope
before unpickling, which causes Pydantic to install the same generated class as
a module attribute before pickle tries to resolve it. That proves this is an
import/materialization-ordering problem rather than a ccflow model-definition
problem.

The same repro was checked with several Pydantic 2.x releases using:

uv run --with pydantic==<version> <pydantic-only-repro-script>
Pydantic version Cold receiver result Warm receiver result
2.6.4 fails with missing Box[int] module attribute succeeds
2.7.4 fails with missing Box[int] module attribute succeeds
2.8.2 fails with missing Box[int] module attribute succeeds
2.9.2 fails with missing Box[int] module attribute succeeds
2.10.6 fails with missing Box[int] module attribute succeeds
2.11.10 fails with missing Box[int] module attribute succeeds
2.12.5 fails with missing Box[int] module attribute succeeds
2.13.4 fails with missing Box[int] module attribute succeeds

This is not a regression in a recent Pydantic minor release. The behavior is
stable across the tested 2.x line.

Why the Fix Uses __reduce_ex__

__reduce_ex__ is the pickle hook that returns a reconstruction recipe.

For concrete generic specializations of ccflow.BaseModel, ccflow now returns
a recipe like:

(
    _new_ccflow_generic_model,
    (origin, portable_args),
    pydantic_state,
)

For:

GenericResult[int](value=5)

the recipe is conceptually:

origin = GenericResult
args = (int,)
state = self.__getstate__()

On load, pickle first calls the reducer function:

cls = GenericResult[int]
obj = cls.__new__(cls)

Then, because the reducer returned pydantic_state as the third tuple element,
pickle applies that state to the object:

obj.__setstate__(pydantic_state)

This uses Pydantic's own generic construction path in the receiving process,
while still letting pickle apply Pydantic's normal state protocol. The receiver
does not need to already have a global GenericResult[int] module attribute.

Why Type Arguments Need Special Handling

The generic argument portability layer is the ugly part, but it is solving a
real second-order problem.

If we only serialize the top-level class as:

origin + args

then this works:

GenericResult[int](value=5)

but these can still fail:

GenericResult[ListResult[int]](...)
GenericResult[list[ListResult[int]]](...)
GenericResult[Callable[[ListResult[int]], int]](...)

because ListResult[int] is itself a generated Pydantic generic class.

The helper therefore handles:

  • generic type arguments
  • typing aliases that contain generic type arguments
  • list and tuple containers that can appear inside type expressions, such
    as the callable parameter list

The raw Pydantic state remains in the outer pickle stream. That is intentional:
pickle keeps its own memo table there, which is required to preserve shared
references, cycles, and protocol-5 buffers. The reducer only changes how the
generic class is recreated; it does not recursively rewrite arbitrary model
field/private state.

It intentionally does not treat model instances as type specs. A value like:

ListResult[int](value=[1])

is still a model instance and should be pickled as an instance. Its own __reduce_ex__ will handle its generated class. Accidentally converting instances into class specs would corrupt data.

Blast Radius

There are two related guards:

isinstance(value, type)
and value.__pydantic_generic_metadata__["origin"] is not None

That predicate identifies concrete generated Pydantic generic classes such as:

GenericResult[int]
ListResult[str]
CallableModelGenericType[NullContext, GenericResult[int]]

The BaseModel.__reduce_ex__ override runs when pickling a ccflow model
instance whose type(self) satisfies that predicate. So it does apply to:

GenericResult[int](value=5)

It does not apply to:

GenericResult
ParentModel
ParentModel(...)

Normal non-generic BaseModel instances continue using the default reducer
path, plus ccflow's existing deterministic __getstate__ / __setstate__
hooks.

The custom reduce recipe is created only during pickling of concrete generic
ccflow BaseModel instances. It does not run during:

  • normal construction
  • validation
  • model_dump
  • equality
  • compute/call paths
  • registry lookup

The performance cost is therefore limited to pickling generic model instances.
The extra work is walking the generic type arguments for the model class. The
actual Pydantic instance state is still handled by the surrounding pickle
operation.

Why Not Simpler Alternatives

Why not call model_validate on restore?

Because pickle should restore object state, not validate new input.

Revalidation can:

  • coerce values differently
  • drop private attrs
  • rerun validators with side effects
  • fail if the current schema changed
  • fail if the object was valid when created but no longer validates under newer code

Pydantic's own pickle support uses __getstate__ / __setstate__, so the ccflow fix follows that model.

Why not rely on cloudpickle to serialize the generated class by value?

Sometimes cloudpickle can serialize dynamic classes by value. But Pydantic specializations are not just ordinary dynamic classes. They carry generated schemas, validators, serializers, generic metadata, and cache behavior.

Also, if the class appears to be importable by module/name in the creating process, cloudpickle can choose a global-reference path. That is exactly the fragile path that fails in a fresh receiver.

The stable representation for a Pydantic generic specialization is not the generated class object. It is:

origin class + generic args

Why not globally register every generated specialization?

Pydantic already conditionally registers generated specializations when it thinks they were created globally. But a fresh Ray worker has not necessarily executed the same specialization expression yet.

Trying to eagerly register all possible specializations is impossible. Registering during serialization still would not help the receiver unless the receiver imports side effects in the same order.

Why not monkeypatch pickle/cloudpickle for all classes?

Generated Pydantic specializations are class objects, so a global reducer would mean changing behavior for type or for broad classes of model classes. That is much wider than this bug.

The ccflow fix keeps the custom behavior inside ccflow BaseModel instance pickling.

Why not support fields containing generated classes too?

That is a real broader issue, but fixing it at the state-value level is a much
bigger change. It requires intercepting arbitrary class objects inside the
pickle stream or walking Pydantic state manually, both of which can disturb
pickle's normal identity/cycle/buffer semantics if done carelessly.

The current fix chooses the smaller and safer boundary: generated classes that
define the model instance's own type, plus generated classes inside that type's
generic arguments. A field value like GenericResult[type](value=ListResult[int])
can still fail in a cold receiver and remains out of scope.

How Broad Is This Problem?

It affects concrete Pydantic generic specializations crossing a process
boundary by pickle/cloudpickle when the receiver has not already materialized
the same specialization.

Shapes this PR fixes:

  • instance class: GenericResult[int](...)
  • nested generic arg: GenericResult[ListResult[int]](...)
  • builtin alias arg: GenericResult[list[ListResult[int]]](...)
  • dict alias arg: GenericResult[dict[str, GenericContext[int]]](...)
  • typing alias arg: GenericResult[typing.List[ListResult[int]]](...)
  • callable alias arg: GenericResult[typing.Callable[[ListResult[int]], int]](...)
  • single-argument special-form args accepted by Pydantic in this context, such
    as typing.ClassVar[ListResult[int]] and typing.Final[ListResult[int]]
  • union arg: GenericResult[GenericContext[int] | None](...)
  • optional alias arg: GenericResult[typing.Optional[ListResult[int]]](...)

Not affected:

  • non-generic BaseModel classes
  • generic origin classes before parametrization, such as GenericResult
  • normal runtime use that does not pickle

Handled by normal nested pickling, not by rewriting field/private state:

  • generic ccflow model instances stored as field values; pickle visits those
    instances normally, and each instance's own reducer handles its generated
    class

Helper-level coverage only:

  • CallableModelGenericType[NullContext, GenericResult[int]] as a type
    argument
  • typing.Required[ListResult[int]] and
    typing.NotRequired[ListResult[int]]; Pydantic rejects these as
    GenericResult[...] arguments in this context, but the restore helper
    handles the one-argument special-form shape

Known not fixed:

  • class-valued state such as GenericResult[type](value=ListResult[int])
  • arbitrary metadata containers inside type expressions, such as
    Annotated[int, frozenset([ListResult[int]])]; the helper walks normal
    typing args plus explicit list/tuple containers, not every object that
    can be embedded in metadata

Why This Is Hard To Read

The code is confusing because it has to preserve three different pieces of
pickle/type state:

  1. Pydantic model instance state: restored through __getstate__ / __setstate__
  2. Pydantic generic class identity: restored through origin[args]
  3. Python typing alias form: rebuilt as the same kind of type expression, such
    as builtin list[...], typing.List[...], typing.Optional[...],
    typing.Callable[...], or a PEP 604 A | B union

It also has to avoid a dangerous false positive:

ListResult[int]          # generated class: make portable
ListResult[int](...)     # model instance: do not convert to a class spec

That is why there are separate helpers for:

  • detecting concrete Pydantic generic specialization classes
  • making generic type arguments portable
  • restoring generic type arguments
  • allocating the rebuilt generic model instance before pickle applies
    Pydantic's normal __setstate__

The resulting implementation is not aesthetically simple, but each piece exists to handle a real pickle/Ray failure mode.

References

@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

❌ Patch coverage is 92.75362% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.32%. Comparing base (3c8fd19) to head (4abb390).

Files with missing lines Patch % Lines
ccflow/base.py 90.58% 6 Missing and 2 partials ⚠️
ccflow/tests/test_base_cloudpickle.py 81.57% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #211      +/-   ##
==========================================
- Coverage   95.37%   95.32%   -0.05%     
==========================================
  Files         142      143       +1     
  Lines       11404    11608     +204     
  Branches      620      633      +13     
==========================================
+ Hits        10876    11065     +189     
- Misses        399      412      +13     
- Partials      129      131       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@NeejWeej NeejWeej force-pushed the nk/generic_cloudpickle branch 2 times, most recently from f4c4745 to 840d895 Compare May 15, 2026 05:58
Signed-off-by: Nijat K <nijat.khanbabayev@gmail.com>
@NeejWeej NeejWeej force-pushed the nk/generic_cloudpickle branch from 840d895 to 7404c07 Compare May 15, 2026 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] GenericResult[T] Cloudpickle Fails In Fresh Processes When Serialized By Global Reference

1 participant