Skip to content

Unify K-FAC and autocorrelation Hessian save layout#276

Open
jammastergirish wants to merge 9 commits intomainfrom
refactor-unify-hessian-serialization
Open

Unify K-FAC and autocorrelation Hessian save layout#276
jammastergirish wants to merge 9 commits intomainfrom
refactor-unify-hessian-serialization

Conversation

@jammastergirish
Copy link
Copy Markdown
Collaborator

@jammastergirish jammastergirish commented May 5, 2026

Linear task

Summary

bergson hessian had two different save layouts depending on the method:

  • K-FAC family (kfac / tkfac / shampoo) wrote to run_path/<method>/.
  • Autocorrelation wrote to run_path/ (because it piggy-backs on bergson build).

Both already serialize a hessian_config.yaml recording the method, but at different relative locations. This PR aligns them under the K-FAC convention so the saved hessian_config.yaml always lives at <run_path>/<method>/hessian_config.yaml and unambiguously identifies which method produced the directory.

What changed

Layout

Path Before After
bergson hessian --method kfac runs/foo runs/foo/kfac/ runs/foo/kfac/ (unchanged)
bergson hessian --method tkfac runs/foo runs/foo/tkfac/ runs/foo/tkfac/ (unchanged)
bergson hessian --method shampoo runs/foo runs/foo/shampoo/ runs/foo/shampoo/ (unchanged)
bergson hessian --method autocorrelation runs/foo runs/foo/ runs/foo/autocorrelation/

Where the routing lives

K-FAC-family writes mutate `run_path` inside `approximate_hessians` (`bergson/hessians/hessian_approximations.py`). The autocorrelation flow piggybacks on `build()`, which now takes an optional `hessian_cfg` parameter; when supplied alongside `skip_index=True`, `build` performs the same `<run_path>//` mutation and serializes the supplied `HessianConfig` to `hessian_config.yaml`. The `Hessian` CLI command no longer rewrites `index_cfg.run_path` itself — it just passes `hessian_cfg` through to `build`. `trackstar.py` and `bergson/hessians/pipeline.py` continue to call `build()` without a `hessian_cfg`, so their custom subdir layouts are unchanged.

Reading back the method

The canonical way to identify the method that produced a saved directory is to load the on-disk config directly:

```python
from bergson.config import HessianConfig

method = HessianConfig.load_yaml(str(path / "hessian_config.yaml")).method
```

`mix_autocorrelation_matrices` uses this same one-liner inline (loop over `query_path` and `index_path`) — no extra helper is kept, since it was only needed in two places.

What is not changed

  • On-disk artifact formats are kept as-is. K-FAC family writes sharded safetensors of Kronecker factors (+ EK-FAC eigenvalue corrections); autocorrelation writes `GradientProcessor.save` artifacts (`hessians.pth`, `hessians_eigen.pth`, `processor_config.yaml`, `normalizers.pth`). They encode different math and converging the formats would be a large refactor with no clear win.
  • `bergson build` (with `skip_hessians=False`) still writes its autocorrelation Hessian alongside the gradient index at `run_path/`. That mode produces an index and an inline Hessian; namespacing the Hessian under `/` would split the run.
  • `trackstar.py` and `bergson/hessians/pipeline.py` call `build()` / `approximate_hessians()` directly with their own custom subdirectories (`value_hessian/`, `query_hessian/`, etc.); those layouts are unchanged.

Breaking change

Users invoking `bergson hessian --method autocorrelation` and then pointing a downstream `--preprocess_cfg.hessian_path` at the run directory must update the path to include the trailing `/autocorrelation` segment, e.g.:

```diff
bergson hessian runs/foo --method autocorrelation

  • bergson score ... --hessian_path runs/foo
  • bergson score ... --hessian_path runs/foo/autocorrelation
    ```

This matches the K-FAC convention which has always required `--hessian_method_path runs/foo/kfac`.

Verification

Ran on the cluster against Pythia-14m / pile-10k:

  • `bergson hessian runs/auto_unified --method autocorrelation` lands artifacts at `runs/auto_unified/autocorrelation/` with `hessian_config.yaml` reporting `method: autocorrelation`. No stray files at the run-path root.
  • `bergson hessian runs/kfac_unified --method kfac --ev_correction true` still produces the same `runs/kfac_unified/kfac/` layout (sharded safetensors + configs); no regression.
  • `HessianConfig.load_yaml(str(path / "hessian_config.yaml")).method` returns the right method (`autocorrelation` / `kfac`) for a saved directory.
  • `mix_autocorrelation_matrices` continues to validate that both sides are autocorrelation hessians via the inlined check.

Both `bergson hessian` paths now write under `run_path/<method>/` with a
`hessian_config.yaml` recording the method, and a single helper makes it
easy to discover which approximation produced a saved directory.

Changes:

- `bergson hessian --method autocorrelation runs/foo` now writes to
  `runs/foo/autocorrelation/` (was: `runs/foo/`). K-FAC family methods
  already use this layout — autocorrelation now matches.
- New module `bergson.hessians.io` with `load_hessian_config(path)` and
  `hessian_method(path)` helpers. Reads `hessian_config.yaml` from a saved
  Hessian directory and returns the parsed config / method string.
- `assert_autocorrelation_hessian` in `process_grads.py` now uses the new
  helper so there's a single read path for hessian metadata.

Behavior **not** changed:

- On-disk artifact formats (sharded safetensors for K-FAC family,
  GradientProcessor `.pth` files for autocorrelation) are kept as-is.
  They encode different math; converging them would be a much larger
  change with no clear benefit.
- `bergson build` (with `skip_hessians=False`) still writes its
  autocorrelation Hessian alongside the gradient index at `run_path/`.
  The Hessian is coupled with the index in this mode, so namespacing it
  under `<method>/` would split the run.
- `trackstar.py` and `bergson/hessians/pipeline.py` call `build()` /
  `approximate_hessians()` directly with their own custom subdirectories
  (e.g. `value_hessian/`, `query_hessian/`); those layouts are unchanged.

Breaking change:

- Users invoking `bergson hessian --method autocorrelation` and then
  pointing a downstream `--preprocess_cfg.hessian_path` at the run
  directory must update the path to include the trailing
  `/autocorrelation` segment. This matches the K-FAC convention that has
  always required `--hessian_method_path runs/foo/kfac`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jammastergirish jammastergirish marked this pull request as draft May 5, 2026 19:37
Round-trips a HessianConfig through save_yaml + load_hessian_config for
each method, verifies hessian_method returns the right string, and
checks that a missing config file raises FileNotFoundError with the
expected filename in the message.

Pure-python, no GPU required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jammastergirish jammastergirish marked this pull request as ready for review May 5, 2026 20:18
Comment thread bergson/hessians/io.py Outdated
Comment thread bergson/hessians/io.py Outdated
Comment thread bergson/__main__.py Outdated
Comment thread bergson/__main__.py Outdated
@luciaquirke
Copy link
Copy Markdown
Collaborator

luciaquirke commented May 6, 2026

Seems like there's some claude bloat in this implementation, could you please do a pass for conciseness before getting human reviews? I want to avoid the code base getting too big and complex

Comment thread bergson/__main__.py Outdated
Comment thread bergson/__main__.py Outdated
Comment thread bergson/process_grads.py Outdated
jammastergirish and others added 2 commits May 7, 2026 14:52
…relation_hessian

Addresses Lucia's review feedback on PR #276:
- Push the `<run_path>/<method>` mutation out of `Hessian.execute` and into
  `build()` itself, gated on a new optional `hessian_cfg` parameter and
  `skip_index=True`. Trackstar still calls `build()` without hessian_cfg, so
  its routing is unchanged.
- Inline `assert_autocorrelation_hessian` into its only caller
  (`mix_autocorrelation_matrices`); it was used in 2 places.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread bergson/build.py
preprocess_cfg.save_yaml(index_cfg.partial_run_path / "preprocess_config.yaml")
if not index_cfg.skip_hessians:
HessianConfig(method="autocorrelation").save_yaml(
(hessian_cfg or HessianConfig(method="autocorrelation")).save_yaml(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just save it if it's not None here? so no if not index_cfg.skip_hessians: check

Comment thread bergson/process_grads.py
assert method == "autocorrelation", (
f"Hessian at '{path}' was computed with method '{method}'; "
f"mix_autocorrelation_matrices only supports autocorrelation."
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Copy link
Copy Markdown
Collaborator

@luciaquirke luciaquirke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, LGTM with YAML saving logic fix

Comment thread bergson/build.py
setup_reproducibility()

if hessian_cfg is not None and index_cfg.skip_index:
index_cfg.run_path = index_cfg.run_path + f"/{hessian_cfg.method}"
Copy link
Copy Markdown
Collaborator

@luciaquirke luciaquirke May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutating the run path is a red flag right

Copy link
Copy Markdown
Collaborator

@luciaquirke luciaquirke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm going to pause this one on reflection, I think the unification needs more thought because it's not really clear which run path strategy we want to commit to between k-fac and autocorrelation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants