Skip to content

Seed all four RNGs on cold start to match checkpoint coverage#60

Open
mmshad wants to merge 1 commit intomainfrom
cold-start-seed-coverage
Open

Seed all four RNGs on cold start to match checkpoint coverage#60
mmshad wants to merge 1 commit intomainfrom
cold-start-seed-coverage

Conversation

@mmshad
Copy link
Copy Markdown
Collaborator

@mmshad mmshad commented Apr 22, 2026

Summary

  • In _set_seed, add random.seed(effective_seed) and numpy.random.seed(effective_seed) so Python's random and NumPy's legacy global RNG are deterministic on cold start. The checkpoint path already captures and restores both, so warm resume is no longer stricter than cold start.
  • Swap torch.cuda.manual_seed for torch.cuda.manual_seed_all to cover every visible device, not just the current one.

Drive-by: tests/unit/test_eval.py::TestRunEval::test_perfect_model_low_loss relied on sum(embed(0)) landing positive from inherited global RNG state. Zero the embedding and set row 0 to ones so the assertion is deterministic regardless of test ordering.

Closes #59

Test plan

  • uv run pytest tests/unit/test_distributed_seed.py tests/unit/test_eval.py -v.
  • Full unit suite clean.

_set_seed() only called torch.manual_seed, leaving Python random and
NumPy's legacy global RNG uninitialized. Warm resume restores all four
generators (python, numpy, torch_cpu, torch_cuda) via get_rng_state, so
cold-start runs had strictly weaker reproducibility than resumed ones.
Any code path using random.random() or np.random.rand() on rank >0
picked up process-level entropy rather than the configured seed.

Also fix a latent flake in test_perfect_model_low_loss: the test relied
on sum(embed(0)) landing positive by luck from global RNG state. Zero
the embedding and set row 0 to ones so the assertion is deterministic
regardless of test ordering.
@mmshad mmshad requested a review from Naeemkh April 22, 2026 01:12
@mmshad mmshad self-assigned this Apr 22, 2026
Copy link
Copy Markdown
Member

@Naeemkh Naeemkh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if importing with _ is a good idea. Please double check.


import logging
import os
import random as _random
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just import random ?

import random as _random
from datetime import timedelta

import numpy as _np
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

import numpy as np

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_set_seed covers fewer RNGs than the checkpoint path captures

2 participants