Skip to content

Non reproducibility in training algorithms #60

@gioelemo

Description

@gioelemo

Hi @g-braeunlich,

Following Soheyl's suggestion, I’m assigning this issue to you.

I am encountering an issue where training runs on Euler are not deterministic, despite using a fixed seed (--seed 1). As shown in the W&B report, two identical runs on the same node yield different loss curves and different generated designs.

#!/bin/bash
#SBATCH --job-name=cgan_cnn_2d_beams2d
#SBATCH --time=00:45:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=7GB
#SBATCH --gpus=rtx_4090:1
#SBATCH --output=engiopt_cgan_cnn_2d_beams2d_%j.out
#SBATCH --error=engiopt_cgan_cnn_2d_beams2d_%j.err
# Mail notifications disabled (SKIP_SLURM_EMAIL)

mkdir -p "$SCRATCH/logs" "$SCRATCH/datasets" "$SCRATCH/models"

module purge
module load stack/2024-06 gcc/12.2.0 python_cuda/3.11.6 cuda/12.8.0 eth_proxy
source ~/venvs/engineer_assistant/bin/activate

export WANDB_API_KEY="...."
export WANDB_ENTITY="gioelemo-ethz"
export WANDB_PROJECT="engiopt"
export HF_HOME="$SCRATCH/models"
export HF_DATASETS_CACHE="$SCRATCH/datasets"
export HF_TOKEN="..."

cd $HOME/EngiOpt
python engiopt/cgan_cnn_2d/cgan_cnn_2d.py --problem-id "beams2d" --track --save-model --n-epochs 100 --seed 1

echo "Training complete!"

Both of the training runned on the same Euler node

Image

The generated design looks also different

Image

Probably not everything is seeded correctly see https://docs.pytorch.org/docs/stable/notes/randomness.html#reproducibility and https://docs.pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html#torch-use-deterministic-algorithms

Could you help me investigate if there are specific settings in the engiopt trainer we should adjust to ensure bit-wise reproducibility?

Thanks!

Gioele

CC: @SoheylM @ffelten

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions