# Environment & Reproductibility in ML

Reproducibility = when you run the same code twice (same data, same config), you should get the same results (model weights, metrics, outputs) — or at least very close. This is crucial for debugging, comparing experiments, and being confident changes in results are due to code/config changes, not random noise.

Getting reproducible runs in TensorFlow requires controlling many randomness sources: Python, NumPy, TensorFlow, OS-level hashing, dataset shuffling, and GPU library nondeterminism. We'll set seeds and flags to control these, explain why each is necessary, and show tests you can run.

## Sources of randomness & what to control

- Python-level randomness (random module) Used by many libraries and your own code (e.g., random.shuffle).
- NumPy RNG (numpy.random) Many preprocessing or ML libs use NumPy random.
- TensorFlow RNG (tf.random) Weight initialization, dropout, shuffling inside TF ops, randomized GPU algorithms.
- tf.data shuffling dataset.shuffle(buffer_size) uses its own RNG (you can pass a seed).
- OS-level hashing (PYTHONHASHSEED) Affects iteration order of hashed collections (dict set order) and can subtly change execution. accsessing is in order but when inserting in a dict the order in which keys are inserted into a dict might change depending on input hash order (hash values of objects like str) EX: 
```py
# save as test.py
data = {'a', 'b', 'c'}  # a set is unordered (hash-based)
d = {k: None for k in data}
print(list(d.keys()))
# if your run in terminal
PYTHONHASHSEED=0 python test.py
PYTHONHASHSEED=1 python test.py
# you might get
# Run 1
['a', 'b', 'c']
# Run 2
['b', 'c', 'a']
# running without hash seed its randomized 
python test.py # runs with random seed like PYTHONHASHSEED=987654
# The set {'a', 'b', 'c'} doesn’t preserve order — it depends on hash values.
# When PYTHONHASHSEED changes, the hash of each string changes → the set’s iteration order changes → keys get inserted into the dict in a different order.
# so for stable results make it a constant like 0, 1 etc
```
- Non-deterministic GPU kernels (cuDNN/cuBLAS) Some GPU ops (e.g., certain convolution/cudnn algorithms, atomic adds, reductions) are inherently nondeterministic for performance reasons. Environment flags can force deterministic algorithms but may reduce performance or even be unavailable for certain ops.
- Other system-level variance Different TF/CUDA/cuDNN versions, different device counts, different threading settings, or multi-process race conditions can change outcomes.



In [None]:
import os
import random
import numpy as np
import tensorflow as tf

def set_seeds(seed=42, deterministic_ops=False):
    os.environ['PYTHONHASHSEED'] = str(seed)        # stabilizes hashing order, fixes hash randomization that can affect iteration order of dicts/sets across processes.
    random.seed(seed)                               # python random (makes all random numbers the same but still exhibit randomness)
    np.random.seed(seed)                            # numpy random, deterministic behavior for numpy functions same as random.seed but for numpy
    tf.random.set_seed(seed)                        # tensorflow random sets the graph-level seed for TensorFlow operations (affects initializers, dropout, etc.) for example it will initialize the weights as the same random numbers at the start of training process and will pick the same random neurons to drop during dropout layers etc.

    # NOTE: enabling deterministic ops may slow things or break some ops, tells TF to prefer deterministic kernels where TF provides them; availability and effect depend on TF, CUDA, cuDNN versions and on which ops you use.
    if deterministic_ops:
        os.environ['TF_DETERMINISTIC_OPS'] = '1'    # try to force deterministic kernels
        # optionally: os.environ['TF_CUDNN_DETERMINISTIC'] = '1' (older TF guidance)


### Recommended extras (for tf.data and reproducible shuffling)
- Use explicit seeds in tf.data shuffle and repeat:
- Use deterministic batching order when needed:
- If you want exact same batch composition across runs, set reshuffle_each_iteration=False and use the same seed.
- If you rely on tf.data.experimental.service or parallelism, be aware they can introduce nondeterminism.

In [None]:
# EX:
ds = tf.data.Dataset.range(10) # create a simple dataset of 10 elements inside a tf dataset
seed = 1234
ds = ds.shuffle(buffer_size=10000, seed=seed, reshuffle_each_iteration=False)
# or for training with repeat:
ds = ds.shuffle(buffer_size=10000, seed=seed, reshuffle_each_iteration=True)

# print dataset as a list
print(list(ds.as_numpy_iterator())) # will always print the same order across runs

### Optional: stronger OS / thread controls (when you need very strict reproducibility)
- Set OMP_NUM_THREADS, TF_NUM_INTRAOP_THREADS, TF_NUM_INTEROP_THREADS to fixed values so thread scheduling is stable.
- Run inside a Docker image with pinned TF/CUDA/cuDNN versions (this helps when you want to share experiments across machines).

EX:
``` bash
export OMP_NUM_THREADS=1
export TF_NUM_INTRAOP_THREADS=1
export TF_NUM_INTEROP_THREADS=1
```


### Tradeoffs 
- Performance vs determinism: Making TensorFlow operations fully deterministic (e.g., TF_DETERMINISTIC_OPS=1) can slow training because faster GPU algorithms (like cuDNN) are often nondeterministic. Use determinism for debugging or CI; disable it for max speed in final training.
- Not everything can be deterministic: Some ops (collective operations, certain GPU kernels) remain nondeterministic, causing small numeric differences.
- Multi-GPU / distributed training: Determinism is harder across multiple GPUs or machines due to variable reduction orders and NCCL algorithms. Exact reproducibility may be limited.
- TF/CUDA version changes: Results can differ across TensorFlow, CUDA, or cuDNN versions; pin versions to reproduce results exactly.


### Practical checklist to include in your repo README / experiment.json
- seed: 1337
- deterministic_ops: (true/false)
- tf_version: tf.__version__
- cuda_version, cudnn_version (if GPU)
- python_version
- git commit hash
- dataset id / TFRecord path
- shuffle buffer size, batch size, tokenizer mode, max_len

In [None]:
config =  {
  "seed": 1337,
  "deterministic_ops": False,
  "tf_version": "2.14.0",
  "python_version": "3.10.12",
  "git_commit": "abc1234",
  "shuffle_buffer_size": 10000,
  "batch_size": 32,
  "tokenizer_mode": "python"
}

### Example usage 
output:
```
Sum w1: 21.006514500710182
Sum w2: 21.006514500710182
Are weights equal? True
```
- same everytime

In [21]:
import os
import tensorflow as tf
import numpy as np
import h5py

# Ensure assets folder exists
os.makedirs("assets", exist_ok=True)

# --- Function to run your training and save weights ---
def run_model(weights_fname, seed=42, deterministic_ops=False):
    # Set seeds for reproducibility
    set_seeds(seed, deterministic_ops=deterministic_ops)
    
    # Small model
    inputs = tf.keras.Input(shape=(4,), dtype=tf.float32)
    x = tf.keras.layers.Dense(8, activation='relu')(inputs)
    outputs = tf.keras.layers.Dense(2)(x)
    model = tf.keras.Model(inputs, outputs)
    
    # Synthetic data
    X = np.random.randn(100, 4).astype('float32')
    y = np.random.randint(0, 2, size=(100,)).astype('int32')
    
    # Compile and train
    model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))
    model.fit(X, y, epochs=3, batch_size=16, verbose=0)
    
    # Save weights
    weights_path = os.path.join("assets", weights_fname)
    model.save_weights(weights_path)
    return weights_path

# --- Run first time ---
w1_path = run_model("w1.weights.h5", seed=42)
# --- Run second time ---
w2_path = run_model("w2.weights.h5", seed=42)

# --- Function to sum weights robustly ---
def sum_weights(fname):
    total = 0.0
    with h5py.File(fname,'r') as f:
        # visititems will call the visitor for every object in the file
        def visitor(name, obj):
            nonlocal total
            if isinstance(obj, h5py.Dataset):
                # use [...] to read full dataset
                total += obj[...].sum()
        f.visititems(visitor)
    return total

# --- Compare weights ---
s1 = sum_weights(w1_path)
s2 = sum_weights(w2_path)
print("Sum w1:", s1)
print("Sum w2:", s2)
print("Are weights equal?", s1 == s2)


Sum w1: 21.006514500710182
Sum w2: 21.006514500710182
Are weights equal? True


### How tf.random.set_seed interacts with operations (important conceptual detail)

tf.random.set_seed(s) sets the graph-level seed. When combined with operation-level seeds (some ops accept a seed argument), TensorFlow composes them to produce deterministic sequences.  
If you set the seed before creating layers/initializers, the layer initializers will be deterministic.  
For eager runs, tf.random.set_seed affects subsequent TF RNG calls.  

### tf.data specifics (practical patterns)

### Shuffle with seed

```python
ds = ds.shuffle(buffer_size=10000, seed=seed, reshuffle_each_iteration=False)
```

- seed gives deterministic shuffle order.
- reshuffle_each_iteration=True will reshuffle on each epoch — if you want same ordering across epochs, set to False.

- When using map() with num_parallel_calls and parallel transformations, the order of parallel execution can cause nondeterminism unless deterministic=True is provided to map/prefetch (TF versions vary).
Example:
``` py
ds = ds.map(fn, num_parallel_calls=tf.data.AUTOTUNE, deterministic=True)
```



#### Practical recommendations (what I’d do as your teacher)
- Default: add a set_seeds(seed) helper in src/utils/seed.py and call it at the top of every script.
- During development: run with deterministic_ops=True to make debugging easier (accept slower run time).
- For final training: set deterministic_ops=False for speed, but keep seeds & save seeds in experiment.json.
- For pipelines: prefer pre-tokenizing and storing token ids in TFRecord if tf.py_function slows your pipeline. Pre-tokenized TFRecords remove Python-based randomness in tokenization.
- For CI: Add a small test like the repro_test.py above to validate that model weight fingerprints match across runs.
