Skip to content

Matheart/experiment_guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

For empirical research (for myself only).

Usage in run.ai clusters (A100) and B200.

Large Storage: /shared_data0/hnwong

run.ai

  • Access to cluster ssh hnwong@locust-login.seas.upenn.edu
  • runai login username hnwong@upenn.edu
  • Delete job: runai delete job honam
  • Submit interactive job:
runai submit honam \
   -i hnwong2025/base:latest \
   --attach \
   --interactive \
   --tty \
   --stdin \
   -v /home/hnwong:/home/hnwong \
   -v /shared_data0:/shared_data0 \
   --cpu 8 \
   -g 1 \
   --large-shm \
   --memory 128G \
   --working-dir /home/hnwong \
   -e HOME=/home/hnwong \
   --service-type=nodeport --port 30025:22 \
   -- /usr/sbin/sshd -D # For running another job, you need to change the port number 30025

and runai port-forward honam --port 30025:30025 (optional?)

  • Jupyter notebook: Create using ui interface in run.ai
  • Access Tensorboard: runai port-forward honam --port 6006:6006 (Forward login node's port to job's port), ssh -L 6006:localhost:6006 hnwong@locust-login.seas.upenn.edu (connect local machine's 6006 port to the login node)

B200 nodes

kinit hnwong@UPENN.EDU
ssh hnwong@login.betty.parcc.upenn.edu
  • home directory: /vast/home/h/hnwong
  • find own jobs: squeue -u $USER
  • run job: srun --partition=dgx-b200 --pty --container-image=hnwong2025/base:latest bash
  • more complicated one, need to find how to maps home addresses correctly:
srun --partition=dgx-b200 \
     --container-image=docker://hnwong2025/base:latest \
     --container-mounts=/vast/home/h/hnwong:/home/hnwong \
     --container-workdir=/home/hnwong \
     --container-env=HOME=/home/hnwong \
     --cpus-per-task=8 \
     --gpus=1 \
     --mem=128G \
     --pty \
     --time=01:00:00 \
     bash

Docker commands:

docker build --build-arg UID=$(id -u) --build-arg GID=$(id -g) -t hnwong2025/base:latest base
docker push hnwong2025/base:latest

Git command:

Have the wrong commit (for example: adding large file) and want to modify it before pushing to remote:

  • Keeps a copy of your current state in case anything goes wrong: git branch backup-before-rewrite
  • Prevents untracked local files from blocking rebase checkouts. git stash -u -m "temp stash for rebase
  • Rebase to previous commit git rebase -i HEAD~2 (# of steps depends on how many commits you have made)
  • Change the line of the wrong commit to edit, and remove unnecessary commits afterwards
  • At that commit, do things you want to, to fix the wrong commit. For example: for wrong large file adding git rm --cached path/to/large_file.
  • Check this by git rev-list --objects --all | grep 'path/to/large_file' || echo "✅ Large file removed from history.
  • Finally, git push origin <branch>.

For git stash

Hosted platform

https://modal.com/

Command

training_image = (
    modal.Image.debian_slim(python_version="3.10")
    .env({"CUBLAS_WORKSPACE_CONFIG": ":4096:8"})
    .pip_install("torch")
    .add_local_python_source("tensor_initializations", "optimization_algorithms", "synthetic_data", "utils", "models", "looper")
)
@app.function(image=training_image, gpu="A100-40GB", timeout=3600)
def get_results(params):
    return looper(params)
@app.local_entrypoint()
def main():
    inputs = [
        run_parameters
    ]
    for result in get_results.map(inputs):
        save_results(result)

Deep Learning Experiments

Choose GPU export CUDA_VISIBLE_DEVICES=1

Torch profiling

Timing

Most simplest way,

                torch.cuda.synchronize(); sync_start = time.time()
                loss = loss.item()
                torch.cuda.synchronize(); sync_time = time.time() - sync_start

Ignore the first time counting since it may take long (pre-loading etc.), and need to take average over multiple runs.

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3),
    on_trace_ready=torch.profiler.tensorboard_trace_handler('/shared_data0/hnwong/logs/profile'),
    with_stack=True
) as prof:
    for step in range(steps):
        train_step() 
        prof.step()

uv add torch_tb_profiler, then uv run tensorboard --logdir=/shared_data0/hnwong/logs/profile --port=6006 --bind_all --load_fast=false. After that we need to handle port forwarding operations. runai port-forward honam --port 6006:6006 (Forward login node's port to job's port), ssh -L 6006:localhost:6006 hnwong@locust-login.seas.upenn.edu (connect local machine's 6006 port to the login node)

Parrallelize many small experiments in a single GPU

We use small_exp as a proxy to small experiments in practice, and we test and compare two approaches, sequential execution and parallel execution.

When experiments inside one batch take approximate amount of time, parallel execution saves some time e.g. num_of_workers = 4, but would introduce overhead if num_of_workers gets too large. When they have different amount of time (i.e. when varying batch size or width), parallel execution might be worse than sequential execution. To verify the conclusion we draw above, can run and compare time for code inside small_exp folder.

This tells us when considering parallelism:

  • Be sure for every batch of experiments, their execution time should be almost the same, i.e. should not run experiments of width = 64, 128 in parallel.
  • Use num_of_workers = 4 and should not set it as too large to avoid overhead.
  • The time can be reduced but not a lot.

Refer to small_exp/small_exp_multiprocessing.py for template.

jax is actually more powerful when dealing with a large amount of small synthetic experiments.

Distributed training

TO-DO

Accelerate Transformer training time on a single GPU

Rememeber don't have frequent CPU-GPU communication during batches (e.g. .item(), .to(device))

Change to other precision

TF32 (Enable Tensor Core)

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

Mix-precision training

Enable this can have massive speedup

TO-DO: Try mixed-precision training.

torch.compile()

Almost one-line, model = torch.compile(model), it captures model’s forward/backward pass once, fuses and optimizes operations, and generate efficient GPU kernels. This always leads to 1.5-2x speedup.

TO-DO: Test.

Increase batch size

Increasing batch size can sometimes be more efficient. Larger batch can give a more accurate estimate of the gradient, can have further decrease of loss after a large number of steps, compared with small batch sizes.

Hyperparameter sweep experiments using wandb

jax

https://docs.jax.dev/en/latest/index.html

  • Guideline: JAX internally uses functional programming model. So all the funtions should be pure (No side effect i.e. print inside function, or using external variables). Don't use iterator or might have errors / unexpected result. For debug printing, use jax.debug.print().
  • jax.jit, jax.map, jax.grad are often applicable to static shapes only, but the scenarios that need dynamic shapes can always be avoided.

Git command

Push to private repository git remote set-url origin https://Matheart:<api_key>@github.com/Matheart/<project>.git

About

For empirical research (for myself only).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages