# Short enwiki batch testing

Test that the training code, batching mode, works with

**1B5 model / L24-D2048 model with**
- Layer count: 24
- Embed size: 2048

> Benchmark was done in JIT mode, not torch compile, on 18th Nov 2023

## Deepspeed 2 Offload speed on
- RTX 4090 + AMD Ryzen 7 3700X 8-Core Processor
- AWS 1 x A10G node

| Batch Size | Peak VRAM | 4090 kt/s  | 4090 time | A10G kt/s | A10G time |
|------------|-----------|------------|-----------|-----------|-----------|
| 6          | ~ 23.7 GB | 5.37 kt/s  | 10.15 min | 2.90 kt/s | 18.25 min |
| 5          | ~ 20.8 GB | 5.45 kt/s  | 9.95 min  | 2.87 kt/s | 18.46 min |
| 4          | ~ 18.0 GB | 5.21 kt/s  | 10.35 min | 2.70 kt/s | 19.50 min |
| 3          | ~ 15.1 GB | 4.76 kt/s  | 11.32 min | 2.43 kt/s | 21.05 min |
| 2          | ~ 12.3 GB | 4.26 kt/s  | 12.73 min | 2.08 kt/s | 25.05 min |
| 1          | ~ 9.54 GB | 3.07 kt/se | 17.29 min | 1.41 kt/s | 36.50 min |

The general advice is increase your microbatch_size size until you hit peak vram or kt/s. Then increase target_batch_size to match, or scale in multiples of GPU count & microbatch_size

## Preparing the init model and test dataset

In [32]:
GPU_DEVICES="auto"
ENABLE_WANDB=True
WANDB_PREFIX="infctx-v5-validation"
DEEPSPEED_STRAT="deepspeed_stage_2_offload"

print("ENABLE_WANDB:", ENABLE_WANDB)
print("GPU_DEVICES:", GPU_DEVICES)

if ENABLE_WANDB:
    WANDB_MODE="online"
else:
    WANDB_MODE="disabled"

# Computing the notebook, and various paths
import os
NOTEBOOK_DIR=os.path.dirname(os.path.abspath("__file__"))
PROJECT_DIR=os.path.abspath(os.path.join(NOTEBOOK_DIR, "../../"))
TRAINER_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./RWKV-v5/"))

print("NOTEBOOK_DIR:", NOTEBOOK_DIR)
print("TRAINER_DIR:", TRAINER_DIR)
print("PROJECT_DIR:", PROJECT_DIR)

ENABLE_WANDB: True
GPU_DEVICES: auto
NOTEBOOK_DIR: /home/picocreator/rwkv-proj/RWKV-infctx-trainer/notebook/trainer-v5-validation
TRAINER_DIR: /home/picocreator/rwkv-proj/RWKV-infctx-trainer/RWKV-v5
PROJECT_DIR: /home/picocreator/rwkv-proj/RWKV-infctx-trainer


In [33]:
# First lets setup the various directories
!mkdir -p "{PROJECT_DIR}/model/"
!mkdir -p "{PROJECT_DIR}/datapath/"
!mkdir -p "{PROJECT_DIR}/checkpoint/"

In [34]:
# Lets initialized the L24-D2048 model with the init_model.py code
!cd "{TRAINER_DIR}" && python3 init_model.py \
    --n_layer 24 --n_embd 2048 \
    --vocab_size world \
    --skip-if-exists --safe-init \
    ../model/L24-D2048-world-init.pth

[2023-11-19 14:29:09,876] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.1.1'
---- Initializing model ----
No of layers: 24
Embedding size: 2048
Output model path: ../model/L24-D2048-world-init.pth
Vocab size: 65536
Emb scale: 0.0001
Note: this process takes a significant time (and ram) for large models
---- ----- ----
Model exists, skipping init_model


In [35]:
# Preload the dataset
!cd "{TRAINER_DIR}" && \
    python3 preload_datapath.py "{NOTEBOOK_DIR}/config/enwiki_10k-world-4096.yaml"

Saving the dataset (1/1 shards): 100%|█| 763/763 [00:00<00:00, 21120.72 examples
Saving the dataset (1/1 shards): 100%|███| 8/8 [00:00<00:00, 4262.50 examples/s]


In [36]:
# Short training process - for quick testing / debugging
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="disabled" && \
    python3 lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/config/enwiki_10k-world-4096.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} (train-ctx=4096, data-ctx=4096, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}" \
        --trainer.fast_dev_run=2 \
        --model.ctx_len=4096 \
        --trainer.microbatch_size=1 \
        --model.load_model="../model/L24-D2048-world-init.pth"

[2023-11-19 14:29:22,119] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.1.1'
/home/picocreator/anaconda3/envs/rwkv-infctx/lib/python3.11/site-packages/lightning/pytorch/cli.py:518: LightningCLI's args parameter is intended to run from within Python like if it were from the command line. To prevent mistakes it is not recommended to provide both args and command line arguments, got: sys.argv[1:]=['fit', '-c', '/home/picocreator/rwkv-proj/RWKV-infctx-trainer/notebook/trainer-v5-validation/config/enwiki_10k-world-4096.yaml', '--trainer.logger.init_args.name=infctx-v5-validation (train-ctx=4096, data-ctx=4096, deepspeed_stage_2_offload)', '--trainer.strategy=deepspeed_stage_2_offload', '--trainer.devices=auto', '--trainer.fast_dev_run=2', '--model.ctx_len=4096', '--trainer.microbatch_size=1', '--model.load_model=../model/L24-D2048-world-init.pth'], args=['fit', '-c', '/home/picocr

In [37]:
# Empty out the checkpoint
!cd "{PROJECT_DIR}" && rm -rf "./checkpoint/infctx-v5-unit-test-baseline-4096/"

# Microbatch 1 training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/config/enwiki_10k-world-4096.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - microbatch 1 (train-ctx=4096, data-ctx=4096, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}" \
        --trainer.microbatch_size=1 \
        --model.load_model="../model/L24-D2048-world-init.pth"
        

[2023-11-19 14:30:02,692] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.1.1'
/home/picocreator/anaconda3/envs/rwkv-infctx/lib/python3.11/site-packages/lightning/pytorch/cli.py:518: LightningCLI's args parameter is intended to run from within Python like if it were from the command line. To prevent mistakes it is not recommended to provide both args and command line arguments, got: sys.argv[1:]=['fit', '-c', '/home/picocreator/rwkv-proj/RWKV-infctx-trainer/notebook/trainer-v5-validation/config/enwiki_10k-world-4096.yaml', '--trainer.logger.init_args.name=infctx-v5-validation - microbatch 1 (train-ctx=4096, data-ctx=4096, deepspeed_stage_2_offload)', '--trainer.strategy=deepspeed_stage_2_offload', '--trainer.devices=auto', '--trainer.microbatch_size=1', '--model.load_model=../model/L24-D2048-world-init.pth'], args=['fit', '-c', '/home/picocreator/rwkv-proj/RWKV-infctx-trainer/n

In [38]:
# Empty out the checkpoint
!cd "{PROJECT_DIR}" && rm -rf "./checkpoint/infctx-v5-unit-test-baseline-4096/"

# Microbatch 1 training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/config/enwiki_10k-world-4096.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - microbatch 2 (train-ctx=4096, data-ctx=4096, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}" \
        --trainer.microbatch_size=2 \
        --model.load_model="../model/L24-D2048-world-init.pth"

[2023-11-19 14:47:55,447] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.1.1'
/home/picocreator/anaconda3/envs/rwkv-infctx/lib/python3.11/site-packages/lightning/pytorch/cli.py:518: LightningCLI's args parameter is intended to run from within Python like if it were from the command line. To prevent mistakes it is not recommended to provide both args and command line arguments, got: sys.argv[1:]=['fit', '-c', '/home/picocreator/rwkv-proj/RWKV-infctx-trainer/notebook/trainer-v5-validation/config/enwiki_10k-world-4096.yaml', '--trainer.logger.init_args.name=infctx-v5-validation - microbatch 2 (train-ctx=4096, data-ctx=4096, deepspeed_stage_2_offload)', '--trainer.strategy=deepspeed_stage_2_offload', '--trainer.devices=auto', '--trainer.microbatch_size=2', '--model.load_model=../model/L24-D2048-world-init.pth'], args=['fit', '-c', '/home/picocreator/rwkv-proj/RWKV-infctx-trainer/n

In [39]:
# Empty out the checkpoint
!cd "{PROJECT_DIR}" && rm -rf "./checkpoint/infctx-v5-unit-test-baseline-4096/"

# Microbatch 1 training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/config/enwiki_10k-world-4096.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - microbatch 3 (train-ctx=4096, data-ctx=4096, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}" \
        --trainer.microbatch_size=3 \
        --model.load_model="../model/L24-D2048-world-init.pth"

[2023-11-19 15:01:12,310] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.1.1'
/home/picocreator/anaconda3/envs/rwkv-infctx/lib/python3.11/site-packages/lightning/pytorch/cli.py:518: LightningCLI's args parameter is intended to run from within Python like if it were from the command line. To prevent mistakes it is not recommended to provide both args and command line arguments, got: sys.argv[1:]=['fit', '-c', '/home/picocreator/rwkv-proj/RWKV-infctx-trainer/notebook/trainer-v5-validation/config/enwiki_10k-world-4096.yaml', '--trainer.logger.init_args.name=infctx-v5-validation - microbatch 3 (train-ctx=4096, data-ctx=4096, deepspeed_stage_2_offload)', '--trainer.strategy=deepspeed_stage_2_offload', '--trainer.devices=auto', '--trainer.microbatch_size=3', '--model.load_model=../model/L24-D2048-world-init.pth'], args=['fit', '-c', '/home/picocreator/rwkv-proj/RWKV-infctx-trainer/n

In [40]:
# Empty out the checkpoint
!cd "{PROJECT_DIR}" && rm -rf "./checkpoint/infctx-v5-unit-test-baseline-4096/"

# Microbatch 1 training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/config/enwiki_10k-world-4096.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - microbatch 4 (train-ctx=4096, data-ctx=4096, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}" \
        --trainer.microbatch_size=4 \
        --model.load_model="../model/L24-D2048-world-init.pth"

[2023-11-19 15:13:04,335] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.1.1'
/home/picocreator/anaconda3/envs/rwkv-infctx/lib/python3.11/site-packages/lightning/pytorch/cli.py:518: LightningCLI's args parameter is intended to run from within Python like if it were from the command line. To prevent mistakes it is not recommended to provide both args and command line arguments, got: sys.argv[1:]=['fit', '-c', '/home/picocreator/rwkv-proj/RWKV-infctx-trainer/notebook/trainer-v5-validation/config/enwiki_10k-world-4096.yaml', '--trainer.logger.init_args.name=infctx-v5-validation - microbatch 4 (train-ctx=4096, data-ctx=4096, deepspeed_stage_2_offload)', '--trainer.strategy=deepspeed_stage_2_offload', '--trainer.devices=auto', '--trainer.microbatch_size=4', '--model.load_model=../model/L24-D2048-world-init.pth'], args=['fit', '-c', '/home/picocreator/rwkv-proj/RWKV-infctx-trainer/n

In [41]:
# Empty out the checkpoint
!cd "{PROJECT_DIR}" && rm -rf "./checkpoint/infctx-v5-unit-test-baseline-4096/"

# Microbatch 1 training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/config/enwiki_10k-world-4096.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - microbatch 5 (train-ctx=4096, data-ctx=4096, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}" \
        --trainer.microbatch_size=5 \
        --model.load_model="../model/L24-D2048-world-init.pth"

[2023-11-19 15:23:57,294] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.1.1'
/home/picocreator/anaconda3/envs/rwkv-infctx/lib/python3.11/site-packages/lightning/pytorch/cli.py:518: LightningCLI's args parameter is intended to run from within Python like if it were from the command line. To prevent mistakes it is not recommended to provide both args and command line arguments, got: sys.argv[1:]=['fit', '-c', '/home/picocreator/rwkv-proj/RWKV-infctx-trainer/notebook/trainer-v5-validation/config/enwiki_10k-world-4096.yaml', '--trainer.logger.init_args.name=infctx-v5-validation - microbatch 5 (train-ctx=4096, data-ctx=4096, deepspeed_stage_2_offload)', '--trainer.strategy=deepspeed_stage_2_offload', '--trainer.devices=auto', '--trainer.microbatch_size=5', '--model.load_model=../model/L24-D2048-world-init.pth'], args=['fit', '-c', '/home/picocreator/rwkv-proj/RWKV-infctx-trainer/n

In [42]:
# Empty out the checkpoint
!cd "{PROJECT_DIR}" && rm -rf "./checkpoint/infctx-v5-unit-test-baseline-4096/"

# Microbatch 1 training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/config/enwiki_10k-world-4096.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - microbatch 6 (train-ctx=4096, data-ctx=4096, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}" \
        --trainer.microbatch_size=6 \
        --model.load_model="../model/L24-D2048-world-init.pth"

[2023-11-19 15:34:28,832] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.1.1'
/home/picocreator/anaconda3/envs/rwkv-infctx/lib/python3.11/site-packages/lightning/pytorch/cli.py:518: LightningCLI's args parameter is intended to run from within Python like if it were from the command line. To prevent mistakes it is not recommended to provide both args and command line arguments, got: sys.argv[1:]=['fit', '-c', '/home/picocreator/rwkv-proj/RWKV-infctx-trainer/notebook/trainer-v5-validation/config/enwiki_10k-world-4096.yaml', '--trainer.logger.init_args.name=infctx-v5-validation - microbatch 6 (train-ctx=4096, data-ctx=4096, deepspeed_stage_2_offload)', '--trainer.strategy=deepspeed_stage_2_offload', '--trainer.devices=auto', '--trainer.microbatch_size=6', '--model.load_model=../model/L24-D2048-world-init.pth'], args=['fit', '-c', '/home/picocreator/rwkv-proj/RWKV-infctx-trainer/n