# InfCtx trainer validation

This model being trained is the small L6-D512 model with
- Layer count: 6
- Embed size: 512

The goal is to validate loss rate change, across the exact same hyper parameters with the following
- 1024 data chunk size
- same learningrate / weightdecay / seed
- "teven/enwiki_10k" dataset, chunked to 1024 token sizes

With only the change in training context size
- 1024 context vs 128 context chunks

> This project assumes you have the rwkv-infctx conda env setup, and you are executing in that environment - see the main README.md for the conda env setup steps
>
> And that you have completed the `baseline-setup.ipynb`

## Configure and apply your preferred settings

Adjust your desired deepspeed settings, and gpu device count.

Enable/Disable WANDB here as well ( Enabled by default, as we need the loss curve for this experiment )

( note you will need to rerun this cell, if you restart your env )

In [None]:
DEEPSPEED_STRAT="deepspeed_stage_1"
GPU_DEVICES="auto"
ENABLE_WANDB=True
WANDB_PREFIX="infctx-bptt-validation L6-D512"

print("DEEPSPEED_STRAT:", DEEPSPEED_STRAT)
print("ENABLE_WANDB:", ENABLE_WANDB)
print("GPU_DEVICES:", GPU_DEVICES)

if ENABLE_WANDB:
    WANDB_MODE="online"
else:
    WANDB_MODE="disabled"

# Baseline full context (1024) training

Perform a full 1 epoch training run of training context size = 1024. Ensuring all data samples fit within the allocated training size.

In [None]:
!cd ../../RWKV-v4neo && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 new_train.py fit \
        -c ../notebook/trainer-validation/config/L6-D512-neox-1024.yaml \
        --trainer.logger.init_args.name="{WANDB_PREFIX} (full, train-ctx=1024, data-ctx=1024, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}"

# Back-Propagation Through Time (512) training

Perform a full 1 epoch training run of training context size = 512. This is a less exegerated version of the 128 training

In [None]:
!cd ../../RWKV-v4neo && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 new_train.py fit \
        -c ../notebook/trainer-validation/config/L6-D512-neox-1024.yaml \
        --trainer.logger.init_args.name="{WANDB_PREFIX} (bptt, train-ctx=512, data-ctx=1024, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}" \
        --model.bptt_learning="true" \
        --model.ctx_len=512

# Back-Propagation Through Time (128) training

Perform a full 1 epoch training run of training context size = 128. Forcing all data samples to be segmented 8 times, via "Truncated Back-Propagation Through Time"

In [None]:
!cd ../../RWKV-v4neo && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 new_train.py fit \
        -c ../notebook/trainer-validation/config/L6-D512-neox-1024.yaml \
        --trainer.logger.init_args.name="{WANDB_PREFIX} (bptt, train-ctx=128, data-ctx=1024, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}" \
        --model.bptt_learning="true" \
        --model.ctx_len=128

# Last 1 segmented (128) training

Perform a full 1 epoch training run of training context size = 128. Only using the last segment. (This replicates previous known regression)

In [None]:
!cd ../../RWKV-v4neo && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 new_train.py fit \
        -c ../notebook/trainer-validation/config/L6-D512-neox-1024.yaml \
        --trainer.logger.init_args.name="{WANDB_PREFIX} (last-1-bptt, train-ctx=128, data-ctx=1024, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}" \
        --model.bptt_learning="true" \
        --model.bptt_learning_range=1 \
        --model.ctx_len=128

# Last 2 segmented (128) training

Last segment + 1 varient, to see the various loss learning differences

In [None]:
!cd ../../RWKV-v4neo && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 new_train.py fit \
        -c ../notebook/trainer-validation/config/L6-D512-neox-1024.yaml \
        --trainer.logger.init_args.name="{WANDB_PREFIX} (last-2-bptt, train-ctx=128, data-ctx=1024, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}" \
        --model.bptt_learning="true" \
        --model.bptt_learning_range=2 \
        --model.ctx_len=128