# InfCtx trainer validation
This model being trained has the same settings as raven 1B5 model.
- Layer count: 24
- Embed size: 2048

The goal is to validate loss rate change, across the exact same hyper parameters with the following
- 1024 data chunk size
- same learningrate / weightdecay / seed
- "teven/enwiki_10k" dataset, chunked to 1024 token sizes

With only the change in training context size
- 1024 context vs 128 context

> This project assumes you have the rwkv-infctx conda env setup, and you are executing in that environment - see the main README.md for the conda env setup steps

## Preparing the init model and test dataset

In [None]:
# First lets setup the various directories, and get the blank init model, these init model was generated
# using the original RWKV-LM repo (as at this point of writing, this repo cannot init a model)
# As such I have preinitialized these blank models and uploaded them to HF for convinence
!mkdir -p ../../model/
!mkdir -p ../../datapath/
!mkdir -p ../../checkpoint/
!rm -rf ../../model/Echo-A-1B5-Init.pth
!cd ../../model/ && wget https://huggingface.co/picocreator/memory-size-experiment-for-rwkv/resolve/main/Echo-A-1B5-Init.pth
!ls -alh ../../model/Echo-A-1B5-Init.pth

In [9]:
# Lets preload the requried dataset
!cd ../../RWKV-v4neo && python3 preload_dataset.py ../notebook/trainer-validation/infctx-validation-dryrun.yaml

Found cached dataset parquet (/home/picocreator/.cache/huggingface/datasets/teven___parquet/teven--enwiki_10k-de63a925546e70ab/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 986.66it/s]
Loading cached processed dataset at /home/picocreator/.cache/huggingface/datasets/teven___parquet/teven--enwiki_10k-de63a925546e70ab/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7/cache-3d43d1724bef83d7_*_of_00016.arrow
Loading cached processed dataset at /home/picocreator/.cache/huggingface/datasets/teven___parquet/teven--enwiki_10k-de63a925546e70ab/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7/cache-5033407f38c97f24.arrow
Loading cached processed dataset at /home/picocreator/.cache/huggingface/datasets/teven___parquet/teven--enwiki_10k-de63a925546e70ab/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7/cache-78e7f3a5f1679aa4_*_of_00016.arrow

# Trainer Code validation via dryrun

The following dryrun, helps check that the existing trainer code changes are valid across 2 * 2 data samples.
It does not log the run the W&B

In [6]:
# Validate source code and env is working, by doing a short 2 sample dryrun
!cd ../../RWKV-v4neo && python3 new_train.py fit -c ../notebook/trainer-validation/infctx-validation-dryrun.yaml

[2023-07-01 19:44:46,777] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Global seed set to 3941088705
Using /home/picocreator/.cache/torch_extensions/py311_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/picocreator/.cache/torch_extensions/py311_cu117/wkv_128_bf16/build.ninja...
Building extension module wkv_128_bf16...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF wkv_op_bf16.o.d -DTORCH_EXTENSION_NAME=wkv_128_bf16 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/picocreator/anaconda3/envs/rwkv-exp/lib/python3.11/site-packages/torch/include -isystem /home/picocreator/anaconda3/envs/rwkv-exp/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /home/picocreator/anaconda3/env

# Baseline full context (1024) training

Perform a full 1 epoch training run of training context size = 1024. Ensuring all data samples fit within the allocated training size.
> PS: Weights and biases logging is enabled

In [9]:
# Full training run
!cd ../../RWKV-v4neo && python3 new_train.py fit -c ../notebook/trainer-validation/infctx-validation-full.yaml

[2023-07-01 20:53:18,310] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Global seed set to 3941088705
[34m[1mwandb[0m: Currently logged in as: [33mpicocreator[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m./wandb/run-20230701_205320-k8flu72z[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33minfctx-validation-full (train-ctx=1024, data-ctx=1024, bs=12)[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/picocreator/RWKV-InfCtx-Validation[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/picocreator/RWKV-InfCtx-Validation/runs/k8flu72z[0m
Using /home/picocreator/.cache/torch_extensions/py311_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/picocreator/.cache/torch

# Segmented (128) training

Perform a full 1 epoch training run of training context size = 128. Forcing all data samples to be segmented 8 times.
> PS: Weights and biases logging is enabled

In [None]:
# Full training run
!cd ../../RWKV-v4neo && python3 new_train.py fit -c ../notebook/trainer-validation/infctx-validation-segmented.yaml

[2023-07-01 20:53:00,724] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Global seed set to 3941088705
[34m[1mwandb[0m: Currently logged in as: [33mpicocreator[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m./wandb/run-20230701_205302-qr7w1xkv[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33minfctx-validation-segmented (train-ctx=128, data-ctx=1024, bs=12)[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/picocreator/RWKV-InfCtx-Validation[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/picocreator/RWKV-InfCtx-Validation/runs/qr7w1xkv[0m
Using /home/picocreator/.cache/torch_extensions/py311_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/picocreator/.cache/t

# Last segmented (128) training

Perform a full 1 epoch training run of training context size = 128. Only using the last segment. (This replicates previous known regression)
> PS: Weights and biases logging is enabled

In [None]:
# Full training run
!cd ../../RWKV-v4neo && python3 new_train.py fit -c ../notebook/trainer-validation/infctx-validation-last-segment.yaml