# InfCtx torch.compile performance uplift
The following trainer validation, is used to compare performance differences between the following optimizations

- torch native
- torch + JIT
- torch + torch.compile

It presumes that basic setup has been done as per
- `./baseline-setup.ipynb`

## Install the nightly build within conda
(Skip if you already have 2.0.2, or already done the setup)

For torch.compile, as of 8th July 2023, you will need the torch nightly build for several fixes we depend on. This is expected to be resolved when merged in for torch 2.0.2 (you will need to call this outside the notebook)

```bash
conda activate rwkv-infctx
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch-nightly -c nvidia
```

To simplify the benchmarking, we are intentionally only performing
- a 100 trainer/global_step
- of 10 gradient accumulation * gpu
- no checkpoint save to disk

This would (on a single GPU) perform the run over
- a 1000 data samples

## Configure and apply your preferred settings
( note you will need to rerun this cell, if you restart your env )

In [21]:
DEEPSPEED_STRAT="deepspeed_stage_2_offload"
ENABLE_WANDB=False

print("DEEPSPEED_STRAT:", DEEPSPEED_STRAT)
print("ENABLE_WANDB:", ENABLE_WANDB)

DEEPSPEED_STRAT: deepspeed_stage_2_offload
ENABLE_WANDB: False


## Run, and get baseline timing (no JIT / torch compile)

In [23]:
%%bash -s "$DEEPSPEED_STRAT" "$ENABLE_WANDB"

export RWKV_TORCH_COMPILE=0 
export RWKV_JIT_ON=0
if [ "$2" = "False" ]; then
    export WANDB_MODE="disabled"
fi

cd ../../RWKV-v4neo

python3 new_train.py fit \
    -c ../notebook/trainer-validation/config-baseline.yaml \
    --trainer.strategy="$1" \
    --trainer.logger.init_args.name="infctx-validation-baseline (OS environment vars: RWKV_JIT_ON=False, RWKV_TORCH_COMPILE=False)" \
    --trainer.max_epochs=-1 --trainer.max_steps:1000

WANDB: False


[2023-07-07 18:25:17,089] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model with : torch-native


Global seed set to 3941088705
Using /home/ubuntu/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py311_cu118/wkv_128_bf16/build.ninja...
Building extension module wkv_128_bf16...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


ninja: no work to do.


Loading extension module wkv_128_bf16...
  rank_zero_warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/teven___parquet/teven--enwiki_10k-de63a925546e70ab/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)
100%|██████████| 1/1 [00:00<00:00, 906.48it/s]
Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/teven___parquet/teven--enwiki_10k-de63a925546e70ab/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7/cache-85ed41912c749812_*_of_00032.arrow
Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/teven___parquet/teven--enwiki_10k-de63a925546e70ab/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7/cache-d919d919e3a12608_*_of_00032.arrow
Loading cached processed dataset at /home/ubuntu/.cache/huggingface/da



Enabling DeepSpeed BF16.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using /home/ubuntu/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py311_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


ninja: no work to do.


Loading extension module cpu_adam...


Time to load cpu_adam op: 2.3277952671051025 seconds
Rank: 0 partition count [1, 1, 1] and sizes[(1515008000, False), (49152, False), (49152, False)] 



  | Name   | Type       | Params
--------------------------------------
0 | emb    | Embedding  | 102 M 
1 | blocks | ModuleList | 1.3 B 
2 | ln_out | LayerNorm  | 4.1 K 
3 | head   | Linear     | 102 M 
--------------------------------------
1.5 B     Trainable params
0         Non-trainable params
1.5 B     Total params
6,060.425 Total estimated model params size (MB)
  overflow_gpu = get_accelerator().ByteTensor([overflow])
`Trainer.fit` stopped: `max_steps=2` reached.


Epoch 0:   0%|          | 4/5308 [01:01<22:28:21, 15.25s/it, v_num=10, train/loss=9.620]
