# CUDA optimization notebook

The following is a quick and dirty log of various optimization experiments done to cuda (and their results)

Each run is done twice after code changes, with the first run aborted after compilation (to cache the cuda compilation shader)

## Preparing the init model and test dataset

In [1]:
# First lets setup the various directories, and get the blank init model, these init model was generated
# using the original RWKV-LM repo (as at this point of writing, this repo cannot init a model)
# As such I have preinitialized these blank models and uploaded them to HF for convinence
!mkdir -p ../../model/
!mkdir -p ../../datapath/
!mkdir -p ../../checkpoint/
!rm -rf ../../model/Echo-A-1B5-Init.pth
!cd ../../model/ && wget https://huggingface.co/picocreator/memory-size-experiment-for-rwkv/resolve/main/Echo-A-1B5-Init.pth
!ls -alh ../../model/Echo-A-1B5-Init.pth

--2023-07-05 01:12:44--  https://huggingface.co/picocreator/memory-size-experiment-for-rwkv/resolve/main/Echo-A-1B5-Init.pth
Resolving huggingface.co (huggingface.co)... 99.84.108.70, 99.84.108.129, 99.84.108.55, ...
Connecting to huggingface.co (huggingface.co)|99.84.108.70|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/cb/ef/cbef09abb2634a3375b28868bffa285226dfeabedec89b28c2fb302221164d66/0ec7214ed16737a6348254e6f96d8cdc04d3b5efbd5f53fe9337607ea42b5b9f?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27Echo-A-1B5-Init.pth%3B+filename%3D%22Echo-A-1B5-Init.pth%22%3B&Expires=1688778765&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZG4tbGZzLmh1Z2dpbmdmYWNlLmNvL3JlcG9zL2NiL2VmL2NiZWYwOWFiYjI2MzRhMzM3NWIyODg2OGJmZmEyODUyMjZkZmVhYmVkZWM4OWIyOGMyZmIzMDIyMjExNjRkNjYvMGVjNzIxNGVkMTY3MzdhNjM0ODI1NGU2Zjk2ZDhjZGMwNGQzYjVlZmJkNWY1M2ZlOTMzNzYwN2VhNDJiNWI5Zj9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoiLCJDb25ka

In [2]:
# Lets preload the requried dataset
!cd ../../RWKV-v4neo && python3 preload_dataset.py ../notebook/trainer-validation/cuda-optimization.yaml

Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/teven___parquet/teven--enwiki_10k-de63a925546e70ab/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 983.42it/s]
                                                                                

# 1000 samples - Baseline without optimization changes

In [13]:
!cd ../../RWKV-v4neo && python3 new_train.py fit -c ../notebook/trainer-validation/cuda-optimization.yaml

Setting ds_accelerator to cuda (auto detect)
Global seed set to 3941088705
Using /home/ubuntu/.cache/torch_extensions/py311_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py311_cu117/wkv_4096_bf16/build.ninja...
Building extension module wkv_4096_bf16...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv_4096_bf16...
  rank_zero_warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/teven___parquet/teven--enwiki_10k-de63a925546e70ab/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 956.73it/s]
Loading cached processed dataset

# with `--default-stream per-thread` and `-arch=native`

In [1]:
!cd ../../RWKV-v4neo && python3 new_train.py fit -c ../notebook/trainer-validation/cuda-optimization.yaml

Setting ds_accelerator to cuda (auto detect)
Global seed set to 3941088705
Using /home/ubuntu/.cache/torch_extensions/py311_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py311_cu117/wkv_4096_bf16/build.ninja...
Building extension module wkv_4096_bf16...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=wkv_4096_bf16 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/ubuntu/anaconda3/envs/rwkv-exp/lib/python3.11/site-packages/torch/include -isystem /home/ubuntu/anaconda3/envs/rwkv-exp/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /home/ubuntu/anaconda3/envs/rwkv-exp/lib/python3.11/site-packages/torch/include/TH -isystem /home/ubuntu/anaconda3/envs/rwkv-e