# Using pre toknized binidx/numpy datasets with infctx trainer

The following is the backward compatibility support for the following binary dataset formats

- binidx
- (NOT SUPPORTED) numpy

> Important note: These example focuses only on how to configure your dataset, and does not properly perform checkmarking - for trainer configurations refer to the training notebooks

## Intial setup

Before we go into the dataset setup, lets perform an initial setup for all the folders we need, and a small toy model which we would use throughout the various examples within this notebook.

In [1]:
# Setup the folders we will need
!mkdir -p ../../model/
!mkdir -p ../../datapath/
!mkdir -p ../../checkpoint/

# Initialized a simple L6-D512 model, for both the v4 neox (50277) tokenizer
!cd ../../RWKV-v4neo/ && python3 ./init_model.py --n_layer 6 --n_embd 512 --vocab_size neox --skip-if-exists ../model/L6-D512-neox-init.pth

# and rwkv world (65529) tokenizers
!cd ../../RWKV-v4neo/ && python3 ./init_model.py --n_layer 6 --n_embd 512 --vocab_size world --skip-if-exists ../model/L6-D512-world-init.pth

# If you have a custom vocab size, you can indicate accordingly as well with an int
!cd ../../RWKV-v4neo/ && python3 ./init_model.py --n_layer 6 --n_embd 512 --vocab_size 20259 --skip-if-exists ../model/L6-D512-V20259-init.pth

[2023-08-04 11:06:02,253] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1'
---- Initializing model ----
No of layers: 6
Embedding size: 512
Output model path: ../model/L6-D512-neox-init.pth
Vocab size: 50277
---- ----- ----
Model exists, skipping init_model
[2023-08-04 11:06:05,508] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1'
---- Initializing model ----
No of layers: 6
Embedding size: 512
Output model path: ../model/L6-D512-world-init.pth
Vocab size: 65529
---- ----- ----
Model exists, skipping init_model
[2023-08-04 11:06:08,808] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1'
---- Initializing model ----
No of layers: 6
Embedding size: 512
O

## Training using a binidx dataset

The following is the `binidx-enwiki.yaml` settings, for using a textual dataset via huggingface, with most of the comments removed.

```.yaml
trainer:
  # Low max step limit, so that this dataset run can complete quickly
  max_steps: 10
  # Resonable batch size, for a more realistic it/s rate
  target_batch_size: 32

model:
  load_model: ../model/L6-D512-neox-init.pth
  ctx_len: 1024
  lr_init: 3e-4

data:
  # Directory where the formatted HF dataset will be saved into
  data_path: ../datapath/example-binidx/
  # Source here points to the binidx file to use (without the .bin / .idx suffix !!!)
  source: ../dataset/dataset-config/sample_data_text_document
  tokenizer: binidx
  test_split: 0
  test_split_shuffle: false
```

### Lets download the example binidx files

In [3]:
# Setup the dataset dir
!mkdir -p ../../dataset/dataset-config/

# Download the binidx file
!cd ../../dataset/dataset-config/ && wget -nc https://huggingface.co/datasets/picocreator/RWKV-notebook-assets/resolve/main/wiki40b_world_text_document.bin
!cd ../../dataset/dataset-config/ && wget -nc https://huggingface.co/datasets/picocreator/RWKV-notebook-assets/resolve/main/wiki40b_world_text_document.idx

File ‘wiki40b_world_text_document.bin’ already there; not retrieving.

File ‘wiki40b_world_text_document.idx’ already there; not retrieving.



### And convert it to HF datapath format + save it

In [5]:
# Lets preload the requried dataset
!cd ../../RWKV-v4neo && python3 preload_datapath.py ../notebook/dataset-config/example-binidx.yaml

[2023-08-04 11:15:01,340] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Downloading and preparing dataset generator/default to /home/picocreator/.cache/huggingface/datasets/generator/default-b8afac7cc2da0cb8/0.0.0...
Dataset generator downloaded and prepared to /home/picocreator/.cache/huggingface/datasets/generator/default-b8afac7cc2da0cb8/0.0.0. Subsequent calls will reuse this data.
                                                                                

### Finally run the training process (with the HF datapath)

In [8]:
# Train using the converted binidx format
!cd ../../RWKV-v4neo && python3 lightning_trainer.py fit -c ../notebook/dataset-config/example-binidx.yaml

[2023-08-04 11:18:25,122] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1'
  rank_zero_warn(
  rank_zero_warn(f"No seed found, seed set to {seed}")
Global seed set to 1182591630
[RWKV.model]: Preloading model from '../model/L6-D512-world-init.pth'
Using /home/picocreator/.cache/torch_extensions/py311_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/picocreator/.cache/torch_extensions/py311_cu117/wkv_1024_bf16/build.ninja...
Building extension module wkv_1024_bf16...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=wkv_1024_bf16 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/picocreator/anaconda3/en