# Using pre toknized binidx/numpy datasets with infctx trainer

The following is the backward compatibility support for the following binary dataset formats

- binidx
- (NOT SUPPORTED) numpy

> Important note: These example focuses only on how to configure your dataset, and does not properly perform checkmarking - for trainer configurations refer to the training notebooks

## Intial setup

Before we go into the dataset setup, lets perform an initial setup for all the folders we need, and a small toy model which we would use throughout the various examples within this notebook.

In [None]:
# Setup the folders we will need
!mkdir -p ../../model/
!mkdir -p ../../datapath/
!mkdir -p ../../checkpoint/

# Initialized a simple L6-D512 model, for both the v4 neox (50277) tokenizer
!cd ../../RWKV-v5/ && python3 ./init_model.py --n_layer 6 --n_embd 512 --vocab_size neox --skip-if-exists ../model/L6-D512-neox-init.pth

# and rwkv world (65529) tokenizers
!cd ../../RWKV-v5/ && python3 ./init_model.py --n_layer 6 --n_embd 512 --vocab_size world --skip-if-exists ../model/L6-D512-world-init.pth

# If you have a custom vocab size, you can indicate accordingly as well with an int
!cd ../../RWKV-v5/ && python3 ./init_model.py --n_layer 6 --n_embd 512 --vocab_size 20259 --skip-if-exists ../model/L6-D512-V20259-init.pth

## Training using a binidx dataset

The following is the `binidx-enwiki.yaml` settings, for using a textual dataset via huggingface, with most of the comments removed.

---
```.yaml
trainer:
  # Low max step limit, so that this dataset run can complete quickly
  max_steps: 10
  # Resonable batch size, for a more realistic it/s rate
  target_batch_size: 32

model:
  load_model: ../model/L6-D512-neox-init.pth
  ctx_len: 1024
  lr_init: 3e-4

data:
  # Directory where the formatted HF dataset will be saved into
  data_path: ../datapath/example-binidx/
  # Source here points to the binidx file to use (without the .bin / .idx suffix !!!)
  source: ../dataset/dataset-config/sample_data_text_document
  tokenizer: binidx
  test_split: 0.001
  test_split_shuffle: false
```
---

### Lets download the example binidx files

In [None]:
# Setup the dataset dir
!mkdir -p ../../dataset/dataset-config/

# Download the binidx file
!cd ../../dataset/dataset-config/ && wget -nc https://huggingface.co/datasets/picocreator/RWKV-notebook-assets/resolve/main/wiki40b_world_text_document.bin
!cd ../../dataset/dataset-config/ && wget -nc https://huggingface.co/datasets/picocreator/RWKV-notebook-assets/resolve/main/wiki40b_world_text_document.idx

### And convert it to HF datapath format + save it

In [None]:
# Lets preload the requried dataset
!cd ../../RWKV-v5 && python3 preload_datapath.py ../notebook/dataset-config/example-binidx.yaml

### Finally run the training process (with the HF datapath)

In [None]:
# Train using the converted binidx format
!cd ../../RWKV-v5 && python3 lightning_trainer.py fit -c ../notebook/dataset-config/example-binidx.yaml