# Using HF datasets with infctx trainer

The infctx trainer makes a huge shift towards a HF focus dataset parser, with several pros and cons. This note book aims to cover all the common use cases for dataset handling and processing, that is supported by this trainer code.

As this guide is focused only on the dataset configuration option side of thing, we will be limiting are training example runs to 16 data samples

## Intial setup

Before we go into the dataset setup, lets perform an initial setup for all the folders we need, and a small toy model which we would use throughout the various examples within this notebook.

In [7]:
# Setup the folders we will need
!mkdir -p ../../model/
!mkdir -p ../../datapath/
!mkdir -p ../../checkpoint/

# Initialized a simple L6-D512 model, for both the v4 neox (50277) tokenizer
!cd ../../RWKV-v4neo/ && python3 ./init_model.py --n_layer 6 --n_embd 512 --vocab_size neox ../model/L6-D512-neox-init.pth

# and rwkv world (65529) tokenizers
!cd ../../RWKV-v4neo/ && python3 ./init_model.py --n_layer 6 --n_embd 512 --vocab_size world ../model/L6-D512-world-init.pth

# If you have a custom vocab size, you can indicate accordingly as well with an int
!cd ../../RWKV-v4neo/ && python3 ./init_model.py --n_layer 6 --n_embd 512 --vocab_size 20259 ../model/L6-D512-V20259-init.pth

[2023-07-29 07:11:05,106] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.1.0.dev20230706'
---- Initializing model ----
No of layers: 6
Embedding size: 512
Output model path: ../model/L6-D512-neox-init.pth
Vocab size: 50277
---- ----- ----
Using /home/ubuntu/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py311_cu118/wkv_2048_bf16/build.ninja...
Building extension module wkv_2048_bf16...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv_2048_bf16...
50277 512   -0.1 emb.weight
512   512   0    blocks.0.att.key.weight
512   512   1.0  blocks.0.att.value.weight
512   512   0    blocks.0.att.receptance.weight
512   512   0    blocks.0.att.output.weigh

## Training using a text dataset via Hugging Face

The following is the `mini-enwiki.yaml` settings, for using a textual dataset via huggingface, with all the comments removed


In [8]:
# Lets preload the requried dataset
!cd ../../RWKV-v4neo && python3 preload_dataset.py ../notebook/dataset-config/mini-enwiki.yaml

Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/teven___parquet/teven--enwiki_10k-de63a925546e70ab/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)
100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 1016.55it/s]
Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/teven___parquet/teven--enwiki_10k-de63a925546e70ab/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7/cache-d42d04c7a8a36da8_*_of_00032.arrow
Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/teven___parquet/teven--enwiki_10k-de63a925546e70ab/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7/cache-1dc161f5d8b6c045_*_of_00032.arrow
Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/teven___parquet/teven--enwiki_10k-de63a925546e70ab/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7/cache-a1b68feb26abfa73_*_of_00032.arrow
        

In [15]:
# Validate source code and env is working, by doing a short 2 sample dryrun
!cd ../../RWKV-v4neo && python3 new_train.py fit -c ../notebook/dataset-config/mini-enwiki.yaml

[2023-07-29 07:35:48,845] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.1.0.dev20230706'
  rank_zero_warn(f"No seed found, seed set to {seed}")
Global seed set to 2583992680
Using /home/ubuntu/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py311_cu118/wkv_1024_bf16/build.ninja...
Building extension module wkv_1024_bf16...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv_1024_bf16...
  rank_zero_warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Found cached dataset parquet (/home/ubuntu/.cache/huggingface/datasets/teven___parquet/teven--en