# Echo-B 1B4 (Memory Finetune)
This continues off from `Echo-B-1B4-basemodel.ipynb` to perform the full memory finetune & testing process

This is done generally in 3 Tune stages
- Tune 1: Low ctx size (512), Training with only the input masked. This does very limited memory training, and is used primarily to train the instruction set.
- Tune 2: Low ctx size (512), Training with instruction & input masked. This forces the actual memory training on the output tokens.
- Tune 3: Mid ctx size (1024), stage 2, scaled up to 1024 context sizes.

In all cases, the input tokens is always masked. And we intentionally use the limited word set for memory training, which matches the same wordset used in the original memory evaluation of raven pretrained models. This is intentional to serve as both consistent comparision between experiments, and resonable training time.

One of the issue faced previously with an excessive large word set, is that the model would be required to see "new words" atleast a few time before being able to train the memory process. This drastically slowed down the process as the large word list meant the model was constantly spending time learning new words (instead of memory training).

If we want to increase the number / type of words the model can handle for memory training, that can be done later as a stage 4 memory tune if needed. But that exceeds the current requirements for the memory experiment process.

> This project assumes you have the rwkv-infctx conda env setup, and you are executing in that environment - see the main README.md for the conda env setup steps

## Optional: Download the pretrained model
(if you want to skip the the basemodel train + instruct tune)


In [1]:
# # Init required dirs
# !mkdir -p ../../../model/
# !mkdir -p ../../../datapath/
# !mkdir -p ../../../checkpoint/

# # Download the Stage2.pth file
# !rm -rf ../../../model/Echo-B-1B4-Stage2.pth
# !cd ../../../model/ && wget https://huggingface.co/picocreator/memory-size-experiment-for-rwkv/resolve/main/Echo-B-1B4-Stage2.pth
# !ls -alh ../../../model/Echo-B-1B4-Stage2.pth

## Configure your environment settings
(!Important: you will need to rerun the below cell, if you restart your kernel)

In [3]:
DEEPSPEED_STRAT="deepspeed_stage_1"
GPU_DEVICES="auto"
ENABLE_WANDB=True
WANDB_PREFIX="(8x3090) Echo-B-1B4"

print("DEEPSPEED_STRAT:", DEEPSPEED_STRAT)
print("ENABLE_WANDB:", ENABLE_WANDB)
print("GPU_DEVICES:", GPU_DEVICES)

if ENABLE_WANDB:
    WANDB_MODE="online"
else:
    WANDB_MODE="disabled"

# Computing the notebook, and various paths
import os
NOTEBOOK_DIR=os.path.dirname(os.path.abspath("__file__"))
PROJECT_DIR=os.path.abspath(os.path.join(NOTEBOOK_DIR, "../../../"))
TRAINER_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./RWKV-v4neo/"))

print("NOTEBOOK_DIR:", NOTEBOOK_DIR)
print("TRAINER_DIR:", TRAINER_DIR)
print("PROJECT_DIR:", PROJECT_DIR)

DEEPSPEED_STRAT: deepspeed_stage_1
ENABLE_WANDB: True
GPU_DEVICES: auto
NOTEBOOK_DIR: /home/picocreator/rwkv-proj/picocreator-memory-experiment/notebook/experiment/memory-enwiki-v2
TRAINER_DIR: /home/picocreator/rwkv-proj/picocreator-memory-experiment/RWKV-v4neo
PROJECT_DIR: /home/picocreator/rwkv-proj/picocreator-memory-experiment


## Tune 1 : Simple Memory instruct finetuning

- Tune 1: Low ctx size (512), Training with only the input masked. This does very limited memory training, and is used primarily to train the instruction set.

In [3]:
%%script bash

########################################
# Generate the required jsonl dataset
########################################

# Reset the dataset dir
mkdir -p ./dataset
rm -rf ./dataset/*.jsonl

# Generate the various datasets
echo "## Generating word reptition dataset ##"

# We do a strong bias for smaller word count, to teach the concept from scratch
# so that the model can learn the function. 
#
# Note that all document samples, are randomized between the target word count, 
# to half of the target word count.
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-2-count.jsonl  2  5000 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-5-count.jsonl  5  5000 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-10-count.jsonl 10 2500 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-15-count.jsonl 15 2500 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-20-count.jsonl 20 2500 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-25-count.jsonl 25 2500 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-40-count.jsonl 40 2500 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-50-count.jsonl 50 2500 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-60-count.jsonl 80 2500 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-80-count.jsonl 80 2500 &

# With a slight mix of the larger word count
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-100-count.jsonl 100 2500 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-200-count.jsonl 200 2500 &

wait
echo "## Done ##"

ls -alh ./dataset/

## Generating word reptition dataset ##
Generated JSONL file with - 10 max words, 2500 samples - at ./dataset/word-10-count.jsonl
Generated JSONL file with - 15 max words, 2500 samples - at ./dataset/word-15-count.jsonl
Generated JSONL file with - 20 max words, 2500 samples - at ./dataset/word-20-count.jsonl
Generated JSONL file with - 2 max words, 5000 samples - at ./dataset/word-2-count.jsonl
Generated JSONL file with - 25 max words, 2500 samples - at ./dataset/word-25-count.jsonl
Generated JSONL file with - 5 max words, 5000 samples - at ./dataset/word-5-count.jsonl
Generated JSONL file with - 40 max words, 2500 samples - at ./dataset/word-40-count.jsonl
Generated JSONL file with - 50 max words, 2500 samples - at ./dataset/word-50-count.jsonl
Generated JSONL file with - 80 max words, 2500 samples - at ./dataset/word-80-count.jsonl
Generated JSONL file with - 80 max words, 2500 samples - at ./dataset/word-60-count.jsonl
Generated JSONL file with - 100 max words, 2500 samples - at ./d

In [4]:
# Lets pre tokenize the requried dataset
!cd "{TRAINER_DIR}" && \
    python3 preload_dataset.py "{NOTEBOOK_DIR}/Echo-B-1B4-mem-finetune-1.yaml"

# Ensure the checkpoint directory exists
!cd "{TRAINER_DIR}" && mkdir -p "../checkpoint/Echo-B-1B4-mem-finetune-1/"

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-0960727d51f49ca1/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...
Downloading data files: 100%|███████████████████| 1/1 [00:00<00:00, 6482.70it/s]
Extracting data files: 100%|█████████████████████| 1/1 [00:00<00:00, 342.50it/s]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-0960727d51f49ca1/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96. Subsequent calls will reuse this data.
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 200.77it/s]
                                                                                

In [5]:
# Start the finetune model training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python new_train.py fit \
        -c "{NOTEBOOK_DIR}/Echo-B-1B4-mem-finetune-1.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Mem-Finetune-1 (bs=256, train-ctx=512, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}"  \
        --model.ctx_len=512

Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1+cu118'
  rank_zero_warn(f"No seed found, seed set to {seed}")
Global seed set to 4190240465
[34m[1mwandb[0m: Currently logged in as: [33mpicocreator[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.15.5 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m./wandb/run-20230714_044445-5f9aziq0[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33m(8x3090) Echo-B-1B4 - Mem-Finetune-1 (bs=256, train-ctx=512, deepspeed_stage_1)[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/picocreator/RWKV-Memory-Experiment[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/picocreator/RWKV-Memory-Experimen

In [6]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py \
        "../checkpoint/Echo-B-1B4-mem-finetune-1/last.ckpt" \
        "../model/Echo-B-1B4-Tune1.pth"
!cd "{TRAINER_DIR}" && ls -alh ../model/Echo-B-1B4-Tune1.pth

Setting ds_accelerator to cuda (auto detect)
Processing zero checkpoint '../checkpoint/Echo-B-1B4-mem-finetune-1/last.ckpt/checkpoint'
Detected checkpoint of type zero stage ZeroStageEnum.optimizer_states, world_size: 8
Parsing checkpoint created by deepspeed==0.9.3
Reconstructed fp32 state dict with 1734 params 1412675584 elements
Saving fp32 state dict to ../model/Echo-B-1B4-Tune1.pth
-rw-r--r-- 1 root root 5.3G Jul 14 05:27 ../model/Echo-B-1B4-Tune1.pth


In [7]:
# Lets do a memory eval
#
# Note that the expected performance "is not that great", as the model seems to be only loosely
# learning the memorization task, and the instruction propmt. And is seem to be acting more
# like an RNG based on the instruct. (Instead of the actual memorization task)
!python3 ./memory_script/eval_memory_guided.py "{PROJECT_DIR}/model/Echo-B-1B4-Tune1.pth"

Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu118/wkv_cuda/build.ninja...
Building extension module wkv_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv_cuda...
RWKV_JIT_ON 1 RWKV_CUDA_ON 1 RESCALE_LAYER 0

Loading /root/picocreator-memory-experiment/model/Echo-B-1B4-Tune1.pth ...
Strategy: (total 96+1=97 layers)
* cuda [float32, float32], store 97 layers
0-cuda-float32-float32 1-cuda-float32-float32 2-cuda-float32-float32 3-cuda-float32-float32 4-cuda-float32-float32 5-cuda-float32-float32 6-cuda-float32-float32 7-cuda-float32-float32 8-cuda-float32-float32 9-cuda-float32-float32 10-cuda-float32-float32 11-cuda-float32-float32 12-cuda-float32-float32 13-cuda-float32-float32 14-cuda-float32-float32 15-cuda-float32-float32 16-cuda-fl

## Tune 2 : Low ctx size (512), memory training

- Tune 2: Low ctx size (512), Training with instruction & input masked. This forces the actual memory training on the output tokens.

In [9]:
%%script bash

########################################
# Generate the required jsonl dataset
########################################

# Reset the dataset dir
mkdir -p ./dataset
rm -rf ./dataset/*.jsonl

# Generate the various datasets
echo "## Generating word reptition dataset ##"

#
# We switch over to fully masked instruct+input, to properly learn the memorization task
#
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-2-count.jsonl  2  5000 &
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-5-count.jsonl  5  5000 &
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-10-count.jsonl 10 10000 &
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-15-count.jsonl 15 10000 &
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-20-count.jsonl 20 10000 &
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-25-count.jsonl 25 10000 &
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-40-count.jsonl 40 10000 &
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-50-count.jsonl 50 10000 &
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-60-count.jsonl 80 10000 &
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-80-count.jsonl 80 10000 &
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-100-count.jsonl 100 5000 &
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-200-count.jsonl 200 5000 &

#
# We mixin the shuffled word list, so that we ensure all words / tokens are learned
# however this might intrduce an exclusion bias (if seen this word, never repeat it), 
# so we limit the mixture of this data samples
#
python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-10-count.jsonl 10 20 &
python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-15-count.jsonl 15 20 &
python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-25-count.jsonl 25 30 &
python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-50-count.jsonl 50 50 &
python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-75-count.jsonl 75 50 &
python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-100-count.jsonl 100 50 &
python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-200-count.jsonl 200 50 &

wait
echo "## Done ##"

ls -alh ./dataset/

## Generating word reptition dataset ##
Generated a single JSONL file with 673 samples (50 token repeat) - 200 max words - at ./dataset/shuffle-word-200-count.jsonl
Generated a single JSONL file with 1768 samples (50 token repeat) - 75 max words - at ./dataset/shuffle-word-75-count.jsonl
Generated a single JSONL file with 1324 samples (50 token repeat) - 100 max words - at ./dataset/shuffle-word-100-count.jsonl
Generated a single JSONL file with 2631 samples (50 token repeat) - 50 max words - at ./dataset/shuffle-word-50-count.jsonl
## Done ##
total 22K
drwxrwxr-x 2 picocreator picocreator    6 Jul 14 14:21 .
drwxrwxr-x 4 picocreator picocreator   12 Jul 14 14:16 ..
-rw-rw-r-- 1 picocreator picocreator 1.4M Jul 14 14:21 shuffle-word-100-count.jsonl
-rw-rw-r-- 1 picocreator picocreator 1.4M Jul 14 14:21 shuffle-word-200-count.jsonl
-rw-rw-r-- 1 picocreator picocreator 1.5M Jul 14 14:21 shuffle-word-50-count.jsonl
-rw-rw-r-- 1 picocreator picocreator 1.5M Jul 14 14:21 shuffle-word-75-cou

In [10]:
# Lets pre tokenize the requried dataset
!cd "{TRAINER_DIR}" && \
    python3 preload_dataset.py "{NOTEBOOK_DIR}/Echo-B-1B4-mem-finetune-2.yaml"

# Ensure the checkpoint directory exists
!cd "{TRAINER_DIR}" && mkdir -p "../checkpoint/Echo-B-1B4-mem-finetune-2/"

Downloading and preparing dataset json/default to /home/picocreator/.cache/huggingface/datasets/json/default-2fdcb4be08d20e18/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...
Downloading data files: 100%|██████████████████| 1/1 [00:00<00:00, 13357.66it/s]
Extracting data files: 100%|█████████████████████| 1/1 [00:00<00:00, 749.65it/s]
Dataset json downloaded and prepared to /home/picocreator/.cache/huggingface/datasets/json/default-2fdcb4be08d20e18/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96. Subsequent calls will reuse this data.
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 320.84it/s]
                                                                                

In [10]:
# Start the finetune model training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python new_train.py fit \
        -c "{NOTEBOOK_DIR}/Echo-B-1B4-mem-finetune-2.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Mem-Finetune-2 (bs=256, train-ctx=512, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}"  \
        --model.ctx_len=512

Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1+cu118'
  rank_zero_warn(f"No seed found, seed set to {seed}")
Global seed set to 3772335697
[34m[1mwandb[0m: Currently logged in as: [33mpicocreator[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.15.5 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m./wandb/run-20230714_053206-50q6wti0[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33m(8x3090) Echo-B-1B4 - Mem-Finetune-2 (bs=256, train-ctx=512, deepspeed_stage_1)[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/picocreator/RWKV-Memory-Experiment[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/picocreator/RWKV-Memory-Experimen

In [4]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py \
        "../checkpoint/Echo-B-1B4-mem-finetune-2/last.ckpt" \
        "../model/Echo-B-1B4-Tune2.pth"
!cd "{TRAINER_DIR}" && ls -alh ../model/Echo-B-1B4-Tune1.pth

[2023-07-14 14:18:32,364] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
  File "/home/picocreator/rwkv-proj/picocreator-memory-experiment/RWKV-v4neo/export_checkpoint.py", line 623, in <module>
    convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir, output_file)
  File "/home/picocreator/rwkv-proj/picocreator-memory-experiment/RWKV-v4neo/export_checkpoint.py", line 537, in convert_zero_checkpoint_to_fp32_state_dict
    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/picocreator/rwkv-proj/picocreator-memory-experiment/RWKV-v4neo/export_checkpoint.py", line 516, in get_fp32_state_dict_from_zero_checkpoint
    raise ValueError(f"Unable to find 'latest' file at {latest_path}")
ValueError: Unable to find 'latest' file at ../checkpoint/Echo-B-1B4-mem-finetune-2/last.ckpt/latest
ls

In [12]:
# Lets do a memory eval 
#
# While not at its full potential, its memory ability should start emerging
#
!python3 ./memory_script/eval_memory_guided.py "{PROJECT_DIR}/model/Echo-B-1B4-Tune2.pth"

Using /root/.cache/torch_extensions/py311_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py311_cu118/wkv_cuda/build.ninja...
Building extension module wkv_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module wkv_cuda...
RWKV_JIT_ON 1 RWKV_CUDA_ON 1 RESCALE_LAYER 0

Loading /root/picocreator-memory-experiment/model/Echo-B-1B4-Tune2.pth ...
Traceback (most recent call last):
  File "/root/picocreator-memory-experiment/notebook/experiment/memory-enwiki-v2/./memory_script/eval_memory_guided.py", line 46, in <module>
    model = RWKV(model=model_path, strategy=model_run_strat)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/jit/_script.py", line 292, in init_then_script
    original_init(self, *args, **kwargs)
  File "/usr/local/