# RWKV v5 Wavenet - memory finetune

This continues off from `./V5-Wave-1B5-basemodel.ipynb` to perform the full memory finetune & testing process

This is done generally across the following stages
- Tune 1: Low ctx size (512), Training with only the input masked. This does very limited memory training, and is used primarily to train the instructions.
- Tune 2: Low ctx size (512), Training with instruction & input masked. This forces the actual memory training on the output tokens.
- Tune 3: Mid ctx size (1024), scaled up
- Tune 4: Mid ctx size (2048), scaled up
- Tune 5: Mid ctx size (4096), scaled up
- Tune 6: Large ctx size (8192), scaled up

In all cases, the input tokens is always masked. And we intentionally use the limited word set for memory training, which matches the same wordset used in the original memory evaluation of raven pretrained models. This is intentional to serve as both consistent comparision between experiments, and resonable training time.

One of the issue faced previously with an excessive large word set, is that the model would be required to see the "new words" atleast a few time before being able to train the memory process. This drastically slowed down the process as the large word list meant the model was constantly spending time learning new words (instead of memory training).

If we want to increase the number / type of words the model can handle for memory training, that can be done later as a stages memory tune if needed. But this exceeds the current requirements for the memory experiment process.

> This project assumes you have the rwkv-infctx conda env setup, and you are executing in that environment - see the main README.md for the conda env setup steps

## Optional: Download the pretrained model
(if you want to skip the the basemodel train + instruct tune)


In [None]:
# Init required dirs
!mkdir -p ../../../../model/
!mkdir -p ../../../../datapath/
!mkdir -p ../../../../checkpoint/

# Download the Stage2.pth file
!cd ../../../../model/ && wget -nc https://huggingface.co/picocreator/memory-size-experiment-for-rwkv/resolve/main/V5x-16k/V5-Wave-1B5-Stage2.pth
!ls -alh ../../../../model/V5-Wave-1B5-Stage2.pth

## Configure your environment settings
(!Important: you will need to rerun the below cell, if you restart your kernel)

In [None]:
DEEPSPEED_STRAT="deepspeed_stage_1"
GPU_DEVICES="auto"
ENABLE_WANDB=True
WANDB_PREFIX="V5-Wave-1B5"

# WAVENET LAYERS settings
RWKV_WAVENET_LAYERS=13

print("DEEPSPEED_STRAT:", DEEPSPEED_STRAT)
print("ENABLE_WANDB:", ENABLE_WANDB)
print("GPU_DEVICES:", GPU_DEVICES)
print("RWKV_WAVENET_LAYERS:", RWKV_WAVENET_LAYERS)

if ENABLE_WANDB:
    WANDB_MODE="online"
else:
    WANDB_MODE="disabled"

# Computing the notebook, and various paths
import os
NOTEBOOK_DIR=os.path.dirname(os.path.abspath("__file__"))
PROJECT_DIR=os.path.abspath(os.path.join(NOTEBOOK_DIR, "../../../../"))
TRAINER_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./RWKV-v5/"))
INFERENCE_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./RWKV-v5/"))

print("NOTEBOOK_DIR:", NOTEBOOK_DIR)
print("INFERENCE_DIR:", INFERENCE_DIR)
print("TRAINER_DIR:", TRAINER_DIR)
print("PROJECT_DIR:", PROJECT_DIR)

## Tune 1 : Simple Memory instruct finetuning

- Tune 1: Low ctx size (512), Training with only the input masked. This does very limited memory training, and is used primarily to train the instruction set.

In [None]:
%%script bash

########################################
# Generate the required jsonl dataset
########################################

# Reset the dataset dir
mkdir -p ../dataset
rm -rf ../dataset/*.jsonl

# Generate the various datasets
echo "## Generating word reptition dataset ##"

# Lets generate for ctx len <= 512
python ../memory_script/gen_limited_segmented_jsonl.py ../dataset/word-2-count.jsonl  2  1000 &
for i in {5..250..5} 
do
    python ../memory_script/gen_limited_segmented_jsonl.py ../dataset/gen-word-$i-count.jsonl $i 1000 & 
done

wait
echo "## Done ##"

ls -alh ../dataset/

In [None]:
# Reset the checkpoint
!cd "{TRAINER_DIR}" && mkdir -p "../checkpoint/V5-Wave-1B5-mem-instruct/"

In [None]:
# Start the finetune model training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    export RWKV_WAVENET_LAYERS="{RWKV_WAVENET_LAYERS}" && \
    python lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/V5-Wave-1B5-mem-instruct.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Mem-Instruct (bs=256, train-ctx=512, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}"  \
        --model.ctx_len=512

In [None]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py \
        "../checkpoint/V5-Base-mem-instruct/last.ckpt" \
        "../model/V5-Wave-1B5-Tune-1.pth"
!cd "{TRAINER_DIR}" && ls -alh "../model/V5-Wave-1B5-Tune-1.pth"

In [None]:
# # Lets do a memory eval
# #
# # Note that the expected performance "is not that great", as the model seems to be only loosely
# # learning the memorization task, and the instruction propmt. And is seem to be acting more
# # like an RNG based on the instruct. (Instead of the actual memorization task)
# !python3 ../memory_script/eval_memory_guided.py "{PROJECT_DIR}/model/V5-Wave-1B5-Tune1.pth"

## Tune 2 : Low ctx size (512), memory training

- Tune 2: Low ctx size (512), Training with instruction & input masked. This forces the actual memory training on the output tokens.

In [None]:
%%script bash

########################################
# Generate the required jsonl dataset
########################################

# Reset the dataset dir
mkdir -p ../dataset
rm -rf ../dataset/*.jsonl

# Generate the various datasets
echo "## Generating word reptition dataset ##"

#
# We switch over to fully masked instruct+input, to properly learn the memorization task
#
python ../memory_script/gen_limited_prompt_completion_jsonl.py ../dataset/word-2-count.jsonl  2  1000 &
for i in {5..250..5} 
do
    python ../memory_script/gen_limited_prompt_completion_jsonl.py ../dataset/gen-word-$i-count.jsonl $i 1000 & 
    python ../memory_script/shuffle_limited_prompt_completion_jsonl.py ../dataset/shuffle-word-$i-count.jsonl $i 100 &
done

wait
echo "## Done ##"

ls -alh ../dataset/

In [None]:
# Ensure the checkpoint directory exists, and reset it
!cd "{TRAINER_DIR}" && \
    mkdir -p "../checkpoint/V5-Wave-1B5-mem-ctx-512/"

In [None]:
# Start the finetune model training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    export RWKV_WAVENET_LAYERS="{RWKV_WAVENET_LAYERS}" && \
    python lightning_trainer.py fit \
        -c "{NOTEBOOK_DIR}/V5-Wave-1B5-mem-template.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Mem-Tune Ctx-512 (bs=256, train-ctx=512, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}"  \
        --model.lr_init=5e-4 \
        --model.lr_final=4e-4 \
        --data.max_token_size=512 \
        --model.ctx_len=512 \
        --model.bptt_learning_range=1 \
        --model.load_model="../V5-Wave-1B5-Tune-1.pth" \
        --callback[0].dirpath="../checkpoint/V5-Wave-1B5-mem-ctx-512/"
        

In [None]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py \
        "../checkpoint/V5-Wave-1B5-mem-ctx-512/last.ckpt" \
        "../model/V5-Wave-1B5-Tune-ctx512.pth"
!cd "{TRAINER_DIR}" && ls -alh ../model/V5-Wave-1B5-Tune2.pth

In [None]:
# # Lets do a memory eval 
# #
# # While not at its full potential, its memory ability should start emerging
# #
# !python3 ../memory_script/eval_memory_guided.py "{PROJECT_DIR}/model/V5-Wave-1B5-Tune2.pth"