# RWKV Token Shift Experiment A (Memory Finetune)
This continues off from `./TokenShift-A-basemodel.ipynb` to perform the full memory finetune & testing process

This is done generally in 3 Tune stages
- Tune 1: Low ctx size (512), Training with only the input masked. This does very limited memory training, and is used primarily to train the instruction set.
- Tune 2: Low ctx size (512), Training with instruction & input masked. This forces the actual memory training on the output tokens.
- Tune 3: Mid ctx size (1024), stage 2, scaled up to 1024 context sizes.

In all cases, the input tokens is always masked. And we intentionally use the limited word set for memory training, which matches the same wordset used in the original memory evaluation of raven pretrained models. This is intentional to serve as both consistent comparision between experiments, and resonable training time.

One of the issue faced previously with an excessive large word set, is that the model would be required to see "new words" atleast a few time before being able to train the memory process. This drastically slowed down the process as the large word list meant the model was constantly spending time learning new words (instead of memory training).

If we want to increase the number / type of words the model can handle for memory training, that can be done later as a stage 4 memory tune if needed. But that exceeds the current requirements for the memory experiment process.

> This project assumes you have the rwkv-infctx conda env setup, and you are executing in that environment - see the main README.md for the conda env setup steps

## Optional: Download the pretrained model
(if you want to skip the the basemodel train + instruct tune)


In [None]:
# # Init required dirs
# !mkdir -p ../../../model/
# !mkdir -p ../../../datapath/
# !mkdir -p ../../../checkpoint/

# # Download the Stage2.pth file
# !rm -rf ../../../model/TokenShift-A-Stage2.pth
# !cd ../../../model/ && wget https://huggingface.co/picocreator/memory-size-experiment-for-rwkv/resolve/main/TokenShift-A-Stage2.pth
# !ls -alh ../../../model/TokenShift-A-Stage2.pth

# Other models to skip steps if wanted
# !cd ../../../model/ && wget https://huggingface.co/picocreator/memory-size-experiment-for-rwkv/resolve/main/TokenShift-A-Tune1.pth
# !cd ../../../model/ && wget https://huggingface.co/picocreator/memory-size-experiment-for-rwkv/resolve/main/TokenShift-A-Tune2.pth
# !cd ../../../model/ && wget https://huggingface.co/picocreator/memory-size-experiment-for-rwkv/resolve/main/TokenShift-A-Tune3.pth
# !cd ../../../model/ && wget https://huggingface.co/picocreator/memory-size-experiment-for-rwkv/resolve/main/TokenShift-A-Tune4.pth

## Configure your environment settings
(!Important: you will need to rerun the below cell, if you restart your kernel)

In [2]:
DEEPSPEED_STRAT="deepspeed_stage_1"
GPU_DEVICES="auto"
ENABLE_WANDB=True
WANDB_PREFIX="(8x3090) TokenShift-A"

print("DEEPSPEED_STRAT:", DEEPSPEED_STRAT)
print("ENABLE_WANDB:", ENABLE_WANDB)
print("GPU_DEVICES:", GPU_DEVICES)

if ENABLE_WANDB:
    WANDB_MODE="online"
else:
    WANDB_MODE="disabled"

# Computing the notebook, and various paths
import os
NOTEBOOK_DIR=os.path.dirname(os.path.abspath("__file__"))
PROJECT_DIR=os.path.abspath(os.path.join(NOTEBOOK_DIR, "../../../"))
TRAINER_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./RWKV-v4neo/"))
INFERENCE_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./RWKV-v5x/"))

print("NOTEBOOK_DIR:", NOTEBOOK_DIR)
print("INFERENCE_DIR:", INFERENCE_DIR)
print("TRAINER_DIR:", TRAINER_DIR)
print("PROJECT_DIR:", PROJECT_DIR)

DEEPSPEED_STRAT: deepspeed_stage_1
ENABLE_WANDB: True
GPU_DEVICES: auto
NOTEBOOK_DIR: /home/ubuntu/rwkv5x-tokenshift-exp-A/notebook/experiment/tokenshift-exp
INFERENCE_DIR: /home/ubuntu/rwkv5x-tokenshift-exp-A/RWKV-v5x
TRAINER_DIR: /home/ubuntu/rwkv5x-tokenshift-exp-A/RWKV-v4neo
PROJECT_DIR: /home/ubuntu/rwkv5x-tokenshift-exp-A


## Tune 1 : Simple Memory instruct finetuning

- Tune 1: Low ctx size (512), Training with only the input masked. This does very limited memory training, and is used primarily to train the instruction set.

In [3]:
%%script bash

########################################
# Generate the required jsonl dataset
########################################

# Reset the dataset dir
mkdir -p ./dataset
rm -rf ./dataset/*.jsonl

# Generate the various datasets
echo "## Generating word reptition dataset ##"

# We do a strong bias for smaller word count, to teach the concept from scratch
# so that the model can learn the function. 
#
# Note that all document samples, are randomized between the target word count, 
# to half of the target word count.
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-2-count.jsonl  2  5000 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-5-count.jsonl  5  5000 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-10-count.jsonl 10 2500 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-15-count.jsonl 15 2500 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-20-count.jsonl 20 2500 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-25-count.jsonl 25 2500 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-40-count.jsonl 40 2500 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-50-count.jsonl 50 2500 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-60-count.jsonl 80 2500 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-80-count.jsonl 80 2500 &

# With a slight mix of the larger word count
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-100-count.jsonl 100 2500 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-200-count.jsonl 200 2500 &

wait
echo "## Done ##"

ls -alh ./dataset/

## Generating word reptition dataset ##
Generated JSONL file with - 10 max words, 2500 samples - at ./dataset/word-10-count.jsonl
Generated JSONL file with - 15 max words, 2500 samples - at ./dataset/word-15-count.jsonl
Generated JSONL file with - 2 max words, 5000 samples - at ./dataset/word-2-count.jsonl
Generated JSONL file with - 25 max words, 2500 samples - at ./dataset/word-25-count.jsonl
Generated JSONL file with - 5 max words, 5000 samples - at ./dataset/word-5-count.jsonl
Generated JSONL file with - 40 max words, 2500 samples - at ./dataset/word-40-count.jsonl
Generated JSONL file with - 50 max words, 2500 samples - at ./dataset/word-50-count.jsonl
Generated JSONL file with - 20 max words, 2500 samples - at ./dataset/word-20-count.jsonl
Generated JSONL file with - 80 max words, 2500 samples - at ./dataset/word-60-count.jsonl
Generated JSONL file with - 100 max words, 2500 samples - at ./dataset/word-100-count.jsonl
Generated JSONL file with - 80 max words, 2500 samples - at ./

In [4]:
# Lets pre tokenize the requried dataset
!cd "{TRAINER_DIR}" && \
    python3 preload_dataset.py "{NOTEBOOK_DIR}/TokenShift-A-mem-finetune-1.yaml"

# Ensure the checkpoint directory exists
!cd "{TRAINER_DIR}" && mkdir -p "../checkpoint/TokenShift-A-mem-finetune-1/"

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-248374da0b936b0e/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...
Downloading data files: 100%|███████████████████| 1/1 [00:00<00:00, 6710.89it/s]
Extracting data files: 100%|█████████████████████| 1/1 [00:00<00:00, 243.94it/s]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-248374da0b936b0e/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96. Subsequent calls will reuse this data.
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 349.32it/s]
                                                                                

In [5]:
# Start the finetune model training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python new_train.py fit \
        -c "{NOTEBOOK_DIR}/TokenShift-A-mem-finetune-1.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Mem-Finetune-1 (bs=256, train-ctx=512, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}"  \
        --model.ctx_len=512

Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1+cu118'
  rank_zero_warn(f"No seed found, seed set to {seed}")
Global seed set to 3204158003
[34m[1mwandb[0m: Currently logged in as: [33mpicocreator[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.15.5 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m./wandb/run-20230715_150021-ty4q0dit[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33m(8x3090) TokenShift-A - Mem-Finetune-1 (bs=256, train-ctx=512, deepspeed_stage_1)[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/picocreator/RWKV-5X-Experiments[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/picocreator/RWKV-5X-Experiments/ru

In [6]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py \
        "../checkpoint/TokenShift-A-mem-finetune-1/last.ckpt" \
        "../model/TokenShift-A-Tune1.pth"
!cd "{TRAINER_DIR}" && ls -alh ../model/TokenShift-A-Tune1.pth

Setting ds_accelerator to cuda (auto detect)
Processing zero checkpoint '../checkpoint/TokenShift-A-mem-finetune-1/last.ckpt/checkpoint'
Detected checkpoint of type zero stage ZeroStageEnum.optimizer_states, world_size: 8
Parsing checkpoint created by deepspeed==0.9.3
Reconstructed fp32 state dict with 222 params 1280128000 elements
Saving fp32 state dict to ../model/TokenShift-A-Tune1.pth
-rw-r--r-- 1 root root 4.8G Jul 15 15:24 ../model/TokenShift-A-Tune1.pth


In [4]:
# Lets do a memory eval
#
# Note that the expected performance "is not that great", as the model seems to be only loosely
# learning the memorization task, and the instruction propmt. And is seem to be acting more
# like an RNG based on the instruct. (Instead of the actual memorization task)
!python3 ./memory_script/eval_memory_guided.py "{PROJECT_DIR}/model/TokenShift-A-Tune1.pth"


RWKV_HEAD_QK_DIM 0 RWKV_JIT_ON 1

blocks.0.att.key.weight                  float32    cuda:0
blocks.0.att.output.weight               float32    cuda:0
blocks.0.att.receptance.weight           float32    cuda:0
blocks.0.att.time_mix_k                  float32    cuda:0
blocks.0.att.time_mix_r                  float32    cuda:0
blocks.0.att.time_mix_v                  float32    cuda:0
blocks.0.att.value.weight                float32    cuda:0
blocks.0.ffn.key.weight                  float32    cuda:0
blocks.0.ffn.receptance.weight           float32    cuda:0
blocks.0.ffn.time_mix_k                  float32    cuda:0
blocks.0.ffn.time_mix_r                  float32    cuda:0
blocks.0.ffn.value.weight                float32    cuda:0
blocks.0.ln0.bias                        float32    cuda:0
blocks.0.ln0.weight                      float32    cuda:0
blocks.0.ln1.bias                        float32    cuda:0
blocks.0.ln1.weight                      float32    cuda:0
blocks.0.ln2.bias    

## Tune 2 : Low ctx size (512), memory training

- Tune 2: Low ctx size (512), Training with instruction & input masked. This forces the actual memory training on the output tokens.

In [8]:
%%script bash

########################################
# Generate the required jsonl dataset
########################################

# Reset the dataset dir
mkdir -p ./dataset
rm -rf ./dataset/*.jsonl

# Generate the various datasets
echo "## Generating word reptition dataset ##"

#
# We switch over to fully masked instruct+input, to properly learn the memorization task
#
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-2-count.jsonl  2  5000 &
for i in {5..95..5} 
do
    python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/gen-word-$i-count.jsonl $i 5000 & 
done
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-100-count.jsonl 100 5000 &
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-200-count.jsonl 200 5000 &

#
# We mixin the shuffled word list, so that we ensure all words / tokens are learned
# however this might intrduce an exclusion bias (if seen this word, never repeat it), 
# so we limit the mixture of this data samples
#
python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-10-count.jsonl 10 20 &
python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-15-count.jsonl 15 20 &
python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-25-count.jsonl 25 30 &
python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-50-count.jsonl 50 50 &
python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-75-count.jsonl 75 50 &
python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-100-count.jsonl 100 50 &
python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-200-count.jsonl 200 50 &

wait
echo "## Done ##"

ls -alh ./dataset/

## Generating word reptition dataset ##
Generated a single JSONL file with 3558 samples (20 token repeat) - 15 max words - at ./dataset/shuffle-word-15-count.jsonl
Generated JSONL file with - 2 max words, 5000 samples - at ./dataset/word-2-count.jsonl
Generated JSONL file with - 5 max words, 5000 samples - at ./dataset/gen-word-5-count.jsonl
Generated a single JSONL file with 676 samples (50 token repeat) - 200 max words - at ./dataset/shuffle-word-200-count.jsonl
Generated a single JSONL file with 1320 samples (50 token repeat) - 100 max words - at ./dataset/shuffle-word-100-count.jsonl
Generated JSONL file with - 10 max words, 5000 samples - at ./dataset/gen-word-10-count.jsonl
Generated a single JSONL file with 2633 samples (50 token repeat) - 50 max words - at ./dataset/shuffle-word-50-count.jsonl
Generated JSONL file with - 15 max words, 5000 samples - at ./dataset/gen-word-15-count.jsonl
Generated a single JSONL file with 1771 samples (50 token repeat) - 75 max words - at ./datas

In [9]:
# Lets pre tokenize the requried dataset
!cd "{TRAINER_DIR}" && \
    python3 preload_dataset.py "{NOTEBOOK_DIR}/TokenShift-A-mem-finetune-2.yaml"

# Ensure the checkpoint directory exists
!cd "{TRAINER_DIR}" && mkdir -p "../checkpoint/TokenShift-A-mem-finetune-2/"

Resolving data files: 100%|██████████████████| 29/29 [00:00<00:00, 36461.28it/s]
Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-d816d1b1ca075f1e/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...
Downloading data files: 100%|███████████████████| 1/1 [00:00<00:00, 2054.02it/s]
Extracting data files: 100%|█████████████████████| 1/1 [00:00<00:00, 144.20it/s]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-d816d1b1ca075f1e/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96. Subsequent calls will reuse this data.
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 123.13it/s]
                                                                                

In [4]:
# Start the finetune model training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python new_train.py fit \
        -c "{NOTEBOOK_DIR}/TokenShift-A-mem-finetune-2.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Mem-Finetune-2 (bs=256, train-ctx=512, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}"  \
        --model.ctx_len=512

Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1+cu118'
  rank_zero_warn(f"No seed found, seed set to {seed}")
Global seed set to 1171926168
[34m[1mwandb[0m: Currently logged in as: [33mpicocreator[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.15.5 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m./wandb/run-20230715_170253-boah6u52[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33m(8x3090) TokenShift-A - Mem-Finetune-2 (bs=256, train-ctx=512, deepspeed_stage_1)[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/picocreator/RWKV-5X-Experiments[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/picocreator/RWKV-5X-Experiments/ru

In [5]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py \
        "../checkpoint/TokenShift-A-mem-finetune-2/last.ckpt" \
        "../model/TokenShift-A-Tune2.pth"
!cd "{TRAINER_DIR}" && ls -alh ../model/TokenShift-A-Tune2.pth

Setting ds_accelerator to cuda (auto detect)
Processing zero checkpoint '../checkpoint/TokenShift-A-mem-finetune-2/last.ckpt/checkpoint'
Detected checkpoint of type zero stage ZeroStageEnum.optimizer_states, world_size: 8
Parsing checkpoint created by deepspeed==0.9.3
Reconstructed fp32 state dict with 222 params 1280128000 elements
Saving fp32 state dict to ../model/TokenShift-A-Tune2.pth
-rw-r--r-- 1 root root 4.8G Jul 15 18:27 ../model/TokenShift-A-Tune2.pth


In [6]:
# Lets do a memory eval 
#
# While not at its full potential, its memory ability should start emerging
#
!python3 ./memory_script/eval_memory_guided.py "{PROJECT_DIR}/model/TokenShift-A-Tune2.pth"


RWKV_HEAD_QK_DIM 0 RWKV_JIT_ON 1

blocks.0.att.key.weight                  float32    cuda:0
blocks.0.att.output.weight               float32    cuda:0
blocks.0.att.receptance.weight           float32    cuda:0
blocks.0.att.time_mix_k                  float32    cuda:0
blocks.0.att.time_mix_r                  float32    cuda:0
blocks.0.att.time_mix_v                  float32    cuda:0
blocks.0.att.value.weight                float32    cuda:0
blocks.0.ffn.key.weight                  float32    cuda:0
blocks.0.ffn.receptance.weight           float32    cuda:0
blocks.0.ffn.time_mix_k                  float32    cuda:0
blocks.0.ffn.time_mix_r                  float32    cuda:0
blocks.0.ffn.value.weight                float32    cuda:0
blocks.0.ln0.bias                        float32    cuda:0
blocks.0.ln0.weight                      float32    cuda:0
blocks.0.ln1.bias                        float32    cuda:0
blocks.0.ln1.weight                      float32    cuda:0
blocks.0.ln2.bias    

## Tune 3 : Ramping up the ctx size (1024), memory training

- Tune 3: Mid ctx size (1024), same as tune 2, but extended in context size

This intentionally a much larger dataset, and lower learning rate to help ensure we push the model to its absolute limits.

In [12]:
%%script bash

########################################
# Generate the required jsonl dataset
########################################

# Reset the dataset dir
mkdir -p ./dataset
rm -rf ./dataset/*.jsonl

# Generate the various datasets
echo "## Generating word reptition dataset ##"

#
# We reduce the training set for < 50 words - and shift the focus upwards
# (aka 50-100 token * 2 : ~100 - 250 token ctx len)
#
python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/word-2-count.jsonl 2 1000 &
for i in {5..45..5} 
do
    python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/gen-word-$i-count.jsonl $i 1000 & 
    python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-$i-count.jsonl $i 10 & 
done

#
# Ramping up the 50+ - 400 words dataset
# 
for i in {50..450..5} 
do
    python ./memory_script/gen_limited_prompt_completion_jsonl.py ./dataset/gen-word-$i-count.jsonl $i 2000 & 
    python ./memory_script/shuffle_limited_prompt_completion_jsonl.py ./dataset/shuffle-word-$i-count.jsonl $i 20 & 
done

wait
echo "## Done ##"

ls -alh ./dataset/

## Generating word reptition dataset ##
Generated JSONL file with - 2 max words, 1000 samples - at ./dataset/word-2-count.jsonl
Generated JSONL file with - 5 max words, 1000 samples - at ./dataset/gen-word-5-count.jsonl
Generated JSONL file with - 10 max words, 1000 samples - at ./dataset/gen-word-10-count.jsonl
Generated a single JSONL file with 667 samples (10 token repeat) - 40 max words - at ./dataset/shuffle-word-40-count.jsonl
Generated JSONL file with - 25 max words, 1000 samples - at ./dataset/gen-word-25-count.jsonl
Generated JSONL file with - 15 max words, 1000 samples - at ./dataset/gen-word-15-count.jsonl
Generated a single JSONL file with 587 samples (10 token repeat) - 45 max words - at ./dataset/shuffle-word-45-count.jsonl
Generated JSONL file with - 20 max words, 1000 samples - at ./dataset/gen-word-20-count.jsonl
Generated JSONL file with - 30 max words, 1000 samples - at ./dataset/gen-word-30-count.jsonl
Generated a single JSONL file with 1784 samples (10 token repeat

In [13]:
# Lets pre tokenize the requried dataset
!cd "{TRAINER_DIR}" && \
    python3 preload_dataset.py "{NOTEBOOK_DIR}/TokenShift-A-mem-finetune-3.yaml"

# Ensure the checkpoint directory exists
!cd "{TRAINER_DIR}" && mkdir -p "../checkpoint/TokenShift-A-mem-finetune-3/"

Resolving data files: 100%|███████████████| 181/181 [00:00<00:00, 145741.80it/s]
Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-5a26de720865b5a0/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...
Downloading data files: 100%|████████████████████| 1/1 [00:00<00:00, 714.65it/s]
Extracting data files: 100%|██████████████████████| 1/1 [00:00<00:00, 26.73it/s]
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-5a26de720865b5a0/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96. Subsequent calls will reuse this data.
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 77.87it/s]
                                                                                

In [None]:
# Start the finetune model training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python new_train.py fit \
        -c "{NOTEBOOK_DIR}/TokenShift-A-mem-finetune-3.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Mem-Finetune-3 (bs=256, train-ctx=1024, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}"  \
        --model.ctx_len=1024

In [None]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py \
        "../checkpoint/TokenShift-A-mem-finetune-3/last.ckpt" \
        "../model/TokenShift-A-Tune3.pth"
!cd "{TRAINER_DIR}" && ls -alh ../model/TokenShift-A-Tune3.pth

In [5]:
# Lets do a memory eval 
#
# We should start approaching the full potential of the model, unless its able to exceed 250 tokens of memory
#
!python3 ./memory_script/eval_memory_guided.py "{PROJECT_DIR}/model/TokenShift-A-Tune3.pth" "verbose"


RWKV_HEAD_QK_DIM 0 RWKV_JIT_ON 1

blocks.0.att.key.weight                  float32    cuda:0
blocks.0.att.output.weight               float32    cuda:0
blocks.0.att.receptance.weight           float32    cuda:0
blocks.0.att.time_mix_k                  float32    cuda:0
blocks.0.att.time_mix_r                  float32    cuda:0
blocks.0.att.time_mix_v                  float32    cuda:0
blocks.0.att.value.weight                float32    cuda:0
blocks.0.ffn.key.weight                  float32    cuda:0
blocks.0.ffn.receptance.weight           float32    cuda:0
blocks.0.ffn.time_mix_k                  float32    cuda:0
blocks.0.ffn.time_mix_r                  float32    cuda:0
blocks.0.ffn.value.weight                float32    cuda:0
blocks.0.ln0.bias                        float32    cuda:0
blocks.0.ln0.weight                      float32    cuda:0
blocks.0.ln1.bias                        float32    cuda:0
blocks.0.ln1.weight                      float32    cuda:0
blocks.0.ln2.bias    