# Echo-B 1B4 (Memory Finetune)
This continues off from `Echo-A-1B5-basemodel.ipynb` to perform the full memory finetune & testing process

This is done generally in 3 stages
- Stage 1: Low ctx size (512), Training with only the input masked. This does very limited memory training, and is used primarily to train the instruction set.
- Stage 2: Low ctx size (512), Training with instruction & input masked. This forces the actual memory training on the output tokens.
- Stage 3: Mid ctx size (1024), stage 2, scaled up to 1024 context sizes.

In all cases, the input tokens is always masked. And we intentionally use the limited word set for memory training, which matches the same wordset used in the original memory evaluation of raven pretrained models. This is intentional to serve as both consistent comparision between experiments, and resonable training time.

One of the issue faced previously with an excessive large word set, is that the model would be required to see "new words" atleast a few time before being able to train the memory process. This drastically slowed down the process as the large word list meant the model was constantly spending time learning new words (instead of memory training).

If we want to increase the number / type of words the model can handle for memory training, that can be done later as a stage 4 memory tune if needed. But that exceeds the current requirements for the memory experiment process.

> This project assumes you have the rwkv-infctx conda env setup, and you are executing in that environment - see the main README.md for the conda env setup steps

## Optional: Download the pretrained model
(if you want to skip the the basemodel train + instruct tune)


In [None]:
# # Init required dirs
# !mkdir -p ../../../model/
# !mkdir -p ../../../datapath/
# !mkdir -p ../../../checkpoint/

# # Download the Stage2.pth file
# !rm -rf ../../../model/Echo-A-1B5-Stage2.pth
# !cd ../../../model/ && wget https://huggingface.co/picocreator/memory-size-experiment-for-rwkv/resolve/main/Echo-A-1B5-Stage2.pth
# !ls -alh ../../../model/Echo-A-1B5-Stage2.pth

## Configure your environment settings
(!Important: you will need to rerun the below cell, if you restart your kernel)

In [None]:
DEEPSPEED_STRAT="deepspeed_stage_1"
GPU_DEVICES="auto"
ENABLE_WANDB=True
WANDB_PREFIX="Echo-A-1B5"

print("DEEPSPEED_STRAT:", DEEPSPEED_STRAT)
print("ENABLE_WANDB:", ENABLE_WANDB)
print("GPU_DEVICES:", GPU_DEVICES)

if ENABLE_WANDB:
    WANDB_MODE="online"
else:
    WANDB_MODE="disabled"

# Computing the notebook, and various paths
import os
NOTEBOOK_DIR=os.path.dirname(os.path.abspath("__file__"))
PROJECT_DIR=os.path.abspath(os.path.join(NOTEBOOK_DIR, "../../../"))
TRAINER_DIR=os.path.abspath(os.path.join(PROJECT_DIR, "./RWKV-v4neo/"))

print("NOTEBOOK_DIR:", NOTEBOOK_DIR)
print("TRAINER_DIR:", TRAINER_DIR)
print("PROJECT_DIR:", PROJECT_DIR)

## Tune 1 : Simple Memory instruct finetuning

In [None]:
%%script bash
# Reset the dataset dir
mkdir -p ./dataset
rm -rf ./dataset/*.jsonl

# Generate the various datasets
echo "## Generating word reptition dataset ##"

# We do a strong bias for smaller word count, to teach the concept from scratch
# so that the model can learn the function
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-2-count.jsonl  2  20000 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-5-count.jsonl  5  20000 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-10-count.jsonl 10 20000 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-20-count.jsonl 20 20000 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-40-count.jsonl 40 20000 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-80-count.jsonl 80 20000 &

# With a slight mix of the larger word count
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-100-count.jsonl 100 5000 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/word-200-count.jsonl 100 5000 &

wait
echo "## Done ##"

ls -alh ./dataset/

## Prepare the dataset

Prepare and preload the finetuning process dataset

In [None]:
# Lets preload the requried dataset (enwiki_100k)
!cd "{TRAINER_DIR}" && \
    python3 preload_dataset.py "{NOTEBOOK_DIR}/Echo-A-1B5-mem-finetune-1.yaml"

# Ensure the checkpoint directory exists
!cd "{TRAINER_DIR}" && mkdir -p "../checkpoint/Echo-A-1B5-mem-finetune-1/"

In [None]:
# Start the foundation model training
!cd "{TRAINER_DIR}" && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python new_train.py fit \
        -c "{NOTEBOOK_DIR}/Echo-A-1B5-mem-finetune-1.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Mem-Finetune-1 (bs=256, train-ctx=512, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}"  \
        --model.ctx_len=512

In [None]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    python export_checkpoint.py \
        "../checkpoint/Echo-A-1B5-mem-finetune-1/last.ckpt" \
        "../model/Echo-A-1B5-Tune1.pth"
!cd "{TRAINER_DIR}" && ls -alh ../model/Echo-A-1B5-Tune1.pth

In [None]:
# Lets do a quick dragon prompt validation
!python3 ./memory_script/eval_memory_guided.py "{PROJECT_DIR}/model/Echo-A-1B5-Tune1.pth"