# Echo-A 1B5 (Memory model from scratch)
This attempts to build the memory model in stages, from scratch
(Instead of previous attempts in doing enwiki foundation + gpt4all + etc)

> This project assumes you have the rwkv-infctx conda env setup, and you are executing in that environment - see the main README.md for the conda env setup steps

## Insights from previous failed attempt at training from scratch / training from enwiki

The following are insights that was found after multiple back and forth (3 weeks+ of experiments). This insights in partcular was derived from the failure to tune enwiki model, but success in doing so after limited gpt4all tuning, in limited capacity.

What prevented training from scratch, was the lack of unmasked "training instruction data", as the original finetune went straight to fully masked instruction+input, and unmasked outputs. This worked on raven models, because they have already be pretrained for memory recall. But is unable to teach the model on its own otherwise.

On the other hand if we were to train with the instruction unmasked, even with the "input" document masked (what needs to be memorized), the model end up thinking it should be somewhat RNG-ing a specific set of words, given an instruction. And fail to fully learn the memorization task. But at the very least, it will learn the "instruction statement" trigger. 

Finally, we used the original limited word list for the bulk of the training / validation. Instead of the full 400k+ larger word list. Another issue that was faced was that the model training / memory gets completely blindsided (bad loss) when it encounter a set of words it has never seen before. This is possible even after 100k samples, due to how large the word list was. Preventing / slowing down the training process.

While the enwiki finetune, resolve these issues in stages, since our goal is to be able to replicate this experiment across multiple models config rapidly. The training from scratch model, is an attempt to remove the enwiki+gpt4all steps required in the process (hopefully) - by ensuring a good mix of all 3 of the above data - when training from scratch (instead of previous attempts at them one-by-one)

## Scratch - Stage 1

Prepare and preload the finetuning process dataset

In [5]:
%%script bash
# Reset the dataset dir
mkdir -p ./dataset
rm -rf ./dataset/*.jsonl

# Generate the various datasets
echo "## Generating word reptition dataset ##"

# For the first stage of < 512 tokens, we form a strong bias for <= 100 words
# to focus training with smaller datasets in the inital stages

# Segmented JSONL, was designed to be only masking the input document. 
# While it failed to teach memorization properly, it teaches the model how to understand the instruction triggers.
#
# One theory, is that it loosely teach memory, but the model is confused thinking maybe these words should be randomly
# generated when seeing a certain instruction. Which is not the case.
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/segmented-word-2-count.jsonl  2  5000 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/segmented-word-5-count.jsonl  5  20000 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/segmented-word-10-count.jsonl 10 20000 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/segmented-word-15-count.jsonl 10 20000 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/segmented-word-20-count.jsonl 20 20000 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/segmented-word-40-count.jsonl 40 20000 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/segmented-word-80-count.jsonl 80 20000 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/segmented-word-100-count.jsonl 100 10000 &
python ./memory_script/gen_limited_segmented_jsonl.py ./dataset/segmented-word-200-count.jsonl 200 10000 &

# Prompt completion pairs, are fully masked instruction and input, with unmasked outputs
# This is required to actually teach the model how to memorize the input, but on its own, 
# its unable to actually teach the model how to trigger this behavior (as the instruct is masked)
python ./memory_script/gen_limited_masked_jsonl.py ./dataset/limited-masked-word-2-count.jsonl  2  5000 &
python ./memory_script/gen_limited_masked_jsonl.py ./dataset/limited-masked-word-5-count.jsonl  5  10000 &
python ./memory_script/gen_limited_masked_jsonl.py ./dataset/limited-masked-word-10-count.jsonl 10 10000 &
python ./memory_script/gen_limited_masked_jsonl.py ./dataset/limited-masked-word-15-count.jsonl 10 10000 &
python ./memory_script/gen_limited_masked_jsonl.py ./dataset/limited-masked-word-20-count.jsonl 20 10000 &
python ./memory_script/gen_limited_masked_jsonl.py ./dataset/limited-masked-word-40-count.jsonl 40 10000 &
python ./memory_script/gen_limited_masked_jsonl.py ./dataset/limited-masked-word-80-count.jsonl 80 10000 &
python ./memory_script/gen_limited_masked_jsonl.py ./dataset/limited-masked-word-100-count.jsonl 100 5000 &
python ./memory_script/gen_limited_masked_jsonl.py ./dataset/limited-masked-word-200-count.jsonl 200 5000 &

# Prompt completion pairs, with the full word list. Due to the size of the full word list, it 
# was possible to be stuck training the model just to recognize new words / tokens, and not perform the memorization task
# this greatly slowed down the memorization learning process. As the model was constantly learning new words. 
# With 400k+ words total, even after 100k worth of document samples, new words can appear (due to how RNG works)
#
# We still include a mix of the data, in an attempt to reduce overtraining the model to only a fixed token set.
# which was one of the weakness faced in the original training / benchmark (but technically not an issue for measuring memory)
python ./memory_script/gen_full_masked_jsonl.py ./dataset/full-masked-word-2-count.jsonl  2  5000 &
python ./memory_script/gen_full_masked_jsonl.py ./dataset/full-masked-word-5-count.jsonl  5  10000 &
python ./memory_script/gen_full_masked_jsonl.py ./dataset/full-masked-word-10-count.jsonl 10 10000 &
python ./memory_script/gen_full_masked_jsonl.py ./dataset/full-masked-word-15-count.jsonl 10 10000 &
python ./memory_script/gen_full_masked_jsonl.py ./dataset/full-masked-word-20-count.jsonl 20 10000 &
python ./memory_script/gen_full_masked_jsonl.py ./dataset/full-masked-word-40-count.jsonl 40 10000 &
python ./memory_script/gen_full_masked_jsonl.py ./dataset/full-masked-word-80-count.jsonl 80 10000 &

wait
echo "## Done ##"

## Generating word reptition dataset ##
Generated JSONL file with - 2 max words, 5000 samples - at ./dataset/segmented-word-2-count.jsonl
Generated JSONL file with - 5 max words, 10000 samples - at ./dataset/limited-masked-word-5-count.jsonl
Generated JSONL file with - 10 max words, 10000 samples - at ./dataset/limited-masked-word-15-count.jsonl
Generated JSONL file with - 2 max words, 5000 samples - at ./dataset/full-masked-word-2-count.jsonl
Generated JSONL file with - 2 max words, 5000 samples - at ./dataset/limited-masked-word-2-count.jsonl
Generated JSONL file with - 10 max words, 20000 samples - at ./dataset/segmented-word-15-count.jsonl
Generated JSONL file with - 10 max words, 10000 samples - at ./dataset/limited-masked-word-10-count.jsonl
Generated JSONL file with - 10 max words, 10000 samples - at ./dataset/full-masked-word-15-count.jsonl
Generated JSONL file with - 5 max words, 10000 samples - at ./dataset/full-masked-word-5-count.jsonl
Generated JSONL file with - 5 max word

In [6]:
# Configure your preferred options

DEEPSPEED_STRAT="deepspeed_stage_1"
GPU_DEVICES="auto"
ENABLE_WANDB=True
WANDB_PREFIX="Echo-A-1B5"

print("DEEPSPEED_STRAT:", DEEPSPEED_STRAT)
print("ENABLE_WANDB:", ENABLE_WANDB)
print("GPU_DEVICES:", GPU_DEVICES)

if ENABLE_WANDB:
    WANDB_MODE="online"
else:
    WANDB_MODE="disabled"

# Computing the notebook, and various paths
import os
NOTEBOOK_DIR=os.path.dirname(os.path.abspath("__file__"))
TRAINER_DIR=os.path.abspath(os.path.join(NOTEBOOK_DIR, "../../../RWKV-v4neo/"))

print("NOTEBOOK_DIR:", NOTEBOOK_DIR)
print("TRAINER_DIR:", TRAINER_DIR)

DEEPSPEED_STRAT: deepspeed_stage_1
ENABLE_WANDB: True
GPU_DEVICES: auto
NOTEBOOK_DIR: /home/picocreator/rwkv-proj/picocreator-memory-experiment/notebook/experiment/memory-scratch
TRAINER_DIR: /home/picocreator/rwkv-proj/picocreator-memory-experiment/RWKV-v4neo


## Stage 1: Low word count memory training

In [7]:
# Lets preload the requried dataset
!cd "{TRAINER_DIR}" && \
    python3 preload_dataset.py "{NOTEBOOK_DIR}/Echo-A-1B5-scratch-stage-1.yaml"

# Ensure the checkpoint directory exists
!cd "{TRAINER_DIR}" && mkdir -p "../checkpoint/Echo-A-1B5-scratch-stage-1/"

Resolving data files: 100%|██████████████████| 25/25 [00:00<00:00, 38850.54it/s]
Downloading and preparing dataset json/default to /home/picocreator/.cache/huggingface/datasets/json/default-9d3acc0155290f3e/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...
Downloading data files: 100%|███████████████████| 1/1 [00:00<00:00, 4686.37it/s]
Extracting data files: 100%|█████████████████████| 1/1 [00:00<00:00, 122.88it/s]
Dataset json downloaded and prepared to /home/picocreator/.cache/huggingface/datasets/json/default-9d3acc0155290f3e/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96. Subsequent calls will reuse this data.
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 32.98it/s]
                                                                                

In [5]:
# Start the memory model training
!cd "{TRAINER_DIR}" && \
    export RWKV_TORCH_COMPILE=0 && \
    export RWKV_JIT_ON=1 && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python new_train.py fit \
        -c "{NOTEBOOK_DIR}/Echo-A-1B5-scratch-stage-1.yaml" \
        --trainer.logger.init_args.name="{WANDB_PREFIX} - Scratch-Stage-1 (bs=256, train-ctx=512, {DEEPSPEED_STRAT})" \
        --trainer.strategy="{DEEPSPEED_STRAT}" \
        --trainer.devices="{GPU_DEVICES}"  \
        --model.ctx_len=512

[2023-07-10 11:53:45,272] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.1.0.dev20230706'
  rank_zero_warn(f"No seed found, seed set to {seed}")
Global seed set to 3901155180
[34m[1mwandb[0m: Currently logged in as: [33mpicocreator[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.15.5
[34m[1mwandb[0m: Run data is saved locally in [35m[1m./wandb/run-20230710_115347-c8mf73xy[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mEcho-A-1B5 - Mem-Train-Stage-1 (bs=64, train-ctx=1024)[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/picocreator/RWKV-Memory-Experiment[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/picocreator/RWKV-Memory-Experiment/runs/c8mf73xy[0m
Using /home/ubuntu/.cache/torch_extensions/py311_cu118 

In [None]:
# Lets export the model from the checkpoint
!cd "{TRAINER_DIR}" && \
    && python export_checkpoint.py "../checkpoint/Echo-A-1B5-scratch-stage-1/last.ckpt" "../model/Echo-A-1B5-Scratch-Stage-1.pth"
!cd "{TRAINER_DIR}" && ls -alh "../model/Echo-A-1B5-Scratch-Stage-1.pth"

In [None]:
# Lets do a quick dragon prompt validation
!cd "{TRAINER_DIR}" && python3 dragon_test.py "../model/Echo-A-1B5-Scratch-Stage-1.pth" "cuda fp32"