TaskMem-Phase One

RL training pipeline for episodic memory generation, built on top of volcengine/verl (forked at 33eb86f54f75ffc5825bfd54cee71b57b6b8ae13). The base policy is Qwen3-VL-30B-A3B-Thinking; reward judging is done via Azure OpenAI (GPT-4o) + Gemini 2.5 Flash through the M3 reward manager.

1. Install

git clone https://github.com/Hope-Rita/TaskMem-PhaseOne.git
cd TaskMem-PhaseOne
bash bootstrap.sh

bootstrap.sh installs PyTorch 2.8 / CUDA 12.8, vLLM 0.11, Megatron-LM core_v0.13.1, transformer-engine 2.8, flash-attn 2.7.4, and the M3 reward manager's Python dependencies (json_repair, openai, etc.).

Azure OpenAI / Gemini judge config

The M3 reward manager calls an external judge through Azure OpenAI. Copy the template and fill in your own endpoints / keys (the real file is gitignored):

cp configs/api_config.example.json configs/api_config.json
# then edit configs/api_config.json

Each model name can be either a single config object or a list of objects to round-robin across (useful for rate-limit avoidance):

{
  "gpt-4o": { "azure_endpoint": "...", "api_version": "...", "api_key": "..." },
  "gemini-2.5-flash": [
    { "azure_endpoint": "...", "api_version": "...", "api_key": "..." },
    { "azure_endpoint": "...", "api_version": "...", "api_key": "..." }
  ]
}

If you store the config elsewhere, point at it via M3_API_CONFIG:

export M3_API_CONFIG=/path/to/your/api_config.json

2. Data format

VLDataset (see verl/utils/dataset/vl_dataset.py) reads JSONL — one JSON object per line. The required fields depend on type:

Field	Type	Required for	Description
`id`	str	all	Sample id; format `<video_id><key>` (split on ``)
`type`	str	all	One of `"episodic_on_policy"`, `"semantic"`
`step`	int	episodic / semantic	Curriculum step. When `> 0` the loader pulls previously generated memories from `<default_local_dir>/history/<video_id>.json`
`video_idx`	int	episodic / semantic	Current segment index inside the video
`episodic_folder`	str	episodic / semantic	Directory containing `<video_id>/{video_id}_map.json` (face-id mapping) and `episodic_<memory_tag>.json`
`memory_tag`	str	semantic only	Selector for `episodic_<memory_tag>.json`
`input`	list	all	OpenAI-style content blocks (`{"type": "text"\|"video"\|"image", ...}`); the last text block must contain `[Description of the preceding part]:` so the reward manager can extract preceding context

The dataset rewrites face ids (<face_n>) using the per-frame map files so that ids remain locally consistent across history segments.

3. Train

The repository ships run.sh with the Megatron / vLLM / offload defaults already baked in. Supply the user-specific overrides (model path, data paths, output dir, reward hyperparameters) on the command line:

bash run.sh \
    data.train_batch_size=32 \
    actor_rollout_ref.rollout.n=8 \
    data.train_files=/path/to/train.jsonl \
    data.val_files=/path/to/val.jsonl \
    actor_rollout_ref.model.path=/path/to/Qwen3-VL-30B-A3B-Thinking \
    trainer.default_local_dir=/path/to/output/run_name \
    trainer.experiment_name=run_name \
    trainer.save_freq=10 \
    reward_model.reward_kwargs.memory_length_threshold=5 \
    reward_model.reward_kwargs.think_penalty=-1.0 \
    reward_model.reward_kwargs.think_length_threshold=1200 \
    reward_model.reward_kwargs.correct_reward=0.5 \
    reward_model.reward_kwargs.wrong_reward=-0.5 \
    reward_model.reward_kwargs.format_penalty=-1.5 \
    reward_model.reward_kwargs.memory_token_threshold=150 \
    reward_model.reward_kwargs.usefulness_scale=0.1 \
    actor_rollout_ref.actor.policy_loss.loss_mode=gspo \
    actor_rollout_ref.rollout.calculate_log_probs=True \
    actor_rollout_ref.actor.use_dynamic_bsz=False \
    actor_rollout_ref.ref.log_prob_use_dynamic_bsz=False \
    actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=False \
    algorithm.adv_estimator=grpo \
    algorithm.mode=optimize_threshold \
    data.shuffle=False \
    data.dataloader_num_workers=0

4. Outputs

All artifacts go under trainer.default_local_dir:

<default_local_dir>/
├── global_step_<N>/                  # checkpoints
├── <N>/<video_id>.json               # per-step, per-video reward log
└── history/<video_id>.json           # accumulated memory used as preceding-context
                                      # for subsequent curriculum steps

W&B is enabled by default (trainer.logger=["console","wandb"], trainer.project_name=m3_qwen3_vl). Override trainer.logger=["console"] to disable.

Name		Name	Last commit message	Last commit date
Latest commit History 1,478 Commits
.gemini		.gemini
.github		.github
.vscode		.vscode
configs		configs
docker		docker
docs		docs
examples		examples
recipe		recipe
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
VERL_README.md		VERL_README.md
bootstrap.sh		bootstrap.sh
pyproject.toml		pyproject.toml
requirements-cuda.txt		requirements-cuda.txt
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
run.sh		run.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TaskMem-Phase One

1. Install

Azure OpenAI / Gemini judge config

2. Data format

3. Train

4. Outputs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TaskMem-Phase One

1. Install

Azure OpenAI / Gemini judge config

2. Data format

3. Train

4. Outputs

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages