This is the official implementation of the paper Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents (Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, Jun Wang).
Skill-Pro is a Skill-augmented Markov Decision Process (Skill-MDP, SMDP) framework for LLM Agents.
On top of a vanilla LLM-Agent decision loop, it introduces an evolvable Skill Pool, where each skill is described by a three-part Initiation / Policy / Termination schema, similar to the classical Options framework. At every step, the agent first selects an appropriate skill and injects it into the prompt, then lets the LLM produce an action. In parallel, a Skill Evolution module continuously generates, refines, and prunes skills based on interaction experience via Non-Parametric PPO (semantic-gradient candidate generation + a PPO Gate for verification), enabling online evolution of the agent's capabilities without any parameter updates.
- Skill-MDP / vanilla MDP modes: switch via
--MDP_type {SMDP, MDP}to enable or disable skills. - Evolvable Skill Pool: ships with a set of general reasoning skills (StructuredCoT, ReActDecision, HypothesisElimination, etc.), maintained online during training.
- Multiple skill-selection strategies: supports
llm_model(LLM picks the skill) andllm_topk_lcb(Top-K retrieval + LCB), among others. - Multiple environments: built on TextArena and ALFWorld; already adapted to
Mastermind-v0,FrozenLake,Hangman,alfworld, and more. - Local and remote LLM backends: local models are served via vLLM; remote models are accessed through OpenRouter.
- Rich experiment logging: training curves are tracked with SwanLab, and per-iteration skill snapshots, rewards, and maintenance logs are written to JSON.
- Ablation switches:
--ablation_typesupportswo_sg / wo_ppo / wo_score / nonefor paper-grade reproducibility.
skill_Pro/
├── main.py # Entry point: parses CLI args and launches SkillMDP
├── run.py # Core training / evaluation loop (SkillMDP)
├── data_structures.py # Skill / Experience dataclasses
├── pool_managers.py # ExperiencePool / GoldenExperiencePool
├── Skills/
│ ├── skill_pool.py # Skill Pool (selection, retrieval, maintenance)
│ ├── skill_evolution.py # Skill evolution and verification
│ └── loss.py # Log-prob and other training losses
├── utils/
│ ├── local_llm.py # Local LLM wrapper (vLLM / HF)
│ ├── encode.py # Text embedding (sentence-transformers)
│ ├── models.py # Model-side utilities
│ └── utils.py # Logging paths, random seeds, etc.
├── configs/
│ └── base_config.yaml # Configuration for ALFWorld and other envs
├── requirements.txt
└── README.md
We recommend Python 3.10. Run the following in Anaconda Prompt or PowerShell:
git clone https://github.com/<your-org>/Skill_Pro.git
cd Skill_Pro
conda create -n Skillmdp python=3.10 -y
conda activate Skillmdp
pip install -r requirements.txtTip: if
conda activatefails the first time you use it in PowerShell, runconda init powershellonce and restart the terminal.
To run ALFWorld tasks, download the data and set the data path (PowerShell syntax):
pip install alfworld
$env:ALFWORLD_DATA = "D:\path\to\alfworld_data"
alfworld-downloadThe $ALFWORLD_DATA placeholder in configs/base_config.yaml will be expanded automatically.
- Local model: pass a local path to
--agent_name, e.g./mnt/.../gemma-2-9b-it. The model is loaded via vLLM internally. - Remote model: any OpenRouter-served model, e.g.
--agent_name meta-llama/llama-3.3-70b-instruct. Set theOPENROUTER_API_KEYenvironment variable beforehand.
python main.py `
--MDP_type SMDP `
--env_names Mastermind-v0 `
--agent_name meta-llama/llama-3.3-70b-instruct `
--ge_model_name meta-llama/llama-3.3-70b-instruct `
--max_iters 50 `
--episodes_per_iter 5 `
--pool_size 10 `
--select_type llm_modelEach outer iteration will:
- Run
episodes_per_iterepisodes in every environment. - Update each skill's
frequency / avg_gain / maturitystatistics from the collected trajectories. - Invoke
SkillEvolutionfor skill evolution and verification. - Log to SwanLab and write
outputs/*_training_logs.json.
python main.py `
--MDP_type SMDP `
--test `
--env_names Mastermind-v0 `
--agent_name meta-llama/llama-3.3-70b-instruct `
--load_pool_path ./outputs/xxx_training_logs.json `
--load_iteration 1# Disable Skill Generation
python main.py --ablation_type wo_sg ...
# Disable score-based maintenance, use FIFO instead
python main.py --ablation_type wo_score ...| Argument | Default | Description |
|---|---|---|
--MDP_type |
SMDP |
SMDP enables skills; MDP is pure LLM decision-making |
--ablation_type |
none |
One of wo_sg / wo_ppo / wo_score / none |
--env_names |
Mastermind-v0 |
Comma-separated list of task names |
--agent_name |
gemma/llama, etc. | Decision LLM (local path or remote ID) |
--ge_model_name |
llama-3.3-70b | LLM used for Skill Evolution |
--select_type |
llm_model |
Skill selection strategy |
--pool_size |
10 |
Maximum size of the Skill Pool |
--max_iters |
50 |
Number of outer iterations |
--episodes_per_iter |
5 |
Episodes per environment per iteration |
--epsilon_initial |
0.3 |
Initial ε for skill selection |
--skill_select_k |
1 |
Number of skill-selection steps |
--topk |
5 |
Top-K used during retrieval |
--output_dir |
outputs |
Directory for training logs |
See parse_arguments in main.py for the full list.
- JSON training logs:
outputs/<env>_<MDP>_<select>_<agent>_<ge_model>_<timestamp>_training_logs.json, containing:- Per-iteration
avg_return,task_baselines, andglobal_baseline pool_snapshot: snapshot of the Skill Pool (including initiation / policy / termination / freq / gain / maturity)evolution_detailsandmaintenance_details: evolution and maintenance tracesdelta_prompt_tokens_per_step: extra prompt tokens introduced by skill injection
- Per-iteration
- SwanLab: the experiment name is auto-composed from CLI arguments and tracks per-environment episode return, baseline, ε, and more.
┌─────────────────────┐
│ SkillPool │
│ (Initiation/Policy │
│ /Termination) │
└──────────┬──────────┘
│ select_skill(state)
▼
state ─► build_decision_prompt() ─► LLM Policy ─► action ─► Env
▲ │
│ trajectory + reward │
│ ▼
ExperiencePool / GoldenExperiencePool ◄────────┘
│
▼
SkillEvolution
(generate / refine / verify / score / prune skills)
- Skill selection:
SkillPool.select_skilldecides whether to activate a skill given the current state, ε, andselect_type. - Skill termination:
SkillMDP._should_terminateuses a meta-controller LLM to decide whether the active skill should stop. - Skill maintenance:
maintain/maintain_fifoprune the pool byfreq × avg_gainscore or by FIFO. - Skill evolution:
SkillEvolution.run_skill_evolution_with_verificationcombines recent experiences with the Golden Experience Pool to generate and verify new skills.
- vLLM fails to load? Check
tensor_parallel_size(default is 4); set it to 1 on a single-GPU machine and make sure VRAM is sufficient. - OpenRouter returns 401? Set the environment variable: in PowerShell,
$env:OPENROUTER_API_KEY = "...". - Running non-ALFWorld tasks without ALFWorld installed? The code falls back gracefully and skips ALFWorld initialization automatically.
If you find this work useful, please consider citing our paper:
@article{mi2026skillpro,
title = {Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents},
author = {Mi, Qirui and Ma, Zhijian and Yang, Mengyue and Li, Haoxuan and Wang, Yisen and Zhang, Haifeng and Wang, Jun},
journal = {arXiv preprint arXiv:2602.01869},
year = {2026},
url = {https://arxiv.org/abs/2602.01869}
}This project is released under the MIT License. See the LICENSE file at the repository root for details.