Mamba PPO Design

First, this is a great repo, seems to have taken inspiration from  `cleanrl`, lovely!

I would love to learn more about the NN architecture  decisions you took in building `ppo_mamba.py`. Was hoping to find something in the reference paper but it seems to focus more on results. 

The only statement on the architecture I found in the paper is the one below
> Mamba/Mamba-2: Integrated using
the official implementation from the mamba-ssm repository. For Mamba, we employed an optimized
training approach utilizing the selective scan mechanism without resetting at episode boundaries,
offering computational advantages but potentially introducing state leakage between episodes. We
incorporated post-model MLP layers and layer normalization.

The kind of questions I'm looking for answers to:
- Why are envs aligned to specific input dimension?
- How could you address the state leakage? Padding?
- The usage of, `mamba_state[0]`, `mamba_state[1]`
- Why only a single Mamba layer?
- Reasoning behind different optimizer learning rates? and why is lr annealing only applied to 0 and -1 layers? How to tune these to different problems? Which metrics to look for during training?

Just trying to learn here!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mamba PPO Design #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mamba PPO Design #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions