Skip to content

Mamba PPO Design #1

@ahmedammar

Description

@ahmedammar

First, this is a great repo, seems to have taken inspiration from cleanrl, lovely!

I would love to learn more about the NN architecture decisions you took in building ppo_mamba.py. Was hoping to find something in the reference paper but it seems to focus more on results.

The only statement on the architecture I found in the paper is the one below

Mamba/Mamba-2: Integrated using
the official implementation from the mamba-ssm repository. For Mamba, we employed an optimized
training approach utilizing the selective scan mechanism without resetting at episode boundaries,
offering computational advantages but potentially introducing state leakage between episodes. We
incorporated post-model MLP layers and layer normalization.

The kind of questions I'm looking for answers to:

  • Why are envs aligned to specific input dimension?
  • How could you address the state leakage? Padding?
  • The usage of, mamba_state[0], mamba_state[1]
  • Why only a single Mamba layer?
  • Reasoning behind different optimizer learning rates? and why is lr annealing only applied to 0 and -1 layers? How to tune these to different problems? Which metrics to look for during training?

Just trying to learn here!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions