PPO kernel optimization #472

jonahsamost · 2026-01-24T19:19:06Z

This kernel mainly attains a speed up from doing an online softmax in both the forward and backward passes and doing computation in 32 bit instead of 64 bit precision. It also avoids writing intermediate values to be used in the backward pass in favor of recomputation.

ppo_loss (NT=32768, 512x64, A=4)
  	forward (original)   17.3 us  1897.58 M elem/s
  	backward (original)   14.4 us  2279.77 M elem/s
  	forward (optimized)    3.3 us  9848.82 M elem/s
  	backward (optimized)    3.3 us  9787.83 M elem/s
  	forward (torch)     17.3 us  1890.14 M elem/s
  	backward (torch)   141.5 us  231.54 M elem/s
  	forward (cpp)      220.1 us  148.87 M elem/s
  	backward (cpp)    1084.5 us   30.21 M elem/s
  	forward (graph)     49.2 us  666.60 M elem/s

jonahsamost added 4 commits January 23, 2026 06:14

initial

fe1bea0

minor

9eb0b97

remove comments

e3bbe5d

ppo

b97d065

jsuarez5341 merged commit 705acc9 into PufferAI:4.0 Jan 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO kernel optimization #472

PPO kernel optimization #472

Uh oh!

jonahsamost commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PPO kernel optimization #472

PPO kernel optimization #472

Uh oh!

Conversation

jonahsamost commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants