Add Muon optimizer #13721

Adhithya-Laxman · 2025-10-23T20:54:31Z

Description

This PR implements the Muon optimizer using pure NumPy, completing the neural network optimizers module for the repository.

Muon is a cutting-edge optimizer specifically designed for hidden layer weight matrices in neural networks, using Newton-Schulz matrix orthogonalization iterations for improved convergence and computational efficiency.

This PR addresses part of issue #13662 - Add neural network optimizers module to enhance training capabilities

What does this PR do?

Implements Muon optimizer that applies Newton-Schulz orthogonalization to gradient matrices
Specifically optimized for 2D weight matrices (hidden layers)
Provides superior convergence properties compared to standard optimizers
Uses momentum with orthogonalized gradient updates
Memory-efficient implementation with minimal state tracking
Educational implementation suitable for learning advanced optimization techniques

Implementation Details

Algorithm: Muon (Matrix-aware Optimizer with Orthogonalization)
Core Mechanism: Newton-Schulz iterations for gradient orthogonalization

Update rule:

ortho_grad = NewtonSchulz(gradient, steps=5)
velocity = momentum * velocity + ortho_grad
param = param - learning_rate * velocity

Newton-Schulz Iteration:

For k steps:
  A = 1.5 * X - 0.5 * X @ (X.T @ X)
Returns orthogonalized matrix

Key Features:
- Matrix-aware optimization
- Spectral norm-based updates
- Implicit regularization through orthogonalization
- Momentum accumulation with orthogonalized gradients
Pure NumPy: No PyTorch, TensorFlow, or other frameworks required
Educational focus: Clear implementation showing matrix orthogonalization concepts

Why Muon?

Muon represents state-of-the-art optimizer research and offers several advantages:

Computational Efficiency: ~2x more efficient than AdamW in large-scale training
Better Convergence: Particularly effective for training deep networks
Memory Efficient: Maintains only momentum buffer (no second-order moments)
Geometric Awareness: Exploits matrix structure of neural network parameters
Recent Innovation: Used in current training speed records for NanoGPT and CIFAR-10

Features

✅ Complete docstrings with parameter descriptions
✅ Type hints for all function parameters and return values
✅ Doctests for correctness validation
✅ Usage example demonstrating optimizer on matrix optimization
✅ PEP8 compliant code formatting
✅ Newton-Schulz orthogonalization implementation
✅ Configurable hyperparameters (learning rate, momentum, iteration steps)
✅ Pure NumPy - no external deep learning frameworks

Testing

All doctests pass:

python -m doctest neural_network/optimizers/muon.py -v

Linting passes:

ruff check neural_network/optimizers/muon.py

Example output demonstrates proper optimization behavior on matrix parameters.

References

Relation to Issue #13662

This PR completes the optimizer sequence outlined in #13662:

✅ Stochastic Gradient Descent (SGD) - feat: add SGD optimizer for neural networks #13671
✅ Momentum SGD - Add Momentum SGD optimizer implementation #13680
✅ Nesterov Accelerated Gradient (NAG) - Added Nesterov and Adam Optimizers #13718
✅ Adagrad - Add Adagrad optimizer implementation in Pure Numpy #13681
✅ Adam - Added Nesterov and Adam Optimizers #13718
✅ Muon (this PR) - FINAL OPTIMIZER

With this PR, the neural network optimizers module is now complete with 6 fundamental optimizers covering classical to cutting-edge optimization techniques.

Use Cases

Muon is particularly effective for:

Training large language models and transformers
Deep convolutional networks
Scenarios requiring computational efficiency at scale
Research into matrix-aware optimization
Cases where AdamW shows diminishing returns

Checklist

Summary

This PR marks the completion of the neural network optimizers module, providing educators and learners with a comprehensive collection of optimization algorithms from fundamental SGD to cutting-edge Muon. The module now serves as a complete educational resource for understanding neural network training optimization.

This PR along with the following PRs collectively fixes issue #13662:

Related PRs:

feat: add SGD optimizer for neural networks #13671
Add Momentum SGD optimizer implementation #13680
Add Adagrad optimizer implementation in Pure Numpy #13681
Added Nesterov and Adam Optimizers #13718
This PR (Muon optimizer) - Final optimizer completing the module

Fixes #13662 (This PR completes the neural network optimizers module)

- Implements Adagrad (Adaptive Gradient) using pure NumPy - Adapts learning rate individually for each parameter - Includes comprehensive docstrings and type hints - Adds doctests for validation - Provides usage example demonstrating convergence - Follows PEP8 coding standards - Part of issue TheAlgorithms#13662

- Implements Adam (Adaptive Moment Estimation) optimizer - Implements Nesterov Accelerated Gradient (NAG) optimizer - Both use pure NumPy without deep learning frameworks - Includes comprehensive docstrings and type hints - Adds doctests for validation - Provides usage examples demonstrating convergence - Follows PEP8 coding standards - Part of issue TheAlgorithms#13662

- Implements Muon optimizer for hidden layer weight matrices - Uses Newton-Schulz orthogonalization iterations - Provides matrix-aware gradient updates with spectral constraints - Includes comprehensive docstrings and type hints - Adds doctests for validation - Provides usage example demonstrating optimization - Follows PEP8 coding standards - Pure NumPy implementation without frameworks - Part of issue TheAlgorithms#13662

algorithms-keeper · 2025-10-23T20:54:36Z

Multiple Pull Request Detected

@Adhithya-Laxman, we are extremely excited that you want to submit multiple algorithms in this repository but we have a limit on how many pull request a user can keep open at a time. This is to make sure all maintainers and users focus on a limited number of pull requests at a time to maintain the quality of the code.

This pull request is being closed as the user already has an open pull request. Please focus on your previous pull request before opening another one. Thank you for your cooperation.

User opened pull requests (including this one): #13721, #13718, #13681, #13680, #13646

Adhithya-Laxman · 2025-10-23T20:56:19Z

My other PRs are closed now, Kindly help me in opening a PR for this branch.
This is the final branch that fixes the issue

Adhithya-Laxman added 3 commits October 22, 2025 11:13

algorithms-keeper bot closed this Oct 23, 2025

algorithms-keeper bot added the awaiting reviews This PR is ready to be reviewed label Oct 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add Muon optimizer #13721

Add Muon optimizer #13721

Adhithya-Laxman commented Oct 23, 2025

Uh oh!

algorithms-keeper bot commented Oct 23, 2025

Uh oh!

Adhithya-Laxman commented Oct 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Add Muon optimizer #13721

Add Muon optimizer #13721

Conversation

Adhithya-Laxman commented Oct 23, 2025

Description

What does this PR do?

Implementation Details

Why Muon?

Features

Testing

References

Relation to Issue #13662

Use Cases

Checklist

Summary

This PR along with the following PRs collectively fixes issue #13662:

Uh oh!

algorithms-keeper bot commented Oct 23, 2025

Multiple Pull Request Detected

Uh oh!

Adhithya-Laxman commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Adhithya-Laxman commented Oct 23, 2025 •

edited

Loading