Added Nesterov and Adam Optimizers #13718

Adhithya-Laxman · 2025-10-23T20:26:14Z

Description

This PR implements three advanced optimizers using pure NumPy as part of the effort to add neural network optimizers to the repository:

Adam (Adaptive Moment Estimation) - Industry-standard optimizer combining momentum and adaptive learning rates
Nesterov Accelerated Gradient (NAG) - Momentum-based optimizer with look-ahead gradient computation
Muon - Cutting-edge matrix-aware optimizer using Newton-Schulz orthogonalization

This PR addresses part of issue #13662 - Add neural network optimizers module to enhance training capabilities

What does this PR do?

Adam Optimizer

Combines first moment (mean) and second moment (variance) estimates of gradients
Adapts learning rate for each parameter individually
Includes bias correction for initial iterations
Industry-standard optimizer widely used in deep learning (used in GPT training, etc.)

Nesterov Accelerated Gradient (NAG)

Enhanced momentum-based optimization with look-ahead capability
Computes gradients at the anticipated position rather than current position
Provides better convergence properties than standard momentum
Particularly effective for convex optimization problems

Muon Optimizer

State-of-the-art optimizer for hidden layer weight matrices
Uses Newton-Schulz orthogonalization iterations for gradient updates
~2x more computationally efficient than AdamW in large-scale training
Memory efficient with only momentum buffer (no second-order moments)
Recently used in training speed records for NanoGPT and CIFAR-10

All implementations provide clean, educational code without external deep learning frameworks.

Implementation Details

Adam

Update rule:

m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2
m_hat = m / (1 - beta1^t)
v_hat = v / (1 - beta2^t)
param = param - learning_rate * m_hat / (sqrt(v_hat) + epsilon)

Key Features:
- First and second moment estimates
- Bias correction
- Per-parameter adaptive learning rates

Nesterov Accelerated Gradient

Update rule:

velocity_prev = velocity
velocity = momentum * velocity - learning_rate * gradient_lookahead
param = param - momentum * velocity_prev + (1 + momentum) * velocity

Key Features:
- Look-ahead gradient computation
- Enhanced momentum for faster convergence
- Better handling of curvature

Muon

Algorithm: Matrix-aware Optimizer with Orthogonalization
Core Mechanism: Newton-Schulz iterations for gradient orthogonalization

Update rule:

ortho_grad = NewtonSchulz(gradient, steps=5)
velocity = momentum * velocity + ortho_grad
param = param - learning_rate * velocity

Newton-Schulz Iteration:

For k steps:
  A = 1.5 * X - 0.5 * X @ (X.T @ X)
Returns orthogonalized matrix

Features

✅ Complete docstrings with parameter descriptions for all three optimizers
✅ Type hints for all function parameters and return values
✅ Doctests for correctness validation
✅ Usage examples demonstrating optimizer behavior on optimization problems
✅ PEP8 compliant code formatting
✅ Pure NumPy implementations - no framework dependencies
✅ Configurable hyperparameters (learning rates, momentum, betas, epsilon, NS steps)

Testing

All doctests pass:

python -m doctest neural_network/optimizers/adam.py -v
python -m doctest neural_network/optimizers/nesterov_accelerated_gradient.py -v
python -m doctest neural_network/optimizers/muon.py -v

Linting passes:

ruff check neural_network/optimizers/adam.py
ruff check neural_network/optimizers/nesterov_accelerated_gradient.py
ruff check neural_network/optimizers/muon.py

Example outputs demonstrate proper convergence behavior for all optimizers.

References

Adam

Nesterov Accelerated Gradient

Muon

Why Combine These Three?

These optimizers represent the cutting edge of neural network optimization:

Adam is the industry standard used in most modern deep learning
NAG demonstrates advanced momentum-based techniques
Muon represents state-of-the-art research in matrix-aware optimization

Together, they provide learners with a comprehensive view from established best practices (Adam) to innovative research directions (Muon).

Relation to Issue #13662

This PR continues the optimizer sequence outlined in #13662:

✅ Stochastic Gradient Descent (SGD) - feat: add SGD optimizer for neural networks #13671
✅ Momentum SGD - Add Momentum SGD optimizer implementation #13680
✅ Nesterov Accelerated Gradient (NAG) (this PR)
✅ Adagrad - Add Adagrad optimizer implementation in Pure Numpy #13681
✅ Adam (this PR)
✅ Muon (this PR)

With this PR combined with the other optimizer PRs, the neural network optimizers module is now complete with 6 fundamental optimizers covering classical to cutting-edge optimization techniques.

Checklist

Summary

This PR provides three advanced optimizers representing both industry-standard techniques (Adam) and cutting-edge research (Muon), contributing significantly to the neural network optimizers module for educational purposes.

This PR along with the following PRs collectively addresses issue #13662:

Related PRs:

feat: add SGD optimizer for neural networks #13671
Add Momentum SGD optimizer implementation #13680
Add Adagrad optimizer implementation in Pure Numpy #13681
This PR (Adam, NAG, and Muon) - Completes the advanced optimizer implementations

Fixes #13662

- Implements Adagrad (Adaptive Gradient) using pure NumPy - Adapts learning rate individually for each parameter - Includes comprehensive docstrings and type hints - Adds doctests for validation - Provides usage example demonstrating convergence - Follows PEP8 coding standards - Part of issue TheAlgorithms#13662

- Implements Adam (Adaptive Moment Estimation) optimizer - Implements Nesterov Accelerated Gradient (NAG) optimizer - Both use pure NumPy without deep learning frameworks - Includes comprehensive docstrings and type hints - Adds doctests for validation - Provides usage examples demonstrating convergence - Follows PEP8 coding standards - Part of issue TheAlgorithms#13662

This was referenced Oct 23, 2025

Add muon optimizer #13719

Closed

Add muon optimizer #13720

Closed

Add Muon optimizer #13721

Closed

Add Muon optimizer implementation #13724

Closed

Adhithya-Laxman force-pushed the add_nag_adam_optimizers branch from 76f8f40 to 08314f8 Compare October 23, 2025 21:19

algorithms-keeper bot added the awaiting reviews This PR is ready to be reviewed label Oct 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Added Nesterov and Adam Optimizers #13718

Added Nesterov and Adam Optimizers #13718

Uh oh!

Adhithya-Laxman commented Oct 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Added Nesterov and Adam Optimizers #13718

Are you sure you want to change the base?

Added Nesterov and Adam Optimizers #13718

Uh oh!

Conversation

Adhithya-Laxman commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What does this PR do?

Adam Optimizer

Nesterov Accelerated Gradient (NAG)

Muon Optimizer

Implementation Details

Adam

Nesterov Accelerated Gradient

Muon

Features

Testing

References

Adam

Nesterov Accelerated Gradient

Muon

Why Combine These Three?

Relation to Issue #13662

Checklist

Summary

This PR along with the following PRs collectively addresses issue #13662:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Adhithya-Laxman commented Oct 23, 2025 •

edited

Loading