Skip to content

Conversation

@Adhithya-Laxman
Copy link

@Adhithya-Laxman Adhithya-Laxman commented Oct 23, 2025

Description

This PR implements three advanced optimizers using pure NumPy as part of the effort to add neural network optimizers to the repository:

  1. Adam (Adaptive Moment Estimation) - Industry-standard optimizer combining momentum and adaptive learning rates
  2. Nesterov Accelerated Gradient (NAG) - Momentum-based optimizer with look-ahead gradient computation
  3. Muon - Cutting-edge matrix-aware optimizer using Newton-Schulz orthogonalization

This PR addresses part of issue #13662 - Add neural network optimizers module to enhance training capabilities

What does this PR do?

Adam Optimizer

  • Combines first moment (mean) and second moment (variance) estimates of gradients
  • Adapts learning rate for each parameter individually
  • Includes bias correction for initial iterations
  • Industry-standard optimizer widely used in deep learning (used in GPT training, etc.)

Nesterov Accelerated Gradient (NAG)

  • Enhanced momentum-based optimization with look-ahead capability
  • Computes gradients at the anticipated position rather than current position
  • Provides better convergence properties than standard momentum
  • Particularly effective for convex optimization problems

Muon Optimizer

  • State-of-the-art optimizer for hidden layer weight matrices
  • Uses Newton-Schulz orthogonalization iterations for gradient updates
  • ~2x more computationally efficient than AdamW in large-scale training
  • Memory efficient with only momentum buffer (no second-order moments)
  • Recently used in training speed records for NanoGPT and CIFAR-10

All implementations provide clean, educational code without external deep learning frameworks.

Implementation Details

Adam

  • Update rule:
    m = beta1 * m + (1 - beta1) * gradient
    v = beta2 * v + (1 - beta2) * gradient^2
    m_hat = m / (1 - beta1^t)
    v_hat = v / (1 - beta2^t)
    param = param - learning_rate * m_hat / (sqrt(v_hat) + epsilon)
    
  • Key Features:
    • First and second moment estimates
    • Bias correction
    • Per-parameter adaptive learning rates

Nesterov Accelerated Gradient

  • Update rule:
    velocity_prev = velocity
    velocity = momentum * velocity - learning_rate * gradient_lookahead
    param = param - momentum * velocity_prev + (1 + momentum) * velocity
    
  • Key Features:
    • Look-ahead gradient computation
    • Enhanced momentum for faster convergence
    • Better handling of curvature

Muon

  • Algorithm: Matrix-aware Optimizer with Orthogonalization
  • Core Mechanism: Newton-Schulz iterations for gradient orthogonalization
  • Update rule:
    ortho_grad = NewtonSchulz(gradient, steps=5)
    velocity = momentum * velocity + ortho_grad
    param = param - learning_rate * velocity
    
  • Newton-Schulz Iteration:
    For k steps:
      A = 1.5 * X - 0.5 * X @ (X.T @ X)
    Returns orthogonalized matrix
    

Features

✅ Complete docstrings with parameter descriptions for all three optimizers
✅ Type hints for all function parameters and return values
✅ Doctests for correctness validation
✅ Usage examples demonstrating optimizer behavior on optimization problems
✅ PEP8 compliant code formatting
✅ Pure NumPy implementations - no framework dependencies
✅ Configurable hyperparameters (learning rates, momentum, betas, epsilon, NS steps)

Testing

All doctests pass:

python -m doctest neural_network/optimizers/adam.py -v
python -m doctest neural_network/optimizers/nesterov_accelerated_gradient.py -v
python -m doctest neural_network/optimizers/muon.py -v

Linting passes:

ruff check neural_network/optimizers/adam.py
ruff check neural_network/optimizers/nesterov_accelerated_gradient.py
ruff check neural_network/optimizers/muon.py

Example outputs demonstrate proper convergence behavior for all optimizers.

References

Adam

Nesterov Accelerated Gradient

Muon

Why Combine These Three?

These optimizers represent the cutting edge of neural network optimization:

  • Adam is the industry standard used in most modern deep learning
  • NAG demonstrates advanced momentum-based techniques
  • Muon represents state-of-the-art research in matrix-aware optimization

Together, they provide learners with a comprehensive view from established best practices (Adam) to innovative research directions (Muon).

Relation to Issue #13662

This PR continues the optimizer sequence outlined in #13662:

With this PR combined with the other optimizer PRs, the neural network optimizers module is now complete with 6 fundamental optimizers covering classical to cutting-edge optimization techniques.

Checklist

  • I have read CONTRIBUTING.md
  • This pull request is all my own work -- I have not plagiarized
  • I know that pull requests will not be merged if they fail the automated tests
  • This PR changes related algorithm files that work together
  • All new Python files are placed inside an existing directory
  • All filenames are in all lowercase characters with no spaces or dashes
  • All functions and variable names follow Python naming conventions
  • All function parameters and return values are annotated with Python type hints
  • All functions have doctests that pass the automated testing
  • All new algorithms include at least one URL that points to Wikipedia or another similar explanation

Summary

This PR provides three advanced optimizers representing both industry-standard techniques (Adam) and cutting-edge research (Muon), contributing significantly to the neural network optimizers module for educational purposes.


This PR along with the following PRs collectively addresses issue #13662:

Related PRs:

Fixes #13662

- Implements Adagrad (Adaptive Gradient) using pure NumPy
- Adapts learning rate individually for each parameter
- Includes comprehensive docstrings and type hints
- Adds doctests for validation
- Provides usage example demonstrating convergence
- Follows PEP8 coding standards
- Part of issue TheAlgorithms#13662
- Implements Adam (Adaptive Moment Estimation) optimizer
- Implements Nesterov Accelerated Gradient (NAG) optimizer
- Both use pure NumPy without deep learning frameworks
- Includes comprehensive docstrings and type hints
- Adds doctests for validation
- Provides usage examples demonstrating convergence
- Follows PEP8 coding standards
- Part of issue TheAlgorithms#13662
@Adhithya-Laxman Adhithya-Laxman force-pushed the add_nag_adam_optimizers branch from 76f8f40 to 08314f8 Compare October 23, 2025 21:19
@algorithms-keeper algorithms-keeper bot added the awaiting reviews This PR is ready to be reviewed label Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting reviews This PR is ready to be reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add neural network optimizers module to enhance training capabilities

1 participant