# Implement Adam Optimization Algorithm
Implement the Adam (Adaptive Moment Estimation) optimization algorithm in Python. Adam is an optimization algorithm that adapts the learning rate for each parameter. Your task is to write a function `adam_optimizer` that updates the parameters of a given function using the Adam algorithm.

The function should take the following parameters:

- `f`: The objective function to be optimized
- `grad`: A function that computes the gradient of `f`
- `x0`: Initial parameter values
- `learning_rate`: The step size (default: 0.001)
- `beta1`: Exponential decay rate for the first moment estimates (default: 0.9)
- `beta2`: Exponential decay rate for the second moment estimates (default: 0.999)
- `epsilon`: A small constant for numerical stability (default: 1e-8)
- `num_iterations`: Number of iterations to run the optimizer (default: 1000)

The function should return the optimized parameters.

Example
```py
import numpy as np

def objective_function(x):
    return x[0]**2 + x[1]**2

def gradient(x):
    return np.array([2*x[0], 2*x[1]])

x0 = np.array([1.0, 1.0])
x_opt = adam_optimizer(objective_function, gradient, x0)

print("Optimized parameters:", x_opt)

# Expected Output:
# Optimized parameters: [0.99000325 0.99000325]
```

## Understanding the Adam Optimization Algorithm

Adam (Adaptive Moment Estimation) is an optimization algorithm commonly used in training deep neural networks. It combines ideas from two other optimization algorithms: RMSprop and Momentum.

## Key Concepts

1. `Adaptive Learning Rates`: Adam computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.

2. `Momentum`: It keeps track of an exponentially decaying average of past gradients, similar to momentum.

3. `RMSprop`: It also keeps track of an exponentially decaying average of past squared gradients.

4. `Bias Correction`: Adam includes bias correction terms to account for the initialization of the first and second moment estimates.

## The Adam Algorithm

Given parameters $\theta$, objective function $f(\theta)$, and its gradient $\nabla_\theta f(\theta)$:

1. Initialize time step $t = 0$, parameters $\theta$, first moment vector $m_0 = 0$, second moment vector $v_0 = 0$, and hyperparameters $\alpha$ (learning rate), $\beta_1$, $\beta_2$, and $\epsilon$.

2. While not converged, do:
    1. Increment time step: $t = t + 1$
    2. Compute gradient: $g_t = \nabla_\theta f_t(\theta_{t-1})$
    3. Update biased first moment estimate: $m_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t$
    4. Update biased second raw moment estimate: $v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2$
    5. Compute bias-corrected first moment estimate: $\hat{m}_t = m_t / (1 - \beta_1^t)$
    6. Compute bias-corrected second raw moment estimate: $\hat{v}_t = v_t / (1 - \beta_2^t)$
    7. Update parameters: $\theta_t = \theta_{t-1} - \alpha \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$
    
Adam combines the advantages of AdaGrad, which works well with sparse gradients, and RMSProp, which works well in online and non-stationary settings. Adam is generally regarded as being fairly robust to the choice of hyperparameters, though the learning rate may sometimes need to be changed from the suggested default.

In [3]:
import numpy as np

def adam_optimizer(f, grad, x0, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8, num_iterations=10):
    x = x0
    m = np.zeros_like(x)
    v = np.zeros_like(x)
    for i in range(num_iterations):
        g = grad(x)
        m = beta1 * m + (1 - beta1) * g
        v = beta2 * v + (1 - beta2) * g**2
        m_hat = m / (1 - beta1**(i+1))
        v_hat = v / (1 - beta2**(i+1))
        x -= learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)
    return x

In [4]:
import numpy as np
def objective_function(x):
    return x[0]**2 + x[1]**2
def gradient(x):
    return np.array([2*x[0], 2*x[1]])
x0 = np.array([1.0, 1.0])
x_opt = adam_optimizer(objective_function, gradient, x0)
print('Test Case 1: Accepted') if np.allclose(x_opt, [0.99000325, 0.99000325]) else print('Test Case 1: Failed')
print('Input:')
print('import numpy as np\ndef objective_function(x):\n    return x[0]**2 + x[1]**2\ndef gradient(x):\n    return np.array([2*x[0], 2*x[1]])\nx0 = np.array([1.0, 1.0])\nx_opt = adam_optimizer(objective_function, gradient, x0)\nprint(x_opt)')
print()
print('Output:')
print(x_opt)
print()
print('Expected:')
print('[0.99000325 0.99000325]')
print()
print()

import numpy as np
def objective_function(x):
    return x[0]**2 + x[1]**2
def gradient(x):
    return np.array([2*x[0], 2*x[1]])
x0 = np.array([0.2, 12.3])
x_opt = adam_optimizer(objective_function, gradient, x0)
print('Test Case 2: Accepted') if np.allclose(x_opt, [0.19001678, 12.29000026]) else print('Test Case 2: Failed')
print('Input:')
print('import numpy as np\ndef objective_function(x):\n    return x[0]**2 + x[1]**2\ndef gradient(x):\n    return np.array([2*x[0], 2*x[1]])\nx0 = np.array([0.2, 12.3])\nx_opt = adam_optimizer(objective_function, gradient, x0)\nprint(x_opt)')
print()
print('Output:')
print(x_opt)
print()
print('Expected:')
print('[ 0.19001678 12.29000026]')

Test Case 1: Accepted
Input:
import numpy as np
def objective_function(x):
    return x[0]**2 + x[1]**2
def gradient(x):
    return np.array([2*x[0], 2*x[1]])
x0 = np.array([1.0, 1.0])
x_opt = adam_optimizer(objective_function, gradient, x0)
print(x_opt)

Output:
[0.99000325 0.99000325]

Expected:
[0.99000325 0.99000325]


Test Case 2: Accepted
Input:
import numpy as np
def objective_function(x):
    return x[0]**2 + x[1]**2
def gradient(x):
    return np.array([2*x[0], 2*x[1]])
x0 = np.array([0.2, 12.3])
x_opt = adam_optimizer(objective_function, gradient, x0)
print(x_opt)

Output:
[ 0.19001678 12.29000026]

Expected:
[ 0.19001678 12.29000026]
