# ADAM Optimizer Tutorial: Intuition And Implementation in Python

## Introduction

Have you ever tried navigating your way down a hilly area blindfolded? That's somewhat similar to what machine learning models do when they are trying to improve. They continually search for the lowest point (best solution) without really seeing the whole picture. This is where optimization algorithms come in handy, and ADAM is like a smart flashlight in this journey. 

ADAM, short for Adaptive Moment Estimation, is a popular optimization technique, especially in deep learning. In this article, you'll see why this is the case. We will cover the intuition behind it, dive into some math (don't worry, we will keep it friendly), its Python implementation and how to use it in PyTorch. 

## What Is ADAM Optimizer? The Short Answer

The popular optimization algorithm used in machine learning and, most often, in deep learning is called ADAM, which stands for Adaptive Moment Estimation. 

ADAM combines ideas from two other robust optimization techniques: momentum and RMSprop. It is named _adaptive_ as it adjusts the learning rate for each parameter. 

Here are its key features and advantages:

- Adaptivity: ADAM adapts the learning rate for each parameter, which can speed up learning in many cases.
- Momentum: It uses a form of momentum, helping it navigate complex surfaces such as ravines and saddle points more effectively.
- Bias correction: ADAM includes bias correction terms, which help it perform well even in the initial stages of training.
- Computational efficiency: It's relatively computationally efficient and has low memory requirements.
- Hyperparameter robustness: While the learning rate may need tuning, ADAM is often less sensitive to hyperparameter choices than some other optimizers.

To summarize, ADAM makes models learn more efficiently by continuously adjusting the learning rate of each parameter and, as a consequence, tends to converge much more quickly than standard stochastic gradient descent. For many deep learning applications, it is therefore a strong default choice.

## Prerequisite Algorithms

ADAM unifies key ideas from a few other critical optimization algorithms, strengthening their advantages while also addressing their shortcomings. We will need to review them before we can grasp the intuition behind ADAM and implement it in Python.

### Optimization Analogy

To understand the intuition of these optimization algorithms, let's continue our analogy from the introduction. Imagine you are blindfolded in a complicated, hilly region. You have been tasked to find the lowest point in this terrain. The hilliness of the terrain represents the loss function of a machine learning model. The overall "lowest" (global minimum) point is the optimal solution to the system. 

Now let's connect some dots: Your current position in the terrain represents the current state of the model's parameters. The height at any point represents the loss value for those parameters. The way you are navigating also corresponds to adjusting the model's parameters in the mathematical sense.

Every optimization algorithm functions like a strategy to navigate the landscape of this problem successfully, guiding the solver on where to step next and how large those steps should be. Some algorithms scan the entire area before deciding on a next move, while others rely on limited information to be faster. Still other algorithms use tools like momentum and step-size adaptation: a good solver knows when to push its way through a problem and when to ease up. 

### Gradient Descent

Gradient Descent is the holy grail of optimization in machine learning, as it sets the foundation many algorithms build upon. 

If you use Gradient Descent (GD), at each step, you carefully feel the entire area around you (using the full dataset). This thorough examination allows you to make very accurate decisions about which way is downhill, but it takes a lot of time. You always move in the direction of steepest descent, which means you'll consistently move towards lower ground. However, if you reach a small depression (local minimum), you might get stuck there, unable to detect that there's an even lower point elsewhere.

Key features of GD:
- Uses the entire dataset for each step
- Consistent, but potentially slow
- Can get stuck in local minima

## It Starts With SGD

This article assumes you are familiar with the fundamental optimization algorithm, Gradient Descent (GD) and its more popular variant, Stochastic Gradient Descent (SGD). Both play crucial roles in understanding ADAM, the reason for its invention and how to implement it in Python. If you are completely new, we have written a [separate article](https://www.datacamp.com/tutorial/stochastic-gradient-descent) explaining both regular and stochastic gradient descent.

In any case, in this section, we will go through a basic Python implementation of SGD as a kind of background for what we will do in the next sections. 

First, we import necessary libraries:

In [3]:
import seaborn as sns
import numpy as np
import pandas as pd
import warnings

warnings.filterwarnings("ignore")
np.random.seed(42)

Next, we load the Diamonds dataset from Seaborn, take a sample from it and build the feature and target arrays:

In [4]:
# Load the data
dataset_size = 10_000
diamonds = sns.load_dataset("diamonds")

# Extract the target and the feature
xy = diamonds[["carat", "price"]].values
np.random.shuffle(xy)  # Shuffle the data
xy = xy[:dataset_size]

xy.shape

(10000, 2)

We are setting up a basic regression problem - given diamond carats (their weight), predict their price. 

Now, let's split the data to create training and test sets:

In [5]:
# Split the data
np.random.shuffle(xy)

train_size = int(0.8 * dataset_size)
train_xy, test_xy = xy[:train_size], xy[train_size:]

train_xy.shape

(8000, 2)

To solve the task, we have a range of models at our disposal, but to keep things simple, we will chose Linear Regression and define it as a function:

In [7]:
def model(m, x, b):
    """
    Simple Linear Regression: f(x) = m * x + b, where
    - x: diamond carat
    - m: price increase per carat
    - b: base diamond price
    - f(x): predicted diamond price
    """
    
    return m * x + b

Our Linear Regression model has only two parameters, `m` and `b`, so the task for SGD (and later, for ADAM) is to find optimal values for them. 

We should also define the loss function, Mean Squared Error, which will be used by SGD for optimization:

In [9]:
def mean_squared_error(y_true, y_pred):
    """
    MSE as a loss function. It is defined as:

    Loss = (1/n) * Σ((y - f(x))²), where:
    - n: the length of the dataset
    - y: the true diamond price
    - f(x): predicted diamond price, i.e. m*x + b
    """

    return np.mean((y_true - y_pred) ** 2)

Now, we define a function called `stochastic_gradient_descent` that accepts six arguments:

- `x` and `y` represent the single feature and target in our problem
- `epochs` denotes how many times we want to perform the descent (more on this later)
- `learning_rate` is the step size
- `batch_size` to control how frequently we make parameter updates
- `stopping_threshold` sets the minimum value the loss should decrease at each step

```python
def stochastic_gradient_descent(
    x, y, epochs=100, learning_rate=0.01, batch_size=32, stopping_threshold=1e-6
):
    """
    SGD with support for mini-batches.
    """

    # Initialize the model parameters randomly
    m = np.random.randn()
    b = np.random.randn()
    n = len(x)  # The number of data points
    previous_loss = np.inf
```

Inside the function, we first initialize the parameters we want to optimize with random values. We also set initial loss to infinity, representing the unsolved state of our problem.

Then, we start a `for` loop that runs for `epochs` iterations. Inside the loop, we shuffle the data to prevent learning order-dependent patters in the data:

```python
def stochastic_gradient_descent(...):
    ...
    for i in range(epochs):
        # Shuffle the data
        indices = np.random.permutation(n)
        x = x[indices]
        y = y[indices]
```

Then, we start another loop controlled by the `batch_size` parameter:

```python
def stochastic_gradient_descent(...):
    ...
    for i in range(epochs):
        ...
        for j in range(0, n, batch_size):
            # Extract the current batch
            x_batch = x[j:j + batch_size]
            y_batch = y[j:j + batch_size]
```

Inside this inner loop, we calculate the gradients (partial derivatives) for both parameters:

```python
def stochastic_gradient_descent(...):
    ...
    for i in range(epochs):
        ...
        for j in range(0, n, batch_size):
            # Extract the current batch
            ...
            # Compute the negative gradients
            y_pred = model(m, x_batch, b)  # Make predictions with current m and b
            m_gradient = -2 * np.mean(x_batch * (y_batch - y_pred))
            b_gradient = -2 * np.mean(y_batch - y_pred)
```

After we have the gradients, we update the parameters using the learning rate:

```python
def stochastic_gradient_descent(...):
    ...
    for i in range(epochs):
        ...
        for j in range(0, n, batch_size):
            # Extract the current batch
            ...
            # Compute the negative gradients
            ...
            # Update the model parameters
            m -= learning_rate * m_gradient
            b -= learning_rate * b_gradient
```

Now, under the parent loop, we calculate the loss for the current epoch:

```python
def stochastic_gradient_descent(...):
    ...
    for i in range(epochs):
        ...
        for j in range(0, n, batch_size):
            # Extract the current batch
            ...
            # Compute the negative gradients
            ...
            # Update the model parameters
            ...
        # Compute the epoch loss
        y_pred = model(m, x, b)
        current_loss = loss(y, y_pred)
```

If the epoch loss is smaller than the `stopping_threshold`, we stop the entire process:

```python
def stochastic_gradient_descent(...):
    ...
    for i in range(epochs):
        ...
        for j in range(0, n, batch_size):
            # Extract the current batch
            ...
            # Compute the negative gradients
            ...
            # Update the model parameters
            ...
        # Compute the epoch loss
        ...
        # Check against the stopping threshold
        if previous_loss - current_loss < stopping_threshold:
            break
        previous_loss = current_loss
```

In the end (after epochs run out or the stopping threshold is met), we return `m` and `b` which are now optimized:

```python
def stochastic_gradient_descent(...):
    ...
    for i in range(epochs):
        ...
        for j in range(0, n, batch_size):
            # Extract the current batch
            ...
            # Compute the negative gradients
            ...
            # Update the model parameters
            ...
        # Compute the epoch loss
        ...
        # Check against the stopping threshold
        ...
    return m, b
```

I’ve pasted the entire code into [this GitHub gist](https://gist.github.com/BexTuychiev/645dcd35ef12dad323b6c0182f29be74) so that you can look at the whole picture.

## SGD With Momentum

## RMSProp

## Finally, ADAM