In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import sys
import os
if not any(path.endswith('textbook') for path in sys.path):
    sys.path.append(os.path.abspath('../../..'))
from textbook_utils import *

In [2]:
times = pd.read_csv('data/seattle_bus_times_NC.csv')

# Minimizing Huber Loss

We first defined the average Huber loss in {numref}`Chapter %s <ch:modeling>`:

$$
L(\theta, \textbf{y}) = \frac{1}{n} \sum_{i=1}^n \begin{cases}
    \frac{1}{2}(y_i - \theta)^2 &  | y_i - \theta | \le \gamma \\
     \gamma (|y_i - \theta| - \frac{1}{2} \gamma ) & \text{otherwise}
\end{cases}
$$

The gradient of the average Huber loss is:

$$
\nabla_{\theta} L(\theta, \textbf{y}) = \frac{1}{n} \sum_{i=1}^n \begin{cases}
    -(y_i - \theta) &  | y_i - \theta | \le \gamma \\
    - \gamma \cdot \text{sign} (y_i - \theta) & \text{otherwise}
\end{cases}
$$

(Note that in previous definitions of Huber loss we used the variable $ \alpha $ to denote the transition point. To avoid confusion with the $ \alpha $ used as the learning rate in gradient descent, we replace the transition point parameter of the Huber loss with $ \gamma $.) 

We create the functions `huber_loss` and `grad_huber_loss` to compute the average loss and its gradient. We write these to have signatures that enable us to specify the parameter as well as the observed data that we average over and the transition point of the loss function.

In [3]:
def huber_loss(theta, dataset, gamma = 1):
    d = np.abs(theta - dataset)
    return np.mean(
        np.where(d <= gamma,
                 (theta - dataset)**2 / 2.0,
                 gamma * (d - gamma / 2.0))
    )

def grad_huber_loss(theta, dataset, gamma = 1):
    d = np.abs(theta - dataset)
    return np.mean(
        np.where(d <= gamma,
                 -(dataset - theta),
                 -gamma * np.sign(dataset - theta))
    )

The signature of our simple implementation of gradient descent includes the loss function, its gradient, and the data to average over. We also supply the learning rate. 

In [4]:
def minimize(loss_fn, grad_loss_fn, dataset, alpha=0.2, progress=False):
    '''
    Uses gradient descent to minimize loss_fn. Returns the minimizing value of
    theta_hat once theta_hat changes less than 0.001 between iterations.
    '''
    theta = 0
    while True:
        if progress:
            print(f'theta: {theta:.2f} | loss: {loss_fn(theta, dataset):.3f}')
        gradient = grad_loss_fn(theta, dataset)
        new_theta = theta - alpha * gradient
        
        if abs(new_theta - theta) < 0.001:
            return new_theta
        
        theta = new_theta

For the bus delays, we use the gradient descent algorithm to find the minimizing constant model for Huber loss. 

In [5]:
%%time
theta_hat = minimize(huber_loss, grad_huber_loss, times['minutes_late'], progress=False)
print(f'Minimizing theta: {theta_hat:.3f}')
print()


Minimizing theta: 0.701

CPU times: user 158 ms, sys: 12.3 ms, total: 170 ms
Wall time: 211 ms


In this minimization process, we typically stop the algorithm when the steps are quite small. In our example, we stop when they are less than 0.001. We also stop the search after a large number of steps, such as 1,000. If the algorithm has not arrived at the minimizing value after so many steps, then the algorithm might be diverging because the learning rate is too large, or the minimum might exist in the limit at $ \pm \infty $. 

Gradient descent gives us a general way to minimize average loss when we cannot easily solve for the minimizing value analytically or when the minimization if computationally expensive. The algorithm relies on two important properties of the average loss function: the average loss is convex and differentiable in $ \boldsymbol{\theta} $. We discuss how the algorithm relies on these properties next.