Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hessian diagonal computation #445

Open
MaxHalford opened this issue Nov 20, 2018 · 8 comments
Open

Hessian diagonal computation #445

MaxHalford opened this issue Nov 20, 2018 · 8 comments

Comments

@MaxHalford
Copy link

MaxHalford commented Nov 20, 2018

Hello,

I've been playing around with autograd and I'm having a blast. However I'm having some difficulty with extracting the diagonal of the Hessian.

This is my current code:

from autograd import hessian
import autograd.numpy as np 

y_pred = np.array([
    [1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0],
    [0, 0, 0, 1, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 0, 1],
    [0, 0, 0, 0, 1],
    [0, 0, 0, 0, 1]
], dtype=float)

weights = np.array([1, 1, 1, 1, 1], dtype=float)

def softmax(x, axis=1):
    z = np.exp(x)
    return z / np.sum(z, axis=axis, keepdims=True)

def loss(y_pred):
    
    y_true = np.array([
        [1, 0, 0, 0, 0],
        [0, 1, 0, 0, 0],
        [0, 0, 0, 1, 0],
        [0, 0, 1, 0, 0],
        [0, 0, 0, 0, 1],
        [0, 0, 0, 0, 1],
        [0, 0, 0, 0, 1]
    ], dtype=float)
    
    ys = np.sum(y_true, axis=0)
    y_true = y_true / ys
    ln_p = np.log(softmax(y_pred))
    wll = np.sum(y_true * ln_p, axis=0)
    loss = -np.dot(weights, wll)
    return loss

hess = hessian(loss)(y_pred)

I understand that hessian is simply jacobian called twice and that hess is an n * p * n * p matrix. I can extract the diagonal manually and obtain my expected output which is:

[[0.24090069 0.12669198 0.12669198 0.12669198 0.12669198]
 [0.12669198 0.24090069 0.12669198 0.12669198 0.12669198]
 [0.12669198 0.12669198 0.12669198 0.24090069 0.12669198]
 [0.12669198 0.12669198 0.24090069 0.12669198 0.12669198]
 [0.04223066 0.04223066 0.04223066 0.04223066 0.08030023]
 [0.04223066 0.04223066 0.04223066 0.04223066 0.08030023]
 [0.04223066 0.04223066 0.04223066 0.04223066 0.08030023]]

I've checked this numerically and it's fine. The problem is that this still requires computing the full Hessian before accessing the diagonal part, which is really expensive. Is there any better way to proceed? I think this is a common use case in machine learning optimization that could deserve a dedicated convenience function

@neonwatty
Copy link

neonwatty commented Nov 21, 2018

EDIT: THIS PROPOSED SOLUTION DOES NOT WORK IN GENERAL - see dougalm's answer below for explanation

Just provided a solution for this over at #417 - small world.

Below is an example showing how to compute the gradient and just the diagonal of the Hessian (using elementwise_grad twice).

from autograd import numpy as np
from autograd import elementwise_grad

# an example function
g = lambda w: 5*w[0]**2 + 7*w[1]**2

# define a test point
w = np.array([2.0,2.0])

# compute gradient function, then evaluate
nabla_g = elementwise_grad(g)
print ('gradient at test point = ' + str(nabla_g(w)))

# just compute pure second derivatives - the diagonal of the hessian - without computing full hessian first
diag_hess_g = elementwise_grad(nabla_g)
print ('diagonal of hessian at test point = ' + str(diag_hess_g(w)))

PS: you can check out the entire list of built in differential operators here.

@MaxHalford
Copy link
Author

Hey, thanks a lot for the answer.

I though about doing that but I find the results surprising.

Here is the updated example using the approach you're proposing.

from autograd import elementwise_grad
import autograd.numpy as np 

y_true = np.array([
    [1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0],
    [0, 0, 0, 1, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 0, 1],
    [0, 0, 0, 0, 1],
    [0, 0, 0, 0, 1]
], dtype=float)

y_pred = np.array([
    [1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0],
    [0, 0, 0, 1, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 0, 1],
    [0, 0, 0, 0, 1],
    [0, 0, 0, 0, 1]
], dtype=float)

weights = np.array([1, 1, 1, 1, 1], dtype=float)

def softmax(x, axis=1):
    z = np.exp(x)
    return z / np.sum(z, axis=axis, keepdims=True)

def loss(y_pred):
    
    y_true = np.array([
        [1, 0, 0, 0, 0],
        [0, 1, 0, 0, 0],
        [0, 0, 0, 1, 0],
        [0, 0, 1, 0, 0],
        [0, 0, 0, 0, 1],
        [0, 0, 0, 0, 1],
        [0, 0, 0, 0, 1]
    ], dtype=float)
    
    ys = np.sum(y_true, axis=0)
    y_true = np.divide(y_true, ys)
    ln_p = np.log(softmax(y_pred))
    wll = np.sum(y_true * ln_p, axis=0)
    loss = -np.dot(weights, wll)
    return loss


nabla_g = elementwise_grad(loss)

diag_hess_g = elementwise_grad(nabla_g)

print(diag_hess_g(y_pred))

This returns:

[[7.54474768e-17 2.77555756e-17 2.77555756e-17 2.77555756e-17
  2.77555756e-17]
 [2.77555756e-17 7.54474768e-17 2.77555756e-17 2.77555756e-17
  2.77555756e-17]
 [2.77555756e-17 2.77555756e-17 2.77555756e-17 7.54474768e-17
  2.77555756e-17]
 [2.77555756e-17 2.77555756e-17 7.54474768e-17 2.77555756e-17
  2.77555756e-17]
 [6.93889390e-18 6.93889390e-18 6.93889390e-18 6.93889390e-18
  1.88618692e-17]
 [6.93889390e-18 6.93889390e-18 6.93889390e-18 6.93889390e-18
  1.88618692e-17]
 [6.93889390e-18 6.93889390e-18 6.93889390e-18 6.93889390e-18
  1.88618692e-17]]

which isn't the result I'm expecting (I provided the expected output in my initial post). Instead it's basically a matrix full of zeros. I'm not saying you're wrong but I'm pretty sure of my expected output. First of all I did the same thing with Tensorflow and it worked fine. Secondly when I used it in my machine learning routine it worked very well. On the other hand using the output obtained by using elementwise_grad twice yields very poor results so it "seems" wrong.

@MaxHalford
Copy link
Author

MaxHalford commented Nov 21, 2018

Here is the code I'm using to extract the diagonal part.

hess = hessian(loss)

H = hess(y_pred)

diag = np.array([
    H[i, j, i, j]
    for j in range(H.shape[1])
    for i in range(H.shape[0])
])

diag.reshape(y_pred.shape, order='F')

@neonwatty
Copy link

Side-question (from looking a bit closer at your original code): are you trying to differentiate the function

loss

with respect to weights? Right now - the way you have implemented loss the autograd package will - by default - differentiate it with respect to the first input argument y_pred. The default autograd functionality (regardless of differential operator - e.g., grad or hessian) differentiates functions based on the first input.

If you want to differentiate with respect to weights you can write your function like shown below

def loss(weights,y_pred):
    
    y_true = np.array([
        [1, 0, 0, 0, 0],
        [0, 1, 0, 0, 0],
        [0, 0, 0, 1, 0],
        [0, 0, 1, 0, 0],
        [0, 0, 0, 0, 1],
        [0, 0, 0, 0, 1],
        [0, 0, 0, 0, 1]
    ], dtype=float)
    
    ys = np.sum(y_true, axis=0)
    y_true = y_true / ys
    ln_p = np.log(softmax(y_pred))    
    wll = np.sum(y_true * ln_p, axis=0)
    loss = -np.dot(weights, wll)

    return loss

Since your weights have length 5, your gradient will also have length 5 and your Hessian will be a (5x5) matrix. Then - I assume you're computing the Hessian to perform Newton's method to tune the weights (?) - you're good to go (i.e., you won't need to take the 'diagonal of the Hessian').

@MaxHalford
Copy link
Author

MaxHalford commented Dec 1, 2018

Hey,

Nope I'm trying to differentiate w.r.t. to y_pred. And no the idea is that I only need the diagonal of the Hessian for gradient boosting. Computing the full Hessian is too costly.

@dougalm
Copy link
Contributor

dougalm commented Dec 1, 2018

Unfortunately, I don't think it's possible to compute the diagonal of the Hessian other than by taking N separate Hessian-vector products, equivalent to instantiating the full Hessian and then taking the diagonal. People resort to all sorts of tricks to estimate the trace of the Hessian (e.g. https://arxiv.org/abs/1802.03451) precisely because it's expensive to evaluate the diagonal.

Autograd's elementwise_grad has a very important caveat: it only applies to functions for which the Jacobian is diagonal. All it does is a vector-Jacobian product with a vector of ones, which gives you the sum of each row of the Jacobian. If the Jacobian is diagonal, then that's the same thing as the diagonal of the Jacobian.

That caveat was in the docstring of an earlier version of elementwise_grad. But at some point we deleted the function (because it can be misleading!) and then later we reinstated it, without the docstring. I just added the caveat back in. Sorry for the confusion.

@dougalm
Copy link
Contributor

dougalm commented Dec 1, 2018

Sorry, I meant to say "sum of each column". I was thinking of forward mode (which we should probably be using in elementwise_grad anyway).

@neonwatty
Copy link

neonwatty commented Dec 2, 2018

How stupid of a work-around (not invoking jacobian elementwise_grad / trying to avoid the diagonal Jacobian restriction / trying to avoid computing the second cross-partials) would it be to loop over the input arguments and create the pure second partials one-at-a-time using grad? Supposing the input arguments are unpacked - and computing the j^{th} pure second partial as grad(grad(g,(j)),(j)) ?

I'm assuming pretty stupid, but for example

from autograd import grad

# a test function
g = lambda w0,w1: 5*w0**2 + 7*w1**2 + w0*w1

# a test point
w0 = 2.0; w1 = 2.0

# construct second pure partials one at-a-time
second_pure_partials = [grad(grad(g,(j)),(j)) for j in range(num_partials)]

# evaluate second pure partials one at-a-time
diag_hess = lambda w0,w1: [part(w0,w1) for part in second_pure_partials]
print ('pure second derivatives at test point = ' + str(diag_hess(w0,w1)))
     
`pure second derivatives at test point = [10.0, 14.0]`

compare to the "incorrect" answer provided by composing elementwise_grad in an attempt to get at the pure partials


bad_hess_diag = elementwise_grad(elementwise_grad(g,(0,1)),(0,1))
print ('"incorrect" pure second derivatives at test point = ' + str(bad_hess_diag(w0,w1)))

`"incorrect" pure second derivatives at test point = (array(11.), array(15.))`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants