Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Minor] In the objective function, drop terms that are not dependent on the parameters. #169

Closed
tbenthompson opened this issue Jun 4, 2020 · 9 comments

Comments

@tbenthompson
Copy link
Collaborator

tbenthompson commented Jun 4, 2020

[EDIT] See the conversation below.

Currently, in _eta_mu_deviance, we compute the deviance and then later multiply by 0.5 and add L1 and L2 penalty terms to compute an objective function value. This isn't actually strictly speaking the objective function value, but it should differ only by a constant dependent on y. Computing the deviance is more complicated for most distribution/link function pairs than computing the log-likelihood. For example, for Poisson, the LL is:

y[i] * eta[i] - mu[i]

whereas the deviance as currently implemented is:

        if y[i] == 0:
            unit_deviance = 2 * (-y[i] + mu_out[i])
        else:
            unit_deviance = 2 * ((y[i] * (log(y[i]) - eta_out[i] - 1)) + mu_out[i])

Since we don't actually need a deviance, we should compute the log-likelihood.

@tbenthompson tbenthompson changed the title Use log-likelihood instead of deviance in the line search. [Minor] Use log-likelihood instead of deviance in the line search. Jun 4, 2020
@lbittarello
Copy link
Member

On the other hand, the Tweedie LL does not have a closed-form solution (Dunn and Smyth, 2005). Approximating the series is rather expensive.

@tbenthompson
Copy link
Collaborator Author

Oh interesting. So, the deviance has a closed form, but the LL does not? Currently the deviance is implemented as:

    def unit_deviance(self, y, mu):
        p = self.power
........
            # return 2 * (np.maximum(y,0)**(2-p)/((1-p)*(2-p))
            #    - y*mu**(1-p)/(1-p) + mu**(2-p)/(2-p))
            return 2 * (
                np.power(np.maximum(y, 0), 2 - p) / ((1 - p) * (2 - p))
                - y * np.power(mu, 1 - p) / (1 - p)
                + np.power(mu, 2 - p) / (2 - p)
            )

@lbittarello
Copy link
Member

Exactly, the normalization term in the Tweedie LL does not have a closed form. The deviance gets rid of it.

@tbenthompson
Copy link
Collaborator Author

The only place we actually use the deviance/LL is in the line search where we determine whether the step size is safe or not using a backtracking line search (https://en.wikipedia.org/wiki/Backtracking_line_search). In that setting, we're always subtracting one objective value from another.

In that sense, any constant offset in the objective function is fine. So, the "ugly" but performant solution here might to use deviance when it's convenient and log-likelihood when that is convenient and be careful to include some comments to make sure it's clear what the heck is going on.

@lbittarello what do you think of that?

@lbittarello
Copy link
Member

Fine by me. Just wanted to warn you of the monsters ahead. :)

By the way, I am also not sure if the Gamma LL is less expensive than the deviance. We currently have:

def _gamma_unit_deviance(power, dispersion, y, mu):
    return 2 * (np.log(mu) - np.log(y) + y / mu - 1)


def _gamma_unit_loglikelihood(power, dispersion, y, mu):
    log_y = np.log(y)
    normalization = (
        (log_y - np.log(dispersion)) / dispersion - log_y - loggamma(1 / dispersion)
    )
    return normalization - y / (dispersion * mu) - np.log(mu) / dispersion

LightGBM uses something else, which we call raw log loss for lack of a better name:

def _gamma_unit_raw_logloss(power, dispersion, y, mu):
    return y / mu + np.log(mu)

@lbittarello
Copy link
Member

what the heck

what the heck hack

@tbenthompson
Copy link
Collaborator Author

You're point about Gamma is great. Beyond just deviance/log-likelihood, we can just drop anything that's not dependent on the parameters like LightGBM is doing. That's even better. And we can just call it a "raw_logloss". Thanks!

@tbenthompson tbenthompson changed the title [Minor] Use log-likelihood instead of deviance in the line search. [Minor] In the objective function, drop terms that are not dependent on the parameters. Jun 4, 2020
@tbenthompson
Copy link
Collaborator Author

what the heck

what the heck hack

Luca, haxor extraordinaire.

@tbenthompson
Copy link
Collaborator Author

Did this for Poisson in #170

I'm going to close this issue and fold it into #151 since the work is very similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants