Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantiles Regression is Poorly-Calibrated #1182

Closed
joseortiz3 opened this issue Jan 8, 2018 · 10 comments
Assignees

Comments

@joseortiz3
Copy link

@joseortiz3 joseortiz3 commented Jan 8, 2018

Environment info

Operating System: Windows 10 64
CPU: i7-7700k
C++/Python/R version: Python 3.6 Anaconda 5.0.0

Problem Description

The quantile-estimation functionality recently implemented is poorly-calibrated in comparison to sklearn's GradientBoostingRegressor. This means that specifying the quantile (75% percentile/quantile, for instance) results in estimations that do not bound 75% of the training data (usually less in practice), and no configuration fixes this.

Reproducible examples

I have modified the script given in a previous issue on quantiles. Toggle USE_SKLEARN to observe issue. Change parameters to see that calibration problem persists.

import numpy as np
import matplotlib.pyplot as plt
import lightgbm as lgb
from sklearn.ensemble import GradientBoostingRegressor
from gbdt_quantiles import plot_figure
np.random.seed(1)

# Use sklearn or lightgbm?
USE_SKLEARN = True # Toggle this to observe issue.


# Quantile to Estimate
alpha = 0.75
# Training data size
N_DATA = 1000
# Function to Estimate
def f(x):
    """The function to predict."""
    return x * np.sin(x)

# model parameters
LEARNING_RATE = 0.1
N_ESTIMATORS = 100
MAX_DEPTH = -1
NUM_LEAVES = 31 # lgbm only
OBJECTIVE = 'quantile_l2' # lgbm only, 'quantile' or 'quantile_l2'
REG_SQRT = True # lgbm only
if USE_SKLEARN:
    if MAX_DEPTH < 0:  # sklearn grows differently than lgbm.
        print('Max Depth specified is incompatible with sklearn. Changing to 3.')
        MAX_DEPTH = 3

#---------------------- DATA GENERATION ------------------- #

#  First the noiseless case
X = np.atleast_2d(np.random.uniform(0, 10.0, size=N_DATA)).T
X = X.astype(np.float32)

# Observations
y = f(X).ravel()

dy = 1.5 + 1.0 * np.random.random(y.shape)
noise = np.random.normal(0, dy)
y += noise
y = y.astype(np.float32)

# Mesh the input space for evaluations of the real function, the prediction and
# its MSE
xx = np.atleast_2d(np.linspace(0, 10, 9999)).T
xx = xx.astype(np.float32)


# Train high, low, and mean regressors.
# ------------------- HIGH/UPPER BOUND ------------------- #
if USE_SKLEARN:
    clfh = GradientBoostingRegressor(loss='quantile', alpha=alpha,
                n_estimators=N_ESTIMATORS, max_depth=MAX_DEPTH,
                learning_rate=LEARNING_RATE, min_samples_leaf=9,
                min_samples_split=9)

    clfh.fit(X, y)
else:
    ## ADDED
    clfh = lgb.LGBMRegressor(objective = OBJECTIVE,
                            alpha = alpha,
                            num_leaves = NUM_LEAVES,
                            learning_rate = LEARNING_RATE,
                            n_estimators = N_ESTIMATORS,
                            reg_sqrt = REG_SQRT,
                            max_depth = MAX_DEPTH)
    clfh.fit(X, y,
            #eval_set=[(X, y)],
            #eval_metric='quantile'
           )
    ## END ADDED

# ------------------- LOW/LOWER BOUND ------------------- #

if USE_SKLEARN:
    clfl = GradientBoostingRegressor(loss='quantile', alpha=1.0-alpha,
        n_estimators=N_ESTIMATORS, max_depth=MAX_DEPTH,
        learning_rate=LEARNING_RATE, min_samples_leaf=9,
        min_samples_split=9)

    clfl.fit(X, y)
else:
    ## ADDED
    clfl = lgb.LGBMRegressor(objective = OBJECTIVE,
                            alpha = 1.0 - alpha,
                            num_leaves = NUM_LEAVES,
                            learning_rate = LEARNING_RATE,
                            n_estimators = N_ESTIMATORS,
                            reg_sqrt = REG_SQRT,
                            max_depth = MAX_DEPTH)
    clfl.fit(X, y,
            #eval_set=[(X, y)],
            #eval_metric='quantile'
            )
    ## END ADDED

# ------------------- MEAN/PREDICTION ------------------- #

if USE_SKLEARN:
    clf = GradientBoostingRegressor(loss='ls',
            n_estimators=N_ESTIMATORS, max_depth=MAX_DEPTH,
            learning_rate=LEARNING_RATE, min_samples_leaf=9,
            min_samples_split=9)
    clf.fit(X, y)
else:
    ## ADDED
    clf = lgb.LGBMRegressor(objective = 'regression',
                            num_leaves = NUM_LEAVES,
                            learning_rate = LEARNING_RATE,
                            n_estimators = N_ESTIMATORS,
                            max_depth = MAX_DEPTH)

    clf.fit(X, y,
            #eval_set=[(X, y)],
            #eval_metric='l2',
            #early_stopping_rounds=5
            )
    ## END ADDED

# ---------------- PREDICTING ----------------- #

# Make the prediction on the meshed x-axis
y_pred = clf.predict(xx)
y_lower = clfl.predict(xx)
y_upper = clfh.predict(xx)

# Check calibration by predicting the training data.
y_autopred = clf.predict(X)
y_autolow = clfl.predict(X)
y_autohigh = clfh.predict(X)
frac_below_upper = round(np.count_nonzero(y_autohigh > y) / len(y),3)
frac_above_upper = round(np.count_nonzero(y_autohigh < y) / len(y),3)
frac_above_lower = round(np.count_nonzero(y_autolow < y) / len(y),3)
frac_below_lower = round(np.count_nonzero(y_autolow > y) / len(y),3)

# Print calibration test
print('fraction below upper estimate: \t actual: ' + str(frac_below_upper) + '\t ideal: ' + str(alpha))
print('fraction above lower estimate: \t actual: ' + str(frac_above_lower) + '\t ideal: ' + str(alpha))

# ------------------- PLOTTING ----------------- #

plt.plot(xx, f(xx), 'g:', label=u'$f(x) = x\,\sin(x)$')
plt.plot(X, y, 'b.', markersize=3, label=u'Observations')
plt.plot(xx, y_pred, 'r-', label=u'Mean Prediction')
plt.plot(xx, y_upper, 'k-')
plt.plot(xx, y_lower, 'k-')
plt.fill(np.concatenate([xx, xx[::-1]]),
            np.concatenate([y_upper, y_lower[::-1]]),
            alpha=.5, fc='b', ec='None', label=(str(round(100*(alpha-0.5)*2))+'% prediction interval'))
plt.scatter(x=X[y_autohigh < y], y=y[y_autohigh < y], s=20, marker='x', c = 'red', 
        label = str(round(100*frac_above_upper,1))+'% of training data above upper (expect '+str(round(100*(1-alpha),1))+'%)')
plt.scatter(x=X[y_autolow > y], y=y[y_autolow > y], s=20, marker='x', c = 'orange', 
        label = str(round(100*frac_below_lower,1))+ '% of training data below lower (expect '+str(round(100*(1-alpha),1))+'%)')
plt.xlabel('$x$')
plt.ylabel('$f(x)$')
plt.ylim(-10, 20)
plt.legend(loc='upper left')
plt.title(  '  Alpha: '+str(alpha) +
            '  Sklearn?: '+str(USE_SKLEARN) +
            '  N_est: '+str(N_ESTIMATORS) +
            '  L_rate: '+str(LEARNING_RATE) +
            '  N_Leaf: '+str(NUM_LEAVES) + 
            '  Obj: '+str(OBJECTIVE) +
            '  R_sqrt: '+str(int(REG_SQRT))
        )

plt.show()

Steps to reproduce

  1. Run code snippet with USE_SKLEARN set to True
  2. Run again with USE_SKLEARN set to False
  3. Compare Results
  4. Modify parameters, repeat, etc.

Results

With Sklearn:
image

With LightGBM:
image

As you can see above, LightGBM's implementation of quantiles is estimating a narrower quantile (about .62) than was specified (.75). Sklearn on the other hand produces a well-calibrated quantile estimate. Playing with the parameters does not help.

@guolinke

This comment has been minimized.

Copy link
Member

@guolinke guolinke commented Jan 8, 2018

@joseortiz3
Thanks, this is very helpful. It indeed is a issue.

As we discussed in #1109 (comment) , current implementation of quantile in LightGBM is not perfect.
There are 2 reasons:

  1. the gradient/hessian of quantile loss is not easy to fit. quantile_l2 is a trade-off solution, which is not equivalent to quantile. Maybe you can try this: (https://www.bigdatarepublic.nl/regression-prediction-intervals-with-xgboost/) and provides some feedback.
  2. The leaf-value in LightGBM is sum_grad/sum_hess, while it is the value of corresponding quantile in sklearn. Sklearn's solution seems better. Maybe we can try something similar.
@guolinke

This comment has been minimized.

Copy link
Member

@guolinke guolinke commented Jan 8, 2018

@Laurae2 @henry0312
any suggestions for the quantile objective ?

@joseortiz3

This comment has been minimized.

Copy link
Author

@joseortiz3 joseortiz3 commented Jan 8, 2018

"Maybe you can try this: (https://www.bigdatarepublic.nl/regression-prediction-intervals-with-xgboost/) and provides some feedback."

Oh, by the way, yes I've seen this. That approach is not very useful for me (finance, science). Grid searching three parameters to find a well-calibrated quantile estimate is wasteful, and sklearn's implementation proves that a well-calibrated quantile estimate is possible without such a search. Sklearn's estimation is remarkably robust when estimating vastly different things / different distributions, so whatever they are doing is the way to go judging from my experience.

@guolinke

This comment has been minimized.

Copy link
Member

@guolinke guolinke commented Jan 8, 2018

@joseortiz3 Thanks for feedback.
So maybe we should try to fix the leaf-value towards sklearn's solution.

@Laurae2

This comment has been minimized.

Copy link
Collaborator

@Laurae2 Laurae2 commented Jan 8, 2018

@guolinke The user must transform the labels to [0, 1] for quantile regression, then extrapolate back to the original range of the labels.

Quantile regression in LightGBM will not work properly without scaling values to the correct range. For instance, scikit-learn uses the range [0, 100], with alpha*100 = alpha for the target quantile. Training is not done using the labels, but is done using the labels' quantiles.

@guolinke

This comment has been minimized.

Copy link
Member

@guolinke guolinke commented Jan 9, 2018

@Laurae2 I didn't find where sklearn normalize the range of label to [0,100].
Can you provide some sources of it ?

@Laurae2

This comment has been minimized.

Copy link
Collaborator

@Laurae2 Laurae2 commented Jan 9, 2018

@guolinke

This comment has been minimized.

Copy link
Member

@guolinke guolinke commented Jan 10, 2018

@Laurae2
I think it is not the same as normalizing label range to [0,100].
sklearn assigns the percentile value of delta(y-pred_over_previous_trees) for each leaf.

I think simply normalize the label range range cannot help, since the leaf output is still the sum_grad/sum_hess, which is not same as the percentile value even for normalized label.

@guolinke

This comment has been minimized.

Copy link
Member

@guolinke guolinke commented Jan 16, 2018

fixed in #1199

@guolinke guolinke closed this Jan 16, 2018
@joseortiz3

This comment has been minimized.

Copy link
Author

@joseortiz3 joseortiz3 commented Jan 16, 2018

I think now with the new updates that fix this issue, LightGBM is the fastest, quantiles-supporting boosted decision tree implementation available. Pretty exciting! I'm getting about 20x speedup with similar performance over sklearn in quantile workloads! Great work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.