Way to generate confidence interval for predictions? #756

enriquecardenas24 · 2024-01-30T15:50:27Z

Looking at glum, I don't currently see a way to generate confidence intervals for predictions on an input dataset, although I want to check here. By this, I mean something similar to the statsmodels ability to generate confidence intervals (as well as error values) for predictions on a given dataset by way of the PredictionResults class. Is there any equivalent way to do this in glum?

MatthiasSchmidtblaicherQC · 2024-01-30T19:29:36Z

There is currently no out-of-the-box way to generate confidence intervals for means. But you should be able to do this in a couple of lines if you have estimated a model and the covariance matrix, along the lines of the statsmodels code that you linked to (especially var_pred_mean = (exog * np.dot(covb, exog.T).T).sum(1)). Note that, in a model without identity link, you will need to apply the delta method to have confidence intervals on the level of the original variable (the statsmodels code appears to be only for identity link / linear models).

enriquecardenas24 · 2024-02-01T15:19:05Z

Think I've got a draft for a solution that strips some of the statsmodels code down. I believe I'll only be working with models that have identity link, but I'll continue working if I find otherwise. Wanted to post my solution here for future reference:

# Get prediction errors and CIs for each prediction (test).

# Create design matrix using model design info and dataset.
coef_table = model.coef_table()
exog_cols = coef_table.index[1:] # remove intercept idx
exog = pd.DataFrame(in_data, columns = exog_cols)
formula = ' + '.join(exog_cols)
exog = np.atleast_2d(np.asarray(dmatrix(formula, exog)))

# Calculate values used to generate errors and CIs.
covb = np.array(cov_params)
var_pred_mean = (exog * np.dot(covb, exog.T).T).sum(1)
params = coef_table['coef']
predicted_mean = np.dot(exog, params) #+ offset + exposure

# Calculate errors and confidence intervals.
se = np.sqrt(var_pred_mean)
dist_args = ()
q = scipy.stats.norm.ppf(1 - alpha / 2.0, *dist_args)
margin = q * se
lower, upper = predicted_mean - margin, predicted_mean + margin
    
# Record errors and confidence intervals.
new_vals = pd.DataFrame([se, lower, upper]).T
dt_dedup[['error', 'Low_C_I', 'High_C_I']] = new_vals

enriquecardenas24 · 2024-02-23T17:45:01Z

Not sure if the above code is entirely correct, actually. It's a good starting point, but the statsmodels method is a bit more complicated than I originally thought.

Are there any plans to add this methodology to glum soon? Perhaps in v3? Regardless, I'll continue to work on the problem and hopefully post here when done.

MatthiasSchmidtblaicherQC · 2024-02-26T08:47:41Z

I see scope for this feature in glum. Before deciding whether to implement it, we would first need to address some questions though:

How could confidence intervals for predictions be incorporated into the API with distracting the standard user who just wants point predictions?
Are confidence intervals really what one wants statistically for predictions? Many use cases require prediction intervals.
These confidence intervals can only be computed for unregularised glm's. If we really care about prediction intervals, then a method-agnostic method such as conformal prediction would be better. See also this library.

Adding the feature does not need to be coupled to v3 release because it would not be a breaking change.

stanmart · 2024-02-26T15:06:44Z

Agreed. While implementing confidence intervals for the predicted mean should be relatively straightforward, I'm not sure how useful they would be for the reasons @MatthiasSchmidtblaicherQC mentioned. For prediction tasks, people are usually more interested in

penalized models, for which we could not compute these confidence intervals
prediction intervals, but those are kind of tricky to compute for glms, and don't always make much sense (e.g. for binary outcomes)

OTOH, if users would find this feature useful despite these limitations, then it seems like a good addition for some future release.

Also, @enriquecardenas24, thanks so much for posting the code snippets! Those should be super useful for others looking for this feature. Please keep the thread updated if you make more progress.

enriquecardenas24 · 2024-02-27T18:39:07Z

Here's something I've been working on, based on statsmodels v0.13.5.

It's missing a few features present in the statsmodels generalized_linear_model.py source code for the get_prediction function and its subfunctions (Prediction results class 1 and class 2), but this works for my purposes (may edit later if I'm able to find the supporting glum functions). Looking at these links, the functions are a bit different in the current v0.15, but I believe v0.13.5 should be more or less the same.

There are submodules used in the original version which I can post, but the basic process is as follows:

from scipy import stats
from patsy import dmatrix

def get_pred_github(
        model: glum._glm.GeneralizedLinearRegressor,
        input_dataset: pd.core.frame.DataFrame,
        alpha: float = 0.05,
):
        
    # Get exog data.
    # A bit of magic is done here to get an array version of the data with dummy columns.
    exog_data, pred_data = get_exog_pred_data(input_dataset)
    # Can expand on this process if necessary, but the essence of this is:
    # nrows = num rows in input_dataset; ncols = num rows in model.coef_table() minus 1 (intercept).
    # This is done for categorical variables that may not have all data present.
    # Ex: If input_dataset is a subset of the entire dataset, then for a given categorical field in the dataset,
    # tabmat.from_pandas(df = input_dataset) doesn't capture all categories if any are missing in input_dataset.
    # pred_data is the same as exog data, except the reference for the control is included.

    # Record predictions.
    wts, offset = input_dataset['Weight'], input_dataset['Offset']
    preds = model.predict(pred_data, wts, offset)
    
    # Get prediction errors and CIs for each prediction.
    # Create design matrix using model design info and dataset.
    cols = model.coef_table().index[1:]    # ignore intercept idx
    exog = pd.DataFrame(exog_data, columns = cols)
    formula = ' + '.join(exog_cols)
    exog = np.atleast_2d(np.asarray(dmatrix(formula, exog)))
    # Get covariance parameters, predicted means, variances of predicted means.
    covb = model.covariance_matrix_ # make sure model feature names are set before doing this
    predicted_mean = np.log(np.array(preds))    # for log link
    var_pred_mean = (exog * np.dot(covb, exog.T).T).sum(1)
    
    # Other parameters from statsmodels to be appended.
    #exposure=offset=row_labels=None
    #transform=True    
    #pred_kwds = {'exposure': exposure, 'offset': offset, 'linear': True}
    
    # Get standard errors and confidence intervals.
    #se = self.se_obs if obs else self.se_mean # obs=False always, in my case
    se = np.sqrt(var_pred_mean)
    dist_args = ()
    q = stats.norm.ppf(1 - alpha / 2., *dist_args)
    lower = predicted_mean - q * se
    upper = predicted_mean + q * se
    cis = np.column_stack((lower, upper))
    ci_mean = np.exp(cis)   # for log link; need way to do self.link.inverse(cis)
    
    summary_frame = pd.DataFrame()
    summary_frame['mean_se'] = se
    summary_frame['mean_ci_lower'] = ci_mean[:, 0]
    summary_frame['mean_ci_upper'] = ci_mean[:, 1]
    
    return summary_frame

Here's a snapshot of the exog_data array, which contains all of the data in the input_dataset with dummy columns for categorical levels:

The first few columns are levels of the categorical field "Year," hence the 0s. Only in pred_data is the base level of the Year filled with 1s.

stanmart added the new feature label Feb 26, 2024

enriquecardenas24 mentioned this issue Mar 5, 2024

Significant difference in model coefficients for binomial distribution? #777

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Way to generate confidence interval for predictions? #756

Way to generate confidence interval for predictions? #756

enriquecardenas24 commented Jan 30, 2024

MatthiasSchmidtblaicherQC commented Jan 30, 2024 •

edited

enriquecardenas24 commented Feb 1, 2024

enriquecardenas24 commented Feb 23, 2024

MatthiasSchmidtblaicherQC commented Feb 26, 2024 •

edited

stanmart commented Feb 26, 2024

enriquecardenas24 commented Feb 27, 2024

Way to generate confidence interval for predictions? #756

Way to generate confidence interval for predictions? #756

Comments

enriquecardenas24 commented Jan 30, 2024

MatthiasSchmidtblaicherQC commented Jan 30, 2024 • edited

enriquecardenas24 commented Feb 1, 2024

enriquecardenas24 commented Feb 23, 2024

MatthiasSchmidtblaicherQC commented Feb 26, 2024 • edited

stanmart commented Feb 26, 2024

enriquecardenas24 commented Feb 27, 2024

MatthiasSchmidtblaicherQC commented Jan 30, 2024 •

edited

MatthiasSchmidtblaicherQC commented Feb 26, 2024 •

edited