Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linear separability diagnostics? #9

Open
johnmyleswhite opened this issue Jan 11, 2013 · 5 comments
Open

Linear separability diagnostics? #9

johnmyleswhite opened this issue Jan 11, 2013 · 5 comments

Comments

@johnmyleswhite
Copy link
Member

One thing I'd really like is for Julia to tell the user when the data is linearly separable under a logistic model. This could be done by making a call to glm for logistic models terminate with a call to predict to see if there are no mispredicted responses. In that case, it would be nice to output a message noting this.

@Nosferican
Copy link
Contributor

Konis, Kjell. 2007 thesis has a survey of various practical methods and explains the approach taken in R's safeBinaryRegression.

@andreasnoack
Copy link
Member

Thanks for the reference. I think that, ideally, the check could a post-processing function. Potentially run as part of the coeftable function.

@Nosferican
Copy link
Contributor

I thought the main consideration would be to have the detection work during the fitting process and deal with it (e.g. drop covariates, drop observations, issue warning, early stop iteration, etc.) This approach is the one Stata uses which sequentially drops covariates / observations until the separability disappears. If it isn't possible it issues an error of no valid observations.

@andreasnoack
Copy link
Member

I wouldn't be in favor of too much magic happening automatically. I'd rather provide the tools to diagnose this and let the user adjust the model. I also wouldn't be in favor of slowing down the fitting procedure. You might only be interested in prediction or parameters not affected by the separation.

@Nosferican
Copy link
Contributor

The methods outlined take into consideration the additional computational expense incurred. I recently implemented O’Leary (1990) IRLS QR Newton (which might be one the DenseQR methods here?) for developing a few routines missing in GLM which I could use to verify the computational cost of adding those. It would not apply to all models, but those that are "unsafe", but I agree that warnings in this case might be preferred to a non-specified handling method. Linear separability seems trickier than just a non-full rank matrix which I am totally fine with automatically making it full rank and letting the user know. As for development, I think the safe-binary algorithms could be developed in a separate package and used in GLM. It might be nice to have the IRLS methods moved to a solver package too and called from GLM. Those can be optimized for Dense, Sparse, Mixed, and Distributed cases (see Kane and Lewis working notes). I mentioned this since StatsModels moves to allow other tabular data packages with different capabilities from DataFrames (Slack#Data). If this is something to consider I can move that discussion to a different to limit this one to the linear separability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants