Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect R-squared values for lasso models #2

Closed
dhimmel opened this issue Jan 24, 2016 · 2 comments
Closed

Incorrect R-squared values for lasso models #2

dhimmel opened this issue Jan 24, 2016 · 2 comments

Comments

@dhimmel
Copy link
Owner

dhimmel commented Jan 24, 2016

The glmnet upgrade to version 2 introduced a bug where the methods package is not properly loaded. In the course of diagnosing that issue, I discovered a second issue which was brought to light by the upgrade. I briefly mentioned the second issue before knowing its cause:

However, if I run the analysis by launching an R session from the project's root directory and then run source('./code/run.R'), the code progresses past create-models.R before another error occurs.

Now, I have tracked down the cause. We were improperly computing R2 values for our lasso models. I corrected the faulty code after evaluating several methods for the R2 computation.

Prior to the fix, we were extracting R2 values directly from a cv.glmnet object. This Class is poorly documented and the glmnet vignette now cautions:

We do not encourage users to extract the components directly except for viewing the selected values of λ.

So essentially, we were reporting an R2 for a model based on a λ evaluated during cross-validation, but not the model with the optimal λ that we intended. Our faulty method for extracting R2 started throwing an error due to a glmnet update that brought a:

Major upgrade to CV; let each model use its own lambdas, then predict at original set.

We will keep this thread updated with information on this issue.

dhimmel added a commit that referenced this issue Jan 24, 2016
Previous results were based on an incorrect method for retrieving
the R-squared of lasso models. See #2 for more detail.

After evaluating several methods (see https://gist.github.com/dhimmel/588d64a73fa4fef02c8f/a256479897a1a9bc63b5b7985df1e1b2ad8fd1e8),
a fix was implemented in 65f4d8c.
Analysis was rerun with fix. Only files that meaningfully changed
were committed.
@dhimmel
Copy link
Owner Author

dhimmel commented Jan 24, 2016

Corrected R2 values

I updated our analysis with the correct lasso R2 values. The old (incorrect) and new (correct) values are:

Cancer Old Lasso R2 New Lasso R2
Lung 67.1% 68.9%
Breast 51.3% 54.5%
Colorectal 27.4% 31.9%
Prostate 7.8% 15.0%

For all four cancers, the faulty method underestimated the lasso R2. The underestimation was minimal for lung cancer and largest for prostate cancer. The new values are more concordant with the best-subset R2 values. As expected, the best-subset values are still higher, but now the discrepancy is smaller.

The conclusions of our study are not affected by this change. To contextualize the change, the old values suggest that the best-subset approach overfit more compared to the new values. However, the main conclusions we drew from the lasso approach were based on the models, which were not affected by this issue. Essentially, the lasso models now appear to explain slightly more variation in cancer incidence.

Errors in the publication

Accordingly, the following paragraph of the paper has errors. The bolded values should respectively be replaced with 69%, 55%, 32%, 15%:

The lasso (and best subset) models explained 67% (70%) of variation in lung cancer incidence, 51% (57%) in breast, 29% (34%) in colorectal, and 9% (19%) in prostate, (Tables 3 and 2) mirroring a previously described trend in fraction of risk attributable to modifiable factors for each of the four cancers (Danaei et al., 2005).

In addition, the R2 column of Table 3 should be updated according to the above table.

You may notice that the old lasso R2 values for colorectal and prostate models differ minimally between the paragraph and table. The table contains the correct incorrect values—the two paragraph values were not properly updated in the manuscript text at 4352de6.

@dhimmel
Copy link
Owner Author

dhimmel commented Jan 26, 2016

Comments added on PeerJ

I added comments on the online PeerJ article using the questions feature. Now both Table 3 and Paragraph 37 reference the inaccuracy and link to this issue.

@dhimmel dhimmel closed this as completed Jan 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant