Incorrect R-squared values for lasso models #2

dhimmel · 2016-01-24T03:27:26Z

The glmnet upgrade to version 2 introduced a bug where the methods package is not properly loaded. In the course of diagnosing that issue, I discovered a second issue which was brought to light by the upgrade. I briefly mentioned the second issue before knowing its cause:

However, if I run the analysis by launching an R session from the project's root directory and then run source('./code/run.R'), the code progresses past create-models.R before another error occurs.

Now, I have tracked down the cause. We were improperly computing R² values for our lasso models. I corrected the faulty code after evaluating several methods for the R² computation.

Prior to the fix, we were extracting R² values directly from a cv.glmnet object. This Class is poorly documented and the glmnet vignette now cautions:

We do not encourage users to extract the components directly except for viewing the selected values of λ.

So essentially, we were reporting an R² for a model based on a λ evaluated during cross-validation, but not the model with the optimal λ that we intended. Our faulty method for extracting R² started throwing an error due to a glmnet update that brought a:

Major upgrade to CV; let each model use its own lambdas, then predict at original set.

We will keep this thread updated with information on this issue.

The text was updated successfully, but these errors were encountered:

Previous results were based on an incorrect method for retrieving the R-squared of lasso models. See #2 for more detail. After evaluating several methods (see https://gist.github.com/dhimmel/588d64a73fa4fef02c8f/a256479897a1a9bc63b5b7985df1e1b2ad8fd1e8), a fix was implemented in 65f4d8c. Analysis was rerun with fix. Only files that meaningfully changed were committed.

dhimmel · 2016-01-24T04:17:55Z

Corrected R² values

I updated our analysis with the correct lasso R² values. The old (incorrect) and new (correct) values are:

Cancer	Old Lasso R²	New Lasso R²
Lung	67.1%	68.9%
Breast	51.3%	54.5%
Colorectal	27.4%	31.9%
Prostate	7.8%	15.0%

For all four cancers, the faulty method underestimated the lasso R². The underestimation was minimal for lung cancer and largest for prostate cancer. The new values are more concordant with the best-subset R² values. As expected, the best-subset values are still higher, but now the discrepancy is smaller.

The conclusions of our study are not affected by this change. To contextualize the change, the old values suggest that the best-subset approach overfit more compared to the new values. However, the main conclusions we drew from the lasso approach were based on the models, which were not affected by this issue. Essentially, the lasso models now appear to explain slightly more variation in cancer incidence.

Errors in the publication

Accordingly, the following paragraph of the paper has errors. The bolded values should respectively be replaced with 69%, 55%, 32%, 15%:

The lasso (and best subset) models explained 67% (70%) of variation in lung cancer incidence, 51% (57%) in breast, 29% (34%) in colorectal, and 9% (19%) in prostate, (Tables 3 and 2) mirroring a previously described trend in fraction of risk attributable to modifiable factors for each of the four cancers (Danaei et al., 2005).

In addition, the R² column of Table 3 should be updated according to the above table.

You may notice that the old lasso R² values for colorectal and prostate models differ minimally between the paragraph and table. The table contains the correct incorrect values—the two paragraph values were not properly updated in the manuscript text at 4352de6.

dhimmel · 2016-01-26T18:19:41Z

Comments added on PeerJ

I added comments on the online PeerJ article using the questions feature. Now both Table 3 and Paragraph 37 reference the inaccuracy and link to this issue.

dhimmel closed this as completed Jan 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect R-squared values for lasso models #2

Incorrect R-squared values for lasso models #2

dhimmel commented Jan 24, 2016

dhimmel commented Jan 24, 2016

dhimmel commented Jan 26, 2016

Incorrect R-squared values for lasso models #2

Incorrect R-squared values for lasso models #2

Comments

dhimmel commented Jan 24, 2016

dhimmel commented Jan 24, 2016

Corrected R2 values

Errors in the publication

dhimmel commented Jan 26, 2016

Comments added on PeerJ

Corrected R² values