Comets 1.3 - Models adjusted for race (when analyses are stratified by BMI) returns errors #32

steven-moore · 2018-03-01T00:48:05Z

In the test file that I had prepared, the N for non-white/European persons was quite small. In fact, only 1 individual had a race_grp=2. This seems to be causing all kinds of problems in the adjusted/stratified analyses.

To test, run in interactive mode:

Exposure: Age
Outcome: Any individual metabolite
Adjusted covariates: race_grp
Strata by: BMI_grp

Two of the three values returned will have a value of NA. Possibly, this reflects a degrees of freedom issue?

Input file is below.

cometsInput_March_2018.xlsx

wobenshain · 2018-03-01T00:54:16Z

I don't know that it is relevant but hadn't you decided not to allow stratification on groups smaller than 15?

steven-moore · 2018-03-01T15:02:07Z

In the example above, we're adjusting for the race-group, rather than stratifying by it. However, I will test stratifying by race-group all the same.

steven-moore · 2018-03-01T15:49:36Z

To further clarify, this is an important issue because we are asking all cohorts to use standardized coding, so many (perhaps most) will have at least one covariate where data are thin at one level.

In the latest runs in batch mode, this issue may have caused missing tables, resulting in severe problems further downstream. Ewy, Ella: perhaps we can discuss on the Data Harmonization call?

One further addition: Technically, it seems like we should be able to estimate the partial correlations. I just ran in SAS and received valid output, which I pasted in the Word Doc below as an example.

SAS output for partial corrs.docx

ewymathe · 2018-03-01T18:40:45Z

I'd like to confirm that the issue has to do with the fact that one of the race-groups has only one unique value.
When I try running the model "Age.2.5 Heart-disease stratified", I get the following dummy variables for race_grp2 (the first part is the head of the dummy matrix, the second part is the vector of race_grp2):

corrmatrix <-calcCorr(modeldata,exmetabdata, "DPP")
[1] "running adjusted"

 smk_grp1 smk_grp2 smk_grp3 bmi_grp2 bmi_grp3 bmi_grp4 race_grp1 race_grp2

[1,] 1 0 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 1 0
[3,] 0 0 0 0 0 0 0 0
[4,] 1 0 0 1 0 0 0 0
[5,] 0 0 0 0 1 0 0 0
[6,] 0 0 0 1 0 0 0 0
educ_grp1 educ_grp2 educ_grp3 educ_grp4 alc_grp1 alc_grp2 alc_grp3
[1,] 0 1 0 0 0 1 0
[2,] 0 0 1 0 1 0 0
[3,] 0 0 1 0 0 0 0
[4,] 0 1 0 0 1 0 0
[5,] 1 0 0 0 1 0 0
[6,] 1 0 0 0 0 0 0
multivitamin1 multivitamin2 horm_curr1 horm_curr2 nested_case1 age
[1,] 0 0 0 0 1 66
[2,] 1 0 0 1 1 61
[3,] 0 0 0 0 1 71
[4,] 1 0 0 0 1 65
[5,] 1 0 1 0 1 56
[6,] 0 0 0 0 1 59
[1] "Race_grp2 dummy"
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[556] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[630] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[667] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[704] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[741] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[778] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[815] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[852] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[889] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[926] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[963] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[1000] 0

R gets caught in an infinite loop when this happens. So we must check that each covariable has enough values.

ewymathe · 2018-03-01T19:50:05Z

Can I assume that SAS runs the partial correlations after removing the variable with low numbers?

steven-moore · 2018-03-01T20:29:56Z

Yes From: Ewy Mathe [mailto:notifications@github.com] Sent: Thursday, March 1, 2018 2:50 PM To: CBIIT/R-cometsAnalytics <R-cometsAnalytics@noreply.github.com> Cc: Moore, Steve (NIH/NCI) [E] <steve.moore@nih.gov>; Author <author@noreply.github.com> Subject: Re: [CBIIT/R-cometsAnalytics] Models adjusted for race (when analyses are stratified by BMI) returns errors (#32) Can I assume that SAS runs the partial correlations after removing the variable with low numbers? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#32 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AhG9TUmgvEVFTBH4d02nR8HkmxqUG_A4ks5taFDugaJpZM4SXlso>.

ewymathe · 2018-03-01T20:53:27Z

With that in mind, can I implement a simple filter that removes dummy variables that correspond to less than x samples, with x set to 5 by default?

ellatemprosa · 2018-03-01T20:59:27Z

here's an alternative calculation of the adjusted spearman rank correlation, i will test it http://onlinelibrary.wiley.com/doi/10.1111/biom.12812/epdf

ewymathe · 2018-03-01T21:38:54Z

And there's an R package :o) Nice find!

steven-moore · 2018-03-01T21:46:10Z

Just be careful--that's a big change to make. We don't know yet what the downstream implications are, especially for large files.

steven-moore · 2018-03-01T21:49:06Z

And Ewy, you could just remove any dummy variables that, within a given stratum, have only one value, or possibly x values (though implementing based on x could be tricky--do you mean different values, or do you mean people with different values? The coding is different for each.).

ewymathe · 2018-03-01T22:05:42Z

@steven-moore Yes, I could set x=1 and remove variables that have only one value. I was thinking that I would remove variables that had <= 5 occurences given one value.
For example, there is only 1 sample with race =2. If there were 5 samples with race =2, I would remove that too.
I'll implement this shortly (x=1).

ellatemprosa · 2018-03-02T15:22:09Z

i am not sure this is the right way to handle this, i tested on raw code and it works

ewymathe · 2018-03-02T18:21:05Z

@ellatemprosa what do you mean by it works? Does it automatically remove dummy variables that have the same values? Meaning it drops the category that only has one entry, and the result from inputting all the data and the result from inputting only the data minus the "one entry" is the same?

steven-moore · 2018-03-05T16:13:11Z

So, what do you guys think is the right approach here?

ellatemprosa · 2018-03-05T17:33:14Z

here's exactly what sas is doing, described here http://support.sas.com/documentation/cdl/en/procstat/67528/HTML/default/viewer.htm#procstat_corr_details.htm

if it does not meet the criteria, it drops the variable.

After applying the Cholesky decomposition algorithm to each row associated with variables $\mb{z}$, PROC CORR checks all higher-numbered diagonal elements associated with $\mb{z}$ for singularity. A variable is considered singular if the value of the corresponding diagonal element is less than $\varepsilon $ times the original unpartialled corrected sum of squares of that variable. You can specify the singularity criterion $\varepsilon $ by using the SINGULAR= option. For Pearson partial correlations, a controlling variable $\mb{z}$ is considered singular if the $R^2$ for predicting this variable from the variables that are already partialled out exceeds $1-\varepsilon $. When this happens, PROC CORR excludes the variable from the analysis. Similarly, a variable is considered singular if the $R^2$ for predicting this variable from the controlling variables exceeds $1-\varepsilon $. When this happens, its associated diagonal element and all higher-numbered elements in this row or column are set to zero.

i think this is the safe way but it is crazy that sas does not tell you which vars were dropped. in my testing the example you sent, the correlation is just the unadjusted correlation

steven-moore · 2018-03-05T21:22:58Z

OK, good to know. What do you think the pragmatic solution is here? Adopt a SAS-like approach for dropping variables? Or a similar but less complex approach? Or swap out the R package?

steven-moore · 2018-03-07T15:49:56Z

We need a fix for this ASAP--can't go into production until this issue is eliminated. I am fine with a kludge or simple hack for now--like eliminating covariate categories if they have 5 or fewer observations and dropping them into the reference. We can always arrange for a more elegant fix in subsequent versions.

steven-moore · 2018-03-08T21:06:39Z

A matrix is singular when at least one variable can be expressed as an exact linear combination of some of the others

So, when a variable has few observations, and several other covariates, models have no mathematical means to distinguish effects. So a mathematical solution here may be impossible. Ella, do let us know what you find though.

Assuming that a math solution is impossible here, we will need to engineer a solution.

Currently, the model handles dummies where all values are the same (0 or 1) well--it drops the dummy. So no fix is needed in these instances. This issue is dropping dummy variables when there is just a smidgen of valid data. For example, a BMI of 40+ category might only have one person. Dropping the dummy variable, though, would result in this person being added to the reference category (18.5-24.9) rather than a more sensible category (BMI 35.0+).

A partial solution is to use broader categories, and fewer of them, which reduces the opportunities for singularity. But we can’t change the coding of categories, since this would result in a lack of comparability across studies (e.g. it's not great to have a category for BMI of 35+ in one cohort and BMI 40.0+ in another cohort--especially since we stratify on BMI). And yet, we still need a way to prevent the reference group from becoming contaminated. Any thoughts on this?

steven-moore · 2018-03-13T13:13:24Z

Any further progress on this? We need to make a decision ASAP

ellatemprosa · 2018-03-14T13:21:53Z

based on the attached testing of various packages and methods in sas, the right packages are being invoked, we just need to fix the parameterization of the models. specifically, we need to update the acovs for each strata. the dummy can remain where it is but needs to be fixed so that we assume 0 for missing values. the lm and reg method if we go this route will produce y_predicted even for some with missing x covariates. we need to screen them out but the current packages invoked are good. i don't think it's right to lump categories together because we are going to meta-analyze. we want to make sure the coding is consistent and does not depend on the n of the cohort. this will not be defensible in the methods section. in our example, in bmi strata 1 and 2, we should specifcy race_grp1 and for bmi strata 3, it should be race_grp1 and race_grp2
marcorrv2.pdf
sasdummytest2.pdf
sasdummytest.pdf
testdummycode.pdf
marcorr.pdf

update on sas singularity:
confirmed in the example of _1_2_DIPALMITOYLGLYCEROL for bmi group 2 that there is a warning for singularity on racegrp2 and it was dropped in the analysis so that specification of race 1 vs race 1 and race 2 show the same results.

steven-moore · 2018-03-18T13:48:19Z

The original issue as described at the top appears to have been resolved, so I will close this. However, the fix has created another issue, which I have described in issue #34 .

steven-moore added the bug label Mar 1, 2018

steven-moore added this to the COMETS 1.3 Jan 2018 milestone Mar 1, 2018

steven-moore assigned ewymathe and ellatemprosa Mar 1, 2018

kailingchen changed the title ~~Models adjusted for race (when analyses are stratified by BMI) returns errors~~ Comets 1.3 - Models adjusted for race (when analyses are stratified by BMI) returns errors Mar 16, 2018

ellatemprosa mentioned this issue Mar 16, 2018

COMETS 1.3 Permit adjustment for categorical variables #19

Closed

steven-moore mentioned this issue Mar 18, 2018

COMETS 1.3. Infinite loop #34

Closed

steven-moore closed this as completed Mar 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comets 1.3 - Models adjusted for race (when analyses are stratified by BMI) returns errors #32

Comets 1.3 - Models adjusted for race (when analyses are stratified by BMI) returns errors #32

steven-moore commented Mar 1, 2018

wobenshain commented Mar 1, 2018

steven-moore commented Mar 1, 2018

steven-moore commented Mar 1, 2018 •

edited

Loading

ewymathe commented Mar 1, 2018

ewymathe commented Mar 1, 2018

steven-moore commented Mar 1, 2018 via email

ewymathe commented Mar 1, 2018

ellatemprosa commented Mar 1, 2018

ewymathe commented Mar 1, 2018

steven-moore commented Mar 1, 2018

steven-moore commented Mar 1, 2018

ewymathe commented Mar 1, 2018

ellatemprosa commented Mar 2, 2018

ewymathe commented Mar 2, 2018

steven-moore commented Mar 5, 2018

ellatemprosa commented Mar 5, 2018

steven-moore commented Mar 5, 2018 •

edited

Loading

steven-moore commented Mar 7, 2018

steven-moore commented Mar 8, 2018

steven-moore commented Mar 13, 2018

ellatemprosa commented Mar 14, 2018 •

edited

Loading

steven-moore commented Mar 18, 2018

Comets 1.3 - Models adjusted for race (when analyses are stratified by BMI) returns errors #32

Comets 1.3 - Models adjusted for race (when analyses are stratified by BMI) returns errors #32

Comments

steven-moore commented Mar 1, 2018

wobenshain commented Mar 1, 2018

steven-moore commented Mar 1, 2018

steven-moore commented Mar 1, 2018 • edited Loading

ewymathe commented Mar 1, 2018

ewymathe commented Mar 1, 2018

steven-moore commented Mar 1, 2018 via email

ewymathe commented Mar 1, 2018

ellatemprosa commented Mar 1, 2018

ewymathe commented Mar 1, 2018

steven-moore commented Mar 1, 2018

steven-moore commented Mar 1, 2018

ewymathe commented Mar 1, 2018

ellatemprosa commented Mar 2, 2018

ewymathe commented Mar 2, 2018

steven-moore commented Mar 5, 2018

ellatemprosa commented Mar 5, 2018

steven-moore commented Mar 5, 2018 • edited Loading

steven-moore commented Mar 7, 2018

steven-moore commented Mar 8, 2018

steven-moore commented Mar 13, 2018

ellatemprosa commented Mar 14, 2018 • edited Loading

steven-moore commented Mar 18, 2018

steven-moore commented Mar 1, 2018 •

edited

Loading

steven-moore commented Mar 5, 2018 •

edited

Loading

ellatemprosa commented Mar 14, 2018 •

edited

Loading