Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comets 1.3 - Models adjusted for race (when analyses are stratified by BMI) returns errors #32

Closed
steven-moore opened this issue Mar 1, 2018 · 22 comments
Assignees
Labels

Comments

@steven-moore
Copy link

In the test file that I had prepared, the N for non-white/European persons was quite small. In fact, only 1 individual had a race_grp=2. This seems to be causing all kinds of problems in the adjusted/stratified analyses.

To test, run in interactive mode:

Exposure: Age
Outcome: Any individual metabolite
Adjusted covariates: race_grp
Strata by: BMI_grp

Two of the three values returned will have a value of NA. Possibly, this reflects a degrees of freedom issue?

Input file is below.

cometsInput_March_2018.xlsx

@wobenshain
Copy link
Contributor

I don't know that it is relevant but hadn't you decided not to allow stratification on groups smaller than 15?

@steven-moore
Copy link
Author

In the example above, we're adjusting for the race-group, rather than stratifying by it. However, I will test stratifying by race-group all the same.

@steven-moore
Copy link
Author

steven-moore commented Mar 1, 2018

To further clarify, this is an important issue because we are asking all cohorts to use standardized coding, so many (perhaps most) will have at least one covariate where data are thin at one level.

In the latest runs in batch mode, this issue may have caused missing tables, resulting in severe problems further downstream. Ewy, Ella: perhaps we can discuss on the Data Harmonization call?

One further addition: Technically, it seems like we should be able to estimate the partial correlations. I just ran in SAS and received valid output, which I pasted in the Word Doc below as an example.

SAS output for partial corrs.docx

@ewymathe
Copy link
Contributor

ewymathe commented Mar 1, 2018

I'd like to confirm that the issue has to do with the fact that one of the race-groups has only one unique value.
When I try running the model "Age.2.5 Heart-disease stratified", I get the following dummy variables for race_grp2 (the first part is the head of the dummy matrix, the second part is the vector of race_grp2):

corrmatrix <-calcCorr(modeldata,exmetabdata, "DPP")
[1] "running adjusted"

 smk_grp1 smk_grp2 smk_grp3 bmi_grp2 bmi_grp3 bmi_grp4 race_grp1 race_grp2

[1,] 1 0 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 1 0
[3,] 0 0 0 0 0 0 0 0
[4,] 1 0 0 1 0 0 0 0
[5,] 0 0 0 0 1 0 0 0
[6,] 0 0 0 1 0 0 0 0
educ_grp1 educ_grp2 educ_grp3 educ_grp4 alc_grp1 alc_grp2 alc_grp3
[1,] 0 1 0 0 0 1 0
[2,] 0 0 1 0 1 0 0
[3,] 0 0 1 0 0 0 0
[4,] 0 1 0 0 1 0 0
[5,] 1 0 0 0 1 0 0
[6,] 1 0 0 0 0 0 0
multivitamin1 multivitamin2 horm_curr1 horm_curr2 nested_case1 age
[1,] 0 0 0 0 1 66
[2,] 1 0 0 1 1 61
[3,] 0 0 0 0 1 71
[4,] 1 0 0 0 1 65
[5,] 1 0 1 0 1 56
[6,] 0 0 0 0 1 59
[1] "Race_grp2 dummy"
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[556] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[630] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[667] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[704] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[741] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[778] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[815] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[852] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[889] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[926] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[963] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[1000] 0

R gets caught in an infinite loop when this happens. So we must check that each covariable has enough values.

@ewymathe
Copy link
Contributor

ewymathe commented Mar 1, 2018

Can I assume that SAS runs the partial correlations after removing the variable with low numbers?

@steven-moore
Copy link
Author

steven-moore commented Mar 1, 2018 via email

@ewymathe
Copy link
Contributor

ewymathe commented Mar 1, 2018

With that in mind, can I implement a simple filter that removes dummy variables that correspond to less than x samples, with x set to 5 by default?

@ellatemprosa
Copy link
Collaborator

here's an alternative calculation of the adjusted spearman rank correlation, i will test it http://onlinelibrary.wiley.com/doi/10.1111/biom.12812/epdf

@ewymathe
Copy link
Contributor

ewymathe commented Mar 1, 2018

And there's an R package :o) Nice find!

@steven-moore
Copy link
Author

Just be careful--that's a big change to make. We don't know yet what the downstream implications are, especially for large files.

@steven-moore
Copy link
Author

And Ewy, you could just remove any dummy variables that, within a given stratum, have only one value, or possibly x values (though implementing based on x could be tricky--do you mean different values, or do you mean people with different values? The coding is different for each.).

@ewymathe
Copy link
Contributor

ewymathe commented Mar 1, 2018

@steven-moore Yes, I could set x=1 and remove variables that have only one value. I was thinking that I would remove variables that had <= 5 occurences given one value.
For example, there is only 1 sample with race =2. If there were 5 samples with race =2, I would remove that too.
I'll implement this shortly (x=1).

@ellatemprosa
Copy link
Collaborator

i am not sure this is the right way to handle this, i tested on raw code and it works

@ewymathe
Copy link
Contributor

ewymathe commented Mar 2, 2018

@ellatemprosa what do you mean by it works? Does it automatically remove dummy variables that have the same values? Meaning it drops the category that only has one entry, and the result from inputting all the data and the result from inputting only the data minus the "one entry" is the same?

@steven-moore
Copy link
Author

So, what do you guys think is the right approach here?

@ellatemprosa
Copy link
Collaborator

here's exactly what sas is doing, described here http://support.sas.com/documentation/cdl/en/procstat/67528/HTML/default/viewer.htm#procstat_corr_details.htm

if it does not meet the criteria, it drops the variable.

After applying the Cholesky decomposition algorithm to each row associated with variables $\mb{z}$, PROC CORR checks all higher-numbered diagonal elements associated with $\mb{z}$ for singularity. A variable is considered singular if the value of the corresponding diagonal element is less than $\varepsilon $ times the original unpartialled corrected sum of squares of that variable. You can specify the singularity criterion $\varepsilon $ by using the SINGULAR= option. For Pearson partial correlations, a controlling variable $\mb{z}$ is considered singular if the $R^2$ for predicting this variable from the variables that are already partialled out exceeds $1-\varepsilon $. When this happens, PROC CORR excludes the variable from the analysis. Similarly, a variable is considered singular if the $R^2$ for predicting this variable from the controlling variables exceeds $1-\varepsilon $. When this happens, its associated diagonal element and all higher-numbered elements in this row or column are set to zero.

i think this is the safe way but it is crazy that sas does not tell you which vars were dropped. in my testing the example you sent, the correlation is just the unadjusted correlation

@steven-moore
Copy link
Author

steven-moore commented Mar 5, 2018

OK, good to know. What do you think the pragmatic solution is here? Adopt a SAS-like approach for dropping variables? Or a similar but less complex approach? Or swap out the R package?

@steven-moore
Copy link
Author

We need a fix for this ASAP--can't go into production until this issue is eliminated. I am fine with a kludge or simple hack for now--like eliminating covariate categories if they have 5 or fewer observations and dropping them into the reference. We can always arrange for a more elegant fix in subsequent versions.

@steven-moore
Copy link
Author

A matrix is singular when at least one variable can be expressed as an exact linear combination of some of the others

So, when a variable has few observations, and several other covariates, models have no mathematical means to distinguish effects. So a mathematical solution here may be impossible. Ella, do let us know what you find though.

Assuming that a math solution is impossible here, we will need to engineer a solution.

Currently, the model handles dummies where all values are the same (0 or 1) well--it drops the dummy. So no fix is needed in these instances. This issue is dropping dummy variables when there is just a smidgen of valid data. For example, a BMI of 40+ category might only have one person. Dropping the dummy variable, though, would result in this person being added to the reference category (18.5-24.9) rather than a more sensible category (BMI 35.0+).

A partial solution is to use broader categories, and fewer of them, which reduces the opportunities for singularity. But we can’t change the coding of categories, since this would result in a lack of comparability across studies (e.g. it's not great to have a category for BMI of 35+ in one cohort and BMI 40.0+ in another cohort--especially since we stratify on BMI). And yet, we still need a way to prevent the reference group from becoming contaminated. Any thoughts on this?

@steven-moore
Copy link
Author

Any further progress on this? We need to make a decision ASAP

@ellatemprosa
Copy link
Collaborator

ellatemprosa commented Mar 14, 2018

based on the attached testing of various packages and methods in sas, the right packages are being invoked, we just need to fix the parameterization of the models. specifically, we need to update the acovs for each strata. the dummy can remain where it is but needs to be fixed so that we assume 0 for missing values. the lm and reg method if we go this route will produce y_predicted even for some with missing x covariates. we need to screen them out but the current packages invoked are good. i don't think it's right to lump categories together because we are going to meta-analyze. we want to make sure the coding is consistent and does not depend on the n of the cohort. this will not be defensible in the methods section. in our example, in bmi strata 1 and 2, we should specifcy race_grp1 and for bmi strata 3, it should be race_grp1 and race_grp2
marcorrv2.pdf
sasdummytest2.pdf
sasdummytest.pdf
testdummycode.pdf
marcorr.pdf

update on sas singularity:
confirmed in the example of _1_2_DIPALMITOYLGLYCEROL for bmi group 2 that there is a warning for singularity on racegrp2 and it was dropped in the analysis so that specification of race 1 vs race 1 and race 2 show the same results.

@kailingchen kailingchen changed the title Models adjusted for race (when analyses are stratified by BMI) returns errors Comets 1.3 - Models adjusted for race (when analyses are stratified by BMI) returns errors Mar 16, 2018
@steven-moore
Copy link
Author

The original issue as described at the top appears to have been resolved, so I will close this. However, the fix has created another issue, which I have described in issue #34 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants