-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comets 1.3 - Models adjusted for race (when analyses are stratified by BMI) returns errors #32
Comments
I don't know that it is relevant but hadn't you decided not to allow stratification on groups smaller than 15? |
In the example above, we're adjusting for the race-group, rather than stratifying by it. However, I will test stratifying by race-group all the same. |
To further clarify, this is an important issue because we are asking all cohorts to use standardized coding, so many (perhaps most) will have at least one covariate where data are thin at one level. In the latest runs in batch mode, this issue may have caused missing tables, resulting in severe problems further downstream. Ewy, Ella: perhaps we can discuss on the Data Harmonization call? One further addition: Technically, it seems like we should be able to estimate the partial correlations. I just ran in SAS and received valid output, which I pasted in the Word Doc below as an example. |
I'd like to confirm that the issue has to do with the fact that one of the race-groups has only one unique value.
[1,] 1 0 0 0 0 0 0 0 R gets caught in an infinite loop when this happens. So we must check that each covariable has enough values. |
Can I assume that SAS runs the partial correlations after removing the variable with low numbers? |
Yes
From: Ewy Mathe [mailto:notifications@github.com]
Sent: Thursday, March 1, 2018 2:50 PM
To: CBIIT/R-cometsAnalytics <R-cometsAnalytics@noreply.github.com>
Cc: Moore, Steve (NIH/NCI) [E] <steve.moore@nih.gov>; Author <author@noreply.github.com>
Subject: Re: [CBIIT/R-cometsAnalytics] Models adjusted for race (when analyses are stratified by BMI) returns errors (#32)
Can I assume that SAS runs the partial correlations after removing the variable with low numbers?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#32 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AhG9TUmgvEVFTBH4d02nR8HkmxqUG_A4ks5taFDugaJpZM4SXlso>.
|
With that in mind, can I implement a simple filter that removes dummy variables that correspond to less than x samples, with x set to 5 by default? |
here's an alternative calculation of the adjusted spearman rank correlation, i will test it http://onlinelibrary.wiley.com/doi/10.1111/biom.12812/epdf |
And there's an R package :o) Nice find! |
Just be careful--that's a big change to make. We don't know yet what the downstream implications are, especially for large files. |
And Ewy, you could just remove any dummy variables that, within a given stratum, have only one value, or possibly x values (though implementing based on x could be tricky--do you mean different values, or do you mean people with different values? The coding is different for each.). |
@steven-moore Yes, I could set x=1 and remove variables that have only one value. I was thinking that I would remove variables that had <= 5 occurences given one value. |
i am not sure this is the right way to handle this, i tested on raw code and it works |
@ellatemprosa what do you mean by it works? Does it automatically remove dummy variables that have the same values? Meaning it drops the category that only has one entry, and the result from inputting all the data and the result from inputting only the data minus the "one entry" is the same? |
So, what do you guys think is the right approach here? |
here's exactly what sas is doing, described here http://support.sas.com/documentation/cdl/en/procstat/67528/HTML/default/viewer.htm#procstat_corr_details.htm if it does not meet the criteria, it drops the variable. After applying the Cholesky decomposition algorithm to each row associated with variables i think this is the safe way but it is crazy that sas does not tell you which vars were dropped. in my testing the example you sent, the correlation is just the unadjusted correlation |
OK, good to know. What do you think the pragmatic solution is here? Adopt a SAS-like approach for dropping variables? Or a similar but less complex approach? Or swap out the R package? |
We need a fix for this ASAP--can't go into production until this issue is eliminated. I am fine with a kludge or simple hack for now--like eliminating covariate categories if they have 5 or fewer observations and dropping them into the reference. We can always arrange for a more elegant fix in subsequent versions. |
A matrix is singular when at least one variable can be expressed as an exact linear combination of some of the others So, when a variable has few observations, and several other covariates, models have no mathematical means to distinguish effects. So a mathematical solution here may be impossible. Ella, do let us know what you find though. Assuming that a math solution is impossible here, we will need to engineer a solution. Currently, the model handles dummies where all values are the same (0 or 1) well--it drops the dummy. So no fix is needed in these instances. This issue is dropping dummy variables when there is just a smidgen of valid data. For example, a BMI of 40+ category might only have one person. Dropping the dummy variable, though, would result in this person being added to the reference category (18.5-24.9) rather than a more sensible category (BMI 35.0+). A partial solution is to use broader categories, and fewer of them, which reduces the opportunities for singularity. But we can’t change the coding of categories, since this would result in a lack of comparability across studies (e.g. it's not great to have a category for BMI of 35+ in one cohort and BMI 40.0+ in another cohort--especially since we stratify on BMI). And yet, we still need a way to prevent the reference group from becoming contaminated. Any thoughts on this? |
Any further progress on this? We need to make a decision ASAP |
based on the attached testing of various packages and methods in sas, the right packages are being invoked, we just need to fix the parameterization of the models. specifically, we need to update the acovs for each strata. the dummy can remain where it is but needs to be fixed so that we assume 0 for missing values. the lm and reg method if we go this route will produce y_predicted even for some with missing x covariates. we need to screen them out but the current packages invoked are good. i don't think it's right to lump categories together because we are going to meta-analyze. we want to make sure the coding is consistent and does not depend on the n of the cohort. this will not be defensible in the methods section. in our example, in bmi strata 1 and 2, we should specifcy race_grp1 and for bmi strata 3, it should be race_grp1 and race_grp2 update on sas singularity: |
The original issue as described at the top appears to have been resolved, so I will close this. However, the fix has created another issue, which I have described in issue #34 . |
In the test file that I had prepared, the N for non-white/European persons was quite small. In fact, only 1 individual had a race_grp=2. This seems to be causing all kinds of problems in the adjusted/stratified analyses.
To test, run in interactive mode:
Exposure: Age
Outcome: Any individual metabolite
Adjusted covariates: race_grp
Strata by: BMI_grp
Two of the three values returned will have a value of NA. Possibly, this reflects a degrees of freedom issue?
Input file is below.
cometsInput_March_2018.xlsx
The text was updated successfully, but these errors were encountered: