-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All SNPs are being dropped due to non-positive-semi-definiteness of omega #23
Comments
If there are negative diagonal elements in the LD score regression coefficients, that's going to torpedo your results. That might happen with a very small sample size and/or mean chi squared < 1. Mind including your full log with the logging turned up so that we can have a little more information? |
2021-06-07 14:07:59,337 Reading in and running QC on LD Scores Reading in summary statistics. 2021-06-07 14:07:59,568 Reading in ('SAS', 'T2D') sumstats file: ./type_2_diabetes_UKB_450_SAS.txt 2021-06-07 14:08:18,965 Filtered out 0 SNPs with "NO NAN" (Filters out SNPs with any NaN values in required columns {'FREQ', 'SNP', 'A2', 'BETA', 'CHR', 'BP', 'A1', 'SE', 'P'}) Filtered out 845055 SNPs in total (as the union of drops, this may be less than the total of all the per-filter drops) 2021-06-07 14:08:18,992 Reading in ('EUR', 'T2D') sumstats file: ./EUR_T2d.txt 2021-06-07 14:08:24,666 Filtered out 0 SNPs with "NO NAN" (Filters out SNPs with any NaN values in required columns {'FREQ', 'SNP', 'A2', 'BETA', 'CHR', 'BP', 'A1', 'SE', 'P'}) Filtered out 33022 SNPs in total (as the union of drops, this may be less than the total of all the per-filter drops) Number of SNPS in initial intersection of all sources: 9350 Running LD Score regression. Creating omega and sigma matrices. Running main MAMA method. 2021-06-07 14:08:26,372 Population 0: ('SAS', 'T2D') |
It looks like you're losing a ton of SNPs when the data sets are being intersected / merged (it gets pared down to less than 10k). Are you expecting such a small overlap? Any chance anything happened to the rsID columns of your data or that they're somehow formatted very differently from each other or something like that?> |
On top of what Jon said, it looks like 50% of your SAS SNPs are being dropped due to allele frequency. You can use the |
Hi Grant, The data is from exome sequencing of SAS and UKB cohorts and there is not much overlap for the common SNPs. I already filtered the EUR data for the common SNPs (AF > 0.05%). I can use the imputed data to see if the overlap increases between the data. Unfortunately there is no array data for our SAS group and I wanted to avoid the imputation of SAS. Thanks, Manav |
That's tricky. MAMA essentially calculates the heritability and genetic
correlation between the populations using LD score regression, and ir there
aren't enough SNPs to get reliable estimates, then you end up with crazy
results like the ones you are getting. One option is to use the imputed SAS
results for the MAMA step, but drop the imputed variants after the MAMA
step. The imputed variants may be reliable enough for LD score regression
even if you think the final results are suspect.
That said, you'd need to make it clear that is what you did when you write
up your results, and reviewers may get nervous if it looks too unusual.
…On Mon, Jun 7, 2021 at 3:49 PM kapoormanav ***@***.***> wrote:
Hi Grant,
The data is from exome sequencing of SAS and UKB cohorts and there is not
much overlap for the common SNPs. I already filtered the EUR data for the
common SNPs (AF > 0.05%). I can use the imputed data to see if the overlap
increases between the data. Unfortunately there is no array data for our
SAS group and I wanted to avoid the imputation of SAS.
Thanks,
Manav
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#23 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFBUB5KELSM7IJ5B7YMHYGLTRUPDDANCNFSM46IAQDQQ>
.
|
Thanks Patrick. I will try that and let you people know. I will make sure to describe the process for reviewers. |
Any luck with trying that out? |
Hi, |
Ok, sounds good! |
Hi, I am having similar issues with almost all my SNPs being removed by the 'non-positive-semi-definiteness of omega.' filter. I am using UKB data, imputed by UKB using HRC (the standard data provided) for EUR, AFR and CSA population subgroups. I selected SNPs with INFO score > 0.8 derived from 1000 randomly selected subjects in each population. This gives trans-ancestral LD scores spanning approximately 6.3 million SNPs. Then when I try to merge summary statistics in each population (also from UKB, so covering the same SNP set and imputation strategy), I am getting almost all SNPs removed by 'non-positive-semi-definiteness of omega'. I am slightly confused as I would have thought this is a routine implementation of this approach? I have copied the output below. Kind regards, Sam Kleeman
|
It looks like you have negative diagonal coefficients on your LD score regressions. Maybe try the following option, which will set the intercept to zero: --reg-int-zero and see if that helps. @ggoldman1 @JonJala Maybe we should make it so the software throws an error in cases like this since the results will never make sense if the diagonal entries of the LD score regression coefficient matrices are negative. |
I tried using --reg-int-zero - it runs across all SNPs but I get crazy p values in the region of 8.169262804980647e-291088526. Why do I see negative diagonal coefficients in LD score regressions? This is the script I use to compute LD scores. IID ancestry file and SNP ancestry file were generated as per the Jupyter Notebook. SNPs are from UK Biobank with INFO > 0.8, randomly selected 1000 subjects per population.
|
Hey Sam, Did you see the most recent response at #15? I see you're using a centimorgan-based window but it seems like from that thread your data doesn't have cM reported. If this is the case then your LD scores will likely be wrong which could cause issues. Grant |
Sorry the script above was an old version, I used the following (with the ld-wind-kb parameter):
If I add in centimorgan will that fix the issue?? |
No, not unless your .bim files have a non-zero third column. Otherwise leaving it as you have is fine. Out of curiosity, what is the mean chi^2 in each of your input GWAS? Per #26, if you rerun with |
I can see the following: Do you have any suggestions about how we can resolve this? We are really keen to use this approach in our work.
|
That looks like it might tbe the mean chi squared of the outputs? If you pull the latest from the repo and re-run (making sure to keep the --verbose flag set), the log should have the mean chi squared of the input populations after it goes through QCing and harmonizing all the input summary statistics. The log file should have something like "Harmonized AFR cyc mean chi squared: [VALUE]" in it |
Sure, have done
|
Those first two mean chi squared values are quite low. From what I
understand, they should be above 1. If you include the complete log from
the recent run, we maybe could see if anything else looks funny, but
otherwise maybe double-check to make sure your standard errors and betas
are ok?
…On Tue, Aug 3, 2021, 2:58 PM samkleeman1 ***@***.***> wrote:
Sure, have done
Harmonized AFR cyc mean chi squared: 0.785988856018892
Harmonized CSA cyc mean chi squared: 0.8874974162834628
Harmonized EUR cyc mean chi squared: 2.9613204740241974
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/APIOF57MD3EGUG7ZLJQ3OKDT3A35HANCNFSM46IAQDQQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
I ran again using a different set of phenotypes. They are standard UK Biobank phenotypes (eGFR Creatinine), with GWAS implemented in BOLT-LMM, with summary statistics directly imported to Mama. I am really struggling to understanding why it isn't working and am grateful for your help with this.
|
The ld score creation ran ok, right? What was the full command you ran
there?
I'll ask some other folks about the latest results, though.
…On Wed, Aug 4, 2021, 3:13 PM samkleeman1 ***@***.***> wrote:
I ran again using a different set of phenotypes. They are standard UK
Biobank phenotypes (eGFR Creatinine), with GWAS implemented in BOLT-LMM,
with summary statistics directly imported to Mama. I am really struggling
to understanding why it isn't working and am grateful for your help with
this.
python3 /mnt/grid/ukbiobank/data/Application58510/skleeman/was/mama/mama/mama.py --sumstats "/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas/AFR/creatinine_summary2.tsv,AFR,cyc" "/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas/CSA/creatinine_summary2.tsv,CSA,cyc" "/mnt/grid/ukbiobank/data/Application58510/skleeman/EUR/creatinine_summary2.tsv,EUR,cyc" \
--ld-scores "/mnt/grid/ukbiobank/data/Application58510/skleeman/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr*.l2.ldscore.gz" \
--out "./mama_merged" \
--replace-a1-col-match "a1" --verbose \
--replace-bp-col-match "BP" --replace-chr-col-match "CHR" --replace-freq-col-match "VAF" \
--replace-info-col-match "info" \
--replace-a2-col-match "a2" --replace-snp-col-match "snpid" --replace-beta-col-match "beta" \
--replace-se-col-match "se" --replace-p-col-match "pval" --allow-palindromic-snps
Harmonized AFR cyc mean chi squared: 1.0315720691747015
Harmonized CSA cyc mean chi squared: 1.0396088044819032
Harmonized EUR cyc mean chi squared: 2.681982126791807
Dropped 6596642 total SNPs due to non-positive-(semi)-definiteness of omega / sigma.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/APIOF53MFHGTJBHY3MCTBMLT3GGNXANCNFSM46IAQDQQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
I ran across each chromosome separately. No error messages to my knowledge.
|
Curious. It's possible that the LD score software has some problem
in non-standardized units. Can you try generating LD scores based on
standardized genotypes and run the MAMA code with those LD scores and the
standardized model flag?
…On Wed, Aug 4, 2021 at 4:41 PM samkleeman1 ***@***.***> wrote:
I ran across each chromosome separately. No error messages to my knowledge.
source ../mama/mama_env/bin/activate
python3 ../mama/mama_ldscores.py --ances-path "../iid_ances_file" --snp-ances "../snp_ances_file" --ld-wind-kb 1000 --stream-stdout --bfile-merged-path "../ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_8" --out "../out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr8"
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#23 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFBUB5OTXUMGTFJ4AMPERDLT3GQXNANCNFSM46IAQDQQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
What do you mean by standardized genotypes? Like recoding to 0/1/2? |
The MAMA method can be based on a model where effect sizes are assumed to
be drawn from iid distribution either assuming that the genotypes are coded
as 0/1/2 or assuming they are standardized to have mean zero and variance
one. I think the default of the software is to assume the 0/1/2 model, but
we recently started noticing a few funny things in the code that make me
worry about that specification.
In practice, this would mean regenerating the LD scores and running MAMA
with the standardized genotype flag. You would not have to modify any of
your data. (Can you remind me what the flags are to do this, Jon or Grant?)
…On Thu, Aug 5, 2021 at 10:34 AM samkleeman1 ***@***.***> wrote:
What do you mean by standardized genotypes? Like recoding to 0/1/2?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#23 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFBUB5JEGVP3MSKALXJYGUDT3KOO3ANCNFSM46IAQDQQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
When calling the ldsc script, you should specify |
Running now, will keep you posted |
It worked! Thanks so much guys! |
Interesting. This is good for us to know. Thanks Sam!
…On Sat, Aug 7, 2021 at 1:59 PM samkleeman1 ***@***.***> wrote:
It worked! Thanks so much guys!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJ2FCZXYDUY6BSSDXPFZDVLT3VX6JANCNFSM46IAQDQQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
Hi Grant and Jon,
All SNPs from my analysis are being dropped due to "non-positive-semi-definiteness of omega". Do you have any suggestions to fix this problem.
Thanks,
Manav
#########################################
Running LD Score regression.
Regression coefficients (LD):
[[-1.45883266e-03 -1.27498494e-03]
[-1.27498494e-03 -9.65617573e-05]]
Regression coefficients (Intercept):
[[1.08435618 1.04543503]
[1.04543503 1.00893974]]
Regression coefficients (SE^2):
[[-0.04235656 -2.6532606 ]
[-2.6532606 -3.08265681]]
Creating omega and sigma matrices.
Average Omega (including dropped slices) =
[[-0.01087047 -0.00785719]
[-0.00785719 -0.00084023]]
Average Sigma (including dropped slices) =
[[1.08297402 1.04543503]
[1.04543503 1.00471217]]
Adjusted 0 SNPs to make omega positive semi-definite.
Dropped 4730 SNPs due to non-positive-semi-definiteness of omega.
Dropped 4730 SNPs due to non-positive-definiteness of sigma.
Dropped 4730 total SNPs due to non-positive-(semi)-definiteness of omega / sigma.
Running main MAMA method.
Preparing results for output.
/mnt/efs/users/manav.kapoor/mama-mainline/mama_pipeline.py:552: RuntimeWarning: Mean of empty slice.
mean_chi_2 = np.square(new_df[Z_COL].to_numpy()).mean()
ERROR: Received Numpy error: invalid value (8)
Mean Chi^2 for ('SAS', 'T2D') = nan
Population 1: ('EUR', 'T2D')
ERROR: Received Numpy error: invalid value (8)
Mean Chi^2 for ('EUR', 'T2D') = nan
The text was updated successfully, but these errors were encountered: