Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error in log file #24

Open
umcyh opened this issue Apr 6, 2018 · 8 comments
Open

error in log file #24

umcyh opened this issue Apr 6, 2018 · 8 comments

Comments

@umcyh
Copy link

umcyh commented Apr 6, 2018

I applied MTAG https://github.com/omeed-maghzian/mtag for some public data:http://csg.sph.umich.edu//abecasis/public/lipids2013/. I choose Total Cholesterol and Triglycerides data to test MTAG.

When I run MTAG, the log file has some error, please see the Log file. I just added column z=Beta/SE to the input file of MTAG from original data.
(1) Is it correct for z-value calculation ?
(2) Is N value is correct?
(3) The error from log is: ERROR converting summary statistics. Could you explain why there is error in converting summary statistics?

The original GWAS data columns are list below:
SNP_hg19 | Marker name in build hg19.
rsid | Marker name in rsid format.
A1 | Effect allele.
A2 | Other allele.
Beta | Effect size.
SE | Standard Error for Beta.
N | The number of individuals analyzed for this marker.
P-value | P-value after doing genomic control.
Freq.A1.1000G.EUR | Frequency of allele A1 from 1000G EUR sample.

Log file:

2018/04/06/12:23:11 PM <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
2018/04/06/12:23:11 PM Munging Trait 1 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><
2018/04/06/12:23:11 PM <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
2018/04/06/12:23:11 PM Interpreting column names as follows:
2018/04/06/12:23:11 PM snpid: Variant ID (e.g., rs number)
n: Sample size
a1: Allele 1, interpreted as ref allele for signed sumstat.
pval: p-Value
a2: Allele 2, interpreted as non-ref allele for signed sumstat.
z: Directional summary statistic as specified by --signed-sumstats.

2018/04/06/12:23:11 PM Reading sumstats from provided DataFrame into memory 10000000 SNPs at a time.
2018/04/06/12:23:16 PM Read 2446981 SNPs from --sumstats file.
Removed 805 SNPs with missing values.
Removed 0 SNPs with INFO <= None.
Removed 0 SNPs with MAF <= 0.01.
Removed 0 SNPs with out-of-bounds p-values.
Removed 0 variants that were not SNPs. Note: strand ambiguous SNPs were not dropped.
2446176 SNPs remain.
2018/04/06/12:23:17 PM Removed 0 SNPs with duplicated rs numbers (2446176 SNPs remain).
2018/04/06/12:23:18 PM Removed 33274 SNPs with N < 63063.3333333 (2412902 SNPs remain).
2018/04/06/12:24:37 PM
ERROR converting summary statistics:

2018/04/06/12:24:37 PM Traceback (most recent call last):
File "/mnt/speliotes-lab/Software/MTAG/mtag-master/ldsc_mod/munge_sumstats.py", line 718, in munge_sumstats
check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.1, sign_cname))
File "/mnt/speliotes-lab/Software/MTAG/mtag-master/ldsc_mod/munge_sumstats.py", line 372, in check_median
raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2)))
ValueError: WARNING: median value of SIGNED_SUMSTATS is 0.71 (should be close to 0.0). This column may be mislabeled.

2018/04/06/12:24:37 PM
Conversion finished at Fri Apr 6 12:24:37 2018
2018/04/06/12:24:37 PM Total time elapsed: 1.0m:26.4s
2018/04/06/12:24:37 PM WARNING: median value of SIGNED_SUMSTATS is 0.71 (should be close to 0.0). This column may be mislabeled.
Traceback (most recent call last):
File "mtag.py", line 1348, in
mtag(args)
File "mtag.py", line 1194, in mtag
DATA, args = load_and_merge_data(args)
File "mtag.py", line 229, in load_and_merge_data
GWAS_d[p], sumstats_format[p] = _perform_munge(args, GWAS_d[p], gwas_dat_gen, p)
File "mtag.py", line 149, in _perform_munge
munged_results = munge_sumstats.munge_sumstats(argnames, write_out=False, new_log=False)
File "/mnt/speliotes-lab/Software/MTAG/mtag-master/ldsc_mod/munge_sumstats.py", line 718, in munge_sumstats
check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.1, sign_cname))
File "/mnt/speliotes-lab/Software/MTAG/mtag-master/ldsc_mod/munge_sumstats.py", line 372, in check_median
raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2)))
ValueError: WARNING: median value of SIGNED_SUMSTATS is 0.71 (should be close to 0.0). This column may be mislabeled.

@GeneticResources
Copy link

I think the public data flipped allele A1 to make betas > 0. But mtag expects the mean value of betas should be 0. Random selecting half of the SNPs and flipping A1 and beta may slove the problem.

Another related Q about mtag z scores, in XY plot of the raw Z(gwas) and Z(mtag), the beta should be around 1? I got 0.5, don't know how to explain the shrunk Z scores from mtag.

@huilisabrina
Copy link
Collaborator

Hi @umcyh ,

The error comes from one of the data validity checks built in the LDSC package, which MTAG uses to estimate the Sigma.

To your questions:
(1) You’re right about the z=beta/SE.
(2) N is correct based on the table you sent.
(3) The message indicates the effect sizes in the input sumstats are skewed to be positive. This violates the underlying random effect assumption used in MTAG. We’d expect beta or Z to be zero on average, if the choice of reference allele is arbitrary.

Thanks @GeneticResources for pointing out the allele issue in the source data. That should explains the error Yanhua was getting. In terms of the Z-score issue - Are you running a single trait MTAG? The input GWAS z and the output mtag_z should match if that is the case.

Thanks,
Hui

@GeneticResources
Copy link

Hi @huilisabrina ,

I ran the mtag based on a disease trait (binary) and a quantitative trait. The beta between Z(gwas of binaray trait) and Z(mtag of binaray trait) was 0.5 (not around 1). Do you know the reason? Thanks.

@huilisabrina
Copy link
Collaborator

Hi @GeneticResources ,

One possibility is that the N used in the case-control trait is not the "effective N". MTAG assumes SE=1/sqrt(N_eff2p*(1-p)), where p is the minor allele frequency. Can you try replacing the N column in the binary trait sumstats with 1/( 2p(1-p)*(SE^2) ) and see if that solves the problem? Also, there are some discussions in an older issue #10 that might be helpful.

Best,
Hui

@GeneticResources
Copy link

Hi @huilisabrina ,
The issue #10 is very helpful. After I used the new "effective N", the lm model slope between beta (gwas) and beta (mtag) is 1.0308840, previous was 4.3747334.
However, the slope of the z (gwas) and z (mtag) is still about 0.6259420, previous was
0.619527.
And the slope of the se (gwas) and se (mtag) is 1.6295029, previous was 6.935761.
Do you know the potential reason for z(slope) ~ 0.63 and se(slope) ~ 1.63?

Thanks.

@paturley
Copy link
Collaborator

paturley commented Apr 9, 2018 via email

@GeneticResources
Copy link

out_mtag_trait_three.log

Hi @paturley ,

The attached is the log file.
The first trait is the binary trait with z(slope) ~ 0.63 and se(slope) ~ 1.63.
Trait 2 and 3 are quantitative traits and their slopes of the z (gwas) and z (mtag) are 1.004962 and 0.949929, respectively. However, the slopes of beta and se are not round 1. (~0.2 for trait 2, ~ 3.5 for trait 3 for both beta and se)

Since the effect sample size is hard to determine, is it possible to use mtag based on beta and se, rather than z.

Thanks.

@paturley
Copy link
Collaborator

MTAG outputs betas and coefficients assuming that the phenotype has been standardized to have a standard deviation of one. That appears to be the problem with traits 2 and 3 since the betas and SEs are inflated/deflated by the same amount. (Do you know that variance of the phenotype for those two data sets?) See Issue #10

For trait one, there may be a few things that are going on.

  1. Since the trait is binary, the results correspond to first standardizing the binary trait, and then doing GWAS. If you want to convert the betas and SEs back into binary units, you need to multiply them by the standard deviation of the binary phenotype.
  2. Even if you correct the units of the estimates, the slope won't be one in expectation due to attenuation bias since the betas are estiamted with noise. (https://en.wikipedia.org/wiki/Regression_dilution)

Re using beta and se rather than z and N, we hope to implement that soon. If you use the formula for N that is found in issue #10 , however, that is equivalent to to using the beta and se.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants