Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add field to signify which allele is the reference #4

Open
jdhayhurst opened this issue Aug 3, 2022 · 14 comments
Open

Add field to signify which allele is the reference #4

jdhayhurst opened this issue Aug 3, 2022 · 14 comments

Comments

@jdhayhurst
Copy link
Contributor

We need to be able to determine which is the reference allele.

For example if you have a ref sequence of ATGTGA and at position 3 you have an EA of GT and an OA of G, you have no way of knowing which is ref. The trouble with that is that the variant is ambiguous.
If the EA is the ref: the OA is a deletion wrt ref seq --> effect sequence: ATGTGA
If the OA is the ref: the EA is an insertion wrt ref seq. --> effect sequence: ATGTTGA
Since you have no way of knowing if the EA sequence is ATGTGA or ATGTTGA, it's ambiguous. Yes, you know with respect to the other allele that there is an additional T, but without knowing which is the reference, you cannot actually know the variant.

Suggestions:

  1. an additional field giving the reference allele string (genome reference in the metadata)
  2. an additional field that signifies whether EA or OA is the ref
@freeseek
Copy link

freeseek commented Aug 8, 2022

I very much agree with @jdhayhurst on this point. This might seem at first as a minor issue, but it really seems like a limitation of the GWAS-SSF standard in my opinion. When looking at the high coverage 1000 Genomes reference panel, there are actually >700k indel pairs with reference and alternate alleles swapped, corresponding to the two possible type of indel variants, insertions and deletions. This represents a typical scenario at homopolymer runs and STR loci and, as our ability to correctly assess these loci increases, this is going to become an increasingly larger problem

As early as 2014 I had noticed that this unordered way to represents the two alleles at a locus like PLINK1 does (see here) was flawed and I believed that the VCF standard was partly driven by having to address these scenarios. Indeed, PLINK2 has now a framework to remember which allele is the reference allele

I would strongly encourage to recommend the genetic community to adopt a summary statistics standard built on top of VCF. The GWAS-VCF standard seemed a much better option in my opinion

It has several advantages over GWAS-SSF:

  1. it is not ambiguous
  2. variants can be easily annotated with available tools (bcftools csq and bcftools annotate)
  3. it can be stored as compressed text (.vcf.gz) or as compressed binary (.bcf) to significantly speed up parsing of the file
  4. it can include summary statistics from multiple traits in the same file
  5. it can be queried with available bcftools infrastructure (e.g. bcftools query --format "%CHROM\t%POS\t%REF\t%ALT[\t%ES\t%LP]\n" --include 'LP>7.3' to extract genome-wide significant variants)
  6. it can be easily converted to other formats. Indeed, the following command would generate a GWAS-SSF file from a GWAS-VCF file (the opposite is much more complicated):
(echo -e "chromosome\tbase_pair_location\teffect_allele\tother_allele\tbeta\tstandard_error\teffect_allele_frequency\tp_value";
bcftools query -s SM -f "%CHROM\t%POS\t%ALT\t%REF[\t%ES\t%SE\t%AF\t%LP]\n" gwas-vcf.vcf | \
  awk -F"\t" -v OFS="\t" '{$8=10^(-$8); print}') > gwas-ssf.tsv

Once a format like GWAS-VCF is agreed and adopted, I think developers of software tools such as PLINK2 or REGENIE could be convinced to allow generation of summary statistics in such a format

@marcora
Copy link

marcora commented Sep 8, 2022

I agree with the points raised by @freeseek and not because I am one of the proponents of the GWAS-VCF format ;)

I would be ecstatic with the wide adoption of any sumstats format but, in my opinion, one of the major difficulties with the absence of such standardized format has been the ambiguity regarding what is ref and what is alt (or min vs major) in combination with what is effect vs non-effect allele.

@julesjacobsen
Copy link

I also generally agree that this should be built on top of the VCF spec rather than kind-of duplicate it, but differently. VCF has well-supported tooling and has already solved most of the variant representation issues. This spec could aim to constrain some aspects of the VCF spec to fit with the existing proposal e.g. requiring the use of contig names 1-25 instead of the slightly more variable (chr)[1-22,X,Y,M,MT] 'traditional' names or requiring splitting multiple alternate alleles onto separate lines along with defining specific keys required/recommended for the INFO and FORMAT sections of a record.

What are the benefits of this over GWAS-VCF? This is a genuine question, as I have no vested interest (or in-depth knowledge) in either this or GWAS-VCF!

@ljwh2
Copy link
Collaborator

ljwh2 commented Mar 20, 2023

Thanks for the discussion and helpful contributions. We have now amended the spec to include variant ID in the format chr_bp_ref_alt, and added an additional column to specify whether the effect or other allele is REF. We haven’t mandated that submitters fix their data with EA=ALT. This is because our main goal in GWAS-SSF is to encourage data sharing (bearing in mind that still only around 30% of GWAS publications share data at all). We know that many data generators (rightly or wrongly) prefer to use EA=minor allele or other allele orders, and we wanted the format to be flexible enough to handle legacy data without presenting further overheads/barriers to data sharing. However in the GWAS Catalog’s submission documentation accompanying GWAS-SSF we now strongly recommend that EA is fixed to ALT at the point of data generation. Of note, the GWAS Catalog also provides data in a harmonised format which includes harmonisation to the reference allele.
We are updating the GWAS-SSF manuscript to include the above points.

@julesjacobsen The current state of the field is that many summary statistics files are lacking key information (particularly effect allele, EAF) which hinder downstream use of the data, or are not shared at all. The main goal of GWAS-SSF is to identify key mandatory and non-mandatory data and metadata fields for usability and encourage data sharing. We believe at this point in time, the community will benefit from definition of these data fields which can be applied to the simple tsv format described here (which is the GWAS Catalog’s implementation), or GWAS-VCF, or any other file format. We are updating the manuscript to focus on the data content and make this clearer.

@freeseek
Copy link

@ljwh2 thank you for the update. I would like to share another example of why I think not requiring the reference allele in the file format is a bad idea for a file standard. Take the latest Alzheimer disease GWAS. The summary statistics are shared as a GWAS-SSF and the most strongly associated variant is the following:

variant_id               rs747519137
p_value                  5.75e-659
chromosome               19
base_pair_location       44909521
effect_allele            CT
other_allele             C
effect_allele_frequency  0.9048
odds_ratio               0.350
ci_lower                 0.338
ci_upper                 0.364
beta                     -1.0485
standard_error           0.0191
n_cases                  34921
n_controls               55357
het_isq                  83.4
het_pvalue               2.493e-06
variant_alternate_id     chr19:44909521:CT:C

Now, if you look at the 1000 Genomes project high coverage reference panel you can find two variants at position chr19:44,909,521:

$ bcftools annotate -x ^INFO/AC,INFO/AN -r chr19:44909521 1kGP_high_coverage_Illumina.chr19.filtered.SNV_INDEL_SV_phased_panel.vcf.gz | bcftools view -GH
chr19	44909521	19:44909522:C:CT	C	CT	.	.	AC=40;AN=6404
chr19	44909521	19:44909521:CT:C	CT	C	.	.	AC=98;AN=6404

Which variant is the one including the effect allele? Both variants have rsID rs747519137 in dbSNP. It's impossible to tell and this problem is just going to get worse as users switch to running GWAS using the latest imputation reference panels that have numerous cases of variants like these.

Now, I can sympathize with the fact that the problem originates at the point of data generation, with software like PLINK largely to blame, and so the GWAS Catalog has limited influence over this. But allowing a file format to be adopted that does not mandate this will inevitably lower the pressure for the authors of the softwares from which the summary statistics originate to fix the problem. Ideally, you would want a file format to be adopted by software like PLINK, but if recording which one is the reference allele is not mandated, the problem is likely to persist. What would be the incentives for users to do what they can to keep this additional information and do what they can to make sure it is correct?

Also, what does it mean that the GWAS Catalog provides data in a harmonised format which includes harmonisation to the reference allele? How would the reference allele be determined in situations like the example presented here?

@chrchang
Copy link

Note that PLINK itself will be updated; this should be taken into account when analyzing the relative merits of these options.

@freeseek
Copy link

I guess I meant the original PLINK software and the BIM file format, rather than the current implementation of PLINK, which did not require to specify which allele is the reference and which allele is the alternate. This is likely due to the fact that the first GWASs were restricted to SNPs and did not have to deal with indels.

@chrchang
Copy link

Note that PLINK 1.x also did not include effect-allele-frequency in its output, which is a required field. So anyone who wants to submit a GWAS-SSF file based on an old PLINK 1.x analysis will need to do some additional work.

A primary goal with my current PLINK 2.0 update is to make it easy for scientists in this boat to submit the most useful thing they realistically can. There are two scenarios to balance here: (i) the scientist has easy access to a VCF file which spells out the REF/ALT alleles for these otherwise-ambiguous indels, and (ii) they do not.

In the second scenario, it is better for the scientist to submit 99.99%-unambiguous results than to submit nothing at all. But we want to nudge as many scientists as possible in the first scenario to generate results with correctly-annotated REF/ALT alleles, rather than to just use the workflow designed to allow scenario-2 scientists to submit anything at all.

To address this, I will modify plink2 --gwas-ssf so that, if an ambiguous indel is present, the command errors out unless the user manually overrides.

@chrchang
Copy link

Implementation: chrchang/plink-ng@c961b6a

Note that there is still an ambiguous case that comes up for multiallelic variants where REF is neither 'effect' nor 'other'.

@ljwh2
Copy link
Collaborator

ljwh2 commented Mar 21, 2023

In the second scenario, it is better for the scientist to submit 99.99%-unambiguous results than to submit nothing at all. But we want to nudge as many scientists as possible in the first scenario to generate results with correctly-annotated REF/ALT alleles, rather than to just use the workflow designed to allow scenario-2 scientists to submit anything at all.

100% agree with this @chrchang

What would be the incentives for users to do what they can to keep this additional information and do what they can to make sure it is correct?

@freeseek I think we can help to address this with messaging and education. Minor allele can make more intuitive sense to a biologist, especially if they're used to working with SNVs only, or looking more broadly for loci for followup study rather than specific variants. It hasn't occurred to them that there is an issue with not using ref/alt. We'll address this with recommendations in the paper and the documentation. I also propose we review in a year or so to see how effective that is, and we can consider mandating it in future.

@freeseek
Copy link

@chrchang thank you for addressing this issue in the --gwas-ssf option and @ljwh2 thank you for helping addressing this issue. Do expect that most users will not understand why this is an issue at all and their goal is going simply to be to upload the summary statistics to wherever they need to upload them. To achieve this goal most users will follow the path of least resistance and if it is made it easier for them to upload summary statistics without reference allele information, that is most likely what they will do because the problem will eventually fall on those using the summary statistics, not on those that generated them.

@ljwh2
Copy link
Collaborator

ljwh2 commented Mar 21, 2023

@jdhayhurst just tagging you for the harmonisation pipeline question

@marcora
Copy link

marcora commented Mar 21, 2023

Messaging, education and recommendations (i.e., relying on good will without strong incentives) clearly does not work. Much messaging, education, and recommendations has been done for a decade or longer on NHST, reproducibility, etc. and not much has changed. Enforcement of unambiguous and robust standards for submission is the only way to go in my opinion. I totally agree with @freeseek on this.

@ljwh2
Copy link
Collaborator

ljwh2 commented Mar 23, 2023

I think we are in agreement on the problem, but differ on the best method to achieve the goal. In our experience, the path of least resistance is for data generators to not share the data at all (Nature & PloS journals have data sharing mandates, but other journals do not - less than 30% of GWAS papers share data at the point of publication) or to stick it on an internal website or FigShare in a non standard format, rather than submit to a repository. Thus we think the best approach is to start with messaging and work towards a mandate. The GWAS Catalog provides harmonised files alongside the submitted version which have been oriented to the reference strand & mapped to the same genome build (these are the files that are ingested by OpenGWAS, for example). This does not solve the indel problem, of course - if the alleles cannot be oriented, the variant is not harmonised - but does take some of the strain for bioinformaticians. Harmoniser repo is here: https://github.com/EBISPOT/gwas-sumstats-harmoniser, and useful info here: https://github.com/opentargets/genetics-sumstat-harmoniser/blob/master/flowchart_v3.svg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants