We suggest that raw virus sequence data as well as assembled and annotated genomes are submitted to ENA.
- There are several ways to submit data to ENA, including SARS-CoV-2 submission and extensive documentation on programmatic submissions.
- Before submission of raw sequence data (e.g. shotgun sequencing) it is necessary to remove contaminating human reads. This can be done using e.g. Metagen-FastQC, or ask for assistance at virus-dataflow@ebi.ac.uk.
A list of relevant data and metadata standards can be found in FAIRsharing, some specific examples are below.
We suggest that data is preferentially stored in the following formats, in order to maximize the interoperability with each other and with standard analysis pipelines:
- Raw sequences: .fastq, optionally add compression with gzip
- Genome contigs: .fastq if uncertainties of the assembler can be captured, otherwise use .fasta; optionally add compression with gzip
- De novo aligned sequences: .afa
- Gene Structure: .gtf
- Gene Features: .gff
- Sequences mapped to a genome: .sam or the compressed formats .bam or .cram. Please ensure that the used reference sequence is also publically available and that the @SQ header is present and unambiguously describes the used reference sequence.
- Variant calling: .vcf. Please ensure that the used reference sequence is also publically available and that it is unambiguously referenced in the header of the .vcf file, e.g. using the URL field of the ##contig field.
- Browser: .bed
Consider annotating virus genomes using the ENA virus pathogen reporting standard checklist, which is a minimal information standard under development right now and the more general Viral Genome Annotation System (VGAS) (Zhang et al. 2019).
For submitting data and metadata relating to phylogenetic relationships (including topology, branch lengths, and support values) consider using widely accepted formats such as Newick, NEXUS and PhyloXML. The Minimum Information About a Phylogenetic Analysis checklist provides a reference list of useful tree annotations.