Skip to content
This repository has been archived by the owner on Mar 17, 2023. It is now read-only.

Latest commit

 

History

History
75 lines (75 loc) · 9.3 KB

FAQ.md

File metadata and controls

75 lines (75 loc) · 9.3 KB
  1. What does mkmetadata do?
    mkmetadata reads files you provide and fill in some information required in metadata file to submit to EVA. mkmetadata use your file structure to determine your project structure. See README for more details.
  2. What do I need for variant submission to EVA?
    You will need your variant files (vcf, bed, wig, etc.) and a metadata file with same format as EVA template
  3. What do I need to run mkmetadata?
    You will need your variant files, organized as instructed in README. Optionally, you can provide config files for your Submitter information, project information, and analysis information so mkmetadata can autofill these fields for you. See here for details on how to format your own config files.
  4. How does mkmetadata recognize my projects?
    mketadata takes a path as input. Any folder under that path is considered a project. Any folder under each project folder is considered an analysis associated with that project. Any files under each analysis folder are associated to said analysis.
    If a project_info.config or analysis_info.config file is found under a project or analysis folder, mkmetadata will try to read information about that project or analysis and autofill correspoding fields in output metadata file. See here for details on how to format your own config files.
  5. Can I upload files that are not associated with any specific analysis?
    No. EVA guidline requires every file in metadata be associated with at least one analysis
  6. What is reference in an analysis?
    The reference assembly against which the analysis was performed. You can either provide a url to the reference file hosted on a publically available server (NCBI, ensemble, etc) or an ENA accession. For human referencec, a simple GRC reference name is accepted.
  7. What is MD5sum of reference?
    MD5sum is a hash sequence calculated from content of a file. Although vulnerable to intentional corruptions and attacks, MD5sum is widely used to ensure files are free of unintentional corruptions and used by EVA to ensure correct version of reference is associated with an analysis. If you modified your reference by any means, you must calculate its own MD5sum value. To calculate MD5sum value of a file, run md5sum reference.fa on unix machine (Linux or Mac OS) or CertUtil -hashfile reference.fa MD5 on Windows machine. You also need to provide a publically accessible link to your modified reference. (see FAQ #6)
    If you did not alter a reference, you can usually get this value from public database that houses the reference. See here for an example.
  8. What are "Experiment Types" allowed in metadata file?
    Allowed experiment types are:
    a. Whole genome sequencing
    b. Exome sequencing
    c. Genotyping by array
    d. Curation
  9. How do I fill out "Sample" section of the metadata?
    Depending on whether your samples have been previously registered with EVA, you either need to fill out the "Pre-registered sample" section or "Novel sample" section.
    Pre-registred sample section
    Provide accession number of the registered sample, its sample ID as in corresponding vcf files, as well as alias of analysis associated with this sample.
    Novel sample section
    You must provide at least sample name, a short description of the sample as Title, Tax ID (the NCBI taxonomic classification (http://www.ncbi.nlm.nih.gov/taxonomy) of the sample (e.g. 9606 for human)), and Scientific name of your sample. Please note that it's a good practice to fill in as much information as possible. Additional attributes are listed in metadata file generated by mkmetadata and instructions on each attribute are available in the file as well.
    Note: For a previously unsequenced organisms please contact eva-helpdesk@ebi.ac.uk for the provision of a new Tax Id
  10. What are platforms in analysis tab?
    The sequencing platforms used to generate sequencing data like Illumina MiSeq, or Illumina HiSeq 200, etc. A list of accepted platforms can be found in the metadata file generated by mkmetadata or the template sheet. If your platform is not included (i.e PacBio, nanopore, etc), contact eva-helpdesk@ebi.ac.uk.
  11. EVA says my vcf files are not valid. What should I do?
    (Note that you can use this software provided by EVA to validate your vcf files before submitting)
    EVA requires that all vcf files follow 4.X vcf format specifications strictly.
    If you have only a few variants to submit, an easy approach would be to mannually edit the template vcf file provided by EVA. Once you download this template file, you can open it with any text editor (MS excel is recomended on Windows machines since Windows notepad uses different line breaks than unix machines, which is not supported in vcf format specifications. This might change in a recent future! Learn more about this here)
    The following section will provide a brief instruction on how to add your variants to the template vcf file provided by EVA.
    On a Windows machine:
    Open the template with MS excel. You should see two rows:
##fileformat=VCFv4.3
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO

On the second row, each attribute should occupy a cell. Starting from the third row, input your data, one variant per row, with each attribute in corresponding column. After all data is input, save and rename the file. Then run vcf validator to make sure it passes EVA's requirements.
On a Unix machine:
for macOS with MS excel or Linux with similar spreadsheet app, steps from above still apply.
Alternatively, you can use a text editor to open the template.vcf file. You should see two rows:

##fileformat=VCFv4.3
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO

Starting from a new row directly after the second row, input your data. Each attribute should be separated by tab (IMPORTANT) and the order of attributes should follow the order in the second row.
To double check that all fields in your vcf file are tab-delimited, on terminal or command line, do:

cat -A yourvariatns.vcf | less

and you should see something similar as following (each ^I indicates a tab character):

##fileformat=VCFv4.3^I^I^I^I^I^I^I^M$
#CHROM^IPOS^IID^IREF^IALT^IQUAL^IFILTER^IINFO^M$
1^I10^Ivar1^IC^IT^I35^I30^Iadditional info in this field^M$
2^I100^Ivar2^IA^IG^I45^I30^Iadditional info in this field^M$

After all data is input, save and rename the file. Then run vcf validator to make sure it passes EVA's requirements.
For large amount of variants generated directly by variant callers, they should pass validation with no issues as long as you're using the lastest version of variant callers. If errors occur, contact eva-helpdesk@ebi.ac.uk to identify issues and devlopers for help.
If you modify your vcf files, errors may arise. Some common errors are:
a. Tabs are replaced by spaces in newly generated vcf file.
This error can be confirmed by cat -A yourvariants.vcf | less as tabs will show up as ^I while spaces will be displayed as spaces. There could be many potential reasons as to why this would happen. A most common reason is the echo command on a unix machine. If you're running a script to edit your vcf file and using echo to output lines, make sure variables are double quaoted or better yet, use printf instead of echo
b. Info field value is not a comma-separated list of valid strings.
This error can also be confirmed by cat -A yourvariants.vcf | less, there should not be any spaces in the value of Info field. (see vcf specifications for details on vcf format standards)
c. Error: Info key is not a sequence of alphanumeric and/or punctuation characters.
Same as above. Info key should only contain alphanumeric characters separated by commas. This is likely due to text editor used to edit vcf files. Vim is known to cause this problem. Nano on Unix and Notepad++ or MS excel on Windows are recommended for editing vcf files. In general, however, you should avoid mannually edit vcf files unless absolutely necessary.
d. Reference and alternate alleles do not share the first nucleotide.
This likely refers to deletions. Per vcf specifications, Ref string of a deletion must include at least one base before deletion and so Alt string must share at least the first nucleotide with Ref string. It is not clear how this error arose (possibly due to the program used to generate vcf files). Please contact eva-helpdesk@ebi.ac.uk for help on how to correct this error.