Skip to content

AustralianBioCommons/xml4ena

Repository files navigation

Helper documentation for bulk genomes uploads to ENA sequence repository

The TCCP produced multi-FASTA assembly files with about 1K-3K sequences which is technically identical to genome assembly files. The diplotype files produced by TCCP would require extra effort to upload as they have long string of N in hard masked low coverage area of assembly. Procedure below assumes you already establish webin account with ENA. Check official documentation page for this step Register a Submission Account

Stage 0 Registry your new project on ENA webin.

Follow interactive menu and enter information about your study. This will create a record for the project id, which will be rooting bunch of samples you will later add with sample_xmler.pl script.

Stage 1: Prepare biosamples entries [sample_xmler.pl]

Biosample records contains information about your sample specifics and will link this sample metadata to the project id (stage 0) and later the sequence assembly fasta file. Following steps will be required to run sample_xmler.pl

1.1 Collect metadata in single tab delimited metadata.txt file.

Metadata file can have arbitrary number of fields (columns) but there are following obligatory columns which you have select interactively with perl script.

  1. Sample ID/NAME -- should match beginning of assembly file name. I recommend use BPA sample ID here.
  2. TAXON_ID -- note this should be exact species identifier from NCBI taxonomy database. Genus or any other than species taxonomy level are not allowed here.
  3. Scientific name -- genus and species id it could be combined from two columns
  4. COMMON_NAME -- for ordinary humans There are some optional columns with geographical information which will be parsed and coded automatically, if you will keep header names like those: • country • state|region|state_or_region • collection_date Here is example for two samples in metadata.txt file.

1.2 Run sample_xmler.pl script to convert meta

You have fix those variables in script before running (sorry no interface for those yet): $investigationtype = "phylogenetic study"; $sequencingmethod = "Illumina SBS, short PE reads"; $bioprojectaccessionid = "PRJEB00000"; #Registry project online manually prior to generating those XMLs (stage 0)

Script should produce set of XML files ready for upload to ENA

1.3 Prepare submission.xml

This small extra xml file with action details can be done manually. This file should have following lines: <?xml version="1.0" encoding="UTF-8"?>
<SUBMISSION>
<ACTIONS>
<ACTION>
<ADD/>
</ACTION>
</ACTIONS>
</SUBMISSION>

You can change <ADD/> to
<MODIFY/> ## For update existing sample. This is convinient to upload new version of assembly. • • • •

More options here

1.4 Upload with curl

Curl is sftp client program need to be pre-installed on the system. To upload generated XML files use curl command:\n"; ncurl -u 'username:password' -F "SUBMISSION=@submission.xml" -F "SAMPLE=@sample.xml" "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/\""; Note you have to change: wwwdev.ebi.ac.uk ---> www.ebi.ac.uk for the real upload. Otherwise it will live there only for 24hrs.";

1.5 Collect receipts from submission

You have to compile all receipts for generated by biosample submission to single table. If you miss receipts you can login to your account and download sample ids table manually from there.

Stage 2 Submit assembly files [manifestator.pl]

In stage 1 you created database entries for biosample ID. Now we need to create manifest files for assemblies using generated biosample ID with MANIFESTATOR.PL. To upload one assembly fasta you will require require 2 files:

  1. Manifest file #Text with sample IDs married with fasta file name.
  2. FASTA file #For unannotated assemblies (plain multi fasta // all exons in the same fasta file).

2.1 Prepare manifest_meta_table.txt

This is another metadata tab delimited text file with few number of columns. This is file may have only those columns:

  1. BPA sample / library id
  2. Biosample sample ID
  3. Biosmaple alias ID
  4. FASTQ file name
  5. Coverage (sequencing depth, from TCCP QC metrics)

10001 ERS0000001 SAMEA0000110 10001_AHGVYVBCX2_GAATCTC_S19_DD.fasta.gz 127 10002 ERS0000011 SAMEA0000111 10002_AHGVYVBCX2_GAGGAC_S21_DD.fasta.gz 88 10003 ERS5079562 SAMEA0000112 10003_AHGVYVBCX2_GATTCTC_S33_DD.fasta.gz 87

2.2 run manifestator.pl

This script will take manifest_meta_table.txt table ind split it to individual manifests. Those manifest files will be named by ID in the first column of metadata (manifest_meta_table.txt). The example of produced manifest file (10001.manifest):

STUDY PRJEB00000 SAMPLE SAMEA0000110 ASSEMBLYNAME ERS0000001 MarsupialExonCaptKit ASSEMBLY_TYPE isolate COVERAGE 127 PROGRAM "TCCP (docker: trust1/ubuntu:OMGv001)" PLATFORM illumina MINGAPLENGTH 50 MOLECULETYPE genomic DNA FASTA 10001_AHGVYVBCX2_GAATCTC_S19_DD.fasta.gz

Where: PRJEB00000 – project id the root project information you may have to add manually online when you registry dataset. SAMEA0000110 -- Sample identifier could be taken either from XML submission recipt (the log file) or downloaded from the ENA biosample summary page for the bioproject. Hint: if you miss those sample ids you can rerun xml submission (see # 1.3 ) with modified “submission.xml”

2.3 Submit sequence data with webin-cli-3.1.0.jar

For upload fasta you will need to get submission program webin-cli-3.1.0.jar (or latest) from the ENA site: https://github.com/enasequence/webin-cli/releases

Here is shell command to cycle across manifests and samples:

for file in *.manifest do echo $file && java -jar webin-cli-3.1.0.jar
-userName Webin-Username
-password='somepasswordhere'
-context=genome
-manifest=/data/local/tmp/$file
-inputdir=/data/local/tmp
-submit done

2.4 Collect fasta data ids from reciepts.

Check receipts for line pattern success="true". Keep receipts in case you may need further work or update those assembly files. Here is example of receipt file:

% Total % Received % Xferd Average Speed Time Time Time Current Download Upload Total Spent Left Speed 100 7322 100 542 100 6780 345 4323 0:00:01 0:00:01 --:--:-- 4666

<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="receipt.xsl"?> <RECEIPT receiptDate="2020-09-25T00:38:36.465+01:00" submissionFile="submission.xml" success="true"> <EXT_ID accession="SAMEA0000110" type="biosample"> <INFO>Submission has been committed. MODIFY


Refer to ENA portal for more detailed documentation about submission processes .

Releases

No releases published

Packages

No packages published

Languages