Tutorial: How to upload your data to the Sequence Read Archive (SRA)?

Hi all,

I am writing this tutorial because THE WHOLE UPLOADING PROCESS IS KILLING ME!!!
I hope my sacrifice could save everybody's time and efforts in the future.
In this tutorial, I am going to show you how to upload large RNA-Seq data (>10GB) to SRA and create Bioproject and BioSample for the first time user.
You may ask "Why also Bioproject and BioSample?" That's because they are the prerequisite data to SRA. I will get the that point later.

Intro

At this point, your question may be "WHY WOULD I HAVE TO UPLOAD MY DATA?"
The answer is simple: for publication.
More and more journals are asking researchers to provide the original data upon submission. For example, here is the requirement for Scientific Reports under the Sharing datasets section:

Datasets must be made freely available to readers from the date of publication and must be provided to Editorial Board Members and referees at submission, for the purposes of evaluating the manuscript.

If you are worried about credential issues, you can set up a time to make your data public accessible (up to 4 years later) or upon publication, depends on whichever comes first. The SRA can help to generate a link for reviewers without publishing making your data.

Initial steps

Google "NCBI SRA", this is what you'll find:

Once you click on "how to submit", you will be lead to SRA Submission Quick Start

If you look closely, I know you probably won't, but if you look closely, you will notice this line:

You can complete both your BioProject as well as BioSample submissions within SRA wizard in Submission Portal.

I am not lying to you. It's written here:
.

However, if you are naive enough to follow their rule, then you are wrong. You will be stuck at a Step 3 SRA metadata FOREVER, just like what happened to me. Since you are completing both BioProject and BioSample submissions at the same time as you are submitting SRA, it is NATURAL that you will not have sample_name and biosample_accession to put into the later SRA_metadata. However, the system will keep warning you that neither sample_name nor biosample_accession are set and will not allow you to proceed to the next step in the SRA submission portal.

Updated on June 27, 2018 - Solution 1 (suggested by the SRA team)

You should leave the BioProject column blank and the other column asks for the sample_name, not the accession number.
You should enter in the name of each sample you gave them in the BioSample attributes spreadsheet - they should match exactly.

I have not had the chance to upload my second dataset...will verified Solution 1 later.

Solution 2 (Upload data seperately)

Let's start with BioProject as the following instructions:

BioProject submission portal - part 1/2

Go to the Submission portal
Click on BioProject
Click on New submission
Step 1 Submitter: There shouldn't be any problem, just ask yourself who you are.
Step 2 Project Type: My Project data type is "Transcriptome or Gene expression" and my Sample scope is "Multiisolate" because I am comparing multiple individuals within the same species.
Step 3 Target: The only required slot is organism name.
Step 4 General Info: You will be asked to select a publication date. It could be immediate or as far as 4 years later or upon publication. Then you will have to put as project title and description.
Step 5 BioSample: Here's the fun part! If you are like me, the first time user, you will have to click on register at BioSample first then come back to this page later.

BioSample submission portal

Now you have been automatcally directed to BioSample submission.
Step 1 General Information: Again, here you can select a publication date.
Step 2 Sample Type: Mine is "Model organism or animal sample".
Step 3 Attributes: Now you are asked to type in every teeny-tiny detail of your sample.
Step 4 Title and Comments: No comments.
Step 5 Overview.
Shortly after you hit submit, you will receive your sample_name in the form of SAMN# in your email or you can check it later on the BioSample submission portal.
Now, you can go back to BioProject by clicking on Submission Portal on the upper left corner:
Then, select BioProject:

BioProject submission portal - part 2/2

Step 6 Publication: Link any existed publication. If not, go to next step.
Step 7 Overview: If everything looks great, hit submit. Again, you will receive your biosample_accession in the form of PRJNA# in your email or you can check it later on the BioProject submission portal.

SRA submission portal

Hooray! Now we have both biosample_accession(PRJNA#) and sample_name(SAMN#). Let's get down to the business.
Go to SRA submission portal.
Hit New submission.
Step 1 Submitter: yadi yadi yada.
Step 2 General Information: Now you already have BioProject, remember to put the project number in the Existing BioProject area.
Step 3 SRA metadata: FINALLY, we got to the point where I was stuck previously. Now you need to put every library prep in it. I suggest to download the spreadsheet, fill it out, then save as Tab Delimited Text (.txt) later. Click Download Excel spreadsheet:

In the spreadsheet, there are detailed explanations for every column.
Of notice, if you are dealing with paired-end reads, please put paired reads in the same row but use filename and filename2 to separate out two files. The filename needs to be exactly the same (including file extension) as the sequence files that you are going to upload later, e.g. t1.control_1_I23_GAGTGG_R1_combined.fastq.gz.
Once you've finished it, stay on the same sheet then click save as. Set File Format as Tab Delimited Text (.txt):
Then hit Save Active Sheet because you only need SRA_metadata but not other sheets such as Contact Info and Instructions and Library and Platform Terms.

Step 4 Files: If your files are small, click on I will upload all the files now via HTTP/Aspera and start to upload your files. If you are like me, my file is over 10GB, then you need to preload the files. At this point, you will have click on the purple Sequence Read Archive (SRA) to go back to the front page of Sequence Read Archive (SRA):

My files are stored on supercomputer and I am not going to download them to local computer then upload them through FTP. It's a waste of time. So I am going to directy upload them through command line in my supercomputer. On the front page, click on command line upload and ask for preload folder. Once you have a preload folder, your front page will look like this:

Click on Aspera command line upload, you will see the following instructions:

The red block is the email used in your NCBI account.
The blue block is a random_code generated by NCBI.
The orange rectangle is a where to download your key file. You can put your key file in the same folder with all fastq files to facilitate the following upload. Simply right click mouse on key file and select Copy Link Address. Log in to the supercomputer linux system, naviagate to the folder where you stored your sequence data, and type wget <the link you copied>. It will download your key file to the current folder.

First, you need to install Aspera on your supercomuter. Second, make sure all your fastq files (they can be compressed using gzip or bzip2), in my case, gz files along with the key file are in the same folder.
Third, upload everything in your folder via Aspera. My command is:

  ~/.aspera/connect/bin/ascp -i <path/to/key_file> -QT -l10000m -k1 -d <path/to/file(s)> subasp@upload.ncbi.nlm.nih.gov:uploads/NCBI_account_email_<random_code>/<submission_folder>/

I use ~/.aspera/connect/bin/ascp is because my aspera is installed in the ~/.aspera folder. If your aspera is pre-installed on the system, just type ascp.
<random_code>: A random code for upload is provided by NCBI.
<path/to/key_file>: key file is provided by NCBI. Download it and save to the same folder where you put all the sequence files. It must be an absolute path, e.g.: /home/keys/aspera.openssh.
<path/to/file(s)>: An absolute path to your sequencing files on the supercomputer.
<submission_folder>: Name it as you want. It is required and will be created automatically.

Once you have uploaded your file successfully. Go back to the SRA submission portal Step 4 Files and select the preload folder. It will have the same name as you specified in <submission_folder> earlier.

Step 5 Overview: Congratulations! You have successfully uploaded your data to SRA and created Bioproject and BioSample.

I can believe this whole process took me 3 days. I hope this tutorial can save your time and make your life easier.

Useful(?) links:
SRA Submission Quick Start
Troubleshooting SRA submission
SRA Metadata and Submission Overview

Provide feedback

Saved searches

Use saved searches to filter your results more quickly