Skip to content

Download_From_SRA

karlrl edited this page Oct 18, 2017 · 22 revisions

Downloading from the NCBI Sequence Read Archive (SRA)

Step 1 - Gather Run IDs:

Obtain a list of run accession numbers from the SRA. This can be done by searching for the SRA or SRP number in the SRA database (or by searching for the PRJ number in the BioProject database). See the NCBI's documentation for more information about what all the accession numbers mean and how they're linked. For example, searching for the SRA accession number SRA045646 yields 145 metagenomic experiments.

Once you have the search results you want, you can collect the run IDs for the experiments:

  1. one run ID per line, by selecting "Send to"-> File and choosing "Accession List" for the format. This method works if each sample has only one associated run.
  2. a table of run IDs, with additional metadata. This may be required if you have samples which required multiple sequencing runs (e.g. the sequencing was split across multiple lanes). Some extra effort will be required to merge the resulting FASTQs in this case (not covered in this tutorial).

More information about this is here: https://www.ncbi.nlm.nih.gov/books/NBK158899/

Step 2 - Download:

This step is technically optional, since fastq-dump can download and dump FASTQs in one go, but it's a simple way to guard against network issues when trying to, for example, concatenate many runs belonging to a single sample. There are at least two ways to download the files.

Using prefetch (recommended)

NCBI's SRA Toolkit comes with a command named prefetch that takes a run accession as an argument and stores the run in a user folder (~/ncbi/public/sra/). To use prefetch to download all the files, wrap it in a shell script loop or use parallel:

parallel -j 1 prefetch {} ::: $(cat SraAccList.txt)
  • The -j 1 specifies the number of threads to use. Using 1 limits to downloading one file at a time (simultaneous downloads may be faster, depending on your computer and network).
  • The ::: $(cat SraAccList.txt) passes the contents of SraAccList.txt as arguments to the parallel command. This assumes that the SRR ids are all in the file SraAccList.txt that was downloaded in Step 1.

Using wget

Download the files using wget. You can form the URL for each file like so (note that the first 3 digits of the identifier is used as a subdirectory):

wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR304/SRR304976/SRR304976.sra

Since we don't want to do that manually for each file we can get parallel to help:

parallel -j 1 \
   wget -P sra ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/{='$_=substr($_,0,3)'=}/{='$_=substr($_,0,6)'=}/{}/{}.sra \
   ::: $(cat SraAccList.txt)                                   
  • The -P sra specifies that the download files should be place in the directory 'sra'.

Step 3 - Convert to FASTQ:

Here, we use the SRA Toolkit's fastq-dump command. If you used prefetch above OR if you did not download the SRRs, the command will be:

parallel -j 1 fastq-dump --skip-technical -F --split-files -O fastq {} ::: $(cat SraAccList.txt)
  • -j 1 specifies number of threads to use. Can increase this number to allow parallel processing of files.
  • -F specifies that the original ids be used (instead of those changed by the SRA)
  • --skip-technical some sequencing technologies will have other reads besides forward and reverse. This skips those.
  • --split-files will split the files into forward and reverse reads
  • -O fastq specifies the directory to place the converted fastq files
  • --gzip can be added as an option if you would like the fastq files to be gzipped (this saves space, but takes much longer to do the conversion).

Otherwise, if you used wget, the command will be similar:

parallel -j 1 fastq-dump --skip-technical -F --split-files -O fastq {} ::: sra/*
  • ::: sra/*.sra feeds the downloaded sra files from step 2 and pipes that list to parallel for processing

A nice explanation of other fastq-dump options are provided by Rob Edward's group: https://edwards.sdsu.edu/research/fastq-dump/

Clone this wiki locally