# Download sequence data from the NCBI Sequence Read Archive (SRA)

## Overview

DNA sequence data are typically deposited into the NCBI Sequence Read Archive, and can be accessed through the SRA website, or via command line tools called SRA Toolkit. Individual sequence entries are assigned an Accession ID, which can be used to find and download a particular file. For example, if you go to the [SRA database](https://www.ncbi.nlm.nih.gov/sra) in a browser window, and search for `SRX15695630`, you should see an entry for _C. elegans_. Here we are going to use tools from the SRA Toolkit to download a few fastq files, which can then be copied over to a cloud bucket.

### 1) Download SRA data using SRA Toolkit

First, install dependencies, including mamba (you could also use conda). At the time of writing, the version of SRA tools available with the Anaconda distribution was v.2.11.0. If you want to install the latest version, download and install from [here](https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit). If you do the direct install, you will also need to configure interactively following [this guide](https://github.com/ncbi/sra-tools/wiki/05.-Toolkit-Configuration), you can do that by opening a terminal and running the commands there.

In [1]:
!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
!bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 88.1M  100 88.1M    0     0  39.4M      0  0:00:02  0:00:02 --:--:-- 49.5M
ERROR: File or directory already exists: '/home/jupyter/mambaforge'
If you want to update an existing installation, use the -u option.


In [2]:
!~/mambaforge/bin/mamba install -c bioconda -c conda-forge sra-tools==2.11.0 -y


                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
███████████████/  /██/  /██/  /██/  /████████████████████████
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (0.22.1) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████


Looking for: ['sra-tools==2.11.0']

[?25l[2K[0G[+] 0.0s
[2K[1A[2K[0G[+] 0.1s
bioconda/linux-64    [33m━━

Test that your install works and that fasterq-dump is available in your path

In [None]:
!fasterq-dump -h

In [12]:
cd ~/NIHCloudLabGCP/tutorials/notebooks/SRADownload/

/home/jupyter/NIHCloudLabGCP/tutorials/notebooks/SRADownload


Set up your directory structure for the raw fastq data

In [None]:
!mkdir -p data data/fasterqdump/raw_fastq data/ipyrad/raw_fastq data/prefetch_fastqdump/raw_fastq data/prefetch_fasterqdump/raw_fastq

### Create Accession list and figure out how much space you need on your system

Here we will download nine fastq files of SARS-CoV-2. You will need a list of accession IDs to download, which you can find these from papers or by searching SRA directly.

In [13]:
!for i in {1..9}; do echo "SRR1408688$i"; done > ~/NIHCloudLabGCP/tutorials/notebooks/SRADownload/data/list_of_accesionIDS.txt
!cat ~/NIHCloudLabGCP/tutorials/notebooks/SRADownload/data/list_of_accesionIDS.txt

SRR14086881
SRR14086882
SRR14086883
SRR14086884
SRR14086885
SRR14086886
SRR14086887
SRR14086888
SRR14086889


Let's get the info for one of those accession numbers

In [14]:
!vdb-dump --info SRR14086881

acc    : SRR14086881
path   : https://storage.googleapis.com/nih-sequence-read-archive/run/SRR14086881/SRR14086881
size   : 2,328,919
type   : Database
platf  : SRA_PLATFORM_ILLUMINA
SEQ    : 11,748
SCHEMA : NCBI:align:db:alignment_sorted#1.3
TIME   : 0x00000000605f7719 (03/27/2021 18:19)
FMT    : FASTQ
FMTVER : 2.9.1
LDR    : latf-load.2.9.1
LDRVER : 2.9.1
LDRDATE: Jun 15 2018 (6/15/2018 0:0)


`Path` shows us the location. In this case the data are on GCP, but some files are on AWS, some on-prem, and others in multiple places.
`Size` shows us the size of the compressed file in bites. You should plan to have about 10x that size available on your system, so to get a rough estimate of the space you need, just look at a few files, convert to MB/GB, then multiply by the number of accessions. In this case, are are converting the .sra file to forward and reverse files equal to ~4MB each.

Check the amount of space on your VM, if it is not enough, then stop the machine, resize it, and come back to here. 

In [15]:
!df -h .

Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb         98G   22G   77G  22% /home/jupyter


### Test stand alone fasterq dump

Fasterq-dump is the replacement for the legacy fastq-dump tool (see below). You can read [this guide](https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump) to see the full details on this tool. You can also run `fasterq-dump -h` to see most of the options

In [16]:
cd data/fasterqdump/

/home/jupyter/NIHCloudLabGCP/tutorials/notebooks/SRADownload/data/fasterqdump


Fasterq dump doesn't run in batch mode, so one way to run a command on multiple samples is by using a for loop. There are many options you can explore, but here we are running -O for outdir, -e for the number of threads, -m for memory (4GB), and --location for the location we want to retrieve the file from. Depending on the type of cloud storage, it may be faster to select `NCBI` for the location. You may consider running a few tests with one or two of your accession numbers before downloading a whole batch. The default number of threads = 6, so adjust -e based on your machine size. For large files, you may also benefit from a machine type with more memory and/or threads. You may need to stop this VM, resize it, then restart and come back. Note that there are a bunch of ways to split your fastq files (defined [here](https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump)) but the default of `split 3` will split into forward, reverse, and unpaired reads.

In [19]:
%%time
!for x in `cat ../list_of_accesionIDS.txt`; do fasterq-dump -f -O raw_fastq -e 8 -m 4G --location GCP $x ; done

spots read      : 11,748
reads read      : 23,496
reads written   : 23,496
spots read      : 1,277,437
reads read      : 2,554,874
reads written   : 2,554,874
spots read      : 1,388,190
reads read      : 2,776,380
reads written   : 2,776,380
spots read      : 272,200
reads read      : 544,400
reads written   : 272,211
reads 0-length  : 272,189
spots read      : 644,160
reads read      : 1,288,320
reads written   : 1,288,320
spots read      : 1,181,998
reads read      : 2,363,996
reads written   : 2,363,996
spots read      : 13,614
reads read      : 27,228
reads written   : 27,228
spots read      : 176
reads read      : 352
reads written   : 352
spots read      : 451
reads read      : 902
reads written   : 902
CPU times: user 2.46 s, sys: 497 ms, total: 2.95 s
Wall time: 2min 30s


### Test Prefetch with legacy fastq dump

In [9]:
cd ../prefetch_fastqdump

/home/jupyter/NIHCloudLabGCP/tutorials/notebooks/SRADownload/data/prefetch_fastqdump


First run the prefetch tool. It will pull the compressed .sra files (ETL) files, which then need to be converted. You can also run prefetch with an explicit list of accessions, or feed in a text file with the `--option-file` command. Here we are running it with `-v` verbose. Play with the --location, switching between AWS, GCP, and NCBI, and see where it pulls the data from and you can also compare the times between locations. Also, these accessions are only stored on GCP. What happens if you put AWS as the location? 

In [None]:
%%time
!prefetch --option-file ../list_of_accesionIDS.txt -O raw_fastq -f yes --location AWS -v

Now we will use the legacy fastq dump tool to convert to fastq files. This is an older tool that uses single threads, so it will take a few minutes to convert the .sra files to fastq. See the full list of options by typing `fastq-dump -h`.

In [None]:
%%time
!for x in `cat ../list_of_accesionIDS.txt`; do fastq-dump -O raw_fastq --split-3 $x; done

### Test Prefetch plus fasterq-dump

Now are were going to do the same thing, but convert the ETL files using fasterq-dump. For this to work, you need to either give the path to the prefetch directories in your text file, or make sure you cd into the raw_fastq dir so that it can find those directories with the .sra files. In this case, --location GCP is a lot faster than NCBI, but feel free to run your own tests with different locations.

In [None]:
cd ../prefetch_fasterqdump

In [None]:
%%time
!prefetch --option-file ../list_of_accesionIDS.txt -O raw_fastq -f yes --location GCP

In [None]:
cd raw_fastq/

Here we won't specify a location since we are just converting the prefetch records

In [None]:
%%time
!for x in `cat ../../list_of_accesionIDS.txt`; do fasterq-dump -f -O raw_fastq -e 8 -m 4G $x; done

In [None]:
!gsutil mb gs://cloud-lab-tutorials/

### Copy data to a bucket
Using `-m` allows multithreading, `-r` allows for recursive copy of a directory

In [None]:
!gsutil -m cp -r raw_fastq/*.fastq gs://cloud-lab-tutorials/raw_fastq/

In [None]:
!gsutil ls gs://cloud-lab-tutorials/raw_fastq/

### Summary
The easiest single command for downloading SRA files is to use fasterq dump, which accessess the ETL files and converts to Fastq/BAM all in one command. In our example, with 8 threads, this took about 2.5 minutes to download the data from GCP. If however, we first use prefetch to download the ETL files, then use fasterq dump to convert the files, the whole process takes about 20 seconds. Thus, the best solution is to run two commands, first prefetch, then faster dump. 

### Clean up
Make sure you shut down this VM, or delete it if you don't plan to use if further.