# Download sequence data from the NCBI Sequence Read Archive (SRA)

## Overview

DNA sequence data are typically deposited into the NCBI Sequence Read Archive, and can be accessed through the SRA website, or via command line tools called SRA Toolkit. Individual sequence entries are assigned an Accession ID, which can be used to find and download a particular file. For example, if you go to the [SRA database](https://www.ncbi.nlm.nih.gov/sra) in a browser window, and search for `SRX15695630`, you should see an entry for _C. elegans_. Here we are going to use one tool from the SRA Toolkit called Fasterq-dump to download a few fastq files, and then do the same thing with a wrapper package called [ipyrad-analysis toolkit](https://ipyrad.readthedocs.io/en/latest/API-analysis/index.html).

### 1) Download SRA data using SRA Toolkit

First, install dependencies, including mamba (you could also use conda)

In [None]:
!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
!bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

In [None]:
!~/mambaforge/bin/mamba install -c bioconda -c conda-forge sra-tools==2.11.0 ipyrad -y

Test that your install works and that fasterq-dump is available in your path

In [None]:
!fasterq-dump -h

Set up your directory structure for the raw fastq data

In [None]:
!mkdir -p data
!mkdir -p data/raw_fastq-fasterq-dump
!mkdir -p data/raw_fastq-ipyrad

In [None]:
cd data/raw_fastq-fasterq-dump

Here we will download three fastq files of SARS-CoV-2, it will run pretty quickly.Now we need a list of accession IDs to download. You can find these from papers or from searching SRA directly. Here we are going to use sequence data from three fastq files of SARS-CoV-2.

In [None]:
!for i in {1..3}; do echo "SRR1408688$i"; done > list_of_accesionIDS.txt
!cat list_of_accesionIDS.txt

SRA Tools doesn't run in batch mode, so one way to run a command on multiple samples is by using a simple for loop. Once you run the following cell, it will take about 15 min to finish downloading the three samples. 

In [None]:
!for x in `cat list_of_accesionIDS.txt`; do fasterq-dump $x; done

### 2) Download SRA data using ipyrad-analysis toolkit

In [None]:
import ipyrad.analysis as ipa

From an sratools object you can fetch just the info, or you can download the files as well. Here we call `.run()` to download the data into a designated workdir. There are arguments for how to name the files according to name fields in the fetch_runinfo table. The accessions argument here is a list of the SRR numbers. You could also write to a list from a text file. Here we will download three fastq files of SARS-CoV-2, it will run pretty quickly.

In [None]:
cd ~/data/raw_fastq-ipyrad

In [None]:
list_of_srrs = ['SRR14086881','SRR14086882','SRR14086883']

In [None]:
# new sra object
sra2 = ipa.sratools(accessions=list_of_srrs, workdir=".")

# call download (run) function
sra2.run(auto=True, name_fields=(1,30))

### Check the data files
You can see that the files were named according to the SRR and species name in the table. The intermediate .sra files were removed and only the fastq files were saved. 

In [None]:
#you can also navigate in the menu on the left
!ls -l downloaded

In [None]:
#now copy the data back to a cloud bucket for future use
!gsutil cp SRR14086881_MSHSPSP-RNP263.fastq s3://<bucket-name>

# 3) Copy directly from Google Cloud Storage
SRA data is now housed on GCP and AWS, and you can directly download from cloud buckets using gsutil! These data are in what are called user pays buckets, which means that you will incur minor costs to download, but it should be a small amount of costs.
Data is organized in buckets by run ID rather than Accession, so to find the run ID, go to the [SRA Run Selector](https://www.ncbi.nlm.nih.gov/Traces/study/) and then search for your Accession number to find the right Run ID.

Once you have the run ID, you just use gsutil to copy the data either locally, or to a bucket. The only part you need modify for the user pays part is that you need to add your project ID so that the bucket can charge you for access. So the command looks like this:
```
gsutil -u <PROJECT-ID> cp -r gs://sra-pub-src/<YOUR-RUN-ID>/ <DESTINATION>
```
The catch is that the bucket name may vary slightly. To find the exact path, click on the Run ID on the Run Selector page, then go to Data Access and scroll down to get the exact path. For some projects, like 1k genomes, you will only have an FTP site rather than a gs path, so you would need to use curl or wget to download the files instead. You may find that in some cases the sra toolkit is a lot easier for bulk downloads.

Download a dog metagenome fastq using gsutil

In [None]:
!gsutil -u cit-oconnellka-1212 cp -r gs://sra-pub-src-10/SRR19658964/ .

Use wget to download SARS-CoV-2 fastq from a storage url

In [None]:
!wget 	https://storage.googleapis.com/nih-sequence-read-archive/sra-src/SRR14086881/MSHSPSP-RNP263_R2_001.fastq.gz.1