# Download sequence data from the NCBI Sequence Read Archive (SRA)

## Overview

DNA sequence data are typically deposited into the NCBI Sequence Read Archive, and can be accessed through the SRA website, or via command line tools called SRA Toolkit. Individual sequence entries are assigned an Accession ID, which can be used to find and download a particular file. For example, if you go to the [SRA database](https://www.ncbi.nlm.nih.gov/sra) in a browser window, and search for `SRX15695630`, you should see an entry for _C. elegans_. Here we are going to use one tool from the SRA Toolkit called Fasterq-dump to download a few fastq files, and then do the same thing with a wrapper package called [ipyrad-analysis toolkit](https://ipyrad.readthedocs.io/en/latest/API-analysis/index.html). Finally, we show you how to copy data in directly from an SRA S3 bucket.

### 1) Download SRA data using SRA Toolkit

First, install dependencies, including mamba (you could also use conda)

In [None]:
!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
!bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

In [None]:
!~/mambaforge/bin/mamba install -c bioconda -c conda-forge sra-tools==2.11.0 ipyrad -y

Test that your install works and that fasterq-dump is available in your path

In [None]:
!fasterq-dump -h

Set up your directory structure for the raw fastq data

In [None]:
cd ~/SageMaker/

In [None]:
!mkdir -p data
!mkdir -p data/raw_fastq-fasterq-dump
!mkdir -p data/raw_fastq-ipyrad

In [None]:
cd data/raw_fastq-fasterq-dump

Now we need a list of accession IDs to download. You can find these from papers or from searching SRA directly. Here we are going to use sequence data from Cushman et al., <em><a href='https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8191103/'>Increased whiB7 expression and antibiotic resistance in Mycobacterium chelonae carrying two prophages</a><em>. The next cell will generate a text file with three accession numbers. The fastq files are pretty big, so make sure you have at least 100 GB of EBS storage on this instance.

In [None]:
!for i in {22..23}; do echo "SRR133491$i"; done > list_of_accesionIDS.txt
!cat list_of_accesionIDS.txt

SRA Tools doesn't run in batch mode, so one way to run a command on multiple samples is by using a simple for loop. Once you run the following cell, it will take about 15 min to finish downloading the three samples. 

In [None]:
!for x in `cat list_of_accesionIDS.txt`; do fasterq-dump $x; done

### 2) Download SRA data using ipyrad-analysis toolkit

In [None]:
import ipyrad.analysis as ipa

From an sratools object you can fetch just the info, or you can download the files as well. Here we call `.run()` to download the data into a designated workdir. There are arguments for how to name the files according to name fields in the fetch_runinfo table. The accessions argument here is a list of the SRR numbers. You could also write to a list from a text file. Here we will download three fastq files of SARS-CoV-2, it will run pretty quickly.

In [None]:
cd /home/ec2-user/SageMaker/data/raw_fastq-ipyrad

In [None]:
list_of_srrs = ['SRR14086881','SRR14086882','SRR14086883']

In [None]:
# new sra object
sra2 = ipa.sratools(accessions=list_of_srrs, workdir=".")

# call download (run) function
sra2.run(auto=True, name_fields=(1,30))

### Check the data files
You can see that the files were named according to the SRR and species name in the table. The intermediate .sra files were removed and only the fastq files were saved. 

In [None]:
#you can also navigate in the menu on the left
!ls -l downloaded

In [None]:
#now copy the data back to a cloud bucket for future use
!aws s3 cp SRR14086881_MSHSPSP-RNP263.fastq s3://<bucket-name>

### 3) Copy directly from S3
SRA data is now housed on AWS and GCP, and you can directly download from cloud buckets using aws CLI! Data is organized in buckets by run ID rather than Accession, so to find the run ID, go to the [SRA Run Selector](https://www.ncbi.nlm.nih.gov/Traces/study/) and then search for your Accession number to find the right Run ID. There are a few buckets that are public and you can read about them all [here](https://www.ncbi.nlm.nih.gov/sra/docs/sra-aws-download/#:~:text=SRA%20Data%20in%20the%20AWS%20Registry%20of%20Open%20Data,-Amazon%20Web%20Services&text=SRA%20has%20several%20datasets%20in,value%20and%20newly%2Dreleased%20studies.).

We recommend that before you go on a wild goose chase for the s3 path of your run ID, go to [this bucket](https://s3.console.aws.amazon.com/s3/buckets/sra-pub-run-odp?region=us-east-1&prefix=sra/&showversions=false) and search for your run ID. For example, let's say we were interested in Run ID = DRR000001. We can then run the following command to copy those data into our instance. Some files have a direct link to the Fastq or BAM files, but usually on S3 it is just the sra files. You will need to use fastq-dump to convert to fastq. We show you how below.

In [None]:
!aws s3 cp s3://sra-pub-run-odp/sra/DRR000001/DRR000001 .

In [None]:
!fastq-dump DRR000001

In [None]:
ls DRR000001*