# Download sequence data from the NCBI Sequence Read Archive (SRA)

## Overview

DNA sequence data are typically deposited into the NCBI Sequence Read Archive, and can be accessed through the SRA website, or via command line tools called SRA Toolkit. Individual sequence entries are assigned an Accession ID, which can be used to find and download a particular file. For example, if you go to the [SRA database](https://www.ncbi.nlm.nih.gov/sra) in a browser window, and search for `SRX15695630`, you should see an entry for _C. elegans_. Here we are going to use one tool from the SRA Toolkit called Fasterq-dump to download a few fastq files, and then do the same thing with a wrapper package called [ipyrad-analysis toolkit](https://ipyrad.readthedocs.io/en/latest/API-analysis/index.html).

### 1) Download SRA data using SRA Toolkit

First, install dependencies, including mamba (you could also use conda)

In [None]:
!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
!bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

In [None]:
!~/mambaforge/bin/mamba install -c bioconda -c conda-forge sra-tools==2.11.0 ipyrad -y

Test that your install works and that fasterq-dump is available in your path

In [None]:
!fasterq-dump -h

Set up your directory structure for the raw fastq data

In [1]:
cd ~/SageMaker/

/home/ec2-user/SageMaker


In [None]:
!mkdir -p data
!mkdir -p data/raw_fastq-fasterq-dump
!mkdir -p data/raw_fastq-ipyrad

In [2]:
cd data/raw_fastq-fasterq-dump

/home/ec2-user/SageMaker/data/raw_fastq-fasterq-dump


Now we need a list of accession IDs to download. You can find these from papers or from searching SRA directly. Here we are going to use sequence data from Cushman et al., <em><a href='https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8191103/'>Increased whiB7 expression and antibiotic resistance in Mycobacterium chelonae carrying two prophages</a><em>. The next cell will generate a text file with three accession numbers. The fastq files are pretty big, so make sure you have at least 100 GB of EBS storage on this instance.

In [None]:
!for i in {22..23}; do echo "SRR133491$i"; done > list_of_accesionIDS.txt
!cat list_of_accesionIDS.txt

SRA Tools doesn't run in batch mode, so one way to run a command on multiple samples is by using a simple for loop. Once you run the following cell, it will take about 15 min to finish downloading the three samples. 

In [None]:
!for x in `cat list_of_accesionIDS.txt`; do fasterq-dump $x; done

### 2) Download SRA data using ipyrad-analysis toolkit

In [7]:
import ipyrad.analysis as ipa

From an sratools object you can fetch just the info, or you can download the files as well. Here we call `.run()` to download the data into a designated workdir. There are arguments for how to name the files according to name fields in the fetch_runinfo table. The accessions argument here is a list of the SRR numbers. You could also write to a list from a text file. Here we will download three fastq files of SARS-CoV-2, it will run pretty quickly.

In [4]:
cd /home/ec2-user/SageMaker/data/raw_fastq-ipyrad

/home/ec2-user/SageMaker/data/raw_fastq-ipyrad


In [5]:
list_of_srrs = ['SRR14086881','SRR14086882','SRR14086883']

In [None]:
# new sra object
sra2 = ipa.sratools(accessions=list_of_srrs, workdir=".")

# call download (run) function
sra2.run(auto=True, name_fields=(1,30))

### Check the data files
You can see that the files were named according to the SRR and species name in the table. The intermediate .sra files were removed and only the fastq files were saved. 

In [9]:
#you can also navigate in the menu on the left
!ls -l downloaded

total 1805672
-rw-rw-r-- 1 ec2-user ec2-user   8061074 Jun 15 14:49 SRR14086881_MSHSPSP-RNP263.fastq
-rw-rw-r-- 1 ec2-user ec2-user 881967690 Jun 15 14:49 SRR14086882_MSHSPSP-RNP134.fastq
-rw-rw-r-- 1 ec2-user ec2-user 958961950 Jun 15 14:49 SRR14086883_MSHSPSP-RNP095.fastq


In [10]:
#now copy the data back to a cloud bucket for future use
!aws s3 cp SRR14086881_MSHSPSP-RNP263.fastq s3://<bucket-name>

/bin/sh: -c: line 0: syntax error near unexpected token `newline'
/bin/sh: -c: line 0: `aws s3 cp SRR14086881_MSHSPSP-RNP263.fastq s3://<bucket-name>'
