# Download sequence data from the NCBI Sequence Read Archive (SRA)

## Overview

DNA sequence data are typically deposited into the NCBI Sequence Read Archive, and can be accessed through the SRA website, or via a collection of command line tools called SRA Toolkit. Individual sequence entries are assigned an Accession ID, which can be used to find and download a particular file. For example, if you go to the [SRA database](https://www.ncbi.nlm.nih.gov/sra) in a browser window, and search for `SRX15695630`, you should see an entry for _C. elegans_. Alternatively, you can search the SRA metadata using Amazon Athena and generate a list of accession numbers. Here we are going to generate a list of accessions using Athena, use tools from the SRA Toolkit to download a few fastq files, then copy those fastq files to a cloud bucket. We really only scratch the surface of how to search Athena using SQL. If you want more examples, you can also try the notebooks from [this SRA GitHub repo](https://github.com/ncbi/ASHG-Workshop-2021). 

## Learning Objectives
+ Learn more about the Sequence Read Archive
+ Learn how to download SRA data locally
+ Learn how to interact with SRA metadata via Athena

## Prerequisites
+ Make sure you have access to SageMaker, Athena and Glue

## Get Started

### Set up environment and install dependencies

### Set up your Athena Database
You need to set up your Athena database in the Athena console before you start this notebook. Follow our [guide](https://github.com/STRIDES/NIHCloudLabAWS/blob/main/docs/create_athena_database.md) to walk you through it.

At the time of writing, the version of SRA tools available with the Anaconda distribution was v.2.11.0. If you want to install the latest version, download and install from [here](https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit). If you do the direct install, you will also need to configure interactively following [this guide](https://github.com/ncbi/sra-tools/wiki/05.-Toolkit-Configuration), you can do that by opening a terminal and running the commands there.

In [None]:
# install everything else
! mamba install -c bioconda -c conda-forge sql-magic pyathena -y #sra-tools==2.11.0

Test that your install works and that fasterq-dump is available in your path

In [None]:
! fasterq-dump -h

### Setup Directory Structure and Create a Staging Bucket

In [None]:
! mkdir -p data data/fasterqdump/raw_fastq data/prefetch_fasterqdump/raw_fastq

In [None]:
cd data/

In [None]:
# make sure you change this name, it needs to be globally unique
%env BUCKET=sra-data-athena

In [None]:
# will only create the bucket if it doesn't yet exist
# if the bucket exists you won't see any output
! aws s3 ls s3://$BUCKET >& /dev/null || aws s3 mb s3://$BUCKET

### Create a list of SRA accessions using Athena

Here we use Athena to generate a list of accessions. You can also generate a manual list by searching the [SRA Database](https://www.ncbi.nlm.nih.gov/sra) and saving to a file or list.

If you get a module not found error for either of these, rerun the mamba commands above, make sure mamba is still in your path, or just use `pip install pyathena`.

In [None]:
# import packages
from pyathena import connect
import pandas as pd

Establish connection. List your staging bucket and the region of your bucket. Make sure your bucket is in us-east-1 to avoid egress charges when downloading from sra.

In [None]:
conn = connect(s3_staging_dir='s3://sra-data-athena/',
               region_name='us-east-1')

**When you run the query in the next cell you may get this error**:

`An error occurred (AccessDeniedException) when calling the StartQueryExecution operation: User: arn:aws:sts::055102001469:assumed-role/sagemaker-notebook-instance-role/SageMaker is not authorized to perform: athena:StartQueryExecution on resource: arn:aws:athena:us-east-1:055102001469:workgroup/primary because no identity-based policy allows the athena:StartQueryExecution action`

If you get this error, read our [IAM guide](https://github.com/STRIDES/NIHCloudLabAWS/blob/main/docs/update_sagemaker_role.md) to set up the correct policy for your Sagemaker role. 


Now that the permissions are all set up, let's download bacterial samples from the [NIGMS Sandbox RNAseq tutorial](https://github.com/NIGMS/RNA-Seq-Differential-Expression-Analysis). You could change the SQL query as you like, feel free to take a look at the generated df, and then play with different parameters. For more inspiration of what is possible with SQL queries, look at this [SRA tutorial](https://github.com/ncbi/ASHG-Workshop-2021/blob/main/3_Biology_Example_AWS_Demo.ipynb).

In [None]:
query = """
SELECT *
FROM AwsDataCatalog.srametadata.metadata
WHERE organism = 'Mycobacteroides chelonae' 
limit 3;
"""
df = pd.read_sql(
    query, conn
)
df

As you can see, most of what you need to know is shown in this data frame. If you wanted to just show the accession, you could replace the * for acc in the SELECT command. One other thing to think about is how large are these files, and do you have space on your VM to download them? You can figure this out by looking at the 'jattr' column, and then converting the number of bites to GB, then add that for a few samples to get a ballpark figure. If you need more space, stop the VM, go to SageMaker and [resize your disk](https://aws.amazon.com/blogs/machine-learning/customize-your-notebook-volume-size-up-to-16-tb-with-amazon-sagemaker/). Make sure you stop your notebook instance to Edit and resize it. You can see the amount of space on your disk from the command line using `! df -h .`

In [None]:
df['jattr'][0]

You can also get the same info using `vdb-dump --info <ACCESSION>`. You can also get the path for the sra compressed file in a bucket using `srapath <ACCESSION>`.

In [None]:
! vdb-dump --info SRR13349124 

In [None]:
! srapath SRR13349124

Save our accession list to a text file

In [None]:
with open('list_of_accessionIDS.txt', 'w') as f:
    accs = df['acc'].to_string(header=False, index=False)
    f.write(accs)

In [None]:
! cat list_of_accessionIDS.txt

### Download FASTQ files with fasterq dump

Fasterq-dump is the replacement for the legacy fastq-dump tool. You can read [this guide](https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump) to see the full details on this tool. You can also run `fasterq-dump -h` to see most of the options

In [None]:
cd fasterqdump/

Fasterq dump doesn't run in batch mode, so one way to run a command on multiple samples is by using a for loop. There are many options you can explore, but here we are running -O for outdir, -e for the number of threads, -m for memory (4GB). The default number of threads = 6, so adjust -e based on your machine size. For large files, you may also benefit from a machine type with more memory and/or threads. You may need to stop this VM, resize it, then restart and come back. There are also a bunch of ways to split your fastq files (defined [here](https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump)) but the default of `split 3` will split into forward, reverse, and unpaired reads. Depending on your machine size, expect about 5 min for these three files.

In [None]:
%%time
! for x in `cat ../list_of_accessionIDS.txt`; do fasterq-dump -f -O raw_fastq -e 8 -m 4G $x ; done

On our VM that command took 6.5 min, although with a larger machine size it will run faster.

### Download FASTQ files with prefetch + fasterq dump

Using the example bacterial data, fasterq dump took about 6.5 min to download the files (ml.t3.2xlarge with 8 CPUs and 32 GB RAM). Under the hood, fasterq dump is pulling the compressed sra files from the database (in this case it should be coming from AWS) and converting them on the fly, which is slow (ish) because it has to do a lot over the network. A better method is to disaggregate these functions using prefetch to pull the compressed files, then use fasterq-dump to convert them locally, rather than over the network. For this to work, you need to either give the path to the prefetch directories in your text file, or make sure you cd into the raw_fastq dir so that it can find those directories with the .sra files.

In [None]:
cd ../prefetch_fasterqdump

In [None]:
%%time
! prefetch --option-file ../list_of_accessionIDS.txt -O raw_fastq -f yes

In [None]:
ls raw_fastq/

Now convert the prefetch records

In [None]:
%%time
! for x in `cat ../list_of_accessionIDS.txt`; do fasterq-dump -f -O raw_fastq -e 8 -m 4G raw_fastq/$x; done

Comparing the two methods, we can see that fasterq-dump on its own took 6.5 min, whereas prefetch + fasterq-dump takes less than 1.5 min.

### Copy Files to a Bucket

--recursive will copy a whole directory like `-r` in bash. S3 multithreads by default so you don't have to specify threads.

In [None]:
! aws s3 cp raw_fastq/*.fastq s3://sra-data-athena/raw_fastq --recursive

In [None]:
! aws s3 ls s3://sra-data-athena/raw_fastq/

## Conclusions 
Here you learned how to generate a list of accessions using Athena and then use the SRA toolkit to download FASTQ files. We tested fasterq-dump on its own, but found that using prefetch then fasterq-dump is much faster. Finally you learned how to copy directories to S3 using the `--recursive` flag.

## Clean up
Make sure you shut down this VM, or delete it if you don't plan to use if further.

You can also [delete the buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html) if you don't want to pay for the data: `aws s3 rb s3://bucket-name --force`