# Download sequence data from the NCBI Sequence Read Archive (SRA)

## Overview

DNA sequence data are typically deposited into the NCBI Sequence Read Archive, and can be accessed through the SRA website, or via a collection of command line tools called SRA Toolkit. Individual sequence entries are assigned an Accession ID, which can be used to find and download a particular file. For example, if you go to the [SRA database](https://www.ncbi.nlm.nih.gov/sra) in a browser window, and search for `SRX15695630`, you should see an entry for _C. elegans_. Alternatively, you can search the SRA metadata dataset in BigQuery to generate a list of accession numbers. Here we are going to generate a list of accessions using Big Query, use tools from the SRA Toolkit to download a few fastq files, then copy those fastq files to a cloud bucket.

## Learning Objectives
+ Learn more about the Sequence Read Archive
+ Learn how to download SRA data locally
+ Learn how to interact with SRA metadata via BigQuery Tables

## Prerequisites
Make sure you have enabled the [BigQuery API](https://cloud.google.com/endpoints/docs/openapi/enable-api).

## Get Started

### Install packages

Install dependencies, using mamba (you could also use conda). At the time of writing, the version of SRA tools available with the Anaconda distribution was v.2.11.0. If you want to install the latest version, download and install from [here](https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit). If you do the direct install, you will also need to configure interactively following [this guide](https://github.com/ncbi/sra-tools/wiki/05.-Toolkit-Configuration), you can do that by opening a terminal and running the commands there.

In [None]:
! mamba install -c bioconda -c conda-forge sra-tools==2.11.0 -y

Test that your install works and that fasterq-dump is available in your path

In [None]:
! fasterq-dump -h

In [None]:
!wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz && \
tar -xvzf sratoolkit.current-ubuntu64.tar.gz && \
export PATH=$PATH:$(pwd)/sratoolkit.*-ubuntu64/bin && \
fasterq-dump --version

### 2) Setup Directory Structure

In [None]:
pwd

Set up your directory structure for the raw fastq data

In [None]:
! mkdir -p data data/fasterqdump/raw_fastq data/prefetch_fasterqdump/raw_fastq

### 3) Create Accession List using BigQuery

Here we use BigQuery to generate a list of accessions. You can also generate a manual list by searching the [SRA Database](https://www.ncbi.nlm.nih.gov/sra) and saving to a file or list.

In [None]:
# Import the biquery api
from google.cloud import bigquery
import pandas

In [None]:
# Designate the client for the API
client = bigquery.Client(location="US")
print("Client creating using default project: {}".format(client.project))

Let's download bacterial samples, one of which happens to come from a swab of a sea horse (which we tell you for no particular reason!). You could change the SQL query as you like, feel free to take a look at the generated df, and then play with different parameters. For more inspiration, look at this [SRA tutorial](https://www.ncbi.nlm.nih.gov/sra/docs/sra-bigquery-examples/) or other links to SRA examples can be found in our [README](https://github.com/STRIDES/NIHCloudLabGCP?tab=readme-ov-file#download-data-from-the-sequence-read-archive-sra-).

In [None]:
query = """
#standardSQL
SELECT *
FROM `nih-sra-datastore.sra.metadata`
WHERE organism = 'Mycobacteroides chelonae' 
limit 3
"""
query_job = client.query(
    query,
    # Location must match that of the dataset(s) referenced in the query.
    location="US",
)  # API request - starts the query

df = query_job.to_dataframe()

In [None]:
df

As you can see, most of what you need to know is shown in this data frame. If you wanted to just show the accession, you could replace the * for acc in the SELECT command. One other thing to think about, is how large are these files, and do you have space on your VM to download them? You can figure this out by looking at the 'jattr' column, and then converting the number of bites to GB, then add that for a few samples to get a ballpark figure. If you need more space, stop the VM, go Compute Engine and either [resize your disk](https://cloud.google.com/compute/docs/disks/resize-persistent-disk) or add a disk. You can see the amount of space on your disk from the command line using `!df -h .`

In [None]:
df['jattr'][0]

You can also get the same info using `vdb-dump --info <ACCESSION>` although that command may not always work as expected. You can also get the path for the sra compressed file in a bucket using `srapath <ACCESSION>`.

Save our accession list to a text file

In [None]:
with open('list_of_accessionIDS.txt', 'w') as f:
    accs = df['acc'].to_string(header=False, index=False)
    f.write(accs)

In [None]:
cat list_of_accessionIDS.txt

### Download FASTQ Files with fasterq dump

Fasterq-dump is the replacement for the legacy fastq-dump tool. You can read [this guide](https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump) to see the full details on this tool. You can also run `fasterq-dump -h` to see most of the options

Fasterq dump doesn't run in batch mode, so one way to run a command on multiple samples is by using a for loop. There are many options you can explore, but here we are running -O for outdir, -e for the number of threads, -m for memory (4GB), and --location for the location we want to retrieve the file from. Depending on the type of cloud storage, it may be faster to select `NCBI` for the location. You may consider running a few tests with one or two of your accession numbers before downloading a whole batch. The default number of threads = 6, so adjust -e based on your machine size. For large files, you may also benefit from a machine type with more memory and/or threads. You may need to stop this VM, resize it, then restart and come back. There are also a bunch of ways to split your fastq files (defined [here](https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump)) but the default of `split 3` will split into forward, reverse, and unpaired reads.

In [None]:
%%time
! for x in `cat list_of_accessionIDS.txt`; do fasterq-dump -f -O data/raw_fastq -e 8 -m 4G $x ; done

### Download FASTQ files with prefetch + fasterq dump

Using the example bacterial data, fasterq dump takes about 3.5 min to download the files. Under the hood, fasterq dump is pulling the compressed sra files from the database and converting them on the fly, which is slow (ish) because it has to do a lot over the network. A better, though less advertised method, is to disaggregate these functions using prefetch to pull the compressed files, then fasterq-dump to convert them locally, rather than over the network. For this to work, you need to either give the path to the prefetch directories in your text file, or make sure you cd into the raw_fastq dir so that it can find those directories with the .sra files. In this case, --location GCP is a lot faster than NCBI, but feel free to run your own tests with different locations.

In [None]:
%%time
! prefetch --option-file list_of_accessionIDS.txt -O data/prefetch_fasterqdump/raw_fastq/ -f yes

In [None]:
ls data/prefetch_fasterqdump/raw_fastq/

In [None]:
%%time
! for x in `cat list_of_accessionIDS.txt`; do fasterq-dump -f -O data/prefetch_fasterqdump/raw_fastq/ -e 8 -m 4G data/prefetch_fasterqdump/raw_fastq/$x; done

Comparing the two methods, we can see that fasterq-dump on its own took 3.5 min, whereas prefetch + fasterq-dump takes less than 40 seconds.

### Copy Files to a Bucket

Create a new bucket, or give the path to an existing bucket

In [None]:
! gsutil mb gs://cloud-lab-tutorials_sra/

In [None]:
ls data/prefetch_fasterqdump/raw_fastq/

Using `-m` allows multithreading, `-r` would allow for recursive copy of a directory, but here we are just giving the path to fastq files.

In [None]:
! gsutil -m cp data/prefetch_fasterqdump/raw_fastq/*.fastq gs://cloud-lab-tutorials_sra/raw_fastq/

In [None]:
! gsutil ls gs://cloud-lab-tutorials_sra/raw_fastq/

## Conclusions
Here we learned about SRA, queried SRA metadata in BigQuery, and then downloaded data using SRA Toolkit. We found that the fastest download method was to use a combination of prefetch + fasterq-dump.

### Clean up
Make sure you shut down this VM, or delete it if you don't plan to use if further.