# Finding and retrieving SRA data with SRA-explorer

## Find data 

Visit https://sra-explorer.info/#

Query with

- free text, eg human, lung, diabetes, mouce
- accession number ID of interest from SRA

**Example search**: `human rna cell illumina`

![](assets/sra-explorer-demo.gif)

We can retrieve

- https paths to FASTQ
- download bash script
- full metadata file with paths and information in a tabular (csv) format

The metadata include the following fields:

1. **Accession**
2. **Title**
3. **Platform**
4. **Total bases**
5. **Create date**
6. **SRA URL**
7. **SRA filename**
8. **SRA nice filename**
9. **FastQ URL**
10. **FastQ Aspera URL**
11. **FastQ filename**
12. **FastQ nice filename**


Example:

| Accession   | Title                                                | Platform              | Total bases | Create date   | SRA URL                                                                                                 | SRA filename    | SRA nice filename                                                | FastQ URL                                                                        | FastQ Aspera URL                                                                     | FastQ filename         | FastQ nice filename                                                     |
|-------------|------------------------------------------------------|-----------------------|-------------|---------------|---------------------------------------------------------------------------------------------------------|-----------------|------------------------------------------------------------------|----------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|------------------------|-------------------------------------------------------------------------|
| SRR12421213 | GSM4718280: A549 24 hpi _rep1; Homo sapiens; RNA-Seq | Illumina NovaSeq 6000 | 45952       | 1597100400000 | ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR124/SRR12421213/SRR12421213.sra | SRR12421213.sra | SRR12421213_GSM4718280_A549_24_hpi_rep1_Homo_sapiens_RNA-Seq.sra | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR124/013/SRR12421213/SRR12421213_1.fastq.gz | era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR124/013/SRR12421213/SRR12421213_1.fastq.gz | SRR12421213_1.fastq.gz | SRR12421213_GSM4718280_A549_24_hpi_rep1_Homo_sapiens_RNA-Seq_1.fastq.gz |
| SRR12421213 | GSM4718280: A549 24 hpi _rep1; Homo sapiens; RNA-Seq | Illumina NovaSeq 6000 | 45952       | 1597100400000 | ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR124/SRR12421213/SRR12421213.sra | SRR12421213.sra | SRR12421213_GSM4718280_A549_24_hpi_rep1_Homo_sapiens_RNA-Seq.sra | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR124/013/SRR12421213/SRR12421213_2.fastq.gz | era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR124/013/SRR12421213/SRR12421213_2.fastq.gz | SRR12421213_2.fastq.gz | SRR12421213_GSM4718280_A549_24_hpi_rep1_Homo_sapiens_RNA-Seq_2.fastq.gz |
| SRR12421212 | GSM4718279: A549 12 hpi _rep1; Homo sapiens; RNA-Seq | Illumina NovaSeq 6000 | 40923       | 1597100400000 | ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR124/SRR12421212/SRR12421212.sra | SRR12421212.sra | SRR12421212_GSM4718279_A549_12_hpi_rep1_Homo_sapiens_RNA-Seq.sra | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR124/012/SRR12421212/SRR12421212_1.fastq.gz | era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR124/012/SRR12421212/SRR12421212_1.fastq.gz | SRR12421212_1.fastq.gz | SRR12421212_GSM4718279_A549_12_hpi_rep1_Homo_sapiens_RNA-Seq_1.fastq.gz |
| SRR12421212 | GSM4718279: A549 12 hpi _rep1; Homo sapiens; RNA-Seq | Illumina NovaSeq 6000 | 40923       | 1597100400000 | ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR124/SRR12421212/SRR12421212.sra | SRR12421212.sra | SRR12421212_GSM4718279_A549_12_hpi_rep1_Homo_sapiens_RNA-Seq.sra | ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR124/012/SRR12421212/SRR12421212_2.fastq.gz | era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR124/012/SRR12421212/SRR12421212_2.fastq.gz | SRR12421212_2.fastq.gz | SRR12421212_GSM4718279_A549_12_hpi_rep1_Homo_sapiens_RNA-Seq_2.fastq.gz |

The simplest way to retrieve the data is with `curl` or `wget` using the ftp links:

Example:

```bash
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR124/013/SRR12421213/SRR12421213_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR124/013/SRR12421213/SRR12421213_2.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR124/012/SRR12421212/SRR12421212_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR124/012/SRR12421212/SRR12421212_2.fastq.gz
```

# A. nf-core/rnaseq: An end-to-end RNAseq workflow

![](https://github.com/TheJacksonLaboratory/nf-core-rnaseq/raw/sk-sahu-dsl-1-latest/docs/images/nf-core-rnaseq_logo.png)

## Pipeline configuration logic


### 1. Select `input` folder with paired-end FASTQ files 



**Example configuration options**

```bash
nextflow run main.nf --input 'testdata/*_R{1,2}.fastq.gz'
nextflow run main.nf --input 'gs://rnaseq-examples/inputs/*_{1,2}.fq'

```

#### a. The folder
We can use local folders when working in a cluster environment or locally on our laptops, or google cloud buckets when working in JAX CloudOS.

For example, we have prepared for the class an example folder with data in the google bucket named 

`gs://rnaseq-examples/inputs`

#### b. The regular expression to match paired-end reads

For processing the paired end reads correctly, we need to define the  regular expression or pattern that matches the pairs. This is dependant on our FASTQ files naming.

Frequently used suffixes that can be found in FASTQ filenames are:

- `fq`
- `fq.gz` (compressed)
- `fastq`
- `fastq.gz` (compressed)


The enclosed in curly brucket text denotes `either or`, for examples in the expression `*_{1,2}.fq`
we will match any pair that has a suffix 

**either**: `*_1.fq`
**or**: `*_2.fq`


The asterisk expression means `everything` or `any match`, and we use it to capture all the files that have this pattern of suffixes, and any basename.

### 2. Select `genome`

**Available:**

Human:
 - GRCh37
 - GRCh38

Mouse:
 - GRCm38
 - `< add your custom! >`
 
And more organisms, see here for available in the [**igenomes.config**](https://github.com/TheJacksonLaboratory/nf-core-rnaseq/blob/sk-sahu-dsl-1-latest/conf/igenomes.config) file


```bash
nextflow run main.nf --genome 'GRCm38' ..
nextflow run main.nf --genome 'GRCh37' ..
```

### Running with my own custom genome? ü§î

We can add and use in the same way any custom genome we want for our own research by editting the  [**igenomes.config**](https://github.com/TheJacksonLaboratory/nf-core-rnaseq/blob/sk-sahu-dsl-1-latest/conf/igenomes.config) file and providing the required reference files (eg fasta, bed, STAR indices) for our custom genome.


The addition in the file could look like this:


```groovy
// contents of igenomes.config

params {
  genomes {
    'my-custom-genome' {
      fasta       = "path-to/WholeGenomeFasta/genome.fa"
      bwa         = "path-to/BWAIndex/genome.fa"
      bowtie2     = "path-to/Bowtie2Index/"
      star        = "path-to/STARIndex/"
      bismark     = "path-to/BismarkIndex/"
      gtf         = "path-to/genes.gtf"
      bed12       = "path-to/genes.bed"
      mito_name   = "MT"
      macs_gsize  = "2.7e9"
      blacklist   = "path-to/my-custom-genome-blacklist.bed"
    }
      .
      .
      .
       
  }

```

We could then run similarly:

```bash
nextflow run main.nf --genome 'my-custom-genome' ..
```


**NOTE**:

If you have a custom genome that you would like to use please reach out via Slack or email to get help for your analysis.

## 3. Prepare `input` folder

For the class, we have prepared an input folder in the location

**`gs://rnaseq-examples/inputs`**

You can find in our current workspace by clicking

`Data` > `rnaseq-examples` as shown below.

![](assets/add-data.gif)

## B. Running an example in the JAX CloudOS

An example run can be as simple as

### 1. provide `input` reads folder 
eg. `gs://rnaseq-examples/inputs/`
### 2. specify regular expression for paired end read files 
eg `*_{1,2}.fq` 
### 3. specify `genome` 
eg `GRCh37`

The command in a terminal would look like this:

```bash
nextflow run https://github.com/TheJacksonLaboratory/nf-core-rnaseq \
--input 'gs://rnaseq-examples/inputs/*_{1,2}.fq' \
--genome GRCh37
```

We can run the same in JAX CloudOS, with the exact same arguments.

For the class we have prepared an example run. To reproduce the analysis and inspect the parameters used you can click the `Clone` button in the link below page:


## https://cloudos.lifebit.ai/app/jobs/606eed7b4f08a000e50e5ee7


![](assets/clone-analysis.gif)

## Running with my own data? ü§î

Feel free to select another genome  or one of the other example input data. 


‚ÑπÔ∏è **NOTE:**

If you don't yet have a JAX CloudOS account for your lab, get in touch via Slack or email to help you raise the ticket with JAX support.
Alternatively, you can try to run **locally** or in a **cluster** personal space that you have access.

# C. Execution across Cloud, Clusters and local machines

The same workflow can be ran with minimal configuration adjustments across laptops, SLURM managed clusters and different cloud vendors.

Notice above the commands:


For Google Cloud we can use a specific configuration file:


```bash
nextflow run https://github.com/TheJacksonLaboratory/nf-core-rnaseq \
--config 'conf/local.config'
```

For SLURM managed another:

```bash
nextflow run https://github.com/TheJacksonLaboratory/nf-core-rnaseq \
--config 'conf/slurm.config'
```

And also for our local machine, eg. laptop:

```bash
nextflow run https://github.com/TheJacksonLaboratory/nf-core-rnaseq \
--config 'conf/local.config'
```

This modification allows us to retain the rest parameters unchanged, allwing for swift change of execution environment

### C1 - SLURM Managed cluster

The required files are 2:

- a pbs submission script
- Nextflow config eg. slurm.config


#### Contents of pbs submission script


```bash

#!/bin/bash
#SBATCH -o logs.%j.out
#SBATCH -e logs.%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=$USER@my-org.org
#SBATCH --mem=20000
#SBATCH --cpus-per-task=4
#SBATCH -p compute
#SBATCH -q batch
#SBATCH -t 3-00:00:00 

cd $SLURM_SUBMIT_DIR
date;hostname;pwd

module load singularity

curl -fsSL https://get.nextflow.io | bash

./nextflow run /absolut/path/to/pipeline/repo/main.nf --outdir ${SLURM_SUBMIT_DIR} --param1 param1_value

```






#### Contents of nextflow config for SLURM managed cluster

- Sungularity enabled

```groovy
/*
 * -------------------------------------------------
 *  Nextflow config file for running on a SLURM managed cluster
 * -------------------------------------------------
 */

process {
  executor = 'slurm'
  beforeScript = 'module load singularity'
}
singularity {
  enabled = true
  autoMounts = true
}

```

### C2 - Google Cloud config

#### Contents of config


```groovy

// This config is specific to google-life-science

includeConfig 'base.config'

google {
    lifeSciences.bootDiskSize = 50.GB
    lifeSciences.preemptible = true
    zone = params.gls_zone
    network = params.gls_network
    subnetwork = params.gls_subnetwork
}

docker.enabled = true

executor {
    name = 'google-lifesciences'
}
```


### C3 - Local config for laptops

#### Contents of config



```
/*
 * -------------------------------------------------
 *  Nextflow config file for running on a local machine eg. laptop
 * -------------------------------------------------
 */

docker.enabled = true

params {
    executor = 'local'
}


```