# Formation RNAseq CEA - juin 2023

Session IFB : 5 CPU + 21 GB de RAM

# Part 1: Download a public dataset


- 0.1 - About session on IFB core cluster
- 0.2 - Parameters to be set or modified by the user
- 1 - Online researches for a dataset
- 2 - Download data files with SRA Toolkit (*sra-tools*)
- 3 - Obtaining fastq.gz files with fasterq-dump
- 4 - Files and folders summary when leaving


---
## **Before going further**

<div class="alert alert-block alert-danger"><b>Caution:</b> 
Before starting the analysis, save a backup copy of this notebok : in the left-hand panel, right-click on this file and select "Duplicate"<br>
You can also make backups during the analysis. Don't forget to save your notebook regularly: <kbd>Ctrl</kbd> + <kbd>S</kbd> or click on the 💾 icon.
</div>

---

## 0.1 - About session on IFB core cluster ##

<em>loaded JupyterLab</em> : Version 3.2.1

In [None]:
## Code cell 1 ##

echo "=== Cell launched on $(date) ==="
squeue -hu $USER 

echo "=== Current IFB session size: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ==="
jobid=$(squeue -hu $USER | awk '/jupyter/ {print $1}')
sacct --format=JobID,AllocCPUS,NODELIST -j ${jobid}

---

## 0.2 - Parameters to be set or modified by the user ##

- Using a full path with a `/` at the end, **define the folder** where you want or have to work with the `gohome` variable:

In [None]:
## Code cell 2 ##

gohome="/shared/projects/2312_rnaseq_cea/" # to adjust with your project's folder

echo "=== Home root folder (stored in the variable gohome) is ==="
echo "${gohome}"
echo ""
echo "=== Working (personal) folder tree ==="
tree "${gohome}$USER"
echo "=== Working directory ==="
echo "${PWD}"

- Please, precise the **maximum amount of CPU** (central processing units, cores) that programs can use.

<div class="alert alert-block alert-warning">
    When working on your own project, you can specify  a 10-CPU session for example. Therefore, you could set the following value to <b>9, valid for a 10-CPU session</b>. Ideally, use 70-80% of the available CPU you system or session has.
    Here we will be working on a 5-CPU session, so we use a value of 4.
</div>

In [None]:
## Code cell 3 ##

authorizedCPU=4

- The last human determined variable we need is an accession list (SRR number, 1 per line) to start retrieving data.  
Either define this variable here or later in **section 2**.

In [None]:
## Code cell 4 ##

accesslist="${gohome}allData/Data/GSE158673_SRR_Acc_List.txt"
echo "The reference list that could be used is $(echo ${accesslist})"

head -n 15 ${accesslist}

This list contains the complete dataset with the 11 samples that we want to analyse.   
Nevertheless, as it would be too long to download and process all samples, and as it would generate too many heavy files, you will work on the first 3 samples,  and use the files that we already downloaded for the remaining samples.   

Therefore, for this analysis, we will go on with a shorter **GSE158673_SRR_Acc_List-2.txt**.

In [None]:
## Code cell 5 ##

accesslist="${gohome}allData/Data/GSE158673_SRR_Acc_List-2.txt"
echo "The reference list that will be used is $(echo ${accesslist})"

cat ${accesslist}

---
## 1 - Online search for sequencing samples ##

Several cases can happen:
- you read a wonderful paper and want to analyze the dataset they used or produced <br>
- you have a project but lack the data (and/or money and/or sequencing platform!) to do it <br>
- a enthusiastic PI asks you a question but you don't yet have the answer <br>

For the two last options, you need to find a suitable dataset before going on.  

If your datatest is already available on the server, skip this part... Unless you want to know how to do for next time ;)

#### **1.1- Find a paper dataset** <a id="section005"></a>

Research articles should include an identifier corresponding to the data used to obtain the published results.   
This identifier corresponds usually to an entry in the <a href="https://www.ncbi.nlm.nih.gov/gds"> <b>GEO datasets</b> </a> database.    

In this course, we use the data registered with the GEO Series Number:  <a href="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE158673"> <b>GSEGSE158673</b> </a>   
The page with this identifier provides many interesting info about the research project, such as experiments, protocols, related publications...

#### **1.2- Find a dataset suited for your biological question** <a id="section005"></a>

To that purpose, we can browse <a href="https://www.ncbi.nlm.nih.gov/gds"> <b>GEO datasets</b> </a> with some filters. <br>


Starting on the main page, proceed to a research with some (2 to 3, including *RNA seq*) key words about the research project you have. <br>
Once you face the huge amount of results, add following filters (left side bar of webpage) :
- *Organism* > *Customize* > *Browse* (Homo sapiens, Mus musculus, Rattus Norvegicus, ...) > select those you want to display > *Show*. Then click on the one or those you want
- *Study type* > Select only *Expression Profiling by high throughput sequencing*. Back on the results page, click on the added option to set it active
- If you still have too many options to handle, select *Tissue* in the *Attribute name* option

Then browse the results you have to find a paper that suits your project. <br>
When you pick up one, keep record of some access references:
<blockquote>
GEO Series Number: GSExxxxxx <br>
SRA related Number: SRPxxxxxx <br>
SRP Run experiment: SRXxxxxxxx <br>
BioProject reference: PRNJAxxxxxx
</blockquote>

<div class="alert alert-block alert-info">
    There is an alternative to the amercian NCBI database in Europe: <a href="https://www.ebi.ac.uk/ena/browser/home">ENA (<i>European Nucleotide Archive</i>)</a> from <a href="https://www.ebi.ac.uk/">EMBL-EBI</a> institute and part of the <a href="https://elixir-europe.org/platforms/data/core-data-resources">Elixir Core Data resource</a>.<br>
    Unfortunately, to the authors' knowledge, there is not any command-line tool to retrieve datafiles when a dataset is identified. So, we will use NCBI's resources for now.
</div>

#### **1.3 - Retrieve dataset informations directly from SRA archive**

Go to the <a href='https://www.ncbi.nlm.nih.gov/Traces/study/'>SRA (Sequence Read Archive) Run Selector</a> portal from the NCBI.
> Another option is the <a href='https://trace.ncbi.nlm.nih.gov/Traces/sra/'>SRA</a> website itself.

A link to the **SRA Run Selector** is also provided at the bottom of the GEO datasets pages.

Using the search bar with your SRA accession number, open the page of the dataset you choose. 

Down the page, you can select the samples you want among the full list. <br>
Whether you select samples or want to take the entire dataset, download the `Metadata`and the `Accession List` from the `Select` section in the middle of the page.
> Add an accession reference at the beginning of the file names (`SraRunTable.txt` and `SRR_Acc_List.txt` respectively) to distinguish series when you save many files.

---
## 2 - Download data files with SRA Toolkit (<i>sra-tools</i>) ##

Since this is the first time you will be using **SRA-tools**, it is necessary to run vdb-config --interactive *(in a terminal, you have to press 'x' to save a basic configuration)*. This is not necessary for further usage. Below, we do it with writing the absolute path of the sra-tools.

In [None]:
## Code cell 6 ##

echo "Aexyo" | /shared/software/miniconda/envs/sra-tools-2.11.0/bin/vdb-config -i

Alternatively to using sra command lines, we can retrieve the list of lines to download fastq files from [sra-explorer web interface](https://ewels.github.io/sra-explorer/) developped by Phil Ewels <a href="https://www.biostars.org/p/366721/#366722"><i>as a fun little side project</i></a>.

### **2.1 - Tools versions**

We load needed tools and display their current versions.

In [None]:
## Code cell 6 ##

module load sra-tools pigz
# module load sra-tools/2.11 pigz/2.3.4 most recent version on the IFB as of 2023/03/15

echo "===== sra-tools modules ====="
prefetch --version
vdb-validate --version
fasterq-dump --version
echo "===== compression tool ====="
pigz --version
echo ""

### **2.2 - Download SRA files with accession list**


We will use **prefecth** as part of sra toolkit. It allows command line downloading of SRA, dbGaP and ADSP data.

<ul class="alert alert-block alert-info">
    <li>
        about sra toolkit <a href="https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/">previous</a> and <a href="https://hpc.nih.gov/apps/sratoolkit.html">latest releases</a>
    </li>
    <li>
    another available option is to use <a href="https://www.biostars.org/p/366721/#366722">sra-explorer web interface</a> to retrieve <code>wget</code> or <code>curl</code> bash command lines
    </li>
</ul>

We will dowload the ``.sra`` files (compressed ones) using the accession list we have downloaded before.  
Please set here the file path and name if not previously done in **Parameters**'s section.

In [None]:
## Code cell 8 ##

accesslist="${accesslist}"
echo "The reference list that will be used is $(echo ${accesslist})"
#accesslist="${gohome}data/GSE158673_SRR_Acc_List-2.txt"

Let's call this file ``accesslist`` for now on.  

To see inside and know which references will be fetched.

In [None]:
## Code cell 9 ##

cat "${accesslist}"

We will store downloaded files in the folder ``/shared/projects/2312_rnaseq_cea/$USER/Data/`` in a subfolder called ``sra``.   
in which $USER corresponds to your user name and therefore to the folder that was created previously.

In [None]:
## Code cell 10 ##

mkdir -p "${gohome}$USER/Data/sra/"

``prefetch`` is verbose, we can see dates and how it goes directly into the notebook output. As it will take some time, we will eventually disconnect and loose tracks. To prevent information lost, we <a href="https://www.cyberciti.biz/faq/how-to-write-the-output-into-the-file-in-linux/">export</a> and write screen lines in a text file (here called <code>logfile</code>).
<blockquote>
    <code>&#38;>></code> to capture both standard output and error output and append a file <br>
    <code>|&#38;</code> to capture both standard output and error output to append a file, still diplaying information on screen <br>
    <code>tee</code> read from standard input and write to standard output and files <br>
    <code>-a</code> for <code>tee</code> command to append file instead of overwriting it
</blockquote>

Then, to launch <code>prefetch</code> tool with this list.

In [None]:
## Code cell 11 ##
# We create the log text file

logfile="${gohome}$USER/Data/sra-files_retrieval.log"

In [None]:
## Code cell 12 ##

echo "Screen output is redirected to ${logfile}."

# as time command does not redirect output
echo "operation starting by $(date)" >> ${logfile}

time for srrnum in $(cat "${accesslist}"); do

    echo "=== starting for ${srrnum}..." |& tee -a ${logfile}
    date &>> ${logfile}

    prefetch -O "${gohome}$USER/Data/sra/" \
             ${srrnum} \
             &>> ${logfile}

    echo "... done $(date) ===" |& tee -a ${logfile}

done

echo "operation finished by $(date)" >> ${logfile}

``prefetch`` options we use or you can add:
<blockquote>
    <code>-O</code> to specify destination folder <br>
    <code>-t</code> to specify where the temporary files are to be written  <br>
    <code>-c</code> to check if presents else download file (default, look into current working directory) <br>
    <code>-f</code> to force overwrite existing files instead of fail <br>

</blockquote>

Let's check that we have all expected files before going on.  
``prefetch`` tool put every ``.sra`` file in a folder named by its accession number, so we will use the ``tree`` command to visualise them all.

In [None]:
## Code cell 13 ##

tree "${gohome}$USER/Data/sra/" >> ${logfile}
tree "${gohome}$USER/Data/sra/" | grep "files"

If we have same numbers between samples and downloaded files, it's a good beginning point. Let's check there is no issue with those files. 

### **2.3- Check sra files integrity**

We will use the tool **vdb-validate** from ``sra-tools`` to know if files matches those on web archive.

In [None]:
## Code cell 14 ##

logfile="${gohome}$USER/Data/sra-files_integrity.log"

In [None]:
## Code cell 15 ##

echo "Screen output is redirected to ${logfile}"

echo "operation starting by $(date)" >> ${logfile}

time for srrfile in $(find "${gohome}$USER" -name "*.sra" | sort); do

    echo "... working on $(basename ${srrfile}) file..." |& tee -a ${logfile}
    vdb-validate "${srrfile}" &>> ${logfile}
                 
done

echo "operation finished by $(date)" >> ${logfile}


Now, you can either open ``.log`` file within a text editor (left panel, browse for file and right click on it) or display it in this notebook to read.  
A shorter option consists in showing only lines that concern ``.sra`` files (extension with ``.`` and ``' `` either we will have comment lines):

In [None]:
## Code cell 16 ##

cat ${logfile} | grep ".sra' "

Another option is to count for lines that contains ***ok*** (each sample with 2 for ``md5`` and 4 for ``checksums``) or ***is consistent*** (1 per ``.sra`` file).

In [None]:
## Code cell 17 ##

cat ${logfile} | grep "md5 ok" | wc -l
echo $(( $(cat ${logfile} | grep "checksums ok" | wc -l) / 4))
cat ${logfile} | grep "is consistent" | wc -l

---

## 3- Raw ``.fastq.gz`` files generation: 

### 3.1 - Introduction

<ul class="alert alert-block alert-info">
    <li>
        about <a href="https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump">prefetch and fasterq-dump</a>
    </li>
    <li>
        <a href="https://usegalaxy.eu/">usegalaxy.eu</a>: first section (<i>Get Data</i>), <br>workflow called <b>Faster Download and Extract Reads in FASTQ format from NCBI SRA</b>
    </li>
    <li>
        about <a href="https://zlib.net/pigz/pigz.pdf">pigz</a> for multithreading
    </li>
</ul>

The ``.sra`` files downloaded in the previous cells are 'small' files compared to ``.fastq`` files. The idea is to download small amounts of data to avoid network issues during the process... but we can't start our analyses on them. We have to re-create raw ``.fastq`` files first. <br>

To that purpose, **fasterq-dump** tool (derived from **fastq-dump** in previous sra-tools releases, now deprecated) works on a ``.sra`` file and decompress it in a fastq file (default, 1 per read so 2 for paired-end data, PE). <br>

The default command line is: ``fasterq-dump filename.sra``.   
It performs reads separation in 3 files (``--split-3`` option in **fastq-dump** tool) for paired-end data: 1 file per read direction for mate reads and another one for remaining unmated reads.  
We can add some options to specify folders to be used, otherwise generated files will be written in current working directory. <br>
<blockquote>
    <code>--outdir</code>, to know where to find the <code>.fastq</code> files <br>
    <code>--temp</code>, to control where temporary files are, indeed temporarily, stored (and check they are removed after :-) ) <br>
</blockquote>

As ``.fastq`` files are quite big files, we'll add the compression step right after any ``.fastq`` files is created. The tool, called **pigz**, compresses files in the ``.gzip`` format that can be handled downstream. <br>
This tool is based on **gzip** tool (so an anagram or an acronym for *Pigz is gzip*?). It allows to have several processes in parallel to work on a the same file... and it goes faster than **gzip** tool, even if it still takes quite some time: up to 8-9 minutes for the biggest file among the 11 samples processed here. <br>
<blockquote>
    <code>-p number</code> or <code>--processes number</code> to allow up to <code>number</code> compression threads (default is the number of online processors, or 8 if unknown)
</blockquote>

### **3.2 - To obtain ``.fastq.gz`` files from ``.sra``**

Using ``--outdir`` in ``fasterq-dump`` command, we will place created files in following folder:

In [None]:
## Code cell 18 ##

outfolder="${gohome}$USER/Data/fastq/raw/"
mkdir -p "${outfolder}"

Let's remember, or set if not done in **Parameters**'s section, the number of CPU (central processing units, cores) that multithreading program `pigz` is suppose to use.  
<div class="alert alert-block alert-warning">
    Following value of <b>9 is valid for a 10-CPU session</b>. Ideally, use 70% of the CPU amount your system or session has.
</div>

In [None]:
## Code cell 19 ##

authorizedCPU=${authorizedCPU}
#authorizedCPU=4

echo "The number of CPU available for computing is ${authorizedCPU}"

Be patient while running following *big loop*'s cell!

In [None]:
## Code cell 20 ##

logfile="${gohome}$USER/Data/sra-files_creation_fastqgz.log"
echo "Screen output is redirected to ${logfile}."

# as time command does not redirect output
echo "operation starting by $(date)" >> ${logfile}

time for srrnum in $(ls "${gohome}$USER/Data/sra/"); do
    
    echo "========== Processing sample ${srrnum} ===========" |& tee -a ${logfile}
    srrfolder="${gohome}$USER/Data/sra/${srrnum}"
    fasterq-dump --outdir ${outfolder} \
                 --temp ${outfolder} \
                 ${srrfolder} \
                 &>> ${logfile}
    
    # don't assume there are 1 or 2 fastq by sample (3rd one if unmated reads)
    for fullfile in $(ls "${outfolder}"*.fastq); do
    
        echo "... compressing $(basename ${fullfile})..." |& tee -a ${logfile}
        date >> ${logfile}
        pigz --processes ${authorizedCPU} ${fullfile}
        date >> ${logfile}
    done
    
done

echo "operation finished by $(date)" >> ${logfile}
ls -lh "${outfolder}" >> ${logfile}
echo "Step done"

---
## 4 - Files, folders and versions summary when leaving

Let's count the number of files in our destination folder.

In [None]:
## Code cell 21 ##

ls "${outfolder}" | wc -l

... and check the arborescence of our folder with the Unix command `tree`.  
<blockquote>
    Adding <code>-L</code> option and a number allows to stop digging into the tree folder... and let the output be still readable. <br>
    To list only directories, use option <code>-d</code>
</blockquote>

In [None]:
## Code cell 22 ##

tree -d -L 3 "${gohome}$USER/"

To know how many disk space files take (either shortest output or detailed one):

In [None]:
## Code cell 23 ##
# disk usage 

du -ch -d1 "${outfolder}"

or

In [None]:
## Code cell 24 ##

ls -lh "${outfolder}"

---
___

## Conclusion

**Next practical session** 

The jupyter notebook used for the next session will be the *Pipe_02-bash_raw-data-quality.ipynb*    
Let's retrieve it in our directory, in order to have a private copy to work on:   

In [None]:
## Code cell 25 ##   

cp "${gohome}pipeline/Pipe_02-bash_raw-data-quality.ipynb" "${gohome}$USER/"



**Save executed notebook**

To end the session, save your exectued notebook in your `run_notebooks' folder. Adjust the name with yours and reformat as code cell to run it.


Now we can go on to get an eye on data and check samples quality.  
  
**=> Next: Lecture 2 : Raw data quality** 

<div class="alert alert-block alert-success"><b>Success:</b> Well done! You now know how to download fastq raw data files from GEO.<br>
Don't forget to save you notebook and export a copy as an <b>html</b> file as well <br>
- Open "File" in the Menu<br>
- Select "Export Notebook As"<br>
- Export notebook as HTML<br>
- You can then open it in your browser even without being connected to the server! 
</div>

---
---

## Useful commands
<div class="alert alert-block alert-info"> 
    
- <kbd>CTRL</kbd>+<kbd>S</kbd> : save notebook<br>    
- <kbd>CTRL</kbd>+<kbd>ENTER</kbd> : Run Cell<br>  
- <kbd>SHIFT</kbd>+<kbd>ENTER</kbd> : Run Cell and Select Next<br>   
- <kbd>ALT</kbd>+<kbd>ENTER</kbd> : Run Cell and Insert Below<br>   
- <kbd>ESC</kbd>+<kbd>y</kbd> : Change to *Code* Cell Type<br>  
- <kbd>ESC</kbd>+<kbd>m</kbd> : Change to *Markdown* Cell Type<br> 
- <kbd>ESC</kbd>+<kbd>r</kbd> : Change to *Raw* Cell Type<br>    
- <kbd>ESC</kbd>+<kbd>a</kbd> : Create Cell Above<br> 
- <kbd>ESC</kbd>+<kbd>b</kbd> : Create Cell Below<br> 

<em>  
To make nice html reports with markdown: <a href="https://dillinger.io/" title="dillinger.io">html visualization tool 1</a> or <a href="https://stackedit.io/app#" title="stackedit.io">html visualization tool 2</a>, <a href="https://www.tablesgenerator.com/markdown_tables" title="tablesgenerator.com">to draw nice tables</a>, and the <a href="https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fd" title="Ultimate guide">Ultimate guide</a>. <br>
Further reading on JupyterLab notebooks: <a href="https://jupyterlab.readthedocs.io/en/latest/user/notebook.html" title="Jupyter Lab">Jupyter Lab documentation</a>.<br>   
</em>    
 
</div>

Bénédicte Noblet - 05-07 2021   
Sandrine Caburet - 02-05 2023   
Claire Vandiedonck - 02-06 2023  
Maj 01/06/2023