# Formation RNAseq CEA - juin 2023

Session IFB : 5 CPU + 21 GB de RAM

# Part 2 : Quality check of raw data before processing



- 0.1 - About session on IFB core cluster
- 0.2 - Parameters to be set or modified by the user
- 1 - Some checks as a precaution
- 2 - Quality control on raw `.fastq.gz` files
- 3 - Summary report with MultiQC


---
## **Before going further**

<div class="alert alert-block alert-danger"><b>Caution:</b> 
Before starting the analysis, save a backup copy of this notebok : in the left-hand panel, right-click on this file and select "Duplicate"<br>
You can also make backups during the analysis. Don't forget to save your notebook regularly: <kbd>Ctrl</kbd> + <kbd>S</kbd> or click on the 💾 icon.
</div>

---

## 0.1 - About session on IFB core cluster

<em>loaded JupyterLab</em> : Version 3.2.1

In [None]:
## Code cell 1 ##

echo "=== Cell launched on $(date) ==="

echo "=== IFB session size ==="
jobid=$(squeue -hu $USER | awk '/jupyter/ {print $1}')
sacct --format=JobID,AllocCPUS,NODELIST -j ${jobid}

In [None]:
## Code cell 2 ##

module load bc/1.07.1 fastqc/0.11.9 multiqc/1.13

echo "===== basic calculator ====="
bc --version | head -n 1
echo "===== individual reports ====="
fastqc --version
echo "===== compiled report ====="
multiqc --version

---

## 0.2 - Parameters to be set or modified by the user

- Using a full path with a `/` at the end, **define the folder** where you want or have to work with the `gohome` variable:

In [None]:
## Code cell 3 ##

gohome="/shared/projects/2312_rnaseq_cea/"

echo "=== Home root folder (stored in the variable gohome) is ==="
echo "${gohome}"
echo ""
echo "=== Working (personal) folder tree ==="
tree -L 2 "${gohome}$USER"
echo "=== Working directory ==="
echo "${PWD}"

- Please, precise the **maximum amount of CPU** (central processing units, cores) that programs can use.

<div class="alert alert-block alert-warning">
    Following value of <b>4 is valid for a 5-CPU session</b>. Ideally, use 70-80% of the avalaible CPU you system or session has.
</div>

In [None]:
## Code cell 4 ##

#authorizedCPU=4

authorizedCPU=4

- Give **full path to get to the folder containing (only) the rawdata files** to be analysed:

In [None]:
## Code cell 5 ##

rawfolder="${gohome}$USER/Data/fastq/raw/"

- In addition, a comment will later be included in Quality report to keep experiment informations handy. Please adapt ``mycomment`` variable text in [section 3.2](#multiqctextvar) before launching MultiQC report generation.

---
## 1 - Some checks as a precaution

### **1.1- Available files**

The data files are already present on the server, in the `Data/fastq/raw/` folder of your current working directory, as produced by the previous step.

As we change session and/or day, let's first check that all files are there using following commands:

In [None]:
## Code cell 6 ##

rawfolder=${rawfolder}
#rawfolder="${gohome}$USER/Data/fastq/raw/"

echo "There are $(ls ${rawfolder} | wc -l) raw .fastq.gz files:"
ls ${rawfolder}

The files consist of raw data from the Illumina sequencer (`.fastq`) which sizes have been reduced (`.gz`) thanks to compression (**pigz** tool, see `Pipe_01.ipynb` notebook). As genomics tools can deal with both compressed and uncompressed file formats, we save disk space using the compressed ones.

### **1.2- Examining data files: are they what we expect?**

Let's pick up one file to get a loook to the data.

We list the files in the folder and ask for only the first line (``-n 1``).

In [None]:
## Code cell 7 ##

arawfile=$(ls "${rawfolder}"*gz | head -n 1)
echo ${arawfile}

``.fastq`` files are readable by the human eye, and we can display the first lines of this file, using the Unix ``head`` command on the **zcat** command that can deal with ``.gzip`` files.

In [None]:
## Code cell 8 ##

zcat ${arawfile} | head

We expect to have text file with 4 lines per read (sequence):
- read identification starts with `@`
- sequence itself (some `N` may appear when bases are undetermined)
- a line separator starting with `+` and the identifier again (for first sequencers) or nothing else
- PHRED quality string with special characters (ranging from 33 to 41 in an ASCII table)

<div class="alert alert-block alert-info">
    For more information on phred score and history, please refer to <a href="https://en.wikipedia.org/wiki/FASTQ_format#Encoding">FASTQ format wikipedia page</a> that display graphical view for different phred score encoding.
</div>

To count lines in that file:

In [None]:
## Code cell 9 ##

time wcloutput=$(zcat ${arawfile} | wc -l)
echo ${wcloutput}

For those who don't want to fetch for a calculator, we will use the **bc**  basic calculator that allows to use decimal in `bash`.

In [None]:
## Code cell 10 ##

echo "scale=2; ${wcloutput}/4" | bc -l

If the result ends with no decimal (*i.e.* `.00`) along with correct file format (upper bullet point list), we have a good start... else please ask for information to the data supplier (platform or colleagues, because file extensions are easy to add or change and files could havebeen overwriten...).

<blockquote>
    Alternatively, we can directly get the number of reads noticing all reads in this file start (<code>^</code> in an expression pattern) with <code>@SRR</code>, using the command <code>zgrep</code> to do the pattern search in a <code>.gz</code> file:
</blockquote>

In [None]:
## Code cell 11 ##

time zgrep "^@SRR" ${arawfile} | wc -l

---
## 2 - First quality control on raw <code>.fastq.gz</code> files

### **2.1 - Tool version and introduction**
For this step, we will use <a href="https://www.bioinformatics.babraham.ac.uk/projects/fastqc/"><b>FASTQC</b></a> (notebook developped with ``FastQC v0.11.9``).

In [None]:
## Code cell 12 ##

fastqc --version

To analyze a sample, we could launch: <br>
<code>fastqc --outdir path/to/destination/folder/ \ <br>        path/to/file.fastq.gz</code> <br>
where <code>-- outdir</code> introduces the path where you want new created files to be saved, while file to be analyzed is placed at the end of the line. <br>
<br>
For several samples, we can directly launch <code>fastqc</code> with a list of files to analyze. If several cores are availables, we can ask for <code>fastqc</code> to deal with several files at a time.
<blockquote>
    <code>-t 16</code> or <code>-threads 16</code> to ask for 16 files to be managed in parallel, knowing that each process will use 250 MB of RAM memory (<em>so 4 GB at a time for 16 threads, and 32 files is also 2 times 16 samples)</em> 
</blockquote>

### **2.2 - Prepare destination folders**

We will store output files in ``Results/`` and in a subfolder called ``fastqc/``.

In [None]:
## Code cell 13 ##

qcfolder="${gohome}$USER/Results/fastqc/"
mkdir -p "${qcfolder}"

As it's easier to work with files saved close to each other, the matched ``.log`` file will be saved in a ``logfiles/`` subfolder, also placed in ``Results/``.

In [None]:
## Code cell 14 ##

logfolder="${gohome}$USER/Results/logfiles/"
mkdir -p "${logfolder}"

Let's remember, or set if not done in **Parameters**'s section, the number of CPU (central processing units, cores) that multithreading program `pigz` is suppose to use.  
<div class="alert alert-block alert-warning">
    Following value of <b>4 is valid for a 5-CPU session</b>. Ideally, use 70-80% of the CPU amount your system or session has.
</div>

In [None]:
## Code cell 15 ##

authorizedCPU=${authorizedCPU}
#authorizedCPU=4


echo "The number of CPU available for computing is ${authorizedCPU}"

### **2.3 - Run ``fastqc`` tool**

In [None]:
## Code cell 16 ##

logfile="${logfolder}fastqc_raw-quality-processing.log"
echo "Screen output is redirected to ${logfile}"

In [None]:
## Code cell 17 ##

# as time command does not redirect output
echo "operation starting by $(date)" >> ${logfile}

time fastqc --outdir ${qcfolder} --threads ${authorizedCPU} \
            $(echo "${rawfolder}"*.gz)\
            &>> ${logfile}
echo "operation finished by $(date)" >> ${logfile}

In [None]:
## Code cell 18 ##

# to see which files we have afterward and follow folder sizes
ls -lh ${qcfolder} >> ${logfile}
ls -lh "${gohome}$USER/Results/" >> ${logfile}

echo "$(ls -l "${qcfolder}"*.html | wc -l) generated .html reports"

The ouputs are in a `.zip` folder and a `.html` file, the latest being a complete summary of the analysis. <br>
To open this `html` file, in the left-hand panel of *JupyterLab* double-click the "Results" folder, and in it, on the html file: it should open in a new tab beside this notebook. <br>

If you have no teacher nor bionformatician at hand (or maybe they don't know either of this subject), you can browse some links:
<ul class="alert alert-block alert-info">
    <li><code>fastqc</code> help sections on <a href="https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/">its raw (no layout!) website</a>
    </li>
    <li>Michigan state University's support facility offers a nicer <a heef="https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/">FASTQC Tutorial and FAQ</a>
    </li>
    <li><i>Assessing quality metrics</i>'s section in <a ref="https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html">Quality control</a>, from a former workshop by the <i>Harvard Chan Bioinformatics Core</i> (HBC). <br>
        <i>++</i>: <b>some schemes on sequencing run and detailed information in general</b> 
    </li>
    <li><a hef="https://huoww07.github.io/Bioinformatics-for-RNA-Seq/lessons/02_Quality_Control.html#understand-fastqc-report">RNA sequencing quality control with fastQC</a>'s section of <i>Tufts University Research Technology Workshop</i>
    </li>
</ul>

After eyeing one file (<i>so only one sample!</i>), you may want to view all results at the same time to compare between samples. That's where MultiQC and next steps will help us.

---
## 3 - Compiling a summary report with MultiQC

When numerous samples are processed, it can easily become tedious to look at each mapping quality report. To that purpose, we will run <a href="https://multiqc.info/"><b>MultiQC</b></a>, that scans automatically a folder for all quality checks outputs and produce a single interactive report in html format.

### **3.1 - Tool version and short presentation**

This notebook was developped with ``multiqc, version 1.13``.

In [None]:
## Code cell 19 ##

multiqc --version

This tool deals with almost any possible NGS tools: see <a href="https://multiqc.info/docs/#multiqc-modules">the onlinefull updated list</a> for more details and to know how it works (detectd files and folder extensions).

By default, **multiqc** identifies any report it can parse from the input directory.
If you want to only generate a multiQC report on specific analyses, you can add the argument ``-m`` followed by the name of the module as for example:
<code>multiqc -m fastqc ./Results/Fastqc/ -o /Results/MultiQC_on_FastQC</code>
> You can add several modules ``-m fastqc dir_fastqc -m qualimap dir_qualimap`` etc...

The three main options we will use are:
<blockquote>
    <code>-ip</code> or <code>--interactive</code> stands for integrate dynamical graphics to have interactive plots in html report <br>
    <code>-p</code> or <code>--export</code> to export plot as static images besides html report <br>
    <code>-o</code> or <code>--outdir</code> to define the destination folder for output and report files <br>
    then, folder we want to be scanned <br>
</blockquote>

Others options exist: <br>
<blockquote>
    <code>-d</code> or <code>--dirs</code> to append directory names to files (useful for same names in different folders) <br>
    <code>-f</code> or <code>--force</code> to force overwriting existing files <br>
    <code>-v</code> or <code>--verbose</code> to increase output verbosity <br>
    <code>--tag TEXT</code> if only TEXT-matching modules are desired <br>
    <code>--pdf</code> to get a pdf report (available only with <code>pandoc</code> library)
</blockquote>

### **3.2 - Folder, filename, title and comment**

We will create a subfolder in the ``Results/`` folder for ``multiqc``.

In [None]:
## Code cell 20 ##

qcsummaries="${gohome}$USER/Results/multiqc/"
mkdir -p ${qcsummaries}

All downstream reports will also be saved here and we will use different file names.

We will ask MultiQC for specific and meaning filenames and title using ``-n`` and ``-i`` options.
<blockquote>
    <code>-n</code> or <code>--filename TEXT</code> to have a non-default report filename (warning: <code>stdout</code> will just print results to console <br>
    <code>-i</code> or <code>--title</code>, to change file header. Also used for filename if option not specified <br>
    <code>-b</code> or <code>--comment</code> to add any text section in report
</blockquote>

Please, specify the **file name** you want to have (do not worry about extension, MultiQC will handle this for us) inside quotes in the next cell. <a id="multiqctextvar"></a>  
<b>DO NOT use spaces or any special characters!</b> 

In [None]:
## Code cell 21 ##

inamemyfile="1_raw-fastq-files"

Please, specify a meaningful **title** inside quotes in the next cell, to display at the head of your oncoming report.
<b>No more space limit but still avoid any special characters.</b> 

In [None]:
## Code cell 22 ##

mytitle="Raw fastq files quality for 3 paired-end sequenced samples from double-Htz B-ALL mice study (Dataset GSE158673, subSerie GSE158661)"

Besides, we can add a comment in header's report. It's a good practise to do so. So we will define it in following cell.

> In this cell, we use several lines to keep it readable when displaying notebook. As your text lines are just collapsed together in the html report, be sure to keep last blank space when ending every line).

In [None]:
## Code cell 23 ##

mycomment=$(echo "Raw fastq files from bulk RNA sequencing (mouse, unstranded) " \
"performed by Ramamoorthy et al. 2020 (PMID:  33004416 ; GEO: GSE158673, subSerie GSE158661 ; SRA: PRJNA666155).   "    \
"RNASeq analysis to unravel molecular networks driving leukemia in Ebf1+/-Pax5+/- (dHet) B-ALL mice : To profile gene expression changes in  Ebf1+/-Pax5+/- (dHet) leukemic mice, RNASeq analysis was performed in dHet B-ALL, dHet proB and wt proB
cells.    "    \
"PRJNA666155: 3 dHet mice, 2 replicates; proB cells derived from dHet pre-leukemic mouse, 3 replicates; wt proB cells, 2 replicates.    ")   

### **3.3 - Generate summary report**

MultiQC is verbose but, as it will work only on FastQC reports, it is quite short.  
So, we will let its output lines show below while saving them in a logfile for later use.

In [None]:
## Code cell 24 ##

logfile="${logfolder}multiqc-processing_raw-quality.log"
echo "Screen output is also saved in ${logfile}"

In [None]:
## Code cell 25 ##

# as time command does not redirect output
echo "operation starting by $(date)" >> ${logfile}
multiqc --interactive --export \
        --outdir "${qcsummaries}" \
        --filename "${inamemyfile}" \
        --title "${mytitle}"  \
        --comment "${mycomment}" \
        "${qcfolder}" \
        |& tee -a ${logfile}
echo "operation finished by $(date)" >> ${logfile}

# to see which files we have afterward and follow folder sizes
ls -lh "${qcsummaries}" >> ${logfile}
ls -lh "${gohome}$USER/Results/" >> ${logfile}

To open the report (``.html`` file using Javascript, language not supported by *JupyterLab* so far), download the html file from the left-hand panel and open it in your own browser for it to express its full abilities.

---
## 4 - Files, folders and versions summary when leaving

Let's count the number of files in our destination folder.

In [None]:
## Code cell 26 ##

ls "${qcfolder}" | wc -l

... and check the arborescence of our folder with the Unix command `tree`.  
<blockquote>
    Adding <code>-L</code> option and a number allows to stop digging into the tree folder... and let the output be still readable. <br>
    To list only directories, use option <code>-d</code>
</blockquote>

In [None]:
## Code cell 27 ##

tree -d -L 3 "${gohome}$USER/"

To know how many disk space files take (either shortest output or detailed one):

In [None]:
## Code cell 28 ##
# disk usage 

du -ch -d1 "${gohome}$USER/Results/"  
du -ch -d1 "${gohome}$USER"  

or

In [None]:
## Code cell 29 ##

ls -lh "${gohome}$USER/Results/"

Last but not least, we list the versions of the tools used in this step for future reference:   

In [None]:
## Code cell 30 ##   

module list

---
___
## Conclusion


**Next Practical session**


After you look at the MultiQC report in order to know what to correct in your data, proceed to next step.
  
**=> Step 3 : Preprocessing reads and checking for their quality** 

The jupyter notebook used for the next session will be the *Pipe_03-bash_preprocessing-and-check.ipynb*    
Let's retrieve it in our directory, in order to have a private copy to work on:   

In [None]:
## Code cell 31 ##   

cp "${gohome}pipeline/Pipe_03-bash_preprocessing-and-check.ipynb" "${gohome}$USER/"



**Save executed notebook**

To end the session, save your exectued notebook in your `run_notebooks' folder. Adjust the name with yours and reformat as code cell to run it.

<div class="alert alert-block alert-success"><b>Success:</b> Well done! You now know how to check quality of fastq raw data files.<br>
Don't forget to save you notebook and export a copy as an <b>html</b> file as well <br>
- Open "File" in the Menu<br>
- Select "Export Notebook As"<br>
- Export notebook as HTML<br>
- You can then open it in your browser even without being connected to the server! 
</div>

---
---

## Useful commands
<div class="alert alert-block alert-info"> 
    
- <kbd>CTRL</kbd>+<kbd>S</kbd> : save notebook<br>    
- <kbd>CTRL</kbd>+<kbd>ENTER</kbd> : Run Cell<br>  
- <kbd>SHIFT</kbd>+<kbd>ENTER</kbd> : Run Cell and Select Next<br>   
- <kbd>ALT</kbd>+<kbd>ENTER</kbd> : Run Cell and Insert Below<br>   
- <kbd>ESC</kbd>+<kbd>y</kbd> : Change to *Code* Cell Type<br>  
- <kbd>ESC</kbd>+<kbd>m</kbd> : Change to *Markdown* Cell Type<br> 
- <kbd>ESC</kbd>+<kbd>r</kbd> : Change to *Raw* Cell Type<br>    
- <kbd>ESC</kbd>+<kbd>a</kbd> : Create Cell Above<br> 
- <kbd>ESC</kbd>+<kbd>b</kbd> : Create Cell Below<br> 

<em>  
To make nice html reports with markdown: <a href="https://dillinger.io/" title="dillinger.io">html visualization tool 1</a> or <a href="https://stackedit.io/app#" title="stackedit.io">html visualization tool 2</a>, <a href="https://www.tablesgenerator.com/markdown_tables" title="tablesgenerator.com">to draw nice tables</a>, and the <a href="https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fd" title="Ultimate guide">Ultimate guide</a>. <br>
Further reading on JupyterLab notebooks: <a href="https://jupyterlab.readthedocs.io/en/latest/user/notebook.html" title="Jupyter Lab">Jupyter Lab documentation</a>.<br>   
</em>    
 
</div>

---
Bénédicte Noblet - 05-07 2021   
Sandrine Caburet - 02-05 2023   
Claire Vandiedonck - 02-06 2023  
Maj 02/06/2023 par @SCaburet