# Formation RNAseq CEA - juin 2023

Session IFB : 5 CPU + 21 GB de RAM

# Part 5: Mapping quality check

  
- 0.1 - About session for IFB core cluster
- 0.2 -Parameters to be set or modified by the user
- 1 - Some checks as a precaution with ``samtools``
- 2 - Some metrics with ``samtools``
- 3 - Some metrics and graphical results with ``Qualimap``
- 4 - Summary report with ``MultiQC``
- 5 - Monitoring disk usage

After quite long running cells, I needed to start again an IFB session. The two additionnal loading sections of this notebook may not be required for your analyses:
- <a href="#sectionIII.3.b">section III.3.b</a>
- <a href="#sectionIV.1.a">section IV.1.a</a>  

**If you use these sections**, change variable cell from *Raw* type to *Code* type before reproducing same changes to ``gohome`` and ``mappedfolder`` variables that you will do (or did) in **Parameter**'s section.

---
## **Before going further**

<div class="alert alert-block alert-danger"><b>Caution:</b> 
Before starting the analysis, save a backup copy of this notebok : in the left-hand panel, right-click on this file and select "Duplicate"<br>
You can also make backups during the analysis. Don't forget to save your notebook regularly: <kbd>Ctrl</kbd> + <kbd>S</kbd> or click on the 💾 icon.
</div>

---

## 0.1 - About session for IFB core cluster

<em>loaded JupyterLab</em> : Version 3.2.1

In [1]:
## Code cell 1 ##

echo "=== Cell launched on $(date) ==="

echo "=== Current IFB session size: Medium (5 CPU, 21GB) ==="
jobid=$(squeue -hu $USER | awk '/jupyter/ {print $1}')
sacct --format=JobID,AllocCPUS,NODELIST -j ${jobid}


=== Cell launched on Thu May 25 09:53:47 CEST 2023 ===
=== Current IFB session size: Medium (5 CPU, 21GB) ===
       JobID  AllocCPUS        NodeList 
------------ ---------- --------------- 
33538881              5     cpu-node-82 
33538881.ba+          5     cpu-node-82 
33538881.0            5     cpu-node-82 


In [2]:
## Code cell 2 ##

module load samtools/1.15.1 qualimap/2.2.2b multiqc/1.13

echo "===== quality control & bam sorting ====="
samtools --version | head -n 2
echo "===== graphical quality controls ====="
qualimap --help | head -n 4
echo "===== quality reports compilation ====="
multiqc --version

===== quality control & bam sorting =====
samtools 1.15.1
Using htslib 1.15.1
===== graphical quality controls =====
Java memory size is set to 1200M
Launching application...

QualiMap v.2.2.2-dev
===== quality reports compilation =====
multiqc, version 1.13


---

## 0.2 - Parameters to be set or modified by the user

- Using a full path with a `/` at the end, **define the folder** where you want or have to work with `gohome` variable:

In [3]:
## Code cell 3 ##

gohome="/shared/projects/2312_rnaseq_cea/"

echo "=== Home root folder is ==="
echo "${gohome}"
echo ""
echo "=== Working (personal) folder tree ==="
tree -L 2 "${gohome}$USER"
echo "=== current working directory ==="
echo "${PWD}"

=== Home root folder is ===
/shared/projects/2312_rnaseq_cea/

=== Working (personal) folder tree ===
/shared/projects/2312_rnaseq_cea/scaburet
├── Data
│   ├── fastq
│   ├── sra
│   ├── sra-files_creation_fastqgz.log
│   ├── sra-files_integrity.log
│   └── sra-files_retrieval.log
├── Pipe_5-bash_mapping-quality.ipynb
└── Results
    ├── fastp
    ├── fastqc
    ├── fastq_screen
    ├── logfiles
    ├── multiqc
    ├── qualimap
    ├── samtools
    ├── star
    └── star-old

13 directories, 4 files
=== current working directory ===
/shared/ifbstor1/projects/2312_rnaseq_cea/pipeline


- Precise the **maximum amount of CPU** (central processing units, cores) and **RAM-memory (in Bytes)** that programs can use.<a id="computressources"></a>

In [4]:
## Code cell 4 ##

authorizedCPU=4

authorizedRAM=20000000000  # 20GB

- This notebook contains heavy running cells in <a href="#section3.2">section 3.2</a> so make sure before you launch these cells that they are adapted to your environment. <br>
For Samtools, the ``-m`` option is the **RAM memory size that will be used by each thread**! <br>
<blockquote>
        See option <code>--java-mem-size=20G</code> in Qualimap lines, <a href="#section3.2">section 3.2</a> <br>
        See options <code>--threads 4 -m 5G</code> in Samtools line, <a href="#section3.3">section 3.3.b</a>
</blockquote>

<div class="alert alert-block alert-danger">
    <b>Values set in this notebook are valid for a 5-CPU session with access to 21 GB of RAM</b>. Ideally, use 70-80% of the CPU amount your system or session has. <br>
    DO NOT ask for more RAM than you can use.
</div>

- With a `/` at the end, precise the folder where alignment ``_ALigned.SortedByCoord.out.bam`` files (produced by ``STAR``) are:

In [5]:
## Code cell 5 ##

mappedfolder="${gohome}allData/Results/star/"

#mappedfolder="${gohome}$USER/Results/star/"

- Reset the path to uncompressed annotation file, especially if you changed it in previous notebook.

In [6]:
## Code cell 6 ##

gtffile="${gohome}allData/Reference/extracted/genome_annotation.gtf"

- Comment will later be included in Quality report to keep analysis informations handy. **Beware to adapt ``mycomment`` variable text** in [section 4.1.c](#multiqctextvar) before launching MultiQC report generation.

---
## 1 - Some checks as a precaution with <code>samtools</code>

### **1.1 - Available files**

As we change session or even day, let's first check all files are there:

In [84]:
## Code cell 7 ##

mappedfolder=${mappedfolder}
#mappedfolder="${gohome}$USER/Results/star/"

echo "mappedfolder is "${mappedfolder}
echo "There are $(ls "${mappedfolder}"*_Aligned.sortedByCoord.out.bam | wc -l) bam files in it:"
ls "${mappedfolder}"*_Aligned.sortedByCoord.out.bam

mappedfolder is /shared/projects/2312_rnaseq_cea/allData/Results/star/
There are 11 bam files in it:
/shared/projects/2312_rnaseq_cea/allData/Results/star/SRR12730403_Aligned.sortedByCoord.out.bam
/shared/projects/2312_rnaseq_cea/allData/Results/star/SRR12730404_Aligned.sortedByCoord.out.bam
/shared/projects/2312_rnaseq_cea/allData/Results/star/SRR12730405_Aligned.sortedByCoord.out.bam
/shared/projects/2312_rnaseq_cea/allData/Results/star/SRR12730406_Aligned.sortedByCoord.out.bam
/shared/projects/2312_rnaseq_cea/allData/Results/star/SRR12730407_Aligned.sortedByCoord.out.bam
/shared/projects/2312_rnaseq_cea/allData/Results/star/SRR12730408_Aligned.sortedByCoord.out.bam
/shared/projects/2312_rnaseq_cea/allData/Results/star/SRR12730409_Aligned.sortedByCoord.out.bam
/shared/projects/2312_rnaseq_cea/allData/Results/star/SRR12730410_Aligned.sortedByCoord.out.bam
/shared/projects/2312_rnaseq_cea/allData/Results/star/SRR12730411_Aligned.sortedByCoord.out.bam
/shared/projects/2312_rnaseq_cea/al

### **1.2 - Examining data files: are they what we expect?**

These files are smaller that ``.sam`` files as they are binary ones... but we can't see inside without a specific tool (``samtools view``) included in ``samtools``suite.

In [85]:
## Code cell 8 ##

samtools --version | head -n 2

samtools 1.15.1
Using htslib 1.15.1


We list the files in the folder and ask for only the first line (``-n 1``) to get one sample:

In [86]:
## Code cell 9 ##

abamfile=$(ls "${mappedfolder}"*_Aligned.sortedByCoord.out.bam | head -n 1)
echo ${abamfile}

/shared/projects/2312_rnaseq_cea/allData/Results/star/SRR12730403_Aligned.sortedByCoord.out.bam


In [87]:
## Code cell 10 ##

samtools view ${abamfile} | head

SRR12730403.44571836	99	chr1	3111929	255	101M	=	3112269	441	CTGATAAGCAGCTTCGGTGAAGTAGCTGGATATAAAATAAACTCAAACAAGTCAATGGCCTTTCTCTATACAAAGAATAAACAGGCTGAGAAAGAAATTAG	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:FFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NH:i:1	HI:i:1	AS:i:198	nM:i:1	NM:i:1	MD:Z:7A93	jM:B:c,-1	jI:B:i,-1	MC:Z:101M
SRR12730403.44571836	147	chr1	3112269	255	101M	=	3111929	-441	ATTCTTCAACGAATTAGAAGGAGCAATTTGCAAATTCAGCTGGAATAACAAAAAACCTAGGATAGCAAAAAGTCTTCTCAAGGATAAAAGAACCTCTGGTG	FFFFFFFFFF,F,FFFF:FFFFFFF:,:F,FFFFFFFF,FFFFF,FFFF:FF,,F:FF,FFF,FFFF:FF,FF::FFFFF::F,:FFFFF:,FFF:FFFF:	NH:i:1	HI:i:1	AS:i:198	nM:i:1	NM:i:0	MD:Z:101	jM:B:c,-1	jI:B:i,-1	MC:Z:101M
SRR12730403.6920683	163	chr1	3225116	3	101M	=	3225152	137	CTGTGGAAAAAACAGGGATGAGGAGGAACGACTTCCAGCTCCTATTTTAGCCACAAATCGTGGTGTTACTAATGACATAATTCTTGCCTAGGTCTTGCTAA	FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NH:i:2	HI:i:1	AS:i:200	nM:i:0	NM:i:0	MD:Z:101	jM:B:c,-1

<ul class="alert alert-block alert-info">
    <li>
        For more information on BAM file format, see <a href="https://biocorecrg.github.io/RNAseq_course_2019/alnpractical.html"><i>BAM/SAM/CRAM format</i></a> section of another training course made by Barcelona's CRG Biocore Facility members on 2019.
    </li>
    <li>
        To understand, or at least know what is inside alignment quality scores, the Broad Institute's tool <a href="https://broadinstitute.github.io/picard/explain-flags.html">Explain SAM Flags</a> may be helpfull!
    </li>
    <li>
        To convert flags between numeric and textual representation, you may also use <code>samtools flags</code>, see <a href="http://www.htslib.org/doc/samtools.html">online manual</a>.
    </li>
</ul>

### **1.3 - Checking `.bam` files integrity**

You may join this pipeline with your own `.bam` files, either generated from another alignment tool or directly downloaded from network.  
Alternatively, you may have issues with `STAR` and some of your files may be corrupted.  

So, to start off on the right foot, let's check that files are correct. To that purpose, `samtools` will once again help us to deal with `.bam` files.

In [88]:
## Code cell 11 ##

logfile="${mappedfolder}samtools_quickcheck_bamfiles.log"
echo "Screen output is redirected to ${logfile}"

# as time command does not redirect output
echo "operation starts at $(date)" >> ${logfile}

samtools quickcheck -vvv "${mappedfolder}"*_Aligned.sortedByCoord.out.bam \
            &>> ${logfile}
echo "operation ends at $(date)" >> ${logfile}            

# to print user's message (samtools quickcheck is really quick!) and record it
samtools quickcheck "${mappedfolder}"*_Aligned.sortedByCoord.out.bam \
            && inbrief="All files are goods." \
            || inbrief="At least one file is corrupted, see logfile."
echo ${inbrief} | tee -a ${logfile}

Screen output is redirected to /shared/projects/2312_rnaseq_cea/allData/Results/star/samtools_quickcheck_bamfiles.log
All files are goods.


<div class="alert alert-block alert-warning">
    <b>If there is an issue with your files</b>, you should start again previous steps for corrupted file(s). Indeed, some downstream tools may work but <code>samtools</code> will fail. Besides, you wouldn't be confident with generated results: <b>save time, take some before losing later!</b>
</div>

## 2 - Some metrics with <code>samtools</code>

The commands used for this part belong to a large package of utilities that are very useful to manage those types of files: Samtools (http://www.htslib.org/).

### **2.1 - Preparing command line variables**

An option would be to save results files along with `.bam` files but it's not the best idea to have a lot of files in one folder (especially when saving them, along with long list handling). So we will save the output files to another folder.

In [89]:
## Code cell 12 ##

samtoolsfolder="${gohome}allData/Results/samtools/"

#samtoolsfolder="${gohome}$USER/Results/samtools/"
mkdir -p ${samtoolsfolder}

Count files are *per se* the interesting outputs but, if there is any issue will or for duration information, we nonetheless keep some details into a `.log` file saved with the others reports.

In [13]:
## Code cell 13 ##

logfolder="${gohome}allData/Results/logfiles/"

#logfolder="${gohome}$USER/Results/logfiles/"

### **2.2 - Running command line for <code>samtools</code> metrics**

<div class="alert alert-block alert-danger">
    In the loop below, we select files with name ending with <code>_sortedByCoord.out.bam</code> pattern. <br>
    Indeed, <code>STAR</code> produces a second <code>.bam</code> file if you asked for transcripts counts using <code>TranscriptomeSAM</code> quantification mode. The suffix for this second filename is <code>_Aligned.toTranscriptome.out.bam</code>: it's not a standardized one as reads are sorted by transcript name instead of chromosome position. <br>
    <b>If you did not use previous notebook as initially set, please remove or adapt <code>_sortedByCoord.out</code> in the fifth line below, in order that filename pattern suits your filename format.</b>
</div>

In [None]:
## Code cell 14 ##

logfile="${logfolder}samtools_stats_flagstat_bamfiles.log"
echo "Screen output is redirected to ${logfile}"

# as time command does not redirect output
echo "operation starts at $(date)" >> ${logfile}

time for fn in $(ls "${mappedfolder}"*_Aligned.sortedByCoord.out.bam); do

    mysortedbam=$(basename ${fn})
    id=${mysortedbam/_Aligned.sortedByCoord.out.bam/}
    echo "======= Processing sampleID: ${id}..." | tee -a ${logfile}

    statsfile="${samtoolsfolder}${id}.stats"
    echo "$(date) starting ${statsfile}" >> ${logfile}
    samtools stats ${fn} > ${statsfile}

    flagsfile="${samtoolsfolder}${id}.flagstat"
    echo "$(date) starting ${flagsfile}" >> ${logfile}
    samtools flagstat ${fn} > ${flagsfile}
    
    echo "... done" | tee -a ${logfile}

done

echo "operation ends at $(date)" >> ${logfile}

echo "=== files created after samtools counts ===" >> ${logfile}
ls -lh "${samtoolsfolder}" >> ${logfile}

echo "There are $(ls "${samtoolsfolder}" | wc -l) output files."

Screen output is redirected to /shared/projects/2312_rnaseq_cea/allData/Results/logfiles/samtools_stats_flagstat_bamfiles.log
... done
... done
... done
... done
... done
... done
... done
... done
... done


## 3 - Some metrics and graphical results with Qualimap

### **3.1 - Tool introduction and preparation step**

We'll run the QUALIMAP program (http://qualimap.conesalab.org/), that collects the data about the `.bam file`, including coverage estimation and many other parameters, and reports a summary of the main properties of the alignment data. QUalimap reads `.sorted.bam` files and generates a folder containing a report on `.html` format.

As you can see from the next command to know which version we are running, Qualimap includes several tools. We will be using the classical bamqc tool which can work on any kind of NGS .bam files. Of note, there is also a rnaqc tool which is dedicated to RNASeq data but it fails to run with C. Parapsilosis annotations.

In [19]:
## Code cell 15 ##

qualimap --help

Java memory size is set to 1200M
Launching application...

QualiMap v.2.2.2-dev
Built on 2017-08-28 08:37

usage: qualimap <tool> [options]

To launch GUI leave <tool> empty.

Available tools:

    bamqc            Evaluate NGS mapping to a reference genome
    rnaseq           Evaluate RNA-seq alignment data
    counts           Counts data analysis (further RNA-seq data evaluation)
    multi-bamqc      Compare QC reports from multiple NGS mappings
    clustering       Cluster epigenomic signals
    comp-counts      Compute feature counts

Special arguments: 

    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G



> Remark: QUALIMAP can also be launched on most OS using a command line that will open a user-friendly Java window. On Mac or Windows, you may have to modify the memory RAMs to run it. It is all explained in the documentation.

As with previous tools, we will put its results in a `Results/` subfolder:

In [20]:
## Code cell 16 ##

qualimapfolder="${gohome}allData/Results/qualimap/"

#qualimapfolder="${gohome}$USER/Results/qualimap/"
mkdir -p ${qualimapfolder}

An issue with current Qualimap releases is that we can't change file names for both tools we will use.  
Thus, let's create distinct folders for the two tools within the qualimap folder.

In [21]:
## Code cell 17 ##

bamqcfolder="${qualimapfolder}bamqc/"
mkdir -p ${bamqcfolder}

In [22]:
## Code cell 18 ##

rnaseqfolder="${qualimapfolder}rnaseq/"
mkdir -p ${rnaseqfolder}

### **3.2 - General view with <code>qualimap bamqc</code>**

The <code>qualimap</code> options we will use are: <br>
<blockquote>
    <code>-bam filepath</code>, Input mapping file in BAM format, with reads sorted by coordinates <br>
    <code>-gff filepath</code> or <code>--feature-file path/to/file</code>, the path to the feature file with regions of interest in GFF/GTF or BED format <br>
    <code>-outdir text</code>, Output folder for HTML report and raw results. <br>
    <code>-p VALUE</code> or <code>--sequencing-protocol VALUE</code>, Sequencing library protocol: strand-specific-forward, strand-specific-reverse or non-strand-specific (default). <br>
    <code>-c</code> or <code>--paint-chromosome-limits</code>, Paint chromosome limits inside charts
</blockquote>

Qualimap produces by default an HTML report but we may ask for a PDF or both, using <code>-outformat</code> option and corresponding predefined values (PDF, HTML or PDF:HTML for both).

<ul class="alert alert-block alert-info">
    <li>
        See <a href="http://qualimap.conesalab.org/doc_html/analysis.html#bam-qc">online manual</a> for more details
    </li>
</ul>

<div class="alert alert-block alert-danger"> <a id="section3.2"></a>
    Following <b>command is prepared for usage on a computational cluster</b> such as the <i>Institut Français de Bioinformatique</i> (IFB)'s core cluster. We use a session defined as <b>5 CPU with 21 GB available for RAM</b>. Adapt it to your ressources.
</div>

In [11]:
## Code cell 19 ##

logfile="${logfolder}qualimap_bamqc_allSamples.log"
#logfile="${logfolder}qualimap_bamqc_3samples.log"
echo "Screen output is redirected to ${logfile}"

# as time command does not redirect output
echo "operation starts at $(date)" >> ${logfile}

for fn in $(ls "${mappedfolder}"*_Aligned.sortedByCoord.out.bam); do   
    
    mysortedbam=$(basename ${fn})
    id=${mysortedbam/_Aligned.sortedByCoord.out.bam/}
    echo "===== Processing sampleID: ${id}..." | tee -a ${logfile}

    myoutdir="${bamqcfolder}${id}"
    echo "destination folder is ${myoutdir}" >> ${logfile}
    
    echo "qualimap bamqc starts at $(date)" >> ${logfile}
    qualimap bamqc -bam ${fn} \
                    --feature-file "${gtffile}" \
                    --sequencing-protocol non-strand-specific \
                    --paint-chromosome-limits \
                    -outdir ${myoutdir} \
                    --java-mem-size=20G \
                    &>> ${logfile}
    echo "qualimap bamqc ends at $(date)" >> ${logfile}
    
    echo "... done" | tee -a ${logfile}
  
done 

echo "operation ends at $(date)" >> ${logfile}

echo "=== Files created after qualimap bamqc ===" >> ${logfile}
tree "${bamqcfolder}" >> ${logfile}
echo "Qualimap generated $(find "${bamqcfolder}" -name *.html | wc -l) html reports."

Screen output is redirected to qualimap_bamqc_allSamples.log
===== Processing sampleID: SRR12730411...
... done
===== Processing sampleID: SRR12730412...
... done
Qualimap generated 11 html reports.


We can know how long this cell lasts, thanks to ``logfile``:

In [14]:
## Code cell 20 ##

grep -e "operation" ${logfile}

operation starts at Wed May 24 19:40:54 CEST 2023
operation starts at Thu May 25 09:54:38 CEST 2023
operation ends at Thu May 25 10:35:54 CEST 2023


### **3.3 - Specific RNAseq plots with <code>qualimap rnaseq</code>**

``qualimap rnaseq``, a Qualimap specific option for RNAseq data, gives some more metrics: 
- a *Reads genomic origin* circus-plot to check that reads map to exons
- two *Coverage Profile Along Genes* plots (low and high)
- a *Junction Analysis* circus-plot to distinguish between known and novel junctions

<ul class="alert alert-block alert-info">
    <li>
        See <a href="http://qualimap.conesalab.org/doc_html/analysis.html#rna-seq-qc">online manual for this tool</a> for more details
    </li>
</ul>

#### **3.3.a - Understanding operations**

With paired end data, we could use following lines:  
<code>qualimap rnaseq -pe \  
               -bam path/to/sortedByNames.bam \ 
               -gtf path/to/gencode_annotation.gtf \ 
               -outdir path/to/outputfolder/</code>

Those options, along with others, are explained in manual:
<blockquote>
     <code>-pe</code> or <code>--paired</code> Setting this flag for paired-end experiments will result in counting fragments instead of reads <br>
     <code>-bam path/to/file.bam</code>, to locate input mapping file in BAM format. <br>
     <code>-gtf path/to/file.gtf</code>, where to find annotations file in Ensembl GTF format (no compressed format). <br>
     <code>-outdir path/to/folder/</code> to define output folder for HTML report and raw data. <br> 
</blockquote>

There are some more options of interest and we will use the first two while letting the third one by default:
<blockquote>
     <code>-p VALUE</code> or <code>--sequencing-protocol VALUE</code>, to precise sequencing library protocol. VALUES among: strand-specific-forward, strand-specific-reverse or non-strand-specific (default) <br>
     <code>-oc path/to/outfile.counts</code>, to place output file for computed counts in a distinct folder. If only name of the file is provided, then the file will be saved in the output folder. <br>
     <code>-a</code> or <code>--algorithm</code>, to set counting algorithm: uniquely-mapped-reads(default) or proportional.
</blockquote>

Contrary to `bamqc`, `rnaseq` Qualimap tool needs alignment files to be sorted by read names. By default, if given otherwise, it will sort `.bam` file... but we already noticed that handling those big files is quite slow with only one process running at a time.  
So we will use `samtools sort` tool to perform it with multiple threads at a time and give `qualimap rnaseq` an already sorted input file and let it know with <code>-s</code> or <code>--sorted</code> (Only required for paired-end analysis...).

In order to follow this process, we will use this command format:  
<code>samtools sort -n \ 
            -@ X -m YG \ 
            -O BAM \ 
            -o path/to/sortedByNames.bam \ 
            -T path/to/temporaryfolder/ \ 
            path/to/input.bam</code>

where options correspond to:  
<blockquote>
    <code>-n</code> for activating sorting by read name (<i>i.e.</i> the QNAME field) <br>
    <code>-o FILE</code>    Write final output to FILE rather than standard output <br>
    <code>-T PREFIX</code>, to write temporary files to a particular folder. Used names are <i>PREFIX.nnnn.bam</i>, where <i>nnnn</i> stands for a 4-digit number <br>
    <code>-O</code> or <code>--output-fmt FORMAT</code> to specify output format (SAM, BAM or CRAM). If not mentionned, <code>samtools</code> will deduce it from file extension (not available if there is no written output file!) <br>
    <code>-@ INT</code> or <code>--threads INT</code>, the number of additional threads to use [0]. Indeed, you will  only use 1 thread by default. <br>
    <code>-m INT</code>, to limit maximum memory per thread; suffix K/M/G recognized [768M]
</blockquote>

#### **3.3.b - Running <code>qualimap rnaseq</code> loop... with patience**

<div class="alert alert-block alert-warning">
    Below cell will run for <b>at least 30 minutes per sample</b>. Either you come back later or you may use JupyterLab otion to <i>Run selected cell and all below</i> in <i>Run</i> Menu (top bar).
</div>

<div class="alert alert-block alert-danger"> <a id="section3.2"></a>
    Following <b>command is prepared for usage on a computational cluster</b> such as the <i>Institut Français de Bioinformatique</i> (IFB)'s core cluster. We use a session defined as <b>5 CPU with 21 GB available for RAM</b>. Adapt it to your ressources.
</div>

In [17]:
## Code cell 21 ##

logfile="${logfolder}qualimap_rnaseq_allSamples.log"
#logfile="${logfolder}qualimap_rnaseq_3samples.log"

In [18]:
## Code cell 22 ##

echo "Screen output is redirected to ${logfile}"

# as time command does not redirect output
echo "operation starts at $(date)" >> ${logfile}

for fn in $(ls "${mappedfolder}"*_Aligned.sortedByCoord.out.bam); do  
    
    mysortedbam=$(basename ${fn})
    id=${mysortedbam/_Aligned.sortedByCoord.out.bam/}
    echo "===== Processing sampleID: ${id}..." | tee -a ${logfile}
    echo "input filename: ${fn}"
    
    # some preparation
    mytempfile="${rnaseqfolder}${id}_Aligned.sortedByNames.bam"
    myoutdir="${rnaseqfolder}${id}/"
    mkdir -p ${myoutdir}
    echo "destination folder ${myoutdir}" >> ${logfile}
    
    # bam sorting...
    echo "samtools starts at $(date)" >> ${logfile}
    
    samtools sort -n \
                --threads 3 -m 5G \
                --output-fmt BAM \
                -o ${mytempfile} \
                -T ${rnaseqfolder} \
                --verbosity 3 \
                ${fn} \
                &>> ${logfile}
    
    echo "samtools ends at $(date)" >> ${logfile}

    # some user conversation to help being patient

    echo "...changing tool..." | tee -a ${logfile}

    # then rnaseq QC from Qualimap
    echo "qualimap rnaseq starts at $(date)" >> ${logfile}
    qualimap rnaseq -bam ${mytempfile} \
                        --sorted \
                        -gtf "${gtffile}" \
                        --sequencing-protocol non-strand-specific \
                        --paired \
                        -outdir ${myoutdir} \
                        --java-mem-size=20G \
                        &>> ${logfile}
    echo "qualimap rnaseq ends at $(date)" >> ${logfile}
    
    # removing extra bam file... saving disk space
    rm ${mytempfile}
    
    echo "... done" | tee -a ${logfile}
    
done

echo "operation ends at $(date)" >> ${logfile}

echo "=== Files created after qualimap rnaseq ===" >> ${logfile}
tree "${rnaseqfolder}" >> ${logfile}
echo "Qualimap generated $(find "${rnaseqfolder}" -name *.html | wc -l) html reports." | tee -a ${logfile}

Screen output is redirected to /shared/projects/2312_rnaseq_cea/allData/Results/logfiles/qualimap_rnaseq_allSamples.log
===== Processing sampleID: SRR12730408...
input filename: /shared/projects/2312_rnaseq_cea/allData/Results/star/SRR12730408_Aligned.sortedByCoord.out.bam
...changing tool...
... done
===== Processing sampleID: SRR12730409...
input filename: /shared/projects/2312_rnaseq_cea/allData/Results/star/SRR12730409_Aligned.sortedByCoord.out.bam
...changing tool...
... done
===== Processing sampleID: SRR12730410...
input filename: /shared/projects/2312_rnaseq_cea/allData/Results/star/SRR12730410_Aligned.sortedByCoord.out.bam
...changing tool...
... done
===== Processing sampleID: SRR12730411...
input filename: /shared/projects/2312_rnaseq_cea/allData/Results/star/SRR12730411_Aligned.sortedByCoord.out.bam
...changing tool...
... done
===== Processing sampleID: SRR12730412...
input filename: /shared/projects/2312_rnaseq_cea/allData/Results/star/SRR12730412_Aligned.sortedByCoord.ou

In [23]:
## Code cell 23 ##

#logfile="/shared/projects/2312_rnaseq_cea/scaburet/Results/logfiles/qualimap_rnaseq_3samples.log"
grep -e "operation" ${logfile}

operation starts at Thu May 25 10:58:27 CEST 2023
operation starts at Thu May 25 13:28:46 CEST 2023
operation ends at Thu May 25 15:30:12 CEST 2023


## 4- Having a summary report with MultiQC

### **4.1 - Preparations steps**

#### **4.1.a - Tool presentation and version**

When numerous samples are processed, it can easily become tedious to look in each mapping quality report. So we'll run MultiQC (https://multiqc.info/), that scans automatically a folder for all quality checks outputs and produce a single report. MutliQC runs on almost any possible NGS tools (https://multiqc.info/docs/#multiqc-modules).

In [24]:
## Code cell 24 ##

multiqc --version

multiqc, version 1.13


#### **4.1.b - Personnalize the report: setting variables**

Please, specify **file name** you want to have (do not worry about extension, MultiQC will handle this for us) inside quotes in the next cell. <a id="multiqctextvar"></a>  
<b>DO NOT use spaces or any special characters!</b>

In [25]:
## Code cell 25 ##

inamemyfile="3_starAlignments-bam-files"

Please, specify a meaningful **title** inside quotes in the next cell, to display at the head of your oncoming report.
<b>No more space limit but still avoid any special characters.</b> 

In [26]:
## Code cell 26 ##

mytitle="Sorted by coordinates bam files quality for 11 paired end sequenced samples, from double-Htz B-ALL mice study (Dataset GSE158673, subSerie GSE158661)"

Besides, we can add a comment in header's report. It's a good practise to do so. So we will define it in following cell.

> In this cell, we use several lines to keep it readable when displaying notebook. As your text lines are just collapsed together in the html report, be sure to keep last blank space when ending every line).

In [27]:
## Code cell 27 ##

mycomment=$(echo "Bam files derived from bulk RNA sequencing (mouse, unstranded) " \
"Performed by Ramamoorthy et al. 2020 (PMID:  33004416 ; GEO: GSE158673, subSerie GSE158661 ; SRA: PRJNA666155). " \
"RNASeq analysis to unravel molecular networks driving leukemia in Ebf1+/-Pax5+/- (dHet) B-ALL mice : To profile gene expression changes in  Ebf1+/-Pax5+/- (dHet) leukemic mice, RNASeq analysis was performed in dHet B-ALL, dHet proB and wt proB
cells.     " \
"PRJNA666155: 3 dHet mice, 2 replicates; proB cells derived from dHet pre-leukemic mouse, 3 replicates; wt proB cells, 2 replicates. " \
"Mapping to the reference genome (mouse, Gencode GRCm39, release 32) performed by STAR. Additionnal settings:  " \
"Unmapped or too multimapped (>20) are kept in separate file, GeneCounts quantification mode activated.")

... and don't forget to specify the output folder!

In [28]:
## Code cell 28 ##

qcsummaries="${gohome}allData/Results/multiqc/"
#qcsummaries="${gohome}$USER/Results/multiqc/"

### **4.2 - Generate MultiQC summary report**

MultiQC is verbose but, as it will work only on FastQC reports, it is quite short.  *This time, we will have qualimap bamqc, qualimap rnaseq, samtools stats and samtools idxstats*!
So, we will let it lines show in notebook while saving them in a file for later use.

In [29]:
## Code cell 29 ##

logfile="${logfolder}multiqc-processing_mapped-quality.log"
echo "Screen output is also saved in ${logfile}"

# as time command does not redirect output
echo "operation starting at $(date)" >> ${logfile}
time multiqc --interactive --export \
        --module star ${mappedfolder} \
        --module samtools ${samtoolsfolder} \
        --module qualimap ${qualimapfolder} \
        --outdir "${qcsummaries}" \
        --filename "${inamemyfile}" \
        --title "${mytitle}"  \
        --comment "${mycomment}" \
        "${gohome}allData/Results/" \
        |& tee -a ${logfile}
echo "operation finished at $(date)" >> ${logfile}

# to see which files we have afterward and follow folder sizes
ls -lh "${qcsummaries}" >> ${logfile}

du -h -d1 "${gohome}allData/Results/" >> ${logfile}
#du -h -d1 "${gohome}$USER/Results/" >> ${logfile}

Screen output is also saved in /shared/projects/2312_rnaseq_cea/allData/Results/logfiles/multiqc-processing_mapped-quality.log
bash: /shared/projects/2312_rnaseq_cea/allData/Results/logfiles/multiqc-processing_mapped-quality.log: Bad address

  /// MultiQC 🔍 | v1.13

|           multiqc | MultiQC Version v1.14 now available!
|           multiqc | Report title: Sorted by coordinates bam files quality for 11 paired end sequenced samples, from double-Htz B-ALL mice study (Dataset GSE158673, subSerie GSE158661)
|           multiqc | Only using modules: star, samtools, qualimap
|           multiqc | Search path : /shared/projects/2312_rnaseq_cea/allData/Results/star
|           multiqc | Search path : /shared/projects/2312_rnaseq_cea/allData/Results/qualimap
|           multiqc | Search path : /shared/projects/2312_rnaseq_cea/allData/Results
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2696/2696  
|          qualimap | Found 11 BamQC reports
|          qualimap | Foun

## 5 - Monitoring disk space usage

In [None]:
## Code cell 29 ##

#du -h -d2 ${gohome}$USER
du -h -d2 ${gohome}

---
___

Now we go on to reads quantification at gene level.  
  
**=> Step 6: Reads Count** 

The jupyter notebook used for the next session will be *Pipe_06-bash_reads-counts.ipynb.ipynb*    
Let's retrieve it in our directory, in order to have a private copy to work on:   

In [None]:
## Code cell 38 ##   

cp "${gohome}pipeline/Pipe_06-bash_reads-counts.ipynb" "${gohome}$USER/"

Bénédicte Noblet - 05-07 2021   
Sandrine Caburet - 02-05 2023   
Maj 24/05/2023