# RNA-seq Analysis Module - Mapping Quality Assessment
## Practical Session 4 (Tuesday 26/11/2024)
---
This notebook guides you through the quality assessment of RNA-seq mapping outputs, focusing on:
- BAM file quality metrics using samtools
- Detailed mapping quality analysis with QUALIMAP
- Multi-sample quality report generation with MultiQC

<div class="alert alert-block alert-info">
<b>Prerequisites:</b><br>
- Completed mapping of RNA-seq reads to reference genome
- Sorted BAM files (.bam) and their indexes (.bai)
- Reference genome annotation file (GTF format)
</div>

<div class="alert alert-block alert-warning">
<b>Resource Requirements:</b><br>
This analysis requires:
- Approximately 30-45 minutes to complete
- 10 CPU cores maximum
- 6GB RAM per student
- ~20GB disk space for results
</div>


## 1. Environment Setup
### 1.1 Resource Configuration

Before starting the analysis, we need to configure our environment and set resource limits. This ensures efficient processing while staying within the available computational resources.

Key parameters:
- NCPUS: Maximum number of CPU cores to use
- MAXRAM: Maximum RAM allocation per process


In [None]:
## Code Cell n°1 ##
# Maximum resources available
NCPUS=10
MAXRAM="6G"  # Maximum RAM per student
echo "Using maximum ${NCPUS} CPUs and ${MAXRAM} RAM"


### 1.2 Directory Structure

The analysis requires a specific directory structure to organize input data and results. We'll create separate directories for each tool's output to maintain clarity and organization.

<div class="alert alert-block alert-info">
<b>Directory Organization:</b><br>
- Results/samtools/: Basic BAM statistics and metrics
- Results/qualimap/: Detailed mapping quality analysis
- Results/multiqc/: Combined quality reports
</div>


In [None]:
## Code Cell n°2 ##
# Create Results directory structure
mkdir -p Results/{samtools,qualimap,multiqc}

# Define data paths
DATA_DIR="/srv/data/meg-m2-rnaseq/Data"
FASTQ_DIR="${DATA_DIR}/fastq/raw"
GENOME_DIR="/srv/data/meg-m2-rnaseq/Genomes/Mmu/GRCm39/extracted"
GTF_FILE="${GENOME_DIR}/genome_annotation-M35.gtf"

echo "Results directories created"
ls -l Results/


## 2. Reference Genome Preparation
### 2.1 Genome Files

For quality assessment, we need the reference genome and its annotation. These files are used by QUALIMAP to analyze gene coverage and other metrics.

<div class="alert alert-block alert-info">
<b>Required Files:</b><br>
- Genome sequence (FASTA format)
- Gene annotation (GTF format)
</div>


## 3. BAM File Quality Assessment
### 3.1 Initial BAM File Inspection

BAM files contain aligned sequencing reads in a binary format. We'll use samtools to examine these files and generate basic statistics.

<div class="alert alert-block alert-info">
<b>Samtools Functions:</b><br>
- view: Convert between SAM/BAM formats and inspect alignments
- sort: Sort alignments by coordinates or read names
- index: Create index for fast random access
- stats: Generate comprehensive mapping statistics
</div>

First, let's examine our BAM files and generate basic statistics:

In [None]:
## Code Cell n°4 ##
# List available BAM files
echo "Available BAM files:"
ls -lh Results/star/*.bam

# Generate basic statistics for first two samples
for bamfile in $(ls Results/star/*.bam | head -n 2); do
    echo "Processing ${bamfile}..."

    # Create output directory
    mkdir -p Results/samtools/$(basename ${bamfile} .bam)

    # Generate statistics
    samtools stats ${bamfile} > Results/samtools/$(basename ${bamfile} .bam)/stats.txt

    # Show summary
    echo "Summary statistics:"
    grep ^SN Results/samtools/$(basename ${bamfile} .bam)/stats.txt | cut -f 2-
done


## 4. QUALIMAP Analysis
### 4.1 Tool Introduction

QUALIMAP is a platform-independent application that provides comprehensive quality control analysis of alignment sequencing data.

Key features:
- Multi-sample BAM QC analysis
- RNA-seq specific quality metrics
- Detailed coverage analysis
- HTML report generation

<div class="alert alert-block alert-warning">
<b>Important Parameters:</b><br>
- -bam: Input BAM file
- -c: Collect overlap rate of reads
- -gd: Genome distribution computation
- -outdir: Output directory
- -nt: Number of threads
- --java-mem-size: Maximum memory allocation
</div>

<div class="alert alert-block alert-info">
<b>Documentation:</b><br>
For more information, visit:
- [QUALIMAP Documentation](http://qualimap.conesalab.org/)
- [QUALIMAP RNA-seq QC](http://qualimap.conesalab.org/doc_html/analysis.html#rna-seq-qc)
</div>


In [None]:
## Code Cell n°6 ##
# Create output directory for QUALIMAP results
mkdir -p Results/qualimap/{bamqc,rnaseq}

# Run QUALIMAP bamqc on first two samples
for bamfile in $(ls Results/star/*.bam | head -n 2); do
    sample=$(basename ${bamfile} _Aligned.sortedByCoord.out.bam)
    echo "Processing ${sample}..."

    qualimap bamqc \
        -bam ${bamfile} \
        -c \
        -gd MOUSE \
        -outdir Results/qualimap/bamqc/${sample} \
        -nt ${NCPUS} \
        --java-mem-size=${MAXRAM}
done

# Run RNA-seq specific analysis
for bamfile in $(ls Results/star/*.bam | head -n 2); do
    sample=$(basename ${bamfile} _Aligned.sortedByCoord.out.bam)
    echo "Processing RNA-seq QC for ${sample}..."

    qualimap rnaseq \
        -bam ${bamfile} \
        -gtf ${GTF_FILE} \
        --paired \
        -outdir Results/qualimap/rnaseq/${sample} \
        -pe \
        --java-mem-size=${MAXRAM}
done


## 5. MultiQC Report Generation
### 5.1 Tool Overview

MultiQC aggregates results from bioinformatics analyses across many samples into a single report. It automatically scans given directories for recognized log files and compiles a HTML report with interactive plots.

<div class="alert alert-block alert-info">
<b>Key Features:</b>
- Combines QC reports from multiple tools
- Interactive plots and tables
- Easy to interpret summary statistics
- Exportable plots and data

<b>Documentation:</b><br>
For more information, visit:
- [MultiQC Documentation](https://multiqc.info/)
- [Available MultiQC Modules](https://multiqc.info/docs/#multiqc-modules)
</div>


In [None]:
## Code Cell n°8 ##
# Set MultiQC parameters
MULTIQC_TITLE="Mouse RNA-seq Quality Report"
MULTIQC_COMMENT="Quality control analysis of mouse RNA-seq data"
MULTIQC_FILENAME="mouse_rnaseq_multiqc"

# Run MultiQC
multiqc \
    Results/{samtools,qualimap}/ \
    --title "${MULTIQC_TITLE}" \
    --comment "${MULTIQC_COMMENT}" \
    --filename ${MULTIQC_FILENAME} \
    --outdir Results/multiqc/


## 6. Results Interpretation

After running the quality control pipeline, you should examine several key metrics:

1. **Basic Mapping Statistics** (samtools):
   - Total reads mapped
   - Properly paired reads
   - Insert size distribution

2. **QUALIMAP Metrics**:
   - Coverage distribution
   - Gene body coverage
   - Read genomic origin
   - Transcript coverage uniformity

3. **MultiQC Summary**:
   - Sample comparison
   - Quality trends
   - Potential batch effects

<div class="alert alert-block alert-success">
<b>Questions to Consider:</b><br>
1. Are the mapping rates consistent across samples?
2. How uniform is the coverage across gene bodies?
3. Are there any concerning quality metrics in specific samples?
4. How do the samples compare in terms of sequencing depth?
</div>


## 7. Analysis Complete
---
<div class="alert alert-block alert-success">
<b>Completion:</b><br>
You have successfully:
- Examined BAM file quality using samtools
- Generated detailed quality metrics with QUALIMAP
- Created a comprehensive MultiQC report

The results are stored in:
- Results/samtools/: Basic BAM statistics
- Results/qualimap/: Detailed mapping quality analysis
- Results/multiqc/: Combined quality report

<b>Next Steps:</b><br>
- Review the MultiQC report for overall quality assessment
- Investigate any samples with unusual metrics
- Document any quality concerns for downstream analysis
</div>
