# RNA-seq Data Analysis Pipeline - PS4 2024
## Tuesday 26/11/2024

This notebook covers the analysis of RNA-seq data from mouse samples, focusing on quality assessment and processing of mapped reads.

<div class="alert alert-info">
<b>Environment Requirements:</b><br>
- 10 CPUs per student<br>
- 6 GB RAM per student<br>
- Tools: samtools, QUALIMAP, MultiQC
</div>


## Required Files and Data

<div class="alert alert-warning">
<b>Required Input Files:</b>
1. Paired-end RNA-seq data (fastq.gz format)
   - Location: `/srv/data/meg-m2-rnaseq/Data/fastq/raw/`
   - Format: `*_1.fastq.gz` and `*_2.fastq.gz` for paired-end reads

2. Reference Genome Files:
   - Mouse genome (GRCm39)
   - GTF annotation: `/srv/data/meg-m2-rnaseq/Genomes/Mmu/GRCm39/extracted/genome_annotation-M35.gtf`

3. Previously Generated Files:
   - BAM files from STAR mapping
   - fastp results in `/srv/data/meg-m2-rnaseq/Results/fastp/`
</div>

<div class="alert alert-info">
<b>Output Structure:</b>
Results will be organized in tool-specific directories:
- `Results/samtools/` - Alignment statistics
- `Results/qualimap/` - RNA-seq QC metrics
- `Results/multiqc/` - Combined quality reports
</div>


## Reference Genome and Annotation

This section covers the downloading and indexing of reference genome files. These steps are typically performed during environment setup.

<div class="alert alert-info">
<b>Reference Files:</b>
- Genome: Mus musculus GRCm39
- Source: Ensembl Release 109
- Annotation: genome_annotation-M35.gtf
</div>

<div class="alert alert-warning">
<b>Note:</b> The following commands are provided for reference and documentation. They are kept in raw format as they are typically executed during environment setup.
</div>


## Quality Assessment of Mapped Reads

We'll use multiple tools to assess the quality of our mapped reads:

1. **samtools**: Basic alignment statistics
   - Read counts
   - Mapping quality
   - Insert size distribution

2. **QUALIMAP**: Detailed RNA-seq metrics
   - Gene coverage
   - Read distribution
   - Transcript coverage

3. **MultiQC**: Combined report generation
   - Aggregates results from all tools
   - Creates interactive visualizations

<div class="alert alert-info">
<b>Documentation Links:</b>
- [Samtools Manual](http://www.htslib.org/doc/samtools.html)
- [QUALIMAP Documentation](http://qualimap.conesalab.org/doc_html/index.html)
- [MultiQC Documentation](https://multiqc.info/)
</div>


### Setup Results Directories

First, we'll create the necessary directories for organizing our results.


In [None]:
# Create results directories
mkdir -p Results/samtools Results/qualimap Results/multiqc

### Process First Two Samples

We'll demonstrate the analysis pipeline on the first two samples. The same process will be applied to all samples later.

<div class="alert alert-info">
<b>Important Parameters:</b>
- samtools stats: Generates comprehensive alignment statistics
- QUALIMAP rnaseq:
  - java-mem-size=6G: Memory allocation
  - GTF file: Required for gene-based analysis
</div>


In [None]:
# Get first two samples
samples=$(ls /srv/data/meg-m2-rnaseq/Data/fastq/raw/*_1.fastq.gz | head -n 2)

# Process each sample
for sample in $samples; do
    # Extract base name
    base=$(basename $sample _1.fastq.gz)
    echo "Processing sample: $base"

    # Samtools stats
    echo "Running samtools stats..."
    samtools stats ${base}.bam > Results/samtools/${base}_stats.txt

    # QUALIMAP analysis
    echo "Running QUALIMAP..."
    qualimap rnaseq \
        -bam ${base}.bam \
        -gtf /srv/data/meg-m2-rnaseq/Genomes/Mmu/GRCm39/extracted/genome_annotation-M35.gtf \
        --java-mem-size=6G \
        -outdir Results/qualimap/${base}
done

### Process All Samples

The following commands (in raw format) show how to process all samples in the dataset.


## MultiQC Report Generation

MultiQC combines the quality reports from all samples into a comprehensive report.

<div class="alert alert-info">
<b>MultiQC Features:</b>
- Combines reports from multiple tools
- Creates interactive visualizations
- Enables easy sample comparison
- Generates HTML report

<b>Input Sources:</b>
- samtools stats results
- QUALIMAP RNA-seq reports
</div>


In [None]:
# Generate MultiQC report
echo "Generating MultiQC report..."
multiqc \
    Results/samtools/ \
    Results/qualimap/ \
    -o Results/multiqc/ \
    --title "Mouse RNA-seq Quality Report" \
    --comment "Generated for PS4-2024 course"



## Progressive Loop Building for Qualimap Analysis

In this section, we'll build a loop to run Qualimap on all samples step by step. This approach helps understand how to automate repetitive tasks in bash.

### Step 1: Understanding the Basic Command
First, let's look at the basic Qualimap command we used for a single sample:


In [None]:
# Example for a single sample
qualimap bamqc     -bam Results/samtools/sample1_sorted.bam     -outdir Results/qualimap/sample1     --java-mem-size=6G


### Step 2: Identifying Variable Parts
In the command above, we can identify two main parts that change for each sample:
1. Input BAM file path (`Results/samtools/sample1_sorted.bam`)
2. Output directory path (`Results/qualimap/sample1`)

The sample name is the key variable that changes in both paths.



### Step 3: Creating a List of Sample Names
First, we'll create a list of our sample names. We can do this by listing the sorted BAM files and extracting the sample names:


In [None]:
# List sample names
ls Results/samtools/*_sorted.bam | sed 's|Results/samtools/||' | sed 's|_sorted.bam||'


### Step 4: Building the Loop - Basic Structure
Now, let's create a simple loop structure that will iterate over our samples:


In [None]:
# Basic loop structure
for sample in $(ls Results/samtools/*_sorted.bam | sed 's|Results/samtools/||' | sed 's|_sorted.bam||')
do
    echo "Processing sample: $sample"
done


### Step 5: Adding the Qualimap Command
Now we'll add the Qualimap command inside our loop, using variables for the sample-specific parts:


In [None]:
# Complete loop with Qualimap command
for sample in $(ls Results/samtools/*_sorted.bam | sed 's|Results/samtools/||' | sed 's|_sorted.bam||')
do
    echo "Processing sample: $sample"

    # Create output directory if it doesn't exist
    mkdir -p Results/qualimap/${sample}

    # Run Qualimap
    qualimap bamqc         -bam Results/samtools/${sample}_sorted.bam         -outdir Results/qualimap/${sample}         --java-mem-size=6G
done


### Important Notes:
1. The `mkdir -p` command ensures our output directory exists
2. We use `${sample}` to clearly separate the variable name in the paths
3. The `--java-mem-size=6G` parameter is important due to the memory constraints (6GB per student)
4. Using `echo` statements helps track progress during execution

### Exercise for Students:
Try modifying the loop to:
1. Add error checking
2. Include progress percentage
3. Save a log file of the analysis
