# RNA-seq Data Analysis Pipeline - Part 5: Read Counting
## Tuesday 26/11/2024

This notebook covers read counting and quantification using two complementary approaches:
1. Gene-level quantification with featureCounts
2. Transcript-level quantification with Salmon

<div class="alert alert-info">
<b>Learning Objectives:</b><br>
- Understand different approaches to RNA-seq read counting
- Learn to use featureCounts for gene-level quantification
- Learn to use Salmon for transcript-level quantification
- Interpret read count statistics and quality metrics
</div>


## Documentation and References

### Tool Documentation
- [featureCounts Documentation](https://subread.sourceforge.net/featureCounts.html)
- [Salmon Documentation](https://salmon.readthedocs.io/en/latest/)

### Parameter References
- featureCounts parameters: [Manual](http://bioinf.wehi.edu.to the/featureCounts/)
- Salmon parameters: [Parameters](https://salmon.readthedocs.io/en/latest/salmon.html#parameters)

## 1. System Setup
### 1.1 Environment Configuration

We'll set up our working environment with the necessary variables and directory structure.

In [None]:
## Code cell 1 ##
try:
    # Progress indicator
    print("Starting process...")

    ## Code cell 1 ##
# Base directories
WORK_DIR="/srv/home/${USER}/meg_m2_rnaseq_bash"
DATA_DIR="/srv/data/meg-m2-rnaseq"

# Input/Output directories
RESULTS_DIR="${WORK_DIR}/Results"
BAM_DIR="${RESULTS_DIR}/star"
COUNTS_DIR="${RESULTS_DIR}/counts"
SALMON_DIR="${RESULTS_DIR}/salmon"

# Reference files
GTF_FILE="${DATA_DIR}/Genomes/Mmu/GRCm39/extracted/genome_annotation-M35.gtf"
TRANSCRIPTOME="${DATA_DIR}/Genomes/Mmu/GRCm39/extracted/transcriptome.fa"

# System resources (adjusted for Plasmabio)
CPU_CORES=4
MAX_RAM="6G"

# Tool parameters
STRAND_SPECIFIC=0  # 0=unstranded, 1=stranded, 2=reversely stranded
MIN_MAPPING_QUALITY=10
LIBTYPE="A"  # Automatic detection of library type

    # Success indicator
    print("Process completed successfully!")
except FileNotFoundError as e:
    print(f"Error: Input file not found - {e}")
    raise
except Exception as e:
    print(f"Error occurred: {e}")
    raise

In [None]:
## Code cell 2 ##
try:
    # Progress indicator
    print("Starting process...")

    ## Code cell 2 ##
# Create output directories
mkdir -p ${COUNTS_DIR} ${SALMON_DIR}

# Validate input files
echo "Checking input files..."
[ -f "${GTF_FILE}" ] && echo "GTF file found" || echo "ERROR: GTF file missing"
[ -d "${BAM_DIR}" ] && echo "BAM directory found" || echo "ERROR: BAM directory missing"


    # Success indicator
    print("Process completed successfully!")
except FileNotFoundError as e:
    print(f"Error: Input file not found - {e}")
    raise
except Exception as e:
    print(f"Error occurred: {e}")
    raise

## 2. Gene-Level Read Counting with featureCounts

<div class="alert alert-info">
<b>Tool Information:</b><br>
featureCounts is part of the Subread package and provides fast and accurate read counting for RNA-seq data.
<br><br>
<b>Key Features:</b>
- Supports both single and paired-end reads
- Handles multi-mapping reads
- Provides detailed assignment statistics
- Efficient memory usage
</div>

For more information, visit the [featureCounts documentation](http://subread.sourceforge.net/).

In [None]:
## Code cell 3 ##
try:
    # Progress indicator
    print("Starting process...")

    ## Code cell 3 ##
# Run featureCounts on a single sample
SAMPLE1=$(ls ${BAM_DIR}/*.bam | head -n 1)
SAMPLE_NAME=$(basename ${SAMPLE1} .bam)

featureCounts \
    -T ${CPU_CORES} \
    -s ${STRAND_SPECIFIC} \
    -Q ${MIN_MAPPING_QUALITY} \
    -p \  # paired-end
    -a ${GTF_FILE} \
    -o ${COUNTS_DIR}/${SAMPLE_NAME}_counts.txt \
    ${SAMPLE1}

# Display summary
cat ${COUNTS_DIR}/${SAMPLE_NAME}_counts.txt.summary

    # Success indicator
    print("Process completed successfully!")
except FileNotFoundError as e:
    print(f"Error: Input file not found - {e}")
    raise
except Exception as e:
    print(f"Error occurred: {e}")
    raise

### 1.3.2 Running featureCounts on Multiple Samples

<div class="alert alert-info">
<b>Important:</b><br>
Processing multiple samples at once is more efficient and ensures consistent parameters across all samples.
</div>

Now we'll process all samples using a loop. This approach allows us to:
1. Process all samples with identical parameters
2. Generate a combined count matrix
3. Save time and reduce potential errors
4. Create consistent output files

In [None]:
## Code cell 4 ##
try:
    # Progress indicator
    print("Starting process...")

    ## Code cell 4 ##
# Create a list of all BAM files
BAM_FILES=(${BAM_DIR}/*.bam)
echo "Found ${#BAM_FILES[@]} BAM files"

# Run featureCounts on all samples
featureCounts \
    -T ${CPU_CORES} \
    -s ${STRAND_SPECIFIC} \
    -Q ${MIN_MAPPING_QUALITY} \
    -p \
    -a ${GTF_FILE} \
    -o ${COUNTS_DIR}/all_samples_counts.txt \
    ${BAM_FILES[@]}

# Create a simplified count matrix
cut -f 1,7- ${COUNTS_DIR}/all_samples_counts.txt | grep -v '^#' > ${COUNTS_DIR}/counts_matrix.txt

    # Success indicator
    print("Process completed successfully!")
except FileNotFoundError as e:
    print(f"Error: Input file not found - {e}")
    raise
except Exception as e:
    print(f"Error occurred: {e}")
    raise

## 2. Pseudo-mapping with Salmon

<div class="alert alert-info">
<b>Tool Information:</b><br>
Salmon performs transcript-level quantification using lightweight algorithms (pseudo-alignment).
<br><br>
<b>Advantages:</b>
- Faster than traditional alignment
- Direct transcript-level quantification
- Bias-aware estimation
- Memory efficient
</div>

For more information, visit the [Salmon documentation](https://salmon.readthedocs.io/).

In [None]:
## Code cell 5 ##
try:
    # Progress indicator
    print("Starting process...")

    ## Code cell 5 ##
# Index the transcriptome (if not already done)
salmon index \
    -t ${TRANSCRIPTOME} \
    -i ${SALMON_DIR}/transcriptome_index \
    -p ${CPU_CORES}

    # Success indicator
    print("Process completed successfully!")
except FileNotFoundError as e:
    print(f"Error: Input file not found - {e}")
    raise
except Exception as e:
    print(f"Error occurred: {e}")
    raise

In [None]:
## Code cell 6 ##
try:
    # Progress indicator
    print("Starting process...")

    ## Code cell 6 ##
# Process first two samples
FASTQ_DIR="${DATA_DIR}/fastq/raw"
SAMPLES=($(ls ${FASTQ_DIR}/*_R1.fastq.gz | head -n 2))

for R1 in "${SAMPLES[@]}"; do
    R2=${R1/_R1/_R2}
    SAMPLE=$(basename ${R1} _R1.fastq.gz)

    echo "Processing sample: ${SAMPLE}"

    salmon quant \
        -i ${SALMON_DIR}/transcriptome_index \
        -l ${LIBTYPE} \
        -1 ${R1} \
        -2 ${R2} \
        -p ${CPU_CORES} \
        --validateMappings \
        -o ${SALMON_DIR}/${SAMPLE}
done

    # Success indicator
    print("Process completed successfully!")
except FileNotFoundError as e:
    print(f"Error: Input file not found - {e}")
    raise
except Exception as e:
    print(f"Error occurred: {e}")
    raise

## 3. Quality Control

Let's examine the quality metrics for both quantification methods. This helps us ensure the reliability of our results.

In [None]:
## Code cell 7 ##
try:
    # Progress indicator
    print("Starting process...")

    ## Code cell 8 ##
# Check featureCounts assignment statistics
echo "=== featureCounts Assignment Statistics ==="
cat ${COUNTS_DIR}/all_samples_counts.txt.summary

# Check Salmon mapping rates
echo -e "\n=== Salmon Mapping Rates ==="
for dir in ${SALMON_DIR}/*/; do
    sample=$(basename ${dir})
    rate=$(grep -A 1 "Mapping Rate" ${dir}/logs/salmon_quant.log | tail -n 1)
    echo "${sample}: ${rate}"
done

    # Success indicator
    print("Process completed successfully!")
except FileNotFoundError as e:
    print(f"Error: Input file not found - {e}")
    raise
except Exception as e:
    print(f"Error occurred: {e}")
    raise

## Troubleshooting Guide

Common Issues and Solutions:

1. **Insufficient Memory**
   - Symptom: Process killed or memory error
   - Solution: Reduce number of threads or batch size

2. **Missing Input Files**
   - Symptom: File not found error
   - Solution: Verify file paths and permissions

3. **Invalid File Format**
   - Symptom: Parse error in featureCounts/Salmon
   - Solution: Check input file format and integrity

4. **Resource Issues**
   - Symptom: Slow processing or timeouts
   - Solution: Adjust resource allocation

## 4. Results Interpretation

<div class="alert alert-info">
<b>Key Points:</b><br>
- featureCounts provides gene-level counts
- Salmon provides transcript-level abundance estimates
- Both tools generate quality metrics
- Compare mapping rates between methods
</div>

The output files can be found in:
1. Gene counts: ${COUNTS_DIR}/counts_matrix.txt
2. Transcript quantification: ${SALMON_DIR}/<sample>/quant.sf

   

- 0. 1 - About session for IFB core cluster
- 0. 2 - Parameters to be set or modified by the user
- 1 - Gene level quantification using ``featureCounts``
- 2 - Pseudo-mapping with Salmon
- 3 - Monitoring disk usage

In [1]:
## Code cell 8 ##

echo "=== Cell launched on $(date) ==="
squeue -hu $USER 

echo "=== Current IFB session size: as an indication: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ==="
jobid=$(squeue -hu $USER | awk '/sys/dash {print $1}')

sacct --format=JobID,AllocCPUS,ReqMem,NodeList,Elapsed,State --jobs ${jobid}

=== Cell launched on Mon Jun 24 12:00:40 CEST 2024 ===
          40350384      fast sys/dash scaburet  R    2:38:33      1 cpu-node-51
=== Current IFB session size: as an indication: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ===
JobID         AllocCPUS     ReqMem        NodeList    Elapsed      State 
------------ ---------- ---------- --------------- ---------- ---------- 
40350384              7        33G     cpu-node-51   02:38:33    RUNNING 
40350384.ba+          7                cpu-node-51   02:38:33    RUNNING 


In [2]:
## Code cell 9 ##

module load samtools/1.18 subread/2.0.6 salmon/1.10.2 multiqc/1.13

# module load samtools/1.10 subread/2.0.1 in 2023

echo "===== bam sorting by names ====="
samtools --version | head -n 2
echo "===== gene level quantification ====="
featureCounts -v
echo "===== Pseudo mapping and quantification with Salmon ====="
salmon -v
echo "===== quality reports compilation ====="
multiqc --version

===== bam sorting by names =====
samtools 1.18
Using htslib 1.19
===== gene level quantification =====

featureCounts v2.0.6

===== Pseudo mapping and quantification with Salmon =====
salmon 1.10.2
===== quality reports compilation =====
multiqc, version 1.13


In [3]:
## Code cell 10 ##

authorizedCPU=5           # 4 CPU

authorizedRAM=30000000000  # 20GB

In [4]:
## Code cell 11 ##

gohome="/shared/projects/2413_rnaseq_cea/"

echo "=== Home root folder is ==="
echo "${gohome}"
echo ""
echo "=== Working (personal) folder tree ==="
tree -d -L 2 "${gohome}$USER"
echo "=== current working directory ==="
echo "${PWD}"

=== Home root folder is ===
/shared/projects/2413_rnaseq_cea/

=== Working (personal) folder tree ===
/shared/projects/2413_rnaseq_cea/scaburet
|-- Data
|   |-- fastq
|   `-- sra
|-- Results
|   |-- Rintro
|   |-- deseq2
|   |-- enrich
|   |-- fastp
|   |-- fastq_screen
|   |-- fastqc
|   |-- featurecounts
|   |-- gsea
|   |-- logfiles
|   |-- multiqc
|   |-- pca1
|   |-- qualimap
|   |-- qualimap-11juin
|   |-- salmon
|   |-- salmon2
|   |-- samtools
|   |-- samtools-11juin
|   |-- star
|   |-- star-11juin
|   `-- wgcna
|-- done
|-- run_notebooks
`-- stuff
    `-- meg_m2_rnaseq

28 directories
=== current working directory ===
/shared/ifbstor1/projects/2413_rnaseq_cea/scaburet


In [5]:
## Code cell 12 ##

logfolder="${gohome}$USER/Results/logfiles/"
echo "the folder for log files is ${logfolder}"

the folder for log files is /shared/projects/2413_rnaseq_cea/scaburet/Results/logfiles/


In [6]:
## Code cell 13 ##

featureCounts -v


featureCounts v2.0.6



In [7]:
## Code cell 14 ##

mappedfolder="${gohome}$USER/Results/star/"
echo "There are $(ls "${mappedfolder}"*_Aligned.sortedByCoord.out.bam | wc -l) bam files:"
ls "${mappedfolder}"*_Aligned.sortedByCoord.out.bam

There are 3 bam files:
/shared/projects/2413_rnaseq_cea/scaburet/Results/star/SRR12730409_Aligned.sortedByCoord.out.bam
/shared/projects/2413_rnaseq_cea/scaburet/Results/star/SRR12730410_Aligned.sortedByCoord.out.bam
/shared/projects/2413_rnaseq_cea/scaburet/Results/star/SRR12730411_Aligned.sortedByCoord.out.bam


In [8]:
## Code cell 15 ##

gtffile="${gohome}alldata/Reference/extracted/genome_annotation-M35.gtf"
echo "The transcript reference gtf file is ${gtffile}"
echo "First rows of the annotation file: ${gtffile}"
head -n 6 ${gtffile}

The transcript reference gtf file is /shared/projects/2413_rnaseq_cea/alldata/Reference/extracted/genome_annotation-M35.gtf
First rows of the annotation file: /shared/projects/2413_rnaseq_cea/alldata/Reference/extracted/genome_annotation-M35.gtf
##description: evidence-based annotation of the mouse genome (GRCm39), version M35 (Ensembl 112)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2024-02-27
chr1	HAVANA	gene	3143476	3144545	.	+	.	gene_id "ENSMUSG00000102693.2"; gene_type "TEC"; gene_name "4933401J01Rik"; level 2; mgi_id "MGI:1918292"; havana_gene "OTTMUSG00000049935.1";


In [12]:
## Code cell 16 ##

featcountfolder="${gohome}$USER/Results/featurecounts2/"
mkdir -p ${featcountfolder}
tree -d -L 1 "${gohome}$USER/Results/"

/shared/projects/2413_rnaseq_cea/scaburet/Results/
|-- Rintro
|-- deseq2
|-- enrich
|-- fastp
|-- fastq_screen
|-- fastqc
|-- featurecounts
|-- featurecounts2
|-- gsea
|-- logfiles
|-- multiqc
|-- pca1
|-- qualimap
|-- qualimap-11juin
|-- salmon
|-- salmon2
|-- samtools
|-- samtools-11juin
|-- star
|-- star-11juin
`-- wgcna

21 directories


In [11]:
## Code cell 17 ##

logfile="${logfolder}featureCounts-gene-level-counts-individualsamples_samtoolsSort-2.log"
echo "Screen output is redirected to ${logfile}"

Screen output is redirected to /shared/projects/2413_rnaseq_cea/scaburet/Results/logfiles/featureCounts-gene-level-counts-individualsamples_samtoolsSort-2.log


In [13]:
## Code cell 18 ##

# as time command does not redirect output
echo "operation starts at $(date)" >> ${logfile}

time for fn in $(ls "${mappedfolder}"*_Aligned.sortedByCoord.out.bam); do  
    
    mysortedbam=$(basename ${fn})
    id=${mysortedbam/_Aligned.sortedByCoord.out.bam/}
    echo "===== Processing sampleID: ${id}..." | tee -a ${logfile}
    
    # outputfiles
    mytempfile="${featcountfolder}${id}_Aligned.sortedByNames.bam"
    myoutfile="${featcountfolder}${id}_paired-unstranded"
    
    # bam sorting...
    echo "samtools starts at $(date)" >> ${logfile}
    samtools sort -n \
                --threads 3 -m 4G \
                --output-fmt BAM \
                -o ${mytempfile} \
                -T ${featcountfolder} \
                ${fn} \
                &>> ${logfile}
    echo "samtools ends at $(date)" >> ${logfile}

    # some user conversation to help being patient
    echo "...changing tool..." | tee -a ${logfile}

    # then featureCounts
    echo "featureCounts starts at $(date)" >> ${logfile}

    featureCounts -p -s 0 -T 4 \
                  -a "${gtffile}" \
                  -o "${myoutfile}.counts" \
                  ${mytempfile} \
                  --donotsort \
                  --verbose \
                  &>> ${logfile}
    echo "featureCounts ends at $(date)" >> ${logfile}
    
    # removing extra bam file... saving disk space
    rm ${mytempfile}
    
    echo "... done" | tee -a ${logfile}
    
done

===== Processing sampleID: SRR12730409...
...changing tool...
... done
===== Processing sampleID: SRR12730410...
...changing tool...
... done
===== Processing sampleID: SRR12730411...
...changing tool...
... done

real	12m40.250s
user	37m49.605s
sys	1m9.624s


In [14]:
## Code cell 19 ##

echo "operation ends at $(date)" >> ${logfile}

echo "=== Files created after featureCounts ===" >> ${logfile}
ls -lh "${featcountfolder}" >> ${logfile}
echo "featureCounts generated $(ls "${featcountfolder}"*.counts | wc -l) count files." \
    | tee -a ${logfile}
echo "featureCounts generated $(ls "${featcountfolder}"*.counts.summary | wc -l) summary files." \
    | tee -a ${logfile}

featureCounts generated 3 count files.
featureCounts generated 3 summary files.


In [15]:
## Code cell 20 ##

cp ${gohome}alldata/Results/featurecounts/11samples_* ${gohome}$USER/Results/featurecounts/

head -n 8 ${gohome}$USER/Results/featurecounts/*_paired-unstranded.counts

==> /shared/projects/2413_rnaseq_cea/scaburet/Results/featurecounts/11samples_paired-unstranded.counts <==
# Program:featureCounts v2.0.6; Command:"featureCounts" "-p" "-s" "0" "-T" "16" "-a" "/shared/projects/2413_rnaseq_cea/alldata/Reference/extracted/genome_annotation-M35.gtf" "-o" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/11samples_paired-unstranded.counts" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/SRR12730403_Aligned.sortedByNames.bam" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/SRR12730404_Aligned.sortedByNames.bam" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/SRR12730405_Aligned.sortedByNames.bam" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/SRR12730406_Aligned.sortedByNames.bam" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/SRR12730407_Aligned.sortedByNames.bam" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/SRR12730408_Aligned.sortedByNames.b

In [16]:
## Code cell 21 ##

salmon -v

salmon 1.10.2


In [18]:
## Code cell 22 ##

salmonfolder="${gohome}$USER/Results/salmon3/"
mkdir -p ${salmonfolder}
echo "The resulting salmon files will be in ${salmonfolder}"

The resulting salmon files will be in /shared/projects/2413_rnaseq_cea/scaburet/Results/salmon3/


In [19]:
## Code cell 23 ##

reffolder="${gohome}alldata/Reference/"
# mkdir -p ${reffolder}
echo "The general folder for reference genome, annotations and index files is ${reffolder}"


salmonreffolder="${reffolder}/salmon/"
# mkdir -p ${salmonreffolder}
echo "The index folder for Salmon is ${salmonreffolder}"

The general folder for reference genome, annotations and index files is /shared/projects/2413_rnaseq_cea/alldata/Reference/
The index folder for Salmon is /shared/projects/2413_rnaseq_cea/alldata/Reference//salmon/


In [20]:
## Code cell 24 ##

fastpfolder="${gohome}$USER/Results/fastp/"
echo "The analysed fastp.fastq files are in ${fastpfolder}"
echo "and they are:"
ls ${fastpfolder} | grep -v -e "_removed" 

#| wc -l

The analysed fastp.fastq files are in /shared/projects/2413_rnaseq_cea/scaburet/Results/fastp/
and they are:
SRR12730409_1.fastp.fastq.gz
SRR12730409_2.fastp.fastq.gz
SRR12730409_fastp.html
SRR12730409_fastp.json
SRR12730410_1.fastp.fastq.gz
SRR12730410_2.fastp.fastq.gz
SRR12730410_fastp.html
SRR12730410_fastp.json
SRR12730411_1.fastp.fastq.gz
SRR12730411_2.fastp.fastq.gz
SRR12730411_fastp.html
SRR12730411_fastp.json
preformation


In [21]:
## Code cell 25 ##

logfile4="${logfolder}Salmon3.log"
echo "Screen output is redirected to ${logfile4}"

Screen output is redirected to /shared/projects/2413_rnaseq_cea/scaburet/Results/logfiles/Salmon3.log


In [None]:
## Code cell 26 ##

echo "Screen output is redirected to ${logfile4}"

# as time command does not redirect output
echo "operation starts at $(date)" >> ${logfile4}
echo "before loop"

time for read1 in $(ls "${gohome}$USER/Results/fastp/"*_1.fastp.fastq.gz); do
    echo "starting the loop with ${read1}"
    
    # handling names with the sample name
    samplenum=$(basename ${read1} | cut -d"_" -f1)
    echo "====== Processing sampleID: ${samplenum}..." | tee -a ${logfile4}
    read2=$(echo ${read1} | sed 's#_1#_2#')

    echo "Salmon starts at $(date)" >> ${logfile4}
    
    # Salmon working
    salmon quant -i "${salmonreffolder}salmon_vM35_index" -l A \
         -1 ${read1} \
         -2 ${read2} \
         --validateMappings \
         --threads ${authorizedCPU} -o "${salmonfolder}${samplenum}" \
         |& tee -a ${logfile4}
         
    echo "Salmon ends at $(date)" >> ${logfile4}
    
    echo "...done" | tee -a ${logfile4} 
done  

Screen output is redirected to /shared/projects/2413_rnaseq_cea/scaburet/Results/logfiles/Salmon3.log
before loop
starting the loop with /shared/projects/2413_rnaseq_cea/scaburet/Results/fastp/SRR12730409_1.fastp.fastq.gz
Version Server Response: Not Found
### salmon (selective-alignment-based) v1.10.2
### [ program ] => salmon 
### [ command ] => quant 
### [ index ] => { /shared/projects/2413_rnaseq_cea/alldata/Reference//salmon/salmon_vM35_index }
### [ libType ] => { A }
### [ mates1 ] => { /shared/projects/2413_rnaseq_cea/scaburet/Results/fastp/SRR12730409_1.fastp.fastq.gz }
### [ mates2 ] => { /shared/projects/2413_rnaseq_cea/scaburet/Results/fastp/SRR12730409_2.fastp.fastq.gz }
### [ validateMappings ] => { }
### [ threads ] => { 5 }
### [ output ] => { /shared/projects/2413_rnaseq_cea/scaburet/Results/salmon3/SRR12730409 }
Logs will be written to /shared/projects/2413_rnaseq_cea/scaburet/Results/salmon3/SRR12730409/logs
[2024-06-24 12:36:41.169] [jointLog] [info] setting maxHas

In [None]:
## Code cell 27 ## 

qcsummaries="${gohome}$USER/Results/multiqc/"

In [None]:
## Code cell 28 ## 

inamemyfile="5_featureCounts_salmon_3samples"

mytitle=$(echo "Quality check after featureCounts and Salmon")

In [None]:
## Code cell 29 ## 

mycomment=$(echo "featureCounts run at gene level and Salmon for pseudomapping on transcripts using the mouse genome (GRCm39), version M35 (Ensembl 112)")

In [None]:
## Code cell 30 ## 

logfile5="${logfolder}multiqc-featurecounts-salmon.log"
echo "Screen output is also saved in ${logfile5}"

echo "operation starting by $(date)" >> ${logfile5}
multiqc --interactive --export \
        --module featureCounts ${featcountfolder} \
        --module salmon ${salmonfolder} \
        --outdir "$qcsummaries" \
        --filename "${inamemyfile}" \
        --title "${mytitle}"  \
        --comment "${mycomment}" \
        "${gohome}$USER/Results/" \
        |& tee -a ${logfile}
echo "operation finished by $(date)" >> ${logfile5}

In [None]:
## Code cell 31 ## 

# to see which files we have afterward and follow folder sizes
ls -lh "${qcsummaries}" >> ${logfile}
ls -lh "${gohome}$USER/Results/" >> ${logfile}

In [None]:
## Code cell 32 ##

du -h -d2 ${gohome}$USER

In [None]:
## Code cell 33 ##

# Saving disk space

# Removing:
# initial srr files
rm -r ${gohome}$USER/Data/sra/

# raw fastq.gz
rm ${gohome}$USER/Data/fastq/raw/*.fastq.gz

# cleaned fastq.gz
#rm ${gohome}$USER/Results/fastp/*.fastp.fastq.gz

# intermediate Aligned.sortedByNames.bam
# rm ${gohome}$USER/Results/featurecounts/*_Aligned.sortedByNames.bam    # if is was not already done in cell 11

In [None]:
## Code cell 34 ##

du -h -d2 ${gohome}$USER

In [None]:
## Code cell 35 ##   

cp "${gohome}pipeline/Pipe_07a-R_intro-to-R.ipynb" "${gohome}$USER/"
cp "${gohome}alldata/Example_Data/Temperatures.txt" "${gohome}$USER/"


## Merged Sections from Pipe_06


   

- 0. 1 - About session for IFB core cluster
- 0. 2 - Parameters to be set or modified by the user
- 1 - Gene level quantification using ``featureCounts``
- 2 - Pseudo-mapping with Salmon
- 3 - Monitoring disk usage

In [1]:
## Code cell 36 ##

echo "=== Cell launched on $(date) ==="
squeue -hu $USER 

echo "=== Current IFB session size: as an indication: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ==="
jobid=$(squeue -hu $USER | awk '/sys/dash {print $1}')

sacct --format=JobID,AllocCPUS,ReqMem,NodeList,Elapsed,State --jobs ${jobid}

=== Cell launched on Mon Jun 24 12:00:40 CEST 2024 ===
          40350384      fast sys/dash scaburet  R    2:38:33      1 cpu-node-51
=== Current IFB session size: as an indication: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ===
JobID         AllocCPUS     ReqMem        NodeList    Elapsed      State 
------------ ---------- ---------- --------------- ---------- ---------- 
40350384              7        33G     cpu-node-51   02:38:33    RUNNING 
40350384.ba+          7                cpu-node-51   02:38:33    RUNNING 


In [2]:
## Code cell 37 ##

module load samtools/1.18 subread/2.0.6 salmon/1.10.2 multiqc/1.13

# module load samtools/1.10 subread/2.0.1 in 2023

echo "===== bam sorting by names ====="
samtools --version | head -n 2
echo "===== gene level quantification ====="
featureCounts -v
echo "===== Pseudo mapping and quantification with Salmon ====="
salmon -v
echo "===== quality reports compilation ====="
multiqc --version

===== bam sorting by names =====
samtools 1.18
Using htslib 1.19
===== gene level quantification =====

featureCounts v2.0.6

===== Pseudo mapping and quantification with Salmon =====
salmon 1.10.2
===== quality reports compilation =====
multiqc, version 1.13


In [3]:
## Code cell 38 ##

authorizedCPU=5           # 4 CPU

authorizedRAM=30000000000  # 20GB

In [4]:
## Code cell 39 ##

gohome="/shared/projects/2413_rnaseq_cea/"

echo "=== Home root folder is ==="
echo "${gohome}"
echo ""
echo "=== Working (personal) folder tree ==="
tree -d -L 2 "${gohome}$USER"
echo "=== current working directory ==="
echo "${PWD}"

=== Home root folder is ===
/shared/projects/2413_rnaseq_cea/

=== Working (personal) folder tree ===
/shared/projects/2413_rnaseq_cea/scaburet
|-- Data
|   |-- fastq
|   `-- sra
|-- Results
|   |-- Rintro
|   |-- deseq2
|   |-- enrich
|   |-- fastp
|   |-- fastq_screen
|   |-- fastqc
|   |-- featurecounts
|   |-- gsea
|   |-- logfiles
|   |-- multiqc
|   |-- pca1
|   |-- qualimap
|   |-- qualimap-11juin
|   |-- salmon
|   |-- salmon2
|   |-- samtools
|   |-- samtools-11juin
|   |-- star
|   |-- star-11juin
|   `-- wgcna
|-- done
|-- run_notebooks
`-- stuff
    `-- meg_m2_rnaseq

28 directories
=== current working directory ===
/shared/ifbstor1/projects/2413_rnaseq_cea/scaburet


In [5]:
## Code cell 40 ##

logfolder="${gohome}$USER/Results/logfiles/"
echo "the folder for log files is ${logfolder}"

the folder for log files is /shared/projects/2413_rnaseq_cea/scaburet/Results/logfiles/



## Merged Sections from Pipe_06


   

- 0. 1 - About session for IFB core cluster
- 0. 2 - Parameters to be set or modified by the user
- 1 - Gene level quantification using ``featureCounts``
- 2 - Pseudo-mapping with Salmon
- 3 - Monitoring disk usage

In [1]:
## Code cell 41 ##

echo "=== Cell launched on $(date) ==="
squeue -hu $USER 

echo "=== Current IFB session size: as an indication: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ==="
jobid=$(squeue -hu $USER | awk '/sys/dash {print $1}')

sacct --format=JobID,AllocCPUS,ReqMem,NodeList,Elapsed,State --jobs ${jobid}

=== Cell launched on Mon Jun 24 12:00:40 CEST 2024 ===
          40350384      fast sys/dash scaburet  R    2:38:33      1 cpu-node-51
=== Current IFB session size: as an indication: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ===
JobID         AllocCPUS     ReqMem        NodeList    Elapsed      State 
------------ ---------- ---------- --------------- ---------- ---------- 
40350384              7        33G     cpu-node-51   02:38:33    RUNNING 
40350384.ba+          7                cpu-node-51   02:38:33    RUNNING 


In [2]:
## Code cell 42 ##

module load samtools/1.18 subread/2.0.6 salmon/1.10.2 multiqc/1.13

# module load samtools/1.10 subread/2.0.1 in 2023

echo "===== bam sorting by names ====="
samtools --version | head -n 2
echo "===== gene level quantification ====="
featureCounts -v
echo "===== Pseudo mapping and quantification with Salmon ====="
salmon -v
echo "===== quality reports compilation ====="
multiqc --version

===== bam sorting by names =====
samtools 1.18
Using htslib 1.19
===== gene level quantification =====

featureCounts v2.0.6

===== Pseudo mapping and quantification with Salmon =====
salmon 1.10.2
===== quality reports compilation =====
multiqc, version 1.13


In [3]:
## Code cell 43 ##

authorizedCPU=5           # 4 CPU

authorizedRAM=30000000000  # 20GB

In [4]:
## Code cell 44 ##

gohome="/shared/projects/2413_rnaseq_cea/"

echo "=== Home root folder is ==="
echo "${gohome}"
echo ""
echo "=== Working (personal) folder tree ==="
tree -d -L 2 "${gohome}$USER"
echo "=== current working directory ==="
echo "${PWD}"

=== Home root folder is ===
/shared/projects/2413_rnaseq_cea/

=== Working (personal) folder tree ===
/shared/projects/2413_rnaseq_cea/scaburet
|-- Data
|   |-- fastq
|   `-- sra
|-- Results
|   |-- Rintro
|   |-- deseq2
|   |-- enrich
|   |-- fastp
|   |-- fastq_screen
|   |-- fastqc
|   |-- featurecounts
|   |-- gsea
|   |-- logfiles
|   |-- multiqc
|   |-- pca1
|   |-- qualimap
|   |-- qualimap-11juin
|   |-- salmon
|   |-- salmon2
|   |-- samtools
|   |-- samtools-11juin
|   |-- star
|   |-- star-11juin
|   `-- wgcna
|-- done
|-- run_notebooks
`-- stuff
    `-- meg_m2_rnaseq

28 directories
=== current working directory ===
/shared/ifbstor1/projects/2413_rnaseq_cea/scaburet


In [5]:
## Code cell 45 ##

logfolder="${gohome}$USER/Results/logfiles/"
echo "the folder for log files is ${logfolder}"

the folder for log files is /shared/projects/2413_rnaseq_cea/scaburet/Results/logfiles/



## Merged Sections from Pipe_06


   

- 0. 1 - About session for IFB core cluster
- 0. 2 - Parameters to be set or modified by the user
- 1 - Gene level quantification using ``featureCounts``
- 2 - Pseudo-mapping with Salmon
- 3 - Monitoring disk usage

In [1]:
## Code cell 46 ##

echo "=== Cell launched on $(date) ==="
squeue -hu $USER 

echo "=== Current IFB session size: as an indication: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ==="
jobid=$(squeue -hu $USER | awk '/sys/dash {print $1}')

sacct --format=JobID,AllocCPUS,ReqMem,NodeList,Elapsed,State --jobs ${jobid}

=== Cell launched on Mon Jun 24 12:00:40 CEST 2024 ===
          40350384      fast sys/dash scaburet  R    2:38:33      1 cpu-node-51
=== Current IFB session size: as an indication: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ===
JobID         AllocCPUS     ReqMem        NodeList    Elapsed      State 
------------ ---------- ---------- --------------- ---------- ---------- 
40350384              7        33G     cpu-node-51   02:38:33    RUNNING 
40350384.ba+          7                cpu-node-51   02:38:33    RUNNING 


In [2]:
## Code cell 47 ##

module load samtools/1.18 subread/2.0.6 salmon/1.10.2 multiqc/1.13

# module load samtools/1.10 subread/2.0.1 in 2023

echo "===== bam sorting by names ====="
samtools --version | head -n 2
echo "===== gene level quantification ====="
featureCounts -v
echo "===== Pseudo mapping and quantification with Salmon ====="
salmon -v
echo "===== quality reports compilation ====="
multiqc --version

===== bam sorting by names =====
samtools 1.18
Using htslib 1.19
===== gene level quantification =====

featureCounts v2.0.6

===== Pseudo mapping and quantification with Salmon =====
salmon 1.10.2
===== quality reports compilation =====
multiqc, version 1.13


In [3]:
## Code cell 48 ##

authorizedCPU=5           # 4 CPU

authorizedRAM=30000000000  # 20GB

In [4]:
## Code cell 49 ##

gohome="/shared/projects/2413_rnaseq_cea/"

echo "=== Home root folder is ==="
echo "${gohome}"
echo ""
echo "=== Working (personal) folder tree ==="
tree -d -L 2 "${gohome}$USER"
echo "=== current working directory ==="
echo "${PWD}"

=== Home root folder is ===
/shared/projects/2413_rnaseq_cea/

=== Working (personal) folder tree ===
/shared/projects/2413_rnaseq_cea/scaburet
|-- Data
|   |-- fastq
|   `-- sra
|-- Results
|   |-- Rintro
|   |-- deseq2
|   |-- enrich
|   |-- fastp
|   |-- fastq_screen
|   |-- fastqc
|   |-- featurecounts
|   |-- gsea
|   |-- logfiles
|   |-- multiqc
|   |-- pca1
|   |-- qualimap
|   |-- qualimap-11juin
|   |-- salmon
|   |-- salmon2
|   |-- samtools
|   |-- samtools-11juin
|   |-- star
|   |-- star-11juin
|   `-- wgcna
|-- done
|-- run_notebooks
`-- stuff
    `-- meg_m2_rnaseq

28 directories
=== current working directory ===
/shared/ifbstor1/projects/2413_rnaseq_cea/scaburet


In [5]:
## Code cell 50 ##

logfolder="${gohome}$USER/Results/logfiles/"
echo "the folder for log files is ${logfolder}"

the folder for log files is /shared/projects/2413_rnaseq_cea/scaburet/Results/logfiles/


#### 1.3.2 - Running <code>featuresCounts</code> on multiple samples
---

In [15]:
## Code cell 51 ##

cp ${gohome}alldata/Results/featurecounts/11samples_* ${gohome}$USER/Results/featurecounts/

head -n 8 ${gohome}$USER/Results/featurecounts/*_paired-unstranded.counts

==> /shared/projects/2413_rnaseq_cea/scaburet/Results/featurecounts/11samples_paired-unstranded.counts <==
# Program:featureCounts v2.0.6; Command:"featureCounts" "-p" "-s" "0" "-T" "16" "-a" "/shared/projects/2413_rnaseq_cea/alldata/Reference/extracted/genome_annotation-M35.gtf" "-o" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/11samples_paired-unstranded.counts" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/SRR12730403_Aligned.sortedByNames.bam" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/SRR12730404_Aligned.sortedByNames.bam" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/SRR12730405_Aligned.sortedByNames.bam" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/SRR12730406_Aligned.sortedByNames.bam" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/SRR12730407_Aligned.sortedByNames.bam" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/SRR12730408_Aligned.sortedByNames.b


## Merged Sections from Pipe_06


   

- 0. 1 - About session for IFB core cluster
- 0. 2 - Parameters to be set or modified by the user
- 1 - Gene level quantification using ``featureCounts``
- 2 - Pseudo-mapping with Salmon
- 3 - Monitoring disk usage

In [1]:
## Code cell 52 ##

echo "=== Cell launched on $(date) ==="
squeue -hu $USER 

echo "=== Current IFB session size: as an indication: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ==="
jobid=$(squeue -hu $USER | awk '/sys/dash {print $1}')

sacct --format=JobID,AllocCPUS,ReqMem,NodeList,Elapsed,State --jobs ${jobid}

=== Cell launched on Mon Jun 24 12:00:40 CEST 2024 ===
          40350384      fast sys/dash scaburet  R    2:38:33      1 cpu-node-51
=== Current IFB session size: as an indication: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ===
JobID         AllocCPUS     ReqMem        NodeList    Elapsed      State 
------------ ---------- ---------- --------------- ---------- ---------- 
40350384              7        33G     cpu-node-51   02:38:33    RUNNING 
40350384.ba+          7                cpu-node-51   02:38:33    RUNNING 


In [2]:
## Code cell 53 ##

module load samtools/1.18 subread/2.0.6 salmon/1.10.2 multiqc/1.13

# module load samtools/1.10 subread/2.0.1 in 2023

echo "===== bam sorting by names ====="
samtools --version | head -n 2
echo "===== gene level quantification ====="
featureCounts -v
echo "===== Pseudo mapping and quantification with Salmon ====="
salmon -v
echo "===== quality reports compilation ====="
multiqc --version

===== bam sorting by names =====
samtools 1.18
Using htslib 1.19
===== gene level quantification =====

featureCounts v2.0.6

===== Pseudo mapping and quantification with Salmon =====
salmon 1.10.2
===== quality reports compilation =====
multiqc, version 1.13


In [3]:
## Code cell 54 ##

authorizedCPU=5           # 4 CPU

authorizedRAM=30000000000  # 20GB

In [4]:
## Code cell 55 ##

gohome="/shared/projects/2413_rnaseq_cea/"

echo "=== Home root folder is ==="
echo "${gohome}"
echo ""
echo "=== Working (personal) folder tree ==="
tree -d -L 2 "${gohome}$USER"
echo "=== current working directory ==="
echo "${PWD}"

=== Home root folder is ===
/shared/projects/2413_rnaseq_cea/

=== Working (personal) folder tree ===
/shared/projects/2413_rnaseq_cea/scaburet
|-- Data
|   |-- fastq
|   `-- sra
|-- Results
|   |-- Rintro
|   |-- deseq2
|   |-- enrich
|   |-- fastp
|   |-- fastq_screen
|   |-- fastqc
|   |-- featurecounts
|   |-- gsea
|   |-- logfiles
|   |-- multiqc
|   |-- pca1
|   |-- qualimap
|   |-- qualimap-11juin
|   |-- salmon
|   |-- salmon2
|   |-- samtools
|   |-- samtools-11juin
|   |-- star
|   |-- star-11juin
|   `-- wgcna
|-- done
|-- run_notebooks
`-- stuff
    `-- meg_m2_rnaseq

28 directories
=== current working directory ===
/shared/ifbstor1/projects/2413_rnaseq_cea/scaburet


In [5]:
## Code cell 56 ##

logfolder="${gohome}$USER/Results/logfiles/"
echo "the folder for log files is ${logfolder}"

the folder for log files is /shared/projects/2413_rnaseq_cea/scaburet/Results/logfiles/


#### 1.3.2 - Running <code>featuresCounts</code> on multiple samples
---

In [15]:
## Code cell 57 ##

cp ${gohome}alldata/Results/featurecounts/11samples_* ${gohome}$USER/Results/featurecounts/

head -n 8 ${gohome}$USER/Results/featurecounts/*_paired-unstranded.counts

==> /shared/projects/2413_rnaseq_cea/scaburet/Results/featurecounts/11samples_paired-unstranded.counts <==
# Program:featureCounts v2.0.6; Command:"featureCounts" "-p" "-s" "0" "-T" "16" "-a" "/shared/projects/2413_rnaseq_cea/alldata/Reference/extracted/genome_annotation-M35.gtf" "-o" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/11samples_paired-unstranded.counts" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/SRR12730403_Aligned.sortedByNames.bam" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/SRR12730404_Aligned.sortedByNames.bam" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/SRR12730405_Aligned.sortedByNames.bam" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/SRR12730406_Aligned.sortedByNames.bam" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/SRR12730407_Aligned.sortedByNames.bam" "/shared/projects/2413_rnaseq_cea/alldata/Results/featurecounts/SRR12730408_Aligned.sortedByNames.b