# Formation RNAseq CEA - juin 2023

*Enseignantes: Sandrine Caburet et Claire Vandiedonck*

Session IFB : 5 CPU + 21 GB de RAM

# Part 6: Read Counts

   

- 0. 1 - About session for IFB core cluster
- 0. 2 - Parameters to be set or modified by the user
- 1 - Some checks as a precaution
- 2 - Gene level quantification using ``featureCounts``
- 3 - Monitoring disk usage

- This notebook contains heavy running cells in Section 2, so make sure before you launch these cells that they are adapted to your environment. <br>
For Samtools, the ``-m`` option is the **RAM memory size that will be used by each thread**! <br>
<blockquote>
        See options <code>--threads 3 -m 5G</code> in Samtools line, Section 2  <br>
        Adapt <code>-T 4</code> in featureCounts lines to set it to 70-80% of available CPU number. <br>
        Adapt <code>-s 0</code> in  line to fit your library preparation. In the current notebook version, this option is set to <code>0</code> as the librairies are unstranded. <br>
</blockquote>

<div class="alert alert-block alert-danger">
    <b>Values set in this notebook are valid for a 5-CPU session with access to 21 GB of RAM</b>. Ideally, use 70-80% of the CPU amount your system or session has. <br>
    DO NOT ask for more RAM than you can use.
</div>

---
---
## 0. Before going further
---

<div class="alert alert-block alert-danger"><b>Caution:</b> 
Before starting the analysis, save a backup copy of this notebok : in the left-hand panel, right-click on this file and select "Duplicate"<br>
You can also make backups during the analysis. Don't forget to save your notebook regularly: <kbd>Ctrl</kbd> + <kbd>S</kbd> or click on the 💾 icon.
</div>

---

## 0.1 - About session for IFB core cluster

<em>loaded JupyterLab</em> : Version 3.2.1

In [None]:
## Code cell 1 ##

echo "=== Cell launched on $(date) ==="

echo "=== Current IFB session size: Medium (5CPU, 21 GB) ==="
jobid=$(squeue -hu $USER | awk '/jupyter/ {print $1}')
sacct --format=JobID,AllocCPUS,NODELIST -j ${jobid}

In [None]:
## Code cell 2 ##

module load samtools/1.10 subread/2.0.1

echo "===== bam sorting by names ====="
samtools --version | head -n 2
echo "===== gene level quantification ====="
featureCounts -v


---

## 0.2 - Parameters to be set or modified by the user


- Using a full path with a `/` at the end, **define the folder** where you want or have to work with `gohome` variable:

In [None]:
## Code cell 3 ##

gohome="/shared/projects/2312_rnaseq_cea/"

echo "=== Home root folder is ==="
echo "${gohome}"
echo ""
echo "=== Working (personal) folder tree ==="
tree -d -L 2 "${gohome}$USER"
echo "=== current working directory ==="
echo "${PWD}"

- With a `/` at the end, precise the folder where alignment ``_ALigned.SortedByCoord.out.bam`` files (produced by ***STAR***) are:

In [None]:
## Code cell 4 ##

mappedfolder="${gohome}$USER/Results/star/"

- Please remember the path to uncompressed annotation file, especially if you changed it in previous notebooks.

In [None]:
## Code cell 5 ##

gtffile="${gohome}allData/Reference/extracted/genome_annotation.gtf"

---
---
## 1 - Initial checks
---

As we change session or even day, let's first check all files are there:

In [None]:
## Code cell 6 ##

mappedfolder=${mappedfolder}
echo "There are $(ls "${mappedfolder}"*_Aligned.sortedByCoord.out.bam | wc -l) bam files:"
ls "${mappedfolder}"*_Aligned.sortedByCoord.out.bam

---
---
## 2 - Gene level quantification using <code>featureCounts</code>
---

### 2.1- Tool presentation
---

`featureCounts` is part of the package <a href="http://subread.sourceforge.net/">SubRead</a>, to be used with bash, and RSubRead, to be used with R. This tool allows to attribute mapped reads to their matching feature (exon, gene, promoter, ...) on the genome and summarize counts per feature.  
  
In SubRead user guide, developpers recommend to use <i>specialized transcript-level quantification tools [...] for counting reads to transcripts</i> (see section 6.2.5, page 34 of pdf manual you can download with this <a href="https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwie1YDqwpD_AhXgU6QEHVLnB14QFnoECAsQAQ&url=https%3A%2F%2Fbioconductor.org%2Fpackages%2Frelease%2Fbioc%2Fvignettes%2FRsubread%2Finst%2Fdoc%2FSubreadUsersGuide.pdf&usg=AOvVaw1b3PpVhTNokdJHARtYAXgf">link</a>). So we will only generate gene-level quantification with `featureCounts`.

In [None]:
## Code cell 7 ##

featureCounts -v

A simple usage to count paired-end sequencing data, then counts fragments instead or reads, is:  
<code>featureCounts -p -a annotation.gtf \ 
                  -o counts.txt \ 
                  alignment.bam</code>

The main <code>featureCounts</code> options correspond to: <br>
- `-p` to count fragments instead or reads (paired-end data)
- `-a TEXT` to locate annotation file, in GTF/GFF format by default (<code>-f</code> to be used to give other file format) that can be a <code>.gzip</code> one. <br>
- `-o TEXT` to set filename for counts<br>
- then alignement files (it may be a list) come. Either in BAM or SAM format, they can be sorted by read names or by chromosomal coordinates.
- `M`: By default, multi-mapped reads (or fragments) are not considered unless we use `-M` option (see others parameters to set in manual, pdf pages 38 to 43).
- `-T exon` *(by default)*: It will select exon lines in annotation file to attribute reads (or fragments).
- `-g gene_id` *(by default)*: Then, it will count them according to gene_id meta-feature level.

We will use some other options:  

- <code>-s INTEGER</code>, to specify strandness. Possible values include: 0 (unstranded), 1 (stranded) and 2 (reversely stranded). Default value is 0. <br>
- <code>-T INTEGER</code>, to set the number of threads to use (default, 1) <br>
- <code>--verbose</code>, to get information for debugging, such as unmatched chromosome/contig names. <br>
    
As temporary files are saved by default to directory specified in <code>-o</code> option, we won't use <code>--tmpDir STRING</code> option <br>
</blockquote>

For paired-end data, it is possible to ask for filtering fragments that have both ends aligned (`-B` option), on same chromose and strand (`-C`) and even separated by an insert distance (`-P`, included in `-d` and `-D` values).  

Besides, input files will be used as sorted by names. Even if ***featureCount*** handles `.bam` files as fast as ***samtools*** does, we will nonetheless use samtools and `--donotsort` option. Indeed, some files may be bigger than supported by featureCounts.

--
### 2.2- Step preparation
---

We need to create a destination folder...

In [None]:
## Code cell 8 ##

featcountfolder="${gohome}$USER/Results/featurecounts/"
mkdir -p ${featcountfolder}
tree -d -L 1 "${gohome}$USER/Results/"

... and remember used folder for saved screen output files.

In [None]:
## Code cell 9 ##

logfolder="${gohome}$USER/Results/logfiles/"

In [None]:
## Code cell 10 ##

gtffile="${gohome}allData/Reference/extracted/genome_annotation.gtf"

echo "Annotation file: ${gtffile}"

---
---
### 3- Compute reads or fragments counts
---

#### 3.1 - Running <code>featuresCounts</code> on individual samples
---

- Before you run the following cells, make sure that they are adapted to your environment. <br>
For Samtools, the ``-m`` option is the **RAM memory size that will be used by each thread**! <br>
<blockquote>
        See options <code>--threads 3 -m 5G</code> in Samtools line <br>
        Adapt <code>-T 4</code> in featureCounts lines to set it to 70-80% of available CPU number. <br>
        Adapt <code>-s 0</code> in  line to fit your library preparation. In the current notebook version, this option is set to <code>0</code> as the librairies are unstranded. <br>
</blockquote>

<div class="alert alert-block alert-danger">
    <b>Values set in this notebook are valid for a 5-CPU session with access to 21 GB of RAM</b>. Ideally, use 70-80% of the CPU amount your system or session has. <br>
    DO NOT ask for more RAM than you can use.
</div>

<div class="alert alert-block alert-warning">
    <b><code>-T</code> option in featureCounts command line doesn't have an impact during counting process</b>. It has an effect when BAM are sorted on the fly by features counts, but this is not the case here, as the input file is sorted before by samtools.
</div>

In [None]:
## Code cell 11 ##

logfile="${logfolder}featureCounts-gene-level-counts-samtoolsSort.log"
echo "Screen output is redirected to ${logfile}"

In [None]:
## Code cell 12 ##

# as time command does not redirect output
echo "operation starts at $(date)" >> ${logfile}

time for fn in $(ls "${mappedfolder}"*_Aligned.sortedByCoord.out.bam); do  
    
    mysortedbam=$(basename ${fn})
    id=${mysortedbam/_Aligned.sortedByCoord.out.bam/}
    echo "===== Processing sampleID: ${id}..." | tee -a ${logfile}
    
    # outputfiles
    mytempfile="${featcountfolder}${id}_Aligned.sortedByNames.bam"
    myoutfile="${featcountfolder}${id}_paired-unstranded"
    
    # bam sorting...
    echo "samtools starts at $(date)" >> ${logfile}
    samtools sort -n \
                --threads 2 -m 5G \
                --output-fmt BAM \
                -o ${mytempfile} \
                -T ${featcountfolder} \
                ${fn} \
                &>> ${logfile}
    echo "samtools ends at $(date)" >> ${logfile}

    # some user conversation to help being patient
    echo "...changing tool..." | tee -a ${logfile}

    # then featureCounts
    echo "featureCounts starts at $(date)" >> ${logfile}

    featureCounts -p -s 0 -T 4 \
                  -a "${gtffile}" \
                  -o "${myoutfile}.counts" \
                  ${mytempfile} \
                  --donotsort \
                  --verbose \
                  &>> ${logfile}
    echo "featureCounts ends at $(date)" >> ${logfile}
    
    # removing extra bam file... saving disk space
    rm ${mytempfile}
    
    echo "... done" | tee -a ${logfile}
    
done

In [None]:
## Code cell 13 ##

echo "operation ends at $(date)" >> ${logfile}

echo "=== Files created after featureCounts ===" >> ${logfile}
ls -lh "${featcountfolder}" >> ${logfile}
echo "featureCounts generated $(ls "${featcountfolder}"*.counts | wc -l) count files." \
    | tee -a ${logfile}
echo "featureCounts generated $(ls "${featcountfolder}"*.counts.summary | wc -l) summary files." \
    | tee -a ${logfile}

#### 3.2 - Running <code>featuresCounts</code> on multiple samples
---

The above cell processed each sample file individually: <code>samtools</code> generated the sorted bam that was used immediately by <code>featureCounts</code>, and that was removed afterward.   
Therefore we obtain one count file per sample. Those count files could be later joined in a single count matrix for the next step of analysis.   

Alternatively, <code>featuresCounts</code> can handle multiple bam files at once, creating directly a single count matrix for all the samples.    
This requires that all the sorted bam files are available as input for <code>featuresCounts</code>.    
This was done beforehand for all 11 samples, by running the next 2 cells.   
Code cell 15 runs <code>samtools</code> to generate (and keep!) the sorted bam files.   
Code cell 17 runs <code>featuresCounts</code> only once, using the list of sorted bam filenames as input.  

To run those cells in your own project, simply change their type from *Markdown* to *Code*.


We can see the proportions of alignements that match a single gene (if default `featureCounts` parameters were kept) thanks to logfile and a research for specifiy patterns (`grep -e PATTERN`).

In [None]:
## Code cell 18 ##

logfile="${logfolder}featureCounts-gene-level-counts-samtoolsSort.log"

cat ${logfile} | grep -e "Successfully assigned alignments" -e "Process BAM"

In [None]:
## Code cell 19 ##

# to have explanations
echo "summary file"
cat "${myoutfile}.counts.summary"

Let's have a look at the beginning of result files from featureCounts: 

In [None]:
## Code cell 20 ##

head -n 10 ${gohome}$USER/Results/featurecounts/*_paired-unstranded.counts

As you can see, the genes are in rows and the counts in the samples are provided in the on-before-last column.   

---
---
## 4 - Pseudo-mapping with **Salmon**
---

There is an aletrnative method of counting reads to transcripts and genes without prior mapping to a reference genome. It relies on prior indexation of all transcripts (rather than on the genome). Using theses indeces, reads are ddirectly assigned to transcripts. This method does not generate BAM/SAM alignement files. Itproduces directly the matrix ouf the counts for each transcript.

The main advangates ofthis method is the speed and direct count of all transcript isoforms (only annotated ones). On the opposite, only annotated transcripts can be counted and it does not allow the discovery of new transcripts. It is also impossible to vizaulize the alignments of the reads, with IGV for example.

Two softwares are currently available with very similar characteristics and outputs, including Salmon and Kallisto.

To quantify expression at the gene level, transcript counts are simply aggregated (i.e summed) by genes.


To use **Salmon**, two main steps are necessary:

**1. build the ***gentrome*** index:**

- obtain the genome annotation  (below link for Gencode version 32 on Mus Muculus genome build GRCm 39)

`wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M32/gencode.vM32.transcripts.fa.gz`

- obtain the genome sequence

`wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M32/GRCm39.primary_assembly.genome.fa.gz`

- extract the name of trasncripts (headers of fasta sequences)

`grep "^>" <(gunzip -c GRCm39.primary_assembly.genome.fa.gz) | cut -d " " -f 1 > decoys.txt
sed -i.bak -e 's/>//g' decoys.txt`

- concatenate genome sequence and transcript sequences:

`cat gencode.vM32.transcripts.fa.gz GRCm39.primary_assembly.genome.fa.gz > gentrome.fa.gz`


- transcriptome indexation with the basic command (for mouse genome version 39):

`salmon index -t gentrome.fa.gz -i salmon_vM32_index --gencode` 

**2. run pseudo-alignement of reads** :

`salmon quant -i salmon_vM32_index -l A \
         -1 sampleID_R1_fastp.fastq.gz \
         -2 sampleID_R2_fastp.fastq.gz \
	 --validateMappings \
	 -p 12 -o salmon_res_folder/sampleID`

For each sample, we obtain a folder in salmon _res_folder with `quant.sf` as main output. This file contains ***TPM (transcripts per million)*** and read count for each transcript ID in rows. Caution, TPM are not integger values. It may have implications for downstream analyses.

For more details, you can follow a nice tutorial here: https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/ or this one: https://combine-lab.github.io/salmon/getting_started/

---
---
## 5 - Monitoring disk usage
---

In [None]:
## Code cell 21 ##

du -h -d2 ${gohome}$USER

We now have a personal folder that gets too heavy.  Let's remove some files we won't use anymore: 

- initial srr files in Data/sra/   
- raw fastq.gz files in Data/fastq/raw/   
- cleaned fastq.gz files in /Results/fastp   
- intermediate Aligned.sortedByNames.bam files produced by <code>samtools</code> fo <code>featuresCounts</code> just above

In [None]:
## Code cell 22 ##

# Saving disk space
# initial srr files
rm -r ${gohome}$USER/Data/sra/

# raw fastq.gz
rm ${gohome}$USER/Data/fastq/raw/*.fastq.gz

# cleaned fastq.gz
rm ${gohome}$USER/Results/fastp/*.fastp.fastq.gz

# intermediate Aligned.sortedByNames.bam
# rm ${gohome}$USER/Results/featurecounts/*_Aligned.sortedByNames.bam # alerady done in cell 22

Let's see our disk usage now:

In [None]:
## Code cell 23 ##

du -h -d2 ${gohome}$USER

---
___

## Conclusion


**Next Practical session**

Now we go on with an introduction to the R language, that will be used during the next steps of the analysis.  
  
**=> Step 7: Introduction to R** 

The jupyter notebook used for the next session will be *Pipe_07-R403_Intro-to-R.ipynb*    
Let's retrieve it in our personal directory, in order to have a private copy to work on:   

In [None]:
## Code cell 24 ##   

cp "${gohome}pipeline/Pipe_07-R403_intro-to-R.ipynb" "${gohome}$USER/"
cp "${gohome}allData/Example_Data/Temperatures.txt" "${gohome}$USER/"



**Save executed notebook**

To end the session, save your exectued notebook in your `run_notebooks' folder. Adjust the name with yours and reformat as code cell to run it.

<div class="alert alert-block alert-success"><b>Success:</b> Well done! You now know how to count reads per features.<br>
Don't forget to save you notebook and export a copy as an <b>html</b> file as well <br>
- Open "File" in the Menu<br>
- Select "Export Notebook As"<br>
- Export notebook as HTML<br>
- You can then open it in your browser even without being connected to the server! 
</div>

---
---

## Useful commands
<div class="alert alert-block alert-info"> 
    
- <kbd>CTRL</kbd>+<kbd>S</kbd> : save notebook<br>    
- <kbd>CTRL</kbd>+<kbd>ENTER</kbd> : Run Cell<br>  
- <kbd>SHIFT</kbd>+<kbd>ENTER</kbd> : Run Cell and Select Next<br>   
- <kbd>ALT</kbd>+<kbd>ENTER</kbd> : Run Cell and Insert Below<br>   
- <kbd>ESC</kbd>+<kbd>y</kbd> : Change to *Code* Cell Type<br>  
- <kbd>ESC</kbd>+<kbd>m</kbd> : Change to *Markdown* Cell Type<br> 
- <kbd>ESC</kbd>+<kbd>r</kbd> : Change to *Raw* Cell Type<br>    
- <kbd>ESC</kbd>+<kbd>a</kbd> : Create Cell Above<br> 
- <kbd>ESC</kbd>+<kbd>b</kbd> : Create Cell Below<br> 

<em>  
To make nice html reports with markdown: <a href="https://dillinger.io/" title="dillinger.io">html visualization tool 1</a> or <a href="https://stackedit.io/app#" title="stackedit.io">html visualization tool 2</a>, <a href="https://www.tablesgenerator.com/markdown_tables" title="tablesgenerator.com">to draw nice tables</a>, and the <a href="https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fd" title="Ultimate guide">Ultimate guide</a>. <br>
Further reading on JupyterLab notebooks: <a href="https://jupyterlab.readthedocs.io/en/latest/user/notebook.html" title="Jupyter Lab">Jupyter Lab documentation</a>.<br>   
</em>    
 
</div>

---
Bénédicte Noblet - 05-07 2021   
Sandrine Caburet - 02-05 2023   
Claire Vandiedonck - 03-06 2023  
Maj 03/06/2023 by @CVandiedonck