# **RNAseq Analysis Module**

## **Practical session 5: Read Counts**

Wednesday, the 1st of December, 2021   
Claire Vandiedonck and Sandrine Caburet - 2021  

  1. Quantification of reads on genomic features

---
## **Before going further**

<div class="alert alert-block alert-danger"><b>Caution:</b> 
Before starting the analysis, save a backup copy of this notebok : in the left-hand panel, right-click on this file and select "Duplicate"<br>
You can also make backups during the analysis. Don't forget to save your notebook regularly.
</div>

___

## **I. Quantification of reads on genomic features**
    
This part is very short, and consists in taking every genomic features provided in the annotation `.gff` file (here, genes or ORFs), and counting the number of reads that are mapped within the boundaries of these elements.

We use **BEDTOOLS** (https://bedtools.readthedocs.io/en/latest/) v2.29.2, with the `multicov` command.


In [None]:
bedtools --version

As in Practical Sessions 3 and 4, a `for loop` will run the program once for each element in the provided list.

In [None]:
ls ./Results/*.sorted.bam

In [None]:
# Quantify your data by annotated features using BEDTOOLS with the following command
# multicov is to count the number of reads to multiple features
# -bams is to specify you are using bam files rather than sam files
# -bed is the option to specify the name of the annotation file, here the .gff file
# An index, initially set at the value 1, is incremented at each step to provide a way to follow the progress of the analysis.

# Creation of a subfolder /Counts for writing the results
mkdir ./Results/Counts


#Runs for multiple gene_counts outputs, with relevant names

for fn in $(ls ./Results/*.sorted.bam); do
     
    mysortedbam=$(basename ${fn})
    id=${mysortedbam/_bowtie_mapping.sorted.bam/}
    echo "========Processing sampleID: ${id}"
     
    myout="./Results/Counts/${id}_gene_counts.txt" 
    bedtools multicov -bams ${fn} -bed /srv/data/meg-m2-rnaseq/genome/C_parapsilosis_ORFs.gff > ${myout}

    echo "...done"

done

To visualize the beginning of the results, we use the command `head` (by default it displays the 10 first lines of a text file).

In [None]:
head ./Results/Counts/Normoxia_1_gene_counts.txt

Since we are only interested in keeping the last columns and without "ID=", we modify the files with sed  (= stream editor), a powerful tool in Unix to handle and edit text files.  Here, we use it to delete all the characters from the beginning of each line up to "ID=" included, in order to only keep the last two columns, one with the gene name, the other with the read counts.

In [None]:
for fn in $(ls ./Results/Counts/*_gene_counts.txt); do
    
    mygenecounts=$(basename ${fn})
    id=${mygenecounts/_gene_counts.txt/}
    echo "========Processing sampleID: ${id}"

    echo $fn
    sed 's/^.*ID=//' ${fn} > "./Results/Counts/${id}.gene_counts.tab"

 echo "...done"

done   
    

<div class="alert alert-block alert-success"><b>=> Question: What can you say on the data?</b><br>

<em>(you can click here to add your answers directly in this markdown cell)</em><br>
    
- What are the problems associated to this way of counting reads to features?
- Which other methods could have been used?
- Does it make sens to compare the samples on this basis?
</div>

<div class="alert alert-block alert-success"><b>Success:</b> Don't forget to save you notebook and export a copy as an <b>html</b> file as well <br>
- Open "File" in the Menu<br>
- Select "Export Notebook As"<br>
- Export notebook as HTML<br>
- You can then open it in your browser even without being connected to adenine! 
</div>

___
___

The normalisation of the data will be performed during the statistical analysis of the data (tomorrow, during **Practical session 7**)

Now we go on with a lecture about basic principles of statitics. 

**=> Lecture 7 : Basic Statistics** 

___

<div class="alert alert-block alert-info"> 
    
<b><em> About jupyter notebooks:</em></b><br>

- To add a new cell, click on the "+" icon in the toolbar above your notebook <br>
- You can "click and drag" to move a cell up or down <br>
- You choose the type of cell in the toolbar above your notebook: <br>
    - 'Code' to enter command lines to be executed <br>
    - 'Markdown' cells to add text, that can be formatted with some characters <br>
- To execute a 'Code' cell, press SHIFT+ENTER or click on the "play" icon  <br>
- To display a 'Markdown' cell, press SHIFT+ENTER or click on the "play" icon  <br>
- To modify a 'Markdown'cell, double-click on it <br>
<br>    

<em>  
To make nice html reports with markdown: <a href="https://dillinger.io/" title="dillinger.io">html visualization tool 1</a> or <a href="https://stackedit.io/app#" title="stackedit.io">html visualization tool 2</a>, <a href="https://www.tablesgenerator.com/markdown_tables" title="tablesgenerator.com">to draw nice tables</a>, and the <a href="https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fd" title="Ultimate guide">Ultimate guide</a>. <br>
Further reading on JupyterLab notebooks: <a href="https://jupyterlab.readthedocs.io/en/latest/user/notebook.html" title="Jupyter Lab">Jupyter Lab documentation</a>.<br>
    
Here we are using JupyterLab interface implemented as part of the <a href="https://plasmabio.org/" title="plasmabio.org">Plasmabio</a> project led by Sandrine Caburet, Pierre Poulain and Claire Vandiedonck.
</em>
</div>