# __RNAseq Analysis Module__

## **Practical Session 3: Quality check of raw data and mapping**

Tuesday, the 22nd of November, 2022   
Claire Vandiedonck and Sandrine Caburet - 2022  


   1. Getting started   
   2. Quality controls on Cparapsilosis fastq files   
   3. Mapping the reads on CParasilosis genome using the BOWTIE program  
   4. Managing the output files
   5. Batch analysing of the other samples


---
## **Before going further**

<div class="alert alert-block alert-danger"> <b>Caution:</b> 
Before starting the analysis, save a backup copy of this notebok : in the left-hand panel, right-click on this file and select "Duplicate"<br>
You can also make backups during the analysis. Don't forget to save your notebook regularly.
</div>

<div class="alert alert-block alert-info"> 
    
<b><em> About jupyter notebooks:</em></b><br>

- To add a new cell, click on the "+" icon in the toolbar above your notebook <br>
- You can "click and drag" to move a cell up or down <br>
- You choose the type of cell in the toolbar above your notebook: <br>
    - 'Code' to enter command lines to be executed <br>
    - 'Markdown' cells to add text, that can be formatted with some characters <br>
- To execute a 'Code' cell, press SHIFT+ENTER or click on the "play" icon  <br>
- To display a 'Markdown' cell, press SHIFT+ENTER or click on the "play" icon  <br>
- To modify a 'Markdown'cell, double-click on it <br>
<br>    

<em>  
To make nice html reports with markdown: <a href="https://dillinger.io/" title="dillinger.io">html visualization tool 1</a> or <a href="https://stackedit.io/app#" title="stackedit.io">html visualization tool 2</a>, <a href="https://www.tablesgenerator.com/markdown_tables" title="tablesgenerator.com">to draw nice tables</a>, and the <a href="https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fd" title="Ultimate guide">Ultimate guide</a>. <br>
Further reading on JupyterLab notebooks: <a href="https://jupyterlab.readthedocs.io/en/latest/user/notebook.html" title="Jupyter Lab">Jupyter Lab documentation</a>.<br>
    
Here we are using JupyterLab interface implemented as part of the <a href="https://plasmabio.org/" title="plasmabio.org">Plasmabio</a> project led by Sandrine Caburet, Pierre Poulain and Claire Vandiedonck.
</em>
</div>

___

__*=> About this jupyter notebook*__

This a jupyter notebook in **bash**, meaning that the commands you will enter or run in `Code` cells are directly understood by the server. <br>You could run the same commands in a `Terminal` (the frightening black window that informaticians use :-D). 

>_If you want to see this by yourself, you can open a terminal on adenine:_
>- _in the `File` menu in the top bar, select `New Launcher` or click on the `+` sign below_
>- _open either a bash `Console` or a `Terminal`_
>- _you'll be able to copy and paste the commands from the `Code` cells of the notebook in the "bottom cell" (for the console) or after the `$` sign (for the terminal)_
>
>_This is for your information only, and not needed. All the commands are already included in this notebook_
<br>

- In Unix, all characters are case sensitive.
- It is good practice to avoid accents and special characters.   
- Within `Code` cells, lines starting with a `#` are comments and are not interpreted as a command. They are meant to help you.  
- You may add your own comments as well, either in a `Code` cell using this `#`, or in a new `Markdown` cell added with the "+" above.  
- <mark>If you add cells with comments, or modify existing cells, **don't forget to save your notebook**.<mark>
___

## **I - Getting started**

### **1- Working directory**

The working directory is where you are currently located in the server. By default, for this practical session using this JupyterLab notebook, this is the folder displayed by the opening of the environment, that you performed when you selected the correct 'server' and launched it: it created the corresponding folder in you home.  

To check where you are working, use the `pwd` command, which stands for "path to working directory".

In [1]:
pwd

/srv/home/scaburet/meg_m2_rnaseq_bash


<div class="alert alert-block alert-warning"><b>The result should be like this:</b>`/srv/home/mylogin/m2meg-rnaseq-tp3to5-bash` with your "login". If not, call us! We can change a working directory using the Unix command <b>cd</b> (change directory) 
</div>

>_Here is a link for some basic Unix commands: https://files.fosswire.com/2007/08/fwunixref.pdf (there are plenty of other good ones on the net).<br>
>You can also get an explanation of general Unix commands using this tool: https://explainshell.com/. <br>
> Some other tips: you may use the autocompletion of the names of your files and folders with the tab arrow on your keyboard._

The content of this working directory is displayed in the left panel. You can also list the content of this folder with the `ls` command (which stands for "list"):

In [2]:
# the option -l will provide details of size for each file, 
# the option h stands for human, to read the file size in a human easy manner. 
# the two options are combined with -lh
# you may add -tr as well to see the files sorted by reverse time.

ls -lh

total 22M
drwxr-xr-x 2 scaburet scaburet 4.0K Nov 15 17:22 binder
-rw-rw-r-- 1 scaburet scaburet 7.6M Nov 22 12:42 C_parapsilosis.1.ebwt
-rw-rw-r-- 1 scaburet scaburet 1.6M Nov 22 12:42 C_parapsilosis.2.ebwt
-rw-rw-r-- 1 scaburet scaburet   89 Nov 22 12:42 C_parapsilosis.3.ebwt
-rw-rw-r-- 1 scaburet scaburet 3.2M Nov 22 12:42 C_parapsilosis.4.ebwt
-rw-rw-r-- 1 scaburet scaburet 7.6M Nov 22 12:42 C_parapsilosis.rev.1.ebwt
-rw-rw-r-- 1 scaburet scaburet 1.6M Nov 22 12:42 C_parapsilosis.rev.2.ebwt
-rw-r--r-- 1 scaburet scaburet 1.5K Nov 15 17:22 LICENSE
-rw-rw-r-- 1 scaburet scaburet  35K Nov 22 15:26 PS3-mapping-bash-2022-Copy1.ipynb
-rw-rw-r-- 1 scaburet scaburet  60K Nov 22 13:14 PS3-mapping-bash-2022.ipynb
-rw-rw-r-- 1 scaburet scaburet  25K Nov 22 12:42 PS4-mappingOutput-bash-2022.ipynb
-rw-rw-r-- 1 scaburet scaburet  10K Nov 22 12:44 PS5-ReadCounts-bash-2022.ipynb
-rw-r--r-- 1 scaburet scaburet 1.3K Nov 15 17:22 README.md
drwxrwxr-x 3 scaburet scaburet 4.0K Nov 22 13:15 Results
drwx

### **2- Data** 
The data files are already present on the server, in the `/srv/data/meg-m2-rnaseq/genome/` and in `/srv/data/meg-m2-rnaseq/experimental_data` folders.
<br><mark> Do not copy them to your working directory. </mark> <br>
We will directly read them from where they are by indicating the  **absolute path** to these folders.

#### **2.a- list of input files:**

In [3]:
# Here we list the content of the folder containing the genome data

ls -lh /srv/data/meg-m2-rnaseq/genome/

total 16M
-rw-rw-r-- 1     1002 1011 2.2M Nov  8  2020 C_parapsilosis_CDC317_current_features.gtf
-rw-r--r-- 1 scaburet 1012 3.9K Nov 13  2020 C_parapsilosis_CDC317_GO_distrib-5958g.txt
-rwxrwxr-x 1 scaburet 1012  13M Nov 16  2020 C_parapsilosis_CGD.fasta
-rwxrwxr-x 1     1002 1011  497 Nov 17  2020 C_parapsilosis_CGD.fasta.fai
-rwxrwxr-x 1 scaburet 1012 460K Nov 16  2020 C_parapsilosis_ORFs.gff
-rw-rw-r-- 1     1002 1011  381 Nov 28  2020 md5sums.txt


In [4]:
# Here we list the content of the folder containing the experimental data

ls -ltr /srv/data/meg-m2-rnaseq/experimental_data/

total 8740404
-rwxrwxr-x 1 scaburet 1012 2076626584 Nov 17  2020 Normoxia_1.fastq
-rwxrwxr-x 1 scaburet 1012 2247722522 Nov 17  2020 Hypoxia_1.fastq
-rwxrwxr-x 1 scaburet 1012  601091995 Nov 23  2020 SRR352276.fastqsanger.gz
-rwxrwxr-x 1 scaburet 1012  453919548 Nov 23  2020 SRR352274.fastqsanger.gz
-rwxrwxr-x 1 scaburet 1012  900553614 Nov 23  2020 SRR352273.fastqsanger.gz
-rwxrwxr-x 1 scaburet 1012  287494946 Nov 24  2020 SRR352270.fastqsanger.gz
-rwxrwxr-x 1 scaburet 1012  545441830 Nov 24  2020 SRR352267.fastqsanger.gz
-rwxrwxr-x 1 scaburet 1012  614531495 Nov 24  2020 SRR352266.fastqsanger.gz
-rwxrwxr-x 1 scaburet 1012  590409188 Nov 24  2020 SRR352264.fastqsanger.gz
-rwxrwxr-x 1 scaburet 1012  632322170 Nov 24  2020 SRR352261.fastqsanger.gz
-rw-rw-r-- 1     1002 1011        993 Nov 29  2020 md5sums.txt


You may count the number of files in one folder using the following command. The symbol `|` is a "pipe". It redirects the output of the command on its left to its right. The command `grep` (*globally search for a regular expression and print matching lines*) is used to identify a specific pattern. The final part of the command `wc -l` is used to count the number of lines.

In [6]:
ls /srv/data/meg-m2-rnaseq/experimental_data/ | grep "fastq" | wc -l 

10


The first two files are `.fastq` files containing raw data of the Immunina sequencer. The other 8 are gunzipped `.gz` compressed files. You can notice their size is reduced compared to the `.fastq` files. Most genomics tools can work with both compressed and uncompressed files.

#### **2.b- checking files integrity:**

<div class="alert alert-block alert-warning"><b>Checking the data are not corrupted</b><br>
Whenever you get such input files, it is mandatory to verify that they are intact and not corrupted before analysing the data further.
This can be performed by computing a <b>md5sum</b>, a kind of "barcode" or "fingerprint" of each file. It should remain the same after a copy on your computer for example.<br>
Similarly in your laboratories, if you get files from collaborators or a Next-Generation-Sequencing platform, always ask for the md5sums to check files integrity
</div>

You may either get the md5sum of one file at a time like this using the command `md5sum` followed by a space and the name of the file:

   - on the __genomic files__:

In [7]:
md5sum /srv/data/meg-m2-rnaseq/genome/C_parapsilosis_CGD.fasta

e189032dafc2b7013eeae7d33cbf9458  /srv/data/meg-m2-rnaseq/genome/C_parapsilosis_CGD.fasta


Or you may get the `md5sum` fingerprint of all the files at once in the folder by using `*` which stands for "anything"

In [8]:
# In a command, * stands for 'anything'.

md5sum /srv/data/meg-m2-rnaseq/genome/*

#You should get the following "barcodes" for each file :
# 423de6aa2842fa7ad2b2639fc4d47808  /srv/data/meg-m2-rnaseq/genome/C_parapsilosis_CDC317_current_features.gtf
# 6455d97a060c3c7d1e94112f818fa046  /srv/data/meg-m2-rnaseq/C_parapsilosis_CDC317_GO_distrib-5958g.txt
# e189032dafc2b7013eeae7d33cbf9458  /srv/data/meg-m2-rnaseq/C_parapsilosis_CGD.fasta
# 537217ec9ac54343af31b28521c0c6f3  /srv/data/meg-m2-rnaseq/genome/C_parapsilosis_CGD.fasta.fai
# e86c62e99a240c0ac309cd067d105522  /srv/data/meg-m2-rnaseq/C_parapsilosis_ORFs.gff
# be6f316b0fcca1b653ee5b98648ddfb2  /srv/data/meg-m2-rnaseq/genome/md5sums.txt


423de6aa2842fa7ad2b2639fc4d47808  /srv/data/meg-m2-rnaseq/genome/C_parapsilosis_CDC317_current_features.gtf
6455d97a060c3c7d1e94112f818fa046  /srv/data/meg-m2-rnaseq/genome/C_parapsilosis_CDC317_GO_distrib-5958g.txt
e189032dafc2b7013eeae7d33cbf9458  /srv/data/meg-m2-rnaseq/genome/C_parapsilosis_CGD.fasta
537217ec9ac54343af31b28521c0c6f3  /srv/data/meg-m2-rnaseq/genome/C_parapsilosis_CGD.fasta.fai
e86c62e99a240c0ac309cd067d105522  /srv/data/meg-m2-rnaseq/genome/C_parapsilosis_ORFs.gff
be6f316b0fcca1b653ee5b98648ddfb2  /srv/data/meg-m2-rnaseq/genome/md5sums.txt


What is even better is to have already in the folder a file, classically called `md5sum.txt`, with the outputs of the above `md5sum` command. Should you have the rights to do it, the command to generate that file would be:

In [11]:
md5sum /srv/data/meg-m2-rnaseq/genome/* > md5sums.txt

Thus, you can automatically do the comparison of the md5sum fingerprints you obtain with the ones stored in the `md5sum.txt` file in a recursive manner using the argument `-c`. This is very convenient when you have lot of files to check from a platform.

In [12]:
md5sum -c /srv/data/meg-m2-rnaseq/genome/md5sums.txt 

/srv/data/meg-m2-rnaseq/genome/C_parapsilosis_CDC317_GO_distrib-5958g.txt: OK
/srv/data/meg-m2-rnaseq/genome/C_parapsilosis_CGD.fasta: OK
/srv/data/meg-m2-rnaseq/genome/C_parapsilosis_CGD.fasta.fai: OK
/srv/data/meg-m2-rnaseq/genome/C_parapsilosis_ORFs.gff: OK


_Remark: To get information on a Unix command, just enter the name of the command followed by `--help` as below. If it is installed on the server/computer, you can also enter the command `man` followed by the name of the command._

In [13]:
md5sum --help
# man md5sum

Usage: md5sum [OPTION]... [FILE]...
Print or check MD5 (128-bit) checksums.

With no FILE, or when FILE is -, read standard input.

  -b, --binary         read in binary mode
  -c, --check          read MD5 sums from the FILEs and check them
      --tag            create a BSD-style checksum
  -t, --text           read in text mode (default)

The following five options are useful only when verifying checksums:
      --ignore-missing  don't fail or report status for missing files
      --quiet          don't print OK for each successfully verified file
      --status         don't output anything, status code shows success
      --strict         exit non-zero for improperly formatted checksum lines
  -w, --warn           warn about improperly formatted checksum lines

      --help     display this help and exit
      --version  output version information and exit

The sums are computed as described in RFC 1321.  When checking, the input
should be a former output of this program.  The de

   - on the __experimental data__ :
   
*Be patient, it can take a minute.*

In [14]:
md5sum -c /srv/data/meg-m2-rnaseq/experimental_data/md5sums.txt

#You should get the following "barcodes" for each file :

# 2fb96155f5c708709a7539c7ff19e9ff  /srv/data/meg-m2-rnaseq/Hypoxia_1.fastq
# 0d8d81a7464f6b662b89a9cea5bb8d1c  /srv/data/meg-m2-rnaseq/Normoxia_1.fastq
# 18a714651a337245bc728f3de2d14c87  /srv/data/meg-m2-rnaseq/experimental_data/SRR352261.fastqsanger.gz
# 72249ca523761575a85c61345529595b  /srv/data/meg-m2-rnaseq/experimental_data/SRR352264.fastqsanger.gz
# 857247cf34e788aef24aeaf9c4081a10  /srv/data/meg-m2-rnaseq/experimental_data/SRR352266.fastqsanger.gz
# d7f3e511652f9f6f08092cb6dbde37b4  /srv/data/meg-m2-rnaseq/experimental_data/SRR352267.fastqsanger.gz
# 55350bf610cafb705956068851038447  /srv/data/meg-m2-rnaseq/experimental_data/SRR352270.fastqsanger.gz
# fa987e543da5da808dd73e36e341c621  /srv/data/meg-m2-rnaseq/experimental_data/SRR352273.fastqsanger.gz
# 4a3449674775c9baa76296244dfe9e3d  /srv/data/meg-m2-rnaseq/experimental_data/SRR352274.fastqsanger.gz
# 39dc93ec7820c315d1a9742444b7f83b  /srv/data/meg-m2-rnaseq/experimental_data/SRR352276.fastqsanger.gz

/srv/data/meg-m2-rnaseq/experimental_data/Hypoxia_1.fastq: OK
/srv/data/meg-m2-rnaseq/experimental_data/Normoxia_1.fastq: OK
/srv/data/meg-m2-rnaseq/experimental_data/SRR352261.fastqsanger.gz: OK
/srv/data/meg-m2-rnaseq/experimental_data/SRR352264.fastqsanger.gz: OK
/srv/data/meg-m2-rnaseq/experimental_data/SRR352266.fastqsanger.gz: OK
/srv/data/meg-m2-rnaseq/experimental_data/SRR352267.fastqsanger.gz: OK
/srv/data/meg-m2-rnaseq/experimental_data/SRR352270.fastqsanger.gz: OK
/srv/data/meg-m2-rnaseq/experimental_data/SRR352273.fastqsanger.gz: OK
/srv/data/meg-m2-rnaseq/experimental_data/SRR352274.fastqsanger.gz: OK
/srv/data/meg-m2-rnaseq/experimental_data/SRR352276.fastqsanger.gz: OK


### **3- Creating a folder for analysis results:**

Now we'll create a new directory to store the results of our analysis, using the _*mkdir*_ command, for "make directory", and within it a sub-folder for quality checks outputs:

In [15]:
  mkdir Results
  mkdir Results/Fastqc

You can check the arborescence of your folder with the Unix command `tree`.

In [None]:
tree

_Of note, the `binder` folder was automatically created with your environment. For those interested, it contains all the configuration information to recreate a similar JupyterLab environment outside of adenine._ 

**=> Well done, you are now ready to check and analyse the data!** 

-------

## **II - Quality controls on *CParapsilosis* `.fastq` and `fastq.gz` files**

### **1- Examining the data**

- `.fastq` files are readable by the human eye, and we can display the first and last lines of each file, using the Unix `head` and `tail` commands:  

In [17]:
head /srv/data/meg-m2-rnaseq/experimental_data/Normoxia_1.fastq

@O2rep2_SRR352263.1 HWI-EAS283:5:1:2:642 length=40
ACTTAATACACACCCAATTCCCTCTTCATCTGATCTAAAT
+O2rep2_SRR352263.1 HWI-EAS283:5:1:2:642 length=40
+55-891>3<7A=./<<232?AAB7C?6AB=-7'-<,A:#
@O2rep2_SRR352263.2 HWI-EAS283:5:1:2:1439 length=40
AATTTGTTCAACGTTTCTTCCCATCATCAAACATTCTGTT
+O2rep2_SRR352263.2 HWI-EAS283:5:1:2:1439 length=40
:-7=>0;AA;39'&86>8;@0A4?################
@O2rep2_SRR352263.3 HWI-EAS283:5:1:3:874 length=40
ATTTATATTTTTTTTATTTCTTTTACCTTCCCTTCTTATT


In [18]:
tail /srv/data/meg-m2-rnaseq/experimental_data/Normoxia_1.fastq

+O2rep2_SRR352263.10213663 HWI-EAS283:5:100:1789:357 length=40
9@AAB5BCB</6?CC@B9ABB>9@97>166)38B:47<:3
@O2rep2_SRR352263.10213664 HWI-EAS283:5:100:1789:667 length=40
TGGAATACCTTCTTTGTCTTGGATTTTGGACTTGAGATTG
+O2rep2_SRR352263.10213664 HWI-EAS283:5:100:1789:667 length=40
B@@@=0<;5A@0:A*=@@?@####################
@O2rep2_SRR352263.10213665 HWI-EAS283:5:100:1790:9 length=40
CATTTGATGCCATCGCGCTCAATGAAAATTATAAAANAAA
+O2rep2_SRR352263.10213665 HWI-EAS283:5:100:1790:9 length=40
=@BCCB.B:BBBBA*@B=AB8BB8B=BBBCA(:.B1%:6C


In [19]:
head /srv/data/meg-m2-rnaseq/experimental_data/Hypoxia_1.fastq

@noO2rep3_SRR352271.1 HWI-EAS283_0006:2:1:0:157 length=42
NTCCGTATTCCCTATGCCTCGTACAAATTNCTTNCAAATCCT
+noO2rep3_SRR352271.1 HWI-EAS283_0006:2:1:0:157 length=42
%198387:;699;96/56;838:::9:5&%/;1%/999::::
@noO2rep3_SRR352271.2 HWI-EAS283_0006:2:1:0:1006 length=42
NCAATTGCAATTTCCAATGTCTATCATACAAATCCTCTTCTT
+noO2rep3_SRR352271.2 HWI-EAS283_0006:2:1:0:1006 length=42
%0;<<717;6;<<<:<<7/7<<<979;;;;;<999999<8;;
@noO2rep3_SRR352271.3 HWI-EAS283_0006:2:1:0:1599 length=42
NCTCCTAATTTCAATTTATACAATATTGTGGTTTTTTTTTCA


In [20]:
tail /srv/data/meg-m2-rnaseq/experimental_data/Hypoxia_1.fastq

+noO2rep3_SRR352271.10136007 HWI-EAS283_0006:2:120:1789:1639 length=42
5BB@CCBCCCA+=CCBBBBB6<B@C@CBBCAA@AAB?%3B3?
@noO2rep3_SRR352271.10136008 HWI-EAS283_0006:2:120:1789:872 length=42
CTCCCGTACTTTTTTATATAATGCTTCTTTTGATGCTNTTTG
+noO2rep3_SRR352271.10136008 HWI-EAS283_0006:2:120:1789:872 length=42
BCBAA7B??CCCCCC>CBC>BC?BCCACCCB9AC8@?%>C@#
@noO2rep3_SRR352271.10136009 HWI-EAS283_0006:2:120:1789:1527 length=42
CCCACGATTGATAATATTGTGGAATCAAGTCCATTGANGTCT
+noO2rep3_SRR352271.10136009 HWI-EAS283_0006:2:120:1789:1527 length=42
BABBB9BBCABCAB+5CB:BB?BBC?BBC7?B?A@-7%3ABC


>Another great command Unix command is `less` when installed. If you want to try it on adenine, you have to do it in a terminal (it does not work in this notebook). It displays initially the first lines of a file. By pressing the spacebar, you will see the next lines. The parameters `S` and `N` respectively display the lines with no wrap and add the line number at the beginning. Press `Q` to escape.

> _For geeks only:_
>
> Similarly, you can count the number of rows in a file:

In [21]:
wc -l /srv/data/meg-m2-rnaseq/experimental_data/Hypoxia_1.fastq

40544036 /srv/data/meg-m2-rnaseq/experimental_data/Hypoxia_1.fastq


> and get the number of reads by dividing by 4:

In [22]:
nb_row=$(wc -l /srv/data/meg-m2-rnaseq/experimental_data/Hypoxia_1.fastq | cut -d" " -f1) 
echo $((${nb_row}/4))

10136009


The same for the Normoxia_1 file :

In [23]:
nb_row=$(wc -l /srv/data/meg-m2-rnaseq/experimental_data/Normoxia_1.fastq | cut -d" " -f1) 
echo $((${nb_row}/4))

10213665


> or directly get the number of reads noticing all reads in this file start with an `@noO2`:

In [24]:
grep "^@noO2" /srv/data/meg-m2-rnaseq/experimental_data/Hypoxia_1.fastq | wc -l

10136009


- On the `gz` files, you need to combine the `zcat` command first that reads compressed files, and the `head` or `tail` commands using a pipe `|`.

In [25]:
zcat /srv/data/meg-m2-rnaseq/experimental_data/SRR352261.fastqsanger.gz | head

@SRR352261.1 HWI-EAS283:2:1:2:388 length=40
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SRR352261.1 HWI-EAS283:2:1:2:388 length=40
=@A@.5;8@7@@=;B@9>>>?@=@AB@;@A@@?BAB<9@B
@SRR352261.2 HWI-EAS283:2:1:3:506 length=40
CAAAATTCTTATTTATTTCCATAAATTATATCGTCATCAC
+SRR352261.2 HWI-EAS283:2:1:3:506 length=40
>3)A?ACCB>5A7/;@:@@61=:00;B@6?@=########
@SRR352261.3 HWI-EAS283:2:1:3:421 length=40
CAACCAATCTAATCATCTTTTCTCTTATTATCCCTATATT

gzip: stdout: Broken pipe


In [26]:
zcat /srv/data/meg-m2-rnaseq/experimental_data/SRR352266.fastqsanger.gz | tail

+SRR352266.12397825 HWI-EAS283:8:100:1790:1748 length=40
BBBBBBBB@<B@BCCBBB?@*):0??##############
@SRR352266.12397826 HWI-EAS283:8:100:1790:1444 length=40
TGAAAATACAATCTGCTTTANTAAATAGACCCANNNNNNN
+SRR352266.12397826 HWI-EAS283:8:100:1790:1444 length=40
A:2@==@;980;??71?;=1%9:>################
@SRR352266.12397827 HWI-EAS283:8:100:1790:1959 length=40
GTTCCATAGTTGTTTGCAATNTATCATTTAACTNNNNNNN
+SRR352266.12397827 HWI-EAS283:8:100:1790:1959 length=40
6@BB@CB>AB@@:A@7@A@=%8AA@###############


> and for geeks, the command `zgrep` will do the pattern search in a gz file: 

In [27]:
zgrep "^@SRR" /srv/data/meg-m2-rnaseq/experimental_data/SRR352261.fastqsanger.gz | wc -l

12523951


<div class="alert alert-block alert-success"><b>=> Question: What can you say on the data?</b><br>

<em>(you can click here to add your answers directly in this markdown cell)</em><br>

For each dataset:

- How many reads do you have in each file?
- What is the size of the reads?
</div>

### **2- fastqc**
Now we run the fastqc quality control with **FASTQC** (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) with the following version of the tool.

In [28]:
fastqc --version

FastQC v0.11.9


To run it on a sample, use the following command lines, where we indicate after the command `fastqc` and the name of the file to examine (with its path) and where to write the results after the argument `outdir`. Here the dot `.` stands for "current working directory". 

In [29]:
fastqc /srv/data/meg-m2-rnaseq/experimental_data/Normoxia_1.fastq --outdir ./Results/Fastqc

Started analysis of Normoxia_1.fastq
Approx 5% complete for Normoxia_1.fastq
Approx 10% complete for Normoxia_1.fastq
Approx 15% complete for Normoxia_1.fastq
Approx 20% complete for Normoxia_1.fastq
Approx 25% complete for Normoxia_1.fastq
Approx 30% complete for Normoxia_1.fastq
Approx 35% complete for Normoxia_1.fastq
Approx 40% complete for Normoxia_1.fastq
Approx 45% complete for Normoxia_1.fastq
Approx 50% complete for Normoxia_1.fastq
Approx 55% complete for Normoxia_1.fastq
Approx 60% complete for Normoxia_1.fastq
Approx 65% complete for Normoxia_1.fastq
Approx 70% complete for Normoxia_1.fastq
Approx 75% complete for Normoxia_1.fastq
Approx 80% complete for Normoxia_1.fastq
Approx 85% complete for Normoxia_1.fastq
Approx 90% complete for Normoxia_1.fastq
Approx 95% complete for Normoxia_1.fastq
Analysis complete for Normoxia_1.fastq


The ouputs are in a `.zip` folder you could unzip with the `unzip` Unix command. But there is no need to open do so, as a summary in `.html` format is also provided. To open this `html` file, in the left-hand pannel of the JupyterLab double-click the "Results" folder, and in it, on the html file: it should open in a new tab beside this notebook.

In [30]:
fastqc /srv/data/meg-m2-rnaseq/experimental_data/Hypoxia_1.fastq --outdir ./Results/Fastqc

Started analysis of Hypoxia_1.fastq
Approx 5% complete for Hypoxia_1.fastq
Approx 10% complete for Hypoxia_1.fastq
Approx 15% complete for Hypoxia_1.fastq
Approx 20% complete for Hypoxia_1.fastq
Approx 25% complete for Hypoxia_1.fastq
Approx 30% complete for Hypoxia_1.fastq
Approx 35% complete for Hypoxia_1.fastq
Approx 40% complete for Hypoxia_1.fastq
Approx 45% complete for Hypoxia_1.fastq
Approx 50% complete for Hypoxia_1.fastq
Approx 55% complete for Hypoxia_1.fastq
Approx 60% complete for Hypoxia_1.fastq
Approx 65% complete for Hypoxia_1.fastq
Approx 70% complete for Hypoxia_1.fastq
Approx 75% complete for Hypoxia_1.fastq
Approx 80% complete for Hypoxia_1.fastq
Approx 85% complete for Hypoxia_1.fastq
Approx 90% complete for Hypoxia_1.fastq
Approx 95% complete for Hypoxia_1.fastq
Analysis complete for Hypoxia_1.fastq


> In some web browsers, the display of the letters and special characters might not be correct. If you encounter this problem with firefox, open the menu on the top right hand corner. Click on "customize" and select the text encoding icon. Slide it to the menu on the right. It now appears in your menu bar. Click on it and select "Unicode" instead of "occidental".

In [31]:
#For more help on fastqc used in command line, you can always type:
fastqc --help


            FastQC - A high throughput sequence QC analysis tool

SYNOPSIS

	fastqc seqfile1 seqfile2 .. seqfileN

    fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] 
           [-c contaminant file] seqfile1 .. seqfileN

DESCRIPTION

    FastQC reads a set of sequence files and produces from each one a quality
    control report consisting of a number of different modules, each one of 
    which will help to identify a different potential type of problem in your
    data.
    
    If no files to process are specified on the command line then the program
    will start as an interactive graphical application.  If files are provided
    on the command line then the program will run with no user interaction
    required.  In this mode it is suitable for inclusion into a standardised
    analysis pipeline.
    
    The options for the program as as follows:
    
    -h --help       Print this help file and exit
    
    -v --version    Print the version of the program and exit

---

## **III - Mapping reads on *CParapsilosis* genome using BOWTIE algorithm (version 1.3.0)**


Checking wich version of **BOWTIE** (http://bowtie-bio.sourceforge.net/manual.shtml) is used.

In [32]:
bowtie --version

/srv/conda/envs/notebook/bin/bowtie-align-s version 1.3.1
64-bit
Built on fv-az75-556
2022-09-15T09:30:25
Compiler: gcc version 10.4.0 (conda-forge gcc 10.4.0-16) 
Options: -O3 -Wl,--hash-style=both -DPOPCNT_CAPABILITY -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /srv/conda/envs/notebook/include -fdebug-prefix-map=/opt/conda/conda-bld/bowtie_1663233587858/work=/usr/local/src/conda/bowtie-1.3.1 -fdebug-prefix-map=/srv/conda/envs/notebook=/usr/local/src/conda-prefix  -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /srv/conda/envs/notebook/include -fdebug-prefix-map=/opt/conda/conda-bld/bowtie_1663233587858/work=/usr/local/src/conda/bowtie-1.3.1 -fdebug-prefix-map=/srv/conda/envs/notebook=/usr/local/src/conda-prefix                                                               


### **1- Generating the indexes of the *C.parapsilosis* genome**
The indexes are small files that tell a program where to look for data in a large data file. They are required for mapping algorithms, as they allow for faster processing of millions reads. With BOWTIE they are generated with the `bowtie-build` fonction.

In [33]:
bowtie-build -q /srv/data/meg-m2-rnaseq/genome/C_parapsilosis_CGD.fasta C_parapsilosis 

The 6 created index files have the `.ebwt` suffix :

In [34]:
ls -lh *.ebwt

-rw-rw-r-- 1 scaburet scaburet 7.6M Nov 22 16:18 C_parapsilosis.1.ebwt
-rw-rw-r-- 1 scaburet scaburet 1.6M Nov 22 16:18 C_parapsilosis.2.ebwt
-rw-rw-r-- 1 scaburet scaburet   89 Nov 22 16:18 C_parapsilosis.3.ebwt
-rw-rw-r-- 1 scaburet scaburet 3.2M Nov 22 16:18 C_parapsilosis.4.ebwt
-rw-rw-r-- 1 scaburet scaburet 7.6M Nov 22 16:18 C_parapsilosis.rev.1.ebwt
-rw-rw-r-- 1 scaburet scaburet 1.6M Nov 22 16:18 C_parapsilosis.rev.2.ebwt


### **2- Mapping the reads**
We use BOWTIE, a mapper that is very simple and efficient. It's not recent at all, and cannot deal with intron-containing genome, but here it works fine.

To start with, we will run BOWTIE on the two `.fastq` files. On section V of this notebook, we will run it on the other `fastq.gz` samples.

In [35]:
# the -S option tells bowtie to generate a .sam file  
# the -x option indicates the prefix name of the various index files 
# then you specify the name of the fastq file
# the last argument is the name of the output file, here located directly into the Results folder ./Results/

bowtie -S -x C_parapsilosis /srv/data/meg-m2-rnaseq/experimental_data/Normoxia_1.fastq ./Results/Normoxia_1_bowtie_mapping.sam

# reads processed: 10213665
# reads with at least one alignment: 9160959 (89.69%)
# reads that failed to align: 1052706 (10.31%)
Reported 9160959 alignments


In [36]:
head -n 20 ./Results/Normoxia_1_bowtie_mapping.sam

# head --help


@HD	VN:1.0	SO:unsorted
@SQ	SN:Contig005504_C_parapsilosis_CDC317	LN:898305
@SQ	SN:Contig005569_C_parapsilosis_CDC317	LN:2235583
@SQ	SN:Contig005806_C_parapsilosis_CDC317	LN:1039767
@SQ	SN:Contig005807_C_parapsilosis_CDC317	LN:2091826
@SQ	SN:Contig005809_C_parapsilosis_CDC317	LN:3023470
@SQ	SN:Contig006110_C_parapsilosis_CDC317	LN:957321
@SQ	SN:Contig006139_C_parapsilosis_CDC317	LN:962442
@SQ	SN:Contig006372_C_parapsilosis_CDC317	LN:1789679
@SQ	SN:mito_C_parapsilosis_CDC317	LN:31781
@PG	ID:Bowtie	VN:1.3.1	CL:"/srv/conda/envs/notebook/bin/bowtie-align-s --wrapper basic-0 -S -x C_parapsilosis /srv/data/meg-m2-rnaseq/experimental_data/Normoxia_1.fastq ./Results/Normoxia_1_bowtie_mapping.sam"
O2rep2_SRR352263.1	4	*	0	0	*	*	0	0	ACTTAATACACACCCAATTCCCTCTTCATCTGATCTAAAT	+55-891>3<7A=./<<232?AAB7C?6AB=-7'-<,A:#	XM:i:0
O2rep2_SRR352263.2	16	Contig005809_C_parapsilosis_CDC317	2486313	255	40M	*	0	0	AACAGAATGTTTGATGATGGGAAGAAACGTTGAACAAATT	################?4A0@;8>68&'93;AA;0>=7-:	XA:i:0	MD:Z:2G1A0C

In [37]:
bowtie -S -x C_parapsilosis /srv/data/meg-m2-rnaseq/experimental_data/Hypoxia_1.fastq ./Results/Hypoxia_1_bowtie_mapping.sam

# reads processed: 10136009
# reads with at least one alignment: 9252407 (91.28%)
# reads that failed to align: 883602 (8.72%)
Reported 9252407 alignments


<div class="alert alert-block alert-success"><b>=> Question: What can you say on the data?</b><br>

<em>(you can click here to add your answers directly in this markdown cell)</em><br>

For each dataset, how many reads were:
- processed?  
- mapped?  
- written in the output file?
</div>

---

## **IV - Managing the output files**

### **1- Converting, sorting and indexing the output files**
The downstream analysis is not performed on `.sam` files, but on binary versions of these : `bam` files.  
So we are going to:  
- convert the `sam` into `bam` files, 
- then sort them in genomic order,  
- finally index them, to produce the companion `bai` files

The commands used for this part belong to a large package of utilities that are very useful to manage those types of files: **SAMTOOLS** (http://www.htslib.org/).  

  Let's check first which version of SAMTOOLS we are using.

In [38]:
samtools --version

samtools: /srv/conda/envs/notebook/bin/../lib/libtinfow.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
samtools 1.16.1
Using htslib 1.16
Copyright (C) 2022 Genome Research Ltd.

Samtools compilation details:
    Features:       build=configure curses=yes 
    CC:             /opt/conda/conda-bld/samtools_1665435019136/_build_env/bin/x86_64-conda-linux-gnu-cc
    CPPFLAGS:       -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /srv/conda/envs/notebook/include
    CFLAGS:         -Wall -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /srv/conda/envs/notebook/include -fdebug-prefix-map=/opt/conda/conda-bld/samtools_1665435019136/work=/usr/local/src/conda/samtools-1.16.1 

<br>- We will start first with the ***Normoxia dataset:***

#### **1.a-** Converting .sam into .bam with **samtools view**

In [39]:
# The 'view' function allows to display bam/sam files, 
# -b is to specify that outputs are .bam files
# it is followed by the name of the .sam
# -o is to provide the name of the ouput .bam file.

samtools view -b ./Results/Normoxia_1_bowtie_mapping.sam -o ./Results/Normoxia_1_bowtie_mapping.bam

samtools: /srv/conda/envs/notebook/bin/../lib/libtinfow.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)


#### **1.b-** Sorting .bam with **samtools sort**

Again, `-o` is used to provide the name of the ouput file.

In [40]:
samtools sort ./Results/Normoxia_1_bowtie_mapping.bam -o ./Results/Normoxia_1_bowtie_mapping.sorted.bam

samtools: /srv/conda/envs/notebook/bin/../lib/libtinfow.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
[bam_sort_core] merging from 2 files and 1 in-memory blocks...


#### **1.c-** Generating an index with **samtools index**.  
There is no need to provide a name of the ouput file, as it should always be the same as the corresponding *bam* file, except for the `.bai` suffix.

In [41]:
samtools index ./Results/Normoxia_1_bowtie_mapping.sorted.bam

samtools: /srv/conda/envs/notebook/bin/../lib/libtinfow.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)


<br>***- For the Hypoxia data set***, we can proceed to the 3 steps in the same cell: the commands will be executed one after another:

In [42]:
samtools view -b ./Results/Hypoxia_1_bowtie_mapping.sam -o ./Results/Hypoxia_1_bowtie_mapping.bam
samtools sort ./Results/Hypoxia_1_bowtie_mapping.bam -o ./Results/Hypoxia_1_bowtie_mapping.sorted.bam
samtools index ./Results/Hypoxia_1_bowtie_mapping.sorted.bam

samtools: /srv/conda/envs/notebook/bin/../lib/libtinfow.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libtinfow.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
[bam_sort_core] merging from 2 files and 1 in-memory blocks...
samtools: /srv/conda/envs/notebook/bin/../lib/libtinfow.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (requi

### **2- Removing the intermediate files**  
The only files needed for the rest of the analysis are the `mapped.sorted.bam` files and their corresponding `.bai` index files. So we are going to save some space by deleting the intermediate files that are not needed any more. (Anyway you can easily produce them again, by running the corresponding Code cell above).  
You can delete a file by right-clicking on it and choosing 'x Delete', or by running the *rm* command (remove) in a cell:

In [43]:
rm ./Results/Normoxia_1_bowtie_mapping.bam
rm ./Results/Hypoxia_1_bowtie_mapping.bam

In [44]:
# removing all the .sam files at the same time

rm ./Results/*.sam

___

## **V - Analysis of the other 8 samples**

The complete study involves 6 Normoxia samples and 4 Hypoxia samples. For the remaining 8 samples, we will perform a batch analysis (all the steps together, for multiple files at once) :
- quality check with fastqc
- mapping with bowtie
- sam-to-bam conversion with samtools
- bam sorting and indexing with samtools
- removal of intermediate files



FASTQC can deal with several files without a loop.

In [45]:
fastqc /srv/data/meg-m2-rnaseq/experimental_data/*.fastqsanger.gz --outdir ./Results/Fastqc

Started analysis of SRR352261.fastqsanger.gz
Approx 5% complete for SRR352261.fastqsanger.gz
Approx 10% complete for SRR352261.fastqsanger.gz
Approx 15% complete for SRR352261.fastqsanger.gz
Approx 20% complete for SRR352261.fastqsanger.gz
Approx 25% complete for SRR352261.fastqsanger.gz
Approx 30% complete for SRR352261.fastqsanger.gz
Approx 35% complete for SRR352261.fastqsanger.gz
Approx 40% complete for SRR352261.fastqsanger.gz
Approx 45% complete for SRR352261.fastqsanger.gz
Approx 50% complete for SRR352261.fastqsanger.gz
Approx 55% complete for SRR352261.fastqsanger.gz
Approx 60% complete for SRR352261.fastqsanger.gz
Approx 65% complete for SRR352261.fastqsanger.gz
Approx 70% complete for SRR352261.fastqsanger.gz
Approx 75% complete for SRR352261.fastqsanger.gz
Approx 80% complete for SRR352261.fastqsanger.gz
Approx 85% complete for SRR352261.fastqsanger.gz
Approx 90% complete for SRR352261.fastqsanger.gz
Approx 95% complete for SRR352261.fastqsanger.gz
Analysis complete for SRR

For the next steps, we use a `for` **loop**, that will run the program once for each element in the provided list, and produce the properly-named output files.

> Here are some explanations on the loop:<br>
> - `fn` is used as a variable to define the "filenames" in the folder containing the data: for each file, we iterate the loop
> - `${}` is used to say we are using a predefined variable
> - an `id` variable is created with the prefix name of the fastqsanger.gz files
> - `basename` is used as a shortcut to extract the name of the file from its absolute path: only the name of the file is kept
> - `cut` is used to split the basename file with `.` as separator defined with the `-d` argument, then `-f1` is used to keep only the first element before the first `.`
> - `echo` is used to print a message
> - we then define the variable `mysortedbam` with the name of the output and its relative path
> - then we use the bowtie command but we redirect its output to samtools using the pipe `|`
> - for samtools, the `-` is given instead of the name of the input file to specify this is the output of the command on the left of the pipe; idem for the next pipe
> - we save here only the sorted.bam and the .sorted.bam.bai files without intermediate files

<div class="alert alert-block alert-danger"><b>Danger:<br></b>The loop will probably take ~30 minutes to 1 hour. It generates <b>temporary "bam.tmp" files</b> in the Results folder.<br> <b>Do not delete them during the process!</b> Once the sample is processed, the server will automatically delete these temporary files 
</div>

In [46]:
date

for fn in $(ls /srv/data/meg-m2-rnaseq/experimental_data/*.fastqsanger.gz); do
       
    id=$(basename ${fn} | cut -d. -f1)
    echo "========Processing sampleID: ${id}..."
    
    myoutsortedbam="./Results/${id}_bowtie_mapping.sorted.bam"
    bowtie -S -x C_parapsilosis ${fn} | samtools view -b - | samtools sort - -o $myoutsortedbam
    samtools index $myoutsortedbam  

    echo "...done"
    
done
date

Tue Nov 22 16:43:57 UTC 2022
samtools: /srv/conda/envs/notebook/bin/../lib/libtinfow.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libtinfow.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
# reads processed: 12523951
# reads with at least one alignment: 11717483 (93.56%)
# reads that failed to align: 806468 (6.44%)
Reported 11717483 alignments
[bam_sort_core] merging from 2 files and 1 in-memory blocks...
samtools: /srv/conda/envs/notebook/

In [47]:
bowtie -S -x C_parapsilosis /srv/data/meg-m2-rnaseq/experimental_data/SRR352266.fastqsanger.gz | samtools view -b - | samtools sort - -o ./Results/SRR352266_bowtie_mapping.sorted.bam

samtools: /srv/conda/envs/notebook/bin/../lib/libtinfow.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libtinfow.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
# reads processed: 12397827
# reads with at least one alignment: 11543165 (93.11%)
# reads that failed to align: 854662 (6.89%)
Reported 11543165 alignments
[bam_sort_core] merging from 2 files and 1 in-memory blocks...


In [48]:
samtools index ./Results/SRR352266_bowtie_mapping.sorted.bam

samtools: /srv/conda/envs/notebook/bin/../lib/libtinfow.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
samtools: /srv/conda/envs/notebook/bin/../lib/libncursesw.so.6: no version information available (required by samtools)


<div class="alert alert-block alert-success"><b>Success:</b> Don't forget to save you notebook and export a copy as an <b>html</b> file as well <br>
- Open "File" in the Menu<br>
- Select "Export Notebook As"<br>
- Export notebook as HTML<br>
- You can then open it in your browser even without being connected to adenine! 
</div>


___
___

Now we go on with a lecture about what is indicated in the output sorted *bam* files. 

**=> Lecture 5 : Mapping output** 