# Formation RNAseq CEA - juin 2024

*Enseignantes: Sandrine Caburet, Claire Vandiedonck*

Session IFB : 16 CPU + 70 GB de RAM

# Part 4 : Mapping reads to genome

<div class="alert alert-block alert-danger"> 
    <b>     🚨 🚨 🚨 💣 💣 💣 <u> WARNING WARNING WARNING </u> 💣 💣 💣 🚨 🚨 🚨    </b> <br><br>
    The mapping tool is very greedy! Therefore we need to allocate more CPU and RAM for this session , but we CAN NOT run this notebook all together at the same time! <br>
    <b>Following values are valid for a 16-CPU session with access to 70 GB of RAM</b>. Ideally, use 70-80% of the CPU amount your system or session has. DO NOT ask for more RAM than you can use.
</div>


- 0.1 - About session for IFB core cluster
- 0.2 - Parameters to be set or modified by the user
- 1 - Reference genome and annotation file
- 2 - Building genome reference files with ``STAR``
- 3 - Mapping samples on reference genome with ``STAR``
- 4 - Building sample ``.bam`` indexes with ``samtools``


---
## **Before going further**

<div class="alert alert-block alert-danger"><b>Caution:</b> 
Before starting the analysis, save a backup copy of this notebok : in the left-hand panel, right-click on this file and select "Duplicate"<br>
You can also make backups during the analysis. Don't forget to save your notebook regularly: <kbd>Ctrl</kbd> + <kbd>S</kbd> or click on the 💾 icon.
</div>

<div class="alert alert-block alert-danger">
    Please note that Mouse Reference genome file indexing is set for RNA sequencing of 100-bases reads. <br>
    If different, change <code>rawreadlength</code> below set value with the proper read length for your dataset.
</div>

---
---

## 0.1 - About session for IFB core cluster
---

<em>loaded JupyterLab</em> : Version 3.5.0

In [1]:
## Code cell 1 ##

echo "=== Cell launched on $(date) ==="
squeue -hu $USER 

echo "=== Current IFB session size: as an indication: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ==="
jobid=$(squeue -hu $USER | awk '/sys/dash {print $1}')

sacct --format=JobID,AllocCPUS,reqmem,NODELIST,elapsed,state --jobs ${jobid}

=== Cell launched on Wed Jun 12 13:10:59 CEST 2024 ===
          40167855      fast sys/dash scaburet  R    2:59:24      1 cpu-node-54
=== Current IFB session size: as an indication: Medium (4CPU, 10GB) or Large (10CPU, 50GB) ===
JobID         AllocCPUS     ReqMem        NodeList    Elapsed      State 
------------ ---------- ---------- --------------- ---------- ---------- 
40167855             16        70G     cpu-node-54   02:59:24    RUNNING 
40167855.ba+         16                cpu-node-54   02:59:24    RUNNING 


In [2]:
## Code cell 2 ##

module load star/2.7.11a samtools/1.18

# module load star/2.7.10b samtools/1.15.1 in 2023
# module load star/2.7.11a samtools/1.18 in 2024

echo "===== download network files ====="
wget --version | head -n 1
echo "===== alignement tool ====="
STAR --version 
echo "===== index construction + quality ====="
samtools --version | head -n 2

===== download network files =====
GNU Wget 1.14 built on linux-gnu.
===== alignement tool =====
2.7.11a
===== index construction + quality =====
samtools 1.18
Using htslib 1.19


---

## 0.2 - Parameters to be set or modified by the user

- Using a full path with a `/` at the end, **define the folder** where you want or have to work with `gohome` variable:

In [4]:
## Code cell 3 ##

gohome="/shared/projects/2413_rnaseq_cea/"

echo "=== Home root folder is ==="
echo "${gohome}"
echo ""
echo "=== Working (personal) folder tree ==="
tree -d -L 2 "${gohome}$USER"
echo "=== current working directory ==="
echo "${PWD}"

=== Home root folder is ===
/shared/projects/2413_rnaseq_cea/

=== Working (personal) folder tree ===
/shared/projects/2413_rnaseq_cea/scaburet
|-- Data
|   |-- fastq
|   `-- sra
|-- Results
|   |-- fastp
|   |-- fastq_screen
|   |-- fastqc
|   |-- logfiles
|   |-- multiqc
|   |-- qualimap-11juin
|   |-- samtools-11juin
|   `-- star-11juin
|-- meg_m2_rnaseq
|   `-- binder
`-- run_notebooks

15 directories
=== current working directory ===
/shared/ifbstor1/projects/2413_rnaseq_cea/scaburet


- **Reference genome and annotation files will be downloaded from the web** in [section 1.2](#downloadsection), so **don't forget to change url for both files** in [section 1.1](#urlsection). The names used to store links are ``fagzurl`` and ``gtfgzurl``, respectively.

- To map reads efficiently, a set of indexes has to be created. The amount of indexes relies on the read length value, a parameter set by the experimenter on sequencing platform.    
    **Change read length value below if you have a different read size**:   


In [5]:
## Code cell 4 ##

rawreadlength=100

- Please, precise the **maximum amount of CPU** (central processing units, cores) and **RAM-memory (in Bytes)** that programs can use.<a id="computressources"></a>

<div class="alert alert-block alert-danger">
    <b>Following values are valid for a 16-CPU session with access to 70 GB of RAM</b>. Ideally, use 70-80% of the CPU amount your system or session has. DO NOT ask for more RAM than you can use.
</div>

In [6]:
## Code cell 5 ##

authorizedCPU=13  # 13

authorizedRAM=60000000000  # 60GB

---
---
## 1 - Reference genome and annotation files 
---

Classical mapping of RNAseq data relies on the use of a reference genome sequence (in <code>fasta</code> format) and a companion annotation file that contains the location of genomic features, such as genes and exons (in <code>gtf</code> or <code>gff</code> format). 
The Reference genome also has to be indexed by STAR (the mapping tool that we'll be using), with indexes corresponding to the specific read length in the dataset.

<div class="alert alert-block alert-warning">   
<b>Here we will used pre-downloaded files and pre-computed indexes, in order to save time and disk space.</b>  
The sections 1.1 to 1.3 (downloading and extracting Reference files) and 2.1 to 2.4 (indexing Reference file) are therefore only provided for further use in your own later projects.   

To use those commands:   
* change the format of the cells from <code>raw</code> to <code>code</code>.  
* specify the correct read length in Code cell 4 (above)
* choose the correct species in Code cells 6 and 7.

**The section 1.4 is kept active to verify the Reference files**
</div>

### **1.1 - Searching for urls on the web**

There are several websites to download reference genome and annotation files:<a id="urlsection"></a>
- <a href="https://www.gencodegenes.org/"><i>Gencode</i></a> from the European Biomolecular Institute (EBI)
- <a href="http://www.ensembl.org/info/data/ftp/index.html"><i>Accessing Ensembl Data</i></a> from Ensembl project database for most stuides
- <a href="https://www.ncbi.nlm.nih.gov/genome/guide/human/"><i>Human Genome Ressources</i></a> at NCBI
- <a href="https://github.com/marbl/CHM13">For human, there is also now the Telomere-to-telomere consortium CHM13 (T2T-CHM13) project.
- ... and maybe one day a commom NCBI and Ensembl/Gencode release (MANE collaboration, <a href="https://ncbiinsights.ncbi.nlm.nih.gov/2020/11/02/ncbi-refseq-ensembl-gencode-mane-v0-92/#more-4781">a story beginning in 2020</a>)

We will use a **Primary assembly** (PRI) release. It includes chromosomes and scaffolds (candidate regions to be integrated or discarded in next genome build).  
On the contrary, the main annotation file is limited to chromosomes while the extensive annotation file also includes all known haplotypes (for highly variables regions such as the HMC).

<div class="alert alert-block alert-warning">
    In this notebook, we use <b>Gencode release</b>, that provides user with <b><code>.gz</code> compressed</b> files, and we choose <b>GTF format</b> for the annotation file. <br>
    Feel free to choose the source you want among citated ones above, as far as downloaded files follow the same file formats (if they don't, change next sections code cells!).
</div>

<div class="alert alert-block alert-info">
    Nonetheless, please note that annotation file format is an actively opened issue as some relevant official sources are contradictory: 
    <ul>
        <li>
            For US Galaxy Main project: <a href="https://galaxyproject.org/learn/datatypes/#gtf">GTF</a> is the GFF version 2 while <a href="https://galaxyproject.org/learn/datatypes/#gff">GFF</a> is version 1 and <a href="https://galaxyproject.org/learn/datatypes/#gff3">GFF3</a> is the latest and 3rd version... 
        </li>
        <li>
        ... but IGV Broad Institute, as UCSC genome browser, makes distinction between <a href="http://software.broadinstitute.org/software/igv/GFF">GFF2 and GTF formats</a>, <a href="https://genome.ucsc.edu/FAQ/FAQformat.html#format3"> the later being only compatible with the former</a>.
        </li>
        <li>
            While both <a href="https://biocorecrg.github.io/PhD_course/gtf_format.html">GTF</a> and <a href="https://biostar.usegalaxy.org/p/28147/">GFF</a> formats have 9 columns, field in the ninth column is longer for <code>.gtf</code> files than for <code>.gff</code> files (<a href="https://genome.ucsc.edu/FAQ/FAQformat.html#format3">UCSC Genome browser documentation</a> and <a href="https://www.ensembl.org/info/website/upload/gff.html">ensembl documentation</a>).
        </li>
        <li>
            Even if both file format have header lines, some tools do not support them (<a href="https://biostar.usegalaxy.org/p/28147/">second bullet point in last answer</a>) and US Galaxy portal asks users to remove those lines before use (see upper US Galaxy links).
        </li>
        <li>
            <code>FeatureCounts</code> (a downstream tool we will use) only <a href="http://bioinf.wehi.edu.au/featureCounts/">works with GTF files</a>. This tool expects to find <i>exon</i> in the <i>features</i> column (both GFF and GTF!) and <i>gene_id</i> as a gene identifier (missing in GFF), see <a href="https://biostar.usegalaxy.org/p/28094/index.html#28099">item 4 in latest answer</a>.
        </li>
    </ul>
</div>

In order to have latest current genome release for your analyses, please go to Gencode's <a href="https://www.gencodegenes.org/human/">download page</a> (or to other chosen reference download page) and adapt url links for:
- Primary annotation (this notebook is developped with GTF file)
> in *GTF/GFF3 files* Gencode's chart: Comprehensive gene annotation > primary annotation > *gtf* file 

- Primary genome sequence file
> in *Fasta files* Gencode's chart: Genome sequence, primary assembly > *Fasta* file 

*Note*: You can get url link with a right click on download links, then *copy link to clipboard*.

<div class="alert alert-block alert-danger">
    Both files have to be retrieved from the same source as sequence region names need to be the same in both; in order to avoid dowstream analysis issues. <br>
    <i>In Gencode, this compatibility between files is specified in the <b>fasta description field</b></i>.
</div>

### **1.2 - Retrieve files with ``wget``**

We will download those files in a distinct folder:<a id="downloadsection"></a>

<ul class="alert alert-block alert-info">
    <li>
        Sometimes (often?!), other users issues help us understand a command more than its manual. For instance, a Stackoverflow's <a href="https://unix.stackexchange.com/questions/23501/download-using-wget-to-a-different-directory-than-current-directory">thread</a> about <code>wget</code> command and the way to write into a chosen output folder. 
    </li>
</ul>

We only use these two options:
> ``-P PREFIX`` or ``--directory-prefix=PREFIX`` to specify output folder  
> ``-N`` or ``--timestamping``: don't re-retrieve files unless newer than local  

Some other available options exist and among them this one:
> ``-a FILE`` or ``--append-output=FILE`` to append messages to FILE  

### **1.3 - Extract archive files**

**STAR**, as other downstream tools, can't deal with compressed reference files.  
Extracted files are quite big but compressed ones are of rather affordable size. So we will keep them as is, along with more simple filename assigment for extracted files (easier to handle, in particular when changing release and/or database source).

To save time, we already performed this extraction and the following indexing of the mouse genome (version M35) before the training session.   
   
**You should do these two steps (Code cells 12 to 28) if:   
    - you're working on another genome   
    - or you want to use an older version of the mouse genome**   

- Primary annotation (notebook developped with GTF file).    
**Modify the names in this command if they don't correspond to your genome or version!**

- Primary genome sequence file

### **1.4 - Verify downloaded files**
Let's have a look to these files to check they correspond to what we expect (or just discover file format).

- Primary annotation (notebook developped with GTF file)

- Primary genome sequence file

---
---
## 2 - Building genome reference index files with <code>STAR</code>
---

The indexes are small files that tell a program where to look for data in a large data file. They are required for mapping algorithms, as they allow for faster processing of millions reads.   
To save time, we already performed this indexing of the mouse genome (version M35) before the training session.   
   
**You should do this step (Code cells 18 to 28) if:   
    - you're working on another genome   
    - or you want to use an older version of the mouse genome   
    - or you have read length other than 100 bp**


### **2.1 - Tool version and command line presentation**

To create reference genome files, the default command is: <br>
<code>STAR --runMode genomeGenerate --genomeDir destination/folder \
      --genomeFastaFiles path/to/sequence.fa
</code>

<blockquote>
    <code>--runMode genomeGenerate</code>, to switch to indexing step, else STAR is by defaul turned to alignReads (mapping step) <br>
    <code>--genomeDir</code>, to specify folder where to put reference genome indexes <br>
    <code>--genomeFastaFiles</code>, fasta file reference genome path (DOES NOT work with gz files)

We are working on RNAseq data and need to have files that take splice junctions into account. Thus, we need to add following 2 parameters: <br>
<code>--sjdbGTFfile path/to/annotation.file --sjdbOverhang readlengthnum</code>

They stand for:
<blockquote>
    <code>--sjdbGTFfiles</code>, to specify where to find annotation file that contains exon positions, thus placing splice junction along genome sequence <br>
    <code>--sjdbOverhang</code>, the maximum size that we expected to found on one splice junction side (<em>ideally, mate length-1</em>)
</blockquote>

For STAR, we can specify those two additional options either when genereating genome index files or when mapping sample. As we may be limited in computational ressources, we will add these items here and avoid memory-consuming operation repetition lately when iterating on all samples for mapping.

### **2.2 - Preparing command line variables**

<div class="alert alert-block alert-warning">The dataset used to develop this pipeline is based on reads sequenced on 100 bases.  
If this is not the case for your dataset, change the value in Code cell 4.</div>

We will then create a folder to put those specific genome index files, with an explicit name for later use:

Let's verify ``fastafile`` and ``gtffile`` variables, defined when extracting compressed files in Code cell 15.  

Let's verify, or set if not done in [**Parameters**'s section](#computressources), the **number of CPU** (central processing units, cores) and **RAM-memory size (in Bytes)** that next multithreading program is allowed to use.  
<div class="alert alert-block alert-danger">
    <b>Following values are valid for a 16-CPU session with access to 70 GB of RAM</b>. Ideally, use 70-80% of the CPU amount your system or session has. DO NOT ask for more RAM than your can use.
</div>

If you have limited computer ressources, change following parameters directly in the command cell below or [set ``authorizedCPU`` and ``authorizedCPU`` values](#computressources) in Code cell 5 otherwise.
<blockquote>
    <code>--limitGenomeGenerateRAM</code>, to set maximum available RAM (in bytes, standing for <i>octets</i> in French) for genome generation (integer, positive and not null, default value: 31000000000) <br>
    <code>--runThreadN</code>, to limit the number of threads that <code>STAR</code> can use. On the IFB, it has to be set to the number of available cores
</blockquote>


### **2.3 - Indexing Reference genome**

If there is any issue, among all output files that STAR writes, start with ``Log.out``. It's a plain text file containing understood command line. It's quite verbose, that's very helpful!

### **2.4 - Extracted Reference genome file removal**

We can verify the disk space that Reference files use:

Extracted genome file is no more used after this step, so we can remove it to save space:

---
---
## 3 - Mapping samples on reference genome with <code>STAR</code></b>
---

### **3.1 - Tool version and command line presentation**

A little stop to discover ``STAR`` version as you may have skiped genome indexing:

In [29]:
## Code cell 29 ##

STAR --version

2.7.11a


A rather simple version of command line for mapping is: <br>
<code>STAR --genomeDir path/to/indexes/folder/ \
      --readFilesIn path/to/read1.fastq.gz path/to/read2.fastq.gz \
      --readFilesCommand zcat \
      --outSAMtype BAM SortedByCoordinate \
      --quantMode GeneCounts \
</code>

<blockquote>
    <code>--readFilesIn</code> for <code>Read</code> (for Single End data) or both <code>Read1 Read2</code> (for Paired End data) as full paths to files that contain input read(s)  <br> 
    <code>--readFilesCommand</code>, to indicate the tool that can handle read file format. <code>STAR</code> allow user a direct use of compressed files but rely on available dezipping tools <br>
    <br>
    <code>--outSAMtype word1 word2</code>, to set output file format we want (default, SAM). <br>
    Options for <code>word1</code> are <code>BAM</code>, <code>SAM</code> and <code>NoneNone</code> (no SAM/BAM output). <br>
    Options for <code>word2</code> are <code>Unsorted</code> or <code>SortedByCoordinate</code>. This option will allocate extra memory for sorting which can be specified by <code>--limitBAMsortRAM</code>.<br>
    <br>
    <code>--quantMode</code> (default, <i>none</i>), to activate and ask for one or several quantification outputs.  <br>
    Available options are: <code>GeneCounts</code> and <code>TranscriptomeSAM</code>. The latter will generate an output SAM/BAM alignments to transcriptome into a separate file while the former only generates a text file with count reads per gene.
</blockquote>

As ``_Aligned.toTranscriptome.out.bam`` generated files for downstream transcript level are as big or bigger than ``_Aligned.sortedByCoord.out.bam``, only required for downstream quantification analysis by **FeatureCounts** or other counting tools, we will focus on gene level quantification mode.

If you want to use transcript level quantification, we have previously successfully used below options: <br>
<code>--quantMode TranscriptomeSAM GeneCounts</code>

### **3.2- Preparing command line variables**

Let's check that we still have all ``.fastq.gz`` files where we left them. We count files that do no include *_removed* in their name:

In [31]:
## Code cell 30 ##

ls "${gohome}$USER/Results/fastp/"
ls "${gohome}$USER/Results/fastp/" | grep -v -e "_removed" | wc -l

SRR12730409_1.fastp.fastq.gz	    SRR12730410_fastp.json
SRR12730409_2.fastp.fastq.gz	    SRR12730410_removed.fastp.fastq.gz
SRR12730409_fastp.html		    SRR12730411_1.fastp.fastq.gz
SRR12730409_fastp.json		    SRR12730411_2.fastp.fastq.gz
SRR12730409_removed.fastp.fastq.gz  SRR12730411_fastp.html
SRR12730410_1.fastp.fastq.gz	    SRR12730411_fastp.json
SRR12730410_2.fastp.fastq.gz	    SRR12730411_removed.fastp.fastq.gz
SRR12730410_fastp.html
12


We also check that the variables in ``${gtffile}`` and ``${indexfolder}`` contain the proper path to Reference features and indexes: 

In [32]:
## Code cell 31 ##

reffolder="${gohome}alldata/Reference/"
gtffile="${reffolder}extracted/genome_annotation-M35.gtf"
indexfolder="${reffolder}indexes_upto99bases/"


ls -lh "${indexfolder}"

total 25G
-rw-rw----+ 1 scaburet scaburet 2.7G Jun 12 14:06 Genome
-rw-rw----+ 1 scaburet scaburet  18K Jun 12 14:07 Log.out
-rw-rw----+ 1 scaburet scaburet  21G Jun 12 14:07 SA
-rw-rw----+ 1 scaburet scaburet 1.5G Jun 12 14:07 SAindex
-rw-rw----+ 1 scaburet scaburet  459 Jun 12 13:43 chrLength.txt
-rw-rw----+ 1 scaburet scaburet  549 Jun 12 13:43 chrName.txt
-rw-rw----+ 1 scaburet scaburet 1008 Jun 12 13:43 chrNameLength.txt
-rw-rw----+ 1 scaburet scaburet  667 Jun 12 13:43 chrStart.txt
-rw-rw----+ 1 scaburet scaburet  30M Jun 12 13:43 exonGeTrInfo.tab
-rw-rw----+ 1 scaburet scaburet  12M Jun 12 13:43 exonInfo.tab
-rw-rw----+ 1 scaburet scaburet 2.4M Jun 12 13:43 geneInfo.tab
-rw-rw----+ 1 scaburet scaburet 1.1K Jun 12 14:06 genomeParameters.txt
-rw-rw----+ 1 scaburet scaburet 8.1M Jun 12 14:04 sjdbInfo.txt
-rw-rw----+ 1 scaburet scaburet 8.8M Jun 12 13:43 sjdbList.fromGTF.out.tab
-rw-rw----+ 1 scaburet scaburet 7.2M Jun 12 14:04 sjdbList.out.tab
-rw-rw----+ 1 scaburet scaburet  10M J

Now we create a destination folder for aligned ``.bam`` and other output files:

In [33]:
## Code cell 32 ##

mappedfolder="${gohome}$USER/Results/star/"
mkdir -p ${mappedfolder}

... and reset matching ``Results/`` destination folder for log files...

In [34]:
## Code cell 33 ##

logfolder="${gohome}$USER/Results/logfiles/"

Let's verify, or set if not done in [**Parameters**'s section](#computressources), the **number of CPU** (central processing units, cores) and **RAM-memory size (in Bytes)** that next multithreading program is allowed to use.  
<div class="alert alert-block alert-danger">
    <b>Following values are valid for a 16-CPU session with access to 70 GB of RAM</b>. Ideally, use 70-80% of the CPU amount your system or session has. DO NOT ask for more RAM than your can use.
</div>

In [35]:
## Code cell 34 ##

authorizedCPU=${authorizedCPU}
#authorizedCPU=13
echo ${authorizedCPU}


authorizedRAM=${authorizedRAM}
#authorizedRAM=60000000000  # 60GB
echo ${authorizedRAM}

13
60000000000


### **3.3- Running command line for mapping with <code>STAR</code>**

If you have limited computer ressources, please change following parameters directly in the command cell below or [set ``authorizedCPU`` and ``authorizedCPU`` values](#computressources) in Code cell 5 otherwise. 
<blockquote>
    <code>--limitBAMsortRAM</code>, to set maximum available RAM (in bytes, standing for <i>octets</i> in French) for sorting <code>.bam</code> file (integer, positive). <i>Note: Value can be null only if <code>--genomeLoad</code> option is unchanged, thus it will be set to the genome index size.</i> <br>
    <code>--runThreadN</code>, to limit the number of threads that <code>STAR</code> can use, it has to be set to the number of available cores
</blockquote>


In [42]:
## Code cell 35 ##

logfile="${logfolder}star_mapping_3samples-M35.log"

In [37]:
## Code cell 36 ##

echo "Screen output is redirected to ${logfile}"

# as time command does not redirect output
echo "operation starts at $(date)" >> ${logfile}

time for read1 in $(ls "${gohome}$USER/Results/fastp/"*_1.fastp.fastq.gz); do

    # handling names with the sample name
    samplenum=$(basename ${read1} | cut -d"_" -f1)
    echo "====== Processing sampleID: ${samplenum}..." | tee -a ${logfile}
    read2=$(echo ${read1} | sed 's#_1#_2#')

    echo "STAR starts at $(date)" >> ${logfile}
    # STAR working
    STAR --runThreadN ${authorizedCPU} --runMode alignReads \
        --genomeDir "${indexfolder}" \
        --readFilesIn "${read1}" "${read2}" \
        --readFilesCommand zcat \
        --outFileNamePrefix "${mappedfolder}${samplenum}_" \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMattributes All \
        --outReadsUnmapped Fastx \
        --limitBAMsortRAM ${authorizedRAM} \
        --quantMode GeneCounts \
        |& tee -a ${logfile}
    echo "STAR ends at $(date)" >> ${logfile}
    
    echo "...done" | tee -a ${logfile} 
done  

Screen output is redirected to /shared/projects/2413_rnaseq_cea/scaburet/Results/logfiles/star_mapping_3samples-M35.log
	/shared/ifbstor1/software/miniconda/envs/star-2.7.11a/bin/STAR-avx2 --runThreadN 13 --runMode alignReads --genomeDir /shared/projects/2413_rnaseq_cea/alldata/Reference/indexes_upto99bases/ --readFilesIn /shared/projects/2413_rnaseq_cea/scaburet/Results/fastp/SRR12730409_1.fastp.fastq.gz /shared/projects/2413_rnaseq_cea/scaburet/Results/fastp/SRR12730409_2.fastp.fastq.gz --readFilesCommand zcat --outFileNamePrefix /shared/projects/2413_rnaseq_cea/scaburet/Results/star/SRR12730409_ --outSAMtype BAM SortedByCoordinate --outSAMattributes All --outReadsUnmapped Fastx --limitBAMsortRAM 60000000000 --quantMode GeneCounts
	STAR version: 2.7.11a   compiled: 2023-09-15T02:58:53+0000 :/opt/conda/conda-bld/star_1694746407721/work/source
Jun 12 14:23:46 ..... started STAR run
Jun 12 14:23:46 ..... loading genome
Jun 12 14:24:37 ..... started mapping
Jun 12 14:37:36 ..... finished

In [43]:
## Code cell 37 ##

echo "operation ends at $(date)"
echo "operation ends at $(date)" >> ${logfile}

echo "=== files created during mapping step ===" >> ${logfile}
ls -lh "${mappedfolder}" >> ${logfile}

echo "STAR generated $(ls "${mappedfolder}" | wc -l) files during this step." \
     | tee -a ${logfile}

operation ends at Wed Jun 12 15:06:47 CEST 2024
STAR generated 27 files during this step.


### **3.4 - Additionnal lines to perform single sample mapping**

<div class="alert alert-block alert-info"><b> Info/help : </b><br><ul>
    <li>If you have <b>only one sample to map</b>, or if mapping failed for one sample, here is an additional Code cell without the loop:<a id="supplementalmapping"></a></il>
    <li>This additional cell can also be used to test mapping <b>at transcript level</b>:      
option <code>--quantMode TranscriptomeSAM GeneCounts</code> </il></div> 

Then you can verify the output files with the name of the sample:

---
---

## 4 - Building samples ``.bam`` indexes with ``samtools``
---

We now have to index ``.bam`` files to produce the companion index ``.bai`` file. Such index files help, in particular, to go faster to visualize alignements ``.bam`` file in genome browser viewer.

### **4.1 - Tool version**

The commands used for this part belong to a large package of utilities that are very useful to manage those types of files: **SAMTOOLS** (http://www.htslib.org/).

Let's check first which version of SAMTOOLS we are using:

In [39]:
## Code cell 40 ##

samtools --version

samtools 1.18
Using htslib 1.19
Copyright (C) 2023 Genome Research Ltd.

Samtools compilation details:
    Features:       build=configure curses=yes 
    CC:             /opt/conda/conda-bld/samtools_1696474357759/_build_env/bin/x86_64-conda-linux-gnu-cc
    CPPFLAGS:       -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /shared/ifbstor1/software/miniconda/envs/samtools-1.18/include
    CFLAGS:         -Wall -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /shared/ifbstor1/software/miniconda/envs/samtools-1.18/include -fdebug-prefix-map=/opt/conda/conda-bld/samtools_1696474357759/work=/usr/local/src/conda/samtools-1.18 -fdebug-prefix-map=/shared/ifbstor1/software/miniconda/envs/samtools-1.18=/usr/local/src/conda-prefix
    LDFLAGS:        -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/shared/ifbstor1/software/minicon

Simple commandline syntax is: <code>samtools index path/to/file.bam</code>
  
There is no need to provide a name of the ouput file, as it should always be the same as the corresponding ``.bam`` file, expect for the added ``.bai`` suffix.

### **4.2 - Creating files**

The only variable we need is the folder where ``.bam`` produced by <code>STAR</code> files are saved:

In [40]:
## Code cell 41 ##

echo ${mappedfolder}

/shared/projects/2413_rnaseq_cea/scaburet/Results/star/


Now, we use a loop to perform ``_Aligned.sortedByCoord.out.bam`` files indexation:

In [41]:
## Code cell 42 ##

logfile="${logfolder}samtools_indexing_samples.log"
echo "Screen output is redirected to ${logfile}"

# as time command does not redirect output
echo "operation starts at $(date)" >> ${logfile}

time for bamfile in $(ls "${mappedfolder}"*_Aligned.sortedByCoord.out.bam); do

    samplenum=$(basename ${bamfile} | cut -d"_" -f1)
    echo "====== Processing sampleID: ${samplenum}..." | tee -a ${logfile}
    
    echo "samtools index starts at $(date)" >> ${logfile}
    samtools index "${bamfile}" \
             &>> ${logfile}
    echo "samtools index ends at $(date)" >> ${logfile}
    
    echo "...done" | tee -a ${logfile} 
    
done

Screen output is redirected to /shared/projects/2413_rnaseq_cea/scaburet/Results/logfiles/samtools_indexing_samples.log
...done
...done
...done

real	1m21.367s
user	1m15.751s
sys	0m2.265s


In [44]:
## Code cell 43 ##

echo "operation ends at $(date)"
echo "operation ends at $(date)" >> ${logfile}

echo "=== files created during indexing step ===" >> ${logfile}
ls -lh "${mappedfolder}"*.bai >> ${logfile}

echo "samtools index generated $(ls "${mappedfolder}"*.bai | wc -l) files during this step." \
     | tee -a ${logfile}

operation ends at Wed Jun 12 15:08:09 CEST 2024
samtools index generated 3 files during this step.


<div class="alert alert-block alert-warning">
    If one or more <code>.bai</code> files are missing, there should be an error in their matched <code>.bam</code> file. You then have a look into the generated <code>.log</code> file. <br>
    When there is not enough disk space during mapping process, <code>.bam</code> file may be incomplete: you can find <i>missing EOF block when one should be present</i> error for this sample. <br>
    In fact,you have to ensure before you start the mapping step that you have at least 5 times more space than one sample <code>.fastq</code> files size (or 10 times if you activate <code>TranscriptomeSAM</code> along with <code>GeneCounts</code>).
</div>

## 5 - Keep track of disk space usage

In [45]:
## Code cell 44 ##
du -ch -d3 ${gohome}$USER

596K	/shared/projects/2413_rnaseq_cea/scaburet/.ipynb_checkpoints
1.9M	/shared/projects/2413_rnaseq_cea/scaburet/Results/qualimap-11juin/bamqc
2.6G	/shared/projects/2413_rnaseq_cea/scaburet/Results/qualimap-11juin/rnaseq
2.6G	/shared/projects/2413_rnaseq_cea/scaburet/Results/qualimap-11juin
4.0K	/shared/projects/2413_rnaseq_cea/scaburet/Results/.ipynb_checkpoints
16G	/shared/projects/2413_rnaseq_cea/scaburet/Results/star
596K	/shared/projects/2413_rnaseq_cea/scaburet/Results/fastqc/.ipynb_checkpoints
6.2M	/shared/projects/2413_rnaseq_cea/scaburet/Results/fastqc
260K	/shared/projects/2413_rnaseq_cea/scaburet/Results/samtools-11juin/.ipynb_checkpoints
520K	/shared/projects/2413_rnaseq_cea/scaburet/Results/samtools-11juin
4.0K	/shared/projects/2413_rnaseq_cea/scaburet/Results/fastp/.ipynb_checkpoints
7.2G	/shared/projects/2413_rnaseq_cea/scaburet/Results/fastp
20M	/shared/projects/2413_rnaseq_cea/scaburet/Results/fastq_screen
5.2G	/shared/projects/2413_rnaseq_cea/scaburet/Results/star-11j

For the current project, we can use up to 70 Gb in each personal folder. Next steps are less space consuming, but nervetheless, we will be able to delete the initial fastq.gz files when the mapping is verified (next session).

---
___

## Conclusion


**Next Practical session**

Now we go on to check mapping quality.  
  
**=> Step 5: Post mapping Quality check** 

The jupyter notebook used for the next session will be *Pipe_05-bash_mapping-quality.ipynb*    
Let's retrieve it in our directory, in order to have a private copy to work on:   

In [None]:
## Code cell 45 ##   

cp "${gohome}pipeline/Pipe_05-bash_mapping-quality.ipynb" "${gohome}$USER/"



**Save executed notebook**

To end the session, save your exectued notebook in your `run_notebooks' folder. Adjust the name with yours and reformat as code cell to run it.

In [47]:
## Code cell 46 ##   
#mkdir -p /shared/projects/2413_rnaseq_cea/$USER/run_notebooks
cp /shared/projects/2413_rnaseq_cea/$USER/Pipe_04-bash_classical-reads-mapping.ipynb /shared/projects/2413_rnaseq_cea/$USER/run_notebooks/Pipe_04-bash_classical-reads-mapping-run.ipynb

<div class="alert alert-block alert-success"><b>Success:</b> Well done! You now know how to do map reads on a genome.<br>
Don't forget to save you notebook and export a copy as an <b>html</b> file as well <br>
- Open "File" in the Menu<br>
- Select "Export Notebook As"<br>
- Export notebook as HTML<br>
- You can then open it in your browser even without being connected to the server! 
</div>

---
---

## Useful commands
<div class="alert alert-block alert-info"> 
    
- <kbd>CTRL</kbd>+<kbd>S</kbd> : save notebook<br>    
- <kbd>CTRL</kbd>+<kbd>ENTER</kbd> : Run Cell<br>  
- <kbd>SHIFT</kbd>+<kbd>ENTER</kbd> : Run Cell and Select Next<br>   
- <kbd>ALT</kbd>+<kbd>ENTER</kbd> : Run Cell and Insert Below<br>   
- <kbd>ESC</kbd>+<kbd>y</kbd> : Change to *Code* Cell Type<br>  
- <kbd>ESC</kbd>+<kbd>m</kbd> : Change to *Markdown* Cell Type<br> 
- <kbd>ESC</kbd>+<kbd>r</kbd> : Change to *Raw* Cell Type<br>    
- <kbd>ESC</kbd>+<kbd>a</kbd> : Create Cell Above<br> 
- <kbd>ESC</kbd>+<kbd>b</kbd> : Create Cell Below<br> 

<em>  
To make nice html reports with markdown: <a href="https://dillinger.io/" title="dillinger.io">html visualization tool 1</a> or <a href="https://stackedit.io/app#" title="stackedit.io">html visualization tool 2</a>, <a href="https://www.tablesgenerator.com/markdown_tables" title="tablesgenerator.com">to draw nice tables</a>, and the <a href="https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fd" title="Ultimate guide">Ultimate guide</a>. <br>
Further reading on JupyterLab notebooks: <a href="https://jupyterlab.readthedocs.io/en/latest/user/notebook.html" title="Jupyter Lab">Jupyter Lab documentation</a>.<br>   
</em>    
 
</div>

---
Bénédicte Noblet - 05-07 2021   
Sandrine Caburet - 02-05 2023   
Claire Vandiedonck - 01-06 2023  
Maj 13/06/2024 by SCaburet