# Formation RNAseq CEA - juin 2023

Session IFB : 5 CPU + 21 GB de RAM

# Part 4 : Mapping reads to genome


- 0.1 - About session for IFB core cluster
- 0.2 - Parameters to be set or modified by the user
- 1 - Reference genome and annotation file
- 2 - Building genome reference files with ``STAR``
- 3 - Mapping samples on reference genome with ``STAR``
- 4 - Building sample ``.bam`` indexes with ``samtools``


<div class="alert alert-block alert-danger">
    Please note that Reference genome file indexing is set for RNA sequencing of 100-bases reads. <br>
    If different, change <code>rawreadlength</code> below set value with the proper read length for your dataset.
</div>

---
## **Before going further**

<div class="alert alert-block alert-danger"><b>Caution:</b> 
Before starting the analysis, save a backup copy of this notebok : in the left-hand panel, right-click on this file and select "Duplicate"<br>
You can also make backups during the analysis. Don't forget to save your notebook regularly: <kbd>Ctrl</kbd> + <kbd>S</kbd> or click on the 💾 icon.
</div>

---

## 0.1 - About session for IFB core cluster

<em>loaded JupyterLab</em> : Version 3.2.1

In [1]:
## Code cell 1 ##

echo "=== Cell launched on $(date) ==="

echo "=== Current IFB session size: Medium (5CPU, 21 GB) ==="
jobid=$(squeue -hu $USER | awk '/jupyter/ {print $1}')
sacct --format=JobID,AllocCPUS,NODELIST -j ${jobid}

=== Cell launched on Wed May 24 11:12:14 CEST 2023 ===
=== Current IFB session size: Medium (5CPU, 21 GB) ===
       JobID  AllocCPUS        NodeList 
------------ ---------- --------------- 
33528930              5     cpu-node-23 
33528930.ba+          5     cpu-node-23 
33528930.0            5     cpu-node-23 


In [2]:
## Code cell 2 ##

module load star/2.7.10b samtools/1.15.1

echo "===== download network files ====="
wget --version | head -n 1
echo "===== alignement tool ====="
STAR --version
echo "===== index construction + quality ====="
samtools --version

===== download network files =====
GNU Wget 1.14 built on linux-gnu.
===== alignement tool =====
2.7.10b
===== index construction + quality =====
samtools 1.15.1
Using htslib 1.15.1
Copyright (C) 2022 Genome Research Ltd.

Samtools compilation details:
    Features:       build=configure curses=yes 
    CC:             /opt/conda/conda-bld/samtools_1649352267887/_build_env/bin/x86_64-conda-linux-gnu-cc
    CPPFLAGS:       -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /shared/ifbstor1/software/miniconda/envs/samtools-1.15.1/include
    CFLAGS:         -Wall -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /shared/ifbstor1/software/miniconda/envs/samtools-1.15.1/include -fdebug-prefix-map=/opt/conda/conda-bld/samtools_1649352267887/work=/usr/local/src/conda/samtools-1.15.1 -fdebug-prefix-map=/shared/ifbstor1/software/miniconda/envs/samtools-1.15.1=/usr/local/src/conda-prefix
    LDFLAGS:        -Wl,-O2 -Wl,--sort-com

---

## 0.2 - Parameters to be set or modified by the user

- Using a full path with a `/` at the end, **define the folder** where you want or have to work with `gohome` variable:

In [3]:
## Code cell 3 ##

gohome="/shared/projects/2312_rnaseq_cea/"

echo "=== Home root folder is ==="
echo "${gohome}"
echo ""
echo "=== Working (personal) folder tree ==="
tree -d -L 2 "${gohome}$USER"
echo "=== current working directory ==="
echo "${PWD}"

=== Home root folder is ===
/shared/projects/2312_rnaseq_cea/

=== Working (personal) folder tree ===
/shared/projects/2312_rnaseq_cea/scaburet
├── Data
│   ├── fastq
│   └── sra
└── Results
    ├── fastp
    ├── fastqc
    ├── fastq_screen
    ├── logfiles
    ├── multiqc
    ├── star
    └── star-old

11 directories
=== current working directory ===
/shared/ifbstor1/projects/2312_rnaseq_cea/pipeline


- **Reference genome and annotation files will be downloaded from the web** in [section 1.2](#downloadsection), so **don't forget to change url for both files** in [section 1.1](#urlsection). The names used to store links are ``fagzurl`` and ``gtfgzurl``, respectively.

- To map reads efficiently, a set of indexes has to be created. The amount of indexes relies on the read length value, a parameter set by the experimenter on sequencing platform.    
    **Change read length value below if you have a different read size**:   


In [4]:
## Code cell 4 ##

rawreadlength=100

- Please, precise the **maximum amount of CPU** (central processing units, cores) and **RAM-memory (in Bytes)** that programs can use.<a id="computressources"></a>

<div class="alert alert-block alert-danger">
    <b>Following values are valid for a 5-CPU session with access to 21 GB of RAM</b>. Ideally, use 70-80% of the CPU amount your system or session has. DO NOT ask for more RAM than you can use.
</div>

In [6]:
## Code cell 5 ##

authorizedCPU=4  #4

authorizedRAM=20000000000  # 20GB

---
## 1 - Reference genome and annotation files   

Classical mapping of RNAseq data relies on the use of a reference genome sequence (in <code>fasta</code> format) and a companion annotation file that contains the location of genomic features, such as genes and exons (in <code>gtf</code> or <code>gff</code> format). 
The Reference genome also has to be indexed by STAR, with indexes corresponding to the specific read length in the dataset.

   
**Here we will used pre-downloaded files and pre-computed indexes, in order to save time and disk space.**   
The sections 1.1 to 1.3 (downloading and extracting Reference files) and 2.1 to 2.4 (indexing Reference file) are therefore only provided for further use in your own later projects.   

To use those commands:   
* change the format of the cells from <code>raw</code> to <code>code</code>.  
* specify the correct read length in Code cell 4
* choose the correct species in Code cells 6 and 7.

**The section 1.4 is kept active to verify the Reference files**


### **1.1 - Searching for urls on the web**

There are several websites to download reference genome and annotation files:<a id="urlsection"></a>
- <a href="https://www.gencodegenes.org/human/"><i>Gencode</i></a> from the European Biomolecular Institute (EBI)  
- <a href="http://www.ensembl.org/info/data/ftp/index.html"><i>Accessing Ensembl Data</i></a> from Ensembl project database  
- <a href="https://www.ncbi.nlm.nih.gov/genome/guide/human/"><i>Human Genome Ressources</i></a> at NCBI  
- ... and maybe one day a commom NCBI and Ensembl/Gencode realease (MANE collaboration, <a href="https://ncbiinsights.ncbi.nlm.nih.gov/2020/11/02/ncbi-refseq-ensembl-gencode-mane-v0-92/#more-4781">a story beginning in 2020</a>)

We will use a **Primary assembly** (PRI) release. It includes chromosomes and scaffolds (candidate regions to be integrated or discarded in next genome build).  
On the contrary, the main annotation file is limited to chromosomes while the extensive annotation file also includes all known haplotypes (for highly variables regions such as the HMC).

<div class="alert alert-block alert-warning">
    In this notebook, we use <b>Gencode release</b>, that provides user with <b><code>.gz</code> compressed</b> files, and we choose <b>GTF format</b> for the annotation file. <br>
    Feel free to choose the source you want among citated ones above, as far as downloaded files follow the same file formats (if they don't, change next sections code cells!).
</div>

<div class="alert alert-block alert-info">
    Nonetheless, please note that annotation file format is an actively opened issue as some relevant official sources are contradictory: 
    <ul>
        <li>
            For US Galaxy Main project: <a href="https://galaxyproject.org/learn/datatypes/#gtf">GTF</a> is the GFF version 2 while <a href="https://galaxyproject.org/learn/datatypes/#gff">GFF</a> is version 1 and <a href="https://galaxyproject.org/learn/datatypes/#gff3">GFF3</a> is the latest and 3rd version... 
        </li>
        <li>
        ... but IGV Broad Institute, as UCSC genome browser, makes distinction between <a href="http://software.broadinstitute.org/software/igv/GFF">GFF2 and GTF formats</a>, <a href="https://genome.ucsc.edu/FAQ/FAQformat.html#format3"> the later being only compatible with the former</a>.
        </li>
        <li>
            While both <a href="https://biocorecrg.github.io/PhD_course/gtf_format.html">GTF</a> and <a href="https://biostar.usegalaxy.org/p/28147/">GFF</a> formats have 9 columns, field in the ninth column is longer for <code>.gtf</code> files than for <code>.gff</code> files (<a href="https://genome.ucsc.edu/FAQ/FAQformat.html#format3">UCSC Genome browser documentation</a> and <a href="https://www.ensembl.org/info/website/upload/gff.html">ensembl documentation</a>).
        </li>
        <li>
            Even if both file format have header lines, some tools do not support them (<a href="https://biostar.usegalaxy.org/p/28147/">second bullet point in last answer</a>) and US Galaxy portal asks users to remove those lines before use (see upper US Galaxy links).
        </li>
        <li>
            <code>FeatureCounts</code> (a downstream tool we will use) only <a href="http://bioinf.wehi.edu.au/featureCounts/">works with GTF files</a>. This tool expects to find <i>exon</i> in the <i>features</i> column (both GFF and GTF!) and <i>gene_id</i> as a gene identifier (missing in GFF), see <a href="https://biostar.usegalaxy.org/p/28094/index.html#28099">item 4 in latest answer</a>.
        </li>
    </ul>
</div>

In order to have latest current genome release for your analyses, please go to Gencode's <a href="https://www.gencodegenes.org/human/">download page</a> (or to other chosen reference download page) and adapt url links for:
- Primary annotation (this notebook is developped with GTF file)
> in *GTF/GFF3 files* Gencode's chart: Comprehensive gene annotation > primary annotation > *gtf* file 

- Primary genome sequence file
> in *Fasta files* Gencode's chart: Genome sequence, primary assembly > *Fasta* file 

*Note*: You can get url link with a right click on download links, then *copy link to clipboard*.

<div class="alert alert-block alert-danger">
    Both files have to be retrieved from the same source as sequence region names need to be the same in both; in order to avoid dowstream analysis issues. <br>
    <i>In Gencode, this compatibility between files is specified in the <b>fasta description field</b></i>.
</div>

### **1.2 - Retrieve files with ``wget``**

We will download those files in a distinct folder:<a id="downloadsection"></a>

<ul class="alert alert-block alert-info">
    <li>
        Sometimes (often?!), other users issues help us understand a command more than its manual. For instance, a Stackoverflow's <a href="https://unix.stackexchange.com/questions/23501/download-using-wget-to-a-different-directory-than-current-directory">thread</a> about <code>wget</code> command and the way to write into a chosen output folder. 
    </li>
</ul>

We only use these two options:
> ``-P PREFIX`` or ``--directory-prefix=PREFIX`` to specify output folder  
> ``-N`` or ``--timestamping``: don't re-retrieve files unless newer than local  

Some other available options exist and among them this one:
> ``-a FILE`` or ``--append-output=FILE`` to append messages to FILE  

### **1.3 - Extract archive files**

``STAR``, as other downstream tools, can't deal with compressed reference files.  
Extracted files are quite big but compressed ones are of rather affordable size. So we will keep them as is, along with more simple filename assigment for extracted files (easier to handle, in particular when changing release and/or database source).

- Primary annotation (notebook developped with GTF file)

- Primary genome sequence file

### **1.4 - Verify downloaded files**
Let's have a look to these files to check they correspond to what we expect (or just discover file format).

In [7]:
## Code cell 14 ##

reffolder="${gohome}allData/Reference/"
gtffile="${reffolder}extracted/genome_annotation.gtf"
fastafile="${reffolder}extracted/genome_sequence.fa"

- Primary annotation (notebook developped with GTF file)

In [7]:
## Code cell 15 ##

head ${gtffile}

##description: evidence-based annotation of the mouse genome (GRCm39), version M32 (Ensembl 109)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2022-12-06
chr1	HAVANA	gene	3143476	3144545	.	+	.	gene_id "ENSMUSG00000102693.2"; gene_type "TEC"; gene_name "4933401J01Rik"; level 2; mgi_id "MGI:1918292"; havana_gene "OTTMUSG00000049935.1";
chr1	HAVANA	transcript	3143476	3144545	.	+	.	gene_id "ENSMUSG00000102693.2"; transcript_id "ENSMUST00000193812.2"; gene_type "TEC"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_name "4933401J01Rik-201"; level 2; transcript_support_level "NA"; mgi_id "MGI:1918292"; tag "basic"; tag "Ensembl_canonical"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";
chr1	HAVANA	exon	3143476	3144545	.	+	.	gene_id "ENSMUSG00000102693.2"; transcript_id "ENSMUST00000193812.2"; gene_type "TEC"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_name "4933401J01Rik-201"; exon_number 1; exon_id "

- Primary genome sequence file

In [8]:
## Code cell 16 ##

head ${fastafile}

>chr1 1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN


---
## 2 - Building genome reference index files with <code>STAR</code>

The indexes are small files that tell a program where to look for data in a large data file. They are required for mapping algorithms, as they allow for faster processing of millions reads.

### **2.1 - Tool version and command line presentation**

To create reference genome files, the default command is: <br>
<code>STAR --runMode genomeGenerate --genomeDir destination/folder \
      --genomeFastaFiles path/to/sequence.fa
</code>

<blockquote>
    <code>--runMode genomeGenerate</code>, to switch to indexing step, else STAR is by defaul turned to alignReads (mapping step) <br>
    <code>--genomeDir</code>, to specify folder where to put reference genome indexes <br>
    <code>--genomeFastaFiles</code>, fasta file reference genome path (DOES NOT work with gz files)

We are working on RNAseq data and need to have files that take splice junctions into account. Thus, we need to add following 2 parameters: <br>
<code>--sjdbGTFfile path/to/annotation.file --sjdbOverhang readlengthnum</code>

They stand for:
<blockquote>
    <code>--sjdbGTFfiles</code>, to specify where to find annotation file that contains exon positions, thus placing splice junction along genome sequence <br>
    <code>--sjdbOverhang</code>, the maximum size that we expected to found on one splice junction side (<em>ideally, mate length-1</em>)
</blockquote>

For STAR, we can specify those two additional options either when genereating genome index files or when mapping sample. As we may be limited in computational ressources, we will add these items here and avoid memory-consuming operation repetition lately when iterating on all samples for mapping.

### **2.2 - Preparing command line variables**

The dataset used to develop this pipeline is based on reads sequenced on 100 bases.  
If this is not the case for your dataset, change the value in Code cell 4.

We will then create a folder to put those specific genomes indexes files, with an explicit name for later use:

Let's verify ``fastafile`` and ``gtffile`` variables, defined when extracting compressed files in Code cell 14.  

Let's verify, or set if not done in [**Parameters**'s section](#computressources), the **number of CPU** (central processing units, cores) and **RAM-memory size (in Bytes)** that next multithreading program is allowed to use.  
<div class="alert alert-block alert-danger">
    <b>Following values are valid for a 5-CPU session with access to 21 GB of RAM</b>. Ideally, use 70-80% of the CPU amount your system or session has. DO NOT ask for more RAM than your can use.
</div>

If you have limited computer ressources, change following parameters directly in the command cell below or [set ``authorizedCPU`` and ``authorizedCPU`` values](#computressources) in Code cell 5 otherwise.
<blockquote>
    <code>--limitGenomeGenerateRAM</code>, to set maximum available RAM (in bytes, standing for <i>octets</i> in French) for genome generation (integer, positive and not null, default value: 31000000000) <br>
    <code>--runThreadN</code>, to limit the number of threads that <code>STAR</code> can use. On the IFB, it has to be set to the number of available cores
</blockquote>


### **2.3 - Indexing Reference genome**

If there is any issue, among all output files that STAR writes, start with ``Log.out``. It's a plain text file containing understood command line. It's quite verbose, that's very helpful!

### **2.4 - Extracted Reference genome file removal**

We can verify the disk space that Reference files use:

Extracted genome file is no more used after this step, so we can remove it to save space:

## 3 - Mapping samples on reference genome with <code>STAR</code></b>

### **3.1 - Tool version and command line presentation**

A little stop to discover ``STAR`` version as you may have skiped genome indexing:

In [42]:
## Code cell 26 ##

STAR --version

2.7.10b


A rather simple version of command line for mapping is: <br>
<code>STAR --genomeDir path/to/indexes/folder/ \
      --readFilesIn path/to/read1.fastq.gz path/to/read2.fastq.gz \
      --readFilesCommand zcat \
      --outSAMtype BAM SortedByCoordinate \
      --quantMode GeneCounts \
</code>

<blockquote>
    <code>--readFilesIn</code> for <code>Read</code> (for Single End data) or both <code>Read1 Read2</code> (for Paired End data) as full paths to files that contain input read(s)  <br> 
    <code>--readFilesCommand</code>, to indicate the tool that can handle read file format. <code>STAR</code> allow user a direct use of compressed files but rely on available dezipping tools <br>
    <br>
    <code>--outSAMtype word1 word2</code>, to set output file format we want (default, SAM). <br>
    Options for <code>word1</code> are <code>BAM</code>, <code>SAM</code> and <code>NoneNone</code> (no SAM/BAM output). <br>
    Options for <code>word2</code> are <code>Unsorted</code> or <code>SortedByCoordinate</code>. This option will allocate extra memory for sorting which can be specified by <code>--limitBAMsortRAM</code>.<br>
    <br>
    <code>--quantMode</code> (default, <i>none</i>), to activate and ask for one or several quantification outputs.  <br>
    Available options are: <code>GeneCounts</code> and <code>TranscriptomeSAM</code>. The latter will generate an output SAM/BAM alignments to transcriptome into a separate file while the former only generates a text file with count reads per gene.
</blockquote>

As ``_Aligned.toTranscriptome.out.bam`` generated files for downstream transcript level are as big or bigger than ``_Aligned.sortedByCoord.out.bam``, only required for downstream quantification analysis by ``FeatureCounts``, we will focus on gene level quantification mode.

If you want to use transcript level qualification, we have previously successfully used below options: <br>
<code>--quantMode TranscriptomeSAM GeneCounts</code>

### **3.2- Preparing command line variables**

Let's check that we still have all ``.fastq.gz`` files where we left them. We count files that do no include *_removed* in their name:

In [19]:
## Code cell 27 ##

ls "${gohome}$USER/Results/fastp/"
ls "${gohome}$USER/Results/fastp/" | grep -v -e "_removed" | wc -l

SRR12730403_1.fastp.fastq.gz	    SRR12730404_fastp.json
SRR12730403_2.fastp.fastq.gz	    SRR12730404_removed.fastp.fastq.gz
SRR12730403_fastp.html		    SRR12730405_1.fastp.fastq.gz
SRR12730403_fastp.json		    SRR12730405_2.fastp.fastq.gz
SRR12730403_removed.fastp.fastq.gz  SRR12730405_fastp.html
SRR12730404_1.fastp.fastq.gz	    SRR12730405_fastp.json
SRR12730404_2.fastp.fastq.gz	    SRR12730405_removed.fastp.fastq.gz
SRR12730404_fastp.html
12


We also check that the variables in ``${gtffile}`` and ``${indexfolder}`` contain the proper path to Reference features and indexes: 

In [20]:
## Code cell 28 ##

reffolder="${gohome}allData/Reference/"
gtffile="${reffolder}extracted/genome_annotation.gtf"
indexfolder="${reffolder}indexes_upto99bases/"


ls -lh "${indexfolder}"

total 25G
-rw-rw----+ 1 scaburet scaburet  459 May 23 17:12 chrLength.txt
-rw-rw----+ 1 scaburet scaburet 1008 May 23 17:12 chrNameLength.txt
-rw-rw----+ 1 scaburet scaburet  549 May 23 17:12 chrName.txt
-rw-rw----+ 1 scaburet scaburet  667 May 23 17:12 chrStart.txt
-rw-rw----+ 1 scaburet scaburet  30M May 23 17:12 exonGeTrInfo.tab
-rw-rw----+ 1 scaburet scaburet  12M May 23 17:12 exonInfo.tab
-rw-rw----+ 1 scaburet scaburet 2.4M May 23 17:12 geneInfo.tab
-rw-rw----+ 1 scaburet scaburet 2.7G May 23 17:26 Genome
-rw-rw----+ 1 scaburet scaburet  981 May 23 17:26 genomeParameters.txt
-rw-rw----+ 1 scaburet scaburet  29K May 23 17:26 Log.out
-rw-rw----+ 1 scaburet scaburet  21G May 23 17:26 SA
-rw-rw----+ 1 scaburet scaburet 1.5G May 23 17:26 SAindex
-rw-rw----+ 1 scaburet scaburet 8.1M May 23 17:23 sjdbInfo.txt
-rw-rw----+ 1 scaburet scaburet 8.8M May 23 17:12 sjdbList.fromGTF.out.tab
-rw-rw----+ 1 scaburet scaburet 7.2M May 23 17:23 sjdbList.out.tab
-rw-rw----+ 1 scaburet scaburet  10M M

Now we create a destination folder for aligned ``.bam`` and other output files:

In [22]:
## Code cell 29 ##

mappedfolder="${gohome}$USER/Results/star/"
mkdir -p ${mappedfolder}

... and reset matching ``Results/`` destination folder for log files...

In [21]:
## Code cell 30 ##

logfolder="${gohome}$USER/Results/logfiles/"

Let's verify, or set if not done in [**Parameters**'s section](#computressources), the **number of CPU** (central processing units, cores) and **RAM-memory size (in Bytes)** that next multithreading program is allowed to use.  
<div class="alert alert-block alert-danger">
    <b>Following values are valid for a 5-CPU session with access to 21 GB of RAM</b>. Ideally, use 70-80% of the CPU amount your system or session has. DO NOT ask for more RAM than your can use.
</div>

In [11]:
## Code cell 31 ##

authorizedCPU=${authorizedCPU}
#authorizedCPU=4
echo ${authorizedCPU}


authorizedRAM=${authorizedRAM}
#authorizedRAM=20000000000  # 20GB
echo ${authorizedRAM}

4
20000000000


### **3.3- Running command line for mapping with <code>STAR</code>**

If you have limited computer ressources, please change following parameters directly in the command cell below or [set ``authorizedCPU`` and ``authorizedCPU`` values](#computressources) in Code cell 5 otherwise. 
<blockquote>
    <code>--limitBAMsortRAM</code>, to set maximum available RAM (in bytes, standing for <i>octets</i> in French) for sorting <code>.bam</code> file (integer, positive). <i>Note: Value can be null only if <code>--genomeLoad</code> option is unchanged, thus it will be set to the genome index size.</i> <br>
    <code>--runThreadN</code>, to limit the number of threads that <code>STAR</code> can use, it has to be set to the number of available cores
</blockquote>


In [23]:
## Code cell 32 ##

logfile="${logfolder}star_mapping_samples.log"

In [41]:
## Code cell 33 ##

echo "Screen output is redirected to ${logfile}"

# as time command does not redirect output
echo "operation starts at $(date)" >> ${logfile}

time for read1 in $(ls "${gohome}allData/Results/fastp/"*_1.fastp.fastq.gz); do

    # handling names with the sample name
    samplenum=$(basename ${read1} | cut -d"_" -f1)
    echo "====== Processing sampleID: ${samplenum}..." | tee -a ${logfile}
    read2=$(echo ${read1} | sed 's#_1#_2#')

    echo "STAR starts at $(date)" >> ${logfile}
    # STAR working
    STAR --runThreadN ${authorizedCPU} --runMode alignReads \
        --genomeDir "${indexfolder}" \
        --readFilesIn "${read1}" "${read2}" \
        --readFilesCommand zcat \
        --outFileNamePrefix "${mappedfolder}${samplenum}_" \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMattributes All \
        --outReadsUnmapped Fastx \
        --limitBAMsortRAM ${authorizedRAM} \
        --quantMode GeneCounts \
        |& tee -a ${logfile}
    echo "STAR ends at $(date)" >> ${logfile}
    
    echo "...done" | tee -a ${logfile} 
    
done
echo "operation ends at $(date)" >> ${logfile}

echo "=== files created during mapping step ===" >> ${logfile}
ls -lh "${mappedfolder}" >> ${logfile}

echo "STAR generated $(ls "${mappedfolder}" | wc -l) files during this step." \
     | tee -a ${logfile}

Screen output is redirected to /shared/projects/2312_rnaseq_cea/allData/Results/logfiles/star_mapping_samples.log
	STAR --runThreadN 45 --runMode alignReads --genomeDir /shared/projects/2312_rnaseq_cea/allData/Reference/indexes_upto99bases/ --readFilesIn /shared/projects/2312_rnaseq_cea/allData/Results/fastp/SRR12730403_1.fastp.fastq.gz /shared/projects/2312_rnaseq_cea/allData/Results/fastp/SRR12730403_2.fastp.fastq.gz --readFilesCommand zcat --outFileNamePrefix /shared/projects/2312_rnaseq_cea/allData/Results/star/SRR12730403_ --outSAMtype BAM SortedByCoordinate --outSAMattributes All --outReadsUnmapped Fastx --limitBAMsortRAM 50000000000 --quantMode GeneCounts
	STAR version: 2.7.10b   compiled: 2022-11-01T09:53:26-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
May 23 18:33:42 ..... started STAR run
May 23 18:33:42 ..... loading genome
May 23 18:34:23 ..... started mapping
May 23 18:45:23 ..... finished mapping
May 23 18:45:24 ..... started sorting BAM
May 23 18:46:54 ..... 

### **3.4 - Additionnal lines to perform single sample mapping**

If you have only one sample to map, or if mapping failed for one sample, here is an additional Code cell without the loop:<a id="supplementalmapping"></a>

This additional cell can also be used to test mapping at transcript level:      
option ``--quantMode TranscriptomeSAM GeneCounts`` 

Then you can verify the output files with the name of the sample:

## 4 - Building samples ``.bam`` indexes with ``samtools``

We now have to index ``.bam`` files to produce the companion ``.bai`` file. Such files help, in particular, to go faster to visualize alignements ``.bam`` file in genome browser viewer.

### **4.1 - Tool version**

The commands used for this part belong to a large package of utilities that are very useful to manage those types of files: SAMTOOLS (http://www.htslib.org/).

Let's check first which version of SAMTOOLS we are using:

In [13]:
## Code cell 36 ##

samtools --version

samtools 1.15.1
Using htslib 1.15.1
Copyright (C) 2022 Genome Research Ltd.

Samtools compilation details:
    Features:       build=configure curses=yes 
    CC:             /opt/conda/conda-bld/samtools_1649352267887/_build_env/bin/x86_64-conda-linux-gnu-cc
    CPPFLAGS:       -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /shared/ifbstor1/software/miniconda/envs/samtools-1.15.1/include
    CFLAGS:         -Wall -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /shared/ifbstor1/software/miniconda/envs/samtools-1.15.1/include -fdebug-prefix-map=/opt/conda/conda-bld/samtools_1649352267887/work=/usr/local/src/conda/samtools-1.15.1 -fdebug-prefix-map=/shared/ifbstor1/software/miniconda/envs/samtools-1.15.1=/usr/local/src/conda-prefix
    LDFLAGS:        -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/shared/ifbstor1/soft

Simple commandline syntax is: <code>samtools index path/to/file.bam</code>
  
There is no need to provide a name of the ouput file, as it should always be the same as the corresponding ``.bam`` file, expect for the added ``.bai`` suffix.

### **4.2 - Creating files**

The only variable we need is the folder where ``.bam`` produced by <code>STAR</code> files are saved:

In [24]:
## Code cell 37 ##

echo ${mappedfolder}

/shared/projects/2312_rnaseq_cea/scaburet/Results/star/


Now, we use a loop to perform ``_Aligned.sortedByCoord.out.bam`` files indexation:

In [15]:
## Code cell 38 ##

logfile="${logfolder}samtools_indexing_samples.log"
echo "Screen output is redirected to ${logfile}"

# as time command does not redirect output
echo "operation starts at $(date)" >> ${logfile}

time for bamfile in $(ls "${mappedfolder}"*_Aligned.sortedByCoord.out.bam); do

    samplenum=$(basename ${bamfile} | cut -d"_" -f1)
    echo "====== Processing sampleID: ${samplenum}..." | tee -a ${logfile}
    
    echo "samtools index starts at $(date)" >> ${logfile}
    samtools index "${bamfile}" \
             &>> ${logfile}
    echo "samtools index ends at $(date)" >> ${logfile}
    
    echo "...done" | tee -a ${logfile} 
    
done
echo "operation ends at $(date)" >> ${logfile}

echo "=== files created during indexing step ===" >> ${logfile}
ls -lh "${mappedfolder}"*.bai >> ${logfile}

echo "samtools index generated $(ls "${mappedfolder}"*.bai | wc -l) files during this step." \
     | tee -a ${logfile}

Screen output is redirected to /shared/projects/2312_rnaseq_cea/allData/Results/logfiles/samtools_indexing_samples.log
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done
...done

real	12m30.564s
user	12m4.415s
sys	0m24.725s
samtools index generated 11 files during this step.


<div class="alert alert-block alert-warning">
    If one or more <code>.bai</code> files are missing, there should be an error in their matched <code>.bam</code> file. You then have a look into the generated <code>.log</code> file. <br>
    When there is not enough disk space during mapping process, <code>.bam</code> file may be incomplete: you can find <i>missing EOF block when one should be present</i> error for this sample. <br>
    In fact,you have to ensure before you start the mapping step that you have at least 5 times more space than one sample <code>.fastq</code> files size (or 10 times if you activate <code>TranscriptomeSAM</code> along with <code>GeneCounts</code>).
</div>

## 5 - Keep track of disk space usage

In [18]:
du -ch -d3 ${gohome}$USER

1.3M	/shared/projects/2312_rnaseq_cea/scaburet/Results/multiqc/2_fastp-fastq-files-3samples_plots
1.8M	/shared/projects/2312_rnaseq_cea/scaburet/Results/multiqc/3_fastp-fastq-files_with_fastqscreen-3samples_data
1.3M	/shared/projects/2312_rnaseq_cea/scaburet/Results/multiqc/3_fastp-fastq-files_with_fastqscreen-3samples_plots
1.8M	/shared/projects/2312_rnaseq_cea/scaburet/Results/multiqc/3_fastp-fastq-files_with_fastqscreen-3samples_data_1
480K	/shared/projects/2312_rnaseq_cea/scaburet/Results/multiqc/1_raw-fastq-files_data
1.3M	/shared/projects/2312_rnaseq_cea/scaburet/Results/multiqc/3_fastp-fastq-files_with_fastqscreen-3samples_plots_1
972K	/shared/projects/2312_rnaseq_cea/scaburet/Results/multiqc/1_raw-fastq-files_plots
1.8M	/shared/projects/2312_rnaseq_cea/scaburet/Results/multiqc/2_fastp-fastq-files-3samples_data
1.1M	/shared/projects/2312_rnaseq_cea/scaburet/Results/multiqc/.ipynb_checkpoints
18M	/shared/projects/2312_rnaseq_cea/scaburet/Results/multiqc
8.0K	/shared/projects/2312

For current project, we can use up to 70 Gb in each personal folder. Next steps are less space consuming, but nervetheless, we will be able to delete the initial fastq.gz files when the mapping is verified (next session).

---
___

Now we go on to check mapping quality.  
  
**=> Step 5: Post mapping Quality check** 

The jupyter notebook used for the next session will be *Pipe_5-bash_mapping-quality.ipynb*    
Let's retrieve it in our directory, in order to have a private copy to work on:   

In [25]:
## Code cell 38 ##   

cp "${gohome}pipeline/Pipe_5-bash_mapping-quality.ipynb" "${gohome}$USER/"

Bénédicte Noblet - 05-07 2021   
Sandrine Caburet - 02-05 2023   
Maj 24/05/2023