# Formation RNAseq CEA - juin 2023

Session IFB : 5 CPU + 21 GB de RAM

# Part 3 : Preprocessing reads: cleaning and checking quality


- 0.1 - About session on IFB core cluster  
- 0.2 - Parameters to be set or modified by the user     
- 1 - Cleaning reads along, with some quality thanks to `fastp`   
- 2 - MultiQC report summary post prepping   
- 3 - Checking for species contamination   
- 4 - MultiQC summary report after `fastq-screen`   
- 5 - Current disk usage situation after preprocessing   


---
## **Before going further**

<div class="alert alert-block alert-danger"><b>Caution:</b> 
Before starting the analysis, save a backup copy of this notebok : in the left-hand panel, right-click on this file and select "Duplicate"<br>
You can also make backups during the analysis. Don't forget to save your notebook regularly: <kbd>Ctrl</kbd> + <kbd>S</kbd> or click on the 💾 icon.
</div>

---

## 0.1 - Preparing session for IFB core cluster   

<em>loaded JupyterLab</em> : Version 3.2.1

In [None]:
## Code cell 1 ##

echo "=== Cell launched on $(date) ==="

echo "=== Current IFB session size ==="
jobid=$(squeue -hu $USER | awk '/jupyter/ {print $1}')
sacct --format=JobID,AllocCPUS,NODELIST -j ${jobid}

In [None]:
## Code cell 2 ## 

module load fastp/0.23.1 multiqc/1.13 fastq-screen/0.13.0

echo "===== cleanning, pre and post-quilities ====="
fastp --version
echo "===== compiled report ====="
multiqc --version
echo "===== checking species contamination ====="  # not used, see below
fastq_screen --version 

---

## 0.2 - Parameters to be set or modified by the user

- Using a full path with a `/` at the end, **define the folder** where you want or have to work with the `gohome` variable:

In [None]:
## Code cell 3 ## 

gohome="/shared/projects/2312_rnaseq_cea/"

echo "=== Home root folder is ==="
echo "${gohome}"
echo ""
echo "=== Working (personal) folder tree ==="
tree -L 2 "${gohome}$USER"
echo "=== current working directory ==="
echo "${PWD}"

- Please, precise the **maximum amount of CPU** (central processing units, cores) that programs can use.

<div class="alert alert-block alert-warning">
    A value of <b>4 is valid for a 5-CPU session</b>. Ideally, use 70-80% of the avalaible CPU you system or session has.
</div>

In [2]:
## Code cell 4 ## 

authorizedCPU=4

- Give **full path to get to the folder containing (only) the rawdata files** to be analysed:

In [None]:
## Code cell 5 ## 

rawfolder="${gohome}$USER/Data/fastq/raw/"
echo "${rawfolder}"

- Comments will later be included in Quality reports to keep analysis informations handy. Please adapt ``mycomment`` variables text in [section 2.2](#multiqctextvar1) and [section 4.2](#multiqctextvar2) before launching MultiQC report generations.

---
## 1 - Cleaning reads with <code>fastp</code></b>

A widely used software is <a href="http://www.usadellab.org/cms/?page=trimmomatic">Trimmomatic</a>. <br>
Unfornately, this tool is only performing trimming step (low quality bases removal) and do not include quality control afterwards. Thus, files need to be open at least 3 times, and for big files, this takes quite some time as you discovered with FastQC. <br>

Researchers from China developped a complete preprocessor tool, called <b>Fastp</b>, and published it in <a href="https://academic.oup.com/bioinformatics/article/34/17/i884/5093234"><em>Bioinformatics Journal</em> in 2018</a>. Currently, code and guides are available on their <a href="https://github.com/OpenGene/fastp">GitHub repository</a>.

### **1.1 - Tool version and default options overview**
This notebook was developped with ``fastp 0.20.0``.

In [None]:
## Code cell 6 ## 

fastp --version

Simpliest usage for paired end (PE) sequencing data is:

<code>fastp -i R1.fastqsanger.gz -I R2.fastqsanger.gz \
       -o R1.fastp.fastqsanger.gz -O R2.fastp.fastqsanger.gz <br></code>
       
If output options (`-o` and `-O` for paired end data) are omitted, `fastp` will operate the same and generate a report but won't write clean `.fastq.gz` files!

<div class="alert alert-block alert-info">
    As its manual page, either on <a href="https://github.com/OpenGene/fastp/blob/master/README.md">Github repository</a> or with <code>fastp --help</code> (also displayed on Github page), is quite long, we will summarize here some features.
</div>

In its default options, *Fastp* proceeds to:
- adaptater removal, also named trimming (automatic identification for commercial sets, ``-A`` to disable it)
- base removal when phred quality drops below 15 (``-q 15``, change value if wanted)
- read removal when there are more than 40% of unqualified bases (``-u 40``) <br>
- read removal when length drops below 15 bases (``-l 15``, ``-L`` can be used to inactivate this option)
- read removal when there are more than 5 N (undetermined) bases (``-n 5`` or ``--n_base_limit 5`` to adjust) <br>
- record of mated reads in separated files (specified with ``-o`` and ``-O``)
- usage of 2 threads (<a href="https://github.com/OpenGene/fastp/issues/13">IN FACT CORES</a>!), standing for 4 computer threads (``-w 2`` or ``--thread 2``)
- a compression of created ``.fastq`` files (``-z 4`` compression level gzip, ranging from 1-faster to 9-smaller)
- files overwriting if names already used (to change it, ``--dont_overwrite``) 
- naming report as "fastp report" (title inside file)
- report files writing IN CURRENT WORKING DIRECTORY

Thus, afterwards, you don't have the removed reads anymore, no matter what is the reason (good quality but alone or bad quality per itself)... and each report erases the previous one!  
We will adjust this later!

### **1.2 - Further options for ``fastp``**

Others avalaible options are:
<blockquote>
    <code>-e score</code> or <code>--average_qual score</code> to filter reads by their mean quality (default 0, no requirement) <br>
    <code>-V</code> to set it verbose every 1M reads are processed <br>
    <code>-a</code> or <code>--adapter_sequence</code> to specify adapter sequence, else autodetected <br>
    <code>--adapter_sequence_r2</code> for read2 adapter sequence <br>
    <code>--adapter_fasta</code> to add a <code>.fasta</code> file with adaptors sequences to apply all sequences on both read1 and read2 <br>
    <code>--filter_by_index1</code> (same for 2), to specify file containing barcodes list and <code>--filter_by_index_treshold</code> to allow for mismatches (default is set to 0) <br>
    <code>-c</code> or <code>--correction</code>, enable base correction for overlapping regions in PE: if one base is unqualified use corresponding base in mate pair if of good quality. <i>Caution: There are other options to add with this one.</i> <br>
    <code>-p</code> or <code>--overrepresentation_analysis</code>, to have overrepresented sequences per sample (<i>careful, big hmtl report!</i>) and     <code>-P number</code> or <code>--overrepresentation_samples number</code>, to have 1 in <i>number</i> reads used to identify those overrepresentated sequences (default, <code>-P 20</code>)
</blockquote>

PolyG tail trimming is required for sequencing data from Illumina NextSeq and NovaSeqand platforms (2-colors technology). It is activated by default for such datasets (<code>-g</code> or <code>--trim_poly_g</code>). See <a href="https://github.com/OpenGene/fastp#polyg-tail-trimming">corresponding section</a> for more details and <a href="https://github.com/OpenGene/fastp#polyx-tail-trimming">polyX tail trimming</a> counterpart.

In addition to default bad quality removal, you can add more stringent filters if needed:
<blockquote>
    <code>-f x</code> or <code>--trim_front1 x</code> to trim x bases from 5' reads extremity of read 1 (<code>-F x</code> or <code>--trim_front2 x</code> for read 2) <br>
    <code>-t x</code> or <code>--trim_tail1 x</code> to trim x bases from the 3' tail of the reads in read1 file (<code>-T x</code> or <code>--trim_tail2 x</code> for read 2) <br>
    <code>-y</code> or <code>--low_complexity_filter</code> to enable removal of reads containing few base changes (with <code>-Y 30</code> percent by default) <br>
</blockquote>

For next 3 options, you need to specify a window size (default <code>-W 4</code> or <code>--cut_window_size 4</code>, number ranging from 1 to 1000) to work with.
<blockquote>
    <code>-5</code> or <code>--cut_front</code>: from start, sliding window, drops just hit windows <br>
    <code>-3</code> or <code>--cut_tail</code>: from tail, same <br>
    <code>-r</code> or <code>--cut_right</code>: drop window and dowstream bases
</blockquote>

With a window size of 1 (``-W 1``), they are equivalent to, respectively, run ``LEADING``, ``TRAILING`` and ``SLIDINGWINDOW`` with *Trimmomatic*. Please, be aware that **these tools interfere with downstream deduplicated algorithms** (see <a href="https://github.com/OpenGene/fastp#per-read-cutting-by-quality-score">corresponding section</a> in manual).

### **1.3 - Preparing step**

For *Fastp* to treat raw ``.fastq.gz`` files, let's remember where they can be found:

In [None]:
## Code cell 7 ## 

echo "${rawfolder}"

ls -lh "${rawfolder}"
#rawfolder="${gohome}$USER/Data/fastq/raw/"

We will create a folder to store the files ``fastp`` will create:  
- cleaned ``.fastq.gz`` files  
- quality reports, one per sample (including quality before and after processiong reads)

In [None]:
## Code cell 8 ## 

fastpfolder="${gohome}$USER/Results/fastp/"
mkdir -p ${fastpfolder}

In addition, we remember the destination folder for text file of redirected outputs.

In [None]:
## Code cell 9 ## 

logfolder="${gohome}$USER/Results/logfiles/"

Let's remember, or set if not done in **Parameters**'s section, the number of CPU (central processing units, cores) that multithreading program `fastp` is supposed to use.  
<div class="alert alert-block alert-warning">
    A value of <b>4 is valid for a 5-CPU session</b>. Ideally, use 70-80% of the CPU amount your system or session has.
</div>

In [1]:
## Code cell 10 ## 

authorizedCPU=${authorizedCPU}
#authorizedCPU=4

echo "The number of CPU available for computing is ${authorizedCPU}"

The number of CPU available for computing is 


### **1.4 - Samples ``fastp`` processing**

And, now the big loop! :-D

In [None]:
## Code cell 11 ## 

logfile="${logfolder}fastp_prequality-filtering-postquality.log"
echo "Screen output is redirected to ${logfile}"

In [None]:
## Code cell 12 ## 

# as time command does not redirect output
echo "operation starting by $(date)" >> ${logfile}

time for read1 in $(ls "${rawfolder}"*_1.fastq.gz); do

    # starting with the sample name
    id=$(basename ${read1} | cut -d"_" -f1)
    echo "====== Processing sampleID: ${id}..." | tee -a ${logfile}
        
    # fastq files section
    read2=$(echo ${read1} | sed 's#_1#_2#')
    outread1="${fastpfolder}${id}_1.fastp.fastq.gz"
    outread2=$(echo ${outread1} | sed "s#_1#_2#")
    outremoved="${fastpfolder}${id}_removed.fastp.fastq.gz"
    
    # report section
    hreport="${fastpfolder}${id}_fastp.html"  # fastp at the end, else multiqc doesn't see it
    jreport=$(echo ${hreport} | sed "s#html#json#")
    myheader=$(echo "Sample ${id} fastp report") # append sample reference to default title

    echo "fastp starts by $(date)" >> ${logfile}
    # fastp working
    fastp --in1 ${read1} --in2 ${read2} \
          --out1 ${outread1} --out2 ${outread2} \
          --failed_out ${outremoved} \
          --qualified_quality_phred 15 \
          --length_required 30 \
          --thread ${authorizedCPU} \
          --html ${hreport} --json ${jreport} \
          --report_title "$(echo ${myheader})" \
          &>> $logfile
    echo "fastp ends by $(date)" >> ${logfile}
    
    echo "...done" | tee -a ${logfile} 
    
done

# size of files
ls -lh "${fastpfolder}" >> ${logfile}

echo "fastp folder contains $(ls -l "${fastpfolder}"*fastq.gz | wc -l) fastq.gz files." \
     | tee -a ${logfile}
echo "fastp folder contains $(ls -l "${fastpfolder}"*.html | wc -l) html reports." \
     | tee -a ${logfile} 

Additional options that could have been added are:
<blockquote>
    <code>--failed_out</code> to have an output for failed reads with specified failure reason. For paired-end (PE) data with no filename for unpaired reads, the qualified discarded read is registered as <code>paired_read_is_failing</code> <br>
    <code>-R text</code> or <code>--report_title text</code>, report <b>title</b> (default "fastp report") <br>
    <code>-j</code> or <code>--json</code> for json <b>filename</b> <br>
    <code>-h</code> or <code>--html</code> for html <b>filename</b>
</blockquote>

Before modifying output filenames, please note:
<ul class="alert alert-block alert-info">
    <li>
        MultiQC detects reports only based <a href="https://multiqc.info/docs/#fastp">on the end of the filenames</a>: it looks for <code>fastp.json</code> and <code>_fastqc.zip</code>.
    </li>
</ul>

---
## 2 - MultiQC report summary post prepping</b>

Now let's have a look at our cleaned (we expect) dataset before mapping reads to the reference genome. <br>

In [None]:
## Code cell 13 ## 

multiqc --version

For more details about MultiQC, please refer to previous notebook.

### **2.1 - Folder, filename, title and comment**

Let's remember where report files are to be placed:

In [None]:
## Code cell 14 ## 

qcsummaries="${gohome}$USER/Results/multiqc/"

We specify then names for files and title to display on html report page. <a id="multiqctextvar1"></a>

In [None]:
## Code cell 15 ## 

inamemyfile="2_fastp-fastq-files-3samples"
mytitle="Fastq files qualities by Fastp: before and after filtering"

To keep record of what have been done with these files, I add an additionnal comment to remember for later use (along with to inform others readers):

In [None]:
## Code cell 16 ## 

mycomment=$(echo "Fastq files processed by Fastp with following options. " \
"qualified quality: 15, minimum length: 30, removed reads kept in same file.")

### **2- Generate summary report**

In [None]:
## Code cell 17 ## 

logfile="${logfolder}multiqc-processing_fastp-quality.log"
echo "Screen output is also saved in ${logfile}"

echo "operation starting by $(date)" >> ${logfile}
multiqc --interactive --export \
        --module fastp \
        --outdir "${qcsummaries}" \
        --filename "${inamemyfile}" \
        --title "${mytitle}"  \
        --comment "${mycomment}" \
        "${fastpfolder}" \
        |& tee -a ${logfile}
echo "operation finished by $(date)" >> ${logfile}

# to see which files we have afterward and follow folder sizes
ls -lh "${gohome}$USER/Results/multiqc/" >> ${logfile}
ls -lh "${gohome}$USER/Results/" >> ${logfile}

---
## 3 - Checking for species contamination

<ul class="alert alert-block alert-info">
    <li>
        Reading <b>FastQ-screen</b> documentation's introduction to see how can I retrieve application's version, we can see that <b>FastQ Screen</b> <a href="https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/_build/html/index.html?highlight=version#installation">is compatible with Bowtie, Bowtie2 or BWA</a> aligners... and <a href="https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/_build/html/index.html?highlight=version#configuration">that the aligner used to build the index files is supposed to be used also to map RNA reads</a>
    </li>
    <li>
       Here, we will use Fastq-Screen by-default aligner (Bowtie2) only to identify contaminations regardless of nucleic acid type, and we will later map RNA reads properly with a splice-aware aligner.
    </li>
</ul>

### **3.1 - Tool version and default options overview**
This notebook was developped with ``fastq-screen 0.13.0``.

In [None]:
## Code cell 18 ## 

fastq_screen --version

FastQ Screen allows to detect species contamination by aligning a sample set of reads from a ``.fastq`` on chosen reference genomes.  
To date, only 3 aligners can be used to do so: ``bowtie``, ``bowtie2``and ``bwa``.

To use this tool, we need indexes files for other species, the ones that could contaminate the reads in our samples. We can either:  
- download genomes sequence files and create indexes using the chosen aligner
- retrieve <a href="https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/_build/html/index.html#obtaining-reference-genomes">pre-indexed <code>bowtie2</code> genomes for a range of commonly studied species and sequences</a> using ``--get_genomes``

During this first step, we can specify destination folder for retrieved and generated files with <code>--outdir TEXT</code> option. If this destination folder is not specified, then output files are saved in our current working directory.  
A folder called ``FastQ_Screen_Genomes/`` is created and contains a text configuration file along with reference genomes indexes. The configuration file provide gives ``fastq_screen`` script with the paths to reference genomes (same folder when using ``--get_genomes``). It is meant to be modified to specify the path to the aligner, but we can also directly use ``--aligner`` option in command line.  
  
In the second step, the analysis itself, the default command line is ``fastq_screen path/to/sample.fastq.gz``.  
  
Along with the path to ``.fastq.gz files``, we will use below options: 
<blockquote>
    <code>--aligner OPTION</code>, to specify the aligner to use for the mapping. Valid arguments are 'bowtie', bowtie2' (default) or 'bwa'. If not specified, it will search for <code>bowtie2</code> location in the configuration file <br>
    <code>--threads INTEGER</code>, to specify how many threads the aligner will be allowed to run on. This options overrides the default value set in the configuration file <br>
    <code>--conf TEXT</code> to manually specify a location for the configuration file. Alternatively, the configuration file should be within the program folder <br>
</blockquote>

### **3.2 - Preparation steps**

<div class="alert alert-block alert-warning">
    It takes about about half an hour (30 minutes) to download all reference files.   <br>
    Therefore, we will work today on pre-downloaded files.    <br>
    The following cells are provided or later use in your own project. <br>
    (To use the commands in cells 19 to 21, change their type from <code>Raw</code> to <code>Code</code> in the drop-down menu above)
</div>

(3.2.1) - First, we will create a destination folder for all ``fastq_screen`` linked files

(3.2.2) - Then, we download genome references thanks to ``--get_genomes`` option.

3.2.3 - We verify the available genomes for contamination check:

In [None]:
# Code cell 22 ## 
# to be skipped if the cells 19 to 21 above are run for another project

contascreenfolder="${gohome}allData/Results/fastq_screen/"
fastq_screenfolder="${gohome}$USER/Results/fastq_screen/"
mkdir -p ${fastq_screenfolder}
logfile="${gohome}allData/Results/logfiles/fastq_screen_get_genomes.log"

In [None]:
## Code cell 23 ## 

echo "Folder composition:" >> ${logfile}
tree ${contascreenfolder} >> ${logfile}
du -h "${contascreenfolder}FastQ_Screen_Genomes/" >> ${logfile}

echo "Downloaded genomes:" | tee -a ${logfile}
ls "${contascreenfolder}FastQ_Screen_Genomes/" | tee -a ${logfile}

We can see genomes versions in the logfile searching for network links (``http://``), then filtering for particular species.

In [None]:
## Code cell 24 ## 

cat "${logfile}" | grep "http://" | grep -e "Human/Homo_sapiens" -e "Mouse/Mus_musculus"

3.2.4 - Predefined Config file with path to genome indexes:

We will need a predefined config file to tell ``fastq_screen`` where to find the genome files. Let's create a variable to handle it more easily.

In [None]:
## Code cell 25 ## 

fscreenconffile="${contascreenfolder}FastQ_Screen_Genomes/fastq_screen.conf"

### **3.3 - Samples analysis**

Let's remember, or if it was not set in **Parameters** section, the number of CPU (central processing units, cores) that `fastq_screen` is supposed to use.  
<div class="alert alert-block alert-warning">
    Following value of <b>4 is valid for a 5-CPU session</b>. Ideally, use 70-80% of the CPU amount your system or session has.
</div>

In [None]:
## Code cell 26 ## 

authorizedCPU=${authorizedCPU}
#authorizedCPU=4

echo "The number of CPU available for computing is ${authorizedCPU}"

We can now check all cleaned ``.fastq.gz`` files produced earlier by ``fastp``.

In [None]:
## Code cell 27 ## 

logfile="${gohome}$USER/Results/logfiles/fastq_screen_3samples.log"
echo "Screen output is redirected to ${logfile}"

In [None]:
## Code cell 28 ## 

echo "Destination folder already contains:" >> ${logfile}
ls -lh ${fastq_screenfolder} >> ${logfile}

echo "operations start at $(date)" >> ${logfile}

time for fastqfile in $(ls "${fastpfolder}"* | grep -v "removed"); do
    
    echo "=== starting for $(basename ${fastqfile})..." |& tee -a ${logfile}
    date >> ${logfile}
    
    fastq_screen --aligner bowtie2  \
                 --outdir ${fastq_screenfolder} \
                 --threads ${authorizedCPU} \
                 --conf ${fscreenconffile} \
                 ${fastqfile} \
                 &>> ${logfile}
                 
    echo "... done ===" | tee -a ${logfile}
    date >> ${logfile}
    
done

echo "operations end at $(date)" >> ${logfile}

ls -lh ${contascreenfolder} >> ${logfile}
echo "$(ls ${contascreenfolder} | wc -l) files were created."

If we want to know the time it took to perform this analysis, we can retrieve the info from the log file:

In [None]:
## Code cell 29 ## 

cat ${logfile} | grep "operations"

The results are shown in several files for each fastq.gz input file, and we could look at each html result.    
The other possibility is to compile the various reportswith MultiQC, as in the following section.

In [None]:
## Code cell 30 ## 

${contascreenfolder}
ls ${fastq_screenfolder}
ls ${watchingforcontafolder} | wc -l
ls ${watchingforcontafolder} | grep ^".html" | wc -l
ls ${watchingforcontafolder} | wc -l

---
## 4- MultiQC summary report after <code>fastq_screen</code>

### **4.1 - Folder, filename, title and comment**

Let's remember where report files are to be placed:

In [None]:
## Code cell 31 ## 

qcsummaries="${gohome}$USER/Results/multiqc/"

We specify then names for files and title to display on html report page. <a id="multiqctextvar2"></a>

In [None]:
## Code cell 32 ## 

inamemyfile="3_fastp-fastq-files_with_fastqscreen-3samples"

mytitle=$(echo "Fastq files qualities by Fastp (before and after filtering) " \
"and FastQ Screen genomes contamination screening")

To keep record of what have been done with these files, I add an additionnal comment to remember for later use (along with aiming at informing others readers):

In [None]:
## Code cell 33 ## 

mycomment=$(echo "Fastq files processed by Fastp with following options. " \
"qualified quality: 15, minimum length: 30, removed reads kept in same file." \
"FastQ Screen with bowtie2 (included version) and Babraham genomes reference.")

### **4.2 - Generate summary report**

In [None]:
## Code cell 34 ## 

logfile="${logfolder}multiqc-processing_fastp-quality_and_fastqscreen_screening.log"
echo "Screen output is also saved in ${logfile}"

echo "operation starting by $(date)" >> ${logfile}
multiqc --interactive --export \
        --module fastp ${fastpfolder} \
        --module fastq_screen ${fastq_screenfolder} \
        --outdir "${qcsummaries}" \
        --filename "${inamemyfile}" \
        --title "${mytitle}"  \
        --comment "${mycomment}" \
        "${gohome}$USER/Results/" \
        |& tee -a ${logfile}
echo "operation finished by $(date)" >> ${logfile}

# to see which files we have afterward and follow folder sizes
ls -lh "${qcsummaries}" >> ${logfile}
ls -lh "${gohome}$USER/Results/" >> ${logfile}

### **4.3 - Retrieve MultiQC report for all samples**

Similar reports were generated for all 11 samples. They are available in the allData folder, and we can have a copy in our personal folder for later use.

In [None]:
## Code cell 35 ##   

cp "${gohome}allData/Results/multiqc/2_fastp-fastq-files-all.html" "${gohome}$USER/Results/multiqc/"
cp "${gohome}allData/Results/multiqc/3_fastp-fastq-files_with_fastqscreen-all.html" "${gohome}$USER/Results/multiqc/"

---
## 5 - Current disk usage situation after preprocessing

In [None]:
## Code cell 36 ##   

# little look at the folder tree
tree -d -L 2 "${gohome}$USER"

Used options are:
<blockquote>
    <code>-d</code> to list only directories <br>
    Adding <code>-L</code> option and a number to avoid going too deep in the tree... and let the output be still readable.
</blockquote>

In [None]:
## Code cell 37 ##   

# disk usage by folder and subfolders
du -ch -d2 "${gohome}$USER"

The options stand for:
<blockquote>
    <code>-c</code> or <code>--total</code> to have total amount displayed <br>
    <code>-h</code> or <code>--human-readable</code>  to get sizes in Mega and Giga bytes format <br>
    <code>-dx</code> or <code>--max-depth=x</code> to limit folder enumeration to x levels
</blockquote>

---
___

Now we go on to map reads on the reference genome and observing.  
  
**=> Step 4 : Classical reads mapping** 

The jupyter notebook used for the next session will be the *Pipe_4-bash_classical-reads-mapping.ipynb*    
Let's retrieve it in our directory, in order to have a private copy to work on:   

In [None]:
## Code cell 38 ##   

cp "${gohome}pipeline/Pipe_4-bash_classical-reads-mapping.ipynb" "${gohome}$USER/"

Bénédicte Noblet - 05-07 2021   
Sandrine Caburet - 02-05 2023   
Maj 23/05/2023

---
___

Another option exists to go faster : carrying out a pseudo mapping on transcriptome.  
  
**=> Alternate step 4 : Pseudomapping**