# Module: Read Mapping

Mapping of reads to reference sequences or assemblies is a crucial step in bioinformatics that involves aligning sequencing reads to a known reference genome, transcriptome, or de novo assembly. This process enables researchers to identify sequence variants, quantify gene expression, and analyze the abundance of specific genomic regions.

Here, we will look at a couple of read mapping utilities that are commonly used for different applications.

Created by: _Microbial Oceanography Laboratory (MOLab)_

---
## How to Use This Notebook

1. Make sure tools are installed already (see below if not yet).
2. Activate environment. Replace environment name accordingly.
```bash
conda activate mapping-env
```
2. Open jupyter notebook with the command below and select the notebook.
```bash
jupyter notebook
```
3. To run the cells in this notebook, press Shift+Enter.

---
## Tools Used
1. **BBMap** for prokaryotic gene prediction.
2. **Trinity** for eukaryotic gene prediction.

To install tool (1), find the `mapping.yaml` file located in the same folder as this notebook (in repository). Then run the command below in the terminal:
```bash
conda env create -f mapping.yaml
```

To install tool (2), a Docker image of Trinity can be pulled. Instructions are outlined here: [Trinity Docker Installation](https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-in-Docker).

---
## Starting Files

1. A contig FASTA file generated from assembly (see **Assembly Module**) or binning (see **Binning Module**) or other reference assembly retrieved from databases (`assembly.fasta`).
2. Clean reads (see **Quality Control Module**).

---
## Expected Outputs

Output(s) may depend on the tool used, but generally, the following outputs will be present:

1. SAM/BAM alignment files.

---
## Table of Contents
 * [**BBMap**](#BBMap)  
 * [**Trinity Wrapper Script**](#Trinity-Wrapper-Script)

----
# <font color = 'gray'>BBMap</font>

BBMap is a splice-aware global aligner for DNA and RNA sequencing reads. It can align reads from all major platforms â€“ Illumina, 454, Sanger, Ion Torrent, PacBio, and Nanopore. BBMap is fast and extremely accurate, particularly with highly mutated genomes or reads with long indels, even whole-gene deletions over 100kbp long.

Source: [BBMap Guide](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbmap-guide/)

The code below will align your paired end reads (`S1_forward.fastq.gz` and `S1_reverse.fastq.gz`) and output an alignment file (`S1_mapping.sam`).

<div class="alert alert-block alert-info">
<b>Note:</b> 

The command only maps a single library to a reference assembly. If you have a multi-sample dataset, you need to modify the sample names every run or write a loop to automate mapping of samples.
</div>

In [None]:
!bbmap.sh \
    in=S1_forward.fastq.gz \
    in2=S1_reverse.fastq.gz \
    ref=assembly.fasta \
    out=S1_mapping.sam \
    bamscript=S1_bamscript

If you need a sorted and indexed BAM file, you can use the output script file (`S1_bamscript`).

In [None]:
!s1_bamscript

----
# <font color = 'gray'>Trinity Wrapper Script</font>

If you intend to map reads to an RNA-seq assembly made using Trinity, a wrapper script which performs the alignment and estimates the abundance of the transcripts is provided by Trinity.

### Prepare Reference Assembly

First, we have to index the reference assembly so that we don't have to repeat it every time for each library/sample.

| option/input | description |
| :-: | :- |
| `--transcripts` | Reference assembly. |
| `--est_method` | Abundance estimation method. |
| `--aln_method` | Alignment method. |
| `--prep_reference` | Prepare reference assembly file. |

In [None]:
!docker run --rm -v "$(pwd)":"$(pwd)" trinityrnaseq/trinityrnaseq \
    /usr/local/bin/util/align_and_estimate_abundance.pl \
    --transcripts "$(pwd)/assembly.fasta" \
    --est_method RSEM \
    --aln_method bowtie2 \
    --prep_reference

### Map and Estimate

Now, using Bowtie2 and RSEM, we can map the clean reads to the assembly (`assembly.fasta`) and estimate the abundances of the transcripts.

| option/input | description |
| :-: | :- |
| `--transcripts` | Reference assembly. |
| `--seqType` | Type of sequence. FASTA (`fa`) or FASTQ (`fq`) |
| `--left` | Left strand read. |
| `--right` | Right strand read. |
| `--est_method` | Tool to use to estimate transcript abundance. |
| `--aln_method` | Tool to use to map reads to assembly. |
| `--output_dir` | Output directory. |

<div class="alert alert-block alert-info">
<b>Note:</b> 

For a multi-sample dataset, consider using the <code>--samples_file</code>. This will automatically organize the outputs to different folders per sample. Alternatively, you can also modify the code below and make use of a loop.
</div>

In [None]:
!docker run --rm -v `pwd`:`pwd` trinityrnaseq/trinityrnaseq \
/usr/local/bin/util/align_and_estimate_abundance.pl \
    --transcripts `pwd`/4-assembly_trinity/a_minutum_ref_assembly.fasta \
    --seqType fq \
    --left "$(pwd)/S1_forward.fastq.gz" \
    --right "$(pwd)/S2_forward.fastq.gz" \
    --est_method RSEM \
    --aln_method bowtie2 \
    --output_dir "$(pwd)/rsem_out" 2>&1 | tee -a "$(pwd)/out.log"