# Module: Estimate Feature Abundance

Estimating the abundance of assembled reads provides insight into the relative representation of specific sequences within a sample. This is crucial in many downstream analyses looking to assess, for example, the abundances of genes and taxonomic groups. Additionally, contig coverage information is used by several binning tools to improve binning performance.

In this module, we will explore various tools that you can use to estimate the abundances of assembled sequences.

Created by: _Microbial Oceanography Laboratory (MOLab)_

---
## How to Use This Notebook

1. Make sure tools are installed already (see below if not yet).
2. Activate environment. Replace environment name accordingly.
```bash
conda activate est-contig-abund-env
```
2. Open jupyter notebook with the command below and select the notebook.
```bash
jupyter notebook
```
3. To run the cells in this notebook, press Shift+Enter.

---
## Tools Used
1. **MetaBAT2**
2. **RSEM**
3. **Trinity**

To install tools (1) - (2), find the `est-contig-abund.yaml` file located in the same folder as this notebook (in repository). Then run the command below in the terminal:
```bash
conda env create -f est-contig-abund.yaml
```

To install tool (3), a Docker image of Trinity can be pulled. Instructions are outlined here: [Trinity Docker Installation](https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-in-Docker).

---
## Starting Files 

1. Assembly FASTA file (see **Assembly Module**).
2. Alignment of reads to assembly (BAM; see **Mapping Module**).

---
## Expected Outputs

Format of output depends on tool used, but in general, the output will be a tabular form containing per-contig abundance estimates.

---
## Table of Contents
 * [**Built-in MetaBAT2 Estimator**](#Built-in-MetaBAT2-Estimator)
 * [**Wrapper Scripts in Trinity**](#Wrapper-Scripts-in-Trinity)
 * [**featureCounts**](#featureCounts)

----
# <font color = 'gray'>Built-in MetaBAT2 Estimator</font>

If you're working with shotgun metagenomics data and intend to cluster contigs into bins (see **Binning Module**), you can simply use a script packaged with the MetaBAT2 library. The input required is a list of BAM file(s). A sample code is shown below.

In [None]:
!jgi_summarize_bam_contig_depths \
    --outputDepth bbmap_out/summ_depth.txt \
    *.bam

----
# <font color = 'gray'>Wrapper Scripts in Trinity</font>

For RNA-seq data assembled using Trinity (see **Assembly Module**), a wrapper script (`abundance_estimates_to_matrix.pl`) to estimate assembled transcript abundance is also provided. 

In the code below, we assume that the reads have been aligned using the `align_and_estimate_abundance.pl` (see **Mapping Module**) wrapper script and RSEM as the estimation method. Before you can run abundance estimation, you must collate the path to the per-sample mapping outputs and store it in a file named `rsem_quant_outputs.txt`. A sample file with five samples is shown below - note that the paths are line-separated.

Example content of `rsem_quant_outputs.txt`:

```
/home/user/modules/Quantification_of_Contig_Abundance/4-mapping/T1xNT1_a/rsem/RSEM.isoforms.results
/home/user/modules/Quantification_of_Contig_Abundance/4-mapping/T1xNT1_b/rsem/RSEM.isoforms.results
/home/user/modules/Quantification_of_Contig_Abundance/4-mapping/T1xNT1_c/rsem/RSEM.isoforms.results
/home/user/modules/Quantification_of_Contig_Abundance/4-mapping/T2xNT2_D/rsem/RSEM.isoforms.results
/home/user/modules/Quantification_of_Contig_Abundance/4-mapping/T2xNT2_V/rsem/RSEM.isoforms.results

```

<div class="alert alert-block alert-warning">
<b>Warning</b> 

For this step, to avoid any potential errors, it is recommended that the file paths contain no spaces.

</div>

We also specify `basedir_index -3` to set the third from the last directory path (e.g. `T1xNT1_a`, `T1xNT1_b`, ...) as the sample prefix name. Adjust this accordingly depending on your folder structure.

In [None]:
%%bash

mkdir -p "build_matrix"

docker run --rm -v "$(pwd)":"$(pwd)" trinityrnaseq/trinityrnaseq \
/usr/local/bin/util/abundance_estimates_to_matrix.pl \
    --name_sample_by_basedir \
    --basedir_index -3 \
    --est_method RSEM \
    --gene_trans_map none \
    --quant_files "$(pwd)/rsem_quant_outputs.txt" \
    --out_prefix "$(pwd)/build_matrix/exp_matrix"

The outputs are inside the `build_matrix/` folder. It includes a non-normalized counts matrix (`exp_matrix.isoform.counts.matrix`), TMM-normalized counts matrix (`exp_matrix.isoform.TMM.EXPR.matrix`), and TPM-normalized counts matrix (`exp_matrix.isoform.TPM`).

----
# <font color = 'gray'>featureCounts</font>

For estimation of the abundance of predicted genes (see **Contig Level Functional Annotation Module**), a straightforward approach is by using `featureCounts`. A sample usage and description of parameters is shown below.

<div class="alert alert-block alert-warning">
<b>Warning</b> 

In RNA-seq data where gene isoforms are present, <code>featureCounts</code> is not an ideal for estimation of transcript abundance since it does not handle multi-mapping reads (i.e. reads that map to multiple locations).

</div>

| argument | description |
| :-: | :- |
| `-T` | Number of threads. |
| `-a` | Feature file (GTF/GFF). |
| `-o` | Output count table. |
| `-t` | Feature type. The selection of feature type can be found in the third column of your feature file. |
| `-g` | Attribute type. You can select the attribute type from the attributes available in the last column of the feature file. Features will be grouped into meta-features specified in this argument. |
| `-p` | Alignment files. |

In [None]:
!featureCounts \
    -T 4 \
    -a feature.gff \
    -o featureCounts_out.tsv \
    -t "CDS" \
    -g "gene_id" \
    -p *.bam