# Bulk RNA-seq counts normalization

Quantile normalization of TPM counts, and TMM normalization of read counts.

## Overview

Currently, we have implemented two pipelines for RNA-seq data normalization along the lines of the GTEx V8 workflow:


A. Read counts -> TPM (within sample normalization) -> TPM level QC -> Quantile normalization (between sample normalization) -> inverse normal transformation
B. Read counts -> TMM (via edgeR, between sample normalization) -> inverse normal transformation

The GTEx protocol, described [here](https://gtexportal.org/home/documentationPage#staticTextAnalysisMethods), suggests that:

1. Genes were selected based on expression thresholds of >0.1 TPM in at least 20% of samples and ≥6 reads in at least 20% of samples.
2. Expression values were normalized between samples using TMM as implemented in edgeR (Robinson & Oshlack, Genome Biology, 2010 ).
3. For each gene, expression values were normalized across samples using an inverse normal transform.

In other words, GTEx implemented normalization on the count data using TMM (Pipeline B outlined above) although the TPM QC results were used to select samples and genes. 

## Caveats

A couple of possible improvement over the existing pipeline:

1. Should we try [GeTMM](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2246-7) instead? That seems to make more sense and is very easy to implement (add one line to TMM code, as shown in [this post](https://www.reneshbedre.com/blog/expression_units.html)).
2. What if we have different batches of data and we know the batches explicitly so we can control for batch effect? What we can do are:
    a. Read counts -> Combat-Seq -> inverse normal transformation
    b. Do what we already have -> Add a batch adjustment using Combat on normalized data
    
**Currently we do not implement either of these improvements -- until future needs or discussions emerge.**

## Input

1. TPM matrix and read count matrix in RNA-SeQC format
    - the first two rows should be commented text with `#` prefix.
    - the matrix should be tab delimited.
    - the matrix files should end with `gct` suffix
    - These requirements are satisfied if the inputs are outputs from [`bulk_expression_QC` pipeline](bulk_expression_QC.html).
2. GTF for collapsed gene model
    - the gene names must be consistent with the GCT matrices (eg ENSG00000000003 vs. ENSG00000000003.1 will not work) 
    - chromosome names must be consistent with the GCT matrices (eg chr1 vs 1 will not work)
3. Meta-data to match between sample names in expression data and genotype files
    - Required input
    - Tab delimited with header
    - Only 2 columns: first column is sample name in expression data, 2nd column is sample name in genotype data
    - **must contains all the sample name in expression matrices even if they don't existing in genotype data**

## Output

Normalized expression file in `bed` format.

## Minimal Working Example

Expression matrices can be generated by the MWE of `bulk_expression_QC.ipynb`. A full set of MWE can be found [on Google Drive](https://drive.google.com/drive/u/0/folders/1Rv2bWHBbX_tastTh49ToYVDMV6rFP5Wk).

In [None]:
sos run normalization.ipynb Normalization \
--tpm-gct "mwe.low_expression_filtered.outlier_removed.processed.tpm.gct"      \
--counts-gct "mwe.low_expression_filtered.outlier_removed.processed.geneCount.gct"      \
--sample_participant_lookup "sampleSheetAfterQc.txt" \
--container ./rna_quantification.sif --wd ./      \
--annotation_gtf Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf  &

In [2]:
[global]
# Work directory & output directory
parameter: cwd = path
#  gene count table
parameter: counts_gct = path
#  gene TPM table
parameter: tpm_gct = path
#  gene gtf annotation table
parameter: annotation_gtf = path
# A file to map sample ID from expression to genotype,must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
parameter: container = ""

In [None]:
[Normalization]
# Path to the input molecular phenotype data, should be a processd and indexed bed.gz file, with tabix index.
input: tpm_gct, counts_gct, annotation_gtf, sample_participant_lookup
output: f'{wd}/{_input[0]:bnn}.bed.gz',  # For factor
task: trunk_workers = 1, trunk_size = 1, walltime = '4h',  mem = '20G', tags = f'{step_name}_{_output[0]:bn}'  
bash: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container
    eqtl_prepare_expression.py ${tpm_gct} ${counts_gct} ${_input[2]} \
    ${sample_participant_lookup} ${sample_participant_lookup if sample_participant_lookup else ""} ${_output[0]:bnnn} \
    --tpm_threshold 0.1 \
    --count_threshold 1 \
    --sample_frac_threshold 0.2 \
    --normalization_method tmm