# Summary of methods used in this work

Several methods for normalization and transformation of RNAseq will be tested from [Johnson and Krishnan 2022](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02568-9). A figure describing their analysis pipeline is [here](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02568-9/figures/1).


## Filtering lowly expressed genes and lowly abundant OTUs




## Normalization methods for RNAseq and metataxonomics


| Data type      | Normalization         | Normalization category    | Condition  |
|----------------|-----------------------|---------------------------|------------|
| Metataxonomics | Raw counts            |   | Joined D and N |
| Metataxonomics | CPM                   |   | Joined D and N |
| Metataxonomics | Relative abundance    |   | Joined D and N |
| Transcriptomics | Estimated counts     |   | D |
| Transcriptomics | CPM                  | Within sample  | D |
| Transcriptomics | TPM                  | Within sample  | D |
| Transcriptomics | TMM                  | Between sample  | D |
| Transcriptomics | RPKM                 | Within sample  | D |
| Transcriptomics | UQ                   | Between sample  | D |
| Transcriptomics | CTF                  | Between sample  | D |
| Transcriptomics | CUF                  | Between sample  | D |
| Transcriptomics | QNT                  | Between sample | D |
| Transcriptomics | Estimated counts     |   | N |
| Transcriptomics | CPM                  | Within sample  | N |
| Transcriptomics | TPM                  | Within sample  | N |
| Transcriptomics | TMM                  | Between sample  | N |
| Transcriptomics | RPKM                 | Within sample  | N |
| Transcriptomics | UQ                   | Between sample  | N |
| Transcriptomics | CTF                  | Between sample  | N |
| Transcriptomics | CUF                  | Between sample  | N |
| Transcriptomics | QNT                  | Between sample | N |


## Data transformation


| Data type      | Transformation                             | Reference |
|----------------|--------------------------------------------|-----------|
| Transcriptomics | asinh                                     | [Johnson and Krishnan 2022](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02568-9) |
| Transcriptomics | Variance stabilizing transformation (VST) | DESeq2; used in [Priya et al 2022](https://www.nature.com/articles/s41564-022-01121-z) |

## Filtering low variance (expression or OTUs)


[Priya et al 2022](https://www.nature.com/articles/s41564-022-01121-z) used 25% quantile as cutoff for gene expression analysis.



# Normalization of RNAseq data

## Importing matrices

In [14]:
import pandas as pd

kremling_raw_expression_v5_night = pd.read_csv('/home/rsantos/Repositories/maize_microbiome_transcriptomics/correlations_rnaseq_metataxonomics/kremling_expression_v5_night.tsv',
                                           sep='\t')
kremling_raw_expression_v5_night.set_index('Name', inplace=True)

kremling_raw_expression_v5_day = pd.read_csv('/home/rsantos/Repositories/maize_microbiome_transcriptomics/correlations_rnaseq_metataxonomics/kremling_expression_v5_day.tsv',
                                           sep='\t')
kremling_raw_expression_v5_day.set_index('Name', inplace=True)
print(kremling_raw_expression_v5_night.shape)
print(kremling_raw_expression_v5_day.shape)

(39096, 228)
(39096, 176)


## Importing transcript length

Several methods for normalization of RNAseq are implemented in the [bioinfokit Python library](https://github.com/reneshbedre/bioinfokit).

`infoseq` from the [EMBOSS:6.6.0.0 package](https://emboss.sourceforge.net/download/) was used to get transcript lenght for normalization methods that require this information (e.g., `rpkm` or `fpkm`):

```bash
infoseq -only -name -length Zma2_rnas.fa > Zmays_Zm_B73_REFERENCE_NAM_5_0_55_transcripts_PrimaryTranscriptOnly_length.txt
```

In [11]:
gene_length_table = pd.read_csv('/home/rsantos/Repositories/maize_microbiome_transcriptomics/correlations_rnaseq_metataxonomics/Zmays_Zm_B73_REFERENCE_NAM_5_0_55_transcripts_PrimaryTranscriptOnly_length.txt',
                               sep="\t")
gene_length_table.set_index('Name', inplace=True)
gene_length_table.head()

Unnamed: 0_level_0,Length
Name,Unnamed: 1_level_1
Zm00001eb441400_T001,321
Zm00001eb442560_T001,524
Zm00001eb436360_T001,4241
Zm00001eb436380_T001,1126
Zm00001eb436400_T001,1804


## Within sample normalization methods

CPM, RPKM and TPM are within sample normalization methods used in this work (following [Johnson and Krishnan 2022](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02568-9))

### Counts per million (CPM or RPM)

In [8]:
from bioinfokit.analys import norm

# CPM (or RPM, reads per million) normalization
# Following Renesh Bedre's blog:
# https://www.reneshbedre.com/blog/expression_units.html#rpm-or-cpm-reads-per-million-mapped-reads-or-counts-per-million-mapped-reads-
nm = norm()
nm.cpm(df=kremling_raw_expression_v5_night)
kremling_expression_v5_night_cpm = nm.cpm_norm

nm = norm()
nm.cpm(df=kremling_raw_expression_v5_day)
kremling_expression_v5_day_cpm = nm.cpm_norm


### RPKM (Reads per kilo base of transcript per million mapped reads)

Gene/transcript lenghts are required for RPKM normalization. First, I (RACS) will merge genes lengths to the estimated counts matrix:

In [18]:
kremling_raw_expression_v5_night_gene_length = pd.merge(kremling_raw_expression_v5_night, gene_length_table, on="Name")
kremling_raw_expression_v5_day_gene_length = pd.merge(kremling_raw_expression_v5_day, gene_length_table, on="Name")
print(kremling_raw_expression_v5_night_gene_length.shape)
print(kremling_raw_expression_v5_day_gene_length.shape)
kremling_raw_expression_v5_night_gene_length.head()

(39096, 229)
(39096, 177)


Unnamed: 0_level_0,14A0253_26,14A0165_8,14A0163_8,14A0199_8,14A0147_8,14A0503_8,14A0045_8,14A0085_8,14A0249_8,14A0241_8,...,14A0005_8,14A0027_8,14A0533_26,14A0333_26,14A0473_26,14A0047_8,14A0453_26,14A0345_8,14A0343_8,Length
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Zm00001eb371370_T002,0,0,2,1,1,1,0,1,5,1,...,0,0,3,1,1,0,0,0,0,1376
Zm00001eb371350_T001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1662
Zm00001eb371330_T001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,336
Zm00001eb371310_T001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1455
Zm00001eb371280_T001,6,11,2,11,3,3,1,5,0,8,...,0,6,0,4,1,4,0,9,4,1628


Actual normalization:

In [19]:
nm = norm()
nm.rpkm(df=kremling_raw_expression_v5_night_gene_length, gl='Length')
kremling_expression_v5_night_rpkm = nm.rpkm_norm

nm = norm()
nm.rpkm(df=kremling_raw_expression_v5_day_gene_length, gl='Length')
kremling_expression_v5_day_rpkm = nm.rpkm_norm

### TPM (Transcripts per Million)

Like RPKM, TPM normalization also requires gene/transcript lengths:

In [22]:
nm = norm()
nm.tpm(df=kremling_raw_expression_v5_night_gene_length, gl='Length')
kremling_expression_v5_night_tpm = nm.tpm_norm

nm = norm()
nm.tpm(df=kremling_raw_expression_v5_day_gene_length, gl='Length')
kremling_expression_v5_day_tpm = nm.tpm_norm

## Between sample normalization methods


Between sample normalization methods have been implemented mostly in R. [Renesh Bedre's blog](https://www.reneshbedre.com/blog/expression_units.html#rpm-or-cpm-reads-per-million-mapped-reads-or-counts-per-million-mapped-reads-) discussed TMM, EdgeR and DESeq2 methods and comments on a few others. [Johnson and Krishnan 2022](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02568-9) have also implemented codes using R. Therefore, I (RACS) will bring these analyses in a separate R markdown notebook.
