Select DEGs.qmd

# Select DEGs

## Performing differential analysis using DESeq2
<b style="color: #6a926b;">DESeq2(Differential Expression using DESeq2)</b>

><b style="color: #6a926b;">Principle:</b> DESeq2 is based on a negative binomial distribution model, taking into account the discreteness and variability of gene expression data, as well as the impact of library size differences on differential analysis. DESeq2 reduces technical variation between samples through normalization transformations and scaling factors, then estimates the dispersion of gene expression. It employs the negative binomial distribution model to identify differentially expressed genes and corrects for multiple testing issues. DESeq2 is suitable for small sample RNA sequencing data and performs particularly well with a small number of samples, effectively handling inter-sample variability, technical noise, and batch effects.
>
><b style="color: #6a926b;">Input Data Type:</b> DESeq2 is designed for RNA-seq data and requires count data for each gene. It typically uses raw count matrices without the need for pre-normalization or scaling. DESeq2 requires the input data to be a matrix composed of integers. In addition to importing counts, DESeq2 officially provides a method to directly import txi objects generated by tximport if salmon was used upstream. 


```{r setup, include=FALSE}
library(TransProR)
```


```{r}
DEG_deseq2 <- DESeq2_analyze(
    tumor_file = "../test_TransProR/generated_data1/removebatch_SKCM_Skin_TCGA_exp_tumor.rds", 
    normal_file = '../test_TransProR/generated_data1/removebatch_SKCM_Skin_Normal_TCGA_GTEX_count.rds',
    output_file = '../test_TransProR/Select DEGs/DEG_deseq2.Rdata',
    logFC = 2.5,
    p_value = 0.01
)
head(DEG_deseq2, 5)
```


## Performing differential analysis using edgeR
<b style="color: #6a926b;">edgeR(Empirical Analysis of Digital Gene Expression Data in R)</b>

><b style="color: #6a926b;">Principle:</b> Similar to DESeq2, edgeR also accounts for the dispersion in data. It utilizes a negative binomial distribution model and different normalization methods, such as TMM (Trimmed Mean of M values), to address technical variations between samples. edgeR employs a hypothesis-testing framework to identify differentially expressed genes and uses multiple testing correction akin to the Benjamini-Hochberg method. edgeR performs better with a larger number of samples and is suitable for medium-scale RNA sequencing data, offering high sensitivity and accuracy.
>
><b style="color: #6a926b;">Input Data Type:</b> edgeR is also applicable to RNA-seq data, requiring expression counts for each gene. It uses the raw count matrix and does not require normalization or standardization. When using edgeR, it is important to select the appropriate analysis algorithm; the official recommendation for bulk RNA-seq is the quasi-likelihood (QL) F-test tests, and for scRNA-seq or data without replicates, the likelihood ratio test is suggested.

```{r}
DEG_edgeR <- edgeR_analyze(
    tumor_file = "../test_TransProR/generated_data1/removebatch_SKCM_Skin_TCGA_exp_tumor.rds", 
    normal_file = '../test_TransProR/generated_data1/removebatch_SKCM_Skin_Normal_TCGA_GTEX_count.rds',
    output_file = '../test_TransProR/Select DEGs/DEG_edgeR.Rdata',
    logFC_threshold = 2.5,
    p_value_threshold = 0.01
)
head(DEG_edgeR, 5)
```


## Performing differential analysis using limma
<b style="color: #6a926b;">limma(Linear Models for Microarray Data)</b>

><b style="color: #6a926b;">Principle:</b> Initially developed for gene chip data, limma was later adapted for RNA sequencing data. It is based on linear models and employs weighted least squares for estimating differences in gene expression, and a Bayesian approach for correcting multiple testing issues. Limma excels in handling large-scale data and is suitable for high-throughput data analysis, such as chip and large-scale RNA sequencing data, effectively controlling the false positive rate.
>
><b style="color: #6a926b;">Input Data Type:</b> limma is applicable to various types of high-throughput data, including chip data and RNA-seq. It requires the expression values for each gene, which can be either raw counts or already normalized values. There are two methods for differential analysis in limma: limma-trend and voom. The difference between these two methods is minimal when the sequencing depth of samples is similar, but voom has advantages when there is a significant difference in sequencing depth, hence voom is generally the preferred method for differential analysis.

```{r}
DEG_limma_voom <- limma_analyze(
    tumor_file = "../test_TransProR/generated_data1/removebatch_SKCM_Skin_TCGA_exp_tumor.rds", 
    normal_file = '../test_TransProR/generated_data1/removebatch_SKCM_Skin_Normal_TCGA_GTEX_count.rds',
    output_file = '../test_TransProR/Select DEGs/DEG_limma_voom.Rdata',
    logFC_threshold = 2.5,
    p_value_threshold = 0.01
)
head(DEG_limma_voom, 5)
```


## Wilcoxon Rank-Sum Test
>Experiments indicate that the Wilcoxon rank-sum test, as a non-parametric test, is <b style="color: #6a926b;">not recommended when the sample size per condition is less than eight</b>. However, once the sample size per condition exceeds eight, the Wilcoxon rank-sum test performs <b style="color: #6a926b;">comparably or better</b> than the three parametric testing methods (DESeq2, edgeR, and limma-voom) or other non-parametric testing methods.
>
>The three parametric testing methods DESeq2, edgeR, and limma were all initially developed to address the problem of analysis with small sample sizes. In population-level studies, which typically have larger sample sizes (at least several dozen), parametric assumptions are often unnecessary. Additionally, larger sample sizes are more likely to include several outliers that, when they violate assumptions, can lead to unstable p-values and consequently invalidate FDR.
>
>Unlike DESeq2, edgeR, and limma, the Wilcoxon rank-sum test, as a non-regression method, cannot adjust for various potential confounding factors (such as differences in sequencing depth). To use the Wilcoxon rank-sum test, it is generally recommended that RNA-seq samples <b style="color: #6a926b;">first be normalized to eliminate batch effects</b> before computation.
>
><b style="color: #6a926b;">Note:</b> The method selected for function implementation in this section is: edgeR TMM + wilcox.test analysis

```{r}
outRst <- Wilcoxon_analyze(
    tumor_file = "../test_TransProR/generated_data1/removebatch_SKCM_Skin_TCGA_exp_tumor.rds", 
    normal_file = '../test_TransProR/generated_data1/removebatch_SKCM_Skin_Normal_TCGA_GTEX_count.rds',
    output_file = '../test_TransProR/Select DEGs/Wilcoxon_rank_sum_testoutRst.Rdata',
    logFC_threshold = 2.5,
    fdr_threshold = 0.01
)
head(outRst, 5)
```


## four degs venn
```{r}
all_degs_venn <- list(
    DESeq2 = deg_filter(DEG_deseq2), 
    edgeR = deg_filter(DEG_edgeR), 
    limma = deg_filter(DEG_limma_voom), 
    Wilcoxon_test = deg_filter(outRst)
)
four_degs_venn <- four_degs_venn(all_degs_venn)
four_degs_venn
```


## Reference
> - **DESeq2**:
>
>> <a href="https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/" style="color: #758b76; font-weight: bold;">Differential gene expression analysis based on the negative binomial distribution</a>
>>
>><a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8" style="color: #758b76; font-weight: bold;">Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.</a>
>
> - **edgeR**:
> 
>><a href="https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf" style="color: #758b76; font-weight: bold;">edgeR: differential analysis
of sequence read count data</a>
>>
>><a href="https://academic.oup.com/bioinformatics/article/26/1/139/182458?login=false" style="color: #758b76; font-weight: bold;">Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140. </a>
>
> - **limma**:
> 
>><a href="https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf" style="color: #758b76; font-weight: bold;">Linear Models for Microarray and RNA-Seq Data User’s Guide</a>
>>
>><a href="https://academic.oup.com/nar/article/43/7/e47/2414268" style="color: #758b76; font-weight: bold;">Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., & Smyth, G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7),e47-e47.</a>
>
> - **Wilcoxon Rank-Sum Test**:
>
>><a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02648-4" style="color: #758b76; font-weight: bold;">Li, Y., Ge, X., Peng, F., Li, W., & Li, J. J. (2022). Exaggerated false positives by popular differential expression methods when analyzing human population samples. Genome Biology, 23(1), 79.</a>