## Analysis Notebook - differential Gene Expression Analysis

This notebook generates the sex-biased differential gene expression analysis.   Differential Analysis (DE) was performed using voom (Law et.al., 2014) with gene expression counts with associated precision weights, followed by linear modeling and empirical Bayes procedure using limma. 

Within each tissue, the following linear regression model was used to detec secually dimorphic alternative splicing event expression:


           y = B0 + B1 sex + epsilon (error)
           
where y is the gene expression to be modeled sex denotes the reported sex of the subject.   The function named `fit_tissue()` performs this analysis and accepts two arguments, the `tissue` and an `object` and create the **model matrix** based  that tissue's sex. We will perform a linear fit after calculating normal factors (based on the library size) and calculate the dispersion using `voom` (mean variance model of dispersion). We are saving the resulting matrixes as files.

### 1. Data files created by this notebook

Output text files are written to the ``../data/`` directory (at the same level as the ``jupyter`` directory). 

For each of the 39 tissues, this notebook produces the following results:

1. **{tissue}_DGE.csv**: topTable results for the edgeR/Limma differential analysis
2. **{tissue}_DGE_ensg_map.csv**: a convenience mapping of the ENSG to the geneSymbol
3. **{tissue}_DGE_refined.csv**: a convenience mapping of the topTable results satisfying the 1.5 fold change and adjusted P-Value < 0.05.

Additionally, diagnostic plots are produced:

1. **{tissue}-gene-y-voom-MDSplot-100.pdf**: multi-dimensional scaling plot (MDSplot), `red` `m` for the male and `blue` `f` voom variance model.
2. **{tissue}-gene-y-MDSplot-100.pdf**: MDSplot without voom.

### 1.1 load dependencies

In [None]:
suppressWarnings({
    options(warn = -1) 
    library(gprofiler2)
    library(downloader)
    library(readr)
    library(edgeR)
    library(limma)
    library(statmod)
    library(snakecase)
    library(multtest)
    library(stringi)
    library(dplyr)
    Sys.setenv(TAR = "/bin/tar") # for gzfile
})

### 1.2 load corrected GTEx expressionSet object

In [None]:
message("\nReading GTEx expressionSet obj from ../data/gtex.corrected.rds\n")
obj <- readRDS(file = "../data/gtex.corrected.rds")
pData(obj)$SAMPID <- gsub('-','\\.',pData(obj)$SAMPID)
message("\ndone reading GTEx corrected expressionSet Object\n")
dim(obj)

### 2. Preparation for Differential Expression Analysis
### 2.1 Keep only the reduced tissues

In [None]:
tissue_reduction <- read.table("../assets/tissues.tsv", header=TRUE, sep="\t",
                               skipNul=FALSE, stringsAsFactors = FALSE)
colnames(tissue_reduction)  <- c("SMTSD","female","male","include","display_name")

## only keep those we wish to include
tissue_reduction <- tissue_reduction[tissue_reduction$include==1,]
glimpse(tissue_reduction)

In [None]:
# convert the GTEx object via snakecase the name of the tissue to match
levels(pData(obj)$SMTSD)

In [None]:
pData(obj)$SMTSD <- snakecase::to_snake_case(as.character(pData(obj)$SMTSD))
head(pData(obj)$SMTSD)

In [None]:
keep <- pData(obj)$SMTSD %in% tissue_reduction$SMTSD
table(keep)

In [None]:
obj   <-  obj[,keep == TRUE]
message("\nthe expressionSet object\n")
dim(obj)

### 3. Differential analysis with edgeR and Limma

Using edgeR expressionSet object for the GTEx data set, perform a linear fit after calculating normal factors (based on the library size) and calculate the dispersion using `voom` (mean variance model of dispersion). We are saving the resulting matrixes as files.

### 3.1 Function fit_tissue

Function named `fit_tissue()`that accepts two arguments, the `tissue` and an `object` and create the **model matrix** based  that tissue's sex.

In [None]:
fit_tissue <- function (tissue, obj) {
    tissue_true             <- pData(obj)$SMTSD == tissue
    tissue_obj              <- obj[,tissue_true ==TRUE]
    tissue_sex              <- factor(pData(tissue_obj)$SEX)
    tissue_design           <- model.matrix(~tissue_sex)
    colnames(tissue_design) <- c("intercept","sex")

    
    # female
    female_obj           <- tissue_obj[,pData(tissue_obj)$SEX == 2]
    female_exprs_rowSums <- rowSums(cpm(exprs(female_obj))>=1)
    count_threshold      <- 0.25 * dim(female_obj)[2]
   
    # male
    male_obj           <- tissue_obj[,pData(tissue_obj)$SEX == 1]
    male_exprs_rowSums <- rowSums(cpm(exprs(male_obj))>=1)
    

    keep_male          <- male_exprs_rowSums >= count_threshold
    keep_female        <- female_exprs_rowSums >= count_threshold
    
    # now keep only those events that are meet either the male or the female criteria
    keep <- keep_male  & keep_female
    
    tissue_obj <- tissue_obj[keep==TRUE,]
    rm(male_obj)
    rm(female_obj)
    
    
    y_tissue       <- DGEList(counts=exprs(tissue_obj), group=tissue_sex)
    y_tissue       <- calcNormFactors(y_tissue)
    y_tissue_voom  <- voom(y_tissue, tissue_design)
    
    sex            <- ifelse(pData(tissue_obj)$SEX==1,'male','female')
    Gender         <- substring(sex,1,1)
    filename       <- paste0(paste0("../pdf/", snakecase::to_snake_case(tissue)),"-gene-y-MDSplot-100.pdf")
    pdf (filename)
        plotMDS(y_tissue, labels=Gender, top=100, col=ifelse(Gender=="m","blue","red"), 
                gene.selection="common")
    dev.off()
    filename       <- paste0(paste0("../pdf/", snakecase::to_snake_case(tissue)),"-gene-y-voom-MDSplot-100.pdf")
    pdf (filename)    
        plotMDS(y_tissue_voom, labels=Gender, top=100, col=ifelse(Gender=="m","blue","red"), 
                gene.selection="common")
    dev.off()

    fit_tissue      <- lmFit(y_tissue_voom, tissue_design)
    fit_tissue      <- eBayes(fit_tissue, robust=TRUE)
    results_tissue  <- topTable (fit_tissue, coef='sex', number=nrow(y_tissue))
    results_refined <- results_tissue$adj.P.Val <= 0.05 & abs(results_tissue$logFC) >= abs(log2(1.5))
    ensgfile  = paste(paste("../data",gsub(" ","",tissue), sep="/"),"DGE_ensg_map.csv", sep="_")
    
    
    filename  = paste(paste("../data",gsub(" ","",tissue), sep="/"),"DGE.csv", sep="_")
    rfilename = paste(paste("../data",gsub(" ","",tissue), sep="/"),"DGE_refined.csv", sep="_")
   
    ensg_names <- as.character(rownames(results_tissue[results_refined,]))
    ensg_genes <- ensg_names

    for (i in (1:length(ensg_names))) {
        dont_convert = 0
        ensg <- as.character(strsplit(ensg_names[i],'\\.\\w+$'))
        ensg_names[i] = ensg[1]
        if (ensg_names[i] == "ENSG00000233864") {
            ensg_genes[i] = as.character("TTTY15")
            dont_convert = 1
        } 
        if (ensg_names[i] == "ENSG00000240800") {
            ensg_genes[i] = as.character("ATP8A2P1")
            dong_convert = 1
        } 
        if (!dont_convert) {
            
            res <- gconvert(c(as.character(ensg_names[i])),
                                      organism = "hsapiens",
                                      target = "ENSG",
                                      numeric_ns = "", 
                                      mthreshold = Inf,
                                      filter_na = TRUE)
            if (!is.null(res)) {
                ensg_genes[i] <- res$name
            }
        }
    }
    
    ensg_maps <- cbind(ensg_names, ensg_genes)

    write.table(results_tissue, filename, sep=',', quote=FALSE)
    write.table(results_tissue[results_refined,], rfilename, sep=',', quote=FALSE)
    write.table(ensg_maps, ensgfile, sep=',', quote=FALSE, row.names=FALSE)
    return (results_tissue)
}

In [None]:
pData(obj)$SMTSD <- factor(pData(obj)$SMTSD)
levels(pData(obj)$SMTSD)
length(levels(pData(obj)$SMTSD))

### 3.2 Looping through reduced Tissue set

Loop through all the tissues and do the differentialGeneExpression analysis per tissue.

In [None]:
for (tissue in levels(pData(obj)$SMTSD)) { 
    fit_tissue(tissue = tissue,obj = obj)
}

### Appendix - Metadata

For replicability and reproducibility purposes, we also print the following metadata:

### Appendix - 1. Checksums with the sha256 algorithm
1. Checksums of **'artefacts'**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `utils::sessionInfo()` and [`devtools::session_info()`](https://devtools.r-lib.org/reference/session_info.html)

In [None]:
figure_id   = "differentialGeneExpression"

message("Generating sha256 checksums of the artefacts in the `..data/` directory .. ")
system(paste0("cd ../data && find . -type f -exec sha256sum {} \\;  >  ../metadata/", figure_id, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

data.table::fread(paste0("../metadata/", figure_id, "_sha256sums.txt"), header = FALSE, col.names = c("sha256sum", "file"))

### Appendix - 2. Library Session Information

In [None]:
dev_session_info   <- devtools::session_info()
utils_session_info <- utils::sessionInfo()

message("Saving `devtools::session_info()` objects in ../metadata/devtools_session_info.rds  ..")
saveRDS(dev_session_info, file = paste0("../metadata/", figure_id, "_devtools_session_info.rds"))
message("Done!\n")

message("Saving `utils::sessionInfo()` objects in ../metadata/utils_session_info.rds  ..")
saveRDS(utils_session_info, file = paste0("../metadata/", figure_id ,"_utils_info.rds"))
message("Done!\n")

dev_session_info$platform
dev_session_info$packages[dev_session_info$packages$attached==TRUE, ]