## Analysis Notebook - differential Gene Expression Analysis

This notebook generates the sex-biased differential gene expression analysis.   Differential Analysis (DE) was performed using voom (Law et.al., 2014) with gene expression counts with associated precision weights, followed by linear modeling and empirical Bayes procedure using limma. 

Within each tissue, the following linear regression model was used to detec secually dimorphic alternative splicing event expression:


           y = B0 + B1 sex + B2 age + B3 RIN + B4 Ischemia + epsilon (error)
           
where y is the gene expression to be modeled sex denotes the reported sex of the subject.   The function named `fit_tissue()` performs this analysis and accepts two arguments, the `tissue` and an `object` and create the **model matrix** based  that tissue's sex. We will perform a linear fit after calculating normal factors (based on the library size) and calculate the dispersion using `voom` (mean variance model of dispersion). We are saving the resulting matrixes as files.

### 0.  Data Needed by this notebook

Input data files are:

1. **[tissues.tsv](https://github.com/TheJacksonLaboratory/sbas/blob/master/assets/tissues.tsv)**: a curated list of tissues, alternative display names & **0** or **1** indicate skipping or including respectively.
2. **[gtex.corrected.rds](https://zenodo.org/record/4179559/files/gtex.tar.gz)**: The function to generate this from the GTEx dataset is in the `differentialSplicingJunctionAnalysis.ipynb`.  The function `renewGTExExpressionSet`, downloads GTEx v8 using the [yarn](https://bioconductor.org/packages/release/bioc/html/yarn.html) 'yarn::downloadGTExV8() and then the function `inspectAndCorrectExpressionSetObject` makes sure the expression object we are using with our Sra subset is correct.  This does not need to be redone unless our samples change and/or we upgrade our V8 versions.   A cautionary note, the latest version of GTEx is through AnViL -- see note regarding latest access ability to (GTEx data)[GTEx.md].  Obtained through Zenodo (gtex.tar.gz)
3. **[SraRunTable.txt](https://zenodo.org/record/4179559/files/srr.tar.gz)** old way of obtaining GTEx accessions 
4. **[srr_pdata.csv](https://zenodo.org/record/4179559/files/srr.tar.gz)**


### 1. Data files created by this notebook

Output text files are written to the ``../data/`` directory (at the same level as the ``jupyter`` directory). 

For each of the 39 tissues, this notebook produces the following results:

1. **{tissue}_DGE.csv**: topTable results for the edgeR/Limma differential analysis
2. **{tissue}_DGE_ensg_map.csv**: a convenience mapping of the ENSG to the geneSymbol
3. **{tissue}_DGE_refined.csv**: a convenience mapping of the topTable results satisfying the 1.5 fold change and adjusted P-Value < 0.05.

Additionally, diagnostic plots are produced:

1. **{tissue}-gene-y-voom-MDSplot-100.pdf**: multi-dimensional scaling plot (MDSplot), `red` `m` for the male and `blue` `f` voom variance model.
2. **{tissue}-gene-y-MDSplot-100.pdf**: MDSplot without voom.

### 1.1 load dependencies

In [1]:
suppressMessages({
    options(warn = -1) 
    library(gprofiler2)
    library(downloader)
    library(readr)
    library(edgeR)
    library(limma)
    library(statmod)
    library(snakecase)
    library(multtest)
    library(stringi)
    library(dplyr)
    Sys.setenv(TAR = "/bin/tar") # for gzfile
})

### 1.2 load corrected GTEx expressionSet object

In [2]:
message("\nReading GTEx expressionSet obj from ../data/gtex.corrected.rds\n")
obj <- readRDS(file = "../data/gtex.corrected.rds")
pData(obj)$SAMPID <- gsub('-','\\.',pData(obj)$SAMPID)
message("\ndone reading GTEx corrected expressionSet Object\n")
dim(obj)


Reading GTEx expressionSet obj from ../data/gtex.corrected.rds





done reading GTEx corrected expressionSet Object




### 2. Preparation for Differential Expression Analysis
### 2.1 Keep only the reduced tissues

In [3]:
tissue_reduction <- read.table("../assets/tissues.tsv", header=TRUE, sep="\t",
                               skipNul=FALSE, stringsAsFactors = FALSE)
colnames(tissue_reduction)  <- c("SMTSD","female","male","include","display_name")

## only keep those we wish to include
tissue_reduction <- tissue_reduction[tissue_reduction$include==1,]
#glimpse(tissue_reduction)

In [4]:
# convert the GTEx object via snakecase the name of the tissue to match
#levels(pData(obj)$SMTSD)
pData(obj)$SMTSD <- snakecase::to_snake_case(as.character(pData(obj)$SMTSD))
head(pData(obj)$SMTSD)

In [5]:
keep <- pData(obj)$SMTSD %in% tissue_reduction$SMTSD
#table(keep)

In [6]:
obj   <-  obj[,keep == TRUE]
message("\nExpressionSet object extracted with dimensions: ", dim(obj)[1], "x", dim(obj[2]))


ExpressionSet object extracted with dimensions: 55878x115531



### 3. Differential analysis with edgeR and Limma

Using edgeR expressionSet object for the GTEx data set, perform a linear fit after calculating normal factors (based on the library size) and calculate the dispersion using `voom` (mean variance model of dispersion). We are saving the resulting matrixes as files.

### 3.1 Function fit_tissue

Function named `fit_tissue()`that accepts two arguments, the `tissue` and an `object` and create the **model matrix** based  that tissue's sex.

In [7]:
fit_tissue <- function (tissue, obj) {
    tissue_true             <- pData(obj)$SMTSD == tissue
    tissue_obj              <- obj[,tissue_true ==TRUE]
    tissue_sex              <- factor(pData(tissue_obj)$SEX)
    tissue_age              <- as.numeric(factor(as.character(pData(tissue_obj)$AGE)))
    tissue_rin              <- as.numeric(pData(tissue_obj)$SMRIN)
    tissue_ischemia         <- as.numeric(pData(tissue_obj)$SMTSISCH)

    # if not defined - set it to median
    tissue_ischemia[is.na(tissue_ischemia)] <- median(tissue_ischemia, na.rm=T)

    # if not defined - set it to median
    tissue_rin[is.na(tissue_rin)] <- median(tissue_rin, na.rm=T)
    
    # if not defined - set it to median
    tissue_age[is.na(tissue_age)] <- median(tissue_age, na.rm=T)

    # model matrix -- with added factors sex, age, RIN and Ischemia time
    tissue_design           <- model.matrix(~tissue_sex+tissue_age+tissue_rin+tissue_ischemia)
    colnames(tissue_design) <- c("intercept","sex","age","rin","ischemia")

    message ("tissue design done")
    # female
    female_obj           <- tissue_obj[,pData(tissue_obj)$SEX == 2]
    female_exprs_rowSums <- rowSums(cpm(exprs(female_obj))>=1)
    count_threshold      <- 0.25 * dim(female_obj)[2]
   
    # male
    male_obj           <- tissue_obj[,pData(tissue_obj)$SEX == 1]
    male_exprs_rowSums <- rowSums(cpm(exprs(male_obj))>=1)
    

    keep_male          <- male_exprs_rowSums >= count_threshold
    keep_female        <- female_exprs_rowSums >= count_threshold
    
    # now keep only those events that are meet either the male or the female criteria
    keep <- keep_male  & keep_female
    
    tissue_obj <- tissue_obj[keep==TRUE,]
    rm(male_obj)
    rm(female_obj)
    
    
    y_tissue       <- DGEList(counts=exprs(tissue_obj), group=tissue_sex)
    y_tissue       <- calcNormFactors(y_tissue)
    y_tissue_voom  <- voom(y_tissue, tissue_design)
    
    sex            <- ifelse(pData(tissue_obj)$SEX==1,'male','female')
    Gender         <- substring(sex,1,1)
    filename       <- paste0(paste0("../pdf/", snakecase::to_snake_case(tissue)),"-gene-y-MDSplot-100.pdf")
    pdf (filename)
        plotMDS(y_tissue, labels=Gender, top=100, col=ifelse(Gender=="m","blue","red"), 
                gene.selection="common")
    dev.off()
    filename       <- paste0(paste0("../pdf/", snakecase::to_snake_case(tissue)),"-gene-y-voom-MDSplot-100.pdf")
    pdf (filename)    
        plotMDS(y_tissue_voom, labels=Gender, top=100, col=ifelse(Gender=="m","blue","red"), 
                gene.selection="common")
    dev.off()

    fit_tissue_res  <- lmFit(y_tissue_voom, tissue_design)
    fit_tissue_res  <- eBayes(fit_tissue_res, robust=TRUE)
    results_tissue  <- topTable (fit_tissue_res, coef='sex', number=nrow(y_tissue))
    results_refined <- results_tissue$adj.P.Val <= 0.05 & abs(results_tissue$logFC) >= abs(log2(1.5))
    ensgfile  = paste(paste("../data",gsub(" ","",tissue), sep="/"),"DGE_ensg_map.csv", sep="_")
    
    
    filename  = paste(paste("../data",gsub(" ","",tissue), sep="/"),"DGE.csv", sep="_")
    rfilename = paste(paste("../data",gsub(" ","",tissue), sep="/"),"DGE_refined.csv", sep="_")
   
    ensg_names <- as.character(rownames(results_tissue[results_refined,]))
    ensg_genes <- ensg_names

    for (i in (1:length(ensg_names))) {
        dont_convert = 0
        ensg <- as.character(strsplit(ensg_names[i],'\\.\\w+$'))
        ensg_names[i] = ensg[1]
        if (ensg_names[i] == "ENSG00000233864") {
            ensg_genes[i] = as.character("TTTY15")
            dont_convert = 1
        } 
        if (ensg_names[i] == "ENSG00000240800") {
            ensg_genes[i] = as.character("ATP8A2P1")
            dong_convert = 1
        } 
        if (!dont_convert) {
            
            res <- gconvert(c(as.character(ensg_names[i])),
                                      organism = "hsapiens",
                                      target = "ENSG",
                                      numeric_ns = "", 
                                      mthreshold = Inf,
                                      filter_na = TRUE)
            if (!is.null(res)) {
                ensg_genes[i] <- res$name
            }
        }
    }
    
    ensg_maps <- cbind(ensg_names, ensg_genes)

    write.table(results_tissue, filename, sep=',', quote=FALSE)
    write.table(results_tissue[results_refined,], rfilename, sep=',', quote=FALSE)
    write.table(ensg_maps, ensgfile, sep=',', quote=FALSE, row.names=FALSE)
    return (results_tissue)
}

In [8]:
pData(obj)$SMTSD <- factor(pData(obj)$SMTSD)
# levels(pData(obj)$SMTSD)
smtsd_len <- length(levels(pData(obj)$SMTSD))
message("Length of factors (SMTSD): ", smtsd_len)

Length of factors (SMTSD): 39



### 3.2 Looping through reduced Tissue set

Loop through all the tissues and do the differentialGeneExpression analysis per tissue.

In [9]:
for (tissue in levels(pData(obj)$SMTSD)) { 
    fit_tissue(tissue = tissue,obj = obj)
    message("Done fit tissue, ", tissue)
}

tissue design done



Done fit tissue, adipose_subcutaneous



tissue design done



Done fit tissue, adipose_visceral_omentum



tissue design done



Done fit tissue, adrenal_gland



tissue design done



Done fit tissue, artery_aorta



tissue design done



Done fit tissue, artery_coronary



tissue design done



Done fit tissue, artery_tibial



tissue design done



Done fit tissue, brain_caudate_basal_ganglia



tissue design done



No results to show
Please make sure that the organism or namespace is correct



Done fit tissue, brain_cerebellar_hemisphere



tissue design done



No results to show
Please make sure that the organism or namespace is correct



Done fit tissue, brain_cerebellum



tissue design done



Done fit tissue, brain_cortex



tissue design done



Done fit tissue, brain_frontal_cortex_ba_9



tissue design done



Done fit tissue, brain_hippocampus



tissue design done



Done fit tissue, brain_hypothalamus



tissue design done



Done fit tissue, brain_nucleus_accumbens_basal_ganglia



tissue design done



Done fit tissue, brain_putamen_basal_ganglia



tissue design done



Done fit tissue, brain_spinal_cord_cervical_c_1



tissue design done



No results to show
Please make sure that the organism or namespace is correct



No results to show
Please make sure that the organism or namespace is correct



No results to show
Please make sure that the organism or namespace is correct



No results to show
Please make sure that the organism or namespace is correct



No results to show
Please make sure that the organism or namespace is correct



No results to show
Please make sure that the organism or namespace is correct



No results to show
Please make sure that the organism or namespace is correct



No results to show
Please make sure that the organism or namespace is correct



No results to show
Please make sure that the organism or namespace is correct



Done fit tissue, breast_mammary_tissue



tissue design done



No results to show
Please make sure that the organism or namespace is correct



Done fit tissue, cells_cultured_fibroblasts



tissue design done



No results to show
Please make sure that the organism or namespace is correct



Done fit tissue, cells_ebv_transformed_lymphocytes



tissue design done



Done fit tissue, colon_sigmoid



tissue design done



Done fit tissue, colon_transverse



tissue design done



Done fit tissue, esophagus_gastroesophageal_junction



tissue design done



No results to show
Please make sure that the organism or namespace is correct



No results to show
Please make sure that the organism or namespace is correct



No results to show
Please make sure that the organism or namespace is correct



Done fit tissue, esophagus_mucosa



tissue design done



Done fit tissue, esophagus_muscularis



tissue design done



Done fit tissue, heart_atrial_appendage



tissue design done



Done fit tissue, heart_left_ventricle



tissue design done



Done fit tissue, liver



tissue design done



Done fit tissue, lung



tissue design done



Done fit tissue, muscle_skeletal



tissue design done



Done fit tissue, nerve_tibial



tissue design done



Done fit tissue, pancreas



tissue design done



Done fit tissue, pituitary



tissue design done



No results to show
Please make sure that the organism or namespace is correct



No results to show
Please make sure that the organism or namespace is correct



No results to show
Please make sure that the organism or namespace is correct



Done fit tissue, skin_not_sun_exposed_suprapubic



tissue design done



No results to show
Please make sure that the organism or namespace is correct



No results to show
Please make sure that the organism or namespace is correct



Done fit tissue, skin_sun_exposed_lower_leg



tissue design done



No results to show
Please make sure that the organism or namespace is correct



Done fit tissue, small_intestine_terminal_ileum



tissue design done



Done fit tissue, spleen



tissue design done



Done fit tissue, stomach



tissue design done



No results to show
Please make sure that the organism or namespace is correct



Done fit tissue, thyroid



tissue design done



No results to show
Please make sure that the organism or namespace is correct



No results to show
Please make sure that the organism or namespace is correct



Done fit tissue, whole_blood



### Appendix - Metadata

For replicability and reproducibility purposes, we also print the following metadata:

### Appendix - 1. Checksums with the sha256 algorithm
1. Checksums of **'artefacts'**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `utils::sessionInfo()` and [`devtools::session_info()`](https://devtools.r-lib.org/reference/session_info.html)

In [10]:
figure_id   = "differentialGeneExpression"

message("Generating sha256 checksums of the artefacts in the `..data/` directory .. ")
system(paste0("cd ../data && find . -type f -exec sha256sum {} \\;  >  ../data/", figure_id, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

data.table::fread(paste0("../data/", figure_id, "_sha256sums.txt"), header = FALSE, col.names = c("sha256sum", "file"))

Generating sha256 checksums of the artefacts in the `..data/` directory .. 



Done!




sha256sum,file
<chr>,<chr>
8fc197023652109dcfef894efa26e7ccc04077a4b67b79ccacc5089e9fc06464,./liver_DGE.csv
17103a26867545a2c0219c16a689868fa8a3ac338b00c64fbcc0ac93a9b2fa11,./esophagus_muscularis_DGE.csv
ed3050e5ae8b2564fb04a9761cf300f629d85982874d2e32f263f7b2f08d5fc0,./brain_frontal_cortex_ba_9_DGE_ensg_map.csv
eb1c7d78388d98a948213356d009b1e3021c956cd0eb5aafaf01fc0792661e64,./pituitary_DGE_refined.csv
402b609deee419ec9c2bdc107d298d04814ca7fae3d6dd9f877b70bff6dbf088,./brain_cortex_DGE_refined.csv
2e10a4bdc3fa4a080346dad337cfd1daa130515a89b50d300ce4cd01f748606d,./liver_DGE_ensg_map.csv
69681daa192eba6eacec2a9e8b8eee0ea76dc7da89eb1ff6e84ba7838641c886,./esophagus_muscularis_DGE_ensg_map.csv
e6edcdb4c192ca3ab182193f5367460a05aedb6f4530a13e6babcfe44425e3c5,./whole_blood_DGE_refined.csv
fa43df9e4591a05e3f0592f4d9e1f9db9afdab1820ce8518500b0cf84e626068,./heart_atrial_appendage_DGE.csv
3cf74d1a65cfd4dd9c450b9c5f46467eb4e3a440a3a8e9b2eaeac43d7a81320d,./nerve_tibial_DGE_ensg_map.csv


### Appendix - 2. Library Session Information

In [11]:
dev_session_info   <- devtools::session_info()
utils_session_info <- utils::sessionInfo()

message("Saving `devtools::session_info()` objects in ../data/devtools_session_info.rds  ..")
saveRDS(dev_session_info, file = paste0("../data/", figure_id, "_devtools_session_info.rds"))
message("Done!\n")

message("Saving `utils::sessionInfo()` objects in ../data/utils_session_info.rds  ..")
saveRDS(utils_session_info, file = paste0("../data/", figure_id ,"_utils_info.rds"))
message("Done!\n")

dev_session_info$platform
dev_session_info$packages[dev_session_info$packages$attached==TRUE, ]

Saving `devtools::session_info()` objects in ../data/devtools_session_info.rds  ..



Done!




Saving `utils::sessionInfo()` objects in ../data/utils_session_info.rds  ..



Done!




 setting  value                       
 version  R version 4.1.1 (2021-08-10)
 os       Ubuntu 18.04.4 LTS          
 system   x86_64, linux-gnu           
 ui       X11                         
 language en_US.UTF-8                 
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Etc/UTC                     
 date     2021-09-23                  

Unnamed: 0_level_0,package,ondiskversion,loadedversion,path,loadedpath,attached,is_base,date,source,md5ok,library
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<lgl>,<chr>,<chr>,<lgl>,<fct>
Biobase,Biobase,2.52.0,2.52.0,/opt/conda/envs/sbas/lib/R/library/Biobase,/opt/conda/envs/sbas/lib/R/library/Biobase,True,False,2021-05-19,Bioconductor,,/opt/conda/envs/sbas/lib/R/library
BiocGenerics,BiocGenerics,0.38.0,0.38.0,/opt/conda/envs/sbas/lib/R/library/BiocGenerics,/opt/conda/envs/sbas/lib/R/library/BiocGenerics,True,False,2021-05-19,Bioconductor,,/opt/conda/envs/sbas/lib/R/library
downloader,downloader,0.4,0.4,/opt/conda/envs/sbas/lib/R/library/downloader,/opt/conda/envs/sbas/lib/R/library/downloader,True,False,2015-07-09,CRAN (R 4.1.0),,/opt/conda/envs/sbas/lib/R/library
dplyr,dplyr,1.0.7,1.0.7,/opt/conda/envs/sbas/lib/R/library/dplyr,/opt/conda/envs/sbas/lib/R/library/dplyr,True,False,2021-06-18,CRAN (R 4.1.0),,/opt/conda/envs/sbas/lib/R/library
edgeR,edgeR,3.34.0,3.34.0,/opt/conda/envs/sbas/lib/R/library/edgeR,/opt/conda/envs/sbas/lib/R/library/edgeR,True,False,2021-05-19,Bioconductor,,/opt/conda/envs/sbas/lib/R/library
gprofiler2,gprofiler2,0.2.1,0.2.1,/opt/conda/envs/sbas/lib/R/library/gprofiler2,/opt/conda/envs/sbas/lib/R/library/gprofiler2,True,False,2021-08-23,CRAN (R 4.1.1),,/opt/conda/envs/sbas/lib/R/library
limma,limma,3.48.0,3.48.0,/opt/conda/envs/sbas/lib/R/library/limma,/opt/conda/envs/sbas/lib/R/library/limma,True,False,2021-05-19,Bioconductor,,/opt/conda/envs/sbas/lib/R/library
multtest,multtest,2.48.0,2.48.0,/opt/conda/envs/sbas/lib/R/library/multtest,/opt/conda/envs/sbas/lib/R/library/multtest,True,False,2021-05-19,Bioconductor,,/opt/conda/envs/sbas/lib/R/library
readr,readr,2.0.1,2.0.1,/opt/conda/envs/sbas/lib/R/library/readr,/opt/conda/envs/sbas/lib/R/library/readr,True,False,2021-08-10,CRAN (R 4.1.1),,/opt/conda/envs/sbas/lib/R/library
snakecase,snakecase,0.11.0,0.11.0,/opt/conda/envs/sbas/lib/R/library/snakecase,/opt/conda/envs/sbas/lib/R/library/snakecase,True,False,2019-05-25,CRAN (R 4.1.0),,/opt/conda/envs/sbas/lib/R/library
