# **Gene Expression Heatplot**
This notebook generates a heat plot representing sex-biased differential gene expression as well as a plot showing the counts of differentially expressed genes per tissue.

The values in the heatmap represent the correlation (similarity in the fold-changes) between male and female samples, with the values in the heatmap being the correlation between the vectors of fold changes of the tissues.

The assumptions made before rendering the heatmaps 
1. Get differential gene expression (DGE) files
2. Use the ``../assets/tissues.tsv`` to limit the tissues to those with at least 50 samples in each sex (``tissues.tsv`` was produced by a Python script)
3. Use the pattern for the differentially expressed genes **"../data/*DGE.csv"** to get all the values for the matrix.

## **Running this notebook**:

See the README for setting up prerequisites for the notebook.

## 1. Setup 

Assumes the `countGenesAndEvents.ipynb` notebook was run -- unpacking the results from the differential Gene Expression Analysis as run in the `differentialGeneExpressionAnalysis.ipynb` notebook.

In [None]:
suppressWarnings({
    defaultW <- getOption("warn")  # suppress warnings for this cell
    options(warn = -1) 
    library(Biobase)
    library(pheatmap)
    Sys.setenv(TAR = "/bin/tar") # for gzfile
    options(warn = defaultW)
})

## 2. Making the matrices

### 2.1 Read in all the differential Gene Expression Analysis results

In [None]:
filenames <- list.files("../data", pattern="*_DGE.csv", all.files=FALSE, full.names=TRUE)
message("Number of DGE files found with *_DGE.csv pattern: ",length(filenames))

### 2.2 read in the curated "../assets/tissues.tsv" 

The **`../assets/tissues.tsv`** file contains an indication to include the tissue if the file has at least **50** samples in that tissue with either **male** or **female** sex reporting

In [None]:
head(filenames,2)
# read in all requirements so that the stage is properly set -- 
# if it is clear here -- it will remain clear for the rest of the time
# tissues.tsv contains the subset of files desired for analysis.
tissue_reduction <- read.table(file="../assets/tissues.tsv", header=TRUE, sep="\t",
                               skipNul=FALSE, stringsAsFactors = FALSE)
colnames(tissue_reduction)  <- c("SMTSD","female","male","include","display_name")
tissue_reduction$SMTSD <- factor(snakecase::to_snake_case(as.character(tissue_reduction$SMTSD)))
# only include those tissues we wish to continue with
table(tissue_reduction$include)
tissue_reduction <- tissue_reduction[tissue_reduction$include==1,]

message("Number of tissues with >=50 samples each in assets/tissues.tsv (tissue_reduction)",
        paste(dim(tissue_reduction), collapse=" "))

### 2.3 Use the first file for the row order of the matrices

Arbitrarily using the first file, to obtain the ordered rownames for assignment to the remainder of the files in the construction of the matrix.

In [None]:
fullfilename <-filenames[1]
logFC_mat    <- read.csv(fullfilename)
pVal_mat     <- logFC_mat
logFC_mat    <- logFC_mat[order(rownames(logFC_mat)),]
logFC_mat_rownames <- as.character(rownames(logFC_mat)) 
pVal_mat_rownames  <- logFC_mat_rownames
pVal_mat     <- logFC_mat

In [None]:
head(logFC_mat,2)

### 2.4 create a matrix of the logFC values for each of the files

Read in the corrected GTEx expression object.  Get the gene names for this using the pData function for this expression object.

In [None]:
gtex.corrected<- readRDS(file = "../data/gtex.corrected.rds")

In [None]:
head(exprs(gtex.corrected),2)

In [None]:
all.genes    <- rownames(exprs(gtex.corrected))
head(all.genes,2)
length(all.genes)
fc.exp.mat<-matrix(rep(0), length(all.genes)*length(filenames),nrow = length(all.genes),ncol=length(filenames))
dim(fc.exp.mat)

In [None]:
for (filename in filenames)
{
    next.res       <- read.csv(filename)
    
    next.res <- next.res[order(match(rownames(next.res),all.genes)),]
    
    rowmatch <- which(all.genes %in% rownames(next.res))
    
    colmatch <- which(filenames==filename)
    
    fc.exp.mat[c(rowmatch),colmatch]<-next.res$logFC[]
}

In [None]:
colnames(fc.exp.mat) <- gsub('../data/','',gsub('_DGE.csv','',filenames))

label.tab<-read.csv('../assets/tissues.tsv',header=TRUE,sep='\t')

label.tab<-label.tab[label.tab$name %in% colnames(fc.exp.mat),]

colnames(fc.exp.mat)<-label.tab$display.name[order(match(label.tab$display.name,colnames(fc.exp.mat)))]

In [None]:
tissue_list  <- levels(factor(tissue_reduction$SMTSD))
message("Number of tissues with at least 50 samples for both sexes: ",length(tissue_list))

In [None]:
rs <- rowSums(fc.exp.mat)
max(rs)

### 2.5 Calculate the correlation between the tissues using the logFC and render the heatmaps of the distance correlations

Calculate the correlation distances between the tissues using the logFC within each of the tissues.   
Reduce the matrix to include only the non-zero results (necessary before calculating the distance.
Clustering by similar expression.  And display the heatmap.

In [None]:
# dist_mat all values logFC_mat - reducing the 
fc.exp.mat<-fc.exp.mat[rowSums(fc.exp.mat!=0)>0,]
dist_mat <- as.matrix(cor(fc.exp.mat))
pheatmap(dist_mat)
hm.parameters <- list(dist_mat, fontsize = 6)
do.call("pheatmap", c(hm.parameters,  filename="../pdf/geneExpressionDistanceCorrelationHeatmapAlllogFC.pdf"))

## Appendix Metadata

For replicability and reproducibility purposes, we also print the following metadata:

### Appendix.1. Checksums with the sha256 algorithm
1. Checksums of **'artefacts'**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `utils::sessionInfo()` and [`devtools::session_info()`](https://devtools.r-lib.org/reference/session_info.html)

In [None]:
figure_id   = "expressionHeatmap"

### Appendix.2. Libraries

In [None]:
dev_session_info   <- devtools::session_info()
utils_session_info <- utils::sessionInfo()

message("Saving `devtools::session_info()` objects in ../data/devtools_session_info.rds  ..")
saveRDS(dev_session_info, file = paste0("../data/", figure_id, "_devtools_session_info.rds"))
message("Done!\n")

message("Saving `utils::sessionInfo()` objects in ../data/utils_session_info.rds  ..")
saveRDS(utils_session_info, file = paste0("../data/", figure_id ,"_utils_info.rds"))
message("Done!\n")

dev_session_info$platform
dev_session_info$packages[dev_session_info$packages$attached==TRUE, ]