# Figures 3c, 3d: 

**EDIT**: 
This code comes from `dimorphAS/notebook/figure4a` but corresponds to the figure **`Figure 3c, 3d`** of the publication.

## Questions:

- Are these plots only for specific tissues?
- Is the tissue defined inside the [`parseMT.pl`]() script?
- What are the `mt.txt` and `lv.txt` referring to? Is that tissues?

 - (lv <- left ventricle)
 - (mt <- mammary tissue)
 
Why only these tissues?

## **NOTE**:

We assume that you have cloned the analysis repository and have `cd` into the parent directory. Before starting with the analysis make sure you have first completed the dependencies set up by following the instructions described in the **`dependencies/README.md`** document. All paths defined in this Notebook are relative to the parent directory (repository). Please close this Notebook and start again by following the above guidelines if you have not completed the aforementioned steps.

## Loading dependencies

In [5]:
Sys.setenv(TAR = "/bin/tar") 

# dataviz dependencies
library(ggplot2)
library(visdat)
library(patchwork)
library(ggsci)
library(grid)
library(report)

# BDA2E-utilities dependencies
library(parallel)
library(rjags)
library(runjags)
library(compute.es)

##  Figure 3b

code from: [dimorphAS/figures/oldFigureDrafts/figure3b.R](https://github.com/TheJacksonLaboratory/sbas/blob/master/dimorphAS/figures/oldFigureDrafts/figure3b.R)

In [6]:
source("../dimorphAS/DBDA2Eprograms/DBDA2E-utilities.R")


*********************************************************************
Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:
A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.
*********************************************************************



## Retrieving the required data

In [11]:
# Download GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct from Google Cloud 
if (!("GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct" %in% list.files("../data/"))) {
    message("Downloading GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz \nfrom https://console.cloud.google.com/storage/browser/_details/gtex_analysis_v7/rna_seq_data/ ..")
    system("wget -O ../data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz", intern = TRUE)
    message("Done!\n\n")
    message("Unzipping compressed file GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz..")
    system("gunzip ../data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz", intern = TRUE)
    message("Done! \n\nThe file GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct can be found in ../data/")
}

Downloading GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz 
from https://console.cloud.google.com/storage/browser/_details/gtex_analysis_v7/rna_seq_data/ ..
Done!


Unzipping compressed file GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz..
Done! 

The file GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct can be found in ../data/


In [12]:
tissue.list<-c('Heart - Left Ventricle',
               'Breast - Mammary Tissue',
               'Brain - Cortex.Brain - Frontal Cortex (BA9).Brain - Anterior cingulate cortex (BA24)',
               'Adrenal Gland',
               'Adipose - Subcutaneous',
               'Muscle - Skeletal',
               'Thyroid',
               'Cells - Transformed fibroblasts',
               'Artery - Aorta',
               'Skin - Sun Exposed (Lower leg).Skin - Not Sun Exposed (Suprapubic)')

In [19]:
all.genes<-data.table::fread('../data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct',
                              sep='\t',
                              header=TRUE,
                              skip=2,
                              colClasses = c(rep("character", 2), rep("NULL", 11688)))

In [21]:
dim(all.genes)
head(all.genes)

Name,Description
<chr>,<chr>
ENSG00000223972.4,DDX11L1
ENSG00000227232.4,WASH7P
ENSG00000243485.2,MIR1302-11
ENSG00000237613.2,FAM138A
ENSG00000268020.2,OR4G4P
ENSG00000240361.1,OR4G11P


## Filtering out of duplicate trabsripts ids

In [23]:
all.genes<-all.genes[!duplicated(all.genes$Description),]

In [25]:
dim(all.genes)

## Accessing Position Specific Scoring Matrices (in `dimorphAS/RBP/RBP_PSSMs.zip`)

In [30]:
# Download GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct from Google Cloud 
if (!("RBP_PSSMs" %in% list.files("../data/"))) {
    message("Unzipping ../dimorphAS/RBP/RBP_PSSMs.zip INTO ../data ..\n")
    system("cd ../data/ && unzip ../dimorphAS/RBP/RBP_PSSMs.zip", intern = TRUE)
    message("Done! \n\nThe files can be found in ../data/RBP_PSSMs/")
}

In [31]:
rbp.names<-unique(gsub('_.*','',list.files('../RBP_PSSMs/')))

In [32]:
length(rbp.names)

In [33]:
summary.tab<-matrix(ncol=7,nrow=0)

In [35]:
colnames(summary.tab)<-c('Event',
                         'Gene', 
                         'Sig. RBPs',
                         'Sig. Gene Expression',
                         'Sig. Sex',
                         'Tissue',
                         'Dimorphic')

In [36]:
top.rbps<-rbp.names

In [37]:
length(top.rbps)

## Initialising dataframe with columns `coef,rbp,tissue`

In [39]:
df <-data.frame(coef=NULL,rbp=NULL,tissue=NULL)

In [44]:
# Refactoring needed to not rely on hard coded by position id of tissue
tissue <- tissue.list[[1]]

In [42]:
head(tissue)

## `{Missing files!}`  Dimorph/McmcMostVaryingMoreSigs_'

In [None]:
load(paste('/Users/karleg/Dimorph/McmcMostVaryingMoreSigs_',tissue,'.Rdata',sep=''))
  
mcmcCoda<-mcmcCoda[,which(grepl('beta2\\[101,87\\]',varnames(mcmcCoda))),drop=FALSE]

diagMCMC( mcmcCoda , parName=c("beta2[101,87]") )  


## Using cached `.Rdata` until the files Dimorph/McmcMostVaryingMoreSigs_* are located

In [None]:
#Before running the following, use the Session menu to set working directory to source file location
load('../dimorphAS/figures/oldFigureDrafts/figure3b.RData')

diagMCMC( mcmcCoda , parName=c("beta2[101,87]"), saveType = "png" ) 


##  Figure 3c
**EDIT**: 
This code comes from `dimorphAS/notebook/figure4a` but corresponds to the figure **`Figure 3c`** of the publication.

This script creates figure 4a. Please run the following command first:

`perl` [`parseMT.pl`](https://github.com/TheJacksonLaboratory/sbas/blob/master/dimorphAS/notebook/parseMT.pl)

This creates the files needed for figure `4a` and `4b`, namely `lv.txt` and `mt.txt`.
The input file for  [`parseMT.pl`](https://github.com/TheJacksonLaboratory/sbas/blob/master/dimorphAS/notebook/parseMT.pl) is a `.tsv` file name `summary_hbm.txt`. Here is a preview of this file:


In [None]:
summary_hbm   <- utils::read.table(file      = "../dimorphAS//notebook/summary_hbm.txt", 
                                   header    = TRUE, 
                                   sep       = "\t")

In [None]:
dim(summary_hbm)
head(summary_hbm, 2)

In [None]:
if ( ("lv.txt" %in% list.files("../data/")) && ("mt.txt" %in% list.files("../data/"))) {
        message("The files lv.txt or mt.txt are available in the folder ../data/! \n")
        message("The 'perl parseMT.pl' command will not be re-run \n")
}


if ( (!("lv.txt" %in% list.files("../data/"))) | (!("mt.txt" %in% list.files("../data/")))) {
        message("The files lv.txt or mt.txt not found in the folder ../data/ \n")
        message("Generating lv.txt and mt.txt with 'perl parseMT.pl' using 'summary_hbm.txt' as input .. \n")
        system(paste0("cd ../dimorphAS/notebook/ && ",
                      "perl parseMT.pl > parseMT_output.txt && ", 
                      "mv lv.txt ../../data/ && ",
                      "mv mt.txt ../../data/ && ",
                      "cp summary_hbm.txt  ../../data/"), 
               intern  = TRUE)
        message("Done!\n")
}



# {placeholder }

Description what does the following code block do

In [None]:
dat           <- utils::read.table("../data/lv.txt", header=FALSE, sep = "\t", col.names = c("RBP", "Expression"))

In [None]:
dim(dat)
summary(dat)

## Remove rows where expression values are equal to 0

In [None]:
d2<-dat[dat$Expression!=0,]
d2<-d2[order(d2$Expression),]

In [None]:
dim(d2)
summary(d2)

In [None]:
dat <- dat[order(dat$Expression),]
with_zeros <- visdat::vis_expect(dat, ~dat$Expression != 0,  show_perc = TRUE)
no_zeros   <- visdat::vis_expect(d2, ~d2$Expression != 0, show_perc = TRUE)
both <- with_zeros + no_zeros

message(paste0("\n", round((nrow(d2)/nrow(dat)), 2),"% ","of rows in the dataframe were filtered out because they contained 0 values\n"))

both + labs(title = "Comparison of data before and after removing expression rows with 0 values") + theme(plot.title    = element_text(size = 10, face = "bold" , hjust = 1.2))

We are checking above if our expectation of having none 0 Expression values is true. We can also verify this by the initial and final row count of the dataframe that contains the `RBP` and `Expression` values.

## Fit a linear model (`expression ~ rbm`)

In [None]:
lm_fit   <- lm(d2$Expression ~ d2$RBP, data=d2)
LM       <-summary(lm_fit)
rsquared <-round(LM$r.squared,digits=2)

In [None]:
lm <- report(lm_fit)
lm$texts$text_long
lm$tables$table_long

## Save predictions of the model 
Save predictions of the model in a new data frame named `predicted_df` along with the variable we want to plot against.

In [None]:
predicted_df <- data.frame(expr_pred = predict(lm_fit, d2), RBP=d2$RBP)

In [None]:
mypal <- ggsci::pal_npg("nrc", alpha = 0.7)(9)


p<-ggplot(dat, aes(x=RBP, y=Expression)) + geom_point(shape=21,fill = mypal[3],size=3) +  theme_bw()
#+ scale_fill_npg() 
p <- p + theme(axis.text = element_text(size=32, hjust=0.5),
               axis.title.x=element_text(size=24),
               axis.title.y = element_text(size=24),
               axis.text.y = element_text(size=32),
               panel.grid.major = element_blank(), 
               panel.grid.minor = element_blank()) 
p <- p +  geom_hline(yintercept=0, linetype="dashed", color = mypal[4])
p <- p +xlab('\U27F6 \n Sum of RBP effect magnitude')+ylab('Expression\ninclusion \U27F5 effect \U27F6 skipping')
p <- p+ geom_line(color='red',data = predicted_df, aes(y=expr_pred, x=RBP))
mylabel<-paste(italic(r)^2~"="~rsquared) 
p <- p+ geom_text(x = 3, y = 0.45, label = as.character(paste( "r^2==",rsquared)), size=7, parse = TRUE)
p

### (3c) Predicted effects of gene expression vs. RBP levels on exon inclusion in 100 sex-biased SE events in the left ventricle. 

The Y axis shows the mean of the posterior of the coefficient that determines the effects of gene expression on exon inclusion. 
Negative values favour skipping and positive values favour inclusion. 
The X axis shows the sum of the absolute values of the posterior of the coefficients of the 87 RBPs. 
The higher the value, the more the predicted effect on exon skipping. 
In the left frame it can be seen that for 61 out of 100 sex-biased events in left ventricle, 
no effect of gene expression was predicted (flat line at y=0.0). 

For the remaining genes there was a correlation with **`R2=0.35 (p=7.98x10-5).`**

##  Figure 4b


This should be run after the `figure4a.R` script (see above code chunks) 

In [None]:
dat <- read.table("../data/mt.txt", header=FALSE, sep = "\t", col.names = c("RBP", "Expression"))

In [None]:
mypal = pal_npg("nrc", alpha = 0.7)(9)
d2<-dat[dat$Expression!=0,]
lm_fit <- lm(d2$Expression ~ d2$RBP, data=d2)
LM<-summary(lm_fit)
rsquared<-round(LM$r.squared,digits=2)  

# save predictions of the model in the new data frame 
# together with variable you want to plot against
predicted_df <- data.frame(expr_pred = predict(lm_fit, d2), RBP=d2$RBP)


p<-ggplot(dat, aes(x=RBP, y=Expression)) + geom_point(shape=21,fill = mypal[3],size=3) +  theme_bw()
 #+ scale_fill_npg() 
p <- p + theme(axis.text = element_text(size=32, hjust=0.5),
               axis.title.x=element_text(size=24),
               axis.title.y = element_text(size=24),
               axis.text.y = element_text(size=32),
               panel.grid.major = element_blank(), 
               panel.grid.minor = element_blank()) 
p <- p +  geom_hline(yintercept=0, linetype="dashed", color = mypal[4])
p <- p +xlab('\U27F6 \n Sum of RBP effect magnitude')+ylab('Expression\ninclusion \U27F5 effect \U27F6 skipping')
p <- p+ geom_line(color='red',data = predicted_df, aes(y=expr_pred, x=RBP))
mylabel<-paste(italic(r)^2~"="~rsquared) 
p <- p+ geom_text(x = 15, y = 3.2, label = as.character(paste( "r^2==",rsquared)), size=7, parse = TRUE)
p

## (3d) A similar correlation was found in mammary tissue, with R2=0.33 (p=3.6x10-12).


## Metadata

For replicability and reproducibility purposes, we also print the following metadata:

1. Checksums of **'artefacts'**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `utils::sessionInfo()` and [`devtools::session_info()`](https://devtools.r-lib.org/reference/session_info.html)

### 1. Checksums with the sha256 algorithm

In [None]:
figure_id   = "figures_3c_3d"

message("Generating sha256 checksums of the artefacts in the `..data/` directory .. ")
system(paste0("cd ../data/ && sha256sum * > ../metadata/", figure_id, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

data.table::fread(paste0("../metadata/", figure_id, "_sha256sums.txt"), header = FALSE, col.names = c("sha256sum", "file"))

### 2. Libraries metadata

In [None]:
figure_id   = "figures_3c_3d"

dev_session_info   <- devtools::session_info()
utils_session_info <- utils::sessionInfo()

message("Saving `devtools::session_info()` objects in ../metadata/devtools_session_info.rds  ..")
saveRDS(dev_session_info, file = paste0("../metadata/", figure_id, "_devtools_session_info.rds"))
message("Done!\n")

message("Saving `utils::sessionInfo()` objects in ../metadata/utils_session_info.rds  ..")
saveRDS(utils_session_info, file = paste0("../metadata/", figure_id ,"_utils_info.rds"))
message("Done!\n")

dev_session_info$platform
dev_session_info$packages[dev_session_info$packages$attached==TRUE, ]