# Analysis Notebook - Hierarchical Bayesian Modelling

## **NOTE**:

We assume that you have cloned the analysis repository and have `cd` into the parent directory. Before starting with the analysis make sure you have first completed the dependencies set up by following the instructions described in the **`dependencies/README.md`** document. All paths defined in this Notebook are relative to the parent directory (repository). Please close this Notebook and start again by following the above guidelines if you have not completed the aforementioned steps.

## Prerequisite input files

Before starting the execution of the following code, make sure you have available in the folders `sbas/data` and `sbas/assets` the files listed below as prerequisites.

###  **`sbas/data`**.
The present analysis requires the following files to be present in the folder **`sbas/data`**.


- [x] The contents of `data.tar.gz` after unpacking them into the `sbas/data` folder with `tar xvzf data.tar.gz -C sbas/data `
- [x] `SraRunTable.txt` formerly named`SraRunTable.noCram.noExome.noWGS.totalRNA.txt` (changed in [9fd0618](https://github.com/TheJacksonLaboratory/sbas/commit/9fd06183d1df0d6c6f072861ad7ff3b84ac5cb47))
- [x] `rmats_final.se.jc.ijc.txt`
- [x] `rmats_final.se.jc.sjc.txt`
- [x] `SraRunTable.noCram.noExome.noWGS.totalRNA.txt`


Additionally, the file `GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct` which is retrieved in the script from [`https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data/`](https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz) and stored into the folder 
`sbas/data` as well.


### **`sbas/assets`**
The present analysis requires the following files to be present in the folder **`sbas/assets`**.

- [x] `tissues.tsv`: metadata file with information on which tissues will be used for analysis
- [x] `splice-relevant-genes.txt`: list of RNA binding proteins that are annotated to splicing relevant functions from GO.

## Loading dependencies

If `conda` is available on your environment you can install the required dependencies by running the following commands:


```bash
time conda install -y r-base==3.6.2 &&
conda install -y r-ggplot2 r-ggsci r-coda r-rstan r-rjags r-compute.es r-snakecase &&
Rscript -e 'install.packages("runjags", repos = "https://cloud.r-project.org/")'
```



In [1]:
# dataviz dependencies
library(ggplot2)
library(ggsci)
library(grid)
library(gridExtra)
library(stringr)
library(snakecase)

# BDA2E-utilities dependencies
library(parallel)
library(rjags)
library(runjags)
library(compute.es)

“package ‘ggplot2’ was built under R version 3.6.3”
“package ‘ggsci’ was built under R version 3.6.3”
“package ‘gridExtra’ was built under R version 3.6.3”
“package ‘snakecase’ was built under R version 3.6.3”
“package ‘rjags’ was built under R version 3.6.3”
Loading required package: coda

“package ‘coda’ was built under R version 3.6.3”
Linked to JAGS 4.3.0

Loaded modules: basemod,bugs

“package ‘compute.es’ was built under R version 3.6.3”


Download GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct from Google Cloud


In [2]:
if (!("GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct" %in% list.files("../data/"))) {
    message("Downloading GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct \nfrom https://console.cloud.google.com/storage/browser/_details/gtex_analysis_v7/rna_seq_data/ ..")
    system("wget -O ../data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz https://storage.googleapis.com/gtex_analysis_v7/rna_seq_data/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_tpm.gct.gz", intern = TRUE)
    message("Done!\n\n")
    message("Unzipping compressed file GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz..")
    system("gunzip ../data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz", intern = TRUE)
    message("Done! \n\nThe file GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct can be found in ../data/")
}

Previously used list of tissues to use for the Hierarchical Bayesian modelling:



```R
tissue.list<-c("Heart - Left Ventricle",
               "Breast - Mammary Tissue",
               "Brain - Cortex.Brain - Frontal Cortex (BA9).Brain - Anterior cingulate cortex (BA24)",
               "Adrenal Gland",
               "Adipose - Subcutaneous",
               "Muscle - Skeletal",
               "Thyroid",
               "Cells - Transformed fibroblasts",
               "Artery - Aorta",
               "Skin - Sun Exposed (Lower leg).Skin - Not Sun Exposed (Suprapubic)")
```

In [3]:
tissues_df <- readr::read_delim("../assets/tissues.tsv", delim = "\t")

Parsed with column specification:
cols(
  name = [31mcol_character()[39m,
  female = [32mcol_double()[39m,
  male = [32mcol_double()[39m,
  include = [32mcol_double()[39m,
  display.name = [31mcol_character()[39m
)



In [4]:
tissue.list <- tissues_df$name[ tissues_df$include ==1]

In [5]:
message(length(tissue.list), " tissues")
cat(tissue.list, sep = ", ")

39 tissues



adipose_subcutaneous, adipose_visceral_omentum, adrenal_gland, artery_aorta, artery_coronary, artery_tibial, brain_caudate_basal_ganglia, brain_cerebellar_hemisphere, brain_cerebellum, brain_cortex, brain_frontal_cortex_ba_9, brain_hippocampus, brain_hypothalamus, brain_nucleus_accumbens_basal_ganglia, brain_putamen_basal_ganglia, brain_spinal_cord_cervical_c_1, breast_mammary_tissue, cells_cultured_fibroblasts, cells_ebv_transformed_lymphocytes, colon_sigmoid, colon_transverse, esophagus_gastroesophageal_junction, esophagus_mucosa, esophagus_muscularis, heart_atrial_appendage, heart_left_ventricle, liver, lung, muscle_skeletal, nerve_tibial, pancreas, pituitary, skin_not_sun_exposed_suprapubic, skin_sun_exposed_lower_leg, small_intestine_terminal_ileum, spleen, stomach, thyroid, whole_blood

In [6]:
tissue <- tissue.list[1]  #can be replaced with a loop or argument to choose a different tissue

In [7]:
tissue

## Pattern for choosing `topTable()` files from `limma`

```bash
# {as_site_type} + '_' + {tissue} + '_' + suffix_pattern 
se_skin_not_sun_exposed_suprapubic_AS_model_B_sex_as_events.csv
```

In [8]:
dataDir <- "../data/"
assetsDir <- "../assets/"
as_site_type <- "se"
suffix_pattern <- "AS_model_B_sex_as_events.csv"

file.with.de.results <- paste0(dataDir, as_site_type, "_", tissue, "_" , suffix_pattern  )
file.with.de.results
file.exists(file.with.de.results)
system( paste0("ls -l ", file.with.de.results), intern = TRUE )

In [9]:
events.table         <- read.table(file.with.de.results, sep = ",")
head(events.table, 2)

Unnamed: 0_level_0,logFC,AveExpr,t,P.Value,adj.P.Val,B
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
XIST-10149,-7.166835,1.000899,-51.69164,1.8778320000000002e-222,7.253878000000001e-218,436.5794
XIST-10154,-6.906147,0.7659794,-50.56963,7.990861e-218,1.5433950000000002e-213,428.6005


## Add annotation columns to the topTable dataframe:

The feature information is encoded in the topTable dataframe as rownames. The `ID` and `geneSymbol` variables have been combined in the following pattern:

```console
{geneSymbol}-{ID} 
```

- `ID`: everything **_after_** last occurence of hyphen `-`
example: 
```R
stringr::str_replace("apples - oranges - bananas", "^.+-", "")
```

```console
# output:

' bananas'
```

- `geneSymbol`: everything **_before_** last occurence of `-`
example: 

```R
sub('-[^-]*$', '',"apples - oranges - bananas")
```

```console
# output:

'apples - oranges '
```

```diff
- NOTE: The above solution covers the cases where a hyphen is part of the geneSymbol.
```

In [10]:
cols_initially <- colnames(events.table)
cols_initially

In [11]:
events.table[["ID"]] <- stringr::str_replace(rownames(events.table),  "^.+-", "")
events.table[["gene_name"]] <- sub('-[^-]*$', '', rownames(events.table))

In [12]:
keepInOrderCols <- c("gene_name", "ID", cols_initially)

In [13]:
events.table <- events.table[ , keepInOrderCols ]

In [14]:
tail(events.table, 2)

Unnamed: 0_level_0,gene_name,ID,logFC,AveExpr,t,P.Value,adj.P.Val,B
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
MYL6-28855,MYL6,28855,0.0194722,10.22273,0.262982,0.7926553,0.9651675,-6.553282
CD74-25493,CD74,25493,-0.01944741,10.19811,-0.1964588,0.844318,0.9759174,-6.580949


## Define filepaths of required inputs

`file.with.de.results` has been defined above

In [15]:
rbp.table.name        <- paste0(assetsDir, "splice-relevant-genes.txt")

In [16]:
events.table.name     <- paste0(dataDir, "fromGTF.SE.txt")

In [17]:
inc.counts.file.name  <- paste0(dataDir, "rmats_final.se.jc.ijc.txt")

In [18]:
skip.counts.file.name <- paste0(dataDir, "rmats_final.se.jc.sjc.txt")

In [19]:
metadata.file.name    <- paste0(dataDir, "SraRunTable.txt")

In [20]:
expression.file.name  <- paste0(dataDir, "GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct")

## Use the define filepaths to load/read in the tables 

Load the skip and inclusion count matrices, and the list of RNA binding proteins that are annotated to either:
- mRNA splicing, via spliceosome `(GO:0000398)`,
- regulation of mRNA splicing, via spliceosome `(GO:0048024)`, or 
- both. 

The table has the:
- Gene Symbol
- the Uniprot ID (`uprot.id`)
- the NCBI Gene ID (`gene.id`) and 
- boolean columns for being 
  - `S`=mRNA splicing, via spliceosome `(GO:0000398)` and 
  - `R`=regulation of mRNA splicing, via spliceosome `(GO:0048024)`.

### Filtering of the `topTable()` object

In [21]:
# before significance filtering

dim(events.table)

In [22]:
events.table <- events.table[abs(events.table$logFC)>=log2(1.5) & events.table$adj.P.Val<=0.05,]

In [23]:
# after significance filtering
dim(events.table)

Make sure this command has been executed before `gunzip sbas/data/fromGTF.*` as the files are expected uncompressed.


In [24]:
annot.table  <- read.table(events.table.name,header=T)

In [25]:
head(annot.table, 1)

Unnamed: 0_level_0,ID,GeneID,geneSymbol,chr,strand,exonStart_0base,exonEnd,upstreamES,upstreamEE,downstreamES,downstreamEE
Unnamed: 0_level_1,<int>,<fct>,<fct>,<fct>,<fct>,<int>,<int>,<int>,<int>,<int>,<int>
1,1,ENSG00000034152.18,MAP2K3,chr17,+,21287990,21288091,21284709,21284969,21295674,21295769


In [26]:
merged.table <- merge(events.table, annot.table, by="ID")

In [27]:
head(merged.table, 2)

Unnamed: 0_level_0,ID,gene_name,logFC,AveExpr,t,P.Value,adj.P.Val,B,GeneID,geneSymbol,chr,strand,exonStart_0base,exonEnd,upstreamES,upstreamEE,downstreamES,downstreamEE
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<fct>,<fct>,<fct>,<int>,<int>,<int>,<int>,<int>,<int>
1,10149,XIST,-7.166835,1.000899,-51.69164,1.8778320000000002e-222,7.253878000000001e-218,436.5794,ENSG00000229807.11,XIST,chrX,-,73831065,73831274,73829067,73829231,73833237,73833374
2,10150,XIST,-3.020911,1.516307,-27.05427,8.81976e-106,8.517463e-102,216.1859,ENSG00000229807.11,XIST,chrX,-,73833237,73833374,73831065,73831274,73837439,73841474


In [28]:
rbp.table    <- read.table(rbp.table.name,sep="\t",header=TRUE)  

In [29]:
head(rbp.table, 1)

Unnamed: 0_level_0,Gene,uprot.id,gene.id,S,R,omim
Unnamed: 0_level_1,<fct>,<fct>,<int>,<lgl>,<lgl>,<fct>
1,AAR2,Q9Y312,25980,True,False,


Make sure this command has been executed before `gunzip sbas/data/rmats_final.se.jc.*jc.*` as the files are expected uncompressed.


In [30]:
inc.counts   <- as.data.frame(data.table::fread(inc.counts.file.name))

In [31]:
skip.counts  <- as.data.frame(data.table::fread(skip.counts.file.name))

In [32]:
inc.counts[1:2,1:3]
skip.counts[1:2,1:3]

Unnamed: 0_level_0,ID,SRR1068788,SRR1068808
Unnamed: 0_level_1,<int>,<int>,<int>
1,1,0,0
2,2,26,247


Unnamed: 0_level_0,ID,SRR1068788,SRR1068808
Unnamed: 0_level_1,<int>,<int>,<int>
1,1,2,0
2,2,0,0


## Check `dim()` of loaded objects

In [33]:
dim(events.table)
dim(annot.table)
dim(merged.table)
dim(rbp.table)
dim(inc.counts)
dim(skip.counts)

## Read sample info

Make sure you have unzipped the file first by typing:

```bash
 gunzip sbas/data/SraRunTable.txt.gz 
```

as the file is expected to be uncompressed

In [34]:
metadata.file.name
file.exists(metadata.file.name)
system(paste0("ls -l", " ../data/Sra*"), intern = TRUE)

In [35]:
meta.data    <- read.csv(metadata.file.name,header=TRUE)

In [36]:
meta.data$body_site[1:3]

In [37]:
meta.data[["body_site"]] <- as.character(meta.data[["body_site"]])

In [38]:
meta.data$body_site[1:3]

In [39]:
meta.data <- meta.data[ snakecase::to_snake_case(meta.data$body_site) == tissue,]

In [40]:
tissue
dim(meta.data)
head(meta.data,2)

Unnamed: 0_level_0,Run,analyte_type,Assay.Type,AvgSpotLen,Bases,BioProject,BioSample,biospecimen_repository,biospecimen_repository_sample_id,body_site,⋯,data_type..run.,product_part_number..exp.,product_part_number..run.,sample_barcode..exp.,sample_barcode..run.,is_technical_control,target_set..exp.,primary_disease..exp.,secondary_accessions..run.,Alignment_Provider..run.
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<int>,<dbl>,<fct>,<fct>,<fct>,<fct>,<chr>,⋯,<fct>,<fct>,<fct>,<dbl>,<dbl>,<fct>,<fct>,<fct>,<fct>,<fct>
244,SRR821715,RNA:Total RNA,RNA-Seq,152,9678630552,PRJNA75899,SAMN01994392,GTEx,GTEX-OHPK-0226-SM-3MJH6,Adipose - Subcutaneous,⋯,,,,,,Yes,,,,
265,SRR807543,RNA:Total RNA,RNA-Seq,152,8608576088,PRJNA75899,SAMN01994252,GTEx,GTEX-X4LF-1726-SM-3NMBZ,Adipose - Subcutaneous,⋯,,,,,,,,,,


In [41]:
dim(inc.counts)
inc.counts   <- inc.counts[,colnames(inc.counts) %in% meta.data$Run]
dim(inc.counts)

In [42]:
dim(skip.counts)
skip.counts  <- skip.counts[,colnames(skip.counts) %in% meta.data$Run]
dim(skip.counts)

In [43]:
sd.threshold <- quantile(apply(inc.counts,1,sd)+apply(skip.counts,1,sd),0.95)
sd.threshold

In [44]:
dim(skip.counts)
skip.counts  <- skip.counts[rownames(skip.counts) %in% merged.table$ID,]
dim(skip.counts)

In [45]:
dim(inc.counts)
inc.counts   <- inc.counts[rownames(inc.counts) %in% merged.table$ID,]
dim(inc.counts)

In [46]:
if (nrow(skip.counts)>100)
{
  select.events <- apply(inc.counts,1,sd)+apply(skip.counts,1,sd)>sd.threshold
  inc.counts    <- inc.counts[select.events,]
  skip.counts   <- skip.counts[select.events,]
  merged.table  <- merged.table[select.events,]
}

In [47]:
nrow(skip.counts)>100

In [48]:
dim(inc.counts)
dim(skip.counts)
dim(merged.table)

## Read expression data:

In [49]:
expression.file.name
file.exists(expression.file.name)

In [50]:
expression.mat          <- read.table(expression.file.name, nrows=1,sep="\t",header=T,skip=2)

In [51]:
dim(expression.mat)
head(expression.mat, 2)

Unnamed: 0_level_0,Name,Description,GTEX.1117F.0226.SM.5GZZ7,GTEX.111CU.1826.SM.5GZYN,GTEX.111FC.0226.SM.5N9B8,GTEX.111VG.2326.SM.5N9BK,GTEX.111YS.2426.SM.5GZZQ,GTEX.1122O.2026.SM.5NQ91,GTEX.1128S.2126.SM.5H12U,GTEX.113IC.0226.SM.5HL5C,⋯,GTEX.ZVE2.0006.SM.51MRW,GTEX.ZVP2.0005.SM.51MRK,GTEX.ZVT2.0005.SM.57WBW,GTEX.ZVT3.0006.SM.51MT9,GTEX.ZVT4.0006.SM.57WB8,GTEX.ZVTK.0006.SM.57WBK,GTEX.ZVZP.0006.SM.51MSW,GTEX.ZVZQ.0006.SM.51MR8,GTEX.ZXES.0005.SM.57WCB,GTEX.ZXG5.0005.SM.57WCN
Unnamed: 0_level_1,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,ENSG00000223972.4,DDX11L1,0.1082,0.1158,0.02104,0.02329,0,0.04641,0.03076,0.09358,⋯,0.09012,0.1462,0.1045,0,0.6603,0.695,0.1213,0.4169,0.2355,0.145


In [52]:
colnames(expression.mat)[1:3]

In [53]:
colnames.expression.mat <- colnames(expression.mat)

In [54]:
length(colnames.expression.mat)

In [55]:
total.samples           <- length(colnames.expression.mat)

In [56]:
meta.data$Sample.Name[1]
gsub("-","\\.",meta.data$Sample.Name[1])

In [57]:
meta.data$Sample.Name   <- gsub("-","\\.",meta.data$Sample.Name)

In [58]:
meta.data               <- meta.data[meta.data$Sample.Name %in% colnames(expression.mat),]

In [59]:
meta.data               <- meta.data[!duplicated(meta.data$Sample.Name),]

In [60]:
inc.counts              <- inc.counts[,colnames(inc.counts) %in% meta.data$Run]

In [61]:
skip.counts             <- skip.counts[,colnames(skip.counts) %in% meta.data$Run]

In [62]:
meta.data               <- meta.data[meta.data$Run %in% colnames(inc.counts),]

In [63]:
col.in.tissue           <- c()

In [64]:
for (col in colnames.expression.mat)
  col.in.tissue<-c(col.in.tissue, (col %in% meta.data$Sample.Name) && (meta.data$body_site[which(meta.data$Sample.Name==col)] %in% tissue) && (meta.data$submitted_subject_id[which(meta.data$Sample.Name==col)]!="GTEX-11ILO"))
expression.mat<-read.table(expression.file.name, colClasses = ifelse(col.in.tissue,"numeric","NULL"),sep="\t",header=T,skip=2)

## Read gene names:

In [65]:
expression.mat <- expression.mat[,order(match(colnames(expression.mat),meta.data$Sample.Name))]

In [66]:
inc.counts     <- inc.counts[,order(match(colnames(inc.counts),meta.data$Run))]

In [67]:
skip.counts    <- skip.counts[,order(match(colnames(skip.counts),meta.data$Run))]

In [68]:
all.genes      <- read.table(expression.file.name,sep="\t",header=T,skip=2,colClasses = c(rep("character", 2), rep("NULL", total.samples-2)))

In [69]:
expression.mat <- expression.mat[!duplicated(all.genes$Description),]

In [70]:
all.genes      <- all.genes[!duplicated(all.genes$Description),]

In [71]:
skip.counts    <- skip.counts[merged.table$geneSymbol %in% all.genes$Description,]

In [72]:
inc.counts     <- inc.counts[merged.table$geneSymbol %in% all.genes$Description,]

In [73]:
merged.table   <- merged.table[merged.table$geneSymbol %in% all.genes$Description,]

In [74]:
gene.names     <- unique(merged.table$geneSymbol)

In [75]:
expression.mat <- expression.mat[all.genes$Description %in% c(as.character(rbp.table$Gene),as.character(gene.names)),]

In [76]:
rownames.expression.mat <-all.genes$Description[all.genes$Description %in% c(as.character(rbp.table$Gene),as.character(gene.names))]

In [77]:
expression.mat <-expression.mat[!duplicated(rownames.expression.mat),]

In [78]:
rownames.expression.mat <-rownames.expression.mat[!duplicated(rownames.expression.mat)]

## Prepare expression of genes and RBPS:

In [79]:
num.events     <- nrow(merged.table)

In [80]:
event.to.gene  <- c()

In [81]:
gexp           <- expression.mat[rownames.expression.mat %in% gene.names,]

In [82]:
rownames(gexp) <- rownames.expression.mat[rownames.expression.mat %in% gene.names]

In [83]:
gexp           <- gexp[order(match(rownames(gexp),gene.names)),]

In [84]:
gexp           <- log2(gexp+0.5)

In [85]:
gexp           <- gexp-rowMeans(gexp)

In [86]:
gexp[apply(gexp,1,sd)>0,] <- gexp[apply(gexp,1,sd)>0,]/apply(gexp[apply(gexp,1,sd)>0,],1,sd)

ERROR: Error in `[<-.data.frame`(`*tmp*`, apply(gexp, 1, sd) > 0, , value = structure(list(), .Names = character(0), row.names = c("NA", : missing values are not allowed in subscripted assignments of data frames


In [None]:
rexp           <- expression.mat[rownames.expression.mat %in% rbp.table$Gene,]

In [None]:
rownames(rexp) <- rownames.expression.mat[rownames.expression.mat %in% rbp.table$Gene]

In [None]:
rexp           <- rexp[order(match(rownames(rexp),rbp.table$Gene)),]

In [None]:
rexp           <- log2(rexp+0.5)

In [None]:
rexp           <- rexp-rowMeans(rexp)

In [None]:
rexp           <- rexp/apply(rexp,1,function(v){ifelse(sum(v==v[1])<length(v),sd(v),1)})

In [None]:
for (i in (1:num.events))
  event.to.gene<-c(event.to.gene,which(unique(merged.table$geneSymbol)==merged.table[i,"geneSymbol"]))
sex<-ifelse(meta.data$sex=="male",1,0)

## Run stan:

In [None]:
dataList = list(
  as = round(skip.counts) ,   #skip event counts across experiments
  c = round(skip.counts+inc.counts)    , #total counts for event, i.e. skip+inclusion, across experiments
  gexp = gexp, #read counts for genes (from gtex, take the raw counts) across experiments
  rexp = rexp, #read counts for RBPs (from gtex, take the raw counts)
  event_to_gene = event.to.gene,  #the gene index for each event (1 to the number of distinct genes) 
  Nrbp = nrow(rexp), #number of RBPs
  Nevents = nrow(merged.table),  #most varying AS events in 
  Nexp = ncol(expression.mat),#number of experiments such that we measured each event, gene and RBP in each experiment
  Ngenes = nrow(gexp),
  sex=sex
)


modelString = "
data {
int<lower=0> Nevents;
int<lower=0> Nexp;
int<lower=0> Nrbp;
int<lower=0> Ngenes;
int<lower=0> as[Nevents,Nexp] ;
int<lower=0> c[Nevents,Nexp] ;
matrix[Ngenes,Nexp] gexp ; 
matrix[Nrbp,Nexp] rexp ; 
int<lower=0> event_to_gene[Nevents];
int<lower=0,upper=1> sex[Nexp];

}


parameters {
real beta0[Nevents] ;
real beta1[Nevents] ;
matrix[Nevents,Nrbp] beta2 ;
real beta3[Nevents];
real beta4[Nrbp];

}
model {

for ( i in 1:Nexp ) {  


    for ( j in 1:Nevents ) if (c[j,i]>0) { 

      as[j,i] ~ binomial(c[j,i], inv_logit(beta0[j]+beta1[j]*sex[i]+dot_product(beta2[j,],rexp[,i])+beta3[j]*gexp[event_to_gene[j],i] ) );

  }
}

for (k in 1:Nrbp){

  for ( j in 1:Nevents ) { 

        beta2[j,k] ~normal(beta4[k],1);
  }

  beta4[k]~normal(0,1);

}


for ( j in 1:Nevents ) { 

    beta1[j] ~ normal(0,1);
    beta0[j] ~ normal(0,1);
    beta3[j] ~ normal(0,1);
  }

}
"

stanDso <- stan_model( model_code=modelString ) 
stanFit <- sampling( object=stanDso , data = dataList , chains = 3 ,iter = 8000,warmup=6000   , thin = 1,init=0, cores=3 )
mcmcCoda = mcmc.list( lapply( 1:ncol(stanFit) , function(x) { mcmc(as.array(stanFit)[,x,]) } ) )

# > Inspect from here onwards

- What files are needed from here to be saved?
- Is the diagnostic plotting helpful for keeping in this notebook?

In [None]:
source("../dimorphAS/DBDA2Eprograms/DBDA2E-utilities.R")

## Initialising dataframe with columns `coef,rbp,tissue`

In [None]:
df <-data.frame(coef=NULL,rbp=NULL,tissue=NULL)

Before running the following, use the Session menu to set working directory to source file location
```R
setwd(dir = "../dimorphAS/DBDA2Eprograms/")
```

In [None]:
load("../dimorphAS/figures/oldFigureDrafts/figure3b.RData")

### This invokes X11 which is not available for all systems, and won"t work in a NextFlow pipeline

```
diagMCMC(mcmcCoda , parName=c("beta2[101,87]"))
```

In [None]:
options(repr.plot.width=6, repr.plot.height=4)

codaObject <- mcmcCoda 
parName    <- c("beta2[101,87]") #varnames(codaObject)[1]
saveName   <- NULL
saveType   <- "jpg"


DBDAplColors = c("skyblue",
               "black",
               "royalblue",
               "steelblue")

#openGraph(height=5,width=7)
    
par(mar=0.5+c(3,4,1,0) , 
  oma=0.1+c(0,0,2,0) , 
  mgp=c(2.25,0.7,0) , 
  cex.lab=1.5 )
    
layout(matrix(1:4,nrow=2))
  # traceplot and gelman.plot are from CODA package:
require(coda)
coda::traceplot( codaObject[,c(parName)], 
              main="" , 
              ylab="Param. Value" ,
              col=DBDAplColors )

In [None]:
options(repr.plot.width=6, repr.plot.height=4)
tryVal = try(
coda::gelman.plot(codaObject[,c(parName)] , 
                  main="",
                  auto.layout=FALSE,
                  col=DBDAplColors )
)  

In [None]:
options(repr.plot.width=6, repr.plot.height=4)

# if it runs, gelman.plot returns a list with finite shrink values:
  if ( class(tryVal)=="try-error" ) {
    plot.new() 
    print(paste0("Warning: coda::gelman.plot fails for ",parName))
  } else { 
    if ( class(tryVal)=="list" & !is.finite(tryVal$shrink[1]) ) {
      plot.new() 
      print(paste0("Warning: coda::gelman.plot fails for ",parName))
    }
  }
  DbdaAcfPlot(codaObject,parName,plColors=DBDAplColors)
  DbdaDensPlot(codaObject,parName,plColors=DBDAplColors)
  mtext( text=parName , outer=TRUE , adj=c(0.5,0.5) , cex=2.0 )
  if ( !is.null(saveName) ) {
    saveGraph( file=paste0(saveName,"Diag",parName), type=saveType)
  }



In [None]:
##Collect coefficients for RBPs whose 95% HDI does not contain 0:

In [None]:
rbp.names<-rownames(rexp)

df<-data.frame(coef=NULL,rbp=NULL,tissue=NULL)

hdi<-HPDinterval(mcmcCoda)  

s <- summary(mcmcCoda)

m <- s$statistics[,"Mean"]

beta2.mat<-matrix(nrow=nrow(merged.table),ncol=length(rbp.names))

for (rbp in (1:length(rbp.names)))
  
  for (event in (1:nrow(merged.table)))
  {
    
    var.name<-paste0("beta2[",event,",",rbp,"]")
    
    low<-hdi[[1]][rownames(hdi[[1]])==var.name][1]
    
    high<-hdi[[1]][rownames(hdi[[1]])==var.name][2]
    
    beta2.mat[event,rbp]<-m[grepl(paste0("beta2\\[",event,",",rbp,"\\]"),names(m))]
    
    if (low<0 && high>0)
      
      beta2.mat[event,rbp]<-0
    
  }



colnames(beta2.mat)=rbp.names

for (rbp in rbp.names)
  
  df<-rbind(df,cbind(beta2.mat[,colnames(beta2.mat)==rbp],rep(rbp,nrow(beta2.mat)),rep(tissue,nrow(beta2.mat)))  )


colnames(df)<-c("Coef","RBP","Tissue")
    
df$Coef<-as.numeric(as.character(df$Coef))


In [None]:
##Display a violin plot for some selected RBPs:

In [None]:
labels<-read.table("labels.tsv",sep="\t",header=T)

df$Tissue<-as.character(df$Tissue)

for (i in 1:nrow(df))
  
  if (df$Tissue[i] %in% labels$tissue)
    
    df$Tissue[i]<-as.character(labels$X[which(as.character(labels$tissue)==as.character(df$Tissue[i]))])


df<-df[df$Coef!=0,]


df$RBP<-as.character(df$RBP)

df1<-df[df$RBP %in% c("BUD13", "GTF2F1","CTNNBL1"),]

pn1<-ggplot(df1,aes(factor(RBP),Coef)) + geom_violin(aes(fill="red")) + scale_fill_manual(values = "#4DBBD5FF") 

pn1 <- pn1 + theme_minimal() +  theme(text = element_text(size=20),
                                      axis.text = element_text(size=20, hjust=0.5),
                                      axis.title.x=element_blank(),
                                      axis.title.y = element_text(size=24),
                                      plot.title = element_text(hjust = 0.5),
                                      legend.position = "none") + ylab("") + labs(title="")+ylim(-2,2)+ geom_hline(yintercept=0)



In [None]:
##Create barplot of the number of RBPs that tend to promote skipping, the number of RBPs that tend to promote
#inclusion and the number of RBPs whose effect is context-specific, for the two RBP groups

In [None]:
spliceosome_genes = as.character(rbp.table$Gene[rbp.table$S=="TRUE"])

splice_regulation_genes = as.character(rbp.table$Gene[rbp.table$R=="TRUE"])

for (RBP_set in list(spliceosome=spliceosome_genes,splice_regulation=splice_regulation_genes))
{
  sum.pos<-sort(unlist(lapply(lapply(split(df$Coef[df$RBP %in% RBP_set],df$RBP[df$RBP %in% RBP_set]),">",0),sum)),decreasing = T)
  
  sum.neg<-sort(unlist(lapply(lapply(split(df$Coef[df$RBP %in% RBP_set],df$RBP[df$RBP %in% RBP_set]),"<",0),sum)),decreasing = T)
  
  sum.pos<-sum.pos[order(names(sum.pos))]
  
  sum.neg<-sum.neg[order(names(sum.neg))]
  
  pos.rbps<-names(which(sum.pos/(sum.pos+sum.neg)>=0.75 & (sum.pos+sum.neg>quantile(sum.pos+sum.neg,0.2))))
  
  neg.rbps<-names(which(sum.pos/(sum.pos+sum.neg)<=0.25 & (sum.pos+sum.neg>quantile(sum.pos+sum.neg,0.2))))
  
  cs.rbps<-names(which(sum.pos/(sum.pos+sum.neg)>0.25 & sum.pos/(sum.pos+sum.neg)<0.75 & (sum.pos+sum.neg>quantile(sum.pos+sum.neg,0.2))))
  
  df.counts<-data.frame(type=c("Skip","Inc","CS"),counts=c(length(pos.rbps),length(neg.rbps),length(cs.rbps)))
  
  pn4_new <- ggplot(df.counts, aes(type, counts)) +  
    geom_bar(fill = "#00008B",color="black", position = "dodge", stat="identity") + 
    geom_text(aes(x = type, y = counts + 10, label = paste(100 * round(counts/sum(counts), 3), "%", sep = "")), size = 3) +
    guides(fill=FALSE) +
    xlab("") + scale_y_continuous(breaks = c(0, 20, 40), limits = c(0, 60))+
    theme_minimal() +
    theme(
      axis.text = element_text(size = 8), 
      axis.text.x = element_text(angle = 90, hjust = 1), 
      axis.title = element_text(size = 10),
      axis.title.y = element_text(vjust = 5)
    )
  show(pn4_new  )
}




In [None]:
if ( ("lv.txt" %in% list.files("../data/")) && ("mt.txt" %in% list.files("../data/"))) {
        message("The files lv.txt or mt.txt are available in the folder ../data/! \n")
        message("The "perl parseMT.pl" command will not be re-run \n")
}


if ( (!("lv.txt" %in% list.files("../data/"))) | (!("mt.txt" %in% list.files("../data/")))) {
        message("The files lv.txt or mt.txt not found in the folder ../data/ \n")
        message("Generating lv.txt and mt.txt with "perl parseMT.pl" using "summary_hbm.txt" as input .. \n")
        system(paste0("cd ../dimorphAS/notebook/ && ",
                      "perl parseMT.pl > parseMT_output.txt && ", 
                      "mv lv.txt ../../data/ && ",
                      "mv mt.txt ../../data/ && ",
                      "cp summary_hbm.txt  ../../data/"), 
               intern  = TRUE)
        message("Done!\n")
}



## Metadata

For replicability and reproducibility purposes, we also print the following metadata:

1. Checksums of **"artefacts"**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `utils::sessionInfo()` and [`devtools::session_info()`](https://devtools.r-lib.org/reference/session_info.html)

### 1. Checksums with the sha256 algorithm

In [None]:
figure_id       <- "figures_3"

message("Generating sha256 checksums of the artefacts in the `..data/` directory .. ")
system(paste0("cd ../data/ && find . -type f -exec sha256sum {} \\; > ../metadata/",  figure_id, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

data.table::fread(paste0("../metadata/", figure_id, "_sha256sums.txt"), header = FALSE, col.names = c("sha256sum", "file"))

### 2. Libraries metadata

In [None]:
dev_session_info   <- devtools::session_info()
utils_session_info <- utils::sessionInfo()

message("Saving `devtools::session_info()` objects in ../metadata/devtools_session_info.rds  ..")
saveRDS(dev_session_info, file = paste0("../metadata/", figure_id, "_devtools_session_info.rds"))
message("Done!\n")

message("Saving `utils::sessionInfo()` objects in ../metadata/utils_session_info.rds  ..")
saveRDS(utils_session_info, file = paste0("../metadata/", figure_id ,"_utils_info.rds"))
message("Done!\n")

dev_session_info$platform
dev_session_info$packages[dev_session_info$packages$attached==TRUE, ]