# Analysis pre-processing by Diogo Veiga

These scripts take the output of in some cases the differential analysis and other cases the output of the rMATS processing, to create the input files for each of the figures for the paper, "The Impact of Sex on Alternative Splicing"

## **NOTE**:

We assume that you have cloned the analysis repository and have `cd` into the parent directory. Before starting with the analysis make sure you have first completed the dependencies set up by following the instructions described in the **`dependencies/README.md`** document. All paths defined in this Notebook are relative to the parent directory (repository). Please close this Notebook and start again by following the above guidelines if you have not completed the aforementioned steps.

## Loading dependencies

In [7]:
library(dplyr)
library(ggplot2)

Sys.setenv(TAR = "/bin/tar") # for gzfile

## Download the files
for testing, in a new launcher window, open a terminal.

cd sbas/data
mkdir significant_events
wget wget https://github.com/adeslatt/sbas_test/releases/download/GTExV6SignificantASTissueEvents.v1/significant_events.tar
tar xvf significant_events.tar

## after getting the significant_events files -- get the gencode specific files.

wget https://github.com/adeslatt/sbas_test/releases/download/rmats_final.gencode.v30/fromGTF.SE.txt
wget https://github.com/adeslatt/sbas_test/releases/download/rmats_final.gencode.v30/fromGTF.RI.txt
wget https://github.com/adeslatt/sbas_test/releases/download/rmats_final.gencode.v30/fromGTF.A3SS.txt
wget https://github.com/adeslatt/sbas_test/releases/download/rmats_final.gencode.v30/fromGTF.A5SS.txt
wget https://github.com/adeslatt/sbas_test/releases/download/rmats_final.gencode.v30/fromGTF.MXE.txt


In [13]:
setwd('jupyter')

Some text for describing what is going to be executed and what it will produce

In [22]:
# verify current working directory, likely it is sbas/jupyter - lets move up a directory.
getwd()
#setwd('../')
getwd()

In [23]:
#Parse files to create a data frame with counts

files <- list.files(path = "data/significant_events/", pattern = "*.txt")
as_types <- c("a3ss", "a5ss", "mxe", "ri", "se")
head (files)
as_types

In [24]:
files_aux <- gsub(pattern = ".txt", replacement = "", x = files)
files_aux <- gsub(pattern = "a3ss$|a5ss$|mxe$|ri$|se$", replacement = "", files_aux)
head(files_aux)

In [27]:
a3ss_annot <- read.table(file = "data/fromGTF.A3SS.txt", sep = "\t", quote = "\"", header = T, stringsAsFactors = F)
a5ss_annot <- read.table(file = "data/fromGTF.A5SS.txt", sep = "\t", quote = "\"", header = T, stringsAsFactors = F)
mxe_annot  <- read.table(file = "data/fromGTF.MXE.txt", sep = "\t", quote = "\"", header = T, stringsAsFactors = F)
ri_annot   <- read.table(file = "data/fromGTF.RI.txt", sep = "\t", quote = "\"", header = T, stringsAsFactors = F)
se_annot   <- read.table(file = "data/fromGTF.SE.txt", sep = "\t", quote = "\"", header = T, stringsAsFactors = F)

head(se_annot)
head(a3ss_annot)
head(a5ss_annot)
head(ri_annot)
head(mxe_annot)

Unnamed: 0_level_0,ID,GeneID,geneSymbol,chr,strand,exonStart_0base,exonEnd,upstreamES,upstreamEE,downstreamES,downstreamEE
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>
1,1,ENSG00000034152.18,MAP2K3,chr17,+,21287990,21288091,21284709,21284969,21295674,21295769
2,2,ENSG00000034152.18,MAP2K3,chr17,+,21303182,21303234,21302142,21302259,21304425,21304553
3,3,ENSG00000034152.18,MAP2K3,chr17,+,21295674,21295769,21287990,21288091,21296085,21296143
4,4,ENSG00000034152.18,MAP2K3,chr17,+,21295674,21295769,21287990,21288091,21298412,21298479
5,5,ENSG00000034152.18,MAP2K3,chr17,+,21295674,21295769,21284710,21284969,21296085,21296143
6,6,ENSG00000034152.18,MAP2K3,chr17,+,21295674,21295769,21284710,21284969,21298412,21298479


Unnamed: 0_level_0,ID,GeneID,geneSymbol,chr,strand,longExonStart_0base,longExonEnd,shortES,shortEE,flankingES,flankingEE
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>
1,1,ENSG00000034152.18,MAP2K3,chr17,+,21300470,21300658,21300544,21300658,21298877,21298926
2,2,ENSG00000160223.17,ICOSLG,chr21,-,44222990,44229044,44222990,44223078,44230053,44230089
3,3,ENSG00000143257.11,NR1I3,chr1,-,161230812,161230933,161230812,161230918,161231116,161231245
4,4,ENSG00000143257.11,NR1I3,chr1,-,161230812,161231245,161230812,161230918,161231328,161231474
5,5,ENSG00000143257.11,NR1I3,chr1,-,161230812,161230933,161230812,161230918,161231328,161231474
6,6,ENSG00000143257.11,NR1I3,chr1,-,161230812,161231245,161230812,161230933,161231328,161231474


Unnamed: 0_level_0,ID,GeneID,geneSymbol,chr,strand,longExonStart_0base,longExonEnd,shortES,shortEE,flankingES,flankingEE
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>
1,1,ENSG00000125166.13,GOT2,chr16,-,58722118,58722278,58722149,58722278,58719195,58719255
2,2,ENSG00000130182.8,ZSCAN10,chr16,-,3092541,3093004,3092676,3093004,3091763,3091828
3,3,ENSG00000143257.11,NR1I3,chr1,-,161236458,161236598,161236529,161236598,161235846,161235977
4,4,ENSG00000154265.16,ABCA5,chr17,-,69273942,69274128,69273958,69274128,69271161,69271289
5,5,ENSG00000035928.16,RFC1,chr4,-,39351347,39351476,39351431,39351476,39345400,39345476
6,6,ENSG00000127249.15,ATP13A4,chr3,-,193440489,193440637,193440557,193440637,193439022,193439065


Unnamed: 0_level_0,ID,GeneID,geneSymbol,chr,strand,riExonStart_0base,riExonEnd,upstreamES,upstreamEE,downstreamES,downstreamEE
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>
1,1,ENSG00000160223.17,ICOSLG,chr21,-,44226833,44230089,44226833,44229044,44230053,44230089
2,2,ENSG00000143257.11,NR1I3,chr1,-,161230812,161231245,161230812,161230933,161231116,161231245
3,3,ENSG00000143257.11,NR1I3,chr1,-,161230812,161231245,161230812,161230918,161231116,161231245
4,4,ENSG00000114062.19,UBE3A,chr15,-,25333727,25340228,25333727,25339257,25340084,25340228
5,5,ENSG00000100359.21,SGSM3,chr22,+,40408932,40409372,40408932,40409018,40409249,40409372
6,6,ENSG00000100359.21,SGSM3,chr22,+,40408932,40409372,40408932,40409018,40409339,40409372


Unnamed: 0_level_0,ID,GeneID,geneSymbol,chr,strand,X1stExonStart_0base,X1stExonEnd,X2ndExonStart_0base,X2ndExonEnd,upstreamES,upstreamEE,downstreamES,downstreamEE
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,1,ENSG00000114062.19,UBE3A,chr15,-,25407066,25407262,25408619,25408684,25405460,25405502,25409087,25409207
2,2,ENSG00000181790.11,ADGRB1,chr8,+,142524237,142524304,142526541,142526627,142522640,142522710,142533294,142533466
3,3,ENSG00000159256.13,MORC3,chr21,+,36380230,36380318,36380625,36380682,36377409,36377507,36384729,36384829
4,4,ENSG00000077232.18,DNAJC10,chr2,+,182752548,182752623,182754702,182754805,182752071,182752188,182755002,182755104
5,5,ENSG00000149809.14,TM7SF2,chr11,+,65113219,65113414,65113490,65113594,65112810,65112865,65114712,65114832
6,6,ENSG00000149809.14,TM7SF2,chr11,+,65112810,65112865,65113219,65113414,65112514,65112711,65113490,65113594


In [28]:
gene_as <- data.frame()

for (i in 1:length(files)) {

  lines <- readLines(paste0("data/significant_events/", files[i]))

  if(length(lines) > 1){ #has significant events
    events <- read.table(paste0("data/significant_events/", files[i]), sep = "\t", skip = 1)

    if(grepl("a3ss.txt$", files[i])){
      idx <- match(events$V1, a3ss_annot$ID)
      res <- data.frame(Tissue = files_aux[i], ASE = "A3SS",
                        GeneSymbol = a3ss_annot$geneSymbol[idx],
                        chr = a3ss_annot$chr[idx])
    }
    if(grepl("a5ss.txt$", files[i])){
      idx <- match(events$V1, a5ss_annot$ID)
      res <- data.frame(Tissue = files_aux[i], ASE = "A5SS",
                        GeneSymbol = a5ss_annot$geneSymbol[idx],
                        chr = a5ss_annot$chr[idx])
    }
    if(grepl("mxe.txt$", files[i])){
      idx <- match(events$V1, mxe_annot$ID)
      res <- data.frame(Tissue = files_aux[i], ASE = "MXE",
                        GeneSymbol = mxe_annot$geneSymbol[idx],
                        chr = mxe_annot$chr[idx])
    }
    if(grepl("se.txt$", files[i])){
      idx <- match(events$V1, se_annot$ID)
      res <- data.frame(Tissue = files_aux[i], ASE = "SE",
                        GeneSymbol = se_annot$geneSymbol[idx],
                        chr = se_annot$chr[idx])
    }
    if(grepl("ri.txt$", files[i])){
      idx <- match(events$V1, ri_annot$ID)
      res <- data.frame(Tissue = files_aux[i], ASE = "RI",
                        GeneSymbol = ri_annot$geneSymbol[idx],
                        chr = ri_annot$chr[idx])
    }

    gene_as <- rbind(gene_as, res)

  } #if has sig. events

} #for all files

head(gene_as)
head(res)

Unnamed: 0_level_0,Tissue,ASE,GeneSymbol,chr
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>
1,Adipose - Subcutaneous,A3SS,SLC25A36,chr3
2,Adipose - Subcutaneous,A3SS,MYO15A,chr17
3,Adipose - Subcutaneous,A5SS,THOC5,chr22
4,Adipose - Subcutaneous,A5SS,NCMAP,chr1
5,Adipose - Subcutaneous,A5SS,C6orf136,chr6
6,Adipose - Subcutaneous,A5SS,APEH,chr3


Unnamed: 0_level_0,Tissue,ASE,GeneSymbol,chr
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>
1,Whole Blood,SE,LINC00671,chr17


In [29]:
# Count most frequent spliced genes
res <- gene_as %>% group_by(GeneSymbol) %>% count(GeneSymbol) %>% arrange(desc(n)) %>% as.data.frame()
res$GeneSymbol <- factor(res$GeneSymbol, levels = res$GeneSymbol)
length(res$GeneSymbol)
head(res)

Unnamed: 0_level_0,GeneSymbol,n
Unnamed: 0_level_1,<fct>,<int>
1,SLC25A36,33
2,TEX41,28
3,PPIEL,23
4,THOC5,21
5,PCBP1-AS1,20
6,GSTM1,18


In [30]:
test <- gene_as%>% group_by(GeneSymbol)
head(test)

Tissue,ASE,GeneSymbol,chr
<fct>,<fct>,<fct>,<fct>
Adipose - Subcutaneous,A3SS,SLC25A36,chr3
Adipose - Subcutaneous,A3SS,MYO15A,chr17
Adipose - Subcutaneous,A5SS,THOC5,chr22
Adipose - Subcutaneous,A5SS,NCMAP,chr1
Adipose - Subcutaneous,A5SS,C6orf136,chr6
Adipose - Subcutaneous,A5SS,APEH,chr3


## Metadata

For replicability and reproducibility purposes, we also print the following metadata:

1. Checksums of **'artefacts'**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `utils::sessionInfo()` and [`devtools::session_info()`](https://devtools.r-lib.org/reference/session_info.html)

## Metadata

For replicability and reproducibility purposes, we also print the following metadata:

1. Checksums of **'artefacts'**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `utils::sessionInfo()` and [`devtools::session_info()`](https://devtools.r-lib.org/reference/session_info.html)

### 1. Checksums with the sha256 algorithm

In [None]:
figure_id   = "<the-figure-i-am-working-on>"

message("Generating sha256 checksums of the artefacts in the `..data/` directory .. ")
system(paste0("cd ../data/ && sha256sum * > ../metadata/", figure_id, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

data.table::fread(paste0("../metadata/", figure_id, "_sha256sums.txt"), header = FALSE, col.names = c("sha256sum", "file"))

### 2. Libraries metadata

In [None]:
figure_id   = "<the-figure-i-am-working-on>"

dev_session_info   <- devtools::session_info()
utils_session_info <- utils::sessionInfo()

message("Saving `devtools::session_info()` objects in ../metadata/devtools_session_info.rds  ..")
saveRDS(dev_session_info, file = paste0("../metadata/", figure_id, "_devtools_session_info.rds"))
message("Done!\n")

message("Saving `utils::sessionInfo()` objects in ../metadata/utils_session_info.rds  ..")
saveRDS(utils_session_info, file = paste0("../metadata/", figure_id ,"_utils_info.rds"))
message("Done!\n")

dev_session_info$platform
dev_session_info$packages[dev_session_info$packages$attached==TRUE, ]