<span style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">An Exception was encountered at '<a href="#papermill-error-cell">In [11]</a>'.</span>

# Supplementary: Counts of differentially spliced exons per examined tissue

This notebook aggregates the results from the differential splicing from (**see** [figureXXX.ipynb](figure1.ipynb)), and more specifically the `limma::topTable()` output dataframes across all tissues in the GTEX cohort and generates summary statistics for the number of genes found to be statistically up or downregulated between male and female subjects.

 ---
 
 **Running this notebook**:
 
A few steps are needed before you can run this document on your own. The GitHub repository (https://github.com/TheJacksonLaboratory/sbas) of the project contains detailed instructions for setting up the environment in the **`dependencies/README.md`** document. Before starting with the analysis, make sure you have first completed the dependencies set up by following the instructions described there. If you have not done this already, you will need to close and restart this notebook before running it.

All paths defined in this Notebook are relative to the parent directory (repository). 

 ---


# Loading dependencies

In [1]:
library(dplyr)
library(tidyr)
library(reshape)
library(ggplot2)
# Install this version: > devtools::install_github("ropensci/piggyback@87f71e8", upgrade="never")
#library(piggyback)
library(snakecase)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Attaching package: ‘reshape’


The following objects are masked from ‘package:tidyr’:

    expand, smiths


The following object is masked from ‘package:dplyr’:

    rename




# Create a list of named dataframes with the Differentially Spliced Exons `limma::topTable()`s

We will iterate over the list of named dataframes to collect summary statistics. More specifically, retrieve the count of:
- upregulated
- downregulated
- non significant

differentially spliced exons for the contrast males-females per tissue.

In [43]:
suffix_pattern   <- "*_AS_model_B_sex_as_events.csv"
tables_folder    <- "../data"

tables_filepaths <- list.files(tables_folder, pattern = suffix_pattern, full.names = TRUE)
tables_filenames <- list.files(tables_folder, pattern = suffix_pattern, full.names = FALSE)

In [44]:
head(tables_filepaths)

In [45]:
head(tables_filenames)

In [46]:
all_topTables <- lapply(tables_filepaths,read.csv)
names(all_topTables) <- gsub(suffix_pattern,"", tables_filenames)

The list named `all_topTables` is the object that holds all the topTable dataframes from each tissue comparison:

In [47]:
length(all_topTables)

In [48]:
head (all_topTables)

Unnamed: 0_level_0,logFC,AveExpr,t,P.Value,adj.P.Val,B
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
PDCD2-904,-0.4347036,6.175682,-4.187693,3.241175e-05,0.1327911,2.02727979
PQBP1-4194,0.4844322,6.494722,4.088813,4.927315e-05,0.1327911,1.69169884
NPHP3-1972,-0.4527978,4.925482,-3.899874,1.071211e-04,0.1416724,0.94818584
RNF32-7280,-0.5096266,3.494105,-3.933913,9.335045e-05,0.1416724,0.90307003
FAM122B-6265,-0.3789073,5.631119,-3.800347,1.592474e-04,0.1430573,0.66915994
SMYD5-4413,-0.4870824,3.542014,-3.848827,1.314215e-04,0.1416724,0.62961155
RPS2-3820,-0.4852762,7.697982,-3.678433,2.557713e-04,0.1787982,0.20236319
SEH1L-3597,0.4534832,5.163434,3.584355,3.653952e-04,0.1787982,-0.06630462
PPP6R3-5587,-0.4508991,3.739152,-3.613860,3.269974e-04,0.1787982,-0.07623101
THADA-4580,-0.4318604,3.635039,-3.604603,3.386165e-04,0.1787982,-0.11325017

Unnamed: 0_level_0,logFC,AveExpr,t,P.Value,adj.P.Val,B
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
IL17RC-5032,-0.7418516,3.317587,-4.358890,1.637778e-05,0.08891499,2.2231806
MDM4-3553,-0.6350300,3.093217,-3.942920,9.397421e-05,0.12934815,0.8175693
TMEM25-2152,-0.6712534,3.694924,-3.909944,1.072623e-04,0.12934815,0.7749876
RPS15A-94,-0.4968195,5.516636,-3.794562,1.691469e-04,0.12934815,0.4889874
BRD8-1346,0.6040044,4.748876,3.786766,1.743621e-04,0.12934815,0.4865721
TMEM25-2154,-0.6524021,3.327380,-3.794095,1.694549e-04,0.12934815,0.3836916
NUBP2-8412,-0.5300493,4.979043,-3.757124,1.956100e-04,0.12934815,0.3503845
RPS15A-95,-0.5016459,7.525785,-3.711859,2.328217e-04,0.12934815,0.2445876
OGFOD1-1989,0.6511965,3.750583,3.718251,2.271901e-04,0.12934815,0.1904553
CCDC22-8875,-0.4876963,4.158666,-3.705829,2.382541e-04,0.12934815,0.1761796

Unnamed: 0_level_0,logFC,AveExpr,t,P.Value,adj.P.Val,B
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
SCO1-8452,0.7715834,5.028635,4.906848,1.533202e-06,0.008170436,0.8561371
NLRX1-7080,-0.3932918,6.047917,-3.508733,5.204727e-04,0.729562757,-1.4519715
CMTM3-6050,-0.5634628,4.736086,-3.596854,3.775214e-04,0.729562757,-1.7279511
COMT-5298,-0.5814431,6.366287,-3.413731,7.305247e-04,0.729562757,-1.7706538
ARL2BP-1047,0.6443623,4.428433,3.455417,6.301242e-04,0.729562757,-2.1386203
KSR1-3412,-0.5803093,4.465821,-3.316173,1.026735e-03,0.729562757,-2.1831471
SYNRG-3837,0.4642084,4.849423,3.166381,1.705091e-03,0.729562757,-2.3585758
NKIRAS2-7569,0.5412126,4.365878,3.235693,1.351522e-03,0.729562757,-2.4384009
RCCD1-7638,-0.4664921,4.834041,-3.166743,1.703036e-03,0.729562757,-2.4462010
OTUB1-7839,0.4232904,5.054867,3.207046,1.488491e-03,0.729562757,-2.4873699

Unnamed: 0_level_0,logFC,AveExpr,t,P.Value,adj.P.Val,B
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
HMGCL-4891,-0.5544818,4.638661,-4.307308,2.040767e-05,0.1072627,1.778330780
DDX3X-5706,0.5336034,5.089376,4.041342,6.275350e-05,0.1649162,0.931205362
RASA1-2906,0.4482972,5.624881,3.717314,2.274062e-04,0.2390494,0.209547310
TPR-5186,0.5504123,4.012007,3.732892,2.141947e-04,0.2390494,0.003490113
RAET1G-1993,-0.7565974,1.582497,-3.803106,1.631312e-04,0.2390494,-0.169918676
RNF216P1-1922,0.5154237,3.518867,3.662572,2.801716e-04,0.2454303,-0.254521775
PMPCB-3741,-0.3897951,5.286827,-3.512051,4.907363e-04,0.3683199,-0.582986199
BAG6-6330,-0.3823559,5.170946,-3.365101,8.323906e-04,0.3683199,-0.936377813
RGS12-2643,0.4592240,3.676873,3.388022,7.674777e-04,0.3683199,-0.963083687
LCAT-5334,-0.4563918,4.789971,-3.333347,9.307901e-04,0.3683199,-1.008087973

Unnamed: 0_level_0,logFC,AveExpr,t,P.Value,adj.P.Val,B
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
XIST-2252,-2.1880965,4.327079,-10.329346,4.631685e-21,2.464057e-17,16.7535120
CIRBP-7479,-0.9059219,6.826605,-4.215284,3.487846e-05,7.488340e-02,-0.4158808
CDC16-8442,0.6708470,4.528174,4.168855,4.222748e-05,7.488340e-02,-0.9825758
NCLN-8310,0.7612256,4.783347,3.654933,3.134939e-04,4.169469e-01,-1.7788510
GHDC-7423,0.5472432,5.480174,3.169623,1.716607e-03,9.243032e-01,-2.2634181
CYTH2-7326,0.6268525,5.865711,3.227853,1.413986e-03,9.243032e-01,-2.2753648
SH2B1-5277,-0.5355633,5.641082,-3.189748,1.605823e-03,9.243032e-01,-2.4487775
POLE-2860,-0.5637339,3.917083,-3.187680,1.616895e-03,9.243032e-01,-2.6090239
INTS11-7824,0.3744695,6.760867,2.793755,5.613400e-03,9.243032e-01,-2.6756341
LRPAP1-3505,0.5696534,6.211958,3.002605,2.948201e-03,9.243032e-01,-2.6788748

Unnamed: 0_level_0,logFC,AveExpr,t,P.Value,adj.P.Val,B
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
RAD9A-7309,0.4931121,3.784158,4.145341,3.858037e-05,0.1969142,0.3365173
SETD6-944,0.3267084,4.804857,3.610304,3.301811e-04,0.4923814,-0.6571164
TMCO6-5638,0.4525169,3.998977,3.622348,3.154791e-04,0.4923814,-0.8878821
ERGIC3-3005,0.2837871,7.321589,3.381043,7.668164e-04,0.4923814,-0.9302555
MGAT4B-4054,0.4592824,4.504340,3.530145,4.456522e-04,0.4923814,-1.0897809
SUMO1-6035,0.2949199,6.150022,3.379246,7.717577e-04,0.4923814,-1.0994417
DPH1-8790,-0.3559022,4.264196,-3.463511,5.693592e-04,0.4923814,-1.2969707
PATZ1-2465,0.4110299,4.078407,3.387027,7.505802e-04,0.4923814,-1.3453788
SLC15A4-2332,0.3121709,5.643216,3.185157,1.518251e-03,0.5517674,-1.5601450
NAT9-6987,0.3158133,5.308934,3.116622,1.912700e-03,0.6101511,-1.7858628


In [49]:
summary(all_topTables)

                                           Length Class      Mode
a3ss_adipose_subcutaneous                  6      data.frame list
a3ss_adipose_visceral_omentum              6      data.frame list
a3ss_adrenal_gland                         6      data.frame list
a3ss_artery_aorta                          6      data.frame list
a3ss_artery_coronary                       6      data.frame list
a3ss_artery_tibial                         6      data.frame list
a3ss_brain_caudate_basal_ganglia           6      data.frame list
a3ss_brain_cerebellar_hemisphere           6      data.frame list
a3ss_brain_cerebellum                      6      data.frame list
a3ss_brain_cortex                          6      data.frame list
a3ss_brain_frontal_cortex_ba_9             6      data.frame list
a3ss_brain_hippocampus                     6      data.frame list
a3ss_brain_hypothalamus                    6      data.frame list
a3ss_brain_nucleus_accumbens_basal_ganglia 6      data.frame list
a3ss_brain

In [50]:
head(all_topTables[[3]] , 2)

Unnamed: 0_level_0,logFC,AveExpr,t,P.Value,adj.P.Val,B
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
SCO1-8452,0.7715834,5.028635,4.906848,1.533202e-06,0.008170436,0.8561371
NLRX1-7080,-0.3932918,6.047917,-3.508733,0.0005204727,0.729562757,-1.4519715


# Example with one topTable before iterating over all tissues

In [51]:
# Example topTable and name
topTable <- all_topTables[[3]]
name     <- names( all_topTables)[3]
name
head(topTable,2)

Unnamed: 0_level_0,logFC,AveExpr,t,P.Value,adj.P.Val,B
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
SCO1-8452,0.7715834,5.028635,4.906848,1.533202e-06,0.008170436,0.8561371
NLRX1-7080,-0.3932918,6.047917,-3.508733,0.0005204727,0.729562757,-1.4519715


## Defining the thresholds for the double criterion filtering:

Criteria:
- Adjusted p-value < `p_value_cuttoff`
- Absolute FoldChange > `absFold_change_threshold`

----

***NOTE***

Defining higher in males or females based on the limma design matrix.
As we have used 1 for encoding the females and 2 for the males, our *reference level* for the contrast in the expression between males and females is 1, the females.


From the `limma` documentation:
>The level which is chosen for the *reference level* is the level which is contrasted against. By default, this is simply the first level alphabetically. We can specify that we want group 2 to be the reference level by either using the relevel function [..]

By convention, we could say that genes with positive log fold change, are higher in males, whereas the opposite holds true for the ones that are observed to have negative log fold change. 

---

In [52]:
adj.P.Val_threshold  <- 0.05
absFoldChange_cutoff <- 1.5

Replacing potential `NA` values in the `P.Value`, `adj.P.Val` to keep the columns numeric and avoid coersion.

In [53]:
# replacing NA p-values with p-value = 1
topTable$P.Value[is.na(topTable$P.Value)]     <- 1; 
topTable$adj.P.Val[is.na(topTable$adj.P.Val)] <- 1;

<span id="papermill-error-cell" style="color:red; font-family:Helvetica Neue, Helvetica, Arial, sans-serif; font-size:2em;">Execution using papermill encountered an exception here and stopped:</span>

In [54]:
# Add helper variable dummy `FoldChange` variable. Use 2 as base of log, because this is the default from limma
# The following statement calculates a dummy fold change (how many times higher or lower)
# The minus symbol is a convention symbol only! to express eg. a fold change of 0.25 as -4, 4 times lower
topTable$FoldChange_dummy    <-   ifelse(topTable$logFC > 0, 2 ^ topTable$logFC, -1 / (2 ^ topTable$logFC))                    

# Add helper variable `abs_logFC`.
topTable$abs_logFC <- abs(topTable$logFC)

# Add helper variable `abundance` for up, down, non_signif
topTable$abundance                                                  <- "non_signif"
topTable$abundance[ ((topTable$logFC >   log2(absFoldChange_cutoff)) & (topTable$adj.P.Val <= adj.P.Val_threshold )) ]   <- "higher"
topTable$abundance[ ((topTable$logFC <  -log2(absFoldChange_cutoff)) & (topTable$adj.P.Val <= adj.P.Val_threshold )) ]   <- "lower"


In [55]:
dim(topTable[ topTable$abundance == "non_signif" ,])
dim(topTable[ topTable$abundance == "higher" ,])
dim(topTable[ topTable$abundance == "lower" ,])

In [56]:
table(topTable$abundance)


    higher non_signif 
         1       5328 

# Define a vector with the columns to keep in the annotated from GTF `topTable` object

In [57]:
#toKeep <- c("Geneid","logFC","FoldChange_dummy", "adj.P.Val", "abundance")
toKeep <- colnames(topTable)

In [58]:
head(topTable[ , colnames(topTable) %in% toKeep ],2)

Unnamed: 0_level_0,logFC,AveExpr,t,P.Value,adj.P.Val,B,FoldChange_dummy,abs_logFC,abundance
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
SCO1-8452,0.7715834,5.028635,4.906848,1.533202e-06,0.008170436,0.8561371,1.707142,0.7715834,higher
NLRX1-7080,-0.3932918,6.047917,-3.508733,0.0005204727,0.729562757,-1.4519715,-1.313387,0.3932918,non_signif


In [59]:
name
dim(topTable)
dim(topTable  [ topTable$abundance != "non_signif",  ])
dim(topTable  [ ((abs(topTable$logFC > log2(absFoldChange_cutoff)) )  & (topTable$adj.P.Val <= adj.P.Val_threshold )) ,  ])
dim(topTable  [ ((abs(topTable$logFC < -log2(absFoldChange_cutoff)) )  & (topTable$adj.P.Val <= adj.P.Val_threshold )) ,  ])

In [60]:
expression_abundance <- t(table(topTable$abundance))
expression_abundance

      
       higher non_signif
  [1,]      1       5328

In [61]:
expression_abundance <- t(table(topTable$abundance))
signif <- as.data.frame.matrix(expression_abundance)

In [62]:
signif

higher,non_signif
<int>,<int>
1,5328


To avoid errors in the cases that we might have none lower or none higher, and the matrix might be missing columns we will create a template data.frame and also add the column that might be missing if lower or higher genes is equal to 0.

In [63]:
signif_template <- structure(list(higher = integer(0), 
                                   lower = integer(0), 
                                   non_signif = integer(0)), 
                              row.names = integer(0), class = "data.frame")
signif_template

higher,lower,non_signif
<int>,<int>,<int>


In the for-loop we will check if both columns `lower`, `higher` are present, if not add the column and zero count to create the expected shape of the dataframe:

```R
signif <- as.data.frame.matrix(expression_abundance)
if(! ("higher" %in% colnames(signif))) { 
    
    signif$higher <- 0
}
if(! ("lower" %in% colnames(signif))) { 

    signif$lower <- 0
}
```

Now we can add some more summary statistics eg percentage of genes lower, higher or non-significantly different, 

In [64]:
signif$tissue <- name
signif$sum    <- signif$non_signif + signif$higher + signif$lower
toKeepInOrder <- c("tissue", "non_signif", "lower", "higher", "% lower", "% higher", "% non-signif")
signif$`% higher`     <-  round(signif$higher / signif$sum  * 100, 2)
signif$`% lower`      <-  round(signif$lower / signif$sum  * 100, 2)
signif$`% non-signif` <-  round(signif$non_signif / signif$sum  * 100, 2)
signif <- signif[, toKeepInOrder]
signif

ERROR: Error in `$<-.data.frame`(`*tmp*`, sum, value = integer(0)): replacement has 0 rows, data has 1


# Summary table of differentially expressed genes between male and female acrosss tissues

Above we demonstrate for one example limma `topTable`. Let's now iterate over all tissue and create an aggregated table of counts of differentially expressed or non-significantly altered between the two sexes.

In [None]:
summary_signif <-structure(list(tissue = character(0), 
                            non_signif = integer(0), 
                            lower = integer(0),
                            higher = integer(0),
                            `% lower` = numeric(0), 
                            `% higher` = numeric(0), 
                            `% non-signif` = numeric(0)), 
                       row.names = integer(0), 
                       class = "data.frame")

signif_template <- structure(list(higher = integer(0), 
                                   lower = integer(0), 
                                   non_signif = integer(0)), 
                              row.names = integer(0), class = "data.frame")

signif_per_tissue <- structure(list(logFC = numeric(0), AveExpr = numeric(0), t = numeric(0), 
                        P.Value = numeric(0), adj.P.Val = numeric(0), B = numeric(0), 
                        initial_gene_id = character(0), gene_id = character(0), abs_logFC = numeric(0), 
                        FoldChange_dummy = numeric(0), abundance = character(0), 
                        GeneSymbol = character(0), Chromosome = character(0), Class = character(0), 
                        Strand = character(0), tissue = character(0)), row.names = integer(0), class = "data.frame")

# Add helper variable dummy `FoldChange` variable. Use 2 as base of log, because this is the default from limma
# The following statement calculates a dummy fold change (how many times higher or lower)
# The minus symbol is a convention symbol only! to express eg. a fold change of 0.25 as -4, 4 times lower
# 
for (i in seq_along(all_topTables)){
    topTable <- all_topTables[[i]]
    name     <- names(all_topTables)[i] 
    topTable$P.Value[is.na(topTable$P.Value)]     <- 1; # replacing NA p-values with p-value = 1
    topTable$adj.P.Val[is.na(topTable$adj.P.Val)] <- 1;
    topTable$abs_logFC <- abs(topTable$logFC)
    topTable$FoldChange_dummy    <-   ifelse(topTable$logFC > 0, 2 ^ topTable$logFC, -1 / (2 ^ topTable$logFC))                    

    # Add helper variable `abs_logFC`.
    topTable$abs_logFC <- abs(topTable$logFC)

    # Add helper variable `abundance` for up, down, non_signif
    topTable$abundance                                                  <- "non_signif"
    topTable$abundance[ ((topTable$logFC >   log2(absFoldChange_cutoff)) & (topTable$adj.P.Val <= adj.P.Val_threshold )) ]   <- "higher"
    topTable$abundance[ ((topTable$logFC <  -log2(absFoldChange_cutoff)) & (topTable$adj.P.Val <= adj.P.Val_threshold )) ]   <- "lower"
    if (sum(topTable$abundance != "non_signif") > 0) {
       topTable_signif <- topTable[ topTable$abundance != "non_signif", ]
       topTable_signif$tissue <- name
       signif_per_tissue <- rbind(signif_per_tissue, topTable_signif )
       data.table::fwrite(file = paste0("../data/signif_", snakecase::to_snake_case(name), ".csv"), topTable_signif)
    }
    expression_abundance <- t(table(topTable$abundance))
    signif <- as.data.frame.matrix(expression_abundance)
    if(! ("higher" %in% colnames(signif))) {
        signif$higher <- 0
    }
    if(! ("lower" %in% colnames(signif))) {
        signif$lower <- 0
    }
    signif$tissue <- name

    signif$sum    <-   signif$non_signif + signif$higher + signif$lower
    toKeepInOrder <- c("tissue", "non_signif", "lower", "higher", "% lower", "% higher", "% non-signif")
    signif$`% higher`     <-  round(signif$higher / signif$sum  * 100, 2)
    signif$`% lower`      <-  round(signif$lower / signif$sum  * 100, 2)
    signif$`% non-signif` <-  round(signif$non_signif / signif$sum  * 100, 2)
    signif <- signif[, toKeepInOrder]
    summary_signif <- rbind(summary_signif, signif)   
}

message("past for seq_along(all_topTables)\n")
summary_signif <- summary_signif[order(summary_signif$`% non-signif`), ]
head(summary_signif , 2)
head(signif_per_tissue, 2)
data.table::fwrite(file = "../data/summary_significant.csv", summary_signif)
data.table::fwrite(file = "../data/summary_significant_per_tissue.csv", signif_per_tissue)

# Defining higher in males or females based on the limma design matrix
As we have used 1 for encoding the males and 2 for the females, our *reference level* for the contrast in the expression between males and females is 1, the males.


From the `limma` documentation:
>The level which is chosen for the *reference level* is the level which is contrasted against. By default, this is simply the first level alphabetically. We can specify that we want group 2 to be the reference level by either using the relevel function [..]

By convention, we could say that genes with positive log fold change, are higher in females, whereas the opposite holds true for the ones that are observed to have negative log folde change. 

In [None]:
summary_signif$`higher in males`   <- summary_signif$lower
summary_signif$`higher in females` <- summary_signif$higher
head(summary_signif[summary_signif$tissue == "Fake", ])

# Preparing the summary table for plotting

We will need to aggregate the number of genes in one column in order to be able to plot, and also convert the `Tissue` column to a factor. We will use the `reshape` R package to *melt* the dataframe from a wide to a long version, as described above:

In [None]:
toPlot <- summary_signif[, c( "tissue", "higher in males", "higher in females")]
toPlot <- reshape::melt(toPlot, id=c("tissue"))
toPlot$tissue <- as.factor(toPlot$tissue)
colnames(toPlot) <- c("Tissue", "Sex Bias", "Number of Genes")
message("new structure to plot structure \n")
head(toPlot[toPlot$Tissue == "Fake", ])

In [None]:

pdf ("../pdf/summary_per_tissue_diff_spliced.pdf")
options(repr.plot.width=15.5, repr.plot.height=20)

ggplot(toPlot, aes(x = Tissue, y = `Number of Genes`, fill = `Sex Bias`)) + 
  geom_bar(stat="identity", position = "dodge") + 
  scale_fill_manual (values = c( "higher in males" = "#4A637B" , "higher in females" = "#f35f71")) + 
  
  theme(text              = element_text(color = "#4A637B", face = "bold", family = 'Helvetica')
        ,plot.caption     = element_text(size =  12, color = "#8d99ae", face = "plain", hjust= 1.05) 
        ,plot.title       = element_text(size =  18, color = "#2b2d42", face = "bold", hjust= 0.5)
        ,axis.text.y      = element_text(angle =  0, size = 10, color = "#8d99ae", face = "bold", hjust=1.1)
        ,axis.text.x      = element_text(angle = 70, size = 12, color = "#8d99ae", face = "bold", hjust=1.1)
        ,axis.title.x     = element_blank()
        ,axis.ticks.x     = element_blank()
        ,axis.ticks.y     = element_blank()
        ,plot.margin      = unit(c(1,1,1,1),"cm")
        ,panel.background = element_blank()
        ,legend.position  = "right") +
  

  geom_text(aes(y = `Number of Genes` + 15, 
                label = `Number of Genes`),
                size = 3,
                color     = "#4A637B",
                position  =  position_dodge(width = 1),
                family    = 'Helvetica') +
  
  labs(title   = "Number of genes with higher expression in each sex per tissue\n",
       caption = "\nsource: 'The impact of sex on alternative splicing'\n doi: https://doi.org/10.1101/490904",
       y   = "\nNumber of Differentially Expressed Genes")  + coord_flip()

dev.off()


# Mutually exclusive sex biased genes (higher expression in one or the other sex only)


The dataframe `signif_per_tissue` contains all the information for the genes that were significantly higher in either of the two sexes. WLet's examine how many mutually exclusive genes were found across all examined tissues. Ensembl encodes as `Chromosome` the chromosomal position, so we will create the required variables to retrieve only the chromosome information for producing summary statistics.

In [None]:
message("signif_per_tissue structure\n")
dput(colnames(signif_per_tissue))

In [None]:
signif_per_tissue$Chromosomal_Position <- signif_per_tissue$Chromosome
signif_per_tissue$Chromosome <- gsub("\\:.*","", signif_per_tissue$Chromosome)
signif_per_tissue$higher_in  <- 0
signif_per_tissue$higher_in[(signif_per_tissue$abundance == "lower" )] <- "males"
signif_per_tissue$higher_in[(signif_per_tissue$abundance == "higher" )] <- "females"
toKeepInOrder <- c( paste0("initial_", GENE_ID), "GeneSymbol", "logFC",  "adj.P.Val", "abundance", "higher_in",  "tissue", "Chromosome", 
GENE_ID, "abs_logFC", "FoldChange_dummy", 
"Class", "Strand","Chromosomal_Position", 
 "AveExpr", "t", "P.Value", "adj.P.Val", "B")
signif_per_tissue <- signif_per_tissue[, toKeepInOrder]
head(signif_per_tissue, 4)

# Examine mutually exclusive genes upregulated in each sex

In [None]:
female_biased <- unique(signif_per_tissue[[paste0("initial_", GENE_ID)]] [ signif_per_tissue$higher_in == "females" ] )
male_biased   <- unique(signif_per_tissue[[paste0("initial_", GENE_ID)]] [ signif_per_tissue$higher_in == "males"  ] )

length(male_biased)
length(female_biased)

In [None]:
## Present in both

length((intersect(male_biased, female_biased)))
length((intersect(female_biased, male_biased)))

intersect <- (intersect(male_biased, female_biased))

In [None]:
## Only in males
length(male_biased[! (male_biased %in% intersect)])

## Only females
length(female_biased[! (female_biased %in% intersect)])

In [None]:
perc_only_male <-  length(male_biased[! (male_biased %in% intersect)]) / length(male_biased) * 100
perc_only_female <-  length(female_biased[! (female_biased %in% intersect)]) / length(female_biased) * 100

head( signif_per_tissue[ signif_per_tissue[[paste0("initial_", GENE_ID)]] %in% male_biased[! (male_biased %in% intersect)],  ] , 4 )


message(round(perc_only_male, 2), " % of differentially expressed genes higher in males only found to be significantly differentin males")
message(round(perc_only_female,2), " % of differentially expressed genes higher in females only found to be significantly different in females")

## Significantly higher only in males

In [None]:
dim(signif_per_tissue[ signif_per_tissue[[paste0("initial_", GENE_ID)]] %in% male_biased[! (male_biased %in% intersect)],  ])

only_male_genes <- signif_per_tissue[ signif_per_tissue[[paste0("initial_", GENE_ID)]] %in% (male_biased[! (male_biased %in% intersect)]) ,  ]

head(only_male_genes[ order(only_male_genes[[paste0("initial_", GENE_ID)]] ), ], 5)

In [None]:
# See 8.1.1 enquo() and !! - Quote and unquote arguments in https://tidyeval.tidyverse.org/dplyr.html

only_male_genes %>% 
    count( !!GENE_ID, GeneSymbol, Class, sort = TRUE) %>%
    head(20)

## Significantly higher only in females

In [None]:
only_female_genes <- signif_per_tissue[ signif_per_tissue[[paste0("initial_", GENE_ID)]] %in% (female_biased[! (female_biased %in% intersect)]) ,  ]

head(only_female_genes[ order(only_female_genes[[paste0("initial_", GENE_ID)]] ), ], 10)

In [None]:
only_female_genes %>% 
    count( !!GENE_ID, GeneSymbol, Class, sort = TRUE) %>%
    head(20)

# Examine number of differentially expressed genes per chromosome per sex

In [None]:
signif_per_tissue$Chromosome <- as.factor(signif_per_tissue$Chromosome)
signif_per_tissue$higher_in <- as.factor(signif_per_tissue$higher_in)

signif_per_tissue %>% 
    group_by(Chromosome,higher_in) %>%  
    count()  -> signif_per_chrom_per_sex

In [None]:
signif_per_chrom_per_sex

## Metadata

For replicability and reproducibility purposes, we also print the following metadata:

1. Checksums of **'artefacts'**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `utils::sessionInfo()` and [`devtools::session_info()`](https://devtools.r-lib.org/reference/session_info.html)

In [None]:
notebook_id   = "summary_per_tissue_diff_expressed"

message("Generating sha256 checksums of the artefacts in the `..data/` directory .. ")
system(paste0("cd ../data/ && sha256sum **/*csv > ../metadata/", notebook_id, "_sha256sums.txt"), intern = TRUE)
system(paste0("cd ../data/ && sha256sum *csv >> ../metadata/", notebook_id, "_sha256sums.txt"), intern = TRUE)

message("Done!\n")

data.table::fread(paste0("../metadata/", notebook_id, "_sha256sums.txt"), header = FALSE, col.names = c("sha256sum", "file"))

### 2. Libraries metadata

In [None]:
dev_session_info   <- devtools::session_info()
utils_session_info <- utils::sessionInfo()

message("Saving `devtools::session_info()` objects in ../metadata/devtools_session_info.rds  ..")
saveRDS(dev_session_info, file = paste0("../metadata/", notebook_id, "_devtools_session_info.rds"))
message("Done!\n")

message("Saving `utils::sessionInfo()` objects in ../metadata/utils_session_info.rds  ..")
saveRDS(utils_session_info, file = paste0("../metadata/", notebook_id ,"_utils_info.rds"))
message("Done!\n")

dev_session_info$platform
dev_session_info$packages[dev_session_info$packages$attached==TRUE, ]

# Calculating the sex-biased splicing index
The normalized sex-biased splicing index is defined as the number of statistically significant splicing events per 1000 exons in the chromosome.

In [None]:
dim(signif_per_chrom_per_sex)

Sorry, I cannot do this in R. Here is Python (ugly script but works)

import csv
import gzip
import re
from collections import defaultdict

fname = 'Homo_sapiens.GRCh38.100.chr_patch_hapl_scaff.gtf.gz'

chrom2exons = defaultdict(set)

with gzip.open(fname, 'rt') as f:
    cr = csv.reader(f, delimiter='\t', quotechar='"')
    for row in cr:
        #print(row)
        if row[0].startswith('#'):
            continue
        chrom = row[0]
        annots = row[8]
        fields = annots.split(";")
        exon = re.compile(r'exon_id "(ENSE\d+)"')
        for f in fields:
            itm = f.strip()
            match = exon.match(itm)
            if match:
                exonid = match.group(1)
                chrom2exons[chrom].add(exonid)

g = open('chrom2exons.txt', 'wt')
for k, v in chrom2exons.items():
    print("chr{}: n={}".format(k, len(v)))
    g.write("{}\t{}\n".format(k, len(v)))
g.close()

1	69381
2	55599
3	46452
4	29749
5	34789
6	33817
7	35973
X	22471
8	28489
9	26460
11	43212
10	26514
12	42925
13	13193
14	25994
15	28720
16	36285
17	45142
18	13360
20	16704
19	44166
Y	2908
22	16411
21	8830
MT	37

In [None]:
signif_per_chrom_per_sex

In [None]:
only_female_genes %>% 
    count( !!GENE_ID, GeneSymbol, Class, sort = TRUE) %>%
    head(20)

In [None]:
signif_per_tissue %>% 
     group_by(Chromosome) %>%  
    count()  -> signif_per_chrom

In [None]:
signif_per_chrom

In [None]:
chrom2exon_filename = '../assets/canon_chrom2exons.txt'
if (! file.exists(chrom2exon_filename)) {
    message("Could not find canon_chrom2exons.txt file")
}
c2e_df = read.csv(chrom2exon_filename, sep='\t', header=FALSE)
colnames(c2e_df) <- c("Chromosome","exons")
head(c2e_df) # 25 chromosomes including MT

In [None]:
df2 <- merge(signif_per_chrom, c2e_df, by="Chromosome")
head(df2)

In [None]:
# calculate splicinig index
library(tidyverse)

In [None]:
df2 %>% 
  mutate(Index = 1000 * n/exons) -> df3

In [None]:
df4 <- df3[-25,] # remove the Y chromosome
df4 <- df4[-23,] # remove the MT chromosome

In [None]:
res_sorted <- df4[order(df4$Index, decreasing=TRUE),]
res_sorted

In [None]:
res_sorted$Chromosome <- factor(res_sorted$Chromosome, levels = res_sorted$chr)
res_sorted

In [None]:
# set the colors
npgBlue<- rgb(60/256,84/256,136/256,1)
npgRed <- rgb(220/256,0,0,0.5)
npgGreen <- rgb(0,160/256,135/256,1)
npgBrown <- rgb(126/256,97/256,72/256,1)

In [None]:
# make the plot 
figure2b <- ggplot(res_sorted, aes(x = Chromosome, y = Index, size = n)) +
  geom_point(color=npgBlue) +
  theme_bw() +
  theme(axis.text.x = element_text(size=14, angle = 270, hjust = 0.0, vjust = 0.5),
	axis.text.y = element_text(size=16),
	axis.title.x = element_blank(),
	axis.title.y = element_text(face="plain", colour="black",
                                    size=18),
	legend.title=element_blank(),
	legend.text = element_text(face="plain", colour="black",
                                   size=14)) +
  scale_fill_viridis_c() +
  ylab(paste("Sex-biased splicing index ")) +
  xlab("Chromosomes") +
  guides(size = guide_legend(title = "Number of ASE"))
figure2b