# Breast-96-Samples.R as a Notebook 

rMATS 3.2.5 was run on controlled access RNASeq files retrieved experiments stored in the Sequence Read Archive with controlled access managed by dbGaP.   The data were generated under the Gene Tissue Expression.

## rMATS RNASeq-MATS.py produces 10 different output types which get assembled into as type junction ID by sample ID matrices

### Alternative Splice Site Types are: (se, a3ss, a5ss, mxe, ri)

 This is input as ARGV1 into variable 'astype'

  * Skipped Exon events (se),
  * Alternative 3' splice site (a3ss),
  * Alternative 5' splice site (a5ss),
  * Mutually exclusive exon (mxe),
  * and retention intron (ri)

### There are two different kinds of junction counts

  * jc = junction counts - reads that cross the junction
  * jcec = junction counts plus reads on the target (such as included exon

### And the count type -- there are 5 types

  * inclusion levels (percent spliced in)
  * included junction counts (ijc)
  * skipped junction counts (sjc)
  * inclusion length (inclen)
  * skipped length (skiplen)

### function: fit_iso_tissue 

fit_iso_tissue expects the following input:

  * the tissue of interest (SMSTD) 
  * an ordered_merged_rmats -- which will be ordered to fit the count matrix
  * count matrix (inc or ijc & sjc merged)
  * splice type (a3ss, a5ss, mxe, ri or se)
  * junction_count type (jc or jcec)
  * count type (inc or the merged ijc,sjc)
  
### reordering to match annotations between count matrix and annotation matrix

Common problem is to match specifically the rows of an annotation matrix with the columns of a count matrix
`match` is the function that gives the re-ordering index required to accomplish this


## **NOTE**:

We assume that you have cloned the analysis repository and have `cd` into the parent directory. Before starting with the analysis make sure you have first completed the dependencies set up by following the instructions described in the **`dependencies/README.md`** document. All paths defined in this Notebook are relative to the parent directory (repository). Please close this Notebook and start again by following the above guidelines if you have not completed the aforementioned steps.

## rMATS-final-merged
the rmats-nf NextFlow was executed and the results released here:

## Loading dependencies

In [None]:
library(limma)
library(multtest)
library(Biobase)
library(edgeR)
library(tibble)
install.packages('R.utils')
library(R.utils)

## Modeling

This analysis uses edgeR.  From the documentation, it is important to note that normalization takes the form of correction factors that enter into the statistical model. Such correction factors are usually computed internally by edgeR functions, but it is also possible for a user to supply them. The correction factors may take the form of scaling factors for the library sizes, such as computed by calcNormFactors, which are then used to compute the effective library sizes. 

Alternatively, gene-specific correction factors can be entered into the glm functions of edgeR as offsets. In the latter case, the offset matrix will be assumed to account for all normalization issues, including sequencing depth and RNA composition.

Note that normalization in edgeR is model-based, and the original read counts are not themselves transformed. This means that users should not transform the read counts in any way before inputing them to edgeR. For example, users should not enter RPKM or FPKM val- ues to edgeR in place of read counts. Such quantities will prevent edgeR from correctly estimating the mean-variance relationship in the data, which is a crucial to the statistical strategies underlying edgeR. Similarly, users should not add artificial values to the counts before inputing them to edgeR.

edgeR is not designed to work with estimated expression levels, for example as might be output by Cufflinks. 
edgeR can work with expected counts as output by RSEM, but raw counts are still preferred. 

As instructed by the software, we are using the raw counts as provided by rMATS.  The raw counts we are using in the model are `ijc` and `sjc`, the sample specific raw read counts as they align to the junctions of the `included exon (ijc)` and the junctions of the `excluded or skipped exon (sjc)` respectively.



In [None]:
ijc.iso.counts.mem <- data.table::fread("../data/rmats_final.se.jc.ijc.txt.gz") 
sjc.iso.counts.mem <- data.table::fread("../data/rmats_final.se.jc.sjc.txt.gz") 
inc.iso.counts.mem <- data.table::fread("../data/rmats_final.se.jc.inc.txt.gz")

meta.data<-read.csv('../data/SraRunTable.noCram.noExome.noWGS.totalRNA.txt',header=TRUE, stringsAsFactors=FALSE)
head(ijc.iso.counts.mem)
head(sjc.iso.counts.mem)
head(inc.iso.counts.mem)
head(meta.data)

## Synchronize metadata samples with ijc, sjc and inc samples

Keep only the runs that are in the ijc count list (assuming ijc and sjc are the same).  As well, name the rows with the junction id column and then make the matrix just about the counts.

In [None]:
#dimensions before we make the changes.
dim(ijc.iso.counts.mem)
dim(sjc.iso.counts.mem)
dim(inc.iso.counts.mem)
dim(meta.data)

# the sample names are in the columns of both the ijc and the sjc matrices, these matrices have the identical column order)
keep.meta.data <- meta.data$Run %in% colnames(ijc.iso.counts.mem)
table(keep.meta.data)
reduced.meta.data <- meta.data[keep.meta.data==TRUE,]

## Construct the ijc, sjc and inc as data matrices
The Junction ID is encoded in the first column of the matrix.  We need to both preserve it (and it is unique) as well as remove it so we may do our calculations.

In [None]:
# preserve junction id as rowname
rownames(ijc.iso.counts.mem) <- ijc.iso.counts.mem$ID
rownames(sjc.iso.counts.mem) <- sjc.iso.counts.mem$ID
rownames(inc.iso.counts.mem) <- inc.iso.counts.mem$ID

# and remove the id to have a data matrix
ijc.iso.counts.mem  <- ijc.iso.counts.mem[,-1]
sjc.iso.counts.mem  <- sjc.iso.counts.mem[,-1]
inc.iso.counts.mem  <- inc.iso.counts.mem[,-1]
dim(ijc.iso.counts.mem)
dim(sjc.iso.counts.mem)
dim(inc.iso.counts.mem)
dim(reduced.meta.data)

## Order ijc and sjc columns in the same order as the metadata Run order

Using tibble library, we can rearrange the columns as the column name.  

In [None]:
meta.data.run.names  <- as.character(reduced.meta.data$Run)
ijc.iso.counts.mem2  <- as_tibble(ijc.iso.counts.mem)
sjc.iso.counts.mem2  <- as_tibble(sjc.iso.counts.mem)
inc.iso.counts.mem2  <- as_tibble(inc.iso.counts.mem)

ijc.iso.counts.mem2  <- ijc.iso.counts.mem2[,c(meta.data.run.names)]
sjc.iso.counts.mem2  <- sjc.iso.counts.mem2[,c(meta.data.run.names)]
inc.iso.counts.mem2  <- inc.iso.counts.mem2[,c(meta.data.run.names)]

Remove samples that match '11IL0' from the ijc, sjc and metadata files

In [None]:
keep.meta.data <- (!grepl('11ILO',reduced.meta.data$Sample.Name))
table(keep.meta.data)
ijc.iso.counts.mem2 <-ijc.iso.counts.mem2 [                    ,keep.meta.data==TRUE]
sjc.iso.counts.mem2 <-sjc.iso.counts.mem2 [                    ,keep.meta.data==TRUE]
inc.iso.counts.mem2 <-inc.iso.counts.mem2 [                    ,keep.meta.data==TRUE]
reduced.meta.data   <-reduced.meta.data   [keep.meta.data==TRUE,                    ]
dim(ijc.iso.counts.mem2)
dim(sjc.iso.counts.mem2)
dim(inc.iso.counts.mem2)

### and focus on a single tissue

In [None]:
tissue <- reduced.meta.data$body_site %in% 'Breast - Mammary Tissue'
table(tissue)

ijc.iso.counts.mem2 <-ijc.iso.counts.mem2 [                    ,tissue==TRUE]
sjc.iso.counts.mem2 <-sjc.iso.counts.mem2 [                    ,tissue==TRUE]
inc.iso.counts.mem2 <-inc.iso.counts.mem2 [                    ,tissue==TRUE]
reduced.meta.data   <-reduced.meta.data   [tissue==TRUE,                    ]
dim(ijc.iso.counts.mem2)
dim(sjc.iso.counts.mem2)
dim(inc.iso.counts.mem2)

## The Generalized linear model

For each sample, we have ijc and sjc count data and demographics of gender.
Our question is regarding the sex biased differences.
For each junction we have 8,000 samples with these count data.   The way to think about the model is that we have in fact for all of these junctions, these are our co-variates in this global transcriptomic model. 
In our example, we have 42,611 non-zero junction IDs for the skipped exon event for breast-Mammary Tissue, 191 individuals.
These are healthy individuals, and we are studying the impact of sex on the occurrence or non-occurance of specific alternative splicing events.   Giving us a matrix size of 191 samples by 85,222 measurements.   And we are interested in asking the question, what is the impact of sex on these splicing events.   These IJCs and SJCs are not independent.  In fact there is another layer of annotation, that is the gene.   Many of these junctions all belong to the same genomic location on the genome.

And in fact, the specific events in terms of their presence or absence within a specific individual will help us to see these differences.

191 x 85,222

                gene1    ...  gene5000  gene1         gene5000
Individual  Sex IJC1 IJC2 ... IJC42,611 SJC1 SJC2 ... SJC42,611

In [None]:
ijc <- as.data.frame(ijc.iso.counts.mem2)
sjc <- as.data.frame(sjc.iso.counts.mem2)
ijcrownames <- paste0(rownames(ijc),'-ijc')
sjcrownames <- paste0(rownames(sjc),'-sjc')
rownames(ijc) <- ijcrownames
rownames(sjc) <- sjcrownames
ijc[1:5,1:5]
sjc[1:5,1:5]
dim(ijc)
dim(sjc)
ijc <- data.matrix(ijc)
sjc <- data.matrix(sjc)
sex<-factor(reduced.meta.data$sex,levels=c('male','female'))
table(sex)


### Differential expression analysis

Differential expression (DE) analysis was performd using voom (Law et al, 2014) to transform rMATS counts of aligned RNA-seq reads in exon skipping events (SE) with associated precision weights, followed by liniear modeling and emperical Bayes procedure using limma.   These counts are obtained from alignment of the RNA-seq reads to junctions involved in the event the exon is included shown as included junction counts (ijc), to junctions involved in the event the exon is excluded, shown as skipped junction counts (sjc).   In each tissue, the following linear regression model was used to detect sexually dimorphic gene expression:

     y = B0 + B1 ijc + B2 sjc + B3 sex + B4 ijc * sjc * sex + epsilon
     
Where Y is the isoform expression; ijc is the count of the number of reads that align to an included exons junction (there are 2 per included exon), and sjc is the count of the number of reads that align to the junction that results when that exon is skipped (there is one per exon skipping event); sex denotes the reported sex of the subject.

In [None]:

y <- DGEList(counts=ijc, group = sex)
y <- calcNormFactors(y, method="upperquartile")

Gender <- substring(sex,1,1)
plotMDS(y, labels=Gender, top=50, col=ifelse(Gender=="m","blue","red"), 
        gene.selection="common")

In [None]:
design <- model.matrix( ~ sex + t(sjc))
y_voom <- voom (y, design=design, plot = TRUE)

In [None]:
    tissue_sex     <- rsex
    tissue_design  <- model.matrix(~tissue_sex)
    y_tissue       <- DGEList(counts=y, group=tissue_sex)
    y_tissue       <- calcNormFactors(y_tissue)
    y_tissue_voom  <- voom(y_tissue, tissue_design, plot=TRUE)
    fit_tissue     <- lmFit(y_tissue_voom, tissue_design)
    fit_tissue     <- eBayes(fit_tissue)
    results_tissue <- topTable (fit_tissue, coef='tissue_sexmale', number=nrow(y_tissue))
    head(results_tissue)



In [None]:

filename      = paste(paste('../data/BreastMammaryTissue',collapse='.'),'.sex.isoform.se.txt',sep='')
genesFilename = paste(paste('../data/BreastMammaryTissue',collapse='.'),'.sex.isoform.all_genes.txt',sep='')

res.robust <- results_tissue$adj.P.Val <= 0.05 & abs(results_tissue$logFC) > 1.5
table(res.robust)

res.refined <- results_tissue[res.robust==TRUE,]

#write.table(res.refined,          file=filename,      row.names = T, col.names = T, quote = F)

r <- strsplit(rownames(res.refined),'-')
head(r)
length(r)

#write.table(rownames(res.refined),file=genesFilename, row.names = T, col.names = T, quote = F)



## Modeling 



In [None]:
groups=c(paste0(reduced.meta.data$sex,'-sjc'),paste0(reduced.meta.data$sex,'-ijc'))
table(groups)

separate the matrices counts into male and female

In [None]:
ijc.counts.male   <- ijc.iso.counts.mem2 [,reduced.meta.data$sex=='male']
ijc.counts.female <- ijc.iso.counts.mem2 [,reduced.meta.data$sex=='female']

sjc.counts.male   <- sjc.iso.counts.mem2 [,reduced.meta.data$sex=='male']
sjc.counts.female <- sjc.iso.counts.mem2 [,reduced.meta.data$sex=='female']

inc.counts.male   <- inc.iso.counts.mem2 [,reduced.meta.data$sex=='male']
inc.counts.female <- inc.iso.counts.mem2 [,reduced.meta.data$sex=='female']


make the matrix, combining now the male and female counts - achieving a rearrangement count columns included counts male, female, followed by skiped counts male female
rownames remain the junction id's which we will use later to resolve the genes from which these isoforms come.

In [None]:
ijc.counts.mat           <- cbind(ijc.counts.male, ijc.counts.female)
sjc.counts.mat           <- cbind(sjc.counts.male, sjc.counts.female)

counts.mat               <- cbind(ijc.counts.male, ijc.counts.female, sjc.counts.male, sjc.counts.female)

rownames(sjc.counts.mat) <- rownames(sjc.iso.counts.mem)
rownames(ijc.counts.mat) <- rownames(ijc.iso.counts.mem2)
rownames(counts.mat)     <- rownames(ijc.iso.counts.mem)

In [None]:
dim(inc.counts.male)
dim(inc.counts.female)


In [None]:
inc.counts.mat <- data.matrix(inc.iso.counts.mem2)

obj           <- reduced.meta.data
sex           <- factor(reduced.meta.data$sex)
table(sex)
tissue_counts <- inc.counts.mat
tissue_name   <- 'Breast - Mammary Tissue'
head(tissue_counts)

In [None]:
#fit_tissue <- function (tissue_name, tissue, tissue_counts, sex) {
#    tissue_true    <- pData(obj)$SMTSD == tissue
#    tissue_obj     <- obj[,tissue_true ==TRUE]
    tissue_sex     <- sex
    tissue_design  <- model.matrix(~tissue_sex)
    y_tissue       <- DGEList(counts=tissue_counts, group=tissue_sex)
    y_tissue       <- calcNormFactors(y_tissue)
    y_tissue_voom  <- voom(y_tissue, tissue_design, plot=TRUE)
    fit_tissue     <- lmFit(y_tissue_voom, tissue_design)
    fit_tissue     <- eBayes(fit_tissue)
    results_tissue <- topTable (fit_tissue, coef='tissue_sexmale', number=nrow(y_tissue))
    head(results_tissue)
#    filename = paste(paste("../data",gsub(" ","",tissue_name), sep="/"),"DGE.csv", sep="_")    
#    write.table(results_tissue, filename, sep=',', quote=FALSE)
#    return (results_tissue)
#}

## isoform
Create an isoform, this will be used in the modeling.  The sum is equal to the total number of samples.   For each sample, they will be represented in each of the forms, so that the modeling could occur, it does make the assumption that there are at least 2 isoforms.   One that has representation as an isoform with includes reads on a specific junction, and one that is an isoform that excludes that particular junction.   This is an over simplification of what is an isoform.   But a useful technique for modeling for differential expression.

In [None]:
isoform<-c(rep(1,ncol(ijc.counts.male)+ncol(ijc.counts.female)),rep(0,ncol(sjc.counts.male)+ncol(sjc.counts.female)))

## sex
Encode then in the matrix the sex values.   This will also be used for making the differential analysis model.  

In [None]:
sex<-c(rep(1,ncol(ijc.counts.male)),rep(0,ncol(ijc.counts.female)),rep(1,ncol(sjc.counts.male)),rep(0,ncol(sjc.counts.female)))

## block - accounting for duplicate correlation

The counts ijc and sjc are from the same sample and are two different measure that are tightly correlated.   We want to account for that.  In order to do this, we create a counts matrix, that keeps the inclusion junction counts (male and then female) separated by the skipped junction counts (male and then female). We calculate the normalization factors by just using one of the counts, so that we don't duplicate it -- this is done by first making the matrix wide -- so that the library count information is done just one per row.   And then repeating these normalization factors in such a way to allow them to be accounted for with the duplicateCorrelation function, which is made to do so when block is defined, which we have done so, matching exactly the design as stated here.  We have two counts closely related within one sample.

In [None]:
block<-c(rep(1,c(ncol(ijc.counts.male)+ncol(ijc.counts.female))), rep(2, c(ncol(sjc.counts.male)+ncol(sjc.counts.female))))

In [None]:
block

# removing zero rows 
Voom adds 0.05 so that the typical log2 error does not occur -- we can safely cut off values < 1.  Lets plot just to see.
The ideal here is to keep only those rows which have non-zero values


In [None]:
rowDistribution <- rowSums(counts.mat)
names(rowDistribution) <- rownames(counts.mat)
index <- order(rowDistribution, decreasing=FALSE)
y <- rowDistribution[index]
y.gt.1 <- y[y > 1]
y.gt.5 <- y[y > 5]
sum(y<=1)
sum(y<=5)
max(y.gt.1)
min(y.gt.1)
max(y.gt.5)
min(y.gt.5)
median(y)
median(y.gt.1)
median(y.gt.5)
log2y.gt.1 <- log2(y.gt.1)
log2y.gt.5 <- log2(y.gt.5)

In [None]:
tail(log2y.gt.1)
tail(names(log2y.gt.1))
log2.df <- data.frame(log2y<-factor(log2y.gt.1))
colnames(log2.df) <- 'log2y'
tail(log2.df)

In [None]:
library(ggplot2)

barplot(log2y.gt.1, main="log2(y > 1) Junction Distribution",
   xlab="Junction")


In [None]:
pdf ("../pdf/log2.y.gt.1.rowcnts.pdf")
barplot(log2y.gt.1, main="log2(y > 1) Junction Distribution",
   xlab="Junction")
dev.off()

# Eliminate zero rows

As shown above -- there are over 2000 zero rows.  These are junctions without counts in either ijc or sjc.  We eliminate them.

In [None]:
# we will plot what we have 
keep.rows <- y > 1
table(keep.rows)

dim(counts.mat)

isoform   <-isoform   
sex       <-sex       
block     <-block     
counts.mat<-counts.mat[keep.rows==TRUE,]

dim(counts.mat)
table(block)
table(isoform)
table(sex)

finally we are able to perform our analysis.

In [None]:
is.matrix(counts.mat)
counts.mat.dm <- data.matrix(counts.mat)
is.matrix(counts.mat.dm)
counts.mat <- counts.mat.dm

make an EDGR DGEList for our differential analysis

In [None]:
y <- edgeR::DGEList(counts=counts.mat,
                    group =block)


## accounting for duplicate correlation

The counts ijc and sjc are from the same sample and are two different measure that are tightly correlated.   We want to account for that.  In order to do this, we create a counts matrix, that keeps the inclusion junction counts (male and then female) separated by the skipped junction counts (male and then female). We calculate the normalization factors by just using one of the counts, so that we don't duplicate it -- this is done by first making the matrix wide -- so that the library count information is done just one per row.   And then repeating these normalization factors in such a way to allow them to be accounted for with the duplicateCorrelation function, which is made to do so when block is defined, which we have done so, matching exactly the design as stated here.  We have two counts closely related within one sample.

In [None]:
counts.isoform.mat =  cbind(counts.mat[,isoform==1] + counts.mat[,isoform==0])
y.isoform          <- DGEList(counts=counts.isoform.mat)
y.isoform          <- calcNormFactors(y.isoform)
y                  <- DGEList(counts=counts.mat)
y                  <- calcNormFactors(y)

dim   (y.isoform$counts)
length(y.isoform$samples$norm.factors)
dim   (y$counts)
length(y$samples$norm.factors)

In [None]:
y$samples$norm.factors=rep(y.isoform$samples$norm.factors,2)

Here we create the design matrix for our linear model, we are interested in seeing separately the impact of sex and isoform as well as their interaction factor.

In [None]:
design <- model.matrix(~sex+isoform+sex*isoform)

In [None]:
table(design[,'sex'])
table(design[,'isoform'])
table(design[,'sex:isoform'])

## voom
mean variance plot of the raw counts and the DGEList should be the same

### voom raw counts

### voom DGEList

In [None]:
voom.y.DGEList <- voom(y, plot=TRUE)

# duplicateCorrelation
Using the block design, and noting that likely, on average, there are at least 2 junctions for a skipped exon event, if we analyze the other events, such as mutually exclusive exons, the number of junctions will be different, we can model the multiple features.  There may be an argument for a different way of proceeding with this modeling.  For example, one could model in the design model itself this feature or argue that the feature is accounted for in the isoform definition.   Warnings occur in the underlying modeling program used here, a gamma generalized linear model by Fisher Scoring with Identity link.  statmod::glmgam.fit.  This function implements a modified Fisher scoring algorithm for generalized linear models, similar to the Levenberg-Marquardt algorithm for nonlinear least squares. The Levenberg-Marquardt modification checks for a reduction in the deviance at each step, and avoids the possibility of divergence. The result is a very secure algorithm that converges for almost all datasets.

In [None]:
dup.corr = duplicateCorrelation(voom.y.DGEList, ndups=2, block=block)

In [None]:
dup.corr$consensus.correlation


## voom 

We have now obtained the model for the correlation between the two features of ijc and sjc to be used in the model.
we will use the just now calculated correlation event.   Our counts data will now be converted to log2 counts.

Let us see the results without the blocks in the design.

In [None]:
v <- voom(counts.mat, design=design, correlation = dup.corr$consensus.correlation, plot=TRUE, save.plot=TRUE)

## Bayes fit

done with out block, as it does not to appear to make a significant difference.   


In [None]:
fit <- lmFit(v$E, design)
fit <- eBayes(fit)
sex.res         = topTable(fit, coef='sex',         number=nrow(counts.mat))
sex.isoform.res = topTable(fit, coef='sex:isoform', number=nrow(counts.mat))
isoform.res     = topTable(fit, coef='isoform',     number=nrow(counts.mat))

In [None]:
fit.corr <- lmFit(v$E, design=design, block=block, correlation=dup.corr$consensus.correlation)
fit.corr <- eBayes(fit.corr)
sex.res.corr         = topTable(fit.corr, coef='sex',         number=nrow(counts.mat))
sex.isoform.res.corr = topTable(fit.corr, coef='sex:isoform', number=nrow(counts.mat))
isoform.res.corr     = topTable(fit.corr, coef='isoform',     number=nrow(counts.mat))

In [None]:
head(sex.res)
head(sex.res.corr)

In [None]:
head(isoform.res)
head(isoform.res.corr)


In [None]:
head(sex.isoform.res)
head(sex.isoform.res.corr)

## Ontologizer


In [None]:
meta.data       <-read.table('../data/fromGTF.SE.txt',sep='\t',header=TRUE)
all.genes       <-read.table('../data/BreastMammaryTissue.sex.isoform.all_genes.txt')
de.tab          <-read.table('../data/BreastMammaryTissue.sex.isoform.se.txt')

de.tab.with.meta<-merge(de.tab,meta.data,by.x='row.names',by.y='ID')


In [None]:
dim(de.tab.with.meta)
head(de.tab.with.meta)

## subset
Based upon our significant results, keep only those subsetted genes


In [None]:
subset <- meta.data$ID %in% all.genes[,1]
table(subset)
significant.genes <- meta.data$geneSymbol[subset==TRUE]
length(significant.genes)

## Ontologizer

Use the ontologizer to show the significance of the subsetted genes against the backdrop of all the genes within the experiment set.
rMATS stores this in the fromGTF folder in the geneSymbol category

In [None]:
setwd('/mnt/shared/gcp-user/session_data/sbas/jupyter')
getwd()

In [None]:
write.table(meta.data$geneSymbol,       '../data/universe.txt',quote = F,row.names = F,col.names = F)
write.table(de.tab.with.meta$geneSymbol,'../data/gene_set.txt',quote = F,row.names = F,col.names = F)

In [None]:
system('java -jar ../../ontologizer/Ontologizer.jar -g ../../ontologizer/go.obo -a ../../ontologizer/goa_human.gaf -s ../data/gene_set.txt -p ../data/universe.txt -c Term-For-Term -m Benjamini-Hochberg -n -o ../data')

## Metadata

For replicability and reproducibility purposes, we also print the following metadata:

1. Checksums of **'artefacts'**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `utils::sessionInfo()` and [`devtools::session_info()`](https://devtools.r-lib.org/reference/session_info.html)

### 1. Checksums with the sha256 algorithm

In [None]:
notebookid   = "Breast-96-Samples"

message("Generating sha256 checksums of the artefacts in the `..data/` directory .. ")
system(paste0("cd ../data && find . -type f -exec sha256sum {} \\;  >  ../metadata/", notebookid, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

data.table::fread(paste0("../metadata/", notebookid, "_sha256sums.txt"), header = FALSE, col.names = c("sha256sum", "file"))

### 2. Libraries metadata

In [None]:
dev_session_info   <- devtools::session_info()
utils_session_info <- utils::sessionInfo()

message("Saving `devtools::session_info()` objects in ../metadata/devtools_session_info.rds  ..")
saveRDS(dev_session_info, file = paste0("../metadata/", notebookid, "_devtools_session_info.rds"))
message("Done!\n")

message("Saving `utils::sessionInfo()` objects in ../metadata/utils_session_info.rds  ..")
saveRDS(utils_session_info, file = paste0("../metadata/", notebookid ,"_utils_info.rds"))
message("Done!\n")

dev_session_info$platform
dev_session_info$packages[dev_session_info$packages$attached==TRUE, ]