# Evaluating the performance of the kataegis detection packages 

This Jupyter notebook will reproduce the evaluation of kataegis detection packages as detailed within our manuscript.

## Loading dependencies

First, we load the dependencies for this notebook.

In [1]:
library(katdetectr)
library(zen4R)
library(futile.logger)
library(dplyr)
library(VariantAnnotation)
library(SeqKat)
library(pbapply)

# Set the seed for reproducibility
set.seed(1)


Attache Paket: 'dplyr'


Die folgenden Objekte sind maskiert von 'package:stats':

    filter, lag


Die folgenden Objekte sind maskiert von 'package:base':

    intersect, setdiff, setequal, union


Lade n"otiges Paket: BiocGenerics


Attache Paket: 'BiocGenerics'


Die folgenden Objekte sind maskiert von 'package:dplyr':

    combine, intersect, setdiff, union


Die folgenden Objekte sind maskiert von 'package:stats':

    IQR, mad, sd, var, xtabs


Die folgenden Objekte sind maskiert von 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
    as.data.frame, basename, cbind, colnames, dirname, do.call,
    duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
    lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
    pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
    tapply, union, unique, unsplit, which.max, which.min


Lade n"otiges Paket: MatrixGenerics

Lade n"otiges Paket: matrixStats


Attache Paket

## Importing Alexandrov et al. (2013) data and generating the synthetic datasets

### Reproducibility

All previously generated input and output for reproduction of the evaluation of kataegis detection packages as detailed within our manuscript were deposited on [Zenodo](https://doi.org/10.5281/zenodo.6810477) to allow re-generating the presented figures and tables. This will retrieve all the RData objects using in the latter part of this notebook.

In [8]:
# Increase the timeout (due to some large files).
options(timeout=5000)

# Download the required files into the data/ folder (~1GB).
zen4R::download_zenodo(doi = "10.5281/zenodo.6810477", path = 'data/', quiet = FALSE, overwrite = FALSE)


### Fresh run

#### Importing somatic variants and kataegis calls from Alexandrov et al. (2013)

The following code will download and pre-process the somatic variants and kataegis calls from Alexandrov et al. (2013). This will generate an RData object within the specified `path`.

In [None]:
source("R/1.importAndProcess_Alexandrov.R")
importAlexandrovData(path = "data/")

### Generating synthetic datasets

The following code will generate a set of xxx synthetic samples (hg19) with varying degrees of kataegis events. This will generate an RData object within the specified `path`.

In [None]:
source("R/2.importAndProcess_Synthetic.R")
generateSyntheticData(path = "data/")

## Performing the various kataegis-detection packages

Next, we will perform the various R-based kataegis-detection packages (katdetectr, SeqKat, maftools, kataegis and ClusteredMutations) and python-based (SigProfilerClusters) on the same datasets (Alexandrov et al. and the synthetic dataset). This will generate RData objects containing the results within `data/`.

In [2]:
# Run the R-based tools on the Alexandrov et al. and synthetic datasets.
source("notebooks/R/3.performRPackages.R")

runTools_Alexandrov(data = "data/alexandrov_data_processed.RData")
runTools_Synthetic(data = "data/synthetic_data_processed.RData")

ERROR: Error in parse(text = x, srcfile = src): <text>:4:1: unerwartetes Symbol
3: runTools_Alexandrov(data = "data/alexandrov_data_processed.RData"
4: runTools_Synthetic
   ^


In [None]:
# Run the python-based SigProfiler on the Alexandrov et al. and synthetic datasets.
source("R/4.performSigProfiler.R")
runSigProfiler(dataAlexandrov = "data/alexandrov_data_processed.RData", dataSynthetic = "data/synthetic_data_processed.RData")

## Evaluate the results

Finally, we will evaluate the results of the various kataegis-detection packages. This will generate the figures and tables presented within our manuscript.

In [None]:
# source("R/5.performSigProfiler.R")



## Session information

In [3]:
sessionInfo()

R version 4.2.3 (2023-03-15)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] pbapply_1.7-0               SeqKat_0.0.8               
 [3] doParallel_1.0.17           iterators_1.0.14           
 [5] foreach_1.5.2               VariantAnnotation_1.44.1   
 [7] Rsamtools_2.14.0            Biostrings_2.66.0          
 [9] XVector_0.38.0              SummarizedExperiment_1.28.0
[11] Biobase_2.58.0              GenomicRanges_1.50.2       
[13] GenomeInfoDb_1.34.9         IRanges_2.32.0             
[15] S4Vectors_0.36.2            MatrixGenerics_1.10.0      
[17] matrixStats_0.63.0          Bi