# Evaluating the performance of the kataegis detection packages 

This Jupyter notebook will reproduce the evaluation of kataegis detection packages as detailed within our manuscript.

## Loading dependencies

First, we load the dependencies for this notebook.

In [None]:
library(katdetectr)
library(zen4R)
library(futile.logger)
library(dplyr)
library(VariantAnnotation)
library(SeqKat)
library(pbapply)

# Set the seed for reproducibility
set.seed(1)

## Importing Alexandrov et al. (2013) data and generating the synthetic datasets

### Reproducibility

All previously generated input and output for reproduction of the evaluation of kataegis detection packages as detailed within our manuscript were deposited on [Zenodo](https://doi.org/10.5281/zenodo.6810477) to allow re-generating the presented figures and tables. This will retrieve all the RData objects using in the latter part of this notebook.

In [8]:
# Increase the timeout (due to some large files).
options(timeout=5000)

# Download the required files into the data/ folder (~1GB).
zen4R::download_zenodo(doi = "10.5281/zenodo.6810477", path = 'data/', quiet = FALSE, overwrite = FALSE)


### Fresh run

#### Importing somatic variants and kataegis calls from Alexandrov et al. (2013)

The following code will download and pre-process the somatic variants and kataegis calls from Alexandrov et al. (2013). This will generate an RData object within the specified `path`.

In [None]:
source("R/1.importAndProcess_Alexandrov.R")
importAlexandrovData(path = "data/")

### Generating synthetic datasets

The following code will generate a set of xxx synthetic samples (hg19) with varying degrees of kataegis events. This will generate an RData object within the specified `path`.

In [None]:
source("R/2.importAndProcess_Synthetic.R")
generateSyntheticData(path = "data/")

## Performing the various kataegis-detection packages

Next, we will perform the various R-based kataegis-detection packages (katdetectr, SeqKat, maftools, kataegis and ClusteredMutations) and python-based (SigProfilerClusters) on the same datasets (Alexandrov et al. and the synthetic dataset). This will generate RData objects containing the results within `data/`.

In [None]:
# Run the R-based tools on the Alexandrov et al. and synthetic datasets.
source("R/3.performRPackages.R")

runTools_Alexandrov(data = "data/alexandrov_data_processed.RData"
runTools_Synthetic(data = "data/synthetic_data_processed.RData")

In [None]:
# Run the python-based SigProfiler on the Alexandrov et al. and synthetic datasets.
source("R/4.performSigProfiler.R")
runSigProfiler(dataAlexandrov = "data/alexandrov_data_processed.RData", dataSynthetic = "data/synthetic_data_processed.RData")