-
Notifications
You must be signed in to change notification settings - Fork 8
Tutorial
You can run SCHISM on your input data by making calls to runSchism
master script from command line, which can be used in two operational modes. The script runSchism
should be available on your system's path if the installation of SCHISM completed successfully. You can verify this by calling:
runSchism --help
The above command should print out the help message.
The first operational mode of runSchism
which is appropriate for most users will take the input data sequentially through all the steps involved in SCHISM analysis listed below (Sequential Mode).
runSchism analyze -c experiment.yaml
where the input configuration file experiment.yaml
contains the parameter settings and analysis choices.
The second mode intended for more advanced users allows making punctuated calls to runSchism
script to perform the specific desired analysis step (Step-Through mode). This operational mode also enables optional parallelization of independent genetic algorithm runs. Please see usage examples for more details.
runSchism [Argument] [options]
Arguments:
-
Sequential mode:
-
analyze
: callsrunSchism
to perform all analysis steps
-
-
Step-Through mode:
-
prepare_for_hypothesis_test
: prepare input data for hypothesis test
-
hypothesis_test
: perform hypothesis test
-
cluster_mutations
: cluster mutations based on their hypothesis test results (optional) -
confirm_clusters
: generate cpov matrix, and estimate cluster cellularity values using cluster definitions generated above (optional) -
plot_cpov
: visualize hypothesis test results
-
run_ga
: run genetic algorithm
-
summarize_ga_results
: gather results from independent runs of GA, generate summary plots
-
consensus_tree
: generate and visualize the consensus of maximum fitness trees across all runs of GA
-
Options:
-
Sequential mode:
-
-c
,--config
: analysis configuration file
-
-
Step-Through mode:
-
-c
,--config
: analysis configuration file -
-m
,--mode
: GA run mode (serial
orparallel
), should accompanyrun_ga
argument -
-r
,runID
: GA run ID, numeric value ranging from 1 toinstance_count
parameter in configuration file, should accompanyrun_ga
argument ifmode
is set to parallel
-
SCHISM analysis requires one or two input data files. The first input file is determined by the choice of the computational tool to estimate mutation cellularity. If the user wishes to use SCHISM to estimate cellularity of somatic mutations in tumor samples, the first input will be a tab-delimited table listing somatic mutation allele specific read counts and integer copy number value formatted similarly to the following example.
sampleID mutationID referenceReads variantReads copyNumber
TUM 1 120 93 2
TUM 2 180 140 2
MET 1 139 64 1
MET 2 132 77 2
Update Starting from SCHISM-1.1.3, the user can include an additional column listing the multiplicity of each mutation. This variable reflects the number of mutated copies present in cancer cells. If mutation multiplicity is specified in the input, the program can extend cellularity estimation step to mutations in aneuploid regions of the genome.
On the other hand, if other computational tools were used to estimate the cellularity of somatic mutations, this input will be a tab-delimited list of the estimated value and standard error of cellularity for each somatic mutation in each tumor sample; e.g.
sampleID mutationID cellularity sd
TUM 1 0.970 0.076
TUM 2 0.972 0.062
MET 1 0.525 0.054
MET 2 0.982 0.089
Please note that SCHISM requires cellularity/read count data for all mutations in all samples. For cases where the mutation is absent in a sample, it uses the reference and alternate (small or zero) read counts to estimate the confidence interval of cellularity of mutation in the sample.
The second (optional) input is a tab-delimited file that assigns each somatic mutation to a mutation cluster, following the format:
mutationID clusterID
1 1
2 2
3 2
4 2
5 3
6 3
7 4
Update: A new clustering module has been added to SCHISM (starting at 1.1.0). To enable clustering of mutations by SCHISM, please see the relevant section under configuration file.
The analysis configuration passes the user parameter settings and analysis choices to runSchism
master script. The information listed in this configuration file can be divided into 4 distinct blocks.
-
working_dir
: sets the schism working directory. All input and output paths will be relative to this directory. -
mutation_to_cluster_assignment
: relative path to a mutation to cluster assignment tab-delimited file. If SCHISM is selected to perform the cluster analysis, the results will be stored in this path. Otherwise, this (Tutorial#cluster-input) will be a user input. -
mutation_raw_input
: relative path to a tab-delimited file of mutation read counts and integer copy number values across tumor samples of a patient. This input is required when using SCHISM to estimate somatic mutation cellularity. -
mutation_cellularity_input
: relative path to a tab-delimited file of mutation cellularity estimates and standard error across samples of a patient. This input is required when external tools are used to estimate somatic mutation cellularity. -
output_prefix
: SCHISM results will be stored inworking_dir
with names starting withoutput_prefix
.
-
cellularity_estimation
: choice of computational tool to estimate mutation cellularities. If equal to "schism", the file indicated bymutation_raw_input
will be used to estimate cellularities. If equal to "other", SCHISM expects mutation cellularities and standard errors available inmutation_cellularity_input
-
cellularity estimator
: relevant where SCHISM is used for cellularity estimation.-
coverage_threshold
: integer value indicating the minimum coverage depth required to estimate cellularity. Mutations with coverage below this will be assigned missing cellularity value. -
absent_mode
: integer value in [0,1]. Determines cellularity value and standard error assigned to mutations with 0 variantRead count. A value of 1 results in such mutations being assigned default values of 0 for cellularity and 0.05 for standard error. A value of 0 results in addition of 1 pseudo-count to reference and variant read counts.
-
-
tumor_sample_purity
: relevant where SCHISM is used for cellularity estimation. -
each item under this subsection will be a sampleID followed by its estimated purity level e.g. (TUM: 0.8)
-
hypothesis_test
:-
test_level
: "mutations" or "clusters". The choice of this parameter determines whether hypothesis test is performed on pairs of mutations or mutation clusters. Hypothesis test on pairs of mutation clusters directly results in CPOV matrix (example E2). Hypothesis test on pairs of mutations needs to be followed by a vote aggregation step to derive the CPOV matrix (example E1). Please note that if SCHISM is asked to cluster mutations (cluster_analysis: schism
), this parameter should be set to "mutations". -
significance_level
: fractional value in [0,1], the significance level (alpha) used to reject the null hypothesis. -
store_pvalues
: binary flag indicating whether hypothesis test p-values should be stored.
-
-
cluster_analysis
: choice of computational tool to cluster mutations. If set toschism
, mutations will be clustered based their hypothesis test results, and the output will be stored in the path specified bymutation_to_cluster_assignment
. Otherwise, the program assumes that the above path contains cluster definitions. If set toschism
, thetest_level
parameter in hypothesis test block should be set to "mutations". -
clustering_method
: Example parameter sets for cluster analysis is available in SCHISM repository at data/schism.yaml . -
algorithm
: one ofAP
(Affinity Propagation),DBSCAN
, orKMeans
. -
min_cluster_count
: minimum number of clusters. -
max_cluster_count
: maximum number of clusters. -
verbose
: verbosity binary flag. -
min_preference
: minimum value of preference parameter (whenalgorithm
isAP
) -
max_preference
: maximum value of preference parameter (whenalgorithm
isAP
) -
preference_increments
: increments in preference grid search (whenalgorithm
isAP
) -
min_eps
: minimum value of epsilon parameter (whenalgorithm
isDBSCAN
) -
max_eps
: maximum value of epsilon parameter (whenalgorithm
isDBSCAN
) -
eps_increments
: increments in epsilon grid search (whenalgorithm
isDBSCAN
) -
min_minPts
: minimum value of minPts parameter (whenalgorithm
isDBSCAN
) -
max_minPts
: maximum value of minPts parameter (whenalgorithm
isDBSCAN
) -
minPts_increments
: increments in minPts grid search (whenalgorithm
isDBSCAN
) -
n_init
: number of random initializations (whenalgorithm
isKMeans
)
-
genetic_algorithm
: -
instance_count
: integer, the number of independent runs of genetic algorithm to be performed. Typically, SCHISM will run the independent instances of GA in sequence (serial). When studying complex trees (node counts > 10), advanced users with access to parallel computing resources can run the independent GA instances in parallel to reduce the computation time (instructions available in usage examples page). -
generation_count
: integer, the number of generations to run in each GA instance. -
generation_size
: integer, the number of tree topologies in each generation of GA. -
random_object_fraction
: fractional value in [0,1], the proportion of tree topologies in each generation that are not descendant of topologies from previous generations, and are randomly generated -
mutation_probability
: fractional value in [0,1], the probability with which the genetic algorithm mutation operator is applied when generating descendant topologies -
crossover_probability
: fractional value in [0,1], the probability with which the genetic algorithm crossover operator is applied when generating descendant topologies -
fitness_coefficient
: numeric value determining the log fold decrease in fitness of tree topologies corresponding to a unit increase in total cost. -
verbose
: binary flag. A value of True will prompt SCHISM to print out stats after each generation of GA.
The SCHISM analysis of input mutation data involves the following steps:
-
prepare_for_hypothesis_test
: Preparation of input data for the hypothesis test module
- input:
mutation_to_cluster_assignment
and-
mutation_raw_input
if using SCHISM cellularity estimation -
mutation_cellularity_input
if using external cellularity estimates
-
- output:
output_prefix.cluster.cellularity
and-
output_prefix.mutation.cellularity
if using SCHISM cellularity estimation
-
-
hypothesis_test
: Performing SCHISM hypothesis test to derive mutation and cluster precedence order violation matrices
- input:
-
output_prefix.cluster.cellularity
iftest_level
equals "clusters" andcluster_analysis
is not set toschism
-
output_prefix.mutation.cellularity
iftest_level
equals "mutations" and using SCHISM for cellularity estimation -
mutation_cellularity_input
iftest_level
equals "mutations" and using external tools for cellularity estimation
-
- output:
-
output_prefix.HT.pov
iftest_level
equals "mutations" -
output_prefix.HT.cpov
ifcluster_analysis
is not "schism" -
output_prefix.HT.pvalues
ifstore_pvalues
is set to True
-
-
cluster_mutations
: (New) Performing clustering of mutations based on their hypothesis test results, when no user clusters are available. SCHISM can apply three different clustering methods to the input data. It is recommended that the user attempts a few different methods, and picks the most reasonable results by comparison of provided Silhouette Coefficients or manual inspection.
- input:
output_prefix.HT.pov
- output:
mutation_to_cluster_assignment
-
confirm_clusters
: (New) Validate the clustering results after review by user. This step is only required if step 3 is run.
- input:
mutation_to_cluster_assignment
output_prefix.HT.pov
output_prefix.mutation.cellularity
- output:
output_prefix.HT.cpov
output_prefix.cluster.cellularity
-
plot_cpov
: Visualization of hypothesis test results
- input:
output_prefix.HT.cpov
- output:
output_prefix.HT.cpov.pdf
-
run_ga
: Genetic Algorithm (GA) search to sample candidate tree topologies
- input:
output_prefix.HT.cpov
output_prefix.cluster.cellularity
- output:
output_prefix.GA.r{instanceID}.trace
-
summarize_ga_results
: Merging the results from independent runs of the GA, and generating summary trace plots
- input:
output_prefix.GA.r{instanceID}.trace
- output:
output_prefix.GA.trace
output_prefix.GA.fitnessTrace.pdf
output_prefix.GA.topTreeCount.pdf
-
consensus_tree
: Derivation and Visualization of the consensus tree
- input:
output_prefix.GA.trace
- output:
output_prefix.GA.consensusTree
output_prefix.GA.consensusTree.pdf