Usage Examples
This page covers two usage examples corresponding to the two operational modes of runSchism
master script.
The input files, along with example outputs are available at:
Please download and decompress the tar balls above in your path of choice. Each example directory includes a wrapper shell script, including all analysis steps described below. You can either use the provided wrapper scripts or make direct calls to runSchism
.
If you decide to make direct calls, change the working_dir
parameter in the example configuration files (i.e. E1.yaml
or E2.yaml
) to the absolute path of the directories where the input files are located in each example. At this point, you are ready to follow the instructions below.
The following example illustrates the usage of runSchism
master script in sequential mode. The input data for this example were generated from a simulated subclonal phylogenetic tree with 7 mutations clusters. Cellularity values were generated for three simulated samples.
In this simulated example, we assume that all somatic mutations are assumed to have a coverage depth of 500X and the simulated tumor samples have a purity of 50% (50% normal cell contamination is present). The figure below depicts the tree topology used as the basis of simulation, along with mutation cluster cellularity values in three simulated biopsy samples S1-3.
The data corresponding to this example can be found in the SCHISM package at data/example1/
. Here, we will use runSchism
master script in sequential mode and allow it to run all SCHISM analysis steps in order, by calling:
runSchism analyze --config E1.yaml
The SCHISM analysis proceeds as follows. As indicated in the configuration file E1.yaml
, the inputs to this example of SCHISM analysis are E1.mutationReadsCNA.tsv
and E1.mutation-to-cluster.tsv
.
First, the cellularity of each somatic mutation will be inferred in each biopsy sample using read counts and copy number values listed in E1.mutationReadsCNA.tsv
and stored in E1.mutation.cellularity
. Also, an estimate of cellularity for each mutation cluster is calculated using E1.mutation-to-cluster.tsv
and stored in E1.cluster.cellularity
. The top 5 rows of E1.mutation.cellularity
the file is listed below:
sampleID mutationID cellularity sd
S1 0 1.0000 0.0789
S1 1 0.9520 0.0762
S1 2 1.0000 0.0790
S1 3 1.0000 0.0785
Please note that SCHISM requires cellularity/read count data for all mutations in all samples. For cases where the mutation is absent in a sample, it uses the reference and alternate (small or zero) read counts to estimate the confidence interval of cellularity of mutation in the sample.
Next, SCHISM hypothesis will be applied to each pair of mutations and the p-value and test results will be stored in E1.HT.pvalues
and E1.HT.pov
. For each pair of mutation clusters, the test results from all pairs of mutations from the two clusters are aggregated as votes, and the result is stored in E1.HT.cpov
. The CPOV table contains the topology cost associated with lineage relationships between each pair of mutation clusters and is visualized as a heatmap in E1.HT.cpov.pdf
.
In cases where the hypothesis test level is set to mutations
, the heatmap will be labeled with the result of vote aggregation in each mutation cluster. Otherwise, the resulting output_prefix.HT.cpov.pdf
will be a binary heatmap where white and purple squares represent rejection of the null hypothesis and lack thereof respectively.
In this example, 10 independent runs of genetic algorithm (GA), each with 50 generations and 100 tree topologies in each generation is performed in sequence. The summary statistics and characteristics from each GA run will be stored in E1.GA.r{runID}.trace
. Each run trace file is comprised of two major blocks. The first block lists summary statistics such as median/maximum of generation/population fitness values at the end of each generation in the genetic algorithm and could be used to assess genetic algorithm performance. The second block covers the population of tree topologies sampled by the genetic algorithm and lists the topology cost, mass cost, and fitness value associated with each tree topology, along with the list of GA generations it appeared in. The table below shows the first few lines of the first and second blocks from E1.GA.r1.trace
.
# generation count = 50
#
generation.index generationHighestFitness generationMedianFitness populationHighestFitness populationMedianFitness
1 0.109231761209 1.18467384206e-19 0.109231761209 1.18467384206e-19
2 0.158658688063 7.3699118827e-05 0.158658688063 1.77142641538e-16
3 0.605752734898 9.88610682911e-05 0.605752734898 1.91820649258e-10
4 0.605752734898 3.5195998274e-06 0.605752734898 7.18622334576e-09
5 0.605752734898 0.000511707589828 0.605752734898 4.23588236797e-08
############################################################
Topology MassCost TopologyCost Cost Fitness appearances(generationID:count)
((2,3,4)1,(6)5)0 0.0 0.0 0.0 1.0 31:1,43:2,45:1,38:1,47:3
((3,4)1,2,(6)5)0 0.0039 0.0 0.0039 0.9806 13:1,16:1,17:1,...
((2,3)1,((4)6)5)0 0.0 0.0256 0.0256 0.87985 12:1,15:1,16:1,17:1,...
((3)1,2,((4)6)5)0 0.0039 0.0256 0.0295 0.86286 14:1,15:2,16:4,17:4,...
((3,4)1,(2,6)5)0 0.0746 0.0 0.0746 0.68847 12:1,17:1,18:2,19:1,...
After completion of all runs, the trace files from independent runs are combined to generate E1.GA.trace
which contains a list of all tree topologies sampled across all runs, sorted by their fitness values. Next, two summary panel plots reporting the maximum population fitness and the number of trees with maximum fitness after each generation are created.
Finally from the overall GA trace file E1.GA.trace
, the consensus tree of all tree topologies sharing the maximum fitness value sampled is generated, and visualized.
#parent child frequency label
0 1 1.0 1/1
0 5 1.0 1/1
1 2 1.0 1/1
1 3 1.0 1/1
1 4 1.0 1/1
5 6 1.0 1/1
The consensus tree file shown here lists the union of directed edges present across all maximum fitness trees, along with the fraction of maximum fitness trees sharing each edge.
This operational mode for runSchism
script is intended for the more advanced user and allows execution of a selected subset of SCHISM analysis steps, as well as parallel computing. The input data for this example were generated from a simulated subclonal phylogenetic tree with 15 mutations clusters. Cellularity values were generated for seven simulated samples. The figure below depicts the tree topology used as the basis of simulation, along with mutation cluster cellularity values in seven simulated biopsy samples S1-7.
The data corresponding to this example can be found in SCHISM bundle at data/example2/
, and the input files corresponding to this analysis are E2.clusterEstimates.tsv
and E2.mutation-to-cluster.tsv
. The first few lines of these input files are listed here.
# E2.clusterEstimates.tsv
sampleID mutationID cellularity sd
s1 0 1 0.05
s1 1 0.6 0.05
s1 2 0.35 0.05
s1 3 0 0.05
s1 4 0 0.05
# E2.mutation-to-cluster.tsv
mutationID clusterID
0 0
1 1
2 2
3 3
4 4
Please note that a single cellularity value for each cluster is available in this example, and thus there is a one-to-one mapping between entries in the two files above. Also, since no vote aggregation among members of a mutation cluster is necessary, the test_level
parameter in the hypothesis test block from the configuration file is set to clusters
.
-
The analysis begins by preparing the input
E2.cluster.cellularity
for the SCHISM hypothesis test :runSchism prepare_for_hypothesis_test -c E2.yaml
-
The hypothesis can be performed by:
runSchsim hypothesis_test -c E2.yaml
This steps generate the CPOV output
E2.HT.cpov
. Each line in the tab-delimited fileE2.HT.cpov
is formatted as:<mutationID1> <mutationID2> <cost>
where the cost determines the topology cost assessed by the genetic algorithm (GA), associated with an arrangement where
mutationID1
is ancestral tomutationID2
in a given phylogenetic tree; i.e.mutationID1
is in the same lineage as and precedesmutationID2
. While the contents of this file by default reflect the outcome of the SCHISM hypothesis test, the cost values can be modulated using additional knowledge of tumor biology (e.g. synthetic lethal interactions) when available to inform prioritization of tree topologies. -
(optional) A heatmap of the hypothesis test results stored in
E2.HT.cpov
can be generated by:runSchism plot_cpov -c E2.yaml
In the binary heatmap above, the white and purple squares represent rejection of the null hypothesis and lack thereof respectively.
-
The independent runs of the GA can be performed in sequence or parallel. The serial mode which is most convenient for most users, and can efficiently handle trees with relatively few nodes (<9) can be initiated by:
runSchism run_ga --config E2.yaml --mode serial
Advanced users working with complex tree topologies with large node count (>= 10) can initiate independent runs of the GA in parallel by making distinct calls from a compute cluster using a syntax similar to:
runSchism run_ga --config E2.yaml --mode parallel --runID 1
The
runID
flag will be used to assign a numeric ID (ranging from 1 toinstance_count
parameter from the configuration file) to each run of the genetic algorithm. -
Once all the independent GA runs have completed, their results can be aggregated to generate summary trace tables and plots:
runSchism summarize_ga_results -c E2.yaml
-
The consensus of the maximum fitness trees sampled across all runs of GA, along with a graphical representation can be generated by:
runSchism consensus_tree -c E2.yaml