Usage Examples

llyzhng edited this page Jun 23, 2018 · 12 revisions
Clone this wiki locally

This page covers two usage examples corresponding to the two operational modes of runSchism master script.

The input files, along with example outputs are available at:

  • Sequential Mode (E1)
  • Step-Through Mode (E2)

Please download and decompress the tar balls above in your path of choice. Each example directory includes a wrapper shell script, including all analysis steps described below. You can either use the provided wrapper scripts or make direct calls to runSchism.

If you decide to make direct calls, change the working_dir parameter in the example configuration files (i.e. E1.yaml or E2.yaml) to the absolute path of the directories where the input files are located in each example. At this point, you are ready to follow the instructions below.

Sequential Mode

The following example illustrates the usage of runSchism master script in sequential mode. The input data for this example were generated from a simulated subclonal phylogenetic tree which 7 mutations clusters. Cellularity values were generated for three simulated samples.

In this simulated example, we make the assumption that all somatic mutations are assumed to have a coverage depth of 500X and the simulated tumor samples have purity of 50% (50% normal cell contamination is present). The figure below depicts the tree topology used as the basis of simulation, along with mutation cluster cellularity values in three simulated biopsy samples S1-3.

The data corresponding to this example can be found in SCHISM package at data/example1/. Here, we will use runSchism master script in sequential mode and allow it run all SCHISM analysis steps in order, by calling:

runSchism analyze --config E1.yaml

The SCHISM analysis proceeds as follows. As indicated in the configuration file E1.yaml, the inputs to this example of SCHISM analysis are E1.mutationReadsCNA.tsv and E1.mutation-to-cluster.tsv.

First, the cellularity of each somatic mutation will be inferred in each biopsy sample using read counts and copy number values listed in E1.mutationReadsCNA.tsv and stored in E1.mutation.cellularity. Also, an estimate of cellularity for each mutation cluster is calculated using E1.mutation-to-cluster.tsv and stored in E1.cluster.cellularity. The top 5 rows of E1.mutation.cellularity file is listed below:

sampleID        mutationID      cellularity     sd
  S1                0             1.0000      0.0789
  S1                1             0.9520      0.0762
  S1                2             1.0000      0.0790
  S1                3             1.0000      0.0785

Please note that SCHISM requires cellularity/read count data for all mutations in all samples. For cases where the mutation is absent in a sample, it uses the reference and alternate (small or zero) read counts to estimate the confidence interval of cellularity of mutation in the sample.

Next, SCHISM hypothesis will be applied to each pair of mutations and the p-value and test results will be stored in E1.HT.pvalues and E1.HT.pov. For each pair of mutation clusters, the test results from all pairs of mutations from the two cluster are aggregated as votes, and the result is stored in E1.HT.cpov. The CPOV table contains the topology cost associated with lineage relationships between each pair of mutation clusters, and is visualized as a heatmap in E1.HT.cpov.pdf.

In cases where the hypothesis test level is set to mutations, the heatmap will be labeled with the result of vote aggregation in each mutation cluster. Otherwise, the resulting output_prefix.HT.cpov.pdf will be a binary heatmap where white and purple squares represent rejection of the null hypothesis and lack thereof respectively.

In this example, 10 independent runs of genetic algorithm (GA), each with 50 generations and 100 tree topologies in each generation are performed in sequence. The summary statistics and characteristics from each GA run will be stored in E1.GA.r{runID}.trace. The each run trace file is comprised of two major blocks. The first block lists summary statistics such as median/maximum of generation/population fitness values at the end of each generation in genetic algorithm, and could be used to assess genetic algorithm performance. The second block covers the population of tree topologies sampled by the genetic algorithm and lists the topology cost, mass cost, and fitness value associated with each tree topology, along with the list of GA generations it appeared in. The table below shows the first few lines of the first and second block from E1.GA.r1.trace.

# generation count = 50
generation.index generationHighestFitness generationMedianFitness populationHighestFitness populationMedianFitness
1                    0.109231761209           1.18467384206e-19       0.109231761209          1.18467384206e-19
2                    0.158658688063           7.3699118827e-05        0.158658688063          1.77142641538e-16
3                    0.605752734898           9.88610682911e-05       0.605752734898          1.91820649258e-10
4                    0.605752734898           3.5195998274e-06        0.605752734898          7.18622334576e-09
5                    0.605752734898           0.000511707589828       0.605752734898          4.23588236797e-08
Topology            MassCost TopologyCost  Cost   Fitness appearances(generationID:count)
((2,3,4)1,(6)5)0       0.0      0.0         0.0     1.0     31:1,43:2,45:1,38:1,47:3
((3,4)1,2,(6)5)0      0.0039    0.0       0.0039   0.9806  13:1,16:1,17:1,...
((2,3)1,((4)6)5)0      0.0     0.0256     0.0256  0.87985  12:1,15:1,16:1,17:1,...
((3)1,2,((4)6)5)0     0.0039   0.0256     0.0295  0.86286  14:1,15:2,16:4,17:4,...
((3,4)1,(2,6)5)0      0.0746    0.0       0.0746  0.68847  12:1,17:1,18:2,19:1,...

After completion of all runs, the trace files from independent runs are combined to generate E1.GA.trace which contains a list of all tree topologies sampled across all runs, sorted by their fitness values. Next, two summary panel plots reporting the maximum population fitness and the number of trees with maximum fitness after each generation are created. Finally from the overall GA trace file E1.GA.trace, the consensus tree of all tree topologies sharing the maximum fitness value sampled is generated, and visualized.

#parent child   frequency   label
0         1        1.0       1/1
0         5        1.0       1/1
1         2        1.0       1/1
1         3        1.0       1/1
1         4        1.0       1/1
5         6        1.0       1/1

The consensus tree file shown here lists the union of directed edges present across all maximum fitness trees, along with the fraction of maximum fitness trees sharing each edge.

Step-Through Mode

This operational mode for runSchism script is intended for more advanced user and allows execution of a selected subset of SCHISM analysis steps, as well as parallel computing. The input data for this example were generated from a simulated subclonal phylogenetic tree which 15 mutations clusters. Cellularity values were generated for seven simulated samples. The figure below depicts the tree topology used as the basis of simulation, along with mutation cluster cellularity values in seven simulated biopsy samples S1-7.

The data corresponding to this example can be found in SCHISM bundle at data/example2/, and the input files corresponding to this analysis are E2.clusterEstimates.tsv and E2.mutation-to-cluster.tsv. The first few lines of these input files are listed here.

# E2.clusterEstimates.tsv
sampleID      mutationID      cellularity     sd
    s1              0                1         0.05
    s1              1               0.6        0.05
    s1              2               0.35       0.05
    s1              3                0         0.05
    s1              4                0         0.05 
# E2.mutation-to-cluster.tsv
mutationID clusterID
      0         0
      1         1
      2         2
      3         3
      4         4

Please note that a single cellularity value for each cluster is available in this example, and thus there is a one-to-one mapping between entries in the two files above. Also, since no vote aggregation among members of a mutation cluster is necessary, the test_level parameter in hypothesis test block from the configuration file is set to clusters.

  1. The analysis begins by preparing the input E2.cluster.cellularity for the SCHISM hypothesis test :

    runSchism prepare_for_hypothesis_test -c E2.yaml
  2. The hypothesis can be performed by:

    runSchsim hypothesis_test -c E2.yaml

    This steps generate the CPOV output E2.HT.cpov. Each line in the tab-delimited file E2.HT.cpov is formatted as:

    <mutationID1>    <mutationID2>    <cost>

    where the cost determines the topology cost assessed by the genetic algorithm (GA), associated with an arrangement where mutationID1 is ancestral to mutationID2 in a given phylogenetic tree; i.e. mutationID1 is in the same lineage as and precedes mutationID2. While the contents of this file by default reflect the outcome of SCHISM hypothesis test, the cost values can be modulated using additional knowledge of tumor biology (e.g. synthetic lethal interactions) when available to inform prioritization of tree topologies.

  3. (optional) A heatmap of the hypothesis test results stored in E2.HT.cpov can be generated by:

    runSchism plot_cpov -c E2.yaml

In the binary heatmap above, the white and purple squares represent rejection of the null hypothesis and lack thereof respectively.
  1. The independent runs of the GA can be performed in sequence or parallel. The serial mode which is most convenient for most users, and can efficiently handle trees with relatively few nodes (<9) can be initiated by:

    runSchism run_ga --config E2.yaml --mode serial

    Advanced users working with complex tree topologies with large node count (>= 10) can initiate independent runs of the GA in parallel by making distinct calls from a compute cluster using a syntax similar to:

    runSchism run_ga --config E2.yaml --mode parallel --runID 1

    The runID flag will be used to assign a numeric ID (ranging from 1 to instance_count parameter from the configuration file) to each run of the genetic algorithm.

  2. Once all the independent GA runs have completed, their results can be aggregated to generate summary trace tables and plots:

    runSchism summarize_ga_results -c E2.yaml

  1. The consensus of the maximum fitness trees sampled across all runs of GA, along with a graphical representation can be generated by:

    runSchism consensus_tree -c E2.yaml