Skip to content
llyzhng edited this page Sep 27, 2019 · 81 revisions

Running SCHISM

You can run SCHISM on your input data by making calls to runSchism master script from command line, which can be used in two operational modes. The script runSchism should be available on your system's path if the installation of SCHISM completed successfully. You can verify this by calling:

runSchism --help

The above command should print out the help message.

The first operational mode of runSchism which is appropriate for most users will take the input data sequentially through all the steps involved in SCHISM analysis listed below (Sequential Mode).

runSchism analyze -c experiment.yaml

where the input configuration file experiment.yaml contains the parameter settings and analysis choices.

The second mode intended for more advanced users allows making punctuated calls to runSchism script to perform the specific desired analysis step (Step-Through mode). This operational mode also enables optional parallelization of independent genetic algorithm runs. Please see usage examples for more details.

Command Line Usage:

runSchism [Argument] [options]

Arguments:

  • Sequential mode:
    • analyze: calls runSchism to perform all analysis steps
  • Step-Through mode:
    • prepare_for_hypothesis_test: prepare input data for hypothesis test
    • hypothesis_test: perform hypothesis test
    • cluster_mutations: cluster mutations based on their hypothesis test results (optional)
    • confirm_clusters: generate cpov matrix, and estimate cluster cellularity values using cluster definitions generated above (optional)
    • plot_cpov: visualize hypothesis test results
    • run_ga: run genetic algorithm
    • summarize_ga_results: gather results from independent runs of GA, generate summary plots
    • consensus_tree: generate and visualize the consensus of maximum fitness trees across all runs of GA

Options:

  • Sequential mode:
    • -c, --config: analysis configuration file
  • Step-Through mode:
    • -c, --config: analysis configuration file
    • -m, --mode: GA run mode (serial or parallel), should accompany run_ga argument
    • -r, runID: GA run ID, numeric value ranging from 1 to instance_count parameter in configuration file, should accompany run_ga argument if mode is set to parallel

Input Data

SCHISM analysis requires one or two input data files. The first input file is determined by the choice of the computational tool to estimate mutation cellularity. If the user wishes to use SCHISM to estimate cellularity of somatic mutations in tumor samples, the first input will be a tab-delimited table listing somatic mutation allele specific read counts and integer copy number value formatted similarly to the following example.

sampleID    mutationID    referenceReads    variantReads    copyNumber
  TUM           1              120                 93             2
  TUM           2              180                140             2
  MET           1              139                 64             1
  MET           2              132                 77             2

Update Starting from SCHISM-1.1.3, the user can include an additional column listing the multiplicity of each mutation. This variable reflects the number of mutated copies present in cancer cells. If mutation multiplicity is specified in the input, the program can extend cellularity estimation step to mutations in aneuploid regions of the genome.

On the other hand, if other computational tools were used to estimate the cellularity of somatic mutations, this input will be a tab-delimited list of the estimated value and standard error of cellularity for each somatic mutation in each tumor sample; e.g.

sampleID    mutationID    cellularity    sd
  TUM           1            0.970      0.076
  TUM           2            0.972      0.062
  MET           1            0.525      0.054
  MET           2            0.982      0.089

Please note that SCHISM requires cellularity/read count data for all mutations in all samples. For cases where the mutation is absent in a sample, it uses the reference and alternate (small or zero) read counts to estimate the confidence interval of cellularity of mutation in the sample.

The second (optional) input is a tab-delimited file that assigns each somatic mutation to a mutation cluster, following the format:

mutationID    clusterID
     1              1
     2              2
     3              2
     4              2
     5              3
     6              3
     7              4

Update: A new clustering module has been added to SCHISM (starting at 1.1.0). To enable clustering of mutations by SCHISM, please see the relevant section under configuration file.

Configuration File

The analysis configuration passes the user parameter settings and analysis choices to runSchism master script. The information listed in this configuration file can be divided into 4 distinct blocks.

Path specifications

  • working_dir: sets the schism working directory. All input and output paths will be relative to this directory.
  • mutation_to_cluster_assignment: relative path to a mutation to cluster assignment tab-delimited file. If SCHISM is selected to perform the cluster analysis, the results will be stored in this path. Otherwise, this (Tutorial#cluster-input) will be a user input.
  • mutation_raw_input: relative path to a tab-delimited file of mutation read counts and integer copy number values across tumor samples of a patient. This input is required when using SCHISM to estimate somatic mutation cellularity.
  • mutation_cellularity_input: relative path to a tab-delimited file of mutation cellularity estimates and standard error across samples of a patient. This input is required when external tools are used to estimate somatic mutation cellularity.
  • output_prefix: SCHISM results will be stored in working_dir with names starting with output_prefix.

Mutation Cellularity Estimation

  • cellularity_estimation: choice of computational tool to estimate mutation cellularities. If equal to "schism", the file indicated by mutation_raw_input will be used to estimate cellularities. If equal to "other", SCHISM expects mutation cellularities and standard errors available in mutation_cellularity_input

  • cellularity estimator: relevant where SCHISM is used for cellularity estimation.

    • coverage_threshold: integer value indicating the minimum coverage depth required to estimate cellularity. Mutations with coverage below this will be assigned missing cellularity value.
    • absent_mode: integer value in [0,1]. Determines cellularity value and standard error assigned to mutations with 0 variantRead count. A value of 1 results in such mutations being assigned default values of 0 for cellularity and 0.05 for standard error. A value of 0 results in addition of 1 pseudo-count to reference and variant read counts.
  • tumor_sample_purity: relevant where SCHISM is used for cellularity estimation.

  • each item under this subsection will be a sampleID followed by its estimated purity level e.g. (TUM: 0.8)

Hypothesis Test

  • hypothesis_test:
    • test_level: "mutations" or "clusters". The choice of this parameter determines whether hypothesis test is performed on pairs of mutations or mutation clusters. Hypothesis test on pairs of mutation clusters directly results in CPOV matrix (example E2). Hypothesis test on pairs of mutations needs to be followed by a vote aggregation step to derive the CPOV matrix (example E1). Please note that if SCHISM is asked to cluster mutations (cluster_analysis: schism), this parameter should be set to "mutations".
    • significance_level: fractional value in [0,1], the significance level (alpha) used to reject the null hypothesis.
    • store_pvalues: binary flag indicating whether hypothesis test p-values should be stored.

Cluster Analysis:

  • cluster_analysis: choice of computational tool to cluster mutations. If set to schism, mutations will be clustered based their hypothesis test results, and the output will be stored in the path specified by mutation_to_cluster_assignment. Otherwise, the program assumes that the above path contains cluster definitions. If set to schism, the test_level parameter in hypothesis test block should be set to "mutations".

  • clustering_method: Example parameter sets for cluster analysis is available in SCHISM repository at data/schism.yaml .

  • algorithm: one of AP (Affinity Propagation), DBSCAN, or KMeans.

  • min_cluster_count: minimum number of clusters.

  • max_cluster_count: maximum number of clusters.

  • verbose: verbosity binary flag.

  • min_preference: minimum value of preference parameter (when algorithm is AP)

  • max_preference: maximum value of preference parameter (when algorithm is AP)

  • preference_increments: increments in preference grid search (when algorithm is AP)

  • min_eps: minimum value of epsilon parameter (when algorithm is DBSCAN)

  • max_eps: maximum value of epsilon parameter (when algorithm is DBSCAN)

  • eps_increments: increments in epsilon grid search (when algorithm is DBSCAN)

  • min_minPts: minimum value of minPts parameter (when algorithm is DBSCAN)

  • max_minPts: maximum value of minPts parameter (when algorithm is DBSCAN)

  • minPts_increments: increments in minPts grid search (when algorithm is DBSCAN)

  • n_init: number of random initializations (when algorithm is KMeans)

Genetic Algorithm (GA)

  • genetic_algorithm:
  • instance_count: integer, the number of independent runs of genetic algorithm to be performed. Typically, SCHISM will run the independent instances of GA in sequence (serial). When studying complex trees (node counts > 10), advanced users with access to parallel computing resources can run the independent GA instances in parallel to reduce the computation time (instructions available in usage examples page).
  • generation_count: integer, the number of generations to run in each GA instance.
  • generation_size: integer, the number of tree topologies in each generation of GA.
  • random_object_fraction: fractional value in [0,1], the proportion of tree topologies in each generation that are not descendant of topologies from previous generations, and are randomly generated
  • mutation_probability: fractional value in [0,1], the probability with which the genetic algorithm mutation operator is applied when generating descendant topologies
  • crossover_probability: fractional value in [0,1], the probability with which the genetic algorithm crossover operator is applied when generating descendant topologies
  • fitness_coefficient: numeric value determining the log fold decrease in fitness of tree topologies corresponding to a unit increase in total cost.
  • verbose: binary flag. A value of True will prompt SCHISM to print out stats after each generation of GA.

Analysis Steps

The SCHISM analysis of input mutation data involves the following steps:

  1. prepare_for_hypothesis_test: Preparation of input data for the hypothesis test module
  • input: mutation_to_cluster_assignment and
    • mutation_raw_input if using SCHISM cellularity estimation
    • mutation_cellularity_input if using external cellularity estimates
  • output: output_prefix.cluster.cellularity and
    • output_prefix.mutation.cellularity if using SCHISM cellularity estimation
  1. hypothesis_test: Performing SCHISM hypothesis test to derive mutation and cluster precedence order violation matrices
  • input:
    • output_prefix.cluster.cellularity if test_level equals "clusters" and cluster_analysis is not set to schism
    • output_prefix.mutation.cellularity if test_level equals "mutations" and using SCHISM for cellularity estimation
    • mutation_cellularity_input if test_level equals "mutations" and using external tools for cellularity estimation
  • output:
    • output_prefix.HT.pov if test_level equals "mutations"
    • output_prefix.HT.cpov if cluster_analysis is not "schism"
    • output_prefix.HT.pvalues if store_pvalues is set to True
  1. cluster_mutations: (New) Performing clustering of mutations based on their hypothesis test results, when no user clusters are available. SCHISM can apply three different clustering methods to the input data. It is recommended that the user attempts a few different methods, and picks the most reasonable results by comparison of provided Silhouette Coefficients or manual inspection.
  • input:
    • output_prefix.HT.pov
  • output:
    • mutation_to_cluster_assignment
  1. confirm_clusters: (New) Validate the clustering results after review by user. This step is only required if step 3 is run.
  • input:
    • mutation_to_cluster_assignment
    • output_prefix.HT.pov
    • output_prefix.mutation.cellularity
  • output:
    • output_prefix.HT.cpov
    • output_prefix.cluster.cellularity
  1. plot_cpov: Visualization of hypothesis test results
  • input:
    • output_prefix.HT.cpov
  • output:
    • output_prefix.HT.cpov.pdf
  1. run_ga: Genetic Algorithm (GA) search to sample candidate tree topologies
  • input:
    • output_prefix.HT.cpov
    • output_prefix.cluster.cellularity
  • output:
    • output_prefix.GA.r{instanceID}.trace
  1. summarize_ga_results: Merging the results from independent runs of the GA, and generating summary trace plots
  • input:
    • output_prefix.GA.r{instanceID}.trace
  • output:
    • output_prefix.GA.trace
    • output_prefix.GA.fitnessTrace.pdf
    • output_prefix.GA.topTreeCount.pdf
  1. consensus_tree: Derivation and Visualization of the consensus tree
  • input:
    • output_prefix.GA.trace
  • output:
    • output_prefix.GA.consensusTree
    • output_prefix.GA.consensusTree.pdf