hypercube-hmm

Hypercubic inference using Hidden Markov Models, from https://academic.oup.com/bioinformatics/article/39/1/btac803/6895098

Code for HyperHMM [1] to infer the transition probabilities on a hypercubic transition network given some observations of emitted signals from a hidden Markov model on that network. Visualisations of the inferred parameterised model and its summary dynamics are also performed. This is the expectation-maximisation cousin of HyperTraPS https://github.com/StochasticBiology/HyperTraPS (simple implementation here https://github.com/StochasticBiology/hypertraps-simple).

We also compare the behaviour of HyperHMM with HyperTraPS [2,3], Oncotrees [4], conjunctive Bayes networks (CBN) via TRONCO [5], and mutual hazard networks (MHN) [6].

Requirements

The inference code uses the C++ armadillo library [7]. Other visualisations use R libraries stringr [8], ggplot2 [9], ggrepel [10], gridExtra [11], and igraph [12].

The R implementation requires Rcpp [13] and RcppArmadillo [14].

Argument	R	Command-line	Default
Input data	obs=matrix	--obs filename	None (required)
Label for output files	[none]	--label labelstring	filename-out
Random number seed (for bootstrapping)	seed=N	--seed N	1
Number of bootstrap resamples	nboot=N	--nboot N	100
Simulate random walkers for each resample?	fullsample=N	--fullsample	(off)
Initial states for longitudinal data	initialstates=matrix	[in input data, see below]	(none)
Longitudinal data?	[implied by initialstates]	--longitudinal	(off)

The code outputs several objects. In R, these are elements of a named list that is returned by HyperHMM; from the command line, they are datafiles. These objects describe the maximum likelihood inferred transition probabilities between states ($transitions in R; transitions_labelstring.txt from the command-line), the mean and bootstrap standard deviation ($stats; wide-form mean_labelstring.txt and sd_labelstring.txt), and a set of transitions from sampled trajectories of random walkers simulated on the inferred network ($viz; graph_viz_labelstring.txt).

Data

Synthetic and published data is in Data. The ovarian cancer dataset is from [15]; the tuberculosis dataset is from [16].

In R:

Data is provided as a matrix via the required first argument "obs", where each row gives an independent snapshot observation of length L, for example

0011
0111
1101

For longitudinal data, a matrix with the same number of rows and columns can be provided describing the initial states for each observation via the argument "initialstates", for example

0001
0011
1001

Have a look at hyperhmm-demos.R for some examples, and hyperhmm-run.R for more examples from the paper.

For the command line:

For cross-sectional observations, data should be provided as a single-column file where each row gives an independent snapshot observation of length L, for example

0011
0111
1101

For longitudinal observations, including those derived from estimated phylogenies, data should be provided as a two-column file where each row gives an independent transition observation between two states of length L, for example

0001 0011
0100 0111
1001 1101

The initial state is assumed to be all 0s, and the system evolves by acquiring 1s.

Examples

In R:

Command	Description
`HyperHMM(m)`	Cross-sectional observations m
`HyperHMM(m.2, initialstates=m.1)`	Longitudinal observations m.1 -> m.2
`HyperHMM(m, nboot=1000, seed=2)`	As top row with 1000 resamples and random number seed 2

Output and plotting

In R, HyperHMM returns a named list with the following elements

Element	Description
stats	"Bubble plot" summary statistics: the mean and bootstrap s.d. of the probability that a feature is acquired at a given ordering step in the accumulation process
transitions	The weights and probability fluxes for each transition between states, for each bootstrap resample and for the original dataset (corresponding to Bootstrap == 0 in the data frame)
features	Empty by default, to allow for feature names
viz	Collection of observed transitions from simulated random walkers, represented as string containing before and after states, for use in sampled visualisation
L	The number of features in the system

Plotting is done in R, using functions in hypercube-plots.R. These include "bubble" plots for mean feature orderings, hypercube visualisations illustrating the complete transition network, and ordering graphs showing pairwise acquisition orderings.

These include:

Element	Description	Optional arguments
`plot.bubbles2`	"Bubble plot" summary statistics: the mean and bootstrap s.d. of the probability that a feature is acquired at a given ordering step in the accumulation process
`plot.hypercube2`	Visualisation of the hypercubic transition graph
`plot.pfg`	Visualisation of which feature acquisitions follow which others
`plot.standard`	Collection of the three above plot types
`plot.hypercube.flux`	(Rather improved) visualisation of the hypercubic transition graph -- recommended over the above

Old command-line formatting

In its first published form, the following was true. This input format is preserved for backwards compatibility but the setup above is recommended for flexibility.

The code takes several command line arguments:

./hyperhmm.ce [datafile] [number of features] [number of bootstrap resamples] [output file label] [cross-sectional data (0 or 1)] [simulate random walkers for each sample (0 or 1)]

datafile -- a datafile containing cross-sectional or longitudinal observations as bit strings (see below)
number of features -- the number of features/traits/characters in the system (i.e. the length of a bit string in the datafile)
number of bootstrap resamples -- 0 for a single point estimate; >0 runs the given number of resamples in addition
output file label -- a string to label output files
cross-sectional data -- if 1, data are assumed to be independent snapshots and the datafile should have one column. If 0, data are assumed to be observed transitions between two states and the datafile should have two columns (ancestor and descendant state)
simulate random walkers for each sample -- random walkers are simulated on the point estimate hypercube to summarise the dynamics. If 1, they are simulated on every bootstrapped hypercube too and the summary is over all resamples. This can make the code take much longer for simple systems.

For example, ./hyperhmm.ce Data/simple_case1_L5.txt 5 100 simple1 1 1

References

[1] Moen, M.T. and Johnston, I.G., 2023. HyperHMM: efficient inference of evolutionary and progressive dynamics on hypercubic transition graphs. Bioinformatics, 39(1), p.btac803.

[2] Johnston, I.G. and Williams, B.P., 2016. Evolutionary inference across eukaryotes identifies specific pressures favoring mitochondrial gene retention. Cell systems, 2(2), pp.101-111.

[3] Greenbury, S.F., Barahona, M. and Johnston, I.G., 2020. HyperTraPS: inferring probabilistic patterns of trait acquisition in evolutionary and disease progression pathways. Cell systems, 10(1), pp.39-51.

[4] Szabo,A. and Pappas,L. (2022) Oncotree: Estimating Oncogenetic Trees. R package version 0.3.4.

[5] De Sano,L. et al. (2016) TRONCO: an R package for the inference of cancer progression models from heterogeneous genomic data. Bioinformatics, 32, 1911–1913.

[6] Schill,R. et al. (2020) Modelling cancer progression using mutual hazard networks. Bioinformatics, 36, 241–249.

[7] Conrad Sanderson and Ryan Curtin. Armadillo: a template-based c++ library for linear algebra. Journal of Open Source Software, 1(2):26, 2016.

[8] Hadley Wickham. stringr: Simple, Consistent Wrappers for Common String Operations, 2019. R package version 1.4.0.

[9] Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

[10] Kamil Slowikowski. ggrepel: Automatically Position Non-Overlapping Text Labels with ’ggplot2’, 2021. R package version 0.9.1.

[11] Baptiste Auguie. gridExtra: Miscellaneous Functions for ”Grid” Graphics, 2017. R package version 2.3.

[12] Gabor Csardi and Tamas Nepusz. The igraph software package for complex network research. InterJournal, Complex Systems:1695, 2006.

[13] Dirk Eddelbuettel and Romain Francois (2011). Rcpp:Seamless R and C++ Integration. Journal of Statistical Software, 40(8), 1-18, doi:10.18637/jss.v040.i08.

[14] Eddelbuettel D, Sanderson C (2014). “RcppArmadillo: Accelerating R with high-performance C++ linear algebra.” Computational Statistics and Data Analysis, 71, 1054-1063. doi:10.1016/j.csda.2013.02.005 https://doi.org/10.1016/j.csda.2013.02.005.

[15] Turid Knutsen et al. The interactive online sky/m-fish & cgh database and the entrez cancer chromosomes search database: linkage of chromosomal aberrations with the genome sequence. Genes, Chromosomes and Cancer, 44(1):52–64, 2005.

[16] Nicola Casali et al. Evolution and transmission of drug-resistant tuberculosis in a russian population. Nature genetics, 46(3):279–286, 2014.

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
Data		Data
wrappers		wrappers
README.md		README.md
hypercube-plots.R		hypercube-plots.R
hyperhmm-demos.R		hyperhmm-demos.R
hyperhmm-r.cpp		hyperhmm-r.cpp
hyperhmm-run.R		hyperhmm-run.R
hyperhmm.cpp		hyperhmm.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hypercube-hmm

Requirements

Contents

Examples

Inference

Data

Examples

Output and plotting

Other approaches and manuscript links

Old command-line formatting

References

About

Releases

Packages

Contributors 2

Languages

StochasticBiology/hypercube-hmm

Folders and files

Latest commit

History

Repository files navigation

hypercube-hmm

Requirements

Contents

Examples

Inference

Data

Examples

Output and plotting

Other approaches and manuscript links

Old command-line formatting

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages