## From Squartini and Arndt

We further want to quantify whether deviations from zero of the three indices are statistically significant when only a finite amount of sequence data is available to measure the present day nucleotide distribution. To achieve this we compare the distribution of nucleotides, $\rho_\alpha$, of a sequence of length N to the stationary distribution, $\pi_\alpha$, using a $\chi^2$ test with

$$ \chi^2 = N \sum_\alpha  \frac{\rho_\alpha - \pi_\alpha}{\pi_\alpha}$$

This quantity follows a $ \chi^2 $ distribution with 3 degrees of freedom. Deviations from stationarity are significant (with 95% confidence) if $ \chi^2 $ > 7.8147.

## Note

We implement a variant of this test that is better suited to our null hypothesis. Specifically, only the foreground edge is evolving in a manner consistent with the null hypothesis. As a consequence, we can only apply the Squatini and Arndt test using the observed nucleotide frequencies from the sequence of the foreground edge compared to the stationary distribution for that edge.

In [None]:
from cogent3 import open_data_store, make_table, get_app
import pathlib
from cogent3.app import typing as c3_types
from cogent3.app.composable import define_app
from scipy import stats
import numpy

from mdeq.utils import load_from_sqldb
from mdeq.stationary_pi import get_stat_pi_via_eigen
import project_paths


from mdeq.bootstrap import compact_bootstrap_result

In [None]:
synthetic_GSN_fits_paths = list(
    (project_paths.RESULT_DIR / "micro/toe/fg-GSN-toe/").glob("*.sqlitedb")
)

In [None]:
def chi_squared(pi, pi_inf, n, motif_order):
    pi_obs = numpy.array([pi[nt] for nt in motif_order])
    pi_exp = numpy.array(pi_inf)

    chi_sum = n * numpy.sum(((pi_obs - pi_exp) ** 2) / pi_exp)
    p = stats.chi2.sf(chi_sum, df=len(motif_order) - 1)

    return chi_sum, p


def squartini_arndt_test(hyp_result: compact_bootstrap_result):
    hyp_result.deserialised_values()
    observed_gn = hyp_result.observed["GN"]
    aln = observed_gn.alignment
    fg_edge = aln.info["fg_edge"]
    pi = aln.probs_per_seq()[fg_edge]
    P = observed_gn.lf.get_psub_for_edge(name=fg_edge)
    pi_inf = get_stat_pi_via_eigen(P)

    return chi_squared(
        pi=pi, pi_inf=pi_inf, n=len(observed_gn.alignment), motif_order=P.keys()
    )


@define_app
def sq_test(db_path: pathlib.PosixPath) -> c3_types.TabularType:
    results = []
    in_dstore = open_data_store(db_path)
    loader = load_from_sqldb()
    for member in in_dstore.completed:
        result = loader(member)
        chi, p = squartini_arndt_test(result)
        results.append((member.unique_id, chi, p))

    return make_table(
        header=["name", "chi2", "chisq_pval"], data=results, source=db_path.stem
    )

In [None]:
tester = sq_test()
out_path = project_paths.RESULT_DIR / "micro/squartini-arndt"
out_dstore = open_data_store(out_path, mode="w", suffix="tsv")
writer = get_app("write_tabular", data_store=out_dstore)
proc = tester + writer
_ = proc.apply_to(synthetic_GSN_fits_paths, parallel=True, show_progress=True)