# Recapitulates conventional genetic distance measures

`divergent` should identify sets of sequences for which their minimum genetic distance is near the upper tail of the distribution of this measure from a sampling of combinations of the same (or different size).

The data was from alignments of 106 genes from exactly 31 diverse mammal species. The matrix of all pairwise genetic distances were computed between the aligned sequences using the Paralinear distance of Lake. If the JSD measure reflects genetic distance, we expect the mean of the pairwise distances between the species identified by `divergent max` to be larger than the same number of species selected randomly. We set a minimal set size of 5 and maximal set size of 10.

The column labelled "P-value<0.1" identifies how many of `num` genes for which the divergent set species gave a p-value ≤0.1 (this value was arbitrarily chosen). The distribution for each gene was obtained be taking of the mean of 1000 randomly chosen (without replacement) combinations of species.

The relationship between the statistic chosen, k and whether a post-process pruning was done all had an effect. For this data set, the combination `mean_delta_jsd`, `max_set=False` and `k=4` produced the largest relationship with genetic distance.

In [None]:
import plotly.express as px
from cogent3 import load_table

def make_result_table():
    table = load_table("jsd_v_dist.tsv")
    table = table.with_new_column("p-val<0.1(%)", lambda x: 100 * x[0]/x[1], columns=["p-value<0.1", "num"], digits=1)
    return table.sorted(columns=["stat", "max_set", "k"])



In [None]:
orig = make_result_table()
table = orig[:, ["k", "max_set", "stat", "p-val<0.1(%)"]]
orig

In [None]:
px.line(table.filtered(lambda x: x == True, columns="max_set"), x="k", y="p-val<0.1(%)", color="stat", title="stat vs. k (max_set=True)")

In [None]:
px.line(table.filtered(lambda x: x == False, columns="max_set"), x="k", y="p-val<0.1(%)", color="stat", title="stat vs. k (max_set=False)")