# Selecting the seed alignments

We use the results from `mdeqasis microbial-gn-stats`. This command extracts core statistics from the GN fits produced from running `mdeqasis microbial-fit-gn`. Note that `microbial-gn-stats` excludes fitted models where the maximum likelihood estimates were within machine precision of the upper/lower bound (which suggests a difficult to fit alignment).

In [1]:
from mdeq_analysis import microbial  # keep as this registers custom deserialiser
from cogent3 import make_table
from cogent3.app import io
import plotly.express as px

In [2]:
path = "../data/raw/microbial/fit_gn-stats.tinydb"
dstore = io.get_data_store(path)
dstore

1779x member ReadOnlyTinyDbDataStore(source='/Users/gavin/repos/Honours2021/Kath/MutationDiseqAnalysis/nbks/../data/raw/microbial/fit_gn-stats.tinydb', members=['761_172269_106785.json', '142042_577382_549747.json', '332182_135137_401568.json'...), 8075x incomplete

In [3]:
loader = io.load_db()
records = [loader(m) for m in dstore]

In [4]:
header = records[0].header()
rows = [r.to_record() for r in records]
table = make_table(header=header, data=rows)

We eliminate model with fits where the condition number was > 2. (Condition number is an indicator of numerical issues.)

In [5]:
table = table.filtered(lambda x: x <= 2, columns="cond_num")

We selected four alignments as our seeds, corresponding to combinations of hi/lo entropy/jsd. These are referred to as "seed" alignments as the fits to these alignments are used to generate synthetic data for evaluating statistical measures.

In [6]:
seed_alignments = [
    "197113_332182_17210",
    "198257_206396_13724",
    "200580_114946_573911",
    "758_443154_73021",
]
seeds = table.filtered(lambda x: x in seed_alignments, columns="source")
not_seeds = table.filtered(lambda x: x not in seed_alignments, columns="source")

We show the position of the selected seed alignments (red markers) with respect to the full distribution (blue markers).

In [7]:
x_label = "jsd"
y_label = "entropy"
nseed_plot = {
    "x": not_seeds.columns[x_label],
    "y": not_seeds.columns[y_label],
    "mode": "markers",
}
seed_plot = {
    "x": seeds.columns[x_label],
    "y": seeds.columns[y_label],
    "mode": "markers",
    "marker_size": 10,
    "marker_color": "red",
}
traces = [nseed_plot, seed_plot]

fig = px.scatter()
fig.add_traces(traces)

size = 700
fig.update_layout(
    showlegend=False,
    xaxis=dict(title=r"$\hat {jsd}_{max}$"),
    yaxis=dict(title=r"$\hat H(\pi_\infty)$"),
    width=size,
    height=size,
)
fig.show()

Write the image out for inclusion in the manuscript.

In [8]:
outpath = "../results/figs/microbial-jsd_x_entropy.pdf"
fig.write_image(outpath)