# Selecting the seed alignments

We use the results from `mdeqasis microbial-gn-stats`. This command extracts core statistics from the GN fits produced from running `mdeqasis microbial-fit-gn`. Note that `microbial-gn-stats` excludes fitted models where the maximum likelihood estimates were within machine precision of the upper/lower bound (which suggests a difficult to fit alignment).

In [1]:
import plotly.express as px
from cogent3 import make_table, open_data_store
from mdeq.utils import load_from_sqldb
from mdeq_analysis.plot import util as plot_util

from project_paths import SUPP_FIG_DIR, SUPP_TABLE_DIR, DATA_DIR

write_pdf = plot_util.pdf_writer()

In [None]:
path = DATA_DIR / "raw/microbial/fit_gn-stats.sqlitedb"
dstore = open_data_store(path)
print(dstore)

In [None]:
dstore.summary_logs

In [4]:
loader = load_from_sqldb()
records = [loader(m) for m in dstore.completed]
header = records[0].header()
rows = [r.to_record() for r in records]
table = make_table(header=header, data=rows)

We eliminate model with fits where the condition number was > 2. (Condition number is an indicator of numerical issues.)

In [None]:
table = table.filtered(lambda x: x <= 2, columns="cond_num")
table

We selected four alignments as our seeds, corresponding to combinations of hi/lo entropy/jsd. These are referred to as "seed" alignments as the fits to these alignments are used to generate synthetic data for evaluating statistical measures.

In [6]:
seed_alignments = [
    "197113_332182_17210",
    "198257_206396_13724",
    "200580_114946_573911",
    "758_443154_73021",
]
seeds = table.filtered(lambda x: x in seed_alignments, columns="source")
not_seeds = table.filtered(lambda x: x not in seed_alignments, columns="source")

We show the position of the selected seed alignments (red markers) with respect to the full distribution (blue markers).

In [7]:
x_label = "jsd"
y_label = "entropy"
nseed_plot = {
    "x": not_seeds.columns[x_label],
    "y": not_seeds.columns[y_label],
    "mode": "markers",
    "marker_color": "blue",
}
seed_plot = {
    "x": seeds.columns[x_label],
    "y": seeds.columns[y_label],
    "mode": "markers",
    "marker_size": 10,
    "marker_color": "red",
}
traces = [nseed_plot, seed_plot]

fig = px.scatter()
fig.add_traces(traces)

size = 700
_ = fig.update_layout(
    showlegend=False,
    xaxis=dict(title=r"$\widehat {\textrm{JSD}}$"),
    yaxis=dict(title=r"$\hat H(\pi_\infty)$"),
    width=size,
    height=size,
)

Write the image out for inclusion in the manuscript.

In [8]:
write_pdf(fig, SUPP_FIG_DIR / "microbial-jsd_x_entropy.pdf")

The designations of the seed fits

In [9]:
header = ["Identifier", "Entropy", "JSD"]
seeds = [
    (r"197113\_332182\_17210", "Hi", "Hi"),
    (r"198257\_206396\_13724", "Hi", "Lo"),
    (r"200580\_114946\_573911", "Lo", "Hi"),
    (r"758\_443154\_73021", "Lo", "Lo"),
]

table = make_table(
    header=header,
    data=seeds,
    title="Selected seed fits from microbial data.",
    legend="Fits from these alignments were used for the simulation study. "
    "Identifier is from the GreenGenes alignment. Entropy and JSD categories "
    r"are from Figure \ref{supfig:jsd-vs-entropy}.",
)

table.write(
    SUPP_TABLE_DIR / "microbial-seed_fits.tex",
    label="suptable:seed-categories",
    justify="lll",
)