### The purpose of this analysis is to test whether the refined catalogues are better.

Selection of catalogues which appeared to improve. Reference 08-10 analyses notebooks
- NC_014328.1.region003
- NZ_CP053893.1.region004
- NZ_LT906470.1.region002
- ranthipeptide_alone

### Goal:
Parwise test of both accuracy and precision for the MAG_init and MAG_best.

We wish to test the null hypotheses:

N0.1
std(MAG_init:RE) == std(MAG_best:RE)

N0.2
(MAG_init:RE) == (MAG_best:RE)

With in both cases the alternative being smaller std for MAG_best and hopefully also a smaller RE.

Here we note that a smaller RE isn't necessarily a requirement for us to percieve the MAG_best to have improve the estimates as the reference needed to calculate RE can be discucssed thus changing the RE-value BUT NOT the spread.


### Methods.

We will use a paired students t-test to compare distribution of RE to determine if they are different. We use paired t-test as it acts similar to a blocked test - we are comparing results accross widely different dataset sizes which may impact the test result.

Alternative:  
We could also set it up as following:
variable1 = dataset size (0.01 - 0.05)
variable2 = method [MAG_init or MAG_best] 
This will a



In [49]:
import pandas as pd
from pathlib import Path
from typing import Sequence
import configparser
import json
import plotly.express as px
from plotly.subplots import make_subplots
import os
import plotly.graph_objects as go

In [50]:
catalogue_subset = [
    "NC_014328.1.region003",
    "NZ_CP053893.1.region004",
    "NZ_LT906470.1.region002",
    "ranthipeptide_alone"
]

workdir = os.environ.get("ScreenerNBWD", "../data/simulated_data_init")

WD_DATA = Path(workdir)


Path("../data/simulated_data_init")


dir_count_matrices = WD_DATA / "kmer_quantification/count_matrices"
dir_mag_flat    = WD_DATA / "MAGinator/screened_flat"
files_mag_gene_sets = dir_mag_flat.glob("*_kmers.csv")
fp_simulation_overview = WD_DATA / "camisim/simulation_overview_full.tsv"

In [51]:
# Collect counts:
catalouge_count_files = [file for file in dir_count_matrices.glob("*.tsv") if not file.stem=="counts_all"]


df_count = pd.concat(
    pd.read_csv(file, sep="\t", index_col=0)\
        .reset_index()\
        .rename(columns={'index':'kmer'})\
        .melt(id_vars=["kmer"], var_name='dataset_sample', value_name='count')\
        .assign(catalogue_name = file.stem)
    for file in catalouge_count_files
).reset_index(drop=True)
df_count[["dataset","sample"]] = df_count['dataset_sample'].str.split(".", expand=True)
df_count.head(3)

Unnamed: 0,kmer,dataset_sample,count,catalogue_name,dataset,sample
0,ATAACCCACCTTTCAAAATAT,0_5GB.sample_3,21,NC_014328.1.region005,0_5GB,sample_3
1,AATAACTACTATAACAATTAA,0_5GB.sample_3,19,NC_014328.1.region005,0_5GB,sample_3
2,TCTACAGGATGGCTTATATAA,0_5GB.sample_3,22,NC_014328.1.region005,0_5GB,sample_3


In [6]:
# Collect genesets

df_genes_sets = pd.concat(
    pd.read_csv(file)\
        .assign(catalogue_name = file.stem.rsplit("_kmers",1)[0])
    for file in files_mag_gene_sets
)
df_genes_sets.head(3)

Unnamed: 0,init,mean,best,catalogue_name
0,AAAGATAATAATGATTGTATA,AAAAAATACCTCGTACATCTT,CGGTTCATAGTGGCATTAGAA,NZ_CP053893.1.region005
1,GACCAATTCCTAGAAGGAAAA,AAATCTCAATCTGTTTAAAAA,ATAAACTTGATATTAATGATG,NZ_CP053893.1.region005
2,AATTCTGTAATGATGGTACAG,ATCTCAATCTGTTTAAAAATA,TCTTAAAGAATATGAAATTAA,NZ_CP053893.1.region005


In [7]:
df_kmerset_long = df_genes_sets.loc[df_genes_sets.catalogue_name.isin(catalogue_subset),:]\
    .melt(id_vars = ["catalogue_name"], value_vars = ["init", "best"], var_name="method",value_name="kmer")
df_kmerset_long.head()

Unnamed: 0,catalogue_name,method,kmer
0,NC_014328.1.region003,init,AAAAGTAGGTCAAAAGGCAAC
1,NC_014328.1.region003,init,ACATCTAAATCAAAAGAGAGA
2,NC_014328.1.region003,init,TATTTAAAATAGCTTATATAA
3,NC_014328.1.region003,init,CTAAAAGTAGGTCAAAAGGCA
4,NC_014328.1.region003,init,ACAGTTGCCTTTTGACCTACT


In [8]:
df_combined = pd.merge(df_kmerset_long, df_count, how="outer", on=["kmer", 'catalogue_name'])
print(df_combined.__len__())
df_combined.head(3)

3376305


Unnamed: 0,catalogue_name,method,kmer,dataset_sample,count,dataset,sample
0,NC_014328.1.region003,init,AAAAGTAGGTCAAAAGGCAAC,0_5GB.sample_3,39,0_5GB,sample_3
1,NC_014328.1.region003,init,AAAAGTAGGTCAAAAGGCAAC,0_5GB.sample_0,29,0_5GB,sample_0
2,NC_014328.1.region003,init,AAAAGTAGGTCAAAAGGCAAC,0_5GB.sample_2,21,0_5GB,sample_2


In [9]:
import statsmodels.api as sm
import numpy as np
def negbinom_mu(column:pd.Series):
    nb_fit = sm.NegativeBinomial(column, np.ones_like(column)).fit(disp=0, start_params=[1,1]) #disp=0 == quiet
    nb_param_mu = np.exp(nb_fit.params.const)
    return nb_param_mu
#df_count.loc[df_genes[ref_type]].agg([negbinom_mu, np.median]).T.reset_index()
#df_agg = 

In [10]:
df_estimates = df_combined\
    .groupby(['catalogue_name', 'method','dataset_sample', 'dataset', 'sample'])['count']\
    .agg([negbinom_mu, np.median])\
    .reset_index()







In [14]:
df_estimates.head(3)

Unnamed: 0,catalogue_name,method,dataset_sample,dataset,sample,negbinom_mu,median
0,NC_014328.1.region003,best,0_005GB.sample_0,0_005GB,sample_0,0.530001,0.0
1,NC_014328.1.region003,best,0_005GB.sample_1,0_005GB,sample_1,0.320002,0.0
2,NC_014328.1.region003,best,0_005GB.sample_2,0_005GB,sample_2,0.059998,0.0


In [13]:
#df_simulation = #get_simulation_overview("../data/simulated_data/camisim/*GB/simulation_overview.csv")
df_simulation = pd.read_csv(fp_simulation_overview, sep="\t")
df_simulation["readsGB"]  = df_simulation.dataset.str.replace("_",".").str.rstrip("GB").astype(float)


file_catalogue_grouping = Path("../data/simulated_data/catalogues/family_dump.json")
catalogue_groupings     = json.loads(file_catalogue_grouping.read_text())

by_sample_grouped = df_simulation.groupby(["dataset", "sample"])
rows = []
for name, df_g in by_sample_grouped:
    group_rows = [
        name + (cat, df_g.loc[df_g.ncbi.isin([member.rsplit(".",1)[0] for member in cat_members]),'expected_average_coverage'].sum())
        for cat, cat_members in catalogue_groupings.items()]
    rows.extend(group_rows)
df_catalogue_expect = pd.DataFrame(rows, columns = ["dataset", "sample","catalogue_name", "expected_average_coverage"])
df_catalogue_expect.head(3)

Unnamed: 0,dataset,sample,catalogue_name,expected_average_coverage
0,0_005GB,sample_0,ranthipeptide_group,0.623462
1,0_005GB,sample_0,ranthipeptide_alone,0.278858
2,0_005GB,sample_0,NC_014328.1.region001,0.259607


In [15]:
# add to counts data
df_combined_error = df_estimates.merge(df_catalogue_expect, how="left", on=["dataset", "sample","catalogue_name"])
df_combined_error.head(3)

Unnamed: 0,catalogue_name,method,dataset_sample,dataset,sample,negbinom_mu,median,expected_average_coverage
0,NC_014328.1.region003,best,0_005GB.sample_0,0_005GB,sample_0,0.530001,0.0,0.259607
1,NC_014328.1.region003,best,0_005GB.sample_1,0_005GB,sample_1,0.320002,0.0,0.311277
2,NC_014328.1.region003,best,0_005GB.sample_2,0_005GB,sample_2,0.059998,0.0,0.202819


In [16]:
df_combined_error_long = df_combined_error\
    .melt(
        id_vars = ["catalogue_name","method", "dataset","sample", "expected_average_coverage"],
        value_vars = ["negbinom_mu", "median"], 
        value_name = "estimate",
        var_name = "estimate_agg")
df_combined_error_long.head(3)

Unnamed: 0,catalogue_name,method,dataset,sample,expected_average_coverage,estimate_agg,estimate
0,NC_014328.1.region003,best,0_005GB,sample_0,0.259607,negbinom_mu,0.530001
1,NC_014328.1.region003,best,0_005GB,sample_1,0.311277,negbinom_mu,0.320002
2,NC_014328.1.region003,best,0_005GB,sample_2,0.202819,negbinom_mu,0.059998


In [17]:
df_combined_error_long["RE"] = (df_combined_error_long.estimate
 - df_combined_error_long.expected_average_coverage) / df_combined_error_long.expected_average_coverage
df_combined_error_long["RAE"] = df_combined_error_long["RE"].abs()
df_combined_error_long.head(3)

Unnamed: 0,catalogue_name,method,dataset,sample,expected_average_coverage,estimate_agg,estimate,RE,RAE
0,NC_014328.1.region003,best,0_005GB,sample_0,0.259607,negbinom_mu,0.530001,1.041549,1.041549
1,NC_014328.1.region003,best,0_005GB,sample_1,0.311277,negbinom_mu,0.320002,0.028028,0.028028
2,NC_014328.1.region003,best,0_005GB,sample_2,0.202819,negbinom_mu,0.059998,-0.704177,0.704177


In [18]:
n_groups = df_combined_error_long.catalogue_name.drop_duplicates().count()
n_groups

4

In [19]:
px.colors.qualitative.Plotly
color_map = {
    'init':px.colors.qualitative.Plotly[0],
    'best':px.colors.qualitative.Plotly[1]
}

In [53]:
df_nb = df_combined_error_long.query("estimate_agg == 'negbinom_mu'")
df_nb

Unnamed: 0,catalogue_name,method,dataset,sample,expected_average_coverage,estimate_agg,estimate,RE,RAE
0,NC_014328.1.region003,best,0_005GB,sample_0,0.259607,negbinom_mu,0.530001,1.041549,1.041549
1,NC_014328.1.region003,best,0_005GB,sample_1,0.311277,negbinom_mu,0.320002,0.028028,0.028028
2,NC_014328.1.region003,best,0_005GB,sample_2,0.202819,negbinom_mu,0.059998,-0.704177,0.704177
3,NC_014328.1.region003,best,0_005GB,sample_3,0.376735,negbinom_mu,0.180001,-0.522208,0.522208
4,NC_014328.1.region003,best,0_005GB,sample_4,0.244971,negbinom_mu,0.140000,-0.428503,0.428503
...,...,...,...,...,...,...,...,...,...
355,ranthipeptide_alone,init,0_5GB,sample_0,27.885769,negbinom_mu,14.550001,-0.478228,0.478228
356,ranthipeptide_alone,init,0_5GB,sample_1,32.836688,negbinom_mu,16.810007,-0.488072,0.488072
357,ranthipeptide_alone,init,0_5GB,sample_2,24.792219,negbinom_mu,12.760000,-0.485322,0.485322
358,ranthipeptide_alone,init,0_5GB,sample_3,28.374974,negbinom_mu,14.629993,-0.484405,0.484405


In [26]:
catagories, df_catagories = zip(*list(df_nb.groupby("catalogue_name")))
titles = []
for cat in catagories:
    titles.extend([cat, "Summarised"])
titles.extend(["Combined catagories", "Total"])
# Create figure
fig_comp = make_subplots(
    rows=5, cols=2,
    shared_yaxes=True,
    column_widths = [0.7, 0.3],
    subplot_titles=titles,
    vertical_spacing=0.06,
    horizontal_spacing=0.03
)



for cat_i, (cat_name, df_group_cat) in enumerate(zip(catagories, df_catagories)):
    
    for method, df_method in df_group_cat.groupby("method"):
        df_method.sort_values("dataset", inplace=True)
        
        fig_comp.add_trace(
            go.Box(
                x = df_method["dataset"].values,
                y = df_method["RE"].values,
                fillcolor = 'rgba(255,255,255,0)', #hide box
                legendgroup = method,
                name = method,
                line = {'color': 'rgba(255,255,255,0)'}, #hide box
                marker = {'color': color_map[method]},
                offsetgroup = method,
                orientation = 'v',
                pointpos = 0,
                jitter=0.3,
                alignmentgroup = 'True',
                boxpoints = 'all',
                showlegend = cat_i == 0,
                boxmean='sd'
            ),
            row=cat_i+1, col=1
        )
        
        fig_comp.add_trace(
            go.Box(
                x = ["Across datasets" for i in range(len(df_method))],
                y = df_method["RE"].values,
                fillcolor = 'rgba(255,255,255,0)', #hide box
                legendgroup = method,
                name = method,
                line = {'color': 'rgba(255,255,255,0)'}, #hide box
                marker = {'color': color_map[method]},
                offsetgroup = method,
                orientation = 'v',
                pointpos = 0,
                jitter=0.5,
                alignmentgroup = True,
                boxpoints = 'all',
                showlegend = False,
                #boxmean='sd'
            ),
            row=cat_i+1, col=2
        )
# add final summarising row.
for method, df_method in df_nb.groupby("method"):
        df_method.sort_values("dataset", inplace=True)
        
        fig_comp.add_trace(
            go.Box(
                x = df_method["dataset"].values,
                y = df_method["RE"].values,
                fillcolor = 'rgba(255,255,255,0)', #hide box
                legendgroup = method,
                name = method,
                line = {'color': 'rgba(255,255,255,0)'}, #hide box
                marker = {'color': color_map[method]},
                offsetgroup = method,
                orientation = 'v',
                pointpos = 0,
                jitter=0.3,
                alignmentgroup = 'True',
                boxpoints = 'all',
                showlegend = cat_i == 0,
                boxmean='sd'
            ),
            row=cat_i+2, col=1
        )
        
        fig_comp.add_trace(
            go.Box(
                x = ["Across datasets" for i in range(len(df_method))],
                y = df_method["RE"].values,
                fillcolor = 'rgba(255,255,255,0)', #hide box
                legendgroup = method,
                name = method,
                line = {'color': 'rgba(255,255,255,0)'}, #hide box
                marker = {'color': color_map[method]},
                offsetgroup = method,
                orientation = 'v',
                pointpos = 0,
                jitter=0.5,
                alignmentgroup = True,
                boxpoints = 'all',
                showlegend = False,
                #boxmean='sd'
            ),
            row=cat_i+2, col=2
        )

### Initial evalutations

From the plot we can get a general feel that there is large spread for <=0.01GB for both best and initial.

However, we also note that the MAG_best method appears to have slight more tightly grouped estimates.

This is however sadly not a clear image, we do note that for the second row we observe a large difference - this could however be due to a very poor seed.

If we combine each row (catalogue) we do not observe any gain in precision, and neither if we total across catalogues and datasets (bottom right). However, if we look individually at datasets within each catalogue we generally observe a smaller spread (better) for the refined set ('best'). This trend also holds when we collapse the catalogues (bottom left).

Thus, we have some tentative evidence that the refinement methods provides more stable (higher precision) estimates compared to the initial seed when looking at the datasets >= 0.02GB.

In [52]:
fig_comp.update_layout(height=1000, boxmode='group', title="Overview of Relative error")
fig_comp.show()

### Statistical test

Perform Fligner-Killeen test for equality of variance.

In [29]:
from scipy.stats import fligner

In [30]:
df_nb.head(3)

Unnamed: 0,catalogue_name,method,dataset,sample,expected_average_coverage,estimate_agg,estimate,RE,RAE
0,NC_014328.1.region003,best,0_005GB,sample_0,0.259607,negbinom_mu,0.530001,1.041549,1.041549
1,NC_014328.1.region003,best,0_005GB,sample_1,0.311277,negbinom_mu,0.320002,0.028028,0.028028
2,NC_014328.1.region003,best,0_005GB,sample_2,0.202819,negbinom_mu,0.059998,-0.704177,0.704177


#### Test without any blocking

In [31]:
estimates_best = df_nb.query("method == 'best'").RE.values.tolist()
estimates_init = df_nb.query("method == 'init'").RE.values.tolist()

stat, p = fligner(estimates_best, estimates_init)

print(f"""
Fligner-Killeen test for equality of variance. When viewing the total population (across datasets and catalogues.)

Variance:
init: {np.var(estimates_init, ddof=1):.2f}
best: {np.var(estimates_best, ddof=1):.2f}

The probability of the two populations of equal variance to give rise to an equal or
more extreme difference in variance is given by p={p:.1e} with a statistic of ({stat:.3f}).

""")



Fligner-Killeen test for equality of variance. When viewing the total population (across datasets and catalogues.)

Variance:
init: 0.03
best: 0.07

The probability of the two populations of equal variance to give rise to an equal or
more extreme difference in variance is given by p=9.8e-01 with a statistic of (0.001).




The Fligner killeen suggests that the variance between the two are not equal and given the variance we actually find that overall the variance is larger fo the MAG_best. This is however across both catalouges and datasets and may not be good estimates.

### Test when blocking for both Dataset and Catalogue

In [32]:
df_nb_wide = df_nb[["catalogue_name","method","dataset","sample","estimate","RE","RAE"]]
df_nb_wide.pivot(index=["catalogue_name","dataset","sample"], columns='method', ).reset_index()

Unnamed: 0_level_0,catalogue_name,dataset,sample,estimate,estimate,RE,RE,RAE,RAE
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,best,init,best,init,best,init
0,NC_014328.1.region003,0_005GB,sample_0,0.530001,0.220003,1.041549,-0.152555,1.041549,0.152555
1,NC_014328.1.region003,0_005GB,sample_1,0.320002,0.189998,0.028028,-0.389617,0.028028,0.389617
2,NC_014328.1.region003,0_005GB,sample_2,0.059998,0.109998,-0.704177,-0.457654,0.704177,0.457654
3,NC_014328.1.region003,0_005GB,sample_3,0.180001,0.140000,-0.522208,-0.628386,0.522208,0.628386
4,NC_014328.1.region003,0_005GB,sample_4,0.140000,0.079997,-0.428503,-0.673443,0.428503,0.673443
...,...,...,...,...,...,...,...,...,...
175,ranthipeptide_alone,0_5GB,sample_0,12.079997,14.550001,-0.566804,-0.478228,0.566804,0.478228
176,ranthipeptide_alone,0_5GB,sample_1,16.170000,16.810007,-0.507563,-0.488072,0.507563,0.488072
177,ranthipeptide_alone,0_5GB,sample_2,12.300000,12.760000,-0.503877,-0.485322,0.503877,0.485322
178,ranthipeptide_alone,0_5GB,sample_3,13.760010,14.629993,-0.515065,-0.484405,0.515065,0.484405


In [33]:
df_set = df_nb.query("catalogue_name == 'NC_014328.1.region003' & dataset == '0_005GB'")
df_set.head(3)
#df_set.pivot(columns = ["method","RE","RAE"])

Unnamed: 0,catalogue_name,method,dataset,sample,expected_average_coverage,estimate_agg,estimate,RE,RAE
0,NC_014328.1.region003,best,0_005GB,sample_0,0.259607,negbinom_mu,0.530001,1.041549,1.041549
1,NC_014328.1.region003,best,0_005GB,sample_1,0.311277,negbinom_mu,0.320002,0.028028,0.028028
2,NC_014328.1.region003,best,0_005GB,sample_2,0.202819,negbinom_mu,0.059998,-0.704177,0.704177


In [36]:
result_rows = []
for (cat, db), df_sub_data in df_nb.groupby(["catalogue_name", "dataset"]):
    estimates_best = df_sub_data.query("method == 'best'").RE.values.tolist()
    estimates_init = df_sub_data.query("method == 'init'").RE.values.tolist()
    
    stat, p = fligner(estimates_best, estimates_init)
    
    result_rows.append({
        'catalogue_name': cat,
        'dataset': db,
        'variance_init' : np.var(estimates_init),
        'variance_best' : np.var(estimates_best),
        'statistic' : stat,
        'p-value' : p,
        "log2Fold": np.log2(np.var(estimates_init)) - np.log2(np.var(estimates_best))
    })
    
df_results = pd.DataFrame(result_rows)


### From the test of individual catalogue/dataset pairs we find no significant difference.

We note however that the statistic reach a global maximum of 2.51... which could be caused by the sample sizes (5) being too small. We therefore try gain while blocking only catalogue and not dataset.

In [37]:
df_results.sort_values("p-value").head(5)

Unnamed: 0,catalogue_name,dataset,variance_init,variance_best,statistic,p-value,log2Fold
29,ranthipeptide_alone,0_02GB,0.005522,0.032656,3.194033,0.073907,-2.56402
30,ranthipeptide_alone,0_05GB,0.002091,0.008524,2.51624,0.112679,-2.027104
31,ranthipeptide_alone,0_08GB,0.00059,0.004836,1.979713,0.159421,-3.035409
33,ranthipeptide_alone,0_2GB,0.000281,0.001333,1.372156,0.241442,-2.248045
28,ranthipeptide_alone,0_01GB,0.006246,0.04123,1.372156,0.241442,-2.722614


### Blocking for catalogues
Testing within catalogue best vs within catalogue init

In [42]:
result_rows2 = []
for cat, df_sub_data in df_nb.groupby(["catalogue_name"]):
    estimates_best = df_sub_data.query("method == 'best'").RE.values.tolist()
    estimates_init = df_sub_data.query("method == 'init'").RE.values.tolist()
    
    stat, p = fligner(estimates_best, estimates_init)
    
    result_rows2.append({
        'catalogue_name': cat,
        #'dataset': db,
        'variance_init' : np.var(estimates_init),
        'variance_best' : np.var(estimates_best),
        'statistic' : stat,
        'p-value' : p,
        "log2Fold": np.log2(np.var(estimates_init)) - np.log2(np.var(estimates_best))
    })
    
df_results2 = pd.DataFrame(result_rows2)


### Results from blocking only catalogue:

Here we note that the variance is consistently larger for the variance best compared to the inital.
We note that this is only significant for __NZ_LT906470.1.region002__ And that this tentative conclusion is without taking into account multiple testing.

In [43]:
df_results2.sort_values("p-value").head(5)

Unnamed: 0,catalogue_name,variance_init,variance_best,statistic,p-value,log2Fold
1,NZ_CP053893.1.region004,0.02174,0.08796,9.228784,0.002382,-2.016526
2,NZ_LT906470.1.region002,0.006652,0.040654,5.905431,0.015094,-2.611512
3,ranthipeptide_alone,0.006405,0.057423,5.233809,0.022152,-3.164286
0,NC_014328.1.region003,0.014714,0.078046,2.088522,0.14841,-2.407167


In [46]:
result_rows3 = []
for cat, df_sub_data in df_nb.groupby(["dataset"]):
    estimates_best = df_sub_data.query("method == 'best'").RE.values.tolist()
    estimates_init = df_sub_data.query("method == 'init'").RE.values.tolist()
    
    stat, p = fligner(estimates_best, estimates_init)
    
    result_rows3.append({
        'dataset': cat,
        #'dataset': db,
        'variance_init' : np.var(estimates_init),
        'variance_best' : np.var(estimates_best),
        'statistic' : stat,
        'p-value' : p
    })
    
df_results3 = pd.DataFrame(result_rows3)
df_results3["log2Fold"] = np.log2(df_results3.variance_init / df_results3.variance_best)

### When blocking for dataset
We see that for all datasets the varaince are multiple folds lower for variance_best compared to variance_init.

These observations also appears to be significant.

Thus, these gives some evidence that there could be value (as in better precision) for the refined gene-set.


In [47]:
df_results3.sort_values("log2Fold",ascending=False )

Unnamed: 0,dataset,variance_init,variance_best,statistic,p-value,log2Fold
8,0_5GB,0.01076,0.000939,20.242611,7e-06,3.519224
7,0_3GB,0.012492,0.001272,18.452388,1.7e-05,3.295271
5,0_1GB,0.017263,0.001983,7.484829,0.006222,3.122176
6,0_2GB,0.013097,0.002142,13.596808,0.000227,2.612477
4,0_08GB,0.016876,0.003275,9.181788,0.002444,2.365463
3,0_05GB,0.027568,0.008908,1.724306,0.18914,1.629896
2,0_02GB,0.051502,0.038398,0.030791,0.860708,0.423573
1,0_01GB,0.037114,0.07078,0.450251,0.502216,-0.931351
0,0_005GB,0.044696,0.209612,12.468876,0.000414,-2.229513


## Summary:

I the investigations both exploratory and statistical show tentative evidence that the refinement method results in more consistent (precise) results. 

I believe that the reason why we observe init to have lesser variance when not looking into within dataset analysis is that the variance is much larger in the smallest datasets which the drive the analysis. Not entirely unlike the learnings in the Simpson Paradox.

When we look within each dataset we do observe that best > init in terms of better precision.

However, we also realize that perhaps the nature of our initial dataset is to small and simple to properly investigate whether the NB catalogue refinement method works. We are simply tring to optimize on a too small area.

We therefore decided to scale up and work on a much large simulated dataset.