Note:

I've included some code for the second paper here. The core problem is that we have a lot of combinations of data / data processing/ and models to run. _At minimum_ we're looking at: 

$5\ species * 6\ contig\ lengths * 6\ kmer\ lengths = 180\ combinations$. 

To run _default knn_ for each across 5 folds we're looking at 900 models. Add 16 cycles of hyperparmeter tuning and we're at 14,400 models. The workaround is to use one fold for hyperparmeter tuning and then evaluate across folds. 

At some point we'll need to think about moving computation to a HPC and making sure it's easy to run many, many jobs concurrently. 

In [None]:
import itertools

In [None]:
options_tune = {
    'target_label': ['Drosophila_melanogaster',  'Glycine_max',  'Spodoptera_frugiperda',  'Vitis_vinifera',  'Zea_mays'],
    'BasePair': ['500', '1000', '3000', '5000', '10000', 'genome'],
    'model_type': ['knn'],
    # 'model_type': ['knn', 'bknn', 'rnr', 'brf', 'rf', 'svml', 'svmr', 'lr', 'hgb'],
    'kmer': [i for i in range(1, 7)],
    # 'fold': [i for i in range(5)]
    'fold': [0]
}
options_train = options_tune.copy()
options_train['fold'] = [i for i in range(5)]
# options_train['model_type'].append(['GNBC'])

In [None]:
o = options_tune

cmds = [f"python 03_tune_model.py --model_type '{e[0]}' --target_label '{e[1]}' --kmer {e[2]}  --BasePair '{e[3]}' --cv_fold {e[4]}  --cv_mode 'tuning' --k_job 30 --tuning_iterations 16"
 for e in itertools.product(
    o['model_type'],
    o['target_label'], 
    o['kmer'],
    o['BasePair'],
    o['fold'],
    )]
len(cmds)

In [None]:
cmds = ' && \n'.join(cmds)

with open('./03_tune_model.sh', 'w') as f:
    f.writelines(cmds)

In [None]:
o = options_train

cmds = [f"python 04_train_model.py --model_type '{e[0]}' --target_label '{e[1]}' --kmer {e[2]}  --BasePair '{e[3]}' --cv_fold {e[4]}  --cv_mode 'training' --k_job 30"
 for e in itertools.product(
    o['model_type'],
    o['target_label'], 
    o['kmer'],
    o['BasePair'],
    o['fold'],
    )]
len(cmds)

In [None]:
# fold for models with tuned hyperparameters
o = options_train

opts = [e for e in itertools.product(
    o['model_type'],
    o['target_label'], 
    o['kmer'],
    o['BasePair'],
    # o['fold'],
    [0]
    )]

cmds = [
    f"python 04_train_model.py --model_type '{e[0]}' --target_label '{e[1]}' --kmer {e[2]}  --BasePair '{e[3]}' --cv_fold {e[4]}  --cv_mode 'training' --k_job 30"
    for e in opts]
len(cmds)

In [None]:
# fold for models that are using hyperparameters tuned on cv0
opts = [e for e in itertools.product(
    o['model_type'],
    o['target_label'], 
    o['kmer'],
    o['BasePair'],
    [1,2,3,4]
    )]

update_cmds = [
    f"python 04_train_model.py --model_type '{e[0]}' --target_label '{e[1]}' --kmer {e[2]}  --BasePair '{e[3]}' --cv_fold {e[4]}  --cv_mode 'training' --k_job 30"
    for e in opts]

In [None]:
rename_ax_json = [
    f"cp ./models/tune/{e[0]}-{e[1]}-kmer{e[2]}-bp{e[3]}-fold0.json ./models/tune/{e[0]}-{e[1]}-kmer{e[2]}-bp{e[3]}-fold{e[4]}.json"
    for e in opts]

In [None]:
remove_ax_json = [
    f"rm ./models/tune/{e[0]}-{e[1]}-kmer{e[2]}-bp{e[3]}-fold{e[4]}.json"
    for e in opts]

In [None]:
update_cmds = sum(
    [[a,b,c] for a,b,c in zip(
        rename_ax_json, update_cmds, remove_ax_json)], [])

cmds = cmds+update_cmds

In [None]:
cmds = ' && \n'.join(cmds)

with open('./04_train_model.sh', 'w') as f:
    f.writelines(cmds)