# Data generation: using python to sweep over methods and parameters

This notebook serves as a template for using python to generate and run a list of commands. To use, follow these instructions:

1) select ``File -> Make a Copy...`` from the toolbar above to copy this notebook and provide a new name describing the method(s) that you are testing.

2) Modify file paths in cell 2 of [Environment preparation](#Environment-preparation) to match the directory structure on your system.

3) Select the datasets you wish to test under [Preparing data set sweep](#Preparing-data-set-sweep); choose from the list of datasets included in ``tax-credit``, or add your own.

4) [Prepare methods and command template](#Preparing-the-method/parameter-combinations-and-generating-commands). Enter your method / parameter combinations as a dictionary to ``method_parameters_combinations`` in cell 1, then provide a ``command_template`` in cell 2. This notebook example assumes that the method commands are passed to the command line, but the command list generated by ``parameter_sweep()`` can also be directed to the python interpreter, as shown in [this example](./taxonomy-assignment-q2-feature-classifer.ipynb). Check command list in cell 3, and set number of jobs and ``joblib`` parameters in cell 4.

5) Run all cells and hold onto your hat.

For an example of how to test classification methods in this notebook, see [taxonomy assignment with Qiime 1](./taxonomy-assignment-qiime1.ipynb).


## Environment preparation

In [1]:
from os.path import join, expandvars 
from joblib import Parallel, delayed
from glob import glob
from os import system
from tax_credit.framework_functions import (parameter_sweep,
                                            generate_per_method_biom_tables,
                                            move_results_to_repository)


In [2]:
project_dir = expandvars("/media/sf_Shared/tax-credit-data")
analysis_name= "mock-community"
data_dir = join(project_dir, "data", analysis_name)

reference_database_dir = expandvars("/media/sf_Shared/ref_dbs/")
results_dir = expandvars("/media/sf_Shared/tax-credit-data/computed-results")

## Preparing data set sweep

First, we're going to define the data sets that we'll sweep over. The following cell does not need to be modified unless if you wish to change the datasets or reference databases used in the sweep.

In [3]:
dataset_reference_combinations = [
 ('mock-1', 'gg_13_8_otus'), # formerly S16S-1
]

reference_dbs = {'gg_13_8_otus' : (join(reference_database_dir, 'gg_13_8_otus/rep_set/99_otus.fasta'), 
                                   join(reference_database_dir, 'gg_13_8_otus/taxonomy/99_otu_taxonomy.txt'))}

## Preparing the method/parameter combinations and generating commands

Now we set the methods and method-specific parameters that we want to sweep. Modify to sweep other methods. Note how method_parameters_combinations feeds method/parameter combinations to parameter_sweep() in the cell below.

In [17]:
method_parameters_combinations = {
              'mindivlp': {'small_k': [4],
                           'large_k': [4],
                           'q_value': [0.1],
                           'const': [1000]
                          }
}

Now enter the template of the command to sweep, and generate a list of commands with ``parameter_sweep()``.

Fields must adhere to following format:

                      {0} = output directory
                      {1} = input data
                      {2} = reference sequences
                      {3} = reference taxonomy
                      {4} = method name
                      {5} = other parameters

In [18]:
command_template = "mkdir -p {0}; python /media/sf_Shared/MinDivLP/classify_mindivlp.py -i {1} -o {0} -r {2} -t {3} {5}"
        
commands = parameter_sweep(data_dir, results_dir, reference_dbs,
                           dataset_reference_combinations,
                           method_parameters_combinations, command_template,
                           infile='rep_seqs.fna', output_name='rep_seqs_tax_assignments.txt')


As a sanity check, we can look at the first command that was generated and the number of commands generated.

In [19]:
print(len(commands))
commands[0]

1


'mkdir -p /media/sf_Shared/tax-credit-data/computed-results/mock-1/gg_13_8_otus/mindivlp/1000_4_0.1_4; python /media/sf_Shared/MinDivLP/classify_mindivlp.py -i /media/sf_Shared/tax-credit-data/data/mock-community/mock-1/rep_seqs.fna -o /media/sf_Shared/tax-credit-data/computed-results/mock-1/gg_13_8_otus/mindivlp/1000_4_0.1_4 -r /media/sf_Shared/ref_dbs/gg_13_8_otus/rep_set/99_otus.fasta -t /media/sf_Shared/ref_dbs/gg_13_8_otus/taxonomy/99_otu_taxonomy.txt --const 1000 --large_k 4 --q_value 0.1 --small_k 4'

Finally, we run our commands.

In [24]:
Parallel(n_jobs=4)(delayed(system)(command) for command in commands)

[0]

## Move result files to repository

Add results to the tax-credit directory (e.g., to push these results to the repository or compare with other precomputed results in downstream analysis steps). The precomputed_results_dir path and methods_dirs glob below should not need to be changed unless if substantial changes were made to filepaths in the preceding cells.

Uncomment and run when (and if) you want to move your new results to the ``tax-credit`` directory. Note that results needn't be in ``tax-credit`` to compare using the evaluation notebooks.

In [26]:
precomputed_results_dir = join(project_dir, "data", "precomputed-results", analysis_name)
method_dirs = glob(join(results_dir, '*', '*', '*', '*'))
move_results_to_repository(method_dirs, precomputed_results_dir)