Merge 8c90b85 into c3f165b

EpistasisLab · Apr 12, 2019 · b019f17 · b019f17
2 parents c3f165b + 8c90b85
commit b019f17
Show file tree

Hide file tree

Showing 23 changed files with 860 additions and 147 deletions.
diff --git a/docs/api/index.html b/docs/api/index.html
@@ -149,6 +149,7 @@ <h1 id="classification">Classification</h1>
                           <strong>subsample</strong>=1.0, <strong>n_jobs</strong>=1,
                           <strong>max_time_mins</strong>=None, <strong>max_eval_time_mins</strong>=5,
                           <strong>random_state</strong>=None, <strong>config_dict</strong>=None,
+                          <strong>template</strong>="RandomTree",
                           <strong>warm_start</strong>=False,
                           <strong>memory</strong>=None,
                           <strong>use_dask</strong>=False,
@@ -246,7 +247,7 @@ <h1 id="classification">Classification</h1>
 <blockquote>
 Number of processes to use in parallel for evaluating pipelines during the TPOT optimization process.
 <br /><br />
-Setting <em>n_jobs</em>=-1 will use as many cores as available on the computer. Beware that using multiple processes on the same machine may cause memory issues for large datasets
+Setting <em>n_jobs</em>=-1 will use as many cores as available on the computer. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used. Beware that using multiple processes on the same machine may cause memory issues for large datasets.
 </blockquote>
 
 <strong>max_time_mins</strong>: integer or None, optional (default=None)
@@ -285,6 +286,15 @@ <h1 id="classification">Classification</h1>
 See the <a href="../using/#built-in-tpot-configurations">built-in configurations</a> section for the list of configurations included with TPOT, and the <a href="../using/#customizing-tpots-operators-and-parameters">custom configuration</a> section for more information and examples of how to create your own TPOT configurations.
 </blockquote>
 
+<strong>template</strong>: string (default="RandomTree")
+<blockquote>
+Template of predefined pipeline structure. The option is for specifying a desired structure for the machine learning pipeline evaluated in TPOT.
+<br /><br />
+So far this option only supports linear pipeline structure. Each step in the pipeline should be a main class of operators (Selector, Transformer, Classifier) or a specific operator (e.g. `SelectPercentile`) defined in TPOT operator configuration. If one step is a main class, TPOT will randomly assign all subclass operators (subclasses of [`SelectorMixin`](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_selection/base.py#L17), [`TransformerMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html), [`ClassifierMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.ClassifierMixin.html) in scikit-learn) to that step. Steps in the template are delimited by "-", e.g. "SelectPercentile-Transformer-Classifier". By default value of template is "RandomTree", TPOT generates tree-based pipeline randomly.
+
+See the <a href="../using/#template-option-in-tpot"> template option in tpot</a> section for more details.
+</blockquote>
+
 <strong>warm_start</strong>: boolean, optional (default=False)
 <blockquote>
 Flag indicating whether the TPOT instance will reuse the population from previous calls to <em>fit()</em>.
@@ -611,6 +621,7 @@ <h1 id="regression">Regression</h1>
                          <strong>subsample</strong>=1.0, <strong>n_jobs</strong>=1,
                          <strong>max_time_mins</strong>=None, <strong>max_eval_time_mins</strong>=5,
                          <strong>random_state</strong>=None, <strong>config_dict</strong>=None,
+                         <strong>template</strong>="RandomTree",
                          <strong>warm_start</strong>=False,
                          <strong>memory</strong>=None,
                          <strong>use_dask</strong>=False,
@@ -709,7 +720,7 @@ <h1 id="regression">Regression</h1>
 <blockquote>
 Number of processes to use in parallel for evaluating pipelines during the TPOT optimization process.
 <br /><br />
-Setting <em>n_jobs</em>=-1 will use as many cores as available on the computer. Beware that using multiple processes on the same machine may cause memory issues for large datasets
+Setting <em>n_jobs</em>=-1 will use as many cores as available on the computer. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used. Beware that using multiple processes on the same machine may cause memory issues for large datasets
 </blockquote>
 
 <strong>max_time_mins</strong>: integer or None, optional (default=None)
@@ -748,6 +759,15 @@ <h1 id="regression">Regression</h1>
 See the <a href="../using/#built-in-tpot-configurations">built-in configurations</a> section for the list of configurations included with TPOT, and the <a href="../using/#customizing-tpots-operators-and-parameters">custom configuration</a> section for more information and examples of how to create your own TPOT configurations.
 </blockquote>
 
+<strong>template</strong>: string (default="RandomTree")
+<blockquote>
+Template of predefined pipeline structure. The option is for specifying a desired structure for the machine learning pipeline evaluated in TPOT.
+<br /><br />
+So far this option only supports linear pipeline structure. Each step in the pipeline should be a main class of operators (Selector, Transformer or Regressor) or a specific operator (e.g. `SelectPercentile`) defined in TPOT operator configuration. If one step is a main class, TPOT will randomly assign all subclass operators (subclasses of [`SelectorMixin`](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_selection/base.py#L17), [`TransformerMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) or [`RegressorMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.RegressorMixin.html) in scikit-learn) to that step. Steps in the template are delimited by "-", e.g. "SelectPercentile-Transformer-Regressor". By default value of template is "RandomTree", TPOT generates tree-based pipeline randomly.
+
+See the <a href="../using/#template-option-in-tpot"> template option in tpot</a> section for more details.
+</blockquote>
+
 <strong>warm_start</strong>: boolean, optional (default=False)
 <blockquote>
 Flag indicating whether the TPOT instance will reuse the population from previous calls to <em>fit()</em>.

diff --git a/docs/index.html b/docs/index.html
@@ -213,5 +213,5 @@
 
 <!--
 MkDocs version : 0.17.2
-Build Date UTC : 2019-03-01 17:12:19
+Build Date UTC : 2019-04-11 20:02:14
 -->
diff --git a/docs/search/search_index.json b/docs/search/search_index.json
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
@@ -4,79 +4,79 @@
 
     <url>
      <loc>http://epistasislab.github.io/tpot/</loc>
-     <lastmod>2019-03-01</lastmod>
+     <lastmod>2019-04-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
 
 
 
     <url>
      <loc>http://epistasislab.github.io/tpot/installing/</loc>
-     <lastmod>2019-03-01</lastmod>
+     <lastmod>2019-04-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
 
 
 
     <url>
      <loc>http://epistasislab.github.io/tpot/using/</loc>
-     <lastmod>2019-03-01</lastmod>
+     <lastmod>2019-04-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
 
 
 
     <url>
      <loc>http://epistasislab.github.io/tpot/api/</loc>
-     <lastmod>2019-03-01</lastmod>
+     <lastmod>2019-04-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
 
 
 
     <url>
      <loc>http://epistasislab.github.io/tpot/examples/</loc>
-     <lastmod>2019-03-01</lastmod>
+     <lastmod>2019-04-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
 
 
 
     <url>
      <loc>http://epistasislab.github.io/tpot/contributing/</loc>
-     <lastmod>2019-03-01</lastmod>
+     <lastmod>2019-04-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
 
 
 
     <url>
      <loc>http://epistasislab.github.io/tpot/releases/</loc>
-     <lastmod>2019-03-01</lastmod>
+     <lastmod>2019-04-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
 
 
 
     <url>
      <loc>http://epistasislab.github.io/tpot/citing/</loc>
-     <lastmod>2019-03-01</lastmod>
+     <lastmod>2019-04-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
 
 
 
     <url>
      <loc>http://epistasislab.github.io/tpot/support/</loc>
-     <lastmod>2019-03-01</lastmod>
+     <lastmod>2019-04-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
 
 
 
     <url>
      <loc>http://epistasislab.github.io/tpot/related/</loc>
-     <lastmod>2019-03-01</lastmod>
+     <lastmod>2019-04-11</lastmod>
      <changefreq>daily</changefreq>
     </url>
 

diff --git a/docs/using/index.html b/docs/using/index.html
@@ -80,6 +80,12 @@
     <li class="toctree-l2"><a href="#customizing-tpots-operators-and-parameters">Customizing TPOT's operators and parameters</a></li>
 
 
+    <li class="toctree-l2"><a href="#template-option-in-tpot">Template option in TPOT</a></li>
+
+
+    <li class="toctree-l2"><a href="#featuresetselector-in-tpot">FeatureSetSelector in TPOT</a></li>
+
+
     <li class="toctree-l2"><a href="#pipeline-caching-in-tpot">Pipeline caching in TPOT</a></li>
 
 
@@ -367,7 +373,7 @@ <h1 id="tpot-on-the-command-line">TPOT on the command line</h1>
 <td>Any positive integer or -1</td>
 <td>Number of CPUs for evaluating pipelines in parallel during the TPOT optimization process.
 <br /><br />
-Assigning this to -1 will use as many cores as available on the computer.</td>
+Assigning this to -1 will use as many cores as available on the computer. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.</td>
 </tr>
 <tr>
 <td>-maxtime</td>
@@ -409,6 +415,15 @@ <h1 id="tpot-on-the-command-line">TPOT on the command line</h1>
 </td>
 </tr>
 <tr>
+<td>-template</td>
+<td>TEMPLATE</td>
+<td>String</td>
+<td>Template of predefined pipeline structure. The option is for specifying a desired structure for the machine learning pipeline evaluated in TPOT. So far this option only supports linear pipeline structure. Each step in the pipeline should be a main class of operators (Selector, Transformer, Classifier or Regressor) or a specific operator (e.g. `SelectPercentile`) defined in TPOT operator configuration. If one step is a main class, TPOT will randomly assign all subclass operators (subclasses of [`SelectorMixin`](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_selection/base.py#L17), [`TransformerMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html), [`ClassifierMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.ClassifierMixin.html) or [`RegressorMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.RegressorMixin.html) in scikit-learn) to that step. Steps in the template are delimited by "-", e.g. "SelectPercentile-Transformer-Classifier". By default value of template is "RandomTree", TPOT generates tree-based pipeline randomly.
+
+See the <a href="../using/#template-option-in-tpot"> template option in tpot</a> section for more details.
+</td>
+</tr>
+<tr>
 <td>-memory</td>
 <td>MEMORY</td>
 <td>String or file path</td>
@@ -641,6 +656,41 @@ <h1 id="customizing-tpots-operators-and-parameters">Customizing TPOT's operators
 <p>When using the command-line interface, the configuration file specified in the <code>-config</code> parameter <em>must</em> name its custom TPOT configuration <code>tpot_config</code>. Otherwise, TPOT will not be able to locate the configuration dictionary.</p>
 <p>For more detailed examples of how to customize TPOT's operator configuration, see the default configurations for <a href="https://github.com/EpistasisLab/tpot/blob/master/tpot/config/classifier.py">classification</a> and <a href="https://github.com/EpistasisLab/tpot/blob/master/tpot/config/regressor.py">regression</a> in TPOT's source code.</p>
 <p>Note that you must have all of the corresponding packages for the operators installed on your computer, otherwise TPOT will not be able to use them. For example, if XGBoost is not installed on your computer, then TPOT will simply not import nor use XGBoost in the pipelines it considers.</p>
+<h1 id="template-option-in-tpot">Template option in TPOT</h1>
+<p>Template option provides a way to specify a desired structure for machine learning pipeline, which may reduce TPOT computation time and potentially provide more interpretable results. Current implementation only supports linear pipelines.</p>
+<p>Below is a simple example to use <code>template</code> option. The pipelines generated/evaluated in TPOT will follow this structure: 1st step is a feature selector (a subclass of <a href="https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_selection/base.py#L17"><code>SelectorMixin</code></a>), 2nd step is a feature transformer (a subclass of <a href="https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html"><code>TransformerMixin</code></a>) and 3rd step is a classifier for classification (a subclass of <a href="https://scikit-learn.org/stable/modules/generated/sklearn.base.ClassifierMixin.html"><code>ClassifierMixin</code></a>). The last step must be <code>Classifier</code> for <code>TPOTClassifier</code>'s template but <code>Regressor</code> for <code>TPOTRegressor</code>. <strong>Note: although <code>SelectorMixin</code> is subclass of <code>TransformerMixin</code> in scikit-leawrn, but <code>Transformer</code> in this option excludes those subclasses of <code>SelectorMixin</code>.</strong></p>
+<pre><code class="Python">tpot_obj = TPOTClassifier(
+                template='Selector-Transformer-Classifier'
+                )
+</code></pre>
+
+<p>If a specific operator, e.g. <code>SelectPercentile</code>, is prefered to used in the 1st step of pipeline, the template can be defined like 'SelectPercentile-Transformer-Classifier'.</p>
+<h1 id="featuresetselector-in-tpot">FeatureSetSelector in TPOT</h1>
+<p><code>FeatureSetSelector</code> is a special new operator in TPOT. This operator enables feature selection based on <em>priori</em> export knowledge. For example, in RNA-seq gene expression analysis, this operator can be used to select one or more gene (feature) set(s) based on GO (Gene Ontology) terms or annotated gene sets Molecular Signatures Database (<a href="http://software.broadinstitute.org/gsea/msigdb/index.jsp">MSigDB</a>) in the 1st step of pipeline via <code>template</code> option above, in order to reduce dimensions and TPOT computation time. This operator requires a dataset list in csv format. In this csv file, there are only three columns: 1st column is feature set names, 2nd column is the total number of features in one set and 3rd column is a list of feature names (if input X is pandas.DataFrame) or indexes (if input X is numpy.ndarray) delimited by ";". Below is a example how to use this operator in TPOT.</p>
+<p>Please check our <a href="https://www.biorxiv.org/content/10.1101/502484v1.article-info">preprint paper</a> for more details.</p>
+<pre><code class="Python">from tpot import TPOTClassifier
+import numpy as np
+import pandas as pd
+from tpot.config import classifier_config_dict
+test_data = pd.read_csv(&quot;https://raw.githubusercontent.com/EpistasisLab/tpot/master/tests/tests.csv&quot;)
+test_X = test_data.drop(&quot;class&quot;, axis=1)
+test_y = test_data['class']
+
+# add FeatureSetSelector into tpot configuration
+classifier_config_dict['tpot.builtins.FeatureSetSelector'] = {
+    'subset_list': ['https://raw.githubusercontent.com/EpistasisLab/tpot/master/tests/subset_test.csv'],
+    'sel_subset': [0,1] # select only one feature set, a list of index of subset in the list above
+    #'sel_subset': list(combinations(range(3), 2)) # select two feature sets
+}
+
+
+tpot = TPOTClassifier(generations=5,
+                           population_size=50, verbosity=2,
+                           template='FeatureSetSelector-Transformer-Classifier',
+                           config_dict=classifier_config_dict)
+tpot.fit(test_X, test_y)
+</code></pre>
+
 <h1 id="pipeline-caching-in-tpot">Pipeline caching in TPOT</h1>
 <p>With the <code>memory</code> parameter, pipelines can cache the results of each transformer after fitting them. This feature is used to avoid repeated computation by transformers within a pipeline if the parameters and input data are identical to another fitted pipeline during optimization process. TPOT allows users to specify a custom directory path or <a href="https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/externals/joblib/memory.py#L847"><code>sklearn.external.joblib.Memory</code></a> in case they want to re-use the memory cache in future TPOT runs (or a <code>warm_start</code> run).</p>
 <p>There are three methods for enabling memory caching in TPOT:</p>
@@ -684,8 +734,8 @@ <h1 id="parallel-training-with-dask">Parallel Training with Dask</h1>
 <p>For large problems or working on Jupyter notebook, we highly recommend that you can distribute the work on a <a href="http://dask.pydata.org/en/latest/">Dask</a> cluster.
 The <a href="https://mybinder.org/v2/gh/dask/dask-examples/master?filepath=machine-learning%2Ftpot.ipynb">dask-examples binder</a> has a runnable example
 with a small dask cluster.</p>
-<p>To use your Dask cluster to fit a TPOT model, specify the <code>use_dask</code> keyword when you create the TPOT estimator. <strong>Note: if <code>use_dask=True</code>, TPOT will use as many cores as available on the your Dask cluster regardless of whether <code>n_jobs</code> is specified.</strong></p>
-<pre><code class="python">estimator = TPOTEstimator(use_dask=True)
+<p>To use your Dask cluster to fit a TPOT model, specify the <code>use_dask</code> keyword when you create the TPOT estimator. <strong>Note: if <code>use_dask=True</code>, TPOT will use as many cores as available on the your Dask cluster. If <code>n_jobs</code> is specified, then it will control the chunk size (10*<code>n_jobs</code> if it is less then offspring size) of parallel training. </strong></p>
+<pre><code class="python">estimator = TPOTEstimator(use_dask=True, n_jobs=-1)
 </code></pre>
 
 <p>This will use use all the workers on your cluster to do the training, and use <a href="https://dask-ml.readthedocs.io/en/latest/hyper-parameter-search.html#avoid-repeated-work">Dask-ML's pipeline rewriting</a> to avoid re-fitting estimators multiple times on the same set of data.