Merge 5a73139 into b32dcfe

EpistasisLab · May 22, 2018 · e2c7712 · e2c7712
2 parents b32dcfe + 5a73139
commit e2c7712
Show file tree

Hide file tree

Showing 21 changed files with 1,142 additions and 574 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,6 @@
+#Custom test files
+run_test.py
+
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
@@ -66,3 +69,4 @@ testing.ipynb
 *.lprof
 
 *.prof
+/demo_scikitrebate.ipynb
diff --git a/README.md b/README.md
@@ -11,21 +11,16 @@ Package information: ![Python 2.7](https://img.shields.io/badge/python-2.7-blue.
 ![License](https://img.shields.io/badge/license-MIT%20License-blue.svg)
 [![PyPI version](https://badge.fury.io/py/skrebate.svg)](https://badge.fury.io/py/skrebate)
 
-# scikit-rebate
+# scikit-rebate (scikit-learn compatible relief-based algorithm training environment) 
+This package includes a scikit-learn-compatible Python implementation of ReBATE, a suite of [Relief-based feature selection algorithms](https://en.wikipedia.org/wiki/Relief_(feature_selection)) for Machine Learning. These Relief-Based algorithms (RBAs) are designed for feature weighting/selection as part of a machine learning pipeline (supervised learning). Presently this includes the following core RBAs: ReliefF, SURF, SURF*, and MultiSURF*. Additionally, an implementation of the iterative TuRF mechanism and VLSRelief is included. **It is still under active development** and we encourage you to check back on this repository regularly for updates.
 
-A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.
+These algorithms offer a computationally efficient way to perform feature selection that is sensitive to feature interactions as well as simple univariate associations, unlike most currently available filter-based feature selection methods. The main benefit of Relief algorithms is that they identify feature interactions without having to exhaustively check every pairwise interaction, thus taking significantly less time than exhaustive pairwise search.
 
-## Relief-based algorithms
-
-This package contains implementations of the [Relief](https://en.wikipedia.org/wiki/Relief_(feature_selection)) family of feature selection algorithms. **It is still under active development** and we encourage you to check back on this repository regularly for updates.
-
-These algorithms excel at identifying features that are predictive of the outcome in supervised learning problems, and are especially good at identifying feature interactions that are normally overlooked by standard feature selection methods.
-
-The main benefit of Relief algorithms is that they identify feature interactions without having to exhaustively check every pairwise interaction, thus taking significantly less time than exhaustive pairwise search.
+Certain algorithms require user specified run parameters (e.g. ReliefF requires the user to specify some 'k' number of nearest neighbors). 
 
 Relief algorithms are commonly applied to genetic analyses, where epistasis (i.e., feature interactions) is common. However, the algorithms implemented in this package can be applied to almost any supervised classification data set and supports:
 
-* A mix of categorical and/or continuous features
+* Feature sets that are discrete/categorical, continuous-valued or a mix of both
 
 * Data with missing values
 
@@ -35,6 +30,13 @@ Relief algorithms are commonly applied to genetic analyses, where epistasis (i.e
 
 * Continuous endpoints (i.e., regression)
 
+Built into this code, is a strategy to 'automatically' detect from the loaded data, these relevant characteristics.
+
+Of our two initial ReBATE software releases, this scikit-learn compatible version primarily focuses on ease of incorporation into a scikit learn analysis pipeline. 
+This code is most appropriate for scikit-learn users, Windows operating system users, beginners, or those looking for the most recent ReBATE developments.
+
+An alternative 'stand-alone' version of [ReBATE](https://github.com/EpistasisLab/ReBATE) is also available that focuses on improving run-time with the use of Cython for optimization. This implementation also outputs feature names and associated feature scores as a text file by default. 
+
 ## License
 
 Please see the [repository license](https://github.com/EpistasisLab/scikit-rebate/blob/master/LICENSE) for the licensing and usage information for scikit-rebate.

diff --git a/docs/index.html b/docs/index.html
@@ -194,5 +194,5 @@
 
 <!--
 MkDocs version : 0.17.3
-Build Date UTC : 2018-05-07 22:30:12
+Build Date UTC : 2018-05-22 21:41:57
 -->
diff --git a/docs/installing/index.html b/docs/installing/index.html
@@ -141,6 +141,14 @@
 <pre><code>pip install skrebate
 </code></pre>
 
+<p>You can retrieve basic information about your installed version of skrebate with the following pip command:</p>
+<pre><code>pip show skrebate
+</code></pre>
+
+<p>You can check that you have the most up to date pypi release of skrebate with the following pip command:</p>
+<pre><code>pip install skrebate -U
+</code></pre>
+
 <p>Please <a href="https://github.com/EpistasisLab/scikit-rebate/issues/new">file a new issue</a> if you run into installation problems.</p>
 
             </div>

diff --git a/docs/releases/index.html b/docs/releases/index.html
@@ -72,6 +72,9 @@
     <a class="current" href="./">Release Notes</a>
     <ul class="subnav">
 
+    <li class="toctree-l2"><a href="#scikit-rebate-06">scikit-rebate 0.6</a></li>
+
+
     <li class="toctree-l2"><a href="#scikit-rebate-05">scikit-rebate 0.5</a></li>
 
 
@@ -135,7 +138,37 @@
           <div role="main">
             <div class="section">
 
-                <h1 id="scikit-rebate-05">scikit-rebate 0.5</h1>
+                <h1 id="scikit-rebate-06">scikit-rebate 0.6</h1>
+<ul>
+<li>
+<p>Fixed internal TuRF implementation so that it outputs scores for all features. Those that make it to the last iteration get true core algorithm scoring, while those that were removed along the way are assigned token scores (lower than the lowest true scoring feature) that indicate when the respective feature(s) were removed. This also alows for greater flexibility in the user specifying the number for features to return. </p>
+</li>
+<li>
+<p>Updated the usage documentation to demonstrate how to use RFE as well as the newly updated internal TuRF implementation. </p>
+</li>
+<li>
+<p>Fixed the pct paramter of TuRF to properly determine the percent of features removed each iteration as well as the total number of iterations as described in the original TuRF paper.  Also managed the edge case to ensure that at least one feature would be removed each TuRF iteration. </p>
+</li>
+<li>
+<p>Fixed ability to parallelize run of core algorithm while using TuRF.</p>
+</li>
+<li>
+<p>Updated the unit testing file to remove some excess unite tests, add other relevant ones, speed up testing overall, and make the testing better organized. </p>
+</li>
+<li>
+<p>Added a preliminary implementation of VLSRelief to scikit-rebate, along with associated unit tests. Documentation and code examples not yet supported. </p>
+</li>
+<li>
+<p>Removed some unused code from TuRF implementation.</p>
+</li>
+<li>
+<p>Added check in the transform method required by scikit-learn in both relieff.py and turf.py to ensure that the number of selected features requested by the user was not larger than the number of features in the dataset. </p>
+</li>
+<li>
+<p>Reduced the default value for number of features selected</p>
+</li>
+</ul>
+<h1 id="scikit-rebate-05">scikit-rebate 0.5</h1>
 <ul>
 <li>
 <p>Added fixes to score normalizations that should ensure that feature scores for all algorithms fall between -1 and 1. </p>