Added cross-validation iterators à la scikit learn.

Evaluate now becomes cross_validate GridSearch has been rewritten as well. Doc (especially getting started guide) has been substancially updated. Squashed commit of the following: commit 6fbb36e Author: Nicolas Hug <contact@nicolas-hug.com> Date: Sat Jan 6 19:34:08 2018 +0100 edit CHANGELOG.md commit 72b4ceb Author: Nicolas Hug <contact@nicolas-hug.com> Date: Sat Jan 6 19:29:56 2018 +0100 Modified __main__ to use model_selection tools. Fixed download_dataset issue commit 361892a Author: Nicolas Hug <contact@nicolas-hug.com> Date: Sat Jan 6 19:14:33 2018 +0100 pep8 commit d4c5969 Author: Nicolas Hug <contact@nicolas-hug.com> Date: Sat Jan 6 19:04:12 2018 +0100 Added docstring for all the model_selection package commit 40593bb Author: Nicolas Hug <contact@nicolas-hug.com> Date: Sat Jan 6 14:02:35 2018 +0100 Added deprecation warnings in the docs commit 7cec6b4 Author: Nicolas Hug <contact@nicolas-hug.com> Date: Sat Jan 6 13:38:28 2018 +0100 Modified Getting Started guide commit 47148d3 Author: Nicolas Hug <contact@nicolas-hug.com> Date: Thu Jan 4 22:56:45 2018 +0100 Some doc rewriting commit 248b738 Author: Nicolas Hug <contact@nicolas-hug.com> Date: Thu Jan 4 19:26:11 2018 +0100 Added verbose parameter for cross_validate commit 7230054 Author: Nicolas Hug <contact@nicolas-hug.com> Date: Thu Jan 4 18:34:20 2018 +0100 GridSearchCV now has the cv_results attribute commit fd4c918 Author: Nicolas Hug <contact@nicolas-hug.com> Date: Thu Jan 4 13:56:59 2018 +0100 Added deprecation warnings for evaluate() and GridSeach commit 7904cb3 Author: Nicolas Hug <contact@nicolas-hug.com> Date: Thu Jan 4 13:52:35 2018 +0100 cross_validate is now parallel with joblib commit f3397bf Author: Nicolas Hug <contact@nicolas-hug.com> Date: Tue Jan 2 19:09:33 2018 +0100 Now ok with Python 2 as well commit 7420cc1 Author: Nicolas Hug <contact@nicolas-hug.com> Date: Tue Jan 2 19:01:15 2018 +0100 No more deprecation warnings for data.folds() in Python 3 commit a6b049a Author: Nicolas Hug <contact@nicolas-hug.com> Date: Tue Jan 2 18:19:38 2018 +0100 Added GridSearch class (TBC) commit 39268c8 Author: Nicolas Hug <contact@nicolas-hug.com> Date: Tue Jan 2 10:29:23 2018 +0100 Added cross_validate() function (like evaluate but slightly better) commit 8f128e2 Author: Nicolas Hug <contact@nicolas-hug.com> Date: Mon Jan 1 19:51:30 2018 +0100 Created a model_selection module commit 91c3997 Author: Nicolas Hug <contact@nicolas-hug.com> Date: Mon Jan 1 18:29:43 2018 +0100 Added PredefinedKFold and some deprecation warnings commit 7d141de Author: Nicolas Hug <contact@nicolas-hug.com> Date: Thu Dec 28 21:53:41 2017 +0100 Added LeaveOneOut CV iterator commit a8bf1f2 Author: Nicolas Hug <contact@nicolas-hug.com> Date: Thu Dec 28 16:52:43 2017 +0100 Added RepeatedKFold CV iterator commit d7a5618 Author: Nicolas Hug <contact@nicolas-hug.com> Date: Thu Dec 28 13:02:04 2017 +0100 Added ShuffleSplit class and train_test_spli() commit 89e0147 Author: Nicolas Hug <contact@nicolas-hug.com> Date: Wed Dec 27 22:17:43 2017 +0100 First draft on CV iterators (Kfold)
NicolasHug · Jan 6, 2018 · 936b809 · 936b809
1 parent 7455c83
commit 936b809
Show file tree

Hide file tree

Showing 52 changed files with 2,067 additions and 438 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,10 @@ VERSION 1.0.5 (latest, in development)
 Enhancements
 ------------
 
+* Cross-validation tools have been entirely reworked. We can now rely on
+  powerful and flexible cross-validation iterators, inspired by scikit-learn's
+  API.
+* the evaluate() method has been replaced by cross-validate which is parallel.
 * GridSearch is now parallel, using joblib.
 * default data directory can now be custom with env variable
   SURPRISE_DATA_FOLDER
@@ -13,6 +17,11 @@ API Changes
 
 * The train() method is now deprecated and replaced by the fit() method (same
   signature). Calls to train() should still work as before.
+* Using data.split() or accessing the data.folds() generator is deprecated and
+  replaced by the use of the more powefull CV iterators.
+* evaluate() is deprecated and  replaced by model_selection.cross_validate(),
+  which is parallel.
+* GridSearch is deprecated and replaced by model_selection.GridSearchCV()
 
 VERSION 1.0.4
 =============

diff --git a/README.md b/README.md
@@ -64,7 +64,7 @@ Getting started, example
 ------------------------
 
 Here is a simple example showing how you can (down)load a dataset, split it for
-3-folds cross-validation, and compute the MAE and RMSE of the
+3-fold cross-validation, and compute the MAE and RMSE of the
 [SVD](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD)
 algorithm.
 

diff --git a/TODO.md b/TODO.md
@@ -1,19 +1,44 @@
 TODO
 ====
 
-* Allow to discount similarities (see aggarwal)
-* Allow incremental updates for some algorithms
-* Profile code (mostly cython) to see what could be optimized
 
-Maybe, Maybe not
-----------------
+* Update README example before new rewlease, as well as computation times
+* all algorithms using random initialization should allow to define
+  random_state. This is paramount for having correct gridsearch results (else
+  different initializations are used for the various parameter combinations).
+  When done, change tests of these algorithms so that they all use the same
+  seed. Right now tests about different RMSE values are not relevant. Also, use
+  SVD on test file when possible for grid search tests. Right now we use knn on
+  train (test does not have enough ratings for parameters to be impactful) and
+  it's slower.
+* Make all fit methods (for algo and GridSearch) return self. Update docs on
+  building custom algorithms, and on getting started -> gridsearch (add
+  example?).
+* Update doc of MF algo to indicate how to retrieve latent factors.
 
-* allow a back up algorithm  when prediction is impossible. Right now it's just
-  the mean rating that is predicted. Maybe user would want to choose it.
+* make some filtering dataset tools, like remove users/items with less/more
+  than n ratings, binarize a dataset, etc...
+* check conda forge
+* Allow incremental updates for some algorithms
 
 Done:
 -----
 
+* CV iterators:
+  - Write basic CV iterators
+  - evaluate -> rewrite to use CV iterators. Rename it into cross_validate.
+  - Same for GridSearch. Keep it in a model_selection module like scikit-learn
+    so that we can keep the old deprecated version. 
+  - Make cross validation parallel with joblib
+  - Add deprecation warnings for evaluate and GridSearch()
+  - handle the cv_results attribute for grid search
+  - (re)write all verbose settings for gridsearch and cross_validate
+  - Change examples so they use CV iterators and the new gridsearch and
+    cross_validate
+  - indicate in docs that split(), folds(), evaluate() and gridsearch() are
+    deprecated
+  - Write comments, docstring and update all docs
+  - Update main and command-line usage doc in getting started.rst
 * Allow to change data folder from env variable
 * Complete FAQ
 * Change the dumping machinery to be more consistent 

diff --git a/doc/source/FAQ.rst b/doc/source/FAQ.rst
@@ -123,11 +123,13 @@ some other will use/return an inner id.
 
 Raw ids are ids as defined in a rating file or in a pandas dataframe. They can
 be strings or numbers. Note though that if the ratings were read from a file
-which is the standard scenario, they are represented as strings (see e.g.
-:ref:`here <train_on_whole_trainset>`).
+which is the standard scenario, they are represented as strings. **This is
+important to know if you're using e.g.** :meth:`predict()
+<surprise.prediction_algorithms.algo_base.AlgoBase.predict>` **or other methods
+that accept raw ids as parameters.**
 
-On trainset creation, each raw id is mapped to a unique
-integer called inner id, which is a lot more suitable for `Surprise
+On trainset creation, each raw id is mapped to a unique integer called inner
+id, which is a lot more suitable for `Surprise
 <https://nicolashug.github.io/Surprise/>`_ to manipulate. Conversions between
 raw and inner ids can be done using the :meth:`to_inner_uid()
 <surprise.Trainset.to_inner_uid>`, :meth:`to_inner_iid()
@@ -145,8 +147,8 @@ Yes, and yes. See the :ref:`user guide <load_custom>`.
 How to tune an algorithm parameters
 -----------------------------------
 
-You can tune the parameters of an algorithm with the :class:`GridSearch
-<surprise.evaluate.GridSearch>` class as described :ref:`here
+You can tune the parameters of an algorithm with the :class:`GridSearchCV
+<surprise.model_selection.search.GridSearchCV>` class as described :ref:`here
 <tuning_algorithm_parameters>`. After the tuning, you may want to have an
 :ref:`unbiased estimate of your algorithm performances
 <unbiased_estimate_after_tuning>`.
@@ -163,7 +165,7 @@ with the :meth:`test()
 .. literalinclude:: ../../examples/evaluate_on_trainset.py
     :caption: From file ``examples/evaluate_on_trainset.py``
     :name: evaluate_on_trainset.py
-    :lines: 9-24
+    :lines: 9-25
 
 Check out the example file for more usage examples.
 

diff --git a/doc/source/building_custom_algo.rst b/doc/source/building_custom_algo.rst
@@ -55,10 +55,11 @@ be done by defining the ``fit`` method:
     :lines: 15-35
 
 
-The ``fit`` method is called by the :func:`evaluate
-<surprise.evaluate.evaluate>` function at each fold of a cross-validation
-process, (but you can also :ref:`call it yourself <iterate_over_folds>`).
-Before doing anything, you should call the base class :meth:`fit()
+The ``fit`` method is called e.g. by the :func:`cross_validate
+<surprise.model_selection.validation.cross_validate>` function at each fold of
+a cross-validation process, (but you can also :ref:`call it yourself
+<use_cross_validation_iterators>`).  Before doing anything, you should call the
+base class :meth:`fit()
 <surprise.prediction_algorithms.algo_base.AlgoBase.fit>` method.
 
 The ``trainset`` attribute