Skip to content

Commit

Permalink
Added cross-validation iterators à la scikit learn.
Browse files Browse the repository at this point in the history
Evaluate now becomes cross_validate
GridSearch has been rewritten as well.
Doc (especially getting started guide) has been substancially updated.

Squashed commit of the following:

commit 6fbb36e
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Sat Jan 6 19:34:08 2018 +0100

    edit CHANGELOG.md

commit 72b4ceb
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Sat Jan 6 19:29:56 2018 +0100

    Modified __main__ to use model_selection tools. Fixed download_dataset issue

commit 361892a
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Sat Jan 6 19:14:33 2018 +0100

    pep8

commit d4c5969
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Sat Jan 6 19:04:12 2018 +0100

    Added docstring for all the model_selection package

commit 40593bb
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Sat Jan 6 14:02:35 2018 +0100

    Added deprecation warnings in the docs

commit 7cec6b4
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Sat Jan 6 13:38:28 2018 +0100

    Modified Getting Started guide

commit 47148d3
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Thu Jan 4 22:56:45 2018 +0100

    Some doc rewriting

commit 248b738
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Thu Jan 4 19:26:11 2018 +0100

    Added verbose parameter for cross_validate

commit 7230054
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Thu Jan 4 18:34:20 2018 +0100

    GridSearchCV now has the cv_results attribute

commit fd4c918
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Thu Jan 4 13:56:59 2018 +0100

    Added deprecation warnings for evaluate() and GridSeach

commit 7904cb3
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Thu Jan 4 13:52:35 2018 +0100

    cross_validate is now parallel with joblib

commit f3397bf
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Tue Jan 2 19:09:33 2018 +0100

    Now ok with Python 2 as well

commit 7420cc1
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Tue Jan 2 19:01:15 2018 +0100

    No more deprecation warnings for data.folds() in Python 3

commit a6b049a
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Tue Jan 2 18:19:38 2018 +0100

    Added GridSearch class (TBC)

commit 39268c8
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Tue Jan 2 10:29:23 2018 +0100

    Added cross_validate() function (like evaluate but slightly better)

commit 8f128e2
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Mon Jan 1 19:51:30 2018 +0100

    Created a model_selection module

commit 91c3997
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Mon Jan 1 18:29:43 2018 +0100

    Added PredefinedKFold and some deprecation warnings

commit 7d141de
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Thu Dec 28 21:53:41 2017 +0100

    Added LeaveOneOut CV iterator

commit a8bf1f2
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Thu Dec 28 16:52:43 2017 +0100

    Added RepeatedKFold CV iterator

commit d7a5618
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Thu Dec 28 13:02:04 2017 +0100

    Added ShuffleSplit class and train_test_spli()

commit 89e0147
Author: Nicolas Hug <contact@nicolas-hug.com>
Date:   Wed Dec 27 22:17:43 2017 +0100

    First draft on CV iterators (Kfold)
  • Loading branch information
NicolasHug committed Jan 6, 2018
1 parent 7455c83 commit 936b809
Show file tree
Hide file tree
Showing 52 changed files with 2,067 additions and 438 deletions.
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ VERSION 1.0.5 (latest, in development)
Enhancements
------------

* Cross-validation tools have been entirely reworked. We can now rely on
powerful and flexible cross-validation iterators, inspired by scikit-learn's
API.
* the evaluate() method has been replaced by cross-validate which is parallel.
* GridSearch is now parallel, using joblib.
* default data directory can now be custom with env variable
SURPRISE_DATA_FOLDER
Expand All @@ -13,6 +17,11 @@ API Changes

* The train() method is now deprecated and replaced by the fit() method (same
signature). Calls to train() should still work as before.
* Using data.split() or accessing the data.folds() generator is deprecated and
replaced by the use of the more powefull CV iterators.
* evaluate() is deprecated and replaced by model_selection.cross_validate(),
which is parallel.
* GridSearch is deprecated and replaced by model_selection.GridSearchCV()

VERSION 1.0.4
=============
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ Getting started, example
------------------------

Here is a simple example showing how you can (down)load a dataset, split it for
3-folds cross-validation, and compute the MAE and RMSE of the
3-fold cross-validation, and compute the MAE and RMSE of the
[SVD](http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD)
algorithm.

Expand Down
39 changes: 32 additions & 7 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,44 @@
TODO
====

* Allow to discount similarities (see aggarwal)
* Allow incremental updates for some algorithms
* Profile code (mostly cython) to see what could be optimized

Maybe, Maybe not
----------------
* Update README example before new rewlease, as well as computation times
* all algorithms using random initialization should allow to define
random_state. This is paramount for having correct gridsearch results (else
different initializations are used for the various parameter combinations).
When done, change tests of these algorithms so that they all use the same
seed. Right now tests about different RMSE values are not relevant. Also, use
SVD on test file when possible for grid search tests. Right now we use knn on
train (test does not have enough ratings for parameters to be impactful) and
it's slower.
* Make all fit methods (for algo and GridSearch) return self. Update docs on
building custom algorithms, and on getting started -> gridsearch (add
example?).
* Update doc of MF algo to indicate how to retrieve latent factors.

* allow a back up algorithm when prediction is impossible. Right now it's just
the mean rating that is predicted. Maybe user would want to choose it.
* make some filtering dataset tools, like remove users/items with less/more
than n ratings, binarize a dataset, etc...
* check conda forge
* Allow incremental updates for some algorithms

Done:
-----

* CV iterators:
- Write basic CV iterators
- evaluate -> rewrite to use CV iterators. Rename it into cross_validate.
- Same for GridSearch. Keep it in a model_selection module like scikit-learn
so that we can keep the old deprecated version.
- Make cross validation parallel with joblib
- Add deprecation warnings for evaluate and GridSearch()
- handle the cv_results attribute for grid search
- (re)write all verbose settings for gridsearch and cross_validate
- Change examples so they use CV iterators and the new gridsearch and
cross_validate
- indicate in docs that split(), folds(), evaluate() and gridsearch() are
deprecated
- Write comments, docstring and update all docs
- Update main and command-line usage doc in getting started.rst
* Allow to change data folder from env variable
* Complete FAQ
* Change the dumping machinery to be more consistent
Expand Down
16 changes: 9 additions & 7 deletions doc/source/FAQ.rst
Original file line number Diff line number Diff line change
Expand Up @@ -123,11 +123,13 @@ some other will use/return an inner id.

Raw ids are ids as defined in a rating file or in a pandas dataframe. They can
be strings or numbers. Note though that if the ratings were read from a file
which is the standard scenario, they are represented as strings (see e.g.
:ref:`here <train_on_whole_trainset>`).
which is the standard scenario, they are represented as strings. **This is
important to know if you're using e.g.** :meth:`predict()
<surprise.prediction_algorithms.algo_base.AlgoBase.predict>` **or other methods
that accept raw ids as parameters.**

On trainset creation, each raw id is mapped to a unique
integer called inner id, which is a lot more suitable for `Surprise
On trainset creation, each raw id is mapped to a unique integer called inner
id, which is a lot more suitable for `Surprise
<https://nicolashug.github.io/Surprise/>`_ to manipulate. Conversions between
raw and inner ids can be done using the :meth:`to_inner_uid()
<surprise.Trainset.to_inner_uid>`, :meth:`to_inner_iid()
Expand All @@ -145,8 +147,8 @@ Yes, and yes. See the :ref:`user guide <load_custom>`.
How to tune an algorithm parameters
-----------------------------------

You can tune the parameters of an algorithm with the :class:`GridSearch
<surprise.evaluate.GridSearch>` class as described :ref:`here
You can tune the parameters of an algorithm with the :class:`GridSearchCV
<surprise.model_selection.search.GridSearchCV>` class as described :ref:`here
<tuning_algorithm_parameters>`. After the tuning, you may want to have an
:ref:`unbiased estimate of your algorithm performances
<unbiased_estimate_after_tuning>`.
Expand All @@ -163,7 +165,7 @@ with the :meth:`test()
.. literalinclude:: ../../examples/evaluate_on_trainset.py
:caption: From file ``examples/evaluate_on_trainset.py``
:name: evaluate_on_trainset.py
:lines: 9-24
:lines: 9-25

Check out the example file for more usage examples.

Expand Down
9 changes: 5 additions & 4 deletions doc/source/building_custom_algo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,10 +55,11 @@ be done by defining the ``fit`` method:
:lines: 15-35


The ``fit`` method is called by the :func:`evaluate
<surprise.evaluate.evaluate>` function at each fold of a cross-validation
process, (but you can also :ref:`call it yourself <iterate_over_folds>`).
Before doing anything, you should call the base class :meth:`fit()
The ``fit`` method is called e.g. by the :func:`cross_validate
<surprise.model_selection.validation.cross_validate>` function at each fold of
a cross-validation process, (but you can also :ref:`call it yourself
<use_cross_validation_iterators>`). Before doing anything, you should call the
base class :meth:`fit()
<surprise.prediction_algorithms.algo_base.AlgoBase.fit>` method.

The ``trainset`` attribute
Expand Down

0 comments on commit 936b809

Please sign in to comment.