Skip to content
Permalink
fa74558801
Switch branches/tags
Go to file
 
 
Cannot retrieve contributors at this time

Getting Started

Basic usage

Automatic cross-validation

Surprise has a set of built-in :ref:`algorithms<prediction_algorithms>` and :ref:`datasets <dataset>` for you to play with. In its simplest form, it only takes a few lines of code to run a cross-validation procedure:

.. literalinclude:: ../../examples/basic_usage.py
    :caption: From file ``examples/basic_usage.py``
    :name: basic_usage.py
    :lines: 9-

The result should be as follows (actual values may vary due to randomization):

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

            Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std
RMSE        0.9311  0.9370  0.9320  0.9317  0.9391  0.9342  0.0032
MAE         0.7350  0.7375  0.7341  0.7342  0.7375  0.7357  0.0015
Fit time    6.53    7.11    7.23    7.15    3.99    6.40    1.23
Test time   0.26    0.26    0.25    0.15    0.13    0.21    0.06

The :meth:`load_builtin() <surprise.dataset.Dataset.load_builtin>` method will offer to download the movielens-100k dataset if it has not already been downloaded, and it will save it in the .surprise_data folder in your home directory (you can also choose to save it :ref:`somewhere else <data_folder>`).

We are here using the well-known :class:`SVD<surprise.prediction_algorithms.matrix_factorization.SVD>` algorithm, but many other algorithms are available. See :ref:`prediction_algorithms` for more details.

The :func:`cross_validate()<surprise.model_selection.validation.cross_validate>` function runs a cross-validation procedure according to the cv argument, and computes some :mod:`accuracy <surprise.accuracy>` measures. We are here using a classical 5-fold cross-validation, but fancier iterators can be used (see :ref:`here <cross_validation_iterators_api>`).

Train-test split and the fit() method

If you don't want to run a full cross-validation procedure, you can use the :func:`train_test_split() <surprise.model_selection.split.train_test_split>` to sample a trainset and a testset with given sizes, and use the :mod:`accuracy metric<surprise.accuracy>` of your chosing. You'll need to use the :meth:`fit() <surprise.prediction_algorithms.algo_base.AlgoBase.fit>` method which will train the algorithm on the trainset, and the :meth:`test() <surprise.prediction_algorithms.algo_base.AlgoBase.test>` method which will return the predictions made from the testset:

.. literalinclude:: ../../examples/train_test_split.py
    :caption: From file ``examples/train_test_split.py``
    :name: train_test_split.py
    :lines: 8-

Result:

RMSE: 0.9411

Note that you can train and test an algorithm with the following one-line:

predictions = algo.fit(trainset).test(testset)

In some cases, your trainset and testset are already defined by some files. Please refer to :ref:`this section <load_from_folds_example>` to handle such cases.

Train on a whole trainset and the predict() method

Obviously, we could also simply fit our algorithm to the whole dataset, rather than running cross-validation. This can be done by using the :meth:`build_full_trainset() <surprise.dataset.DatasetAutoFolds.build_full_trainset>` method which will build a :class:`trainset <surprise.Trainset>` object:

.. literalinclude:: ../../examples/predict_ratings.py
    :caption: From file ``examples/predict_ratings.py``
    :name: predict_ratings.py
    :lines: 9-20

We can now predict ratings by directly calling the :meth:`predict() <surprise.prediction_algorithms.algo_base.AlgoBase.predict>` method. Let's say you're interested in user 196 and item 302 (make sure they're in the trainset!), and you know that the true rating r_{ui} = 4:

.. literalinclude:: ../../examples/predict_ratings.py
    :caption: From file ``examples/predict_ratings.py``
    :name: predict_ratings2.py
    :lines: 23-27

The result should be:

user: 196        item: 302        r_ui = 4.00   est = 4.06   {'actual_k': 40, 'was_impossible': False}

Note

The :meth:`predict() <surprise.prediction_algorithms.algo_base.AlgoBase.predict>` uses raw ids (please read :ref:`this <raw_inner_note>` about raw and inner ids). As the dataset we have used has been read from a file, the raw ids are strings (even if they represent numbers).

We have so far used a built-in dataset, but you can of course use your own. This is explained in the next section.

Use a custom dataset

Surprise has a set of builtin :ref:`datasets <dataset>`, but you can of course use a custom dataset. Loading a rating dataset can be done either from a file (e.g. a csv file), or from a pandas dataframe. Either way, you will need to define a :class:`Reader <surprise.reader.Reader>` object for Surprise to be able to parse the file or the dataframe.

  • To load a dataset from a file (e.g. a csv file), you will need the :meth:`load_from_file() <surprise.dataset.Dataset.load_from_file>` method:

    .. literalinclude:: ../../examples/load_custom_dataset.py
        :caption: From file ``examples/load_custom_dataset.py``
        :name: load_custom_dataset.py
        :lines: 12-28
    
    

    For more details about readers and how to use them, see the :class:`Reader class <surprise.reader.Reader>` documentation.

    Note

    As you already know from the previous section, the Movielens-100k dataset is built-in so a much quicker way to load the dataset is to do data = Dataset.load_builtin('ml-100k'). We will of course ignore this here.

  • To load a dataset from a pandas dataframe, you will need the :meth:`load_from_df() <surprise.dataset.Dataset.load_from_df>` method. You will also need a :class:`Reader<surprise.reader.Reader>` object, but only the rating_scale parameter must be specified. The dataframe must have three columns, corresponding to the user (raw) ids, the item (raw) ids, and the ratings in this order. Each row thus corresponds to a given rating. This is not restrictive as you can reorder the columns of your dataframe easily.

    .. literalinclude:: ../../examples/load_from_dataframe.py
        :caption: From file ``examples/load_from_dataframe.py``
        :name: load_dom_dataframe.py
        :lines: 8-29
    
    

    The dataframe initially looks like this:

          itemID  rating    userID
    0       1       3         9
    1       1       2        32
    2       1       4         2
    3       2       3        45
    4       2       1  user_foo
    

Use cross-validation iterators

For cross-validation, we can use the :func:`cross_validate() <surprise.model_selection.validation.cross_validate>` function that does all the hard work for us. But for a better control, we can also instanciate a cross-validation iterator, and make predictions over each split using the split() method of the iterator, and the :meth:`test()<surprise.prediction_algorithms.algo_base.AlgoBase.test>` method of the algorithm. Here is an example where we use a classical K-fold cross-validation procedure with 3 splits:

.. literalinclude:: ../../examples/use_cross_validation_iterators.py
    :caption: From file ``examples/use_cross_validation_iterators.py``
    :name: use_cross_validation_iterators.py
    :lines: 8-

Result could be, e.g.:

RMSE: 0.9374
RMSE: 0.9476
RMSE: 0.9478

Other cross-validation iterator can be used, like LeaveOneOut or ShuffleSplit. See all the available iterators :ref:`here <cross_validation_iterators_api>`. The design of Surprise's cross-validation tools is heavily inspired from the excellent scikit-learn API.


A special case of cross-validation is when the folds are already predefined by some files. For instance, the movielens-100K dataset already provides 5 train and test files (u1.base, u1.test ... u5.base, u5.test). Surprise can handle this case by using a :class:`surprise.model_selection.split.PredefinedKFold` object:

.. literalinclude:: ../../examples/load_custom_dataset_predefined_folds.py
    :caption: From file ``examples/load_custom_dataset_predefined_folds.py``
    :name: load_custom_dataset_predefined_folds.py
    :lines: 13-

Of course, nothing prevents you from only loading a single file for training and a single file for testing. However, the folds_files parameter still needs to be a list.

Tune algorithm parameters with GridSearchCV

The :func:`cross_validate() <surprise.model_selection.validation.cross_validate>` function reports accuracy metric over a cross-validation procedure for a given set of parameters. If you want to know which parameter combination yields the best results, the :class:`GridSearchCV <surprise.model_selection.search.GridSearchCV>` class comes to the rescue. Given a dict of parameters, this class exhaustively tries all the combinations of parameters and reports the best parameters for any accuracy measure (averaged over the different splits). It is heavily inspired from scikit-learn's GridSearchCV.

Here is an example where we try different values for parameters n_epochs, lr_all and reg_all of the :class:`SVD <surprise.prediction_algorithms.matrix_factorization.SVD>` algorithm.

.. literalinclude:: ../../examples/grid_search_usage.py
    :caption: From file ``examples/grid_search_usage.py``
    :name: grid_search_usage.py
    :lines: 9-26

Result:

0.961300130118
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}

We are here evaluating the average RMSE and MAE over a 3-fold cross-validation procedure, but any :ref:`cross-validation iterator <cross_validation_iterators_api>` can used.

Once fit() has been called, the best_estimator attribute gives us an algorithm instance with the optimal set of parameters, which can be used how we please:

.. literalinclude:: ../../examples/grid_search_usage.py
    :caption: From file ``examples/grid_search_usage.py``
    :name: grid_search_usage2.py
    :lines: 28-30

Note

Dictionary parameters such as bsl_options and sim_options require particular treatment. See usage example below:

param_grid = {'k': [10, 20],
              'sim_options': {'name': ['msd', 'cosine'],
                              'min_support': [1, 5],
                              'user_based': [False]}
              }

Naturally, both can be combined, for example for the :class:`KNNBaseline <surprise.prediction_algorithms.knns.KNNBaseline>` algorithm:

param_grid = {'bsl_options': {'method': ['als', 'sgd'],
                              'reg': [1, 2]},
              'k': [2, 3],
              'sim_options': {'name': ['msd', 'cosine'],
                              'min_support': [1, 5],
                              'user_based': [False]}
              }

For further analysis, the cv_results attribute has all the needed information and can be imported in a pandas dataframe:

.. literalinclude:: ../../examples/grid_search_usage.py
    :caption: From file ``examples/grid_search_usage.py``
    :name: grid_search_usage3.py
    :lines: 33

In our example, the cv_results attribute looks like this (floats are formatted):

'split0_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'split1_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'split2_test_rmse': [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'mean_test_rmse':   [1.0, 1.0, 0.97, 0.98, 0.98, 0.99, 0.96, 0.97]
'std_test_rmse':    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
'rank_test_rmse':   [7 8 3 5 4 6 1 2]
'split0_test_mae':  [0.81, 0.82, 0.78, 0.79, 0.79, 0.8, 0.77, 0.79]
'split1_test_mae':  [0.8, 0.81, 0.78, 0.79, 0.78, 0.79, 0.77, 0.78]
'split2_test_mae':  [0.81, 0.81, 0.78, 0.79, 0.78, 0.8, 0.77, 0.78]
'mean_test_mae':    [0.81, 0.81, 0.78, 0.79, 0.79, 0.8, 0.77, 0.78]
'std_test_mae':     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
'rank_test_mae':    [7 8 2 5 4 6 1 3]
'mean_fit_time':    [1.53, 1.52, 1.53, 1.53, 3.04, 3.05, 3.06, 3.02]
'std_fit_time':     [0.03, 0.04, 0.0, 0.01, 0.04, 0.01, 0.06, 0.01]
'mean_test_time':   [0.46, 0.45, 0.44, 0.44, 0.47, 0.49, 0.46, 0.34]
'std_test_time':    [0.0, 0.01, 0.01, 0.0, 0.03, 0.06, 0.01, 0.08]
'params':           [{'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.4}, {'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.6}, {'n_epochs': 5, 'lr_all': 0.005, 'reg_all': 0.4}, {'n_epochs': 5, 'lr_all': 0.005, 'reg_all': 0.6}, {'n_epochs': 10, 'lr_all': 0.002, 'reg_all': 0.4}, {'n_epochs': 10, 'lr_all': 0.002, 'reg_all': 0.6}, {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}, {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.6}]
'param_n_epochs':   [5, 5, 5, 5, 10, 10, 10, 10]
'param_lr_all':     [0.0, 0.0, 0.01, 0.01, 0.0, 0.0, 0.01, 0.01]
'param_reg_all':    [0.4, 0.6, 0.4, 0.6, 0.4, 0.6, 0.4, 0.6]

As you can see, each list has the same size of the number of parameter combination. It corresponds to the following table:

split0_test_rmse split1_test_rmse split2_test_rmse mean_test_rmse std_test_rmse rank_test_rmse split0_test_mae split1_test_mae split2_test_mae mean_test_mae std_test_mae rank_test_mae mean_fit_time std_fit_time mean_test_time std_test_time params param_n_epochs param_lr_all param_reg_all
0.99775 0.997744 0.996378 0.997291 0.000645508 7 0.807862 0.804626 0.805282 0.805923 0.00139657 7 1.53341 0.0305216 0.455831 0.000922113 {'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.4} 5 0.002 0.4
1.00381 1.00304 1.00257 1.00314 0.000508358 8 0.816559 0.812905 0.813772 0.814412 0.00155866 8 1.5199 0.0367117 0.451068 0.00938646 {'n_epochs': 5, 'lr_all': 0.002, 'reg_all': 0.6} 5 0.002 0.6
0.973524 0.973595 0.972495 0.973205 0.000502609 3 0.783361 0.780242 0.78067 0.781424 0.00138049 2 1.53449 0.00496203 0.441558 0.00529696 {'n_epochs': 5, 'lr_all': 0.005, 'reg_all': 0.4} 5 0.005 0.4
0.98229 0.982059 0.981486 0.981945 0.000338056 5 0.794481 0.790781 0.79186 0.792374 0.00155377 5 1.52739 0.00859185 0.44463 0.000888907 {'n_epochs': 5, 'lr_all': 0.005, 'reg_all': 0.6} 5 0.005 0.6
0.978034 0.978407 0.976919 0.977787 0.000632049 4 0.787643 0.784723 0.784957 0.785774 0.00132486 4 3.03572 0.0431101 0.466606 0.0254965 {'n_epochs': 10, 'lr_all': 0.002, 'reg_all': 0.4} 10 0.002 0.4
0.986263 0.985817 0.985004 0.985695 0.000520899 6 0.798218 0.794457 0.795373 0.796016 0.00160135 6 3.0544 0.00636185 0.488357 0.0576194 {'n_epochs': 10, 'lr_all': 0.002, 'reg_all': 0.6} 10 0.002 0.6
0.963751 0.963463 0.962676 0.963297 0.000454661 1 0.774036 0.770548 0.771588 0.772057 0.00146201 1 3.0636 0.0597982 0.456484 0.00510321 {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4} 10 0.005 0.4
0.973605 0.972868 0.972765 0.973079 0.000374222 2 0.78607 0.781918 0.783537 0.783842 0.00170855 3 3.01907 0.011834 0.338839 0.075346 {'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.6} 10 0.005 0.6

Command line usage

Surprise can also be used from the command line, for example:

surprise -algo SVD -params "{'n_epochs': 5, 'verbose': True}" -load-builtin ml-100k -n-folds 3

See detailed usage by running:

surprise -h