Moved Reader and Trainset classes in their own module

NicolasHug · Dec 25, 2017 · 86cf445 · 86cf445
1 parent aa92831
commit 86cf445
Show file tree

Hide file tree

Showing 17 changed files with 484 additions and 439 deletions.
diff --git a/doc/source/FAQ.rst b/doc/source/FAQ.rst
@@ -130,11 +130,11 @@ On trainset creation, each raw id is mapped to a unique
 integer called inner id, which is a lot more suitable for `Surprise
 <https://nicolashug.github.io/Surprise/>`_ to manipulate. Conversions between
 raw and inner ids can be done using the :meth:`to_inner_uid()
-<surprise.dataset.Trainset.to_inner_uid>`, :meth:`to_inner_iid()
-<surprise.dataset.Trainset.to_inner_iid>`, :meth:`to_raw_uid()
-<surprise.dataset.Trainset.to_raw_uid>`, and :meth:`to_raw_iid()
-<surprise.dataset.Trainset.to_raw_iid>` methods of the :class:`trainset
-<surprise.dataset.Trainset>`.
+<surprise.Trainset.to_inner_uid>`, :meth:`to_inner_iid()
+<surprise.Trainset.to_inner_iid>`, :meth:`to_raw_uid()
+<surprise.Trainset.to_raw_uid>`, and :meth:`to_raw_iid()
+<surprise.Trainset.to_raw_iid>` methods of the :class:`trainset
+<surprise.Trainset>`.
 
 
 Can I use my own dataset with Surprise, and can it be a pandas dataframe
@@ -155,8 +155,8 @@ How to get accuracy measures on the training set
 ------------------------------------------------
 
 You can use the :meth:`build_testset()
-<surprise.dataset.Trainset.build_testset()>` method of the :class:`Trainset
-<surprise.dataset.Trainset>` object to build a testset that can be then used
+<surprise.Trainset.build_testset()>` method of the :class:`Trainset
+<surprise.Trainset>` object to build a testset that can be then used
 with the :meth:`test()
 <surprise.prediction_algorithms.algo_base.AlgoBase.test>` method:
 

diff --git a/doc/source/building_custom_algo.rst b/doc/source/building_custom_algo.rst
@@ -68,7 +68,7 @@ Once the base class :meth:`train()
 <surprise.prediction_algorithms.algo_base.AlgoBase.train>` method has returned,
 all the info you need about the current training set (rating values, etc...) is
 stored in the ``self.trainset`` attribute. This is a :class:`Trainset
-<surprise.dataset.Trainset>` object that has many attributes and methods of
+<surprise.Trainset>` object that has many attributes and methods of
 interest for prediction.
 
 To illustrate its usage, let's make an algorithm that predicts an average

diff --git a/doc/source/dataset.rst b/doc/source/dataset.rst
@@ -6,4 +6,3 @@ dataset module
 .. automodule:: surprise.dataset
     :members:
     :exclude-members: BuiltinDataset, read_ratings, DatasetUserFolds,
-        parse_line
diff --git a/doc/source/getting_started.rst b/doc/source/getting_started.rst
@@ -47,7 +47,7 @@ dataset:
 - or if your dataset is already split into predefined folds, you can specify a
   list of files for training and testing.
 
-Either way, you will need to define a :class:`Reader <surprise.dataset.Reader>`
+Either way, you will need to define a :class:`Reader <surprise.reader.Reader>`
 object for `Surprise <https://nicolashug.github.io/Surprise/>`_ to be able to
 parse the file(s) or the dataframe. We'll see now how to handle both cases.
 
@@ -65,7 +65,7 @@ Load an entire dataset from a file or a dataframe
       :lines: 17-26
 
   For more details about readers and how to use them, see the :class:`Reader
-  class <surprise.dataset.Reader>` documentation.
+  class <surprise.reader.Reader>` documentation.
 
   .. note::
       As you already know from the previous section, the Movielens-100k dataset
@@ -76,7 +76,7 @@ Load an entire dataset from a file or a dataframe
 
 - To load a dataset from a pandas dataframe, you will need the
   :meth:`load_from_df() <surprise.dataset.Dataset.load_from_df>` method. You
-  will also need a :class:`Reader<surprise.dataset.Reader>` object, but only
+  will also need a :class:`Reader<surprise.reader.Reader>` object, but only
   the ``rating_scale`` parameter must be specified. The dataframe must have
   three columns, corresponding to the user (raw) ids, the item (raw) ids, and
   the ratings in this order. Each row thus corresponds to a given rating. This
@@ -241,7 +241,7 @@ performing cross-validation (i.e. there is no test set).
 The latter is pretty straightforward: all you need is to load a dataset, and
 the :meth:`build_full_trainset()
 <surprise.dataset.DatasetAutoFolds.build_full_trainset>` method to build the
-:class:`trainset <surprise.dataset.Trainset>` and train you algorithm:
+:class:`trainset <surprise.trainset.Trainset>` and train you algorithm:
 
 .. literalinclude:: ../../examples/query_for_predictions.py
     :caption: From file ``examples/query_for_predictions.py``

diff --git a/doc/source/index.rst b/doc/source/index.rst
@@ -44,5 +44,7 @@ to contribute and send pull requests (see `GitHub page
    similarities
    accuracy
    dataset
+   trainset
+   reader 
    evaluate
    dump
diff --git a/doc/source/reader.rst b/doc/source/reader.rst
@@ -0,0 +1,9 @@
+.. _reader:
+
+Reader class
+============
+
+.. autoclass:: surprise.reader.Reader
+    :members:
+    :exclude-members: parse_line
+
diff --git a/doc/source/trainset.rst b/doc/source/trainset.rst
@@ -0,0 +1,7 @@
+.. _trainset:
+
+Trainset class
+==============
+
+.. autoclass:: surprise.Trainset
+    :members:
diff --git a/surprise/__init__.py b/surprise/__init__.py
@@ -17,9 +17,9 @@
 from .prediction_algorithms import Prediction
 
 from .dataset import Dataset
-from .dataset import Reader
-from .dataset import Trainset
-from .dataset import get_dataset_dir
+from .reader import Reader
+from .trainset import Trainset
+from .builtin_datasets import get_dataset_dir
 from .evaluate import evaluate
 from .evaluate import print_perf
 from .evaluate import GridSearch

diff --git a/surprise/__main__.py b/surprise/__main__.py
@@ -22,7 +22,7 @@
 from surprise.prediction_algorithms import CoClustering
 import surprise.dataset as dataset
 from surprise.dataset import Dataset
-from surprise.dataset import Reader  # noqa
+from surprise.builtin_datasets import get_dataset_dir
 from surprise.evaluate import evaluate
 from surprise import __version__
 
@@ -137,11 +137,11 @@ def error(self, message):
                         default=None,
                         help='Where to dump the files. Ignored if ' +
                         'with-dump is not set. Default is ' +
-                        os.path.join(dataset.get_dataset_dir(), 'dumps/')
+                        os.path.join(get_dataset_dir(), 'dumps/')
                         )
 
     parser.add_argument('--clean', dest='clean', action='store_true',
-                        help='Remove the ' + dataset.get_dataset_dir() +
+                        help='Remove the ' + get_dataset_dir() +
                         ' directory and exit.'
                         )
 
@@ -151,7 +151,7 @@ def error(self, message):
     args = parser.parse_args()
 
     if args.clean:
-        folder = dataset.get_dataset_dir()
+        folder = get_dataset_dir()
         shutil.rmtree(folder)
         print('Removed', folder)
         exit()

diff --git a/surprise/builtin_datasets.py b/surprise/builtin_datasets.py
@@ -0,0 +1,66 @@
+from six.moves.urllib.request import urlretrieve
+import zipfile
+from collections import namedtuple
+import os
+from os.path import join
+
+
+def get_dataset_dir():
+    '''Return folder where downloaded datasets and other data are stored.
+    Default folder is ~/.surprise_data/, but it can also be set by the
+    environment variable ``SURPRISE_DATA_FOLDER``.
+    '''
+
+    folder = os.environ.get('SURPRISE_DATA_FOLDER', os.path.expanduser('~') +
+                            '/.surprise_data/')
+    if not os.path.exists(folder):
+        os.makedirs(folder)
+
+    return folder
+
+
+# a builtin dataset has
+# - an url (where to download it)
+# - a path (where it is located on the filesystem)
+# - the parameters of the corresponding reader
+BuiltinDataset = namedtuple('BuiltinDataset', ['url', 'path', 'reader_params'])
+
+BUILTIN_DATASETS = {
+    'ml-100k':
+        BuiltinDataset(
+            url='http://files.grouplens.org/datasets/movielens/ml-100k.zip',
+            path=join(get_dataset_dir(), 'ml-100k/ml-100k/u.data'),
+            reader_params=dict(line_format='user item rating timestamp',
+                               rating_scale=(1, 5),
+                               sep='\t')
+        ),
+    'ml-1m':
+        BuiltinDataset(
+            url='http://files.grouplens.org/datasets/movielens/ml-1m.zip',
+            path=join(get_dataset_dir(), 'ml-1m/ml-1m/ratings.dat'),
+            reader_params=dict(line_format='user item rating timestamp',
+                               rating_scale=(1, 5),
+                               sep='::')
+        ),
+    'jester':
+        BuiltinDataset(
+            url='http://eigentaste.berkeley.edu/dataset/jester_dataset_2.zip',
+            path=join(get_dataset_dir(), 'jester/jester_ratings.dat'),
+            reader_params=dict(line_format='user item rating',
+                               rating_scale=(-10, 10))
+        )
+}
+
+
+def download_builtin_dataset(name, dataset):
+
+    print('Trying to download dataset from ' + dataset.url + '...')
+    tmp_file_path = join(get_dataset_dir(), 'tmp.zip')
+    urlretrieve(dataset.url, tmp_file_path)
+
+    with zipfile.ZipFile(tmp_file_path, 'r') as tmp_zip:
+        tmp_zip.extractall(join(get_dataset_dir(), name))
+
+    os.remove(tmp_file_path)
+    print('Done! Dataset', name, 'has been saved to',
+          join(get_dataset_dir(), name))