Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
186 lines (180 sloc) 8.47 KB

Datasets

For easy experimentation, Cornac offers access to a number of popular recommendation benchmark datasets. These are listed below along with their basic characteristics, followed by a usage example. In addition to preference feedback, some of these datasets come with item and/or user auxiliary information, which are grouped into three main categories:

  • Text refers to textual information associated with items or users. The usual format of this data is (item_id, text), or (user_id, text). Concrete examples of such information are item textual descriptions, product reviews, movie plots, and user reviews, just to name a few.
  • Graph, for items, corresponds to a network where nodes (or vertices) are items, and links (or edges) represent relations among items. This information is typically represented by an adjacency matrix in the sparse triplet format: (item_id, item_id, weight), or simply (item_id, item_id) in the case of unweighted edges. Relations between users (e.g., social network) are represented similarly.
  • Image consists of visual information paired with either users or items. The common format for this type of auxiliary data is (object_id, ndarray), where object_id could be one of user_id or item_id, the ndarray may contain the raw images (pixel intensities), or some visual feature vectors extracted from the images, e.g., using deep neural nets. For instance, the Amazon clothing dataset includes product CNN visual features.

How to cite. If you are using one of the datasets listed below in your research, please follow the citation guidelines by the authors (the "source" link below) of each respective dataset.


Dataset
Preference Info. Item Auxiliary Info. User Auxiliary Info.
#Users #Items #Interactions Type Text Graph Image Graph
Amazon Clothing
(source)
5,377 3,393 13, 689 INT
[1,5]
Amazon Office
(source)
3,703 6,523 53,282 INT
[1,5]
Amazon Toy
(source)
19,412 11,924 167,597 INT
[1,5]
Citeulike
(source)
5,551 16,980 210,537 BIN
{0,1}
Epinions
(source)
40,163 139,738 664,824 INT
[1,5]
FilmTrust
(source)
1,508 2,071 35,497 REAL
[0.5,4]
MovieLens 100k
(source)
943 1,682 100,000 INT
[1,5]
MovieLens 1M
(source)
6,040 3,706 1,000,209 INT
[1,5]
MovieLens 10M
(source)
69,878 10,677 10,000,054 INT
[1,5]
MovieLens 20M
(source)
138,493 26,744 20,000,263 INT
[1,5]
Netflix Small
(source)
10,000 5,000 607,803 INT
[1,5]
Neflix Original
(source)
480,189 17,770 100,480,507 INT
[1,5]
Tradesy
(source)
19,243 165,906 394,421 BIN
{0,1}

Usage example

Assume that we are interested in the FilmTrust dataset, which comes with both user-item ratings and user-user trust information. We can load these two pieces of information as follows,

from cornac.datasets import filmtrust

ratings = filmtrust.load_feedback()
trust = filmtrust.load_trust()

The ranting values are in the range [0.5,4], and the trust network is undirected. Here are samples from our dataset,

Samples from ratings: [('1', '1', 2.0), ('1', '2', 4.0), ('1', '3', 3.5)] 
Samples from trust: [('2', '966', 1.0), ('2', '104', 1.0), ('5', '1509', 1.0)]

Our dataset is now ready to use for model training and evaluation. A concrete example is sorec_filmtrust, which illustrates how to perform an experiment with the SoRec model on FilmTrust. More details regarding the other datasets are available in the documentation.

You can’t perform that action at this time.