Skip to content
This repository has been archived by the owner on Dec 21, 2017. It is now read-only.

Datasets

Anna Bethke edited this page Dec 1, 2016 · 7 revisions

The code necessary to load and use various datasets is provided with this repository. Below is a list of the datasets currently supported.

Book-Crossing

The Book-Crossing dataset is a collection of user ratings of books. It comes with both explicit ratings (1-10 stars) and implicit ratings (user interacted with the book). The data was compiled by Cai-Nicolas Ziegler of IIF and can be found here.

More information about the data is available in Ziegler et al.:

Improving Recommendation Lists Through Topic Diversification. Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW '05), May 10-14, 2005, Chiba, Japan. To appear.

The scripts to work with the Last.fm dataset are located here.

Jester

The Jester dataset is a set of jokes and ratings from Ken Goldberg at UC Berkeley, and can be found here.There are only about one hundred items in the dataset but thousands of users leading to a very high density of ratings.

More information about the data is available in Goldber et. al:

Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.

The scripts to work with the Jester dataset are located here.

Last.fm

The last.fm HetRec 2011 dataset includes the play counts of a set of artists by a set of users. Also included are tags applied by the users to the artists and a social network graph of the users friends. The dataset comes from last.fm and was compiled by Ignacio Fernández-Tobías, Iván Cantador, and Alejandro Bellogín and is available here.

More information about the data is available in Cantador et al.:

2nd Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011). I. Cantod, P Brusilovsky, T. Kuflik. Proceedings of the 5th ACM conference on Recommender systems.

The scripts to work with the Last.fm dataset are located here.

MovieLens

The MovieLens datasets include user ratings of Movies, as well as genre information about the movies and user applied tags. There are various datasets of which the 1M, 10M, and 20M are supported by Hermes.

The scripts to work with these datasets are located here.

OpenStreetMap

The OpenStreetMap dataset comes from the full-history dumps of OpenStreetMap, available here

The scripts to work with the OpenStreetMap dataset are located here.

Python Git Repositories

The git dataset comes from various Python projects downloaded from Github. The script to download the data is here.

Kaggle

The Kaggle dataset is the Meta Kaggle dataset regarding the user statistics and run history for the various competitions (or other data exploration) that Kaggle hosts. In particular we utilize the files Scripts.csv and ScriptVersions.csv, though other files may also be applicable. This data is released under CC BY-NC-SA 4.0 License. You will likely need to join Kaggle to download the data (though that is highly recommended for the other datasets they have on their site).

The script to work with the Kaggle dataset is located here.

Wikipedia

The Wikipedia dataset comes from the full-history dumps of Wikipedia, available here with file names like 'pages-meta-history'. It contains the full edit history of Wikipedia.

The scripts to work with the Wikipedia dataset are located here.