Cluster dbPedia ( instances based on the type of relations defined for them.
Case study for

Getting the data

You'll need redis version 2.8.17 or later (see below).

Full dump (with values per instance): (182Mb, ~16M keys) Clean dump (for use with current code): (65Mb, ~6M keys)

To use, stop redis (service redis-server stop), uncompress the file using 7zip and put its contents with name dump.rdb in /var/lib/redis, owned by user redis, group redis. Then restart redis.

Generating the data

(You don't need these steps if you use the redis dump above.)

Get instance_types_en.nt.bz2 and mappingbased_properties_en.nt.bz2.

Set the path in scripts/


$ perl
$ perl

Redis Version

You'll need version 2.8.17 or later:

deb wheezy-backports main contrib non-free deb-src wheezy-backports main contrib non-free

apt-get install -t wheezy-backports redis-server

Sample Output


Running the code

you need to have Mahout 1.0 installed from source in your local repo, configured for Hadoop 2.0, see below

mvn clean package assembly:single

Then run the hadoop job org.keywords4bytecodes.firstclass.Driver pointing to the tsv file training file (see below) and an output directory.

Installing Mahout from source

$ git clone
$ cd mahout
$ mvn clean package -DskipTests -Drelease -Dmahout.skip.distribution=false -Dhadoop.profile=200 -Dhadoop2.version=2.4.1 -Dhbase.version=0.98.0-hadoop2