mreid / injuce

A small personal project to learn Clojure by implementing some simple machine learning algorithms

This URL has Read+Write access

mreid (author)
Thu Oct 08 03:41:45 -0700 2009
commit  c350baa5bfe8ae3ce565c4958482ff811761ff4c
tree    a1b49ada63a58cd402fb6b19fcd8af0ad9d66ba2
parent  e59eb4055b2eb944cd538d7d5b08cf7671cce8cf
injuce /
name age message
file .clojure Wed Jul 15 22:24:42 -0700 2009 Experimenting with Parallel Colt for efficiency [mreid]
file README.markdown Tue Jul 14 19:34:20 -0700 2009 Added data sets and performance notes [mreid]
directory data/ Loading commit data...
file npca.clj Sat Aug 29 04:01:24 -0700 2009 More hacking [mreid]
directory online/
file parallelcolt.jar Wed Jul 15 22:24:42 -0700 2009 Experimenting with Parallel Colt for efficiency [mreid]
README.markdown

Injuce - An induction toolkit in Clojure

This is a small personal project used to better understand Clojure and some classic statistical machine learning algorithms.

Initially, this will just consist of some simple data handling routines and an implementation of stochastic gradient descent.

Set up

Requirements

Incanter Script

The programs here will be run using the clj script that comes with Incanter. This script sets up the classpath for Incanter.

To run a program in this package, call it as follows:

$ ../incanter/bin/clj FILENAME.clj

where ../incanter should be replaced with whatever path gets you to your installation of Incanter.

I'll aim to make this easier in the future.

Performance Notes

Just parsing the entire training set train.dat.gz takes some time:

$ zless ../../sgd/svm/train.dat.gz | clj sgd.clj 
"Elapsed time: 189872.08 msecs"
781265

That is a total of ~3 mins for a rate of 0.24 msecs/example. However, the C++ version of svmsgd also takes some time to read in train.dat.gz (in the order of a several minutes).

It seems the most expensive part of the Clojure SGD is the operations on the sparse vectors represented as hash maps. Use of jvisualvm shows at least 20% of processing time is spent in calls to clojure.lang.Var.get.

The current code cannot complete a single training run through all the data, throwing a OutOfMemory (out of Java heap) exception after processing about 11k examples. To get to this point takes hours of processing (compared to Bottou's svmsgd which takes seconds to process the entire training set).

To do

[] Write a new parser that reads the train.bin.gz format. [] Make use of the COLT libraries that Incanter is built upon for sparse vecs.