There are 3 files, rcv1.train.dat.gz, rcv1.test.dat.gz and vw_process. vw_process is a simple script that converts from an svmlight format to VW's format.
The individual files look like:
1 |f 13:3.9656971e-02 24:3.4781646e-02 69:4.6296168e-02 85:6.1853945e-02 ... 0 |f 9:8.5609287e-02 14:2.9904654e-02 19:6.1031535e-02 20:2.1757640e-02 ... ...
From the above, you can see that the input data format is similar to SVMlight's feature:value sparse representation format. There are two important differences:
There are a couple variation on the above format. If you want to importance weight examples, place the importance weight after the label and before the first namespace. A missing importance weight is treated as 1 by default. Similarly, if features have a weight of 1, they can be represented as just it's name rather than name:1.
A command for training is the following.
vw rcv1.train.vw.gz --cache_file cache_train -f r_temp
Next, you can test according to the following:
vw -t --cache_file cache_test -i r_temp -p p_out rcv1.test.vw.gz
Here the flags are:
To measure performance, I often use the perf which Rich Caruana put together for the 2004 KDD cup challenge. This software has the advantage that many people cared that it worked right. To use perf, you first create a file with the labels
zcat rcv1.test.vw.gz | cut -d ' ' -f 1 | sed -e 's/^-1/0/' > labels
and then type:
perf -ACC -files labels p_out -t 0.5
The results on my machine are summarized by the following table(*):
|Method||Wall clock Execution Time||Test Set Error rate|
There are several things to understand about the results.
(*) The comparison with svmsgd is obsolete, as Leon has updated his code.