Malicious URL example

Jim Geovedi edited this page Aug 15, 2013 · 15 revisions

Detecting Malicious URLs

The malicious URL dataset from UCSD represents a sequential binary classification problem. The data are temporally correlated, and thus the problem is particularly suitable for online learning approaches like VW. Here we show how to evaluate the "out-of-the-box" performance of VW on this task.

Preparing the data

First, download the data in SVM-light format and extract the files from the tar-ball:

tar xzf url_svmlight.tar.gz

The following command-line converts these data from SVM-light format to VW input format:

for d in {0..120}; do cat url_svmlight/Day$d.svm; done \
  | sed -e 's/^-1/0 |f/' |sed -e 's/^+1/1 |f/' |sed -e 's/$/ const:.01/'

This conversion accomplishes the following:

  • Converts the labels from "-1" to "0" and from "+1" to "1"
  • Puts the features into a namespace called "f"
  • Adds a constant feature called "const" with value ".01".
  • Retains the temporal order of the data.

This shell pipeline can be used directly with VW, as described below.

Out-of-the-box online training and testing

We can use the above command-line to feed the data directly into VW to simulate online training and testing:

time for d in {0..120}; do cat url_svmlight/Day$d.svm; done \
  |sed -e 's/^-1/0 |f/' |sed -e 's/^+1/1 |f/' |sed -e 's/$/ const:.01/' \
  |vw --adaptive --cache_file cache

The command line arguments used above are:

  • --adaptive: use per-feature adaptive learning rates; this is sensible for highly diverse and variable features
  • --cache_file cache: cache the parsed input data into the file cache

It also uses time to measure the approximate wall-clock execution time.


The output of the above command-line concludes with the following.

finished run
number of examples = 2396130
weighted example sum = 2.396e+06
weighted label sum = 7.921e+05
average loss = 0.0127
best constant = 0.3306
best constant's loss = 0.2213
total feature number = 281850904

real    3m28.111s
user    2m36.850s
sys     0m17.050s

The average square loss over all 2396130 examples is 0.0127. The wall-clock execution time is 3 minutes 28 seconds. This may alarm you (or not), but most of time is spent parsing. If you re-run the command-line, it will read the cached data from cache and give the same result, except for the execution time:

real    0m17.982s
user    0m20.650s
sys     0m9.340s

Evaluating prediction accuracy

If you want to compare the actual predictions to the true labels, re-run the command-line with the additional option --predictions p_out to output the predictions to the file p_out. Then extract the labels from the training data using the following command-line:

for d in `seq 0 120`; do cat url_svmlight/Day$d.svm; done \
  |cut -d ' ' -f 1 |sed -e 's/^-1/0/' >labels

One can use Rich Caruana's perf software to compute the cumulative accuracy, but this requires a minor tweak in the code to allow more than 500000 predictions. Once that is dealt with, executing the command-line:

perf -ACC -files labels p_out

should give the result:

ACC    0.98364   pred_thresh  0.500000

98.36% accuracy by thresholding predictions at 0.5.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.