Scaling of categorization algorithms #58

rth · 2017-01-17T11:07:44Z

The following figure presents the scaling of different text categorization algorithms with the training set size for the Enron ERDM (TREC Legal 2009) dataset (700k documents). The test size is always the full document collection of 700k documents (since training_size << dataset_size).

Bag of words, hashed into a 100001 features space with sublinear-tf weights followed by a L2 normalization was used for the feature extraction.

The LSI decomposition time is not accounted for. This benchmark was run on a 4 core Intel(R) Xeon(R) CPU E3-1225 V2 @ 3.20GHz with 16 GB RAM.

This benchmark uses scikit-learn 0.18.1 and xgboost 0.4a30 .

While the timing for all algorithms remains reasonable, we notice that NearestNeighbor search in the LSI space is the only algorithm that scales linearly with the training set size, and therefore it would be quite slow with large training sets (>5k documents). I will open a separate issue for optimization of the NearestNeighbor search.

Update: the previous version of this plot was not correct (it accounted for scoring instead of training time), above is the updated figure that looks more reasonable.

The text was updated successfully, but these errors were encountered:

rth added this to the v0.8 milestone Jan 17, 2017

rth closed this as completed Jan 17, 2017

rth mentioned this issue Jan 17, 2017

Improve performance of Nearest Neighbors search #59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling of categorization algorithms #58

Scaling of categorization algorithms #58

rth commented Jan 17, 2017 •

edited

Scaling of categorization algorithms #58

Scaling of categorization algorithms #58

Comments

rth commented Jan 17, 2017 • edited

rth commented Jan 17, 2017 •

edited