Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling of categorization algorithms #58

Closed
rth opened this issue Jan 17, 2017 · 0 comments
Closed

Scaling of categorization algorithms #58

rth opened this issue Jan 17, 2017 · 0 comments
Milestone

Comments

@rth
Copy link
Contributor

rth commented Jan 17, 2017

The following figure presents the scaling of different text categorization algorithms with the training set size for the Enron ERDM (TREC Legal 2009) dataset (700k documents). The test size is always the full document collection of 700k documents (since training_size << dataset_size).

Bag of words, hashed into a 100001 features space with sublinear-tf weights followed by a L2 normalization was used for the feature extraction.

plot_scaling_run_time_ topic201 trial00001_018d9658

The LSI decomposition time is not accounted for. This benchmark was run on a 4 core Intel(R) Xeon(R) CPU E3-1225 V2 @ 3.20GHz with 16 GB RAM.

This benchmark uses scikit-learn 0.18.1 and xgboost 0.4a30 .

While the timing for all algorithms remains reasonable, we notice that NearestNeighbor search in the LSI space is the only algorithm that scales linearly with the training set size, and therefore it would be quite slow with large training sets (>5k documents). I will open a separate issue for optimization of the NearestNeighbor search.

Update: the previous version of this plot was not correct (it accounted for scoring instead of training time), above is the updated figure that looks more reasonable.

@rth rth added this to the v0.8 milestone Jan 17, 2017
@rth rth closed this as completed Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant