You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following figure presents the scaling of different text categorization algorithms with the training set size for the Enron ERDM (TREC Legal 2009) dataset (700k documents). The test size is always the full document collection of 700k documents (since training_size << dataset_size).
Bag of words, hashed into a 100001 features space with sublinear-tf weights followed by a L2 normalization was used for the feature extraction.
The LSI decomposition time is not accounted for. This benchmark was run on a 4 core Intel(R) Xeon(R) CPU E3-1225 V2 @ 3.20GHz with 16 GB RAM.
This benchmark uses scikit-learn 0.18.1 and xgboost 0.4a30 .
While the timing for all algorithms remains reasonable, we notice that NearestNeighbor search in the LSI space is the only algorithm that scales linearly with the training set size, and therefore it would be quite slow with large training sets (>5k documents). I will open a separate issue for optimization of the NearestNeighbor search.
Update: the previous version of this plot was not correct (it accounted for scoring instead of training time), above is the updated figure that looks more reasonable.
The text was updated successfully, but these errors were encountered:
The following figure presents the scaling of different text categorization algorithms with the training set size for the Enron ERDM (TREC Legal 2009) dataset (700k documents). The test size is always the full document collection of 700k documents (since
training_size << dataset_size
).Bag of words, hashed into a 100001 features space with sublinear-tf weights followed by a L2 normalization was used for the feature extraction.
The LSI decomposition time is not accounted for. This benchmark was run on a 4 core Intel(R) Xeon(R) CPU E3-1225 V2 @ 3.20GHz with 16 GB RAM.
This benchmark uses scikit-learn 0.18.1 and xgboost 0.4a30 .
While the timing for all algorithms remains reasonable, we notice that NearestNeighbor search in the LSI space is the only algorithm that scales linearly with the training set size, and therefore it would be quite slow with large training sets (>5k documents). I will open a separate issue for optimization of the NearestNeighbor search.
Update: the previous version of this plot was not correct (it accounted for scoring instead of training time), above is the updated figure that looks more reasonable.
The text was updated successfully, but these errors were encountered: