Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

soft404 doesn't work with scikit-learn 0.18+ #3

Closed
kmike opened this issue Jan 13, 2017 · 7 comments
Closed

soft404 doesn't work with scikit-learn 0.18+ #3

kmike opened this issue Jan 13, 2017 · 7 comments

Comments

@kmike
Copy link
Contributor

kmike commented Jan 13, 2017

For me the model fail to load:

sklearn/tree/_tree.pyx:632: KeyError
------------------------------------------------------------ Captured stderr call -------------------------------------------------------------
/Users/kmike/envs/deepdeep/lib/python3.5/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator SGDClassifier from version pre-0.18 when using version 0.18.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
/Users/kmike/envs/deepdeep/lib/python3.5/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator LogOddsEstimator from version pre-0.18 when using version 0.18.1. This might lead to breaking code or invalid results. Use at your own risk.
  UserWarning)
____________________________________________________________ test_predict_function ____________________________________________________________

    def test_predict_function():
>       assert probability('<h1>page not found, oops</h1>') > 0.9

tests/test_predict.py:11: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
soft404/predict.py:43: in probability
    default_classifier = Soft404Classifier()
soft404/predict.py:15: in __init__
    vect_params, vect_vocab, text_clf, clf = joblib.load(filename)
../../envs/deepdeep/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle.py:573: in load
    return load_compatibility(fobj)
../../envs/deepdeep/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle_compat.py:226: in load_compatibility
    obj = unpickler.load()
/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/pickle.py:1039: in load
    dispatch[key[0]](self)
../../envs/deepdeep/lib/python3.5/site-packages/sklearn/externals/joblib/numpy_pickle_compat.py:177: in load_build
    Unpickler.load_build(self)
/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/pickle.py:1510: in load_build
    setstate(state)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   KeyError: 'max_depth'

I think it makes sense to either upgrade the model to use scikit-learn 0.18.1, or to put training corpus to the repository, so that the model can be updated on a client.

@lopuhin
Copy link
Contributor

lopuhin commented Jan 13, 2017

The current corpus is too big (about 1G compressed), unfortunately. I'll check if it can be made smaller, and I'll put is on S3 anyway.
Re-training the model now, I'll put it to a branch at first.

@kmike
Copy link
Contributor Author

kmike commented Jan 13, 2017

Yeah, 1GB is way too much.. S3 can be not good as a long-term solution because it costs $$, maybe we can use http://academictorrents.com/ or something like that? Someone still need to seed though

@kmike
Copy link
Contributor Author

kmike commented Jan 13, 2017

Do you recall how long does it take to run a crawl and get a similar dataset?

lopuhin added a commit that referenced this issue Jan 13, 2017
$ ./soft404/train.py text_items_big
Most common languages in data:
[('zh-cn', 143533),
 ('en', 117488),
 ('ko', 23013),
 ('ja', 11624),
 ('fr', 8772),
 ('de', 8533),
 ('it', 6847),
 ('pt', 5491),
 ('', 4918),
 ('vi', 3399)]
Using only data for "en" language
117484 pages, 26464 domains, 0.28 404 pages
Training vectorizer...
117484/117484 [10:18<00:00, 189.91it/s]
Building numeric features...
117484/117484 [02:45<00:00, 708.33it/s]
Training and evaluating...
105735 in train, 11749 in test
AUC   0.992 ± 0.007
AUC_text 0.992 ± 0.005
AUC_text_full 0.992 ± 0.005
F1    0.963 ± 0.013
F1_text 0.958 ± 0.012
F1_text_full 0.958 ± 0.014
selected_features 3000.000 ± 0.000
@lopuhin
Copy link
Contributor

lopuhin commented Jan 13, 2017

Pushed the model in d066986

Do you recall how long does it take to run a crawl and get a similar dataset?

The dataset is 117484 pages, so with 500 rpm it should take just 4 hours. But I have a note that crawling got much slower after some time due to scheduling issues which I never solved, so actual time was more than a day, I think.

@lopuhin
Copy link
Contributor

lopuhin commented Jan 19, 2017

It currently works with scikit-learn 0.18+, although the model is still serialized with joblib - see issue #13 about it, and #12 about training a classifier form scratch.

@lopuhin lopuhin closed this as completed Jan 19, 2017
@kmike
Copy link
Contributor Author

kmike commented Jan 19, 2017

@lopuhin if the problem with crawling speed is the usual "all requests returned by scheduler are for the same domain, we hit downloader limits and do nothing" then something like https://github.com/TeamHG-Memex/linkdepth/blob/master/queues.py could help; to use it set 'scheduler_slot' request.meta key (like this: https://github.com/TeamHG-Memex/linkdepth/blob/b5c18819f61a25e586347c04c116bcabc44067af/linkdepth.py#L98) and tell scrapy to use these custom queues:

SCHEDULER_PRIORITY_QUEUE='queues.RoundRobinPriorityQueue'
SCHEDULER_DISK_QUEUE='queues.DiskQueue'

Another option is to use frontera; it uses a thing called OverusedBuffer to fight this issue.

@lopuhin
Copy link
Contributor

lopuhin commented Jan 20, 2017

Yes, I think that was the problem. Thanks for the pointers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants