Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New hooks to be notified about URLs being added or retrieved from the queue #169

Merged
merged 11 commits into from
Sep 2, 2014

Conversation

mback2k
Copy link
Contributor

@mback2k mback2k commented Aug 21, 2014

This merge request implements new hooks in order to be notified about URLs being added or retrieved from the queue. There is still some room for improvement since currently the list of added url_infos has to be recreated in item.py since only the URLs themselves are passed to database.py.

<+mback2k> chfoo: here is my first draft for new hooks: mback2k@a73092d
<+mback2k> chfoo: my only problem is that the URL-table performs a unique-operation on URLs and therefore multiple adds of the same URL result in at most one dequeue operation.
<+mback2k> chfoo: that makes it impossible to implement a proper counter unless the add-operation returns wether the URL already existed.
@chfoo chfoo: the new hooks are looking good. yeah, unfortunately i don't check whether the url already exists for performance reasons.
<+mback2k> chfoo: maybe the SQL statement in https://github.com/mback2k/wpull/blob/master/wpull/database.py#L250 could be adapted to return a different result code on INSERT if already exists?
<+mback2k> chfoo: I mean, the unique-constraint in the database has to make the check anyway, so there should be a quick way to get that information
<+mback2k> without an additional query, of course
<+mback2k> chfoo: I will take a look
<+mback2k> chfoo: I found a way to get it working without any additional SQL queries :)
<+mback2k> chfoo: some more progress: https://github.com/mback2k/wpull/commits/topic/queue_hooks

@chfoo
Copy link
Member

chfoo commented Aug 21, 2014

I can't remember the intention but I think I wanted wpull/testing/boring_script.py to be blank. Can you move the global counter test into wpull/testing/py_hook_script2.py and assert that it never goes negative? And then also update wpull/testing/lua_hook_script2.lua if possible and it should be good.

@mback2k
Copy link
Contributor Author

mback2k commented Aug 21, 2014

Okay, I moved the test hooks from wpull/testing/boring_script.py into wpull/testing/py_hook_script2.py and added two asserts for the counter value.
I updated wpull/testing/lua_hook_script2.lua but have no clue if that is correct.

@chfoo
Copy link
Member

chfoo commented Aug 21, 2014

record_info is a item.URLRecord instance. I think in py_hook_script2.py for assert record_info['record_info'] you wanted something like assert record_info['url'] (and also in lua_hook_script2.lua) and the unit tests should pass.

If you can make those final two changes, that would be great. Thanks for your help!

@mback2k
Copy link
Contributor Author

mback2k commented Aug 21, 2014

Okay, let's try this again.

@chfoo
Copy link
Member

chfoo commented Aug 21, 2014

Looks like forgot to consider the code path where scripts can queue URLs of their own. If you haven't seen it yet, in hook.py for _add_hooked_url(), the same thing from WebProcesorSession about queued URLs needs to be added. I'm not too worried about redundant code because it will be refactored later to integrate with PhantomJS. Did you still want to work on it or should I?

@mback2k
Copy link
Contributor Author

mback2k commented Aug 23, 2014

@chfoo Thanks, I updated the code path in hook.py and added another test assertion.

@chfoo chfoo self-assigned this Aug 24, 2014
@chfoo chfoo merged commit 9bab455 into ArchiveTeam:master Sep 2, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants