Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-blocking autologin #46

Merged
merged 8 commits into from
Apr 22, 2016
Merged

Non-blocking autologin #46

merged 8 commits into from
Apr 22, 2016

Conversation

lopuhin
Copy link
Contributor

@lopuhin lopuhin commented Apr 20, 2016

I make requests to autologin via normal scrapy requests and process responses in callbacks. Requests before login and during logout are queued and scheduled after login.

@codecov-io
Copy link

Current coverage is 79.44%

Merging #46 into master will increase coverage by +1.27% as of b6d5a00

@@            master     #46   diff @@
======================================
  Files           14      14       
  Stmts          756     788    +32
  Branches       152     158     +6
  Methods          0       0       
======================================
+ Hit            591     626    +35
+ Partial         40      38     -2
+ Missed         125     124     -1

Review entire Coverage Diff as of b6d5a00

Powered by Codecov. Updated on successful CI builds.

@lopuhin
Copy link
Contributor Author

lopuhin commented Apr 20, 2016

Hm, strange that this https://github.com/TeamHG-Memex/undercrawler/pull/46/files#diff-cd4de073722fd256d3d944b28a9c88baR95 is not covered, may be something fishy here, I was sure it must be covered...

@lopuhin lopuhin changed the title Non-blocking autologin [WIP] Non-blocking autologin Apr 20, 2016
@@ -36,16 +37,21 @@ class AutologinMiddleware:
- do not block event loop in login() method (instead, collect
scheduled requests in a separate queue and make request with scrapy).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the docstring still valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, thanks! Fixed, I think now the only restriction is a single authorization domain per spider, and now it should be easier to relax if needed.

@lopuhin
Copy link
Contributor Author

lopuhin commented Apr 21, 2016

I think it's ready now @kmike !

I tried to come up with a scrapy API that would simplify this case, but did not come up with anything good. One small thing that could simplify it is making crawler.engine.crawl(request, spider) more obvious, perhaps as spider.crawl(request), but this does not change much. A more powerful thing would be to add a DelayRequest exception, so instead of adding request to self._queue it would be queued somewhere in scrapy, but then we should also have some way to say when to re-process this (original) requests, and it also must be composable, so that several middlewares can use it, so it starts looking more complicated than explicitly queueing requests in the middleware.

@lopuhin lopuhin changed the title [WIP] Non-blocking autologin Non-blocking autologin Apr 21, 2016
else:
self._enqueue(request)
if self.waiting_for_login:
raise IgnoreRequest
Copy link
Contributor

@kmike kmike Apr 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we solve it without dropping the request and maintaining a queue ourselves? Downloader middleware support returning Deferreds from process_request; see e.g. https://github.com/scrapy/scrapy/blob/master/scrapy/downloadermiddlewares/robotstxt.py implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks much nicer, I'll try, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It tool a while to get used to such style, but at the end it's better, I think. The only gotcha is that at one point the traceback was incorrect, is it worth a bug report?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, bad tracebacks worth a bug report.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, filed scrapy/scrapy#1948

@kmike kmike merged commit 479a871 into master Apr 22, 2016
@lopuhin lopuhin deleted the nonblocking-autologin branch April 25, 2016 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants