Non-blocking autologin #46

lopuhin · 2016-04-20T16:08:19Z

I make requests to autologin via normal scrapy requests and process responses in callbacks. Requests before login and during logout are queued and scheduled after login.

Logout not supported yet

codecov-io · 2016-04-20T16:19:22Z

Current coverage is `79.44%`

Merging #46 into master will increase coverage by +1.27% as of b6d5a00

@@            master     #46   diff @@
======================================
  Files           14      14       
  Stmts          756     788    +32
  Branches       152     158     +6
  Methods          0       0       
======================================
+ Hit            591     626    +35
+ Partial         40      38     -2
+ Missed         125     124     -1

Review entire Coverage Diff as of b6d5a00

Powered by Codecov. Updated on successful CI builds.

lopuhin · 2016-04-20T19:31:20Z

Hm, strange that this https://github.com/TeamHG-Memex/undercrawler/pull/46/files#diff-cd4de073722fd256d3d944b28a9c88baR95 is not covered, may be something fishy here, I was sure it must be covered...

kmike · 2016-04-20T21:10:47Z

undercrawler/middleware/autologin.py

@@ -36,16 +37,21 @@ class AutologinMiddleware:
    - do not block event loop in login() method (instead, collect
    scheduled requests in a separate queue and make request with scrapy).


is the docstring still valid?

Nice catch, thanks! Fixed, I think now the only restriction is a single authorization domain per spider, and now it should be easier to relax if needed.

lopuhin · 2016-04-21T09:01:42Z

I think it's ready now @kmike !

I tried to come up with a scrapy API that would simplify this case, but did not come up with anything good. One small thing that could simplify it is making crawler.engine.crawl(request, spider) more obvious, perhaps as spider.crawl(request), but this does not change much. A more powerful thing would be to add a DelayRequest exception, so instead of adding request to self._queue it would be queued somewhere in scrapy, but then we should also have some way to say when to re-process this (original) requests, and it also must be composable, so that several middlewares can use it, so it starts looking more complicated than explicitly queueing requests in the middleware.

kmike · 2016-04-21T10:19:08Z

undercrawler/middleware/autologin.py

+        else:
+            self._enqueue(request)
+            if self.waiting_for_login:
+                raise IgnoreRequest


Can we solve it without dropping the request and maintaining a queue ourselves? Downloader middleware support returning Deferreds from process_request; see e.g. https://github.com/scrapy/scrapy/blob/master/scrapy/downloadermiddlewares/robotstxt.py implementation.

It looks much nicer, I'll try, thanks!

It tool a while to get used to such style, but at the end it's better, I think. The only gotcha is that at one point the traceback was incorrect, is it worth a bug report?

Yeah, bad tracebacks worth a bug report.

Yep, filed scrapy/scrapy#1948

lopuhin added 2 commits April 20, 2016 17:25

[WIP] Non-blocking login: first login, pending and skipped

4ee9f01

Logout not supported yet

Logout support: retry enqueued requests

b855561

lopuhin changed the title ~~Non-blocking autologin~~ [WIP] Non-blocking autologin Apr 20, 2016

kmike reviewed Apr 20, 2016
View reviewed changes

[skip-ci] Fix docstring and cleanup

b2a0f63

lopuhin force-pushed the nonblocking-autologin branch from d2a838f to b2a0f63 Compare April 21, 2016 07:47

lopuhin added 3 commits April 21, 2016 11:25

Do login requests with higher priority

7ec47ed

Add a "slow" endpoint to cover stale requests in tests

37817c1

Add a test for skipped autologin, expose autologin state

794526f

lopuhin changed the title ~~[WIP] Non-blocking autologin~~ Non-blocking autologin Apr 21, 2016

lopuhin force-pushed the nonblocking-autologin branch from cb3ca6d to 794526f Compare April 21, 2016 10:07

kmike reviewed Apr 21, 2016
View reviewed changes

lopuhin added 2 commits April 21, 2016 15:37

Autologin without an explicit queue, using deferreds

257863f

Add a test for pending status

ee18b3c

kmike merged commit 479a871 into master Apr 22, 2016

lopuhin deleted the nonblocking-autologin branch April 25, 2016 14:39

lopuhin mentioned this pull request Apr 26, 2016

spider can't be stopped with Ctrl-C when autologin is pending #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-blocking autologin #46

Non-blocking autologin #46

lopuhin commented Apr 20, 2016

codecov-io commented Apr 20, 2016

lopuhin commented Apr 20, 2016

kmike Apr 20, 2016

lopuhin Apr 21, 2016

lopuhin commented Apr 21, 2016

kmike Apr 21, 2016 •

edited

Loading

lopuhin Apr 21, 2016

lopuhin Apr 21, 2016

kmike Apr 21, 2016

lopuhin Apr 21, 2016

		@@ -36,16 +37,21 @@ class AutologinMiddleware:
		- do not block event loop in login() method (instead, collect
		scheduled requests in a separate queue and make request with scrapy).

Non-blocking autologin #46

Non-blocking autologin #46

Conversation

lopuhin commented Apr 20, 2016

codecov-io commented Apr 20, 2016

Current coverage is 79.44%

lopuhin commented Apr 20, 2016

kmike Apr 20, 2016

Choose a reason for hiding this comment

lopuhin Apr 21, 2016

Choose a reason for hiding this comment

lopuhin commented Apr 21, 2016

kmike Apr 21, 2016 • edited Loading

Choose a reason for hiding this comment

lopuhin Apr 21, 2016

Choose a reason for hiding this comment

lopuhin Apr 21, 2016

Choose a reason for hiding this comment

kmike Apr 21, 2016

Choose a reason for hiding this comment

lopuhin Apr 21, 2016

Choose a reason for hiding this comment

Current coverage is `79.44%`

kmike Apr 21, 2016 •

edited

Loading