Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreading #26

Open
ethicalhack3r opened this issue Jun 6, 2011 · 13 comments
Open

Multithreading #26

ethicalhack3r opened this issue Jun 6, 2011 · 13 comments
Labels

Comments

@ethicalhack3r
Copy link

Hi there,

I was wondering if it would be possible to multithread the spidr gem? I don't know much about multithreading in ruby, but I believe only Ruby 1.9.x is able to do so?

I had a look through the source but couldn't find where the spidr gem makes its http requests.

Maybe something like Typhoeus can be used?! (http://rubygems.org/gems/typhoeus)

Thanks,
Ryan

@postmodern
Copy link
Owner

This is possible, but difficult. The main problem is a race condition between the url/page callbacks and the requesting of pages. The callbacks could modify the filtering rules, as another thread is requesting a page, that is suddenly unwanted. The second problem is currently Spidr uses persistent HTTP connections, so I'm unsure how multi-threading would improve performance? We've been looking at alternative HTTP libraries, but they all have various pros/cons.

@ethicalhack3r
Copy link
Author

Thanks for the quick response. I don't know too much about multi-threading, maybe X amount of persistent HTTP connections can be opened?!

Either way seems like a difficult task to achieve.

@nirvdrum
Copy link

nirvdrum commented May 8, 2012

If you decide to go with it, I'd give Celluloid a look. Alas, it is Ruby 1.9 only due to its use of fibers. But it's a pretty nice library.

@postmodern
Copy link
Owner

I'm considering switching to net-http-persistent, a Thread pool for requests, with mutexes around adding filters.

@grrowl
Copy link

grrowl commented Oct 14, 2013

+1, this seems to be the best spider/crawling library out, and this would be a great feature.

@dadamschi
Copy link

What happened with this request?

@postmodern
Copy link
Owner

I don't have the time currently to work on such a large feature.

@ZeroChaos-
Copy link

been a year, any chance you have time to work on such a feature now? :-)

@fuzzygroup
Copy link

I've written more than a crawler or N in my career and if you didn't make it multi-threaded from the start, it is damn hard to do so in retrospect. Now, that said, I think the overall goal here is throughput rather than threads. If the discovered urls can be surfaced to an overall queue (Redis or SQS) then that would change the equation because rather than threads you simply run more instances (or containers) of Spidr and let the queue handle distribution of work across N copies.

Thoughts?

@postmodern
Copy link
Owner

A distributed Spidr is a little out of scope, or at least further down the road.

Multi-threading here is mainly to address blocking I/O when waiting on responses to come back from the HTTP Sessions. Luckily, net-http-persistent is already thread aware. We'd just need to replace the spidering loop with a producer/consumer thread pool. Each thread would have it's own session cache via net-http-persistent, would dequeue URLs, and enqueue the responses/Pages. All additional logic with headers and parsing HTML would still be done in the main thread, to avoid additional Mutex complexity. There's probably other hidden work and locking issues hidden in the details.

@vwochnik
Copy link

+1, a producer/consumer for the requests would be awesome! I really like the interface of your library by the way.

@dadamschi
Copy link

dadamschi commented Sep 24, 2017 via email

@vwochnik
Copy link

I mean a producer/consumer pattern where one a thread pool of worker threads that do the requesting are connected to the main thread with queues, like a manufacturing band.

The main thread puts all requests that it wants to have resolved in a queue and any worker thread can pick the task from the queue, do the request, and put it inside the finished responses queue which is being read by the main thread. In this way, the main thread does not do any requesting, i.e. blocking activity, itself which will lead to a speedup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants