Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Intra-grid/Inter-process communication should use keep-alive and multiplexing #221

Closed
Zapotek opened this Issue Jul 2, 2012 · 3 comments

Comments

Projects
None yet
1 participant
Owner

Zapotek commented Jul 2, 2012

The ArachniRPC protocol was designed to be light-weight and simple in order to aid integration with 3rd party systems.
It basically uses 1 socket per call in order not to require multiplexing and make it very simple to be implemented by anyone with access to a serializer (usually YAML since it's multi-platform) and TLS/SSL sockets.

And that's good, that should remain the 3rd-party-facing interface -- i.e. for Dispatchers and simple and master Instances.

However, communication between a master and its slaves is hidden from the user and could use the boost of a more complex and performance-oriented protocol.
So, the existing protocol should be amended by adding a high-performance mode which will utilize a binary serializer (most likely Marshal), single connection per master-slave and message multiplexing.

This isn't strictly necessary yet but the distributed crawling algorithm (#207) will make good use of it since it will hugely benefit from a super-fast and extra-lightweight (both in size and init/tear-down of messages and connections) RPC protocol as path distribution will require tens or hundreds of thousands of RPC calls.

And since I got going, I might as well mention this too:

Even though the Ruby (MRI) dudes got their heads straight and mapped Ruby threads 1:1 to OS threads, there still is the Global-Interpreter-Lock (GIL) which only schedules one thread at time.
And even if they did provide proper threading, because we're using a single-threaded, async, singleton HTTP interface, proper threads would mean very little to us.

And since workload distribution and message-passing has already been implemented for the Grid, we already have a nice and clean IPC system in place which basically allows parallelism via Ruby Processes -- which are proper OS processes and can thus run on multiple cores and CPUs.
The ability to truly and easily parallelize scans (even on single machines) will be a huge asset when we get JS integration (#50), which will require some serious processing power.

You can go even further with this and have Grid slaves spawn local slave Instances for themselves, now that would be cool.

@ghost ghost assigned Zapotek Jul 2, 2012

Owner

Zapotek commented Nov 2, 2012

Arachni-RPC EM has been updated to allow the use of a primary and fallback serializer and Arachni's RPC service has been updated to use Marshal (for higher performance) as the primary one and YAML (for backwards compatibility, interoperability) as secondary/fallback.

A slight update on this, support for keep-alive and multiplexing will not be implemented at this time since it would severely complicate things, however this hasn't seemed to have hurt performance.

Also, the API has been updated so that any 2, or more, instances can share the scan (crawl and audit) workload very easily -- with a call to Framework#enslave.

The aforementioned functionality currently exists in the feature/distributed-crawling branch and hasn't yet been merged into experimental.

Owner

Zapotek commented Nov 5, 2012

Consider adding some sort of Quality-of-Service on the server side in order to prioritize management requests.

Owner

Zapotek commented Dec 1, 2012

Closing since the feature/distributed-crawling branch has been merged into experimental and QoS is no longer necessary because some buffering on the crawler's part did the trick.

@Zapotek Zapotek closed this Dec 1, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment