Skip to content

How To: Scrape Web Pages

sambit edited this page Jun 13, 2021 · 4 revisions

What's special about scraping Web pages that's different from log processing?

I/O Bound Jobs

It is I/O bound. I/O bound means that the job at hand not only depends on your machine and specifically your CPU, it most commonly means that you depends on other people's machine. And, well, other people's machine suck.

  • Latency
  • Transfer speed
  • Failures
  • Corruption

All these influence how your job will be executed.

Scaling I/O

If you used a single thread, or a single process, then naively they would block while waiting for I/O, or fail, or both block and fail and present you with corrupt data :(.

If this took 1 second, then you have a 1 req/s pipeline in your hands.

In Ruby, there's no silver bullet other than making more pipelines (let's ignore evented frameworks for now) - more threads or more processes and Sneakers is designed to scale both.

Building a Scraper

So same as with the How To: Do Log Processing example, let's outline a worker:

require 'sneakers'
require 'open-uri'
require 'nokogiri'
require 'sneakers/metrics/logging_metrics'

class WebScraper
  include Sneakers::Worker
  from_queue :web_pages

  def work(msg)
    doc = Nokogiri::HTML(open(msg))
    page_title = doc.css('title').text

    worker_trace "Found: #{page_title}"
    ack!
  end
end

However, since this worker does I/O, it will by default open up 25 threads for us. What if we want more?

require 'sneakers'
require 'open-uri'
require 'nokogiri'
require 'sneakers/metrics/logging_metrics'

class WebScraper
  include Sneakers::Worker
  from_queue :web_pages,
             :threads => 50,
             :prefetch => 50,
             :timeout_job_after => 1

  def work(msg)
    doc = Nokogiri::HTML(open(msg))
    page_title = doc.css('title').text

    worker_trace "Found: #{page_title}"
    ack!
  end
end

This means we set up 50 threads that will all do I/O for us at the same time. A good practice is to set up a prefetch policy against RabbitMQ of at least the amount of threads involved.

We also want to timeout super-fast; a timeout of 1 second means a thread can only be held up to 1 second, and this whole thing will generate at worst 50 req/s (worst being all jobs failing and timeouting on us).

Resource Starvation

If you are thinking of adding a persistence layer here (for example, for saving the page titles), note that the fact that Sneakers opens up to so many threads and so many processes unless you opt in for connection sharing, it may cause high contention for the data store client used. In case the data store client supports connection pooling and/or tune-able concurrency, those may need adjusting to match the concurrency level used by Sneakers.

Finding suitable values is often a matter of trial and error.