Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Scala Webscraper
Scala
branch: master

Merge pull request #11 from tomer-ben-david/master

fix compile error in example
latest commit c81130ef89
@Rovak authored

README.md

Scala Webscraper 0.4.1

Build Status

Getting started

The project is build with Scala 2.10.2 and sbt 0.13.0, both can be installed using this install script

To try the example navigate to the project folder and run sbt "project scraper-demo" run which will start the example scraper

Installation

If you use SBT, you just have to edit build.sbt and add the following:

libraryDependencies += "nl.razko" %% "scraper" % "0.4.1"

If you want to use bleeding edge versions using snapshots then add the Sonatype snapshots to the resolvers:

resolvers += "Sonatype Snapshots" at "http://oss.sonatype.org/content/repositories/snapshots/"

libraryDependencies += "nl.razko" %% "scraper" % "0.4.1-SNAPSHOT"

DSL

The webscraper provides a simple DSL to write scrape rules

import org.rovak.scraper.ScrapeManager._
import org.jsoup.nodes.Element

object Google {
  val results = "#res li.g h3.r a"
  def search(term: String) = {
    "http://www.google.com/search?q=" + term.replace(" ", "+")
  }
}

// Open the search results page for the query "php elephant"
scrape from Google.search("php elephant") open { implicit page =>

  // Iterate through every result link
  Google.results each { x: Element =>

    val link = x.select("a[href]").attr("abs:href").substring(28)
    if (link.isValidURL) {

      // Iterate through every found link in the found page
      scrape from link each (x => println("found: " + x))
    }
  }
}

Spiders

A spider is a scraper which recursively loads a page and opens every link it finds. It will keep scraping until all pages within the allowed domains are visited once.

The following snippet demonstrates a basic spider which crawls a website and provides hooks to do something with the data

new Spider {
  startUrls ::= "http://events.stanford.edu/"
  allowedDomains ::= "events.stanford.edu"

  onReceivedPage ::= { page: WebPage =>
    // Page received
  }

  onLinkFound ::= { link: Href =>
    println(s"Found link ${link.url} with name ${link.name}")
  }
}.start()

The spider can be extended by providing traits, if you want to scrape emails then add the EmailSpider trait which offers a new onEmailFound hook in which emails can be collected.

new Spider with EmailSpider {
  startUrls ::= "http://events.stanford.edu/"
  allowedDomains ::= "events.stanford.edu"

  onEmailFound ::= { email: String =>
    // Email found
  }

  onReceivedPage ::= { page: WebPage =>
    // Page received
  }

  onLinkFound ::= { link: Href =>
    println(s"Found link ${link.url} with name ${link.name}")
  }
}.start()

Multiple spiders can be mixed together

new Spider with EmailSpider with SitemapSpider {
  startUrls ::= "http://events.stanford.edu/"
  allowedDomains ::= "events.stanford.edu"
  sitemapUrls ::= "http://events.stanford.edu/sitemap.xml"

  onEmailFound ::= { email: String =>
    println("Found email: " + email)
  }

  onReceivedPage ::= { page: WebPage =>
    // Page received
  }

  onLinkFound ::= { link: Href =>
    println(s"Found link ${link.url} with name ${link.name}")
  }
}.start()

Documentation

Something went wrong with that request. Please try again.