Krake.js

Krake is a library for declarative web scraping, written in Javascript. It abstracts away the boilerplate of web scrapers, so you don't need to describe anything but features unique to your project.

Using existing libraries to write a web scraper usually involves writing code to request, parse, select, modify, and serialise HTML content. You might request a single URL; or fetch a collection of URLs populated from either the results of another task, or known patterns in addresses. You're likely selecting content with XPath expressions, scoped to single item-representing DOM elements. You're probably serialising to a list of objects with fixed structure, reflecting how you expect the data was originally represented.

There's a lot of boilerplate in web scraping. Most of what varies from project to project is the structure of target pages, and the data underlying them. The ideal motivating this project is a function from a minimal representation of a scraper in data, to the data it intended to specify. Right now we believe that representation looks close to:

> var task = 
  { url: 
    { pattern: "https://www.flickr.com/search/?q=@keywords@&l=@licenses@"
    , keywords: ["kitten", "cat", "meow"]
    , licenses: ["comm", "deriv"]
    }
  , cols: 
     [ { desc: 'title'
       , sel: "//img[contains(@id,'photo_img_')]" // XPath or CSS
       , attr: 'alt' 
       }
     , { desc: 'image'
       , sel: "//img[contains(@id,'photo_img_')]"
       , attr: 'data-defer-src'
       , fn: function(str) {return str.replace(/(\.[a-z]+)$/,'_b$1');} 
       }
     , { desc: 'owner'
       , sel: "//a[@data-track='owner']"
       , attr: 'title' 
       }
     , { desc: 'page'
       , sel: 'a.photo-click'
       , attr: 'href'
       , cols: // A nested task, with supra referenced page as pattern.
           [ { desc: "description"
             , sel: "//div[contains(@class,'photo-desc')]"
             // When no 'attr' is specified, we use the innerHTML of the specified element
             }
           ]
       }
     ] 
  }

Given such a definition, we can set up an emitter to receive results as they're retrieved:

> var Krake = require('krake');
> new Krake({}).scrape(task).on('retrieved', console.log);
{ title: 'meow meow and the mouse',
  image: 'https://farm5.staticflickr.com/4072/4217411837_05b4ba416f_b.jpg',
  owner: 'megankhines',
  page: { description: '\n\t\t<p>Kittens</p>\n\t\t\t\t' } 
}
{ title: 'meow meow',
  image: 'https://farm1.staticflickr.com/26/50499021_cefc189a28_b.jpg',
  owner: 'megankhines',
  page: { description: '\n\t\t<p>Kittens</p>\n\t\t\t\t' } 
}
{ title: 'Meow',
  image: 'https://farm3.staticflickr.com/2543/3843401084_2495a1367d_b.jpg',
  owner: 'mindy_g',
  page: { description: '\n\t\t<p>Kittens</p>\n\t\t\t\t' } 
}
[..]

Krake.IO

Defining programs in data (we don't mean syntax trees ;-) provides several advantages. For one, it's a catalyst for re-use; when that data (|| embedded data) defines a web scraper, once a single great canonical implementation has been written for a service, that work can be reused and extended with minimal wrangling. It also facilitates the development of even higher-level abstractions, having reduces the requirement from writing to low-level code, to writing high-level definitions.

Krake.IO have been building tools on top of this library. They've developed a graphical interface for constructing definitions, and rent clusters to perform larger tasks. You should investigate these! fmap works there, and they hold this code's copyright.

.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
grunt		grunt
lib		lib
tests		tests
.gitignore		.gitignore
.jshintrc		.jshintrc
Gruntfile.js		Gruntfile.js
LICENSE		LICENSE
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grunt

grunt

lib

lib

tests

tests

.gitignore

.gitignore

.jshintrc

.jshintrc

Gruntfile.js

Gruntfile.js

LICENSE

LICENSE

README.md

README.md

package.json

package.json

Repository files navigation

Krake.js

Krake.IO

About

Releases

Packages

Languages

License

KrakeIO/libkrake

Folders and files

Latest commit

History

Repository files navigation

Krake.js

Krake.IO

About

Resources

License

Stars

Watchers

Forks

Languages