Skip to content

3on/scrap.js

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scrap.js

Scrapping websites made easy. It handles redirections, cookies, get/post, string/binary/dom, jquery ... for you!

npm install Scrap

Examples

Download images from Wikipedia front-page

This example makes use of jQuery to traverse the page, and shows how to download binary files.

# Create a new Scrap with the base for all the following requests
wikipedia = new Scrap
  path: 'http://en.wikipedia.org/'

# Download the front page as HTML
wikipedia.get '/', type: 'html', (window) ->

	# We get a window element with jQuery
	$ = window.$

	# Using jQuery we iterate over all the images
	$('img').each ->

		# We get the url and basename
		url = $(this).attr('src')
		basename = path.basename url

		# We get the image as binary
		wikipedia.get url, type: 'binary', (file) ->

			# And save it to the disk
			fs.writeFile 'images/' + basename, file

Download all the excerpts from Wikipedia front-page links

This example shows how to download the page as string and use regular expressions with jsMatch to extract meaningful parts.

# Create a new Scrap with the base for all the following requests
wikipedia = new Scrap
	path: 'http://en.wikipedia.org/'

# Download the front page as string
wikipedia.get '/', (page) ->

	# Get all the wiki article links using a regex
	urls = match.all(page, '<a href="(/wiki/[^:"]+)"')

	# Request all the urls at once
	wikipedia.get urls, (page, url) ->

		# Display useful information from the page
		console.log
			url: url
			title: match(page, '<h1[^>]+>(.*?)<\/h1>')
			excerpt: match(page, '<p>(.*?)<\/p>').replace(/<[^>]+>/g, '')

Thanks to NodeJS asynchronous download, this example runs 3 times faster than the same version written synchronously in php.

API

API

  • new Scrap([options])
  • scrap.get(url, [options], [callback])

All the 4 HTTP methods get, put, post and delete share the same definition.

Options

  • path: Base path of the website. The following request paths will be relative to this path.
  • type: You can chose several post filters
  • filter: A function used to edit the html string before it is parsed.
  • cookies: An object to provide additional cookies.
  • headers: An object to provide additional headers.
  • data: The content of a POST request.
    • String: The content as-is.
    • Object: Converted to text with querystring.

About

Scrapping tool for node.js

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published