RegexpCrawler

regexp_crawler is a crawler which uses regular expression to catch data from website. It is easy to use and less code if you are familiar with regular expression.

Install


sudo gem install regexp_crawler

Usage

It’s really easy to use, sometime just one line.


RegexpCrawler::Crawler.new(options).start

options is a hash

:start_page, mandatory, a string to define a website url where crawler start
:continue_regexp, optional, a regexp to define what website urls the crawler continue to crawl, it is parsed by String#scan and get the first not nil result
:capture_regexp, mandatory, a regexp to define what contents the crawler crawl, it is parse by Regexp#match and get all group captures
:named_captures, mandatory, a string array to define the names of captured groups according to :capture_regexp
:model, optional if :save_method defined, a string of result’s model class
:save_method, optional if :model defined, a proc to define how to save the result which the crawler crawled, the proc accept two parameters, first is one page crawled result, second is the crawled url
:headers, optional, a hash to define http headers
:encoding, optional, a string of the coding of crawled page, the results will be converted to utf8
:need_parse, optional, a proc if parsing the page by regexp or not, the proc accept two parameters, first is the crawled website uri, second is the response body of crawled page
:logger, optional, true for logging to STDOUT, or a Logger object for logging to that logger

If the crawler define :model no :save_method, the RegexpCrawler::Crawler#start will return an array of results, such as


[{:model_name => {:attr_name => 'attr_value'}, :page => 'website url'}, {:model_name => {:attr_name => 'attr_value'}, :page => 'another website url'}]

Example

a script to synchronize your github projects except fork projects, please check example/github_projects.rb


require 'rubygems'
require 'regexp_crawler'

class Project
  attr_accessor :title, :description, :body, :url

  def initialize(options)
    options.each do |k, v|
      self.instance_variable_set("@#{k}", v)
    end
  end
end

projects = []
crawler = RegexpCrawler::Crawler.new(
  :start_page => "http://github.com/flyerhzm",
  :continue_regexp => %r{<h3>[\s\n]*?<a href="(/flyerhzm/.*?)">}m,
  :capture_regexp => %r{<a href="http://github.com/flyerhzm/[^"]*?">(.*?)</a>.*?<div id="repository_description".*?>[\s\n]*?<p>(.*?)[\s\n]*?<span id="read_more".*(<div class="wikistyle">.*?</div>)</div>}m,
  :named_captures => ['title', 'description', 'body'],
  :logger => true,
  :save_method => Proc.new do |result, page|
    projects << Project.new(result.merge(:url => page))
  end,
  :need_parse => Proc.new do |page, response_body|
    !response_body.index(/<span class="fork-flag">/)
  end)
crawler.start

projects.each do |project|
  puts project.url
  puts project.title
  puts project.description
end

The results are as follows:


D, [2010-02-06T18:59:32.487885 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm
D, [2010-02-06T18:59:34.877730 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm
D, [2010-02-06T18:59:34.878158 #11387] DEBUG -- : continue_page: /flyerhzm/regexp_crawler
D, [2010-02-06T18:59:34.878462 #11387] DEBUG -- : continue_page: /flyerhzm/css_sprite
D, [2010-02-06T18:59:34.878707 #11387] DEBUG -- : continue_page: /flyerhzm/chinese_permalink
D, [2010-02-06T18:59:34.878991 #11387] DEBUG -- : continue_page: /flyerhzm/contactlist
D, [2010-02-06T18:59:34.879299 #11387] DEBUG -- : continue_page: /flyerhzm/rails_best_practices
D, [2010-02-06T18:59:34.880802 #11387] DEBUG -- : continue_page: /flyerhzm/rfetion
D, [2010-02-06T18:59:34.881232 #11387] DEBUG -- : continue_page: /flyerhzm/bullet
D, [2010-02-06T18:59:34.881644 #11387] DEBUG -- : continue_page: /flyerhzm/metric_fu
D, [2010-02-06T18:59:34.882090 #11387] DEBUG -- : continue_page: /flyerhzm/exception_notification
D, [2010-02-06T18:59:34.882570 #11387] DEBUG -- : continue_page: /flyerhzm/activemerchant_patch_for_china
D, [2010-02-06T18:59:34.883087 #11387] DEBUG -- : continue_page: /flyerhzm/contactlist-client
D, [2010-02-06T18:59:34.883650 #11387] DEBUG -- : continue_page: /flyerhzm/taobao
D, [2010-02-06T18:59:34.884231 #11387] DEBUG -- : continue_page: /flyerhzm/monitor
D, [2010-02-06T18:59:34.884843 #11387] DEBUG -- : continue_page: /flyerhzm/sitemap
D, [2010-02-06T18:59:34.885491 #11387] DEBUG -- : continue_page: /flyerhzm/visual_partial
D, [2010-02-06T18:59:34.886370 #11387] DEBUG -- : continue_page: /flyerhzm/chinese_regions
D, [2010-02-06T18:59:34.887123 #11387] DEBUG -- : continue_page: /flyerhzm/codelinestatistics
D, [2010-02-06T18:59:34.888060 #11387] DEBUG -- : continue_page: /flyerhzm/rack
D, [2010-02-06T19:00:25.245306 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/regexp_crawler
D, [2010-02-06T19:00:27.168275 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/regexp_crawler
D, [2010-02-06T19:00:27.172163 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:27.172349 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/css_sprite
D, [2010-02-06T19:00:29.005109 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/css_sprite
D, [2010-02-06T19:00:29.008690 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:29.008882 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/chinese_permalink
D, [2010-02-06T19:00:30.672890 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/chinese_permalink
D, [2010-02-06T19:00:30.680095 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:30.680453 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/contactlist
D, [2010-02-06T19:00:32.332182 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/contactlist
D, [2010-02-06T19:00:32.336053 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:32.336222 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/rails_best_practices
D, [2010-02-06T19:00:34.554523 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/rails_best_practices
D, [2010-02-06T19:00:34.564731 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:34.565456 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/rfetion
D, [2010-02-06T19:00:36.255873 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/rfetion
D, [2010-02-06T19:00:36.260189 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:36.260389 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/bullet
D, [2010-02-06T19:00:39.847604 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/bullet
D, [2010-02-06T19:00:39.858775 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:39.859471 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/metric_fu
D, [2010-02-06T19:00:41.779917 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/metric_fu
D, [2010-02-06T19:00:41.780332 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/exception_notification
D, [2010-02-06T19:00:43.481367 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/exception_notification
D, [2010-02-06T19:00:43.481768 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/activemerchant_patch_for_china
D, [2010-02-06T19:00:45.111665 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/activemerchant_patch_for_china
D, [2010-02-06T19:00:45.114517 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:45.114687 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/contactlist-client
D, [2010-02-06T19:00:46.797493 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/contactlist-client
D, [2010-02-06T19:00:46.801662 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:46.801909 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/taobao
D, [2010-02-06T19:00:49.147218 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/taobao
D, [2010-02-06T19:00:49.147556 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/monitor
D, [2010-02-06T19:00:52.968478 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/monitor
D, [2010-02-06T19:00:52.971288 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:52.971458 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/sitemap
D, [2010-02-06T19:00:58.807052 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/sitemap
D, [2010-02-06T19:00:58.811199 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:58.811388 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/visual_partial
D, [2010-02-06T19:01:01.788958 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/visual_partial
D, [2010-02-06T19:01:01.793886 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:01:01.794191 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/chinese_regions
D, [2010-02-06T19:01:04.098727 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/chinese_regions
D, [2010-02-06T19:01:04.103930 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:01:04.104248 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/codelinestatistics
D, [2010-02-06T19:01:06.304536 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/codelinestatistics
D, [2010-02-06T19:01:14.003714 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/rack
D, [2010-02-06T19:01:16.551656 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/rack
http://github.com/flyerhzm/regexp_crawler
regexp_crawler
A crawler which uses regular expression to catch data from website.
http://github.com/flyerhzm/css_sprite
css_sprite
A rails plugin to generate css sprite image automatically
http://github.com/flyerhzm/chinese_permalink
chinese_permalink
This plugin adds a capability for AR model to create a seo permalink with your chinese text. It will translate your chinese text to english url based on google translate.
http://github.com/flyerhzm/contactlist
contactlist
java api to retrieve contact list of email(hotmail, gmail, yahoo, sohu, sina, 163, 126, tom, yeah, 189 and 139) and im(msn)
http://github.com/flyerhzm/rails_best_practices
rails_best_practices
rails_best_practices is a gem to check quality of rails app files according to ihower’s presentation from Kungfu RailsConf in Shanghai China
http://github.com/flyerhzm/rfetion
rfetion
rfetion is a ruby gem for China Mobile fetion service that you can send SMS free.
http://github.com/flyerhzm/bullet
bullet
A rails plugin/gem to kill N+1 queries and unused eager loading
http://github.com/flyerhzm/activemerchant_patch_for_china
activemerchant_patch_for_china
A rails plugin to add an active_merchant patch for china online payment platform including alipay (支付宝), 99bill (快钱) and tenpay (财付通)
http://github.com/flyerhzm/contactlist-client
contactlist-client
The contactlist-client gem is a ruby client to contactlist service which retrieves contact list of email(hotmail, gmail, yahoo, sohu, sina, 163, 126, tom, yeah, 189 and 139) and im(msn)
http://github.com/flyerhzm/monitor
monitor
Monitor gem can display ruby methods call stack on browser based on unroller
http://github.com/flyerhzm/sitemap
sitemap
This plugin will generate a sitemap.xml from sitemap.rb whose format is very similar to routes.rb
http://github.com/flyerhzm/visual_partial
visual_partial
This plugin provides a way that you can see all the partial pages rendered. So it can prevent you from using partial page too much, which hurts the performance.
http://github.com/flyerhzm/chinese_regions
chinese_regions
provides all chinese regions, cities and districts

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
example		example
lib		lib
spec		spec
.gitignore		.gitignore
LICENSE		LICENSE
README.textile		README.textile
Rakefile		Rakefile
VERSION		VERSION
init.rb		init.rb
regexp_crawler.gemspec		regexp_crawler.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example

example

lib

lib

spec

spec

.gitignore

.gitignore

LICENSE

LICENSE

README.textile

README.textile

Rakefile

Rakefile

VERSION

VERSION

init.rb

init.rb

regexp_crawler.gemspec

regexp_crawler.gemspec

Repository files navigation

RegexpCrawler

Install

Usage

Example

About

Releases

Packages

Languages

License

flyerhzm/regexp_crawler

Folders and files

Latest commit

History

Repository files navigation

RegexpCrawler

Install

Usage

Example

About

Resources

License

Stars

Watchers

Forks

Languages