diff --git a/History.rdoc b/History.md similarity index 91% rename from History.rdoc rename to History.md index 22bd8f84..935669e1 100644 --- a/History.rdoc +++ b/History.md @@ -1,4 +1,4 @@ -=== 0.2.2 / 2010-01-06 +### 0.2.2 / 2010-01-06 * Require Web Spider Obstacle Course (WSOC) >= 0.1.1. * Integrated the new WSOC into the specs. @@ -15,7 +15,7 @@ * Renamed Spidr::Agent#get_session to {Spidr::SessionCache#[]}. * Renamed Spidr::Agent#kill_session to {Spidr::SessionCache#kill!}. -=== 0.2.1 / 2009-11-25 +### 0.2.1 / 2009-11-25 * Added {Spidr::Events#every_ok_page}. * Added {Spidr::Events#every_redirect_page}. @@ -44,9 +44,9 @@ * Added {Spidr::Events#every_zip_page}. * Fixed a bug where {Spidr::Agent#delay} was not being used to delay requesting pages. -* Spider +link+ and +script+ tags in HTML pages (thanks Nick Plante). +* Spider `link` and `script` tags in HTML pages (thanks Nick Plante). -=== 0.2.0 / 2009-10-10 +### 0.2.0 / 2009-10-10 * Added {URI.expand_path}. * Added {Spidr::Page#search}. @@ -91,7 +91,7 @@ * Made {Spidr::Agent#visit_page} public. * Moved to YARD based documentation. -=== 0.1.9 / 2009-06-13 +### 0.1.9 / 2009-06-13 * Upgraded to Hoe 2.0.0. * Use Hoe.spec instead of Hoe.new. @@ -108,7 +108,7 @@ could not be loaded. * Removed Spidr::Agent::SCHEMES. -=== 0.1.8 / 2009-05-27 +### 0.1.8 / 2009-05-27 * Added the Spidr::Agent#pause! and Spidr::Agent#continue! methods. * Added the Spidr::Agent#running? and Spidr::Agent#paused? methods. @@ -121,15 +121,15 @@ * Made {Spidr::Agent#enqueue} and {Spidr::Agent#queued?} public. * Added more specs. -=== 0.1.7 / 2009-04-24 +### 0.1.7 / 2009-04-24 * Added Spidr::Agent#all_headers. -* Fixed a bug where Page#headers was always +nil+. +* Fixed a bug where Page#headers was always `nil`. * {Spidr::Spidr::Agent} will now follow the Location header in HTTP 300, 301, 302, 303 and 307 Redirects. * {Spidr::Agent} will now follow iframe and frame tags. -=== 0.1.6 / 2009-04-14 +### 0.1.6 / 2009-04-14 * Added {Spidr::Agent#failures}, a list of URLs which could not be visited. * Added {Spidr::Agent#failed?}. @@ -143,27 +143,27 @@ * Updated the Web Spider Obstacle Course with links that always fail to be visited. -=== 0.1.5 / 2009-03-22 +### 0.1.5 / 2009-03-22 -* Catch malformed URIs in {Spidr::Page#to_absolute} and return +nil+. -* Filter out +nil+ URIs in {Spidr::Page#urls}. +* Catch malformed URIs in {Spidr::Page#to_absolute} and return `nil`. +* Filter out `nil` URIs in {Spidr::Page#urls}. -=== 0.1.4 / 2009-01-15 +### 0.1.4 / 2009-01-15 * Use Nokogiri for HTML and XML parsing. -=== 0.1.3 / 2009-01-10 +### 0.1.3 / 2009-01-10 * Added the :host options to {Spidr::Agent#initialize}. * Added the Web Spider Obstacle Course files to the Manifest. * Aliased {Spidr::Agent#visited_urls} to {Spidr::Agent#history}. -=== 0.1.2 / 2008-11-06 +### 0.1.2 / 2008-11-06 * Fixed a bug in {Spidr::Page#to_absolute} where URLs with no path were not - receiving a default path of /. + receiving a default path of `/`. * Fixed a bug in {Spidr::Page#to_absolute} where URL paths were not being - expanded, in order to remove .. and . directories. + expanded, in order to remove `..` and `.` directories. * Fixed a bug where absolute URLs could have a blank path, thus causing {Spidr::Agent#get_page} to crash when it performed the HTTP request. * Added RSpec spec tests. @@ -171,12 +171,12 @@ (http://spidr.rubyforge.org/course/start.html) which is used in the spec tests. -=== 0.1.1 / 2008-10-04 +### 0.1.1 / 2008-10-04 * Added a reader method for the response instance variable in Page. * Fixed a bug in {Spidr::Page#method_missing}. -=== 0.1.0 / 2008-05-23 +### 0.1.0 / 2008-05-23 * Initial release. * Black-list or white-list URLs based upon: diff --git a/README.rdoc b/README.md similarity index 80% rename from README.rdoc rename to README.md index fe78ca1f..9ccb3ca2 100644 --- a/README.rdoc +++ b/README.md @@ -1,18 +1,18 @@ -= Spidr +# Spidr -* http://spidr.rubyforge.org -* http://github.com/postmodern/spidr -* http://github.com/postmodern/spidr/issues -* http://groups.google.com/group/spidr +* [spidr.rubyforge.org](http://spidr.rubyforge.org/) +* [github.com/postmodern/spidr](http://github.com/postmodern/spidr) +* [github.com/postmodern/spidr/issues](http://github.com/postmodern/spidr/issues) +* [groups.google.com/group/spidr](http://groups.google.com/group/spidr) * irc.freenode.net #spidr -== DESCRIPTION: +## DESCRIPTION: Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use. -== FEATURES: +## FEATURES: * Follows: * a tags. @@ -41,21 +41,21 @@ and easy to use. * Custom proxy settings. * HTTPS support. -== EXAMPLES: +## EXAMPLES: -* Start spidering from a URL: +Start spidering from a URL: Spidr.start_at('http://tenderlovemaking.com/') -* Spider a host: +Spider a host: Spidr.host('coderrr.wordpress.com') -* Spider a site: +Spider a site: Spidr.site('http://rubyflow.com/') -* Spider multiple hosts: +Spider multiple hosts: Spidr.start_at( 'http://company.com/', @@ -65,30 +65,30 @@ and easy to use. ] ) -* Do not spider certain links: +Do not spider certain links: Spidr.site('http://matasano.com/', :ignore_links => [/log/]) -* Do not spider links on certain ports: +Do not spider links on certain ports: Spidr.site( 'http://sketchy.content.com/', :ignore_ports => [8000, 8010, 8080] ) -* Print out visited URLs: +Print out visited URLs: Spidr.site('http://rubyinside.org/') do |spider| spider.every_url { |url| puts url } end -* Print out the URLs that could not be requested: +Print out the URLs that could not be requested: Spidr.site('http://sketchy.content.com/') do |spider| spider.every_failed_url { |url| puts url } end -* Search HTML and XML pages: +Search HTML and XML pages: Spidr.site('http://company.withablog.com/') do |spider| spider.every_page do |page| @@ -99,11 +99,11 @@ and easy to use. value = meta.attributes['content'] puts " #{name} = #{value}" - end + end end end -* Print out the titles from every page: +Print out the titles from every page: Spidr.site('http://www.rubypulse.com/') do |spider| spider.every_html_page do |page| @@ -111,7 +111,7 @@ and easy to use. end end -* Find what kinds of web servers a host is using, by accessing the headers: +Find what kinds of web servers a host is using, by accessing the headers: servers = Set[] @@ -121,7 +121,7 @@ and easy to use. end end -* Pause the spider on a forbidden page: +Pause the spider on a forbidden page: spider = Spidr.host('overnight.startup.com') do |spider| spider.every_forbidden_page do |page| @@ -129,7 +129,7 @@ and easy to use. end end -* Skip the processing of a page: +Skip the processing of a page: Spidr.host('sketchy.content.com') do |spider| spider.every_missing_page do |page| @@ -137,7 +137,7 @@ and easy to use. end end -* Skip the processing of links: +Skip the processing of links: Spidr.host('sketchy.content.com') do |spider| spider.every_url do |url| @@ -147,15 +147,15 @@ and easy to use. end end -== REQUIREMENTS: +## REQUIREMENTS: -* {nokogiri}[http://nokogiri.rubyforge.org/] >= 1.2.0 +* [nokogiri](http://nokogiri.rubyforge.org/) >= 1.2.0 -== INSTALL: +## INSTALL: - $ sudo gem install spidr + $ sudo gem install spidr -== LICENSE: +## LICENSE: The MIT License diff --git a/Rakefile b/Rakefile index b71f2185..37632686 100644 --- a/Rakefile +++ b/Rakefile @@ -11,7 +11,7 @@ Hoe.spec('spidr') do self.rspec_options += ['--colour', '--format', 'specdoc'] - self.yard_options += ['--protected'] + self.yard_options += ['--markup', 'markdown', '--protected'] self.remote_yard_dir = 'docs' self.extra_deps = [ diff --git a/lib/spidr/agent.rb b/lib/spidr/agent.rb index 05201d68..c201c21e 100644 --- a/lib/spidr/agent.rb +++ b/lib/spidr/agent.rb @@ -492,7 +492,7 @@ def enqueue(url) # The page for the response. # # @return [Page, nil] - # The page for the response, or +nil+ if the request failed. + # The page for the response, or `nil` if the request failed. # def get_page(url,&block) url = URI(url.to_s) @@ -525,7 +525,7 @@ def get_page(url,&block) # The page for the response. # # @return [Page, nil] - # The page for the response, or +nil+ if the request failed. + # The page for the response, or `nil` if the request failed. # # @since 0.2.2 # @@ -557,7 +557,7 @@ def post_page(url,post_data='',&block) # The page which was visited. # # @return [Page, nil] - # The page that was visited. If +nil+ is returned, either the request + # The page that was visited. If `nil` is returned, either the request # for the page failed, or the page was skipped. # def visit_page(url,&block) @@ -585,8 +585,8 @@ def visit_page(url,&block) # Converts the agent into a Hash. # # @return [Hash] - # The agent represented as a Hash containing the +history+ and - # the +queue+ of the agent. + # The agent represented as a Hash containing the `history` and + # the `queue` of the agent. # def to_hash {:history => @history, :queue => @queue} diff --git a/lib/spidr/auth_store.rb b/lib/spidr/auth_store.rb index 7aa09df5..868aa531 100644 --- a/lib/spidr/auth_store.rb +++ b/lib/spidr/auth_store.rb @@ -24,7 +24,7 @@ def initialize # # @return [AuthCredential, nil] # Closest matching {AuthCredential} values for the URL, - # or +nil+ if nothing matches. + # or `nil` if nothing matches. # # @since 0.2.2 # @@ -102,13 +102,13 @@ def add(url,username,password) # # Returns the base64 encoded authorization string for the URL - # or +nil+ if no authorization exists. + # or `nil` if no authorization exists. # # @param [URI] url # The url. # # @return [String, nil] - # The base64 encoded authorizatio string or +nil+. + # The base64 encoded authorizatio string or `nil`. # # @since 0.2.2 # diff --git a/lib/spidr/cookie_jar.rb b/lib/spidr/cookie_jar.rb index 2994e8b1..2eb59190 100644 --- a/lib/spidr/cookie_jar.rb +++ b/lib/spidr/cookie_jar.rb @@ -47,7 +47,7 @@ def each(&block) # Host or domain name for cookies. # # @return [String, nil] - # The cookie values or +nil+ if the host does not have a cookie in the + # The cookie values or `nil` if the host does not have a cookie in the # jar. # # @since 0.2.2 diff --git a/lib/spidr/filters.rb b/lib/spidr/filters.rb index 5962e7c6..59ea47d1 100644 --- a/lib/spidr/filters.rb +++ b/lib/spidr/filters.rb @@ -17,7 +17,7 @@ def self.included(base) # # @option options [Array] :schemes (['http', 'https']) # The list of acceptable URI schemes to visit. - # The +https+ scheme will be ignored if +net/https+ cannot be loaded. + # The `https` scheme will be ignored if `net/https` cannot be loaded. # # @option options [String] :host # The host-name to visit. diff --git a/lib/spidr/page.rb b/lib/spidr/page.rb index 8af257cb..42cb57b9 100644 --- a/lib/spidr/page.rb +++ b/lib/spidr/page.rb @@ -46,10 +46,10 @@ def code end # - # Determines if the response code is +200+. + # Determines if the response code is `200`. # # @return [Boolean] - # Specifies whether the response code is +200+. + # Specifies whether the response code is `200`. # def is_ok? code == 200 @@ -58,10 +58,10 @@ def is_ok? alias ok? is_ok? # - # Determines if the response code is +301+ or +307+. + # Determines if the response code is `301` or `307`. # # @return [Boolean] - # Specifies whether the response code is +301+ or +307+. + # Specifies whether the response code is `301` or `307`. # def is_redirect? (code == 301 || code == 307) @@ -70,30 +70,30 @@ def is_redirect? alias redirect? is_redirect? # - # Determines if the response code is +308+. + # Determines if the response code is `308`. # # @return [Boolean] - # Specifies whether the response code is +308+. + # Specifies whether the response code is `308`. # def timedout? code == 308 end # - # Determines if the response code is +400+. + # Determines if the response code is `400`. # # @return [Boolean] - # Specifies whether the response code is +400+. + # Specifies whether the response code is `400`. # def bad_request? code == 400 end # - # Determines if the response code is +401+. + # Determines if the response code is `401`. # # @return [Boolean] - # Specifies whether the response code is +401+. + # Specifies whether the response code is `401`. # def is_unauthorized? code == 401 @@ -102,10 +102,10 @@ def is_unauthorized? alias unauthorized? is_unauthorized? # - # Determines if the response code is +403+. + # Determines if the response code is `403`. # # @return [Boolean] - # Specifies whether the response code is +403+. + # Specifies whether the response code is `403`. # def is_forbidden? code == 403 @@ -114,10 +114,10 @@ def is_forbidden? alias forbidden? is_forbidden? # - # Determines if the response code is +404+. + # Determines if the response code is `404`. # # @return [Boolean] - # Specifies whether the response code is +404+. + # Specifies whether the response code is `404`. # def is_missing? code == 404 @@ -126,10 +126,10 @@ def is_missing? alias missing? is_missing? # - # Determines if the response code is +500+. + # Determines if the response code is `500`. # # @return [Boolean] - # Specifies whether the response code is +500+. + # Specifies whether the response code is `500`. # def had_internal_server_error? code == 500 @@ -334,7 +334,7 @@ def body # # @return [Nokogiri::HTML::Document, Nokogiri::XML::Document, nil] # The document that represents HTML or XML pages. - # Returns +nil+ if the page is neither HTML, XML, RSS, Atom or if + # Returns `nil` if the page is neither HTML, XML, RSS, Atom or if # the page could not be parsed properly. # # @see http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Document.html @@ -382,7 +382,7 @@ def search(*paths) # Searches for the first occurrence an XPath or CSS Path expression. # # @return [Nokogiri::HTML::Node, Nokogiri::XML::Node, nil] - # The first matched node. Returns +nil+ if no nodes could be matched, + # The first matched node. Returns `nil` if no nodes could be matched, # or if the page is not a HTML or XML document. # # @example @@ -418,7 +418,7 @@ def title # # @return [Array] # All links within the HTML page, frame/iframe source URLs and any - # links in the +Location+ header. + # links in the `Location` header. # def links urls = [] @@ -504,7 +504,7 @@ def to_absolute(link) protected # - # Provides transparent access to the values in +headers+. + # Provides transparent access to the values in `headers`. # def method_missing(sym,*args,&block) if (args.empty? && block.nil?)