- Require Web Spider Obstacle Course (WSOC) >= 0.1.1.
- Integrated the new WSOC into the specs.
- Removed the built-in Web Spider Obstacle Course.
- Added {Spidr::Page#content_types}.
- Added {Spidr::Page#cookie}.
- Added {Spidr::Page#cookies}.
- Added {Spidr::Page#cookie_params}.
- Added {Spidr::Sanitizers}.
- Added {Spidr::SessionCache}.
- Added {Spidr::CookieJar} (thanks Nick Plante).
- Added {Spidr::AuthStore} (thanks Nick Plante).
- Added {Spidr::Agent#post_page} (thanks Nick Plante).
- Renamed Spidr::Agent#get_session to {Spidr::SessionCache#[]}.
- Renamed Spidr::Agent#kill_session to {Spidr::SessionCache#kill!}.
- Added {Spidr::Events#every_ok_page}.
- Added {Spidr::Events#every_redirect_page}.
- Added {Spidr::Events#every_timedout_page}.
- Added {Spidr::Events#every_bad_request_page}.
- Added {Spidr::Events#every_unauthorized_page}.
- Added {Spidr::Events#every_forbidden_page}.
- Added {Spidr::Events#every_missing_page}.
- Added {Spidr::Events#every_internal_server_error_page}.
- Added {Spidr::Events#every_txt_page}.
- Added {Spidr::Events#every_html_page}.
- Added {Spidr::Events#every_xml_page}.
- Added {Spidr::Events#every_xsl_page}.
- Added {Spidr::Events#every_doc}.
- Added {Spidr::Events#every_html_doc}.
- Added {Spidr::Events#every_xml_doc}.
- Added {Spidr::Events#every_xsl_doc}.
- Added {Spidr::Events#every_rss_doc}.
- Added {Spidr::Events#every_atom_doc}.
- Added {Spidr::Events#every_javascript_page}.
- Added {Spidr::Events#every_css_page}.
- Added {Spidr::Events#every_rss_page}.
- Added {Spidr::Events#every_atom_page}.
- Added {Spidr::Events#every_ms_word_page}.
- Added {Spidr::Events#every_pdf_page}.
- Added {Spidr::Events#every_zip_page}.
- Fixed a bug where {Spidr::Agent#delay} was not being used to delay requesting pages.
- Spider
link
andscript
tags in HTML pages (thanks Nick Plante).
- Added {URI.expand_path}.
- Added {Spidr::Page#search}.
- Added {Spidr::Page#at}.
- Added {Spidr::Page#title}.
- Added {Spidr::Agent#failures=}.
- Added a HTTP session cache to {Spidr::Agent}, per suggestion of falter.
- Added Spidr::Agent#get_session.
- Added Spidr::Agent#kill_session.
- Added {Spidr.proxy=}.
- Added {Spidr.disable_proxy!}.
- Aliased Spidr::Page#txt? to {Spidr::Page#plain_text?}.
- Aliased Spidr::Page#ok? to {Spidr::Page#is_ok?}.
- Aliased Spidr::Page#redirect? to {Spidr::Page#is_redirect?}.
- Aliased Spidr::Page#unauthorized? to {Spidr::Page#is_unauthorized?}.
- Aliased Spidr::Page#forbidden? to {Spidr::Page#is_forbidden?}.
- Aliased Spidr::Page#missing? to {Spidr::Page#is_missing?}.
- Split URL filtering code out of {Spidr::Agent} and into {Spidr::Filters}.
- Split URL / Page event code out of {Spidr::Agent} and into {Spidr::Events}.
- Split pause! / continue! / skip_link! / skip_page! methods out of {Spidr::Agent} and into {Spidr::Actions}.
- Fixed a bug in {Spidr::Page#code}, where it was not returning an Integer.
- Make sure {Spidr::Page#doc} returns Nokogiri::XML::Document objects for RSS/RDF/Atom pages as well.
- Fixed the handling of the Location header in {Spidr::Page#links} (thanks falter).
- Fixed a bug in {Spidr::Page#to_absolute} where trailing '/' characters on URI paths were not being preserved (thanks falter).
- Fixed a bug where the URI query was not being sent with the request in {Spidr::Agent#get_page} (thanks Damian Steer).
- Fixed a bug where SSL sessions were not being properly setup (thanks falter).
- Switched {Spidr::Agent#history} to be a Set, to improve search-time of the history (thanks falter).
- Switched {Spidr::Agent#failures} to a Set.
- Allow a block to be passed to {Spidr::Agent#run}, which will receive all pages visited.
- Allow Spidr::Agent#start_at and Spidr::Agent#continue! to pass blocks to {Spidr::Agent#run}.
- Made {Spidr::Agent#visit_page} public.
- Moved to YARD based documentation.
- Upgraded to Hoe 2.0.0.
- Use Hoe.spec instead of Hoe.new.
- Use the Hoe signing task for signed gems.
- Added the Spidr::Agent#schemes and Spidr::Agent#schemes= methods.
- Added a warning message if 'net/https' cannot be loaded.
- Allow the list of acceptable URL schemes to be passed into {Spidr::Agent#initialize}.
- Allow history and queue information to be passed into {Spidr::Agent#initialize}.
- {Spidr::Agent#start_at} no longer clears the history or the queue.
- Fixed a bug in the sanitization of semi-escaped URLs.
- Fixed a bug where https URLs would be followed even if 'net/https' could not be loaded.
- Removed Spidr::Agent::SCHEMES.
- Added the Spidr::Agent#pause! and Spidr::Agent#continue! methods.
- Added the Spidr::Agent#running? and Spidr::Agent#paused? methods.
- Added an alias for pending_urls to the queue methods.
- Added {Spidr::Agent#queue} to provide read access to the queue.
- Added {Spidr::Agent#queue=} and {Spidr::Agent#history=} for setting the queue and history.
- Added {Spidr::Agent#to_hash} which returns a Hash of the agents queue and history.
- Made {Spidr::Agent#enqueue} and {Spidr::Agent#queued?} public.
- Added more specs.
- Added Spidr::Agent#all_headers.
- Fixed a bug where Page#headers was always
nil
. - {Spidr::Spidr::Agent} will now follow the Location header in HTTP 300, 301, 302, 303 and 307 Redirects.
- {Spidr::Agent} will now follow iframe and frame tags.
- Added {Spidr::Agent#failures}, a list of URLs which could not be visited.
- Added {Spidr::Agent#failed?}.
- Added Spidr::Agent#every_failed_url.
- Added {Spidr::Agent#clear}, which clears the history and failures URL lists.
- Improved fault tolerance in {Spidr::Agent#get_page}.
- If a Network or HTTP error is encountered, the URL will be added to the failures list and the next URL will be visited.
- Fixed a typo in Spidr::Agent#ignore_exts_like.
- Updated the Web Spider Obstacle Course with links that always fail to be visited.
- Catch malformed URIs in {Spidr::Page#to_absolute} and return
nil
. - Filter out
nil
URIs in {Spidr::Page#urls}.
- Use Nokogiri for HTML and XML parsing.
- Added the :host options to {Spidr::Agent#initialize}.
- Added the Web Spider Obstacle Course files to the Manifest.
- Aliased {Spidr::Agent#visited_urls} to {Spidr::Agent#history}.
- Fixed a bug in {Spidr::Page#to_absolute} where URLs with no path were not
receiving a default path of
/
. - Fixed a bug in {Spidr::Page#to_absolute} where URL paths were not being
expanded, in order to remove
..
and.
directories. - Fixed a bug where absolute URLs could have a blank path, thus causing {Spidr::Agent#get_page} to crash when it performed the HTTP request.
- Added RSpec spec tests.
- Created a Web-Spider Obstacle Course (http://spidr.rubyforge.org/course/start.html) which is used in the spec tests.
- Added a reader method for the response instance variable in Page.
- Fixed a bug in {Spidr::Page#method_missing}.
- Initial release.
- Black-list or white-list URLs based upon:
- Host name
- Port number
- Full link
- URL extension
- Provides call-backs for:
- Every visited Page.
- Every visited URL.
- Every visited URL that matches a specified pattern.
- Black-list or white-list URLs based upon: