Skip to content

v0.3.0

Latest
Compare
Choose a tag to compare
@Sobak Sobak released this 19 May 07:36
· 1 commit to develop since this release

Added

  • Added robots.txt parser block
  • Added support for resolving relative URLs
  • Added CSV file result writer
  • Added ArrayUrlListProvider
  • Added SameDomainUrlListProvider
  • Added support for writing results into SQLite databases
  • Added literal client configuration provider
  • Added total time displayed once operation is done
  • Added ability to set maximum number of crawled URLs
  • Added a warning when empty response is encountered
  • Added ability to mark objects as fetched only once per operation
  • Added ability to pass encoding options to JSON file result writer
  • Added support for relative URL in ArgumentAdvancerUrlListProvider's template
  • Added TEST_SERVER_WAIT environment variable to change default wait time for
    the server used in integration tests

Changed

  • CssSelectorTextMatcher and XpathSelectorTextMatcher are now renamed to
    CssSelectorHtmlMatcher and XpathSelectorHtmlMatcher accordingly and will
    return original HTML content instead of textual form, making them consistent
    with other matchers like regular expression matcher. To retain previous
    behavior one should strip the tags further down the line (e.g. in entities)
  • RegexTextMatcher has been renamed to RegexHtmlMatcher
  • Underlying Guzzle instance will always depend on cURL now. This is done to
    ensure that widest set of features is available for handling HTTP requests.
  • Scrawler will now explicitly emit a warning for content types other than XML or (X)HTML
  • DefaultConfigurationProvider sets timeouts for Guzzle now
  • JSON_UNESCAPED_UNICODE option is now used by default when using JSON file
    result writer
  • The simple_annotations options for the database result writer is now false
    by default. Previously it had to be specified explicitly.
  • Only HTTP and HTTPS protocols are now explicitly allowed, URLs with other
    protocols will be silently ignored now
  • Improved performance of CSS selector matchers
  • Improved handling of networking errors
  • Improved logs readability
  • Changed default log verbosity for both console and the textfile to INFO level
  • PHP_SERVER_PORT environment variable used to set the port of the webserver
    used to run integration tests has been renamed to TEST_SERVER_PORT

Fixed

  • Fixed incorrect detecting of visited URLs resulting in some adresses being
    processed multiple times
  • Fixed incorrect whitespace trimming in text matchers

Removed

  • Removed InMemoryResultWriter from the blocks list - now it is only available
    during development to run tests