@jnioche jnioche released this Nov 28, 2017 · 133 commits to master since this release

Assets 2

Dependencies updates
crawler-commons 0.9 #513
Core
(bugfix) ParserBolts should use outlinks from parsefilters #498
LD_JSON parsefilter #501
okhttp : store request and response headers verbatim in metadata #506
(bugfix) okhttp protocol does not store headers in metadata #507
HTTP clients should handle http.accept.language and http.accept #499
Selenium protocol follows redirections #514
RemoteDriverProtocol needs multiple instances #505
SitemapParserBolt should force mime-type based on the clue #515
Elasticsearch
ES Spout : define filter query via config #502
Upgrade to ES 6.0 #517
We recommend all users to move to this version. If you wish to remain on an older version of Elasticsearch, you can simply keep your existing version of the stormcrawler elasticsearch module while upgrading stormcrawler core.

This version improves the processing of sitemaps, via #515 and the use of the crawler-commons 0.9 where we fixed the SAX parsing and extended its coverage. We also added improvements to our okhttp-based protocol implementation. If your crawl is a wide one with potentially any sort of content then you should go for okhttp over the default httpclient one. See our comparison of protocol implementations on the WIKI.

Finally, if you want to extract semantic data represented in ld-json then you'll love #501.