New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SitemapParserBolt should force mime-type based on the clue #515

Closed
jnioche opened this Issue Nov 9, 2017 · 0 comments

Comments

Projects
None yet
1 participant
@jnioche
Member

jnioche commented Nov 9, 2017

http://www.soliant.com/feeds/jobs-sitemap/
returns the following http header
Content-Type: text/html; charset=utf-8
as a result the underlying sitemap parser can't handle it properly.

What we can do is to do the detection based on the clue regardless of whether the doc has been declared as being a sitemap and if it matches, force the mime-type to 'application/xml' as the clue indicates a XML doc for sure.

For this particular URL, not setting the mime-type at all does not work either as the content does not have the required xml element <?xml version="1.0" encoding="UTF-8"?> which Tika uses to guess the mimetype.

@jnioche jnioche added this to the 1.7 milestone Nov 9, 2017

@jnioche jnioche closed this in 8e38aa0 Nov 9, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment