New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimisation: faster extraction of META tags #553

Closed
jnioche opened this Issue Mar 27, 2018 · 1 comment

Comments

Projects
None yet
1 participant
@jnioche
Member

jnioche commented Mar 27, 2018

The utility classes RefreshTag and RobotsTags both use XPATH to retrieve META tags. They currently do so by looking for //META which is inefficient as it searches everywhere in the document. These 2 methods can take up to 18% of the processing time for JSoupParserBolt and 16% of the overall CPU.

Instead, we can use a more constraining XPATH which will look only into /HTML/HEAD or /HTML/BODY, the latter is not the recommended variant but can be found in the wild.

@jnioche

This comment has been minimized.

Member

jnioche commented Mar 27, 2018

Profiling after the change doesn't show a significant impact on RobotsTags.extractMetaTags but RefreshTag takes only 1/2 the time it used to. This represents 14% of the processing time for JSoupParserBolt and 11% of the overall CPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment