Optimisation: faster extraction of META tags #553

jnioche · 2018-03-27T12:12:29Z

The utility classes RefreshTag and RobotsTags both use XPATH to retrieve META tags. They currently do so by looking for //META which is inefficient as it searches everywhere in the document. These 2 methods can take up to 18% of the processing time for JSoupParserBolt and 16% of the overall CPU.

Instead, we can use a more constraining XPATH which will look only into /HTML/HEAD or /HTML/BODY, the latter is not the recommended variant but can be found in the wild.

jnioche · 2018-03-27T12:36:00Z

Profiling after the change doesn't show a significant impact on RobotsTags.extractMetaTags but RefreshTag takes only 1/2 the time it used to. This represents 14% of the processing time for JSoupParserBolt and 11% of the overall CPU.

jnioche added enhancement parser core labels Mar 27, 2018

jnioche added this to the 1.9 milestone Mar 27, 2018

jnioche added a commit that referenced this issue Mar 27, 2018

Optimisation: faster extraction of META tags; implements #553

ce91346

jnioche closed this as completed Mar 27, 2018

jnioche mentioned this issue Apr 11, 2018

JSOUPParserBolt: lazy DOM conversion #562

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimisation: faster extraction of META tags #553

Optimisation: faster extraction of META tags #553

jnioche commented Mar 27, 2018 •

edited

jnioche commented Mar 27, 2018

Optimisation: faster extraction of META tags #553

Optimisation: faster extraction of META tags #553

Comments

jnioche commented Mar 27, 2018 • edited

jnioche commented Mar 27, 2018

jnioche commented Mar 27, 2018 •

edited