Most efficient way to get matching element? #929

derekantrican · 2021-01-13T01:48:31Z

I have an application that scrapes an entire website and it runs in about 12 hours. It uses WebClient.DownloadString to get the html, then uses HtmlParser.ParseDocument to parse it. Then, I do a lot of other parsing on top of that. This happens for about 293,000 pages so I'm trying to save any little bit of time that I can.

I've noticed that I've got a lot of places where I call IHtmlDocument.GetElementsByTagName(TAG).FirstOrDefault(QUERY_SELECTOR). I believe I could collapse this into some sort of IHtmlDocument.QuerySelector(QUERY_SELECTOR) which theoretically would speed up the time by returning after the first match, but some preliminary testing has shown QuerySelector to be slow vs the old method. For instance:

IElement element = doc.GetElementsByTagName("h2").FirstOrDefault(x => x.TextContent.Contains("Climbing Directory"));

takes about 1 ms, where

IElement element = doc.QuerySelector("h2:contains('Climbing Directory')");

takes about 23 ms.

Any suggestions for improving my code?

The text was updated successfully, but these errors were encountered:

derekantrican · 2021-01-13T01:49:22Z

All my parsing code is here if you have any tips for improving efficiency: https://github.com/derekantrican/MountainProject/blob/master/MountainProjectAPI/Functions/Parsers.cs

derekantrican · 2021-01-13T01:58:32Z

Of course, with 12 hours for 293,000 items, maybe an average of 147ms per item is about as good as it can get

FlorianRappl · 2021-01-14T19:09:03Z

I'm afraid I don't have a good answer (#584).

This certainly can / could be improved on the QuerySelector level. I'm not sure if the :contains is the villain here, or if the overall performance of the QuerySelector is in charge...

santoro-mariano · 2022-11-12T16:45:06Z

@derekantrican I know it will not improve anglesharp performance but have you tried to parallelize some of those foreachs calling Parallel.ForEach?

derekantrican · 2022-11-13T03:18:25Z

@santoro-mariano Yup. In the repro I linked earlier, that's used here: https://github.com/derekantrican/MountainProject/blob/master/MountainProjectDBBuilder/Program.cs#L286

I could try parallelizing more (in the Parsers file I linked above).

Since I originally posted this, the greatest improvement in speed has come from moving to .NET Core from .NET Framework. That pretty much cut the entire time in half!

FlorianRappl added css help-wanted performance labels Jan 14, 2021

FlorianRappl added this to the vNext milestone Jan 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Most efficient way to get matching element? #929

Most efficient way to get matching element? #929

derekantrican commented Jan 13, 2021

derekantrican commented Jan 13, 2021

derekantrican commented Jan 13, 2021

FlorianRappl commented Jan 14, 2021

santoro-mariano commented Nov 12, 2022

derekantrican commented Nov 13, 2022 •

edited

Most efficient way to get matching element? #929

Most efficient way to get matching element? #929

Comments

derekantrican commented Jan 13, 2021

derekantrican commented Jan 13, 2021

derekantrican commented Jan 13, 2021

FlorianRappl commented Jan 14, 2021

santoro-mariano commented Nov 12, 2022

derekantrican commented Nov 13, 2022 • edited

derekantrican commented Nov 13, 2022 •

edited