You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have an application that scrapes an entire website and it runs in about 12 hours. It uses WebClient.DownloadString to get the html, then uses HtmlParser.ParseDocument to parse it. Then, I do a lot of other parsing on top of that. This happens for about 293,000 pages so I'm trying to save any little bit of time that I can.
I've noticed that I've got a lot of places where I call IHtmlDocument.GetElementsByTagName(TAG).FirstOrDefault(QUERY_SELECTOR). I believe I could collapse this into some sort of IHtmlDocument.QuerySelector(QUERY_SELECTOR) which theoretically would speed up the time by returning after the first match, but some preliminary testing has shown QuerySelector to be slow vs the old method. For instance:
IElement element = doc.GetElementsByTagName("h2").FirstOrDefault(x => x.TextContent.Contains("Climbing Directory"));
takes about 1 ms, where
IElement element = doc.QuerySelector("h2:contains('Climbing Directory')");
takes about 23 ms.
Any suggestions for improving my code?
The text was updated successfully, but these errors were encountered:
This certainly can / could be improved on the QuerySelector level. I'm not sure if the :contains is the villain here, or if the overall performance of the QuerySelector is in charge...
I could try parallelizing more (in the Parsers file I linked above).
Since I originally posted this, the greatest improvement in speed has come from moving to .NET Core from .NET Framework. That pretty much cut the entire time in half!
I have an application that scrapes an entire website and it runs in about 12 hours. It uses
WebClient.DownloadString
to get the html, then usesHtmlParser.ParseDocument
to parse it. Then, I do a lot of other parsing on top of that. This happens for about 293,000 pages so I'm trying to save any little bit of time that I can.I've noticed that I've got a lot of places where I call
IHtmlDocument.GetElementsByTagName(TAG).FirstOrDefault(QUERY_SELECTOR)
. I believe I could collapse this into some sort ofIHtmlDocument.QuerySelector(QUERY_SELECTOR)
which theoretically would speed up the time by returning after the first match, but some preliminary testing has shownQuerySelector
to be slow vs the old method. For instance:IElement element = doc.GetElementsByTagName("h2").FirstOrDefault(x => x.TextContent.Contains("Climbing Directory"));
takes about 1 ms, where
IElement element = doc.QuerySelector("h2:contains('Climbing Directory')");
takes about 23 ms.
Any suggestions for improving my code?
The text was updated successfully, but these errors were encountered: