Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConvertToPlaintext performance enhancements #61

Open
malv007 opened this issue Apr 27, 2024 · 3 comments
Open

ConvertToPlaintext performance enhancements #61

malv007 opened this issue Apr 27, 2024 · 3 comments

Comments

@malv007
Copy link

malv007 commented Apr 27, 2024

The methods ConvertToPlaintext and ConvertToText can do with a few small performance improvements. We found a few HTML pages where those methods would take 10 minutes to run, and after doing some profiling these are the changes we made (the execution went form 10 minutes to milliseconds):

ConvertToText

  1. Use StringBuilder instead of StringWriter, the latter is more resource hungry
  2. Don't return a value (StringBuilder.ToString()), make the function void, keep the same instance of the StringBuilder during the recursion

ConvertToPlaintext

  1. Don't invoke the Remove method on the second StringBuilder, this method is very expensive, instead append character by character to the new StringBuilder if the character meets the conditions you already have
  2. Pre-compile and re-use the regular expressions

Thanks for the library!

@gabriele-tomassetti
Copy link
Member

Thanks for your feedback. I will work on it. I understand the issue, 10 minutes is a lot of time. Could you indicate which pages took so much to convert to text, so we could have a test case?

@malv007
Copy link
Author

malv007 commented Apr 28, 2024

You are welcome! See below the URLs we were having problems with, but the slowdown would only happen when running the requests in parallel inside an ASP.NET app using TPL,for some reason the StringWriter class would use large amounts of memory and cause a thread pool starvation. But if we requested the URLs individually, that problem wouldn't happen. And simply changing from a StringWriter to a StringBuilder solved the main problem, the other changes are minor enhancements.

https://www.heconomia.es/volatil.asp?o=1513041124
https://www.heconomia.es/volatil.asp?o=1625032559
https://www.heconomia.es/volatil.asp?o=-1348189962
https://www.heconomia.es/volatil.asp?o=1799106153
https://www.heconomia.es/volatil.asp?o=1888223065
https://www.heconomia.es/volatil.asp?o=1678194698
https://heconomia.es/volatil.asp?o=1680517915
https://www.heconomia.es/volatil.asp?o=1680517915
https://heconomia.es/volatil.asp?o=1437850236

I hope this helps!

By the way, this is the logic we use to Append to a new StringBuilder instead of removing from the existing one, I think it is equivalent to yours, but take a look:

      var stringBuilder = new StringBuilder();

      while (index < text.Length)
      {
        var c = text[index];
        // carriage return and line feed are not separator characters
        bool isSpace = char.IsSeparator(c);
        bool isNewline = c is '\r' or '\n';
        if (isSpace)
          c = ' ';
        else if (isNewline)
          c = '\n';

        if (!(previousNewline && isSpace) && !(previousSpace && isSpace) && !(isSpace && text[index + 1] is '\r' or '\n'))
          stringBuilder.Append(c);

        index++;

        previousSpace = isSpace;
        previousNewline = isNewline;
      }

@gabriele-tomassetti
Copy link
Member

Thanks for providing the additional context. I implemented your suggestions. I am not sure what is causing the problems, since from what I understand StringWriter is just a StringBuilder with some additional methods.

Your logic is mostly equivalent, but it normalizes newlines to \n, which is not what is happening right now. To be fair, testing the changes, I think that internally AngleSharp (the HTML parsing library we use) does already do that, so that would not be a big deal. However, it is slightly different from our current code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants