New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ConvertToPlaintext performance enhancements #61
Comments
Thanks for your feedback. I will work on it. I understand the issue, 10 minutes is a lot of time. Could you indicate which pages took so much to convert to text, so we could have a test case? |
You are welcome! See below the URLs we were having problems with, but the slowdown would only happen when running the requests in parallel inside an ASP.NET app using TPL,for some reason the StringWriter class would use large amounts of memory and cause a thread pool starvation. But if we requested the URLs individually, that problem wouldn't happen. And simply changing from a StringWriter to a StringBuilder solved the main problem, the other changes are minor enhancements. https://www.heconomia.es/volatil.asp?o=1513041124 I hope this helps! By the way, this is the logic we use to Append to a new StringBuilder instead of removing from the existing one, I think it is equivalent to yours, but take a look:
|
Thanks for providing the additional context. I implemented your suggestions. I am not sure what is causing the problems, since from what I understand Your logic is mostly equivalent, but it normalizes newlines to |
The methods ConvertToPlaintext and ConvertToText can do with a few small performance improvements. We found a few HTML pages where those methods would take 10 minutes to run, and after doing some profiling these are the changes we made (the execution went form 10 minutes to milliseconds):
ConvertToText
ConvertToPlaintext
Thanks for the library!
The text was updated successfully, but these errors were encountered: