New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issues on Pull Content screen when origin site has 100k+ posts #809
Comments
Looked into this and from what I've found, it isn't an issue with how much content the site has but is an issue with how many previously pulled items there are. We get all previously pulled and skipped items, sort those from highest ID to lowest ID and then trim that down to 200. These then get passed to our query as I haven't been able to reproduce issues for internal connections but for external connections, because of the amount of data being passed and the performance implications of I'm still thinking through how to fix this but will probably need to remove the This still could have scaling issues though as the amount of pulled items increases. Imagine a site with 2,000 pulled items. To properly exclude those all would require querying for 2,020 items, which could have performance issues of its' own. I think we probably need to have an upper limit (500 maybe?) and run multiple queries if we need more than that. |
I tend to use multiple requests/queries for solving this performance issue. We will get a fixed amount of posts from the remote site, then filtering it, if we need more posts, make another request.
If we display all posts, then gray out/label the skipped/pulled posts, the problem is solved but I disagree with that approach. Another approach that popped out in my head is creating our own endpoint for querying posts instead of relying on the core REST API. This sounds most promising to me among these ideas. |
@dinhtungdu agree on custom REST API endpoint. We will be able to use POST method and provide as much |
I know WP VIP does not recommend to use |
Another idea is adding a flag to the remote site when the post is pulled/skipped. Probably post meta name needs tailing remote site identifier ( |
@dinhtungdu @cadic I'm working on a PR that adds a new endpoint that we can then make a POST request to. This should allow us to pass in all excluded post IDs, which fixes the issue detailed in #808. It also gives us a larger timeout value than what we get when making a GET request (in particular when in a VIP environment). That probably fixes the issue here as well. But we still are making a As already mentioned, this works fine on the first set of results but becomes tricky when trying to support pagination. The only solution I could think of was whenever we're on a page other than 1, we would need to run the queries for each page before that to determine what our offset should be. Once you get to higher page numbers this results in a decent amount of extra queries and could have performance problems itself. Anyway, just wanted to get both of yours feedback on if you had a better idea in mind with the query we should be running in this new endpoint? Do we just use |
@dkotter What do you think about endpoints that return only new or pulled posts for a specific connection? It will solve the pagination issue. But at the same time, we may need meta queries. Edit: I don't think it's possible to use meta queries to query the pulled post, we store all connection details of a post in a single |
@dkotter if we choose between Pros: native WP_Query which will return predictable results |
A bit of a drive-by here: IMO avoiding |
Thanks for all the feedback @dinhtungdu and @cadic. I've got a work in progress PR that is ready for review/testing. My current approach is a new custom endpoint that we only use if we are on the This endpoint is then used in a I detailed out a few other approaches I thought through and rejected on the PR (some of the same already mentioned here) but happy to discuss any of those in more detail. The one solution I really like is around adding some custom meta to the original content when it gets pulled. We can then run a fairly simple NOT EXISTS meta query instead of having to pass in a bunch of post IDs. The problem there is we aren't currently storing that information so not sure how to make that backwards compatible. But would love some review and testing on that PR to ensure I didn't miss anything and didn't break anything. In my testing, I pulled in ~900 items from an external connection and I average around 2-3 second response times on the HTTP request. Not sure if that will scale proportionally but seems totally reasonable to me. |
Made some research about Conclusion: using custom filter will work for repeating queries with persistent post and database caching. The initial non-cached request with |
Describe the bug
When having two separate WP instances connected via an External Connection and using the Pull Content screen where the origin site selected has over 100,000 posts (in this specific case ~230,000 posts) there are performance issues when trying to filter the New/Pulled/Skipped posts as well as paginate within those tabbed sections.
Steps to Reproduce
Expected behavior
Pagination and filtering on the Pull Content screen should work regardless of the number of posts on an origin site.
Screenshots
Environment information
Site Health Info:
Additional context
This issue relates to #808 as the New tab view is polluted with previously pulled posts which makes the New tab view have more posts rendering than necessary. Once 808 is resolved, this performance issue may relax a bit but should still be something we investigate and work to handle for as best we can for sites with a large amount of content to filter through and pull.
The text was updated successfully, but these errors were encountered: