Performance issues on Pull Content screen when origin site has 100k+ posts #809

jeffpaul · 2021-10-13T14:46:53Z

Describe the bug
When having two separate WP instances connected via an External Connection and using the Pull Content screen where the origin site selected has over 100,000 posts (in this specific case ~230,000 posts) there are performance issues when trying to filter the New/Pulled/Skipped posts as well as paginate within those tabbed sections.

Steps to Reproduce

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
Pagination and filtering on the Pull Content screen should work regardless of the number of posts on an origin site.

Screenshots

Environment information

Device:
OS:
Browser and version:
WordPress version:

Plugins and version:
Theme and version:

Site Health Info:

Additional context
This issue relates to #808 as the New tab view is polluted with previously pulled posts which makes the New tab view have more posts rendering than necessary. Once 808 is resolved, this performance issue may relax a bit but should still be something we investigate and work to handle for as best we can for sites with a large amount of content to filter through and pull.

dkotter · 2021-10-13T23:13:58Z

Looked into this and from what I've found, it isn't an issue with how much content the site has but is an issue with how many previously pulled items there are.

We get all previously pulled and skipped items, sort those from highest ID to lowest ID and then trim that down to 200. These then get passed to our query as post__not_in arguments. For internal connections, this is used in a normal WP_Query. For external connections, this gets passed to the remote request.

I haven't been able to reproduce issues for internal connections but for external connections, because of the amount of data being passed and the performance implications of post__not_in, you can potentially run into timeouts which would result in the error shown. This is much more likely to happen in WordPress.com VIP environments, as we use a different fetch function there that only has a 3 second timeout. For other sites, we have a 45 second timeout which should be fine.

I'm still thinking through how to fix this but will probably need to remove the post__not_in argument and instead remove those after the query is run. For instance, if we need to display 20 items and we have 100 items previously pulled, we can increase our query to pull 120 items. After the query is run, we can then remove any of those 100 previously pulled items and then finally trim our array down to just 20. This will give us 20 items to display without any of the previously pulled (or skipped) items.

This still could have scaling issues though as the amount of pulled items increases. Imagine a site with 2,000 pulled items. To properly exclude those all would require querying for 2,020 items, which could have performance issues of its' own. I think we probably need to have an upper limit (500 maybe?) and run multiple queries if we need more than that.

dinhtungdu · 2021-10-14T07:30:40Z

This still could have scaling issues though as the amount of pulled items increases. Imagine a site with 2,000 pulled items. To properly exclude those all would require querying for 2,020 items, which could have performance issues of its' own. I think we probably need to have an upper limit (500 maybe?) and run multiple queries if we need more than that.

I tend to use multiple requests/queries for solving this performance issue. We will get a fixed amount of posts from the remote site, then filtering it, if we need more posts, make another request.
But with this approach, we have an issue with the pagination of the list table:

We display 100 posts per page, we get 200 posts per request.
There are 150 posts that were pulled/skipped in those 200 posts of the first request.
We need 50 posts more then we make another request to get 200 posts more. So now the pull list table page is 1 but the WP Query page is 2.
Assume in 200 posts of step 3, we have 100 skipped/pulled posts, then we have 50 posts left for list table page 2. So we need to make another request (or more) to fill the list table page 2. But we only know that after querying the list table page 1. What if the users jump from page 1 to page 13?

For the pagination, we need to save the location of the first post of a page, then make the request based on that data. WordPress doesn't support querying posts based on ID, so we have to use current_page. Note that we have two page values here, one for the visual pagination of new posts, one for WP_Query on the remote site.

If we display all posts, then gray out/label the skipped/pulled posts, the problem is solved but I disagree with that approach.

Another approach that popped out in my head is creating our own endpoint for querying posts instead of relying on the core REST API. This sounds most promising to me among these ideas.

cadic · 2021-10-14T09:18:20Z

@dinhtungdu agree on custom REST API endpoint. We will be able to use POST method and provide as much exclude post IDs as we need instead of limiting it. This isn't possible with builtin /wp/v2/posts because POST method is reserved for post creation.

cadic · 2021-10-14T09:21:35Z

I know WP VIP does not recommend to use exclude or post__not_in because it may reduce DB performance, but here we have remote request bottleneck which at my opinion depends more on the bandwidth than DB time.

cadic · 2021-10-14T09:29:00Z

Another idea is adding a flag to the remote site when the post is pulled/skipped. dt_pulled and dt_skipped post metas will help to perform the query without exclude IDs.

Probably post meta name needs tailing remote site identifier (dt_pulled_at_<example.com>) to prevent confusion on complex Duplicator setups.

dkotter · 2021-10-15T13:57:54Z

@dinhtungdu @cadic I'm working on a PR that adds a new endpoint that we can then make a POST request to. This should allow us to pass in all excluded post IDs, which fixes the issue detailed in #808. It also gives us a larger timeout value than what we get when making a GET request (in particular when in a VIP environment). That probably fixes the issue here as well.

But we still are making a post__not_in query. The only other approach I can think of is the one I already mentioned and @dinhtungdu added more details around. Basically running multiple queries (if needed) with a fixed posts_per_page each time and then removing the excluded items from that.

As already mentioned, this works fine on the first set of results but becomes tricky when trying to support pagination. The only solution I could think of was whenever we're on a page other than 1, we would need to run the queries for each page before that to determine what our offset should be. Once you get to higher page numbers this results in a decent amount of extra queries and could have performance problems itself.

Anyway, just wanted to get both of yours feedback on if you had a better idea in mind with the query we should be running in this new endpoint? Do we just use post__not_in and not worry about it? In my testing, I'm not actually seeing much performance impact of that query but obviously could be slower for sites with hundreds or thousands of pulled items.

dinhtungdu · 2021-10-15T14:41:22Z

@dkotter What do you think about endpoints that return only new or pulled posts for a specific connection? It will solve the pagination issue. But at the same time, we may need meta queries.

Edit: I don't think it's possible to use meta queries to query the pulled post, we store all connection details of a post in a single dt_connnection_map meta as an array.

cadic · 2021-10-15T15:17:32Z

@dkotter if we choose between post__not_in (slower DB query) and pagination (2x, 3x, Nx DB queries de facto) I would pick up first solution.

Pros: native WP_Query which will return predictable results
Cons: slower performance (but since is not a front-end feature, it will not happen too often), thickening later support...

helen · 2021-10-18T22:26:56Z

A bit of a drive-by here: IMO avoiding post__not_in has taken on an outsized "do not use" stigma in WP when it's perfectly reasonable for the right use cases, and this definitely seems like a good use case - not a front-end query, only impacting an isolated area, and in the context of a deliberate action (looking at items to pull), not something you're typically zipping through from a user flow perspective anyway. We'll definitely want to keep an eye on it and I believe there were some query impact numbers run that would be nice to have documented in this thread as well.

dkotter · 2021-10-18T22:30:59Z

Thanks for all the feedback @dinhtungdu and @cadic. I've got a work in progress PR that is ready for review/testing.

My current approach is a new custom endpoint that we only use if we are on the New tab of the Pull screen for an external connection that has previously pulled or skipped items. In any other situation (like any internal connections) the process should not be changed.

This endpoint is then used in a POST request, removing the limit on the number of items we can have in the URL (which gets rid of that 200 excluded item limit, fixing #808). This also gets rid of the 3 second timeout that you'll run into on WordPress VIP sites, which I think is the main culprit this issue stems from.

I detailed out a few other approaches I thought through and rejected on the PR (some of the same already mentioned here) but happy to discuss any of those in more detail. The one solution I really like is around adding some custom meta to the original content when it gets pulled. We can then run a fairly simple NOT EXISTS meta query instead of having to pass in a bunch of post IDs. The problem there is we aren't currently storing that information so not sure how to make that backwards compatible.

But would love some review and testing on that PR to ensure I didn't miss anything and didn't break anything. In my testing, I pulled in ~900 items from an external connection and I average around 2-3 second response times on the HTTP request. Not sure if that will scale proportionally but seems totally reasonable to me.

cadic · 2021-11-01T08:52:21Z

Made some research about post__not_in VS custom filter performance on weekend, https://lyuchin.com/2021/11/why-avoid-post__not_in/

Conclusion: using custom filter will work for repeating queries with persistent post and database caching. The initial non-cached request with post__not_in will always be faster then custom filter.

jeffpaul added the type:bug Something isn't working. label Oct 13, 2021

jeffpaul modified the milestones: 1.7.0, 1.6.7 Oct 13, 2021

dkotter mentioned this issue Oct 13, 2021

Previously pulled posts appear in new tab #808

Closed

dkotter self-assigned this Oct 13, 2021

dkotter mentioned this issue Oct 18, 2021

Change how the New tab is generated on the Pull screen for external connections #811

Merged

6 tasks

jeffpaul closed this as completed in #811 Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues on Pull Content screen when origin site has 100k+ posts #809

Performance issues on Pull Content screen when origin site has 100k+ posts #809

jeffpaul commented Oct 13, 2021

dkotter commented Oct 13, 2021

dinhtungdu commented Oct 14, 2021 •

edited

cadic commented Oct 14, 2021

cadic commented Oct 14, 2021

cadic commented Oct 14, 2021

dkotter commented Oct 15, 2021

dinhtungdu commented Oct 15, 2021 •

edited

cadic commented Oct 15, 2021

helen commented Oct 18, 2021 •

edited

dkotter commented Oct 18, 2021

cadic commented Nov 1, 2021

Performance issues on Pull Content screen when origin site has 100k+ posts #809

Performance issues on Pull Content screen when origin site has 100k+ posts #809

Comments

jeffpaul commented Oct 13, 2021

dkotter commented Oct 13, 2021

dinhtungdu commented Oct 14, 2021 • edited

cadic commented Oct 14, 2021

cadic commented Oct 14, 2021

cadic commented Oct 14, 2021

dkotter commented Oct 15, 2021

dinhtungdu commented Oct 15, 2021 • edited

cadic commented Oct 15, 2021

helen commented Oct 18, 2021 • edited

dkotter commented Oct 18, 2021

cadic commented Nov 1, 2021

dinhtungdu commented Oct 14, 2021 •

edited

dinhtungdu commented Oct 15, 2021 •

edited

helen commented Oct 18, 2021 •

edited