Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for Groups with many EPerson members. Fix pagination on endpoints #9078

Merged
merged 11 commits into from Nov 2, 2023

Conversation

tdonohue
Copy link
Member

@tdonohue tdonohue commented Sep 18, 2023

References

Description

This PR improves the performance of Groups which have many EPerson (or Group) members by adding proper pagination to these endpoints. (Previously all objects were always loaded into memory before pagination was applied.)

This was implemented via new paginated methods in EPersonService/DAO and GroupService/DAO:

  • EPersonService/DAO.findByGroups() -> Locates all EPersons who are a member of one or more groups in a paginated manner.
  • EPersonService/DAO.countByGroups() -> Provides total count of those EPerson members (necessary for pagination)
  • GroupService/DAO.findByParent() -> Locates all (Sub)Groups who are a member group of the given Parent group in a paginated manner.
  • GroupService/DAO.countByParent() -> Provides total count of those (Sub)Groups members (necessary for pagination)
  • Unit & Integration Tests were added for these new methods

Warnings were added to JavaDoc for the following methods. All of which will have very BAD performance for large groups:

  • Group.getMembers()
  • Group.getMemberGroups()
  • EPersonService.findByGroups() (unpaginated version)
  • GroupService.allMembers()
  • NOTE: I did NOT remove all usages of these methods. Attempting to remove all usages of these methods would require more discussion/thought. I just replaced the simple ones (i.e. some calls to allMembers()) as described below.

Finally, I discovered several usages of GroupService.allMembers() where the code only required the size() of that list. These usages were replaced with new countAllMembers() or countByParent() methods. All these usages have existing detailed tests were still pass with the newly refactored code:

  • EPersonService.delete(Context, EPerson, cascade)
  • GroupService.removeMember(Context, Group, EPerson)
  • GroupService.removeMember(Context, Group, Group)

Instructions for Reviewers

  • See Performance Issues with Groups having many EPerson members #9052.
  • In UI, verify "Access Control" -> Groups is working properly for search/editing.
  • In REST API, verify pagination still works properly for BOTH the /epersons and /subgroups endpoints (size = how many to return, page = which page of results to return). With this PR in place, small pages (small values of size) will return a response quickly. Without this PR (e.g. in 7.6), the size of page doesn't matter as every object is loaded into memory regardless of size before a response is returned.
    • e.g. GET /api/eperson/groups/<:uuid>/epersons?page=0&size=1 (Only return one result at a time per page)
    • e.g. GET /api/eperson/groups/<:uuid>/subgroups?page=0&size=5 (Return first page of 5 results per page)
  • Review code and new tests
  • Verify all automated tests pass.

@tdonohue tdonohue added bug high priority interface: REST API v7+ REST API for v7 and later (dspace-server-webapp module) authorization Related to user authorization / permissions performance / caching Related to performance or caching issues port to dspace-7_x This PR needs to be ported to `dspace-7_x` branch for next bug-fix release labels Sep 18, 2023
@alanorth
Copy link
Contributor

I tested this patch on DSpace 7.6. I was able to search and edit groups as expected, with no errors in the browser console or dspace.log. I notice the dspace.log shows the new pagination:

2023-09-19 21:58:38,419 INFO  aaf64908-f03a-4b93-a975-1445a69c26fc c1902ac1-d0e2-4148-bc19-93bf3ff2c572 org.dspace.app.rest.utils.DSpaceAPIRequestLoggingFilter @ Before request [GET /server/api/eperson/groups/8b556d2d-a5d0-46c1-b648-128c10792e9b/epersons] originated from /access-control/groups?gl.page=7

I was not able to test the REST API directly because I don't know how to do it via simple HTTP requests.

@tdonohue
Copy link
Member Author

@alanorth : Thanks for the tests! I added more details under "Instructions for Reviewers" on how to test pagination. If you are interested, It can be done either via commandline (wget or curl) or from the HAL Browser. (Pagination also has some automated tests in my updates to GroupRestRepositoryIT in this PR, so I feel confident it works. But that's how you'd manually test it)

@alanorth
Copy link
Contributor

Thanks @tdonohue. The HAL browser is easier than curl because we don't have to mess with signing tokens.

I checked the two endpoints in the HAL browser and don't see a difference. Paging already seems to work on my unpatched DSpace 7.6 instance. For reference, I tried:

  • /server/api/eperson/groups/<:uuid>/epersons?page=2&size=1

And I get this response in the HAL browser:

{
  "page": {
    "number": 2,
    "size": 1,
    "totalPages": 521,
    "totalElements": 521
  }
}

Perhaps I'm misunderstanding...

@tdonohue
Copy link
Member Author

tdonohue commented Sep 21, 2023

@alanorth : Apologies, I've misworded the description slightly. Yes, paging already works in 7.6... but this PR changes the paging to provide better performance. In 7.6, paging will first load every object into memory (in your test, 512 objects will be loaded into memory, as totalElements=512) before returning just the few necessary for the page. With this PR in place, only the objects on the page (i.e. size value) will be loaded into memory.

So, if you test with 7.6, you'll see slower behavior to the paging, as every page will load totalElements number of objects before returning the response (even when size=1). But, if you test with this PR, you'll see much quicker responses, as you'll only load size number of objects. Based on your response above, if you use size=1, that means 7.6 will still load 512 objects into memory while this PR would only load 1.

It can be difficult to see the speed difference on smaller repositories, but it should still be noticeable.

@alanorth
Copy link
Contributor

@tdonohue ok thanks for taking the time to explain. I was contemplating not commenting, assuming I was doing something stupid. So I am +1 on this by testing because it doesn't change functionality at all and, while I didn't notice the performance increase on localhost in my case, it didn't have any negative affect either.

I will let someone else review the code since I don't know it well.

@tdonohue tdonohue changed the title Improve performance for Groups with many EPerson members. Add pagination to endpoints Improve performance for Groups with many EPerson members. Fix pagination on endpoints Oct 5, 2023
Copy link
Contributor

@vins01-4science vins01-4science left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @tdonohue , thank you very much for this improvement 🚀 !

However, I am a little bit concerned about the paginated queries that are not using any ordering. I think that in both cases we should offer an order, otherwise the pagination cannot be used in a real scenario.

On the other hand I found out some other little improvements that can be made, and that I suggest that can be done.

@tdonohue
Copy link
Member Author

tdonohue commented Nov 1, 2023

@vins01-4science : Thanks for your review. I've addressed all feedback in the latest commit. That said, I did NOT fix the issues you pointed out with nested for loops in GroupServiceImpl (and similar) as this PR didn't create that code. While I agree that code could be cleaned up better, I won't have time to refactor it for the 7.6.1 release... so I'd prefer moving forward with this PR as-is and only reviewing the code this PR adds/removes. Thanks!

Copy link
Contributor

@vins01-4science vins01-4science left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you @tdonohue 🚀

@tdonohue
Copy link
Member Author

tdonohue commented Nov 2, 2023

Merging as this is at +2.

@tdonohue tdonohue merged commit ee3369d into DSpace:main Nov 2, 2023
15 checks passed
@tdonohue tdonohue deleted the fix_9052 branch November 2, 2023 19:09
@dspace-bot
Copy link

@tdonohue tdonohue removed the port to dspace-7_x This PR needs to be ported to `dspace-7_x` branch for next bug-fix release label Nov 2, 2023
@tdonohue tdonohue added this to the 8.0 milestone Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
authorization Related to user authorization / permissions bug high priority interface: REST API v7+ REST API for v7 and later (dspace-server-webapp module) performance / caching Related to performance or caching issues
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

Performance Issues with Groups having many EPerson members
4 participants