Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change graphql library to use Strawberry and fix memory leak with large numbers subscription #44

Merged
merged 60 commits into from
Apr 18, 2023

Conversation

rjwills28
Copy link
Collaborator

This PR includes the following major changes:

  • We have changed the graphql library from Tartiflette to Strawberry.
  • We have also fixed a memory leak within Coniql which occurs with many subscriptions (~200) to PVs updating at 10Hz. This was found when running performance tests stressing the application. This was due to Coniql not processing the graphql results quickly enough and so an internal queue continued to grow leading to ever increasing memory and out-of-date PV values. To fix this I removed the queue and added a deque. This means that only the most recent value with be sent and if we fall behind we simply have to skip that update.

Other changes after a review of PR #39:

  • Remove the 'resolver' layer which was an implementation necessary for Tartiflette.
  • Remove creation of the 'context' and instead use a global value in the app.

Changes to tests:

  • Move the IOC required for the tests to conftest.py.
  • Add tests covering both websocket protocols (legacy and new).
  • Use a single running IOC for all tests so we don't have to build and tear down in between each test.

Copy link
Contributor

@AlexanderWells-diamond AlexanderWells-diamond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this overall. I definitely like the code schema rather than the separate .gql file that tartiflette used.

I note that test_pvaplugin.py seems to have not been updated - what is the intention with that file? I note that pvaplugin.py was updated so I expected the tests to be updated.

A number of my comments are duplicates of each other, where i've noted the same issue appearing in multiple places. There's also a bunch here that are by no means musts - a lot are optional or simply opinions.

Pipfile Outdated Show resolved Hide resolved
src/coniql/strawberry_schema.py Outdated Show resolved Hide resolved
src/coniql/app.py Show resolved Hide resolved
src/coniql/app.py Outdated Show resolved Hide resolved
src/coniql/app.py Outdated Show resolved Hide resolved
tests/test_aiohttp.py Outdated Show resolved Hide resolved
tests/test_caplugin.py Outdated Show resolved Hide resolved
tests/test_caplugin.py Outdated Show resolved Hide resolved
src/coniql/strawberry_schema.py Outdated Show resolved Hide resolved
src/coniql/strawberry_schema.py Outdated Show resolved Hide resolved
@rjwills28
Copy link
Collaborator Author

Regarding the test parameterization, I don't see why the one-line fixtures were created. They can just be regular functions and called dynamically in the test body. I can't see an advantage to declaring them as fixtures. The parameters can literally be the function reference which gets called in the test body.

The problem I ran into here was that I needed to parametrize the fixtures for the two versions (schema and aiohttp) as one takes the schema directly and the other creates an aiohttp_client. The latter has to be done in a fixture as is a feature of pytest-aiohttp.
In order to parametrize the fixtures I call them within the tests using the request.getfixturevalue(fixture_to_run). This should all be fine as then I can get my client or schema and pass them into other functions to make the queries and gather results.

The problem is that the fixture to create the aiohttp_client has to be async as we have to await the client creation and for some reason there is a problem using request.getfixturevalue(...) with an async fixture that then means that that test cannot be defined as async and so you cannot do any awaits in the test, which are necessary to submit queries and get the results. See similar reports of this issue: pytest-dev/pytest-asyncio#112. I couldn't find or think of any other workaround. The only way I saw was to do all of the async work in the fixtures so that the test didn't need to be async, which is the reason for the one-line fixtures in this case.

This will suppress the stacktrace on error if the debug argument
is false.
Fixes the "Task was destroyed but it is pending" warnings at the end
of the tests.
@rjwills28
Copy link
Collaborator Author

I did have an issue running the tests. Multiple runs sometimes showed this error (sometimes in result2, sometimes in result4):

This occasional failure was coming from the subscription 'ticking' test. I have updated this now to make it more reliable including a wait to ensure we are connected to the websocket before trying to subscribe.

Also, even if all tests pass, I still see the "Task was destroyed but it is pending!" warning message printed 4 times. Is this intentional? Are some of the tests deliberately trying to kill pending tasks? If not, it's an indicator that there's something wrong in task handling somewhere.

This was an interesting one. In some cases this was caused by the pytest event loop closing too early, which was fixed by using a new event loop for each 'module' (i.e. each set of tests).
There was another issue though in simplugin.py and test_coniql.py that used it where the simplugin tasks were only weakly referenced and so were being GC'ed too early. I introduced a set (task_references) to track these and clean up afterwards so that they are not GC'ed immediately.
With these two fixes I no longer see these warnings at the end of the tests.

@rjwills28
Copy link
Collaborator Author

@aawdls @AlexanderWells-diamond I think I have now finished addressing the second round of review comments. Please feel free to review again or provide feedback on my recent changes.

As a side note, I know we are moving away from the Pipenv installation process but I was able to get a fresh install working with it on Python 3.8. To do this I had to update the setuptools and wheel modules. I updated the Pipefile and Pipefile.lock so this might even now work for you?

Copy link
Contributor

@AlexanderWells-diamond AlexanderWells-diamond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-running the tests and I'm unfortunately still seeing warnings, despite tests passing:

======================================================================================= 25 passed in 21.13s =======================================================================================
Task was destroyed but it is pending!
task: <Task pending name='Task-169' coro=<BaseGraphQLTransportWSHandler.handle_connection_init_timeout() running at /home/eyh46967/dev/coniql/venv/lib64/python3.8/site-packages/strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py:82> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f3c0f9452b0>()]>>
Task was destroyed but it is pending!
task: <Task pending name='Task-198' coro=<BaseGraphQLTransportWSHandler.handle_connection_init_timeout() running at /home/eyh46967/dev/coniql/venv/lib64/python3.8/site-packages/strawberry/subscriptions/protocols/graphql_transport_ws/handlers.py:82> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f3c0f7ece50>()]>>

This is almost certainly a timing issue, where running the test on different hardware produces the warnings. This issue will likely be exacerbated by GitHub Actions, as some of their runners (especially MacOS) are very slow.
I don't know how much longer it is worth spending on this issue though, as the issue most likely comes from how Pytest is handling itself, which is rather unlike any actual client out there.

I am unable to install this internally using the default pipenv command, as several of the modules are not available on the internal PyPi. I haven't spent the time installing the missing ones as we're going to be scrapping pipenv anyway.

Other than the test warnings, this all looks good to me.

@rjwills28
Copy link
Collaborator Author

I've finally pinpointed where these warnings are coming from and it looks like it's an issue in Strawberry. Essentially they create a task that does an asyncio.sleep for 60 seconds. The task then checks whether the connection_init message has been received from the client. If so then it simply returns and if not it closes the connection.
In the tests we create a subscription, gather the results and then close the connection and event loop in less than 60 secs. But this task is still running, sitting in the asyncio.sleep. It is this task the warning is about. When we run against a 'normal' client we do not have this problem as it is either open long enough for the 60 secs to elapse and complete the task, or we kill the Coniql application.

This 60 secs is strange to me. In testing I found that this connection_init message was received very quickly and so this 60 sec wait seems unnecessary. There are 2 ways to fix it in the Strawberry code:

  1. Run a while loop checking the whether the connection_init message has been received with a short sleep in between and timeout if we do not receive it after 60 secs. This will cause the task to complete as soon as the connection_init message has been received instead of waiting the arbitrary 60 secs.
  2. Add some clean up code when a subscription is closed that checks if this task is still running and cancels it if it is.

The second seems cleaner to me but I will have to approach the Strawberry collaboration to get this fixed so might just ask for their opinion.

@rjwills28
Copy link
Collaborator Author

Scrap those fixes. Turns out it's much simpler - we can configure the timeout ourselves so could override the 60 seconds. I'll look into that.

@rjwills28
Copy link
Collaborator Author

@AlexanderWells-diamond if you have a moment would you mind re-running the tests on your system with my latest changes to see if you still get any warnings? Thanks!

@AlexanderWells-diamond
Copy link
Contributor

@AlexanderWells-diamond if you have a moment would you mind re-running the tests on your system with my latest changes to see if you still get any warnings? Thanks!

I've run the tests several times and see no warnings printed.

I'm always a little suspicious about hardcoding any form of timeout into tests. The machines that run the CI have proven to be very slow at times (especially the MacOS ones), which if I understand the changes means that any machine that takes longer than 2 seconds to connect will cause a test failure. I suppose we'll just have to keep an eye on the CI and adjust this timeout accordingly.

@rjwills28
Copy link
Collaborator Author

We send the connection_init message ourselves in the tests as soon as we have connected (see https://github.com/dls-controls/coniql/blob/ee1304c9115aa21cfd73f3c953d0ef2c9faff21a/tests/test_aiohttp.py#L116) so I would be surprised if we saw a timeout here, however something to keep in mind.
It would be nicer if the Strawberry code cleaned up properly so possibly something we could follow up with them at a later date?

@AlexanderWells-diamond
Copy link
Contributor

Yes, it sounds like there may be something that could be done in Strawberry so should be raised with them.

Otherwise this all looks good.

@aawdls aawdls marked this pull request as ready for review April 18, 2023 13:58
@aawdls
Copy link
Contributor

aawdls commented Apr 18, 2023

This is now ready to be merged and will be deployed in pre-prod for further evaluation at scale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants