New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Postgresql connection error / troePoolSize #1412
Comments
ok, sounds like a good idea what you're proposing. |
Thank you for having a look at it. Another thing:
Even if the connection pointer is not null it does not mean that the connection was successfull. If I check the error-message when the pointer is != NULL, I still get connection errors sometimes
The output
|
Not sure you can trust those error messages. I could of course re-check, once I see that the pointer is non-NULL. I'll try to find some way to "health check" the connection. [ I already fixed the 10 => troePoolSize, that was just a stupid mistake. Thanks for finding it for me! ] |
You missed the line
I implemented this a few years ago, but, if I remember correctly, queueSem is a counting semaphore and if you manage to take it, you are guaranteed that there is either an unused slot, or a free already used slot. I'll push a PR with just the "10 => troePoolSize" bug fixed, hoping that's all we need here. Apart from looking at the source code, did you have any problems after fixing the "10 => ..." ? |
No, no problems so far, it's working now for two days. |
Well, I would suggest the following: extend the block from L100
to
and from L164
to
After implementing it this way, when I restart the PostgreSQL server, the connections are restored as soon as it is available again. There might be some loss of data for that timespan when the DB server is down. Maybe there could be some queue be implemented where the data is stored in memory until it's written to the DB. |
ok! |
issue #1412 re-connecting to postgres when connection is lost
Hi there,
in our environment we use telegraf to get some data, transform and forward it to the NGSI-LD interface of Orion-LD. As we now added more sensors, the load of the system increases and I see more and more postgresql connection errors appear:
time=Tuesday 22 Aug 05:37:40 2023.018Z | lvl=WARN | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=pgConnectionGet.cpp[153]:pgConnectionGet | msg=Internal Error (bug in postgres connection pool logic?) time=Tuesday 22 Aug 05:37:40 2023.018Z | lvl=WARN | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=pgConnectionGet.cpp[154]:pgConnectionGet | msg=poolP at 0x242a5b0 time=Tuesday 22 Aug 05:37:40 2023.018Z | lvl=WARN | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=pgConnectionGet.cpp[155]:pgConnectionGet | msg=poolP->items: 10 time=Tuesday 22 Aug 05:37:40 2023.018Z | lvl=WARN | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=pgConnectionGet.cpp[156]:pgConnectionGet | msg=poolP->connectionV at 0x2442ac0
Increasing the troePoolSize did not solve the problem, so I did some investigation and saw that in https://github.com/FIWARE/context.Orion-LD/blob/5bfdeb19907ee0f3d6765d593c9db398fb18df79/src/lib/orionld/troe/pgInit.cpp#L48C1-L48C1 the value of "10" is hard coded. Shouldn't be there the value of troePoolSize be passed? I changed this in a local debug environment and the error seams to be gone.
When I looked at the file pgConnectionGet.cpp on that lines where the error occures (153-156) and the above logic on how to get a connection from the pool: In the first loop you are looking for a connection that is open and free, if there is none, in the next loop you look for a connection that is not already connected or not busy. But with high load it could be possible that there is just every connection in the pool connected and busy, so we will run in to that error. In that case wouldn't it be an idea to wait until a connection gets free or dynamically increase the connection pool?
The text was updated successfully, but these errors were encountered: