Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgresql connection error / troePoolSize #1412

Closed
cfreyfh opened this issue Aug 22, 2023 · 8 comments
Closed

Postgresql connection error / troePoolSize #1412

cfreyfh opened this issue Aug 22, 2023 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@cfreyfh
Copy link

cfreyfh commented Aug 22, 2023

Hi there,

in our environment we use telegraf to get some data, transform and forward it to the NGSI-LD interface of Orion-LD. As we now added more sensors, the load of the system increases and I see more and more postgresql connection errors appear:

time=Tuesday 22 Aug 05:37:40 2023.018Z | lvl=WARN | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=pgConnectionGet.cpp[153]:pgConnectionGet | msg=Internal Error (bug in postgres connection pool logic?) time=Tuesday 22 Aug 05:37:40 2023.018Z | lvl=WARN | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=pgConnectionGet.cpp[154]:pgConnectionGet | msg=poolP at 0x242a5b0 time=Tuesday 22 Aug 05:37:40 2023.018Z | lvl=WARN | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=pgConnectionGet.cpp[155]:pgConnectionGet | msg=poolP->items: 10 time=Tuesday 22 Aug 05:37:40 2023.018Z | lvl=WARN | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=pgConnectionGet.cpp[156]:pgConnectionGet | msg=poolP->connectionV at 0x2442ac0

Increasing the troePoolSize did not solve the problem, so I did some investigation and saw that in https://github.com/FIWARE/context.Orion-LD/blob/5bfdeb19907ee0f3d6765d593c9db398fb18df79/src/lib/orionld/troe/pgInit.cpp#L48C1-L48C1 the value of "10" is hard coded. Shouldn't be there the value of troePoolSize be passed? I changed this in a local debug environment and the error seams to be gone.

When I looked at the file pgConnectionGet.cpp on that lines where the error occures (153-156) and the above logic on how to get a connection from the pool: In the first loop you are looking for a connection that is open and free, if there is none, in the next loop you look for a connection that is not already connected or not busy. But with high load it could be possible that there is just every connection in the pool connected and busy, so we will run in to that error. In that case wouldn't it be an idea to wait until a connection gets free or dynamically increase the connection pool?

@kzangeli
Copy link
Collaborator

ok, sounds like a good idea what you're proposing.
I'll look into this asap.
Thank you for reporting!

@kzangeli kzangeli self-assigned this Aug 22, 2023
@kzangeli kzangeli added the bug Something isn't working label Aug 22, 2023
@kzangeli kzangeli mentioned this issue Aug 22, 2023
@cfreyfh
Copy link
Author

cfreyfh commented Aug 22, 2023

Thank you for having a look at it.

Another thing:

if (cP->connectionP == NULL)

Even if the connection pointer is not null it does not mean that the connection was successfull.

If I check the error-message when the pointer is != NULL, I still get connection errors sometimes

if (cP->connectionP == NULL)
{
      char* errMsg = PQerrorMessage(cP->connectionP);
      cP->busy = false;  // So the slot can be used again!
      LM_RE(NULL, ("Database Error (unable to connect to postgres(%s)): %s", _db, errMsg));
} else {
      char* errMsg = PQerrorMessage(cP->connectionP);
      LM_W(("**Database Connection established** (%s): %s ",_db, errMsg));
}

The output

 op=pgConnectionGet.cpp[176]:pgConnectionGet | msg=**Database Connection established** (orion): connection to server at "xxx.xxx.xxx.xxx", port 30432 failed: server closed the connection unexpectedly
 op=pgConnectionGet.cpp[176]:pgConnectionGet | msg=**Database Connection established** (orion): connection  to server at "xxx.xxx.xxx.xxx", port 30432 failed: server closed the connection unexpectedly

@kzangeli
Copy link
Collaborator

Not sure you can trust those error messages.
Might be the last error is still in there.

I could of course re-check, once I see that the pointer is non-NULL.
The connection is reused and it just might have been closed in between uses.

I'll try to find some way to "health check" the connection.
If you know how to do that "health check", please let me know :)

[ I already fixed the 10 => troePoolSize, that was just a stupid mistake. Thanks for finding it for me! ]

@kzangeli
Copy link
Collaborator

When I looked at the file pgConnectionGet.cpp on that lines where the error occures (153-156) and the above logic on how to get a connection from the pool

You missed the line

  // Await a free slot in the pool                                                                                                                                             
  sem_wait(&poolP->queueSem);

I implemented this a few years ago, but, if I remember correctly, queueSem is a counting semaphore and if you manage to take it, you are guaranteed that there is either an unused slot, or a free already used slot.
So, that part should be fine.

I'll push a PR with just the "10 => troePoolSize" bug fixed, hoping that's all we need here.

Apart from looking at the source code, did you have any problems after fixing the "10 => ..." ?

kzangeli added a commit that referenced this issue Aug 23, 2023
kzangeli added a commit that referenced this issue Aug 23, 2023
@cfreyfh
Copy link
Author

cfreyfh commented Aug 24, 2023

Apart from looking at the source code, did you have any problems after fixing the "10 => ..." ?

No, no problems so far, it's working now for two days.

@cfreyfh
Copy link
Author

cfreyfh commented Aug 24, 2023

I'll try to find some way to "health check" the connection.
If you know how to do that "health check", please let me know :)

Well, I would suggest the following:

extend the block from L100

if (cP->connectionP != NULL)

to

    if (cP->connectionP != NULL)
    {
      // check if we are still connected
      ConnStatusType pgStatus = PQstatus(cP->connectionP);
      if (pgStatus != CONNECTION_OK)
      {
          LM_W(("Connection of item %d is lost, trying to re-connect...", ix));
          // try to re-connect
          PQreset(cP->connectionP);
          // get status again
          pgStatus = PQstatus(cP->connectionP);

          // if still no connection
          if (pgStatus != CONNECTION_OK)
          {
              // we free this pointer that it can be used in the next call of pgConnectionGet
              free(poolP->connectionV[ix]);
              poolP->connectionV[ix] = NULL;
              LM_W(("Connection failed, pointer of item %d was re-set to NULL (%p)", ix, poolP->connectionV[ix]));
              // this time no success finding a connection that is working, try in the next loop
              continue;
          }
      }	 

      // Great - found a free and already connected item - let's use it !
      cP->busy = true;

      sem_post(&poolP->poolSem);
      sem_post(&poolP->queueSem);

      cP->uses += 1;
      return cP;
    }

and from L164

if (cP->connectionP == NULL)

to

    if (cP->connectionP == NULL)
    {
      char* errMsg = PQerrorMessage(cP->connectionP);
      
      cP->busy = false;  // So the slot can be used again!
      LM_RE(NULL, ("Database Error (unable to connect to postgres(%s)): %s", _db, errMsg));
    } else 
    {
      char* errMsg = PQerrorMessage(cP->connectionP);
      // check if we are connected
      ConnStatusType pgStatus = PQstatus(cP->connectionP);
      if (pgStatus != CONNECTION_OK)
      {
        sem_wait(&poolP->poolSem);
        sem_wait(&poolP->queueSem);
	
        // find the connection pointer in the pool and free it
        for (int ix = 0; ix < poolP->items; ix++)
        {
                if (poolP->connectionV[ix] == cP)
                {
                        free(poolP->connectionV[ix]);
                        poolP->connectionV[ix] = NULL;
                        break;
                }
        }
        sem_post(&poolP->poolSem);
        sem_post(&poolP->queueSem);

        LM_RE(NULL, ("Database Connection could not be established (%s): %s ",_db, errMsg));
      }
    }

After implementing it this way, when I restart the PostgreSQL server, the connections are restored as soon as it is available again. There might be some loss of data for that timespan when the DB server is down. Maybe there could be some queue be implemented where the data is stored in memory until it's written to the DB.

@kzangeli
Copy link
Collaborator

ok!
If you have your ideas clear, and the broker passes all tests with this modification you suggest (I imagine you've tested it at least a little), why not go ahead and send a pull request?

cfreyfh added a commit to cfreyfh/context.Orion-LD that referenced this issue Aug 24, 2023
kzangeli added a commit that referenced this issue Aug 25, 2023
issue #1412 re-connecting to postgres when connection is lost
@kzangeli
Copy link
Collaborator

kzangeli commented Aug 26, 2023

Fixed in #1414 and #1416

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants