Postgresql connection error / troePoolSize #1412

cfreyfh · 2023-08-22T05:50:53Z

Hi there,

in our environment we use telegraf to get some data, transform and forward it to the NGSI-LD interface of Orion-LD. As we now added more sensors, the load of the system increases and I see more and more postgresql connection errors appear:

Increasing the troePoolSize did not solve the problem, so I did some investigation and saw that in https://github.com/FIWARE/context.Orion-LD/blob/5bfdeb19907ee0f3d6765d593c9db398fb18df79/src/lib/orionld/troe/pgInit.cpp#L48C1-L48C1 the value of "10" is hard coded. Shouldn't be there the value of troePoolSize be passed? I changed this in a local debug environment and the error seams to be gone.

When I looked at the file pgConnectionGet.cpp on that lines where the error occures (153-156) and the above logic on how to get a connection from the pool: In the first loop you are looking for a connection that is open and free, if there is none, in the next loop you look for a connection that is not already connected or not busy. But with high load it could be possible that there is just every connection in the pool connected and busy, so we will run in to that error. In that case wouldn't it be an idea to wait until a connection gets free or dynamically increase the connection pool?

The text was updated successfully, but these errors were encountered:

kzangeli · 2023-08-22T07:27:55Z

ok, sounds like a good idea what you're proposing.
I'll look into this asap.
Thank you for reporting!

cfreyfh · 2023-08-22T13:29:21Z

Thank you for having a look at it.

Another thing:

context.Orion-LD/src/lib/orionld/troe/pgConnectionGet.cpp

Line 164 in 5bfdeb1

if (cP->connectionP == NULL)

Even if the connection pointer is not null it does not mean that the connection was successfull.

If I check the error-message when the pointer is != NULL, I still get connection errors sometimes

if (cP->connectionP == NULL)
{
      char* errMsg = PQerrorMessage(cP->connectionP);
      cP->busy = false;  // So the slot can be used again!
      LM_RE(NULL, ("Database Error (unable to connect to postgres(%s)): %s", _db, errMsg));
} else {
      char* errMsg = PQerrorMessage(cP->connectionP);
      LM_W(("**Database Connection established** (%s): %s ",_db, errMsg));
}

The output

 op=pgConnectionGet.cpp[176]:pgConnectionGet | msg=**Database Connection established** (orion): connection to server at "xxx.xxx.xxx.xxx", port 30432 failed: server closed the connection unexpectedly
 op=pgConnectionGet.cpp[176]:pgConnectionGet | msg=**Database Connection established** (orion): connection  to server at "xxx.xxx.xxx.xxx", port 30432 failed: server closed the connection unexpectedly

kzangeli · 2023-08-23T14:53:13Z

Not sure you can trust those error messages.
Might be the last error is still in there.

I could of course re-check, once I see that the pointer is non-NULL.
The connection is reused and it just might have been closed in between uses.

I'll try to find some way to "health check" the connection.
If you know how to do that "health check", please let me know :)

[ I already fixed the 10 => troePoolSize, that was just a stupid mistake. Thanks for finding it for me! ]

kzangeli · 2023-08-23T15:01:32Z

When I looked at the file pgConnectionGet.cpp on that lines where the error occures (153-156) and the above logic on how to get a connection from the pool

You missed the line

  // Await a free slot in the pool                                                                                                                                             
  sem_wait(&poolP->queueSem);

I implemented this a few years ago, but, if I remember correctly, queueSem is a counting semaphore and if you manage to take it, you are guaranteed that there is either an unused slot, or a free already used slot.
So, that part should be fine.

I'll push a PR with just the "10 => troePoolSize" bug fixed, hoping that's all we need here.

Apart from looking at the source code, did you have any problems after fixing the "10 => ..." ?

Hopefully fixed issue #1412

cfreyfh · 2023-08-24T08:03:34Z

Apart from looking at the source code, did you have any problems after fixing the "10 => ..." ?

No, no problems so far, it's working now for two days.

cfreyfh · 2023-08-24T08:15:15Z

I'll try to find some way to "health check" the connection.
If you know how to do that "health check", please let me know :)

Well, I would suggest the following:

extend the block from L100

context.Orion-LD/src/lib/orionld/troe/pgConnectionGet.cpp

Line 100 in 5bfdeb1

if (cP->connectionP != NULL)

to

    if (cP->connectionP != NULL)
    {
      // check if we are still connected
      ConnStatusType pgStatus = PQstatus(cP->connectionP);
      if (pgStatus != CONNECTION_OK)
      {
          LM_W(("Connection of item %d is lost, trying to re-connect...", ix));
          // try to re-connect
          PQreset(cP->connectionP);
          // get status again
          pgStatus = PQstatus(cP->connectionP);

          // if still no connection
          if (pgStatus != CONNECTION_OK)
          {
              // we free this pointer that it can be used in the next call of pgConnectionGet
              free(poolP->connectionV[ix]);
              poolP->connectionV[ix] = NULL;
              LM_W(("Connection failed, pointer of item %d was re-set to NULL (%p)", ix, poolP->connectionV[ix]));
              // this time no success finding a connection that is working, try in the next loop
              continue;
          }
      }	 

      // Great - found a free and already connected item - let's use it !
      cP->busy = true;

      sem_post(&poolP->poolSem);
      sem_post(&poolP->queueSem);

      cP->uses += 1;
      return cP;
    }

and from L164

context.Orion-LD/src/lib/orionld/troe/pgConnectionGet.cpp

Line 164 in 5bfdeb1

if (cP->connectionP == NULL)

to

    if (cP->connectionP == NULL)
    {
      char* errMsg = PQerrorMessage(cP->connectionP);
      
      cP->busy = false;  // So the slot can be used again!
      LM_RE(NULL, ("Database Error (unable to connect to postgres(%s)): %s", _db, errMsg));
    } else 
    {
      char* errMsg = PQerrorMessage(cP->connectionP);
      // check if we are connected
      ConnStatusType pgStatus = PQstatus(cP->connectionP);
      if (pgStatus != CONNECTION_OK)
      {
        sem_wait(&poolP->poolSem);
        sem_wait(&poolP->queueSem);
	
        // find the connection pointer in the pool and free it
        for (int ix = 0; ix < poolP->items; ix++)
        {
                if (poolP->connectionV[ix] == cP)
                {
                        free(poolP->connectionV[ix]);
                        poolP->connectionV[ix] = NULL;
                        break;
                }
        }
        sem_post(&poolP->poolSem);
        sem_post(&poolP->queueSem);

        LM_RE(NULL, ("Database Connection could not be established (%s): %s ",_db, errMsg));
      }
    }

After implementing it this way, when I restart the PostgreSQL server, the connections are restored as soon as it is available again. There might be some loss of data for that timespan when the DB server is down. Maybe there could be some queue be implemented where the data is stored in memory until it's written to the DB.

kzangeli · 2023-08-24T09:27:13Z

ok!
If you have your ideas clear, and the broker passes all tests with this modification you suggest (I imagine you've tested it at least a little), why not go ahead and send a pull request?

issue #1412 re-connecting to postgres when connection is lost

kzangeli · 2023-08-26T08:30:41Z

Fixed in #1414 and #1416

kzangeli self-assigned this Aug 22, 2023

kzangeli added the bug Something isn't working label Aug 22, 2023

kzangeli mentioned this issue Aug 22, 2023

Planning #280

Open

kzangeli added a commit that referenced this issue Aug 23, 2023

Hopefully fixed issue #1412

c67a9a8

kzangeli added a commit that referenced this issue Aug 23, 2023

Merge pull request #1414 from FIWARE/issue/1412

73b966d

Hopefully fixed issue #1412

cfreyfh added a commit to cfreyfh/context.Orion-LD that referenced this issue Aug 24, 2023

issue FIWARE#1412 re-connecting to postgres when connection is lost

a3347de

cfreyfh mentioned this issue Aug 25, 2023

issue #1412 re-connecting to postgres when connection is lost #1416

Merged

8 tasks

kzangeli added a commit that referenced this issue Aug 25, 2023

Merge pull request #1416 from cfreyfh/develop

513edcd

issue #1412 re-connecting to postgres when connection is lost

kzangeli closed this as completed Aug 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postgresql connection error / troePoolSize #1412

Postgresql connection error / troePoolSize #1412

cfreyfh commented Aug 22, 2023

kzangeli commented Aug 22, 2023

cfreyfh commented Aug 22, 2023

kzangeli commented Aug 23, 2023

kzangeli commented Aug 23, 2023

cfreyfh commented Aug 24, 2023

cfreyfh commented Aug 24, 2023 •

edited

kzangeli commented Aug 24, 2023

kzangeli commented Aug 26, 2023 •

edited

Postgresql connection error / troePoolSize #1412

Postgresql connection error / troePoolSize #1412

Comments

cfreyfh commented Aug 22, 2023

kzangeli commented Aug 22, 2023

cfreyfh commented Aug 22, 2023

kzangeli commented Aug 23, 2023

kzangeli commented Aug 23, 2023

cfreyfh commented Aug 24, 2023

cfreyfh commented Aug 24, 2023 • edited

kzangeli commented Aug 24, 2023

kzangeli commented Aug 26, 2023 • edited

cfreyfh commented Aug 24, 2023 •

edited

kzangeli commented Aug 26, 2023 •

edited