[BUG] sql_cacher get incorrect string value, after restart opensips #2649

GangZhuo · 2021-10-11T06:05:38Z

OpenSIPS version you are running

version: opensips 3.3.0-dev (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, CC_O0, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: b4061d3d7
main.c compiled on 11:53:33 Oct 11 2021 with gcc 8

Describe the bug

sql_cacher get incorrect string value, after restart opensips.

To Reproduce

modparam("sql_cacher", "cache_table",
"id=cache_passwd
db_url=postgres://uuu:123@192.168.1.77:5432/table_name
cachedb_url=redis:passwd://:123@127.0.0.1:6379/
table=user_info
key=id
key_type=int
columns=password
on_demand=1
expire=3600")

# Get password from database.
$var(password) = $sql_cached_value(cache_passwd:password:$var(user_id));

Example:

If we have a user with id=1000 and password="1234".

The sql_cacher automatically create a new redis key "passwd1000" with value "AAAAAA==EAAAAA==1234",
and $sql_cached_value(cache_passwd:password:1000) can get correct value "1234".

After restart the opensips, $sql_cached_value(cache_passwd:password:1000) get incorrect value "16".

Additional context

No problem when on_demand=0.
Maybe: After restarting OpenSIPs, sql_cacher did not reload the values from the database, so incorrectly defaulted the field to an integer type, and then returned the offset of string value, which value is 16 in "AAAAAA==EAAAAA==1234".

The text was updated successfully, but these errors were encountered:

github-actions · 2021-10-26T06:40:03Z

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

kertor · 2021-11-01T08:41:33Z

Hello! I have the same problem on opensips 3.1:

version: opensips 3.1.6 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-FUTEX-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: a162f8859
main.c compiled on  with gcc 9

As i can see (if we compare with 3.0) on 3.1 version "sql_cacher" caching values on few seconds later after opensips restart.
If we have calls before "sql_cacher" cached values after opensips start, then we will see errors:

Oct 29 17:00:55 ts-devdev01-1 opensips[236885]: Oct 29 17:00:54 [236885] ERROR:sql_cacher:get_rld_vers_from_cache: Failed to get reload version integer from cachedb
Oct 29 17:00:55 ts-devdev01-1 opensips[236885]: Oct 29 17:00:54 [236885] ERROR:sql_cacher:pv_get_sql_cached_value: Error fetching from cachedb

and after this "sql_cacher" will not cache values and will always will return 16.

Could you please check it? Thank you!

nikbyte · 2021-11-01T08:45:33Z

+1

nikbyte · 2021-11-01T09:12:14Z

I have the same issue on 3.1, but I have no the same issue on 3.0 with the same script.

liviuchircu · 2021-11-01T09:25:15Z

In order to extract the commit diff between latest 3.0 and 3.1, we can do:

git log d19b20e2e737db95149f5f42f9b78e77df09412b..3.1 modules/sql_cacher

... which produces: https://pastebin.com/e1H24uSS

Out of these, there is one commit which stands out and seems directly related to the issue here:

commit 7b62f2453bd226ec08b4f229017fa953d3876681
Author: Bogdan-Andrei Iancu <bogdan@opensips.org>
Date:   Fri Mar 20 20:43:37 2020 +0200

    The main process (attendant) is no longer running the child_init()
    
    The main proc must stay free of any db connections, so it can later fork and child_init() other procs (due auto_scaling).
    For the modules which were keeping a DB conn open in main proc for later usage, during module destroy, the new approach is to open the conn, on the spot, in the destroy handler.
    
    Related to #2003

Maybe this can serve as a good starting point as to what needs to be troubleshooted here.

nikbyte · 2021-11-01T13:50:14Z

@liviuchircu Thank you. I tried to rollback these changes using next patch:

diff --git a/main.c b/main.c
index 0e1cd7e31..c5d6d836e 100644
--- a/main.c
+++ b/main.c
@@ -840,7 +840,7 @@ static int main_loop(void)

 	if (testing_framework) {
 		if (init_child(1) < 0) {
-			LM_ERR("error in init_child for First Worker\n");
+			LM_ERR("error in init_child for PROC_MAIN\n");
 			report_failure_status();
 			goto error;
 		}
@@ -849,6 +849,12 @@ static int main_loop(void)
 		shutdown_opensips(rc);
 	}

+	if (init_child(PROC_MAIN) < 0) {
+		LM_ERR("error in init_child for PROC_MAIN\n");
+		report_failure_status();
+		goto error;
+	}
+
 	report_conditional_status( (!no_daemon_mode), 0);

 	if (auto_scaling_enabled) {
--- a/modules/sql_cacher/sql_cacher.c
+++ b/modules/sql_cacher/sql_cacher.c
@@ -1928,9 +1928,17 @@ static void free_c_entry(cache_entry_t *c)

 static void destroy(void)
 {
+	db_handlers_t *db_hdls;
 	struct queried_key *q_it, *q_tmp;
 	cache_entry_t *c_it, *c_tmp;

+	for(db_hdls = db_hdls_list; db_hdls; db_hdls = db_hdls->next) {
+		if (db_hdls->cdbcon)
+			db_hdls->cdbf.destroy(db_hdls->cdbcon);
+		if (db_hdls->db_con)
+			db_hdls->db_funcs.close(db_hdls->db_con);
+	}
+
 	q_it = *queries_in_progress;
 	while (q_it) {
 		q_tmp = q_it;

and it fixed the issue.

nikbyte · 2021-11-01T13:50:36Z

Could you please fix it properly in code now?

bogdan-iancu · 2021-11-08T11:56:10Z

Logically speaking, only the init_child(PROC_MAIN) change may induce any kind of change here. The rest of changes are NOPs.
Now, checking the child_init() for the sql_cacher module I do not see anything that is a potential changer. For the PROC_MAIN there is nothing done rather than opening some db cons (which anyhow are useless for the module).

Basically, I fail to see how the removal of the init_child(PROC_MAIN) may impact the module's behavior .

liviuchircu · 2021-11-08T12:53:52Z

@bogdan-iancu Indeed, looking closer, it doesn't look like that commit could cause the issue. Perhaps the problem originated somewhere in the past, say here, in a commit which is both on 3.0 and 3.1:

commit 71960b917deef997799c2abbd80cae2182642341
Author: rvlad-patrascu <vladp@opensips.org>
Date:   Thu Jan 11 20:25:41 2018 +0200

    sql_cacher: do data loading for full caching in the first child process
    
    Speed up OpenSIPS starting sequence by doing the full caching data load
    in the first process, after child_init is done (via RPC), instead of
    during module init.

However, with more and more accumulation of "stuff to load" specifically targeted at worker #1's IPC startup queue in OpenSIPS 3.1 and further releases, we were led to believe that the issue was added in 3.1, when, in fact it was there all along (ever since starting to use the async initialization). It's just that there was a smaller window for the race condition to occur ("cache version initialization" vs. "SIP traffic processing") when there were less things to load post-fork, so sql_cacher got its turn to initialize faster, post-fork.

bogdan-iancu · 2021-11-08T15:23:08Z

@liviuchircu , that's more likely that this is the culprit commit. And if I understand you correctly, you say that doing the IPC from worker 1 may give a higher load latency than forcing the load on worker 1, as the IPC task (fired by worker 1) may be delayed (as handling) due the ongoing SIP traffic (while the forced load on child_init() is not subject to collisions with the SIP traffic handling).

liviuchircu · 2021-11-08T15:35:06Z

Almost. What I'm trying to say is that the overall latency of loading the DB data is the same, coming from 2.2, but behavior has gradually shifted as follows:

2.2: all is done on mod_init(). Startup time: T seconds. Cacher fully available 0 seconds after startup.
2.4: cacher moves to child_init() loading, via worker 1 IPC. Startup time: T-X seconds (faster). Cacher fully available X seconds after startup.
3.0: module A switches to child_init() loading, via worker 1 IPC. Startup time: T-X-Y seconds (even faster!). Cacher fully available Y+X seconds after startup, assuming "loadmodule sql_cacher" is last in script.
3.1: module B switches to child_init() loading, via worker 1 IPC. Startup time: T-X-Y-Z seconds (fastest!). Cacher fully available Y+Z+X seconds after startup, assuming "loadmodule sql_cacher" is last in script.

So, overall, cacher is still fully ready after X seconds. But the changes have caused the potential window for inconsistent system state to gradually increase, to the point where @nikbyte and @GangZhuo are starting to process SIP traffic during this window and run into trouble.

liviuchircu · 2021-11-08T15:40:49Z

@nikbyte and @GangZhuo: could you try to place the loadmodule "sql_cacher.so" line as 1st loaded module in your scripts, and see if behavior improves?

This should reduce all startup delays to a minimum and ensure cacher is ready ASAP. And thanks to module dependencies, there should be no other ordering issues caused by this change (ideally!). Of course, this is not a permanent fix, but it should help you out while the issue is being solved.

bogdan-iancu · 2021-11-08T15:50:06Z

IMO, the position of the module at startup should be irrelevant, as the worker 1 is only firing the IPC job, a operation taking no time. I would say that the Y and Z are actually 0, as all the IPC jobs (data loading from different processes) are evenly distributed across all the workers, so done in parallel (of course, considering that the number of opensips procs >= modules loading via IPC).
As said, an impacting factor may be the presence of SIP traffic right at startup - if SIP traffic is to be handled, the workers may start doing SIP rather the consuming the IPC jobs, delaying the data loading. But a startup under 0 or small SIP traffic should translate into a Y and Z being 0.

GangZhuo · 2021-11-09T06:45:39Z

This is my fix: #2676

I don't know whether sql_res->col is available in all DB implementations, this PR is just a hint, maybe need more fix.

github-actions · 2021-11-25T06:40:52Z

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

github-actions bot added the stale label Oct 26, 2021

stale bot removed the stale label Nov 1, 2021

nikbyte added the bug label Nov 1, 2021

GangZhuo mentioned this issue Nov 9, 2021

[sql_cacher] When on demand mode, get column types at mod_init #2676

Merged

github-actions bot added the stale label Nov 25, 2021

rvlad-patrascu self-assigned this Nov 25, 2021

rvlad-patrascu closed this as completed in #2676 Dec 2, 2021

rvlad-patrascu added the fixed label Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] sql_cacher get incorrect string value, after restart opensips #2649

[BUG] sql_cacher get incorrect string value, after restart opensips #2649

GangZhuo commented Oct 11, 2021 •

edited

github-actions bot commented Oct 26, 2021

kertor commented Nov 1, 2021

nikbyte commented Nov 1, 2021

nikbyte commented Nov 1, 2021

liviuchircu commented Nov 1, 2021

nikbyte commented Nov 1, 2021

nikbyte commented Nov 1, 2021

bogdan-iancu commented Nov 8, 2021

liviuchircu commented Nov 8, 2021 •

edited

bogdan-iancu commented Nov 8, 2021

liviuchircu commented Nov 8, 2021 •

edited

liviuchircu commented Nov 8, 2021 •

edited

bogdan-iancu commented Nov 8, 2021

GangZhuo commented Nov 9, 2021

github-actions bot commented Nov 25, 2021

[BUG] sql_cacher get incorrect string value, after restart opensips #2649

[BUG] sql_cacher get incorrect string value, after restart opensips #2649

Comments

GangZhuo commented Oct 11, 2021 • edited

github-actions bot commented Oct 26, 2021

kertor commented Nov 1, 2021

nikbyte commented Nov 1, 2021

nikbyte commented Nov 1, 2021

liviuchircu commented Nov 1, 2021

nikbyte commented Nov 1, 2021

nikbyte commented Nov 1, 2021

bogdan-iancu commented Nov 8, 2021

liviuchircu commented Nov 8, 2021 • edited

bogdan-iancu commented Nov 8, 2021

liviuchircu commented Nov 8, 2021 • edited

liviuchircu commented Nov 8, 2021 • edited

bogdan-iancu commented Nov 8, 2021

GangZhuo commented Nov 9, 2021

github-actions bot commented Nov 25, 2021

GangZhuo commented Oct 11, 2021 •

edited

liviuchircu commented Nov 8, 2021 •

edited

liviuchircu commented Nov 8, 2021 •

edited

liviuchircu commented Nov 8, 2021 •

edited