Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] sql_cacher get incorrect string value, after restart opensips #2649

Closed
GangZhuo opened this issue Oct 11, 2021 · 15 comments · Fixed by #2676
Closed

[BUG] sql_cacher get incorrect string value, after restart opensips #2649

GangZhuo opened this issue Oct 11, 2021 · 15 comments · Fixed by #2676
Assignees

Comments

@GangZhuo
Copy link
Contributor

GangZhuo commented Oct 11, 2021

OpenSIPS version you are running

version: opensips 3.3.0-dev (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, CC_O0, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: b4061d3d7
main.c compiled on 11:53:33 Oct 11 2021 with gcc 8

Describe the bug

sql_cacher get incorrect string value, after restart opensips.

To Reproduce

modparam("sql_cacher", "cache_table",
"id=cache_passwd
db_url=postgres://uuu:123@192.168.1.77:5432/table_name
cachedb_url=redis:passwd://:123@127.0.0.1:6379/
table=user_info
key=id
key_type=int
columns=password
on_demand=1
expire=3600")

# Get password from database.
$var(password) = $sql_cached_value(cache_passwd:password:$var(user_id));

Example:

If we have a user with id=1000 and password="1234".

The sql_cacher automatically create a new redis key "passwd1000" with value "AAAAAA==EAAAAA==1234",
and $sql_cached_value(cache_passwd:password:1000) can get correct value "1234".

After restart the opensips, $sql_cached_value(cache_passwd:password:1000) get incorrect value "16".

Additional context

  1. No problem when on_demand=0.
  2. Maybe: After restarting OpenSIPs, sql_cacher did not reload the values from the database, so incorrectly defaulted the field to an integer type, and then returned the offset of string value, which value is 16 in "AAAAAA==EAAAAA==1234".
@github-actions
Copy link

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@github-actions github-actions bot added the stale label Oct 26, 2021
@kertor
Copy link

kertor commented Nov 1, 2021

Hello! I have the same problem on opensips 3.1:

version: opensips 3.1.6 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-FUTEX-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: a162f8859
main.c compiled on  with gcc 9
  1. As i can see (if we compare with 3.0) on 3.1 version "sql_cacher" caching values on few seconds later after opensips restart.
  2. If we have calls before "sql_cacher" cached values after opensips start, then we will see errors:
Oct 29 17:00:55 ts-devdev01-1 opensips[236885]: Oct 29 17:00:54 [236885] ERROR:sql_cacher:get_rld_vers_from_cache: Failed to get reload version integer from cachedb
Oct 29 17:00:55 ts-devdev01-1 opensips[236885]: Oct 29 17:00:54 [236885] ERROR:sql_cacher:pv_get_sql_cached_value: Error fetching from cachedb

and after this "sql_cacher" will not cache values and will always will return 16.

Could you please check it? Thank you!

@stale stale bot removed the stale label Nov 1, 2021
@nikbyte
Copy link
Member

nikbyte commented Nov 1, 2021

+1

@nikbyte
Copy link
Member

nikbyte commented Nov 1, 2021

I have the same issue on 3.1, but I have no the same issue on 3.0 with the same script.

@nikbyte nikbyte added the bug label Nov 1, 2021
@liviuchircu
Copy link
Member

In order to extract the commit diff between latest 3.0 and 3.1, we can do:

git log d19b20e2e737db95149f5f42f9b78e77df09412b..3.1 modules/sql_cacher

... which produces: https://pastebin.com/e1H24uSS

Out of these, there is one commit which stands out and seems directly related to the issue here:

commit 7b62f2453bd226ec08b4f229017fa953d3876681
Author: Bogdan-Andrei Iancu <bogdan@opensips.org>
Date:   Fri Mar 20 20:43:37 2020 +0200

    The main process (attendant) is no longer running the child_init()
    
    The main proc must stay free of any db connections, so it can later fork and child_init() other procs (due auto_scaling).
    For the modules which were keeping a DB conn open in main proc for later usage, during module destroy, the new approach is to open the conn, on the spot, in the destroy handler.
    
    Related to #2003

Maybe this can serve as a good starting point as to what needs to be troubleshooted here.

@nikbyte
Copy link
Member

nikbyte commented Nov 1, 2021

@liviuchircu Thank you. I tried to rollback these changes using next patch:

diff --git a/main.c b/main.c
index 0e1cd7e31..c5d6d836e 100644
--- a/main.c
+++ b/main.c
@@ -840,7 +840,7 @@ static int main_loop(void)

 	if (testing_framework) {
 		if (init_child(1) < 0) {
-			LM_ERR("error in init_child for First Worker\n");
+			LM_ERR("error in init_child for PROC_MAIN\n");
 			report_failure_status();
 			goto error;
 		}
@@ -849,6 +849,12 @@ static int main_loop(void)
 		shutdown_opensips(rc);
 	}

+	if (init_child(PROC_MAIN) < 0) {
+		LM_ERR("error in init_child for PROC_MAIN\n");
+		report_failure_status();
+		goto error;
+	}
+
 	report_conditional_status( (!no_daemon_mode), 0);

 	if (auto_scaling_enabled) {
--- a/modules/sql_cacher/sql_cacher.c
+++ b/modules/sql_cacher/sql_cacher.c
@@ -1928,9 +1928,17 @@ static void free_c_entry(cache_entry_t *c)

 static void destroy(void)
 {
+	db_handlers_t *db_hdls;
 	struct queried_key *q_it, *q_tmp;
 	cache_entry_t *c_it, *c_tmp;

+	for(db_hdls = db_hdls_list; db_hdls; db_hdls = db_hdls->next) {
+		if (db_hdls->cdbcon)
+			db_hdls->cdbf.destroy(db_hdls->cdbcon);
+		if (db_hdls->db_con)
+			db_hdls->db_funcs.close(db_hdls->db_con);
+	}
+
 	q_it = *queries_in_progress;
 	while (q_it) {
 		q_tmp = q_it;

and it fixed the issue.

@nikbyte
Copy link
Member

nikbyte commented Nov 1, 2021

Could you please fix it properly in code now?

@bogdan-iancu
Copy link
Member

Logically speaking, only the init_child(PROC_MAIN) change may induce any kind of change here. The rest of changes are NOPs.
Now, checking the child_init() for the sql_cacher module I do not see anything that is a potential changer. For the PROC_MAIN there is nothing done rather than opening some db cons (which anyhow are useless for the module).

Basically, I fail to see how the removal of the init_child(PROC_MAIN) may impact the module's behavior .

@liviuchircu
Copy link
Member

liviuchircu commented Nov 8, 2021

@bogdan-iancu Indeed, looking closer, it doesn't look like that commit could cause the issue. Perhaps the problem originated somewhere in the past, say here, in a commit which is both on 3.0 and 3.1:

commit 71960b917deef997799c2abbd80cae2182642341
Author: rvlad-patrascu <vladp@opensips.org>
Date:   Thu Jan 11 20:25:41 2018 +0200

    sql_cacher: do data loading for full caching in the first child process
    
    Speed up OpenSIPS starting sequence by doing the full caching data load
    in the first process, after child_init is done (via RPC), instead of
    during module init.

However, with more and more accumulation of "stuff to load" specifically targeted at worker #1's IPC startup queue in OpenSIPS 3.1 and further releases, we were led to believe that the issue was added in 3.1, when, in fact it was there all along (ever since starting to use the async initialization). It's just that there was a smaller window for the race condition to occur ("cache version initialization" vs. "SIP traffic processing") when there were less things to load post-fork, so sql_cacher got its turn to initialize faster, post-fork.

@bogdan-iancu
Copy link
Member

@liviuchircu , that's more likely that this is the culprit commit. And if I understand you correctly, you say that doing the IPC from worker 1 may give a higher load latency than forcing the load on worker 1, as the IPC task (fired by worker 1) may be delayed (as handling) due the ongoing SIP traffic (while the forced load on child_init() is not subject to collisions with the SIP traffic handling).

@liviuchircu
Copy link
Member

liviuchircu commented Nov 8, 2021

Almost. What I'm trying to say is that the overall latency of loading the DB data is the same, coming from 2.2, but behavior has gradually shifted as follows:

  • 2.2: all is done on mod_init(). Startup time: T seconds. Cacher fully available 0 seconds after startup.
  • 2.4: cacher moves to child_init() loading, via worker 1 IPC. Startup time: T-X seconds (faster). Cacher fully available X seconds after startup.
  • 3.0: module A switches to child_init() loading, via worker 1 IPC. Startup time: T-X-Y seconds (even faster!). Cacher fully available Y+X seconds after startup, assuming "loadmodule sql_cacher" is last in script.
  • 3.1: module B switches to child_init() loading, via worker 1 IPC. Startup time: T-X-Y-Z seconds (fastest!). Cacher fully available Y+Z+X seconds after startup, assuming "loadmodule sql_cacher" is last in script.

So, overall, cacher is still fully ready after X seconds. But the changes have caused the potential window for inconsistent system state to gradually increase, to the point where @nikbyte and @GangZhuo are starting to process SIP traffic during this window and run into trouble.

@liviuchircu
Copy link
Member

liviuchircu commented Nov 8, 2021

@nikbyte and @GangZhuo: could you try to place the loadmodule "sql_cacher.so" line as 1st loaded module in your scripts, and see if behavior improves?

This should reduce all startup delays to a minimum and ensure cacher is ready ASAP. And thanks to module dependencies, there should be no other ordering issues caused by this change (ideally!). Of course, this is not a permanent fix, but it should help you out while the issue is being solved.

@bogdan-iancu
Copy link
Member

IMO, the position of the module at startup should be irrelevant, as the worker 1 is only firing the IPC job, a operation taking no time. I would say that the Y and Z are actually 0, as all the IPC jobs (data loading from different processes) are evenly distributed across all the workers, so done in parallel (of course, considering that the number of opensips procs >= modules loading via IPC).
As said, an impacting factor may be the presence of SIP traffic right at startup - if SIP traffic is to be handled, the workers may start doing SIP rather the consuming the IPC jobs, delaying the data loading. But a startup under 0 or small SIP traffic should translate into a Y and Z being 0.

@GangZhuo
Copy link
Contributor Author

GangZhuo commented Nov 9, 2021

This is my fix: #2676

I don't know whether sql_res->col is available in all DB implementations, this PR is just a hint, maybe need more fix.

@github-actions
Copy link

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants