Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dialog table corrupted when out of shared memory #311

Closed
mikomarrache opened this issue Aug 22, 2014 · 1 comment
Closed

dialog table corrupted when out of shared memory #311

mikomarrache opened this issue Aug 22, 2014 · 1 comment
Assignees
Labels
Milestone

Comments

@mikomarrache
Copy link

Hi,

We recently encountered a weird issue that corrupted the dialog table of our servers after an instance had no more shared memory.

The different memory errors that appeared in the logs:

xxxxxxx[6439]: WARNING:core:fm_malloc: Not enough free memory, will attempt defragmentation.
xxxxxxx[6439]: ERROR:tm:sip_msg_cloner: no more share memory
xxxxxxx[6439]: ERROR:tm:new_t: out of mem
xxxxxxx[6439]: ERROR:tm:t_newtran: new_t failed

xxxxxxx[6450]: WARNING:core:fm_malloc: Not enough free memory, will attempt defragmentation.
xxxxxxx[6450]: ERROR:tm:build_local: no more share memory
xxxxxxx[6450]: ERROR:tm:send_ack: failed to build ACK·
xxxxxxx[6450]: ERROR:tm:reply_received: failed to send ACK (local=no)·
xxxxxxx[6450]: ERROR:dialog:push_reply_in_dialog: missing TAG param in TO hdr :-/·

Then, a lot of duplicated dialogs were present in the dialog table. I am sure these duplicated dialogs have been added after the memory error occured. After analyzing the content of the dialog table, I found that most calls had duplicated dialogs as follows: one initial dialog created from the script (with a timestamp at the time it has been created) and multiple duplicates of this dialog (with a timestamp that is after the first error occured). The duplicate dialogs have the same data as the initial dialog except the id (auto increment), the timeout and the timestamp columns. Please note that the dlg_id column of the duplicated dialogs was identical to the initial dialog.

We just started to apply the change of adding the new dlg_id column so the id column was still present and defined as primary key. The dlg_id column wasn't defined as primary key, therefore adding duplicated dialogs didn't generate any error from the database side.

I thought the duplicated dialogs were really created in memory but if it was the case, they would have different dlg_id (the hash_entry would be the same because the CallID is the same, but the hash_id would be different).

This scenario is very bad since our monitoring system detected that opensips doesn't respond and therefore tried to restart it. However, there were more that 300K dialogs (around 5K were good dialogs) in the table and the load_dialog_from_db function that is executed at startup took too much time and memory and during this time opensips wasn't able to respond to incoming request, therefore the monitoring system continued to restart it again and again.

I tried to examine the code to understand what may have caused the duplication but I didn't find anything. I am sure the timer process added each one of the duplicated dialogs since the auto_increment primary key is different for each duplicated dialog.

Regards,
Mickael

@bogdan-iancu bogdan-iancu added this to the 1.12 milestone Aug 24, 2014
@mikomarrache
Copy link
Author

Hi,

I just found it was caused by an error while importing public changes.

The DLG_FLAG_NEW dialog flag was saved in the database. Therefore, it was fetched during dialogs loading and the timer process tried to insert it again in the database and because the dlg_id column was not set as the primary key, the table got bigger and bigger with a lot of duplicate dialogs.

Sorry to waste your time.

Regards,
Mickael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants