files: split update into batches #5552

sumit-bose · 2021-03-25T11:15:23Z

If the files managed by the files provider contain many users or groups
processing them might take a considerable amount of time. To keep the
backend responsive this patch splits the update into multiple steps
running one after the other but returning to the main loop in between.

This avoids issues during startup because the watchdog timer state is
reset properly. Additionally SBUS messages are process and as a result
the domain can be marked inconsistent in the frontends properly.

sumit-bose · 2021-03-25T11:21:05Z

Hi,

I didn't include changes with respect to use talloc-pools in this PR because currently I cannot reproduce the original performance improvements I have seen. I will investigate this further.

I think #5091 is fixed as a side effect of this patch but so far I haven't found a proper ticket for this PR. @alexey-tikhonov, do you have a suggestion or shall I create a new ticket for the batch processing?

As you can see there are some delays and counts hardcoded which can easily made configurable. But so far I think there is no need for configuration options.

bye,
Sumit

alexey-tikhonov · 2021-03-25T12:17:33Z

So far when we talked about "batches" I thought we meant "accumulating few consequent update events and processing in a single 'batch' (once)" - to avoid an issue in a case user populates very large /etc/passwd with a script.

But this PR aims to fix another valid issue - single operation during single update taking > 3 * WD timeouts. I think we don't have a ticket for this issue.

sumit-bose · 2021-03-25T13:18:25Z

So far when we talked about "batches" I thought we meant "accumulating few consequent update events and processing in a single 'batch' (once)" - to avoid an issue in a case user populates very large /etc/passwd with a script.

Hi,

I'm sorry for not being more specific here. I meant batch in the sense we use it in the refresh code.

In the patch there is an initiial delay, which can be used to wait some time for more changes to the files before doing the real update.

But this PR aims to fix another valid issue - single operation during single update taking > 3 * WD timeouts. I think we don't have a ticket for this issue.

Would you mind to open a suitable ticket and link it to related downstream tickets?

Thanks.

bye,
Sumit

alexey-tikhonov · 2021-03-26T22:06:12Z

But this PR aims to fix another valid issue - single operation during single update taking > 3 * WD timeouts. I think we don't have a ticket for this issue.

Would you mind to open a suitable ticket and link it to related downstream tickets?

Hi,

#5557

sumit-bose · 2021-03-30T16:29:18Z

Hi,

the latest version only updates the commit message.

bye,
Sumit

elkoniu · 2021-03-31T21:42:31Z

src/providers/files/files_ops.c

+    DELETE_GROUPS,
+    READ_GROUPS,
+    SAVE_GROUPS,
+    UPDATE_FINISH,


UPDATE_FINISH and UPDATE_DONE sounds very similar to me. Maybe UPDATE_COMPLETED instead of UPDATE_FINISH, or PROCESSING_COMPLETED?

elkoniu · 2021-03-31T21:46:01Z

src/providers/files/files_ops.c

+    state->batch_size = 10;
+    state->obj_idx = 0;
+    state->file_idx = 0;
+    state->initial_delay = 100;


I think some meaningful define here will be better. Or at least information about type of value (us / ms).

elkoniu · 2021-03-31T21:58:42Z

src/providers/files/files_ops.c

-        DEBUG(SSSDBG_CRIT_FAILURE,
-              "Failed to add session recording attribute, ignored.\n");
-    }
+        ret = sysdb_transaction_commit(id_ctx->domain->sysdb);


If being that long in transaction state is safe? Total time will be the same as before + time spent waiting for tevent timers to trigger function.

elkoniu · 2021-03-31T22:20:33Z

src/providers/files/files_ops.c

+    }
+
+    tv = tevent_timeval_current_ofs(0, state->initial_delay);
+    state->te = tevent_add_timer(id_ctx->be->ev, state, tv,


This is triggered by signal that file has been modified. What in the case of scenario when file has been modified multiple times in a very short time span (t << state->initial_delay)? If we will not create multiple tevent chains running in parallel?

I run small test:

Changed delay to be 100ms from 100us

Added some debug info every time sf_enum_files_first_step() is called

Assigned each batching random number to track it logs

Called touch /etc/passwd & touch /etc/passwd & touch /etc/passwd & touch /etc/passwd &

Result was creation of 3 almost parallel tracks of batching procedure:

./sssd_files.log:3558:(2021-04-01 16:30:17): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: READ_USERS, track: 10 ./sssd_files.log:3559:(2021-04-01 16:30:17): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: READ_USERS, track: 5 ./sssd_files.log:3560:(2021-04-01 16:30:17): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: READ_USERS, track: 76 ./sssd_files.log:3561:(2021-04-01 16:30:17): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: DELETE_GROUPS, track: 10 ./sssd_files.log:3562:(2021-04-01 16:30:17): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: DELETE_GROUPS, track: 5 ./sssd_files.log:3563:(2021-04-01 16:30:17): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: DELETE_GROUPS, track: 76 ./sssd_files.log:3564:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: READ_GROUPS, track: 10 ./sssd_files.log:3620:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: READ_GROUPS, track: 5 ./sssd_files.log:3676:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: READ_GROUPS, track: 76 ./sssd_files.log:3732:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: SAVE_GROUPS, track: 10 ./sssd_files.log:3809:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: SAVE_GROUPS, track: 5 ./sssd_files.log:3832:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: SAVE_GROUPS, track: 76 ./sssd_files.log:3855:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: SAVE_GROUPS, track: 10 ./sssd_files.log:3939:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: SAVE_GROUPS, track: 5 ./sssd_files.log:3963:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: SAVE_GROUPS, track: 76 ./sssd_files.log:3987:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: SAVE_GROUPS, track: 10 ./sssd_files.log:4071:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: SAVE_GROUPS, track: 5 ./sssd_files.log:4095:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: SAVE_GROUPS, track: 76 ./sssd_files.log:4119:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: SAVE_GROUPS, track: 10 ./sssd_files.log:4203:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: SAVE_GROUPS, track: 5 ./sssd_files.log:4227:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: SAVE_GROUPS, track: 76 ./sssd_files.log:4251:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: SAVE_GROUPS, track: 10 ./sssd_files.log:4335:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: SAVE_GROUPS, track: 5 ./sssd_files.log:4359:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: SAVE_GROUPS, track: 76 ./sssd_files.log:4383:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: SAVE_GROUPS, track: 10 ./sssd_files.log:4428:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: SAVE_GROUPS, track: 5 ./sssd_files.log:4443:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: SAVE_GROUPS, track: 76 ./sssd_files.log:4458:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: READ_GROUPS, track: 10 ./sssd_files.log:4459:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: READ_GROUPS, track: 5 ./sssd_files.log:4460:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: READ_GROUPS, track: 76 ./sssd_files.log:4461:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: UPDATE_FINISH, track: 10 ./sssd_files.log:4489:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: UPDATE_FINISH, track: 5 ./sssd_files.log:4517:(2021-04-01 16:30:18): [be[files]] [sf_enum_files_first_step] (0x0200): BATCH step: UPDATE_FINISH, track: 76

With the default 100us timeout there is a small chance it will happen using command line and bash scripts but it is still possible if someone will send inotify signal fast enough the other way.

I think every new batching routine may want to check if old routine is still running and use this knowledge somehow (canceling old routine as new will overwrite it, building on top of old routine by merging results etc.).

pbrezina · 2021-04-01T10:34:49Z

What is the reason for this patches? To be able to deliver domain-inconsistent message? If that's the case you can send the message via monitor connection.

alexey-tikhonov · 2021-04-01T10:48:46Z

What is the reason for this patches?

IIUC, to avoid single large blocking operation. see #5557

To be able to deliver domain-inconsistent message? If that's the case you can send the message via monitor connection.

What do you mean?

pbrezina · 2021-04-01T11:05:38Z

What is the reason for this patches?

IIUC, to avoid single large blocking operation. see #5557

IMHO in this case it would be better to do a diff update to improve performance first. It would also solve #4614. It wouldn't of course solve the initial cache population so some batching is needed, but it will affect the design a lot therefore it might be better to start with it.

To be able to deliver domain-inconsistent message? If that's the case you can send the message via monitor connection.

What do you mean?

The reason why the message is not delivered is because we send it over provider->sbus_conn which requires accessing the backend's tevent loop because the backend runs this message bus. If be_ctx->mon_conn is used instead the message is sent through the monitor's message bus immediately.

elkoniu · 2021-04-03T21:51:31Z

I did some testing comparing this PR with vanilla SSSD. I added extra log in sf_passwd_cb() which is /etc/passwd file modification callback and in sss_resp_domain_inconsistent() which is receiver of DBUS command SetInconsisten marking domain as inconsistent. Another test was creation and deletion of 100, 250 and 500 users in bash for loop.
I have not run dedicated test for /etc/groups based routines as I thought result of /etc/passwd modifications
will be already meaningful.

Test environment was sssd-test-suite VM (1 core CPU, 2G RAM) running Fedora 31.
As a base I used upstream commit: 66960c7
On top of it I added my commit with extra logs and then this PR.
Full log from test is on the bottom.

TL; DR;

Estimated time needed to mark domain as inconsistent:

Version	30 users	30 + 100 users	30 + 250 users	30 + 500 users
Vanilla	91ms	190ms	444ms	891ms
Batching	34ms	63ms	105ms	188ms

Time consumed by calling operations on X users:

time for N in `seq X`; do useradd test-user-$N; done
time for N in `seq X`; do userdel -rf  test-user-$N; done

Version	100 useradd	100 userdel	250 useradd	250 userdel	500 useradd	500 userdel
Vanilla	0m43.464s	0m41.084s	2m46.462s	2m46.462s	9m55.823s	9m15.531s
Batching	0m53.304s	0m48.572s	3m42.061s	4m16.405s	20m15.965s	18m1.768s

Conclusion: While this PR shortens the time needed to mark a domain as inconsistent it is still affected by the size of /etc/passwd file. The issues reported for files provider with 1k+ users may be solved now but they will be still valid for 10k+ etc. Another observation is that total time of adding / deleting N users via bash script has been increased by this PR. I guess playing with state->initial_delay and state->delay could help fix this.
Positive observation is drop in CPU usage. Load on CPU for 500 users adding / deleting was still heavy but not that heavy as in vanilla version and better balanced.

To speed up domain marking as inconsistent there are three options:

As @pbrezina suggested use monitor DBUS connector to send SetInconsistent
Introduce 1st semi step which will send DBUS commands even before /etc/passwd / /etc/groups processing to release backend tevent loop faster and push DBUS commands
Split 1st step which is file read into even smaller substeps to read file part by part

==============================================================================================================
[VANILLA SSSD]
==============================================================================================================
[30 entries in /etc/passwd]

# Counting users in passwd
[root@master.client.vm /var/log/sssd]# cat /etc/passwd | wc -l
30

# Cache cleaning and daemon restart
[root@master.client.vm /var/log/sssd]# sss_cache -E; systemctl stop sssd; rm -rf /var/lib/sss/db/*; rm -rf /var/log/sssd/*; systemctl start sssd

# Triggering actions
3 x manual call "touch /etc/passwd" with few seconds in between calls

# Logs timestamp calculation
./sssd_files.log:1715:(2021-04-03 14:15:23:911881): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:215:(2021-04-03 14:15:24:007537): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
1000000 - 911881 + 007537 = 95656

./sssd_files.log:2430:(2021-04-03 14:16:42:521671): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:243:(2021-04-03 14:16:42:610820): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
610820 - 521671 = 89149

./sssd_files.log:3145:(2021-04-03 14:17:00:105384): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:271:(2021-04-03 14:17:00:193427): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
193427 - 105384 = 88043

Average = (95656 + 89149 + 88043) / 3 = 90949us = 90,949ms
--------------------------------------------------------------------------------------------------------------
[30 + 100 entries in /etc/passwd]

# Adding users
[root@master.client.vm /var/log/sssd]# time for N in `seq 100`; do useradd test-user-$N; done

real	0m43.464s
user	0m7.187s
sys	0m3.635s

# Counting users in passwd
[root@master.client.vm /var/log/sssd]# cat /etc/passwd | wc -l
130

# Cache cleaning and daemon restart
[root@master.client.vm /var/log/sssd]# sss_cache -E; systemctl stop sssd; rm -rf /var/lib/sss/db/*; rm -rf /var/log/sssd/*; systemctl start sssd

# Triggering actions
3 x manual call "touch /etc/passwd" with few seconds in between calls

# Logs timestamp calculation
./sssd_files.log:5125:(2021-04-03 14:27:10:556898): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:260:(2021-04-03 14:27:10:750840): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
750840 - 556898 = 193942

./sssd_files.log:7540:(2021-04-03 14:27:41:406064): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:288:(2021-04-03 14:27:41:600166): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
600166 - 406064 = 194102

./sssd_files.log:9955:(2021-04-03 14:27:58:037802): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:316:(2021-04-03 14:27:58:221111): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
221111 - 037802 = 183309

Average = (193942 + 194102 + 183309) / 3 = 190451us = 190,451ms

# Deleting users
[root@master.client.vm /var/log/sssd]# time for N in `seq 100`; do userdel -rf  test-user-$N; done

real	0m41.084s
user	0m6.838s
sys	0m3.729s
--------------------------------------------------------------------------------------------------------------
[30 + 250 entries in /etc/passwd]

# Adding users
[root@master.client.vm /var/log/sssd]# time for N in `seq 250`; do useradd test-user-$N; done

real	2m46.462s
user	0m34.869s
sys	0m18.381s

# Counting users in passwd
[root@master.client.vm /var/log/sssd]# cat /etc/passwd | wc -l
280

# Cache cleaning and daemon restart
[root@master.client.vm /var/log/sssd]# sss_cache -E; systemctl stop sssd; rm -rf /var/lib/sss/db/*; rm -rf /var/log/sssd/*; systemctl start sssd

# Triggering actions
3 x manual call "touch /etc/passwd" with few seconds in between calls

# Logs timestamp calculation
./sssd_files.log:10205:(2021-04-03 18:04:14:997171): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:161:(2021-04-03 18:04:15:422140): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
1000000 - 997171 + 422140 = 424969

./sssd_files.log:15170:(2021-04-03 18:04:38:941796): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:189:(2021-04-03 18:04:39:387461): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
1000000 - 941796 + 387461 = 445665

./sssd_files.log:20135:(2021-04-03 18:04:56:860883): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:217:(2021-04-03 18:04:57:322166): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
1000000 - 860883 + 322166 = 461283

Average = (424969 + 445665 + 461283) / 3 = 443972us = 443,972ms

# Deleting users
[root@master.client.vm /var/log/sssd]# time for N in `seq 250`; do userdel -rf  test-user-$N; done

real	2m42.784s
user	0m34.563s
sys	0m17.203s
--------------------------------------------------------------------------------------------------------------
[30 + 500 entries in /etc/passwd]

# Adding users
[root@master.client.vm /var/log/sssd]# time for N in `seq 500`; do useradd test-user-$N; done

real	9m55.823s
user	2m15.133s
sys	1m7.711s

# Counting users in passwd
[root@master.client.vm /var/log/sssd]# cat /etc/passwd | wc -l
530

# Cache cleaning and daemon restart
[root@master.client.vm /var/log/sssd]# sss_cache -E; systemctl stop sssd; rm -rf /var/lib/sss/db/*; rm -rf /var/log/sssd/*; systemctl start sssd

# Triggering actions
3 x manual call "touch /etc/passw" with few seconds in between calls

# Logs timestamp calculation
./sssd_files.log:18705:(2021-04-03 19:08:04:485795): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:157:(2021-04-03 19:08:05:358181): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
1000000 - 485795 + 358181 = 872386

./sssd_files.log:27920:(2021-04-03 19:08:23:710157): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:185:(2021-04-03 19:08:24:608777): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
1000000 - 710157 + 608777 = 898620

./sssd_files.log:37135:(2021-04-03 19:08:50:350387): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:213:(2021-04-03 19:08:51:253827): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
1000000 - 350387 + 253827 = 903440

Average = (872386 + 898620 + 903440) / 3 = 891482us = 891,482ms

# Deleting users
[root@master.client.vm /var/log/sssd]# time for N in `seq 500`; do userdel -rf  test-user-$N; done

real	9m15.531s
user	2m9.902s
sys	1m2.880s


==============================================================================================================
[SSSD FILES BATCHING]
==============================================================================================================
[30 entries in /etc/passwd]

# Counting users in passwd
[root@master.client.vm /var/log/sssd]# cat /etc/passwd | wc -l
30

# Cache cleaning and daemon restart
[root@master.client.vm /var/log/sssd]# sss_cache -E; systemctl stop sssd; rm -rf /var/lib/sss/db/*; rm -rf /var/log/sssd/*; systemctl start sssd

# Triggering actions
3 x manual call "touch /etc/passwd" with few seconds in between calls

# Logs timestamp calculation
./sssd_files.log:1755:(2021-04-03 20:38:17:173733): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:263:(2021-04-03 20:38:17:208845): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
208845 - 173733 = 35112

./sssd_files.log:2485:(2021-04-03 20:38:39:735251): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:291:(2021-04-03 20:38:39:765298): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
765298 - 735251 = 30047

./sssd_files.log:3215:(2021-04-03 20:38:58:006912): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:319:(2021-04-03 20:38:58:042773): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
042773 - 006912 = 35861

Average = (35112 + 30047 + 35861) / 3 = 33673us = 33,673ms
--------------------------------------------------------------------------------------------------------------
[30 + 100 entries in /etc/passwd]

# Adding users
[root@master.client.vm /var/log/sssd]# time for N in `seq 100`; do useradd test-user-$N; done

real	0m53.304s
user	0m6.711s
sys	0m3.675s

# Counting users in passwd
[root@master.client.vm /var/log/sssd]# cat /etc/passwd | wc -l
130

# Cache cleaning and daemon restart
[root@master.client.vm /var/log/sssd]# sss_cache -E; systemctl stop sssd; rm -rf /var/lib/sss/db/*; rm -rf /var/log/sssd/*; systemctl start sssd

# Triggering actions
3 x manual call "touch /etc/passwd" with few seconds in between calls

# Logs timestamp calculation
./sssd_files.log:5215:(2021-04-03 20:46:34:054796): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:265:(2021-04-03 20:46:34:113920): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
113920 - 054796 = 59124

./sssd_files.log:7675:(2021-04-03 20:47:06:303398): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:293:(2021-04-03 20:47:06:369933): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
369933 - 303398 = 66535

./sssd_files.log:10135:(2021-04-03 20:47:21:847043): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:321:(2021-04-03 20:47:21:910637): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
910637 - 847043 = 63594

Average = (59124 + 66535 + 63594) / 3 = 63084us = 63,084ms

# Deleting users
[root@master.client.vm /var/log/sssd]# time for N in `seq 100`; do userdel -rf  test-user-$N; done

real	0m48.572s
user	0m6.181s
sys	0m3.636s
--------------------------------------------------------------------------------------------------------------
[30 + 250 entries in /etc/passwd]

# Adding users
[root@master.client.vm /var/log/sssd]# time for N in `seq 250`; do useradd test-user-$N; done

real	3m42.061s
user	0m29.853s
sys	0m15.093s


# Counting users in passwd
[root@master.client.vm /var/log/sssd]# cat /etc/passwd | wc -l
280

# Cache cleaning and daemon restart
[root@master.client.vm /var/log/sssd]# sss_cache -E; systemctl stop sssd; rm -rf /var/lib/sss/db/*; rm -rf /var/log/sssd/*; systemctl start sssd

# Triggering actions
3 x manual call "touch /etc/passwd" with few seconds in between calls

# Logs timestamp calculation
./sssd_files.log:10385:(2021-04-03 20:57:13:771709): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:168:(2021-04-03 20:57:13:886122): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
886122 - 771709 = 114413

./sssd_files.log:15440:(2021-04-03 20:57:32:787196): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:196:(2021-04-03 20:57:32:888939): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
888939 - 787196 = 101743

./sssd_files.log:20495:(2021-04-03 20:57:49:476424): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:224:(2021-04-03 20:57:49:576838): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
576838 - 476424 = 100414

Average = (114413 + 101743 + 100414) / 3 = 105523us = 105,523ms

# Deleting users
[root@master.client.vm /var/log/sssd]# time for N in `seq 250`; do userdel -rf  test-user-$N; done

real	4m16.405s
user	0m28.943s
sys	0m14.883s
--------------------------------------------------------------------------------------------------------------
[30 + 500 entries in /etc/passwd]

# Adding users
[root@master.client.vm /var/log/sssd]# time for N in `seq 500`; do useradd test-user-$N; done

real	20m15.965s
user	2m20.204s
sys	1m9.866s

# Counting users in passwd
[root@master.client.vm /var/log/sssd]# cat /etc/passwd | wc -l
530

# Cache cleaning and daemon restart
[root@master.client.vm /var/log/sssd]# sss_cache -E; systemctl stop sssd; rm -rf /var/lib/sss/db/*; rm -rf /var/log/sssd/*; systemctl start sssd

# Triggering actions
3 x manual call "touch /etc/passwd" with few seconds in between calls

# Logs timestamp calculation
./sssd_files.log:19035:(2021-04-03 21:43:11:230076): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:156:(2021-04-03 21:43:11:438640): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
438640 - 230076 = 208564

./sssd_files.log:28415:(2021-04-03 21:43:25:944105): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:184:(2021-04-03 21:43:26:106063): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
1000000 - 944105 + 106063 = 161958

./sssd_files.log:37795:(2021-04-03 21:44:08:229897): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:212:(2021-04-03 21:44:08:424929): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
424929 - 229897 = 195032

Average = (208564 + 161958 + 195032) / 3 = 188518us = 188,518ms

# Deleting users
[root@master.client.vm /var/log/sssd]# time for N in `seq 500`; do userdel -rf  test-user-$N; done

real	18m1.768s
user	1m53.758s
sys	0m59.567s

alexey-tikhonov · 2021-04-06T10:49:27Z

Version 100 useradd 100 userdel 250 useradd 250 userdel 500 useradd 500 userdel
Vanilla 0m43.464s 0m41.084s 2m46.462s 2m46.462s 9m55.823s 9m15.531s
Batching 0m53.304s 0m48.572s 3m42.061s 4m16.405s 20m15.965s 18m1.768s

Why this change isn't constant?
100 useradd: +23%,
500 useradd: +102%

9m -> 20m is worrying.

elkoniu · 2021-04-06T11:06:48Z

Version 100 useradd 100 userdel 250 useradd 250 userdel 500 useradd 500 userdel
Vanilla 0m43.464s 0m41.084s 2m46.462s 2m46.462s 9m55.823s 9m15.531s
Batching 0m53.304s 0m48.572s 3m42.061s 4m16.405s 20m15.965s 18m1.768s

Why this change isn't linear?
100 useradd: +23%,
500 useradd: +102%

9m -> 20m is worrying.

To be precise it was 30 + 100 users and 30 + 500 users. I was using default 30 /etc/passwd entries as sort of zero - reference.

9m -> 20m is worrying.

Please take a look at this comment: #5552 (comment)
I think when I was spamming SSSD with useradd / userdell commands end result was multiple batching tasks running in parallel, where each one was doing some database operations.

elkoniu · 2021-04-06T11:25:28Z

I modified this PR by adding initial zero step which only starts database transaction and returns to tevent loop:
https://github.com/elkoniu/sssd/tree/file-batching-zero-step
End results is that DBUS commands can be send before "heavy" calculations starts.

Please take a look at updated test results (previous test result included):

Estimated time needed to mark domain as inconsistent:

Version	30 users	30 + 100 users	30 + 250 users	30 + 500 users
Vanilla	91ms	190ms	444ms	891ms
Batching	34ms	63ms	105ms	188ms
Zero step	1,1ms	1,4ms	0,8ms	1,1ms

Time consumed by calling operations on X users:

time for N in `seq X`; do useradd test-user-$N; done
time for N in `seq X`; do userdel -rf  test-user-$N; done

Version	100 useradd	100 userdel	250 useradd	250 userdel	500 useradd	500 userdel
Vanilla	0m43.464s	0m41.084s	2m46.462s	2m46.462s	9m55.823s	9m15.531s
Batching	0m53.304s	0m48.572s	3m42.061s	4m16.405s	20m15.965s	18m1.768s
Zero step	0m56.830s	0m53.676s	3m48.812s	3m46.155s	15m27.024s	14m56.387s

Conclusion: With dummy zero steep this PR will be able to solve #5091 by marking domain as inconsistent fast enough.

==============================================================================================================
[BATCH WITH ZERO STEP]
==============================================================================================================
[30 entries in /etc/passwd]

# Counting users in passwd
[root@master.client.vm /var/log/sssd]# cat /etc/passwd | wc -l
30

# Cache cleaning and daemon restart
[root@master.client.vm /var/log/sssd]# sss_cache -E; systemctl stop sssd; rm -rf /var/lib/sss/db/*; rm -rf /var/log/sssd/*; systemctl start sssd

# Triggering actions
3 x manual call "touch /etc/passwd" with few seconds in between calls

# Logs timestamp calculation
./sssd_files.log:1743:(2021-04-06 10:27:35:422398): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:156:(2021-04-06 10:27:35:423142): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
423142 - 422398 = 744

./sssd_files.log:2477:(2021-04-06 10:28:02:358816): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:184:(2021-04-06 10:28:02:360463): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
360463 - 358816 = 1647

./sssd_files.log:3211:(2021-04-06 10:28:21:213798): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:212:(2021-04-06 10:28:21:214862): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
214862 - 213798 = 1064

Average = (744 + 1647 + 1064) / 3 = 1152us = 1,152ms
--------------------------------------------------------------------------------------------------------------
[30 + 100 entries in /etc/passwd]

# Adding users
[root@master.client.vm /var/log/sssd]# time for N in `seq 100`; do useradd test-user-$N; done

real	0m56.830s
user	0m6.624s
sys	0m3.577s

# Counting users in passwd
[root@master.client.vm /var/log/sssd]# cat /etc/passwd | wc -l
130

# Cache cleaning and daemon restart
[root@master.client.vm /var/log/sssd]# sss_cache -E; systemctl stop sssd; rm -rf /var/lib/sss/db/*; rm -rf /var/log/sssd/*; systemctl start sssd

# Triggering actions
3 x manual call "touch /etc/passwd" with few seconds in between calls

# Logs timestamp calculation
./sssd_files.log:5203:(2021-04-06 10:31:42:194803): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:160:(2021-04-06 10:31:42:195815): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
195815 - 194803 = 1012

./sssd_files.log:7667:(2021-04-06 10:32:03:338062): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:188:(2021-04-06 10:32:03:339720): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
339720 - 338062 = 1658

./sssd_files.log:10131:(2021-04-06 10:32:15:219538): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:216:(2021-04-06 10:32:15:221051): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
221051 - 219538 = 1513

Average = (1012 + 1658 + 1513) / 3 = 1394us = 1,394ms

# Deleting users
[root@master.client.vm /var/log/sssd]# time for N in `seq 100`; do userdel -rf  test-user-$N; done

real	0m53.676s
user	0m6.299s
sys	0m3.504s

--------------------------------------------------------------------------------------------------------------
[30 + 250 entries in /etc/passwd]

# Adding users
[root@master.client.vm /var/log/sssd]# time for N in `seq 250`; do useradd test-user-$N; done

real	3m48.812s
user	0m28.150s
sys	0m14.616s

# Counting users in passwd
[root@master.client.vm /var/log/sssd]# cat /etc/passwd | wc -l
280

# Cache cleaning and daemon restart
[root@master.client.vm /var/log/sssd]# sss_cache -E; systemctl stop sssd; rm -rf /var/lib/sss/db/*; rm -rf /var/log/sssd/*; systemctl start sssd

# Triggering actions
3 x manual call "touch /etc/passwd" with few seconds in between calls

# Logs timestamp calculation
./sssd_files.log:10393:(2021-04-06 10:40:16:284891): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:164:(2021-04-06 10:40:16:285288): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
285288 - 284891 = 397

./sssd_files.log:15452:(2021-04-06 10:41:54:614785): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:192:(2021-04-06 10:41:54:616448): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
616448 - 614785 = 1663

./sssd_files.log:20511:(2021-04-06 10:42:14:447177): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:220:(2021-04-06 10:42:14:447551): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
447551 - 447177 = 374

Average = (397 + 1663 + 374) / 3 = 811us = 0,811ms

# Deleting users
[root@master.client.vm /var/log/sssd]# time for N in `seq 250`; do userdel -rf  test-user-$N; done

real	3m46.155s
user	0m27.973s
sys	0m14.632s
--------------------------------------------------------------------------------------------------------------
[30 + 500 entries in /etc/passwd]

# Adding users
[root@master.client.vm /var/log/sssd]# time for N in `seq 500`; do useradd test-user-$N; done

real	15m27.024s
user	1m48.542s
sys	0m55.511s


# Counting users in passwd
[root@master.client.vm /var/log/sssd]# cat /etc/passwd | wc -l
530

# Cache cleaning and daemon restart
[root@master.client.vm /var/log/sssd]# sss_cache -E; systemctl stop sssd; rm -rf /var/lib/sss/db/*; rm -rf /var/log/sssd/*; systemctl start sssd

# Triggering actions
3 x manual call "touch /etc/passwd" with few seconds in between calls

# Logs timestamp calculation
./sssd_files.log:19043:(2021-04-06 11:07:34:234339): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:156:(2021-04-06 11:07:34:234736): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
234736 - 234339 = 397

./sssd_files.log:28427:(2021-04-06 11:07:49:242399): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:184:(2021-04-06 11:07:49:244265): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
244265 - 242399 = 1866

./sssd_files.log:37811:(2021-04-06 11:08:04:169746): [be[files]] [sf_passwd_cb] (0x0200): >>>PASSWD_MODIFIED_CB
./sssd_nss.log:212:(2021-04-06 11:08:04:170919): [nss] [sss_resp_domain_inconsistent] (0x0200): >>>DOMAIN_INCONSISTENT_RECEIVED
170919 - 169746 = 1173

Average = (397 + 1866 + 1173) / 3 = 1145us = 1,145ms

# Deleting users
[root@master.client.vm /var/log/sssd]# time for N in `seq 500`; do userdel -rf  test-user-$N; done

real	14m56.387s
user	1m47.032s
sys	0m55.798s

sumit-bose · 2021-04-07T11:18:44Z

The reason why the message is not delivered is because we send it over provider->sbus_conn which requires accessing the backend's tevent loop because the backend runs this message bus. If be_ctx->mon_conn is used instead the message is sent through the monitor's message bus immediately.

Hi @pbrezina,

this sounds like the proper solution for #5091. There were discussions about dropping the monitor and relying on systemd to start the SSSD components. I do not remember the current state of those discussions but if we ever want to move that way this would be another task of the monitor which must be replaced.

About @elkoniu's suggestion to wait with the processing of the files until the messages are send. This can be easily combined with @alexey-tikhonov 's original understanding of batches where we should not immediately start processing if we are notified of a change but wait for e.g. 1s of inactivity on the files and start processing. With this we should be long enough in the event loop to properly finish the communication with the NSS responder, but I guess you would still recommend to send the message via the monitor?

bye,
Sumit

sumit-bose · 2021-04-07T11:24:22Z

What is the reason for this patches?

IIUC, to avoid single large blocking operation. see #5557

IMHO in this case it would be better to do a diff update to improve performance first. It would also solve #4614. It wouldn't of course solve the initial cache population so some batching is needed, but it will affect the design a lot therefore it might be better to start with it.

Hi @pbrezina,

where do you see the difficulties to add a diff approach? The step the remove the cache has to be replaced by the diff step and then the entries which changed will be processed. Even with the diff approach we might need process the changes in batches if there are many to not block the backend for too long.

bye,
Sumit

pbrezina · 2021-04-07T13:16:34Z

The reason why the message is not delivered is because we send it over provider->sbus_conn which requires accessing the backend's tevent loop because the backend runs this message bus. If be_ctx->mon_conn is used instead the message is sent through the monitor's message bus immediately.

Hi @pbrezina,

this sounds like the proper solution for #5091. There were discussions about dropping the monitor and relying on systemd to start the SSSD components. I do not remember the current state of those discussions but if we ever want to move that way this would be another task of the monitor which must be replaced.

Ideally, we want to spawn private session of dbus-broker and use only one private message bus instead of running multiple. But so far we can use the monitor for situations like this.

About transition to systemd, I'm all for it. We already rely on systemd at some places anyway so I think it is time to finally drop the monitor. But lets not discuss it here.

About @elkoniu's suggestion to wait with the processing of the files until the messages are send. This can be easily combined with @alexey-tikhonov 's original understanding of batches where we should not immediately start processing if we are notified of a change but wait for e.g. 1s of inactivity on the files and start processing. With this we should be long enough in the event loop to properly finish the communication with the NSS responder, but I guess you would still recommend to send the message via the monitor?

If the processing is delayed then the provider gets to the loop and sends the message so monitor is not needed in this case.

where do you see the difficulties to add a diff approach? The step the remove the cache has to be replaced by the diff step and then the entries which changed will be processed. Even with the diff approach we might need process the changes in batches if there are many to not block the backend for too long.

Let me ask a question first - if the domain is correctly marked as inconsistent, why do we need to have the files backend responsive?

elkoniu · 2021-04-07T13:54:23Z

About transition to systemd, I'm all for it. We already rely on systemd at some places anyway so I think it is time to finally drop the monitor. But lets not discuss it here.

If the processing is delayed then the provider gets to the loop and sends the message so monitor is not needed in this case.

Can we agree to fix #5091 as a part of this PR with this zero-step approach rather than using monitor message bus?

sumit-bose · 2021-04-08T05:57:51Z

where do you see the difficulties to add a diff approach? The step the remove the cache has to be replaced by the diff step and then the entries which changed will be processed. Even with the diff approach we might need process the changes in batches if there are many to not block the backend for too long.

Let me ask a question first - if the domain is correctly marked as inconsistent, why do we need to have the files backend responsive?

Hi,

I meant "block" or "responsive" with respect to the watchdog. To avoid that the backend is killed by the watchdog we should make sure to return to the main loop regularly when know that processing the data needs more time.

bye,
Sumit

pbrezina · 2021-04-13T11:34:18Z

Ok, feel free to continue with this PR.

sumit-bose · 2021-07-01T08:51:58Z

I have one strange observation but I am not sure if it is related to batching process.

Steeps:
1. Add 100 users calling useradd in loop

2. Dell 100 users calling userdell in loop

3. Add 250 users calling useradd in loop
If there is no delay between steep (2) and (3) useradd returns "User already exists" for 100 users and then it adds another 150 (250 in total). Those 100 "existing" users are not presented in /etc/passwd.

Hi,

thanks for running this kind of test. The fallback to nss was not complete because the cache_req tried to find the related object in the cache. I added a new patch to cache_req to check if the domain is inconsistent or not.

@pbrezina, do you agree with 715b407?

bye,
Sumit

elkoniu · 2021-07-01T10:00:52Z

Just run test against 5000 users to check marking domain as inconsistent. This PR can be considered as solution for #5091

elkoniu · 2021-07-01T12:39:51Z

I have one strange observation but I am not sure if it is related to batching process.
Steeps:
1. Add 100 users calling useradd in loop

2. Dell 100 users calling userdell in loop

3. Add 250 users calling useradd in loop
If there is no delay between steep (2) and (3) useradd returns "User already exists" for 100 users and then it adds another 150 (250 in total). Those 100 "existing" users are not presented in /etc/passwd.
Hi,

thanks for running this kind of test. The fallback to nss was not complete because the cache_req tried to find the related object in the cache. I added a new patch to cache_req to check if the domain is inconsistent or not.

@pbrezina, do you agree with 715b407?

bye,
Sumit

It looks like in the latest version this issue is solved.

elkoniu · 2021-07-01T12:42:01Z

Another observation I have is that with 5000 users systemctl restart sssd takes about 40 seconds. Normally it would hit watchdog, with this PR it starts normally. The question is: if this long start time should be addressed as a part of this PR or separated PR related to cache optimization itself? I am voting for 2nd option.

elkoniu · 2021-07-01T13:05:26Z

Two more issues exists so far:

[ISSUE 1]

Add 5000 users without SSSD (raw NSS only, systemctl stop sssd, /etc/nsswitch.conf -> files only)
Enable sss in /etc/nsswitch.conf (passwd -> sss files, group -> sss files)
Start SSSD
Remove few users manually from /etc/passwd
Call time for N in 'seq 5000'; do userdel -r test-user-$N; done

Process will remove 1 user and will hang with sssd_be consuming 99% of CPU, then switching to sss_cahce consuming 99% of CPU. After about 1min it will continue to remove next user. At the end it takes around 1min to remove single user.

[ISSUE 2] - On top of steps from [ISSUE 1]

Stop SSSD: systemctl stop sssd
/etc/nsswitch.conf -> files only (sss removed from passwd and group to have only raw NSS provider)
Call time for N in 'seq 5000'; do userdel -r test-user-$N; done

userdel running on top of nss (SSSD stoped) still blocks and takes about 30s to remove single user. Also sss_cache is visible in top command consuming up to 60% of CPU (even with SSSD stopped)

It looks like those two issues are related to each other. The open question is if they are directly related to this PR as in [ISSUE 2] SSSD is not running and it is on top of "raw" files provider.

elkoniu · 2021-07-01T15:38:50Z

The [ISSUE 1] and [ISSUE 2] I described above are presented on master branch build too. This way I do not think we are introducing any regression with this PR.

sumit-bose · 2021-07-01T15:42:38Z

Hi,

I think the latest issues you've found are not strictly related to the patches. The reason is that if there are many users and groups in the cache the sss_cache -UG needs more time to invalidate all users and groups. With 5000 users and the related primary groups it looks like sss_cache runs for more then 1s even if all entries are already invalidated. But this means that the next userdel command is called more than 1s after the last userdel which means that it is called while a refresh is running. This is no issue for the userdel command itself, but userdel will call sss_cache which will try to modify the cache which is locked by the transaction covering the refresh. So sss_cache has to wait until the refresh is finished which might take some time. For the next userdel call the same happens again.

It might be possible to mitigate this by increasing the initial delay before the refresh starts (struct timeval delay in check_state()) but this will delay the refresh for all other cases as well.

Another way might be to speed up the refresh by not deleting and adding everything from scratch but with a diff like approach. But as discussed earlier this should be part of a PR.

It might also be possible to modify sss_cache so that it e.g. only sends a command to SSSD and not tries to modify the caches on its own.

bye,
Sumit

To not block callers when SSSD's files is doing a refresh of /etc/passwd or /etc/group allow to fall back to the next nss module which is typically libnss_files. Resolves: SSSD#5557 :config: Add new config option 'fallback_to_nss'

To avoid constant refreshes if /etc/passwd or /etc/group are modified multiple times in a short interval the refresh is only started after 1s of inactivity. Additionally the request makes sure that only one instance is run. Resolves: SSSD#5557

To make sure current and valid data is used when a certificate should be matched to a users from the files provider the request has to wait until a running refresh is finished. Resolves: SSSD#5557

If a domain is inconsistent the cached data might be inconsistent as well, so better not return it. Resolves: SSSD#5557

sumit-bose · 2021-07-01T16:39:06Z

Hi,

the latest version just adds the :config: tag for the new option.

bye,
Sumit

elkoniu

LGTM, few minor comments.

elkoniu · 2021-07-05T23:52:54Z

src/providers/files/files_ops.c

+        case REFRESH_NOT_RUNNIG:
+            break;
+        case REFRESH_WAITING_TO_START:
+            refresh_ctx->passwd_start_again = true;


For the passwd refresh we have some extra debug messages with are not presented for group refresh.

elkoniu · 2021-07-05T23:54:42Z

src/providers/files/files_ops.c

+        refresh_ctx->certmap_req_list = NULL;
+    }
+
+    ret = check_state(refresh_ctx, flags);


This check_state in fact evaluates the state of process.

elkoniu · 2021-07-06T00:03:38Z

src/providers/files/files_private.h

@@ -34,6 +34,12 @@

 #include "providers/data_provider/dp.h"

+enum refresh_task_status {


I am not sure if this needs to be exposed in the header.

alexey-tikhonov · 2021-07-08T16:14:26Z

I think the latest issues you've found are not strictly related to the patches. The reason is that if there are many users and groups in the cache the sss_cache -UG needs more time to invalidate all users and groups. With 5000 users and the related primary groups it looks like sss_cache runs for more then 1s even if all entries are already invalidated. But this means that the next userdel command is called more than 1s after the last userdel which means that it is called while a refresh is running. This is no issue for the userdel command itself, but userdel will call sss_cache which will try to modify the cache which is locked by the transaction covering the refresh. So sss_cache has to wait until the refresh is finished which might take some time. For the next userdel call the same happens again.
...
It might also be possible to modify sss_cache so that it e.g. only sends a command to SSSD and not tries to modify the caches on its own.

The thing is, IIRC, the real use of sss_cache here is to drop mem-cache. Invalidation of sysdb cache (for all domains!) is undesirable side effect (that does only harm - for "files" provider expiration time doesn't have any affect anyway).
Details are here: https://bugzilla.redhat.com/show_bug.cgi?id=1693379#c6

alexey-tikhonov · 2021-07-08T16:14:58Z

My questions were addressed. Thank you.

Covscan is clean.

pbrezina · 2021-07-09T09:36:25Z

Pushed PR: #5552

master
- b4ee698 - cache_req: do not return cached data if domain is inconsistent
- 19b8506 - files: queue certmap requests if a refresh is running
- dd1aa57 - files: delay refresh and not run in parallel
- 0fbd674 - files: add new option fallback_to_nss
- 5288dda - files: split update into batches

alexey-tikhonov · 2021-07-09T09:55:04Z

Hi @elkoniu,

is #5086 fixed by this PR?

elkoniu · 2021-07-09T09:58:34Z

Hi @alexey-tikhonov , I think it is addressed by part of this PR which is commit eae03fb

alexey-tikhonov requested a review from elkoniu March 26, 2021 10:08

alexey-tikhonov linked an issue Mar 26, 2021 that may be closed by this pull request

files_provider: long "blocking" sf_enum_files() can overrun internal WD timeouts and trigger process restart #5557

Closed

sumit-bose force-pushed the files_batches branch from 31092b2 to f238da3 Compare March 30, 2021 16:28

sumit-bose added the Waiting for review label Mar 30, 2021

elkoniu reviewed Mar 31, 2021

View reviewed changes

elkoniu added Changes requested and removed Waiting for review labels Mar 31, 2021

elkoniu mentioned this pull request Apr 6, 2021

files_provider: domain is not marked as inconsistent properly #5091

Closed

elkoniu mentioned this pull request Apr 7, 2021

files_provider: race condition #5086

Closed

elkoniu mentioned this pull request Apr 8, 2021

NSS: Clear negative cache when SIGHUP received #5539

Closed

alexey-tikhonov assigned elkoniu and pbrezina May 4, 2021

elkoniu linked an issue Jul 1, 2021 that may be closed by this pull request

files_provider: domain is not marked as inconsistent properly #5091

Closed

sumit-bose added 4 commits July 1, 2021 18:36

files: add new option fallback_to_nss

da2dd92

To not block callers when SSSD's files is doing a refresh of /etc/passwd or /etc/group allow to fall back to the next nss module which is typically libnss_files. Resolves: SSSD#5557 :config: Add new config option 'fallback_to_nss'

files: queue certmap requests if a refresh is running

65b0b56

To make sure current and valid data is used when a certificate should be matched to a users from the files provider the request has to wait until a running refresh is finished. Resolves: SSSD#5557

cache_req: do not return cached data if domain is inconsistent

eae03fb

If a domain is inconsistent the cached data might be inconsistent as well, so better not return it. Resolves: SSSD#5557

sumit-bose force-pushed the files_batches branch from 715b407 to eae03fb Compare July 1, 2021 16:38

This was referenced Jul 2, 2021

Slow SSSD start when /etc/passwd has big number of users #5699

Closed

Speed up sss_cache to not block when userdel is called in loop #5700

Closed

elkoniu reviewed Jul 6, 2021

View reviewed changes

elkoniu added Accepted and removed Waiting for review labels Jul 8, 2021

pbrezina added the Ready to push Ready to push label Jul 9, 2021

pbrezina added Pushed and removed Accepted Ready to push Ready to push labels Jul 9, 2021

pbrezina closed this Jul 9, 2021

pbrezina mentioned this pull request Jul 9, 2021

files_provider: long "blocking" sf_enum_files() can overrun internal WD timeouts and trigger process restart #5557

Closed

		@@ -34,6 +34,12 @@

		#include "providers/data_provider/dp.h"

		enum refresh_task_status {

files: split update into batches #5552

files: split update into batches #5552

Conversation

sumit-bose commented Mar 25, 2021

sumit-bose commented Mar 25, 2021

alexey-tikhonov commented Mar 25, 2021

sumit-bose commented Mar 25, 2021

alexey-tikhonov commented Mar 26, 2021

sumit-bose commented Mar 30, 2021

elkoniu Mar 31, 2021

Choose a reason for hiding this comment

elkoniu Mar 31, 2021

Choose a reason for hiding this comment

elkoniu Mar 31, 2021

Choose a reason for hiding this comment

elkoniu Mar 31, 2021

Choose a reason for hiding this comment

elkoniu Apr 1, 2021 • edited Loading

Choose a reason for hiding this comment

pbrezina commented Apr 1, 2021

alexey-tikhonov commented Apr 1, 2021

pbrezina commented Apr 1, 2021

elkoniu commented Apr 3, 2021 • edited Loading

alexey-tikhonov commented Apr 6, 2021 • edited Loading

elkoniu commented Apr 6, 2021

elkoniu commented Apr 6, 2021 • edited Loading

sumit-bose commented Apr 7, 2021

sumit-bose commented Apr 7, 2021

pbrezina commented Apr 7, 2021

elkoniu commented Apr 7, 2021

sumit-bose commented Apr 8, 2021

pbrezina commented Apr 13, 2021

sumit-bose commented Jul 1, 2021

elkoniu commented Jul 1, 2021 • edited Loading

elkoniu commented Jul 1, 2021

elkoniu commented Jul 1, 2021

elkoniu commented Jul 1, 2021 • edited Loading

elkoniu commented Jul 1, 2021

sumit-bose commented Jul 1, 2021

sumit-bose commented Jul 1, 2021

elkoniu left a comment

Choose a reason for hiding this comment

elkoniu Jul 5, 2021

Choose a reason for hiding this comment

elkoniu Jul 5, 2021

Choose a reason for hiding this comment

elkoniu Jul 6, 2021

Choose a reason for hiding this comment

alexey-tikhonov commented Jul 8, 2021

alexey-tikhonov commented Jul 8, 2021 • edited Loading

pbrezina commented Jul 9, 2021

alexey-tikhonov commented Jul 9, 2021

elkoniu commented Jul 9, 2021

elkoniu Apr 1, 2021 •

edited

Loading

elkoniu commented Apr 3, 2021 •

edited

Loading

alexey-tikhonov commented Apr 6, 2021 •

edited

Loading

elkoniu commented Apr 6, 2021 •

edited

Loading

elkoniu commented Jul 1, 2021 •

edited

Loading

elkoniu commented Jul 1, 2021 •

edited

Loading

alexey-tikhonov commented Jul 8, 2021 •

edited

Loading