Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow IPA login #3977

Closed
sssd-bot opened this issue May 2, 2020 · 0 comments
Closed

Slow IPA login #3977

sssd-bot opened this issue May 2, 2020 · 0 comments

Comments

@sssd-bot
Copy link

sssd-bot commented May 2, 2020

Cloned from Pagure issue: https://pagure.io/SSSD/sssd/issue/2936

  • Created at 2016-01-24 20:33:00 by mogthesprog
  • Closed as Invalid
  • Assigned to nobody

Hi,

I've done plenty of research on the internet looking for a solution to this problem but i'm still pretty stuck.

I have a freeipa instance which contains 3 users, and a similar amount of groups. I've configured one small vm to authenticate with the ipa instance. However logon takes 4/5 seconds. I've attached my config file to the ticket, any help in solving this would be appreciated.

I've tested kinit, it's instant. When i execute "time ipa user-find admin", that query takes about 2.5 seconds. I think this may be the source of the problem.

I've also ensured that multicasting is disabled and that resolv.conf has the right nameservers.

Cheers,

Morgan

Comments


Comment from mogthesprog at 2016-01-24 20:33:20

attachment
sssd.conf


Comment from dpal at 2016-01-24 23:45:13

Sounds like a DNS issue.
How much does it take to resolve the IPA server from the test machine?
Also krb5.conf from test machine might be helpful.


Comment from jhrozek at 2016-01-25 09:32:52

In general looking at the delay in the logs might be worth the time, see https://fedorahosted.org/sssd/wiki/Troubleshooting

We're working at improving performance for 1.14, but a simple user lookup shouldn't take this long.


Comment from mogthesprog at 2016-01-25 09:46:00

Hi, Thanks for the speedy response...

So I'm not sure of the best way to test the dns response time, but i used dig and nslookup and the output is below. Note

[root@central-puppet morganj]# dig @10.0.0.4 central-ipa-001.globalscience.dh

; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.1 <<>> @10.0.0.4 central-ipa-001.globalscience.dh
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 57945
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;central-ipa-001.globalscience.dh. IN   A

;; ANSWER SECTION:
central-ipa-001.globalscience.dh. 1200 IN A     10.0.0.4

;; AUTHORITY SECTION:
globalscience.dh.       86400   IN      NS      central-ipa-001.globalscience.dh.

;; Query time: 1 msec
;; SERVER: 10.0.0.4#53(10.0.0.4)
;; WHEN: Mon Jan 25 08:30:12 UTC 2016
;; MSG SIZE  rcvd: 91

[root@central-puppet morganj]# time nslookup central-ipa-001.globalscience.dh
Server:         10.0.0.4
Address:        10.0.0.4#53

Name:   central-ipa-001.globalscience.dh
Address: 10.0.0.4


real    0m0.007s
user    0m0.002s
sys     0m0.004s
[root@central-puppet morganj]# time nslookup central-ipa-001
Server:         10.0.0.4
Address:        10.0.0.4#53

Name:   central-ipa-001.globalscience.dh
Address: 10.0.0.4


real    0m0.021s
user    0m0.005s
sys     0m0.001s

However i then tried to use dig with the ip of the free=ipa server and that took 170ms, which is an age. the second time i run it the time came down, but it then fluctuates between 20ms and 50ms.

Also, I'll upload the krb5.conf now :) Let me know if you need anything else.


Comment from mogthesprog at 2016-01-25 09:49:11

Cheers jhrozek, I took a log for a logon and i've kept the part where the delay takes place. I'll upload that too, just a sec.


Comment from mogthesprog at 2016-01-25 09:58:04

attachment
sssd_domain.log


Comment from mogthesprog at 2016-01-25 09:58:17

kerb config on the client
krb5.conf


Comment from mogthesprog at 2016-01-25 09:58:57

I just wanted to clarify that a kinit returns instantaneously. Cheers


Comment from sbose at 2016-01-25 10:39:56

Can you attach the krb5_child.log as well? Looks like the most time is spend in the krb5_child.


Comment from sbose at 2016-01-25 10:41:09

Fields changed

cc: => sbose


Comment from mogthesprog at 2016-01-25 11:01:29

is there a way to turn up the verbosity on krb5_child.log? Attaching at the current verbosity anyway...


Comment from mogthesprog at 2016-01-25 11:02:15

attachment
krb5_child.log


Comment from sbose at 2016-01-25 11:06:32

Just set debug_level=10 in the [domain/...] section of sssd.conf and restart SSSD, this will increase the debug_level of keb5_child as well.


Comment from mogthesprog at 2016-01-25 14:03:17

Hi guys sorry for the delay, uploading the log now.


Comment from mogthesprog at 2016-01-25 14:03:51

krb5_child at log level 10
krb5_child.2.log


Comment from sbose at 2016-01-25 15:17:54

Thank you for the logs. To me it looks like there is no specific issue which makes the authentication slow. But in contrast to a simple kinit SSSD's krb5_child does some extra operations to increase the security here which result in some extra roundtrips between the client and the KDC.

First SSSD uses FAST (see e.g. http://k5wiki.kerberos.org/wiki/Projects/FAST for details) to establish an encrypted channel to protect against some kind of brute force attacks against captured network packages. For this an additional authentication the a key from the host keytab is needed.

After the TGT for the user is returned, SSSD validates the ticket by requesting a service ticket for a service from the host keytab and checks is the returned service ticket can be decrypted with the related key form the keytab. This prevents attacks where a malicious KDC tries to hijack the connection.

You can disable those with 'krb5_use_fast=never' and 'krb5_validate=false' respectively, but I would recommend against it.

As a side note, when the kinit shows the password prompt all communication with the KDC is already done, the password is just needed to decrypt the response from the KDC.


Comment from mogthesprog at 2016-01-25 15:43:03

okay, no worries. So by disabling krb5_use_fast, what exactly changes compared to a standard kinit? I just want to understand the implications before doing so. As it stands the log in 4/5 seconds is quite a long time. Especially when a standard kinit is instantaneous.

thanks again for the quick response guys. We're running a POC on FreeIPA at the moment and only issue (at the moment) is the duration of the logins.

Cheers,

Morgan


Comment from mogthesprog at 2016-01-25 15:46:01

Also, do these configs go under the domain/ section of sssd.conf?

I'm still wondering whether the two issues are related. Adding the tow kerberos configs (just to see if they improved performance for the login) and they didn't seem to do much. Are we sure that the user look up time is unrelated to the slow login?

The time for login also fluctuates quite a lot, on the order of a second or a second and a half difference. any idea why that may be? Bare in mind that there is only 3 users in ipa, and only me authenticating at any one time.

_comment0: Also, do these configs go under the domain/ section of sssd.conf? => 1453734328179468


Comment from sbose at 2016-01-25 17:59:09

To get reproducible results it might be best to run sssd freshly started with an empty cache. Typically I run something like

systemctl stop sssd.service ; rm -f /var/lib/sss/db/* ; rm -f /var/log/sssd/* ; systemctl start sssd

to achieve this. Before running the test please add debug_level=10 to the [pam] and [nss] section as well.

I agree that 5s are quite long. I run FreeIPA and a SSSD client in different VMs on a laptop and see the login happen in the same second. How does your PoC look like, are you running the test in the FreeIPA server or on a client?


Comment from mogthesprog at 2016-01-25 18:26:53

Thanks for that, just on my way home from work now so I'll either try later this evening or tomorrow morning. And I'll post back the logs :)

The poc currently consists of two VMs in azure running centos 7.2. I don't think I can know where in the cloud they sit, and whether or not they're on the same machine. In a month or so we plan to do the same thing in our DC, where we'll know a lot more about the environment. I'm wondering whether it is a DNS related thing from on of the previous comments. Do you run the DNS server on the same VM as IPA-server?

Cheers,

Morgan


Comment from dpal at 2016-01-25 20:10:18

IPA server can have an integrated DNS server. We generally recommend but not require it.


Comment from mogthesprog at 2016-01-26 09:36:13

Okay, so i've done the recommended thing, my ipa server also runs DNS. I'll post back in half hour or so with those extra logs.


Comment from mogthesprog at 2016-01-26 09:47:57

okay, so as an asside, something strange is going on, maybe related to caching.

[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m2.309s
user    0m0.021s
sys     0m0.004s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m0.978s
user    0m0.032s
sys     0m0.009s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m2.709s
user    0m0.018s
sys     0m0.010s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m0.958s
user    0m0.022s
sys     0m0.005s


########
NOW I STARTED REPEATING THE TASK IMMEDIATELY AFTER IT FINISHED
########


[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m2.872s
user    0m0.015s
sys     0m0.005s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m0.912s
user    0m0.030s
sys     0m0.004s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m0.761s
user    0m0.011s
sys     0m0.004s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m1.253s
user    0m0.014s
sys     0m0.003s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m0.988s
user    0m0.029s
sys     0m0.007s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real    0m0.713s
user    0m0.013s
sys     0m0.005s

Notice that as soon as I start repeating the task very quickly, the login is happening at a normal speed. Is this an artifact of some other underlying thing, or could it be pointing us in the right direction. Just to note, this is with the two krb5 settings (validate and fast) DISABLED and selinux_provider=none, so the login should be much quicker than 3 seconds.

Just to clarify, I don't want to run with these settings disabled in production, I'm just trying to narrow down the choke point for authentication.

_comment0: okay, so as an asside, something strange is going on, maybe related to caching.

{{{
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m2.309s
user 0m0.021s
sys 0m0.004s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m0.978s
user 0m0.032s
sys 0m0.009s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m2.709s
user 0m0.018s
sys 0m0.010s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m0.958s
user 0m0.022s
sys 0m0.005s

########
NOW I STARTED REPEATING THE TASK IMMEDIATELY AFTER IT FINISHED
########

[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m2.872s
user 0m0.015s
sys 0m0.005s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m0.912s
user 0m0.030s
sys 0m0.004s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m0.761s
user 0m0.011s
sys 0m0.004s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m1.253s
user 0m0.014s
sys 0m0.003s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m0.988s
user 0m0.029s
sys 0m0.007s
[root@central-puppet morganj]# time ssh -K morganj@central-ipa-001.globalscience.dh echo hello > /dev/null

real 0m0.713s
user 0m0.013s
sys 0m0.005s
}}}

Notice that as soon as I start repeating the task very quickly, the login is happening at a normal speed. Is this an artifact of some other underlying thing, or could it be pointing us in the right direction. Just to note, this is with the two krb5 settings (validate and fast) DISABLED, so the login should be much quicker than 3 seconds.

=> 1453798942929847


Comment from mogthesprog at 2016-01-26 10:27:05

attachment
krb5_child.log.ipaclient


Comment from mogthesprog at 2016-01-26 10:27:11

attachment
ldap_child.log.ipaclient


Comment from mogthesprog at 2016-01-26 10:27:21

attachment
sssd_globalscience.dh.log.ipaclient


Comment from mogthesprog at 2016-01-26 10:27:29

attachment
sssd.log.ipaclient


Comment from mogthesprog at 2016-01-26 10:27:36

attachment
sssd_nss.log.ipaclient


Comment from mogthesprog at 2016-01-26 10:27:43

attachment
sssd_pam.log.ipaclient


Comment from mogthesprog at 2016-01-26 10:32:38

attachment
ldap_child.log.ipaserver


Comment from mogthesprog at 2016-01-26 10:32:45

attachment
sssd.log.ipaserver


Comment from mogthesprog at 2016-01-26 10:32:56

attachment
sssd_globalscience.dh.log.ipaserver


Comment from mogthesprog at 2016-01-26 10:33:08

attachment
sssd_nss.log.ipaserver


Comment from mogthesprog at 2016-01-26 10:33:17

attachment
sssd_pam.log.ipaserver


Comment from mogthesprog at 2016-01-26 10:33:25

attachment
sssd_ssh.log.ipaserver


Comment from mogthesprog at 2016-01-26 10:34:53

I've uploaded all of the logs from both server and client. The logs and cache were cleared prior to the log in to "central-puppet" client.

The logs are for one login, and a few seconds of inactivity either side.

Thanks again for all the help guys :)


Comment from sbose at 2016-01-26 10:39:07

With every new login SSSD tries to read all group memberships of the user from the server t make sure that e.g. access control is done with the most recent data. There is a small timeout of 5s (pam_id_timeout) during which no group lookup happen and SSSD will use the data from the cache. I think this is what you seeing in your tests.

Nevertheless, with the given the small number of users and group 2+s seems quite long (I assume you already have a valid Kerberos service ticket so that the ssh client does not need to connect to the KDC in the first test to get one). I'll check the logs and come back to you.


Comment from mogthesprog at 2016-01-26 11:14:46

yep that's correct, I did a kinit before issuing an ssh -K, and assuming the defaults are the usual kerberos ticket expirations the ticket shouldn't need renewing.

Thanks sbose


Comment from mogthesprog at 2016-01-28 21:26:58

Hi Guys,

Sbose, any idea from the logs what's causing the delay?

Cheers,

Morgan


Comment from sbose at 2016-01-29 12:31:01

Unfortunately each step is happening in a single second so that the timestamps do not really help to identify which operation is slow.

08:49:22: the user is looked up in the nss responder, i.e. sshd is checking if the user exists

... I guess you typed the password here ...

08:49:24: authentication request enters the PAM responder and the request to lookup the groupmemberships is send to the backend

08:49:25: group-membership lookup is done, authentication request is send to the backend

08:49:26: authentication result is send back to sshd and sshd sends access control request

08:49:27: access control request is send to sshd and sshd calls the remaining PAM steps which are basically no-ops and happen in the same second.

There is an option debug_microseconds to get a better time resolution.

Additionally to rule out disk I/O effects you might want to try to put the SSSD cache on tmpfs as described on https://jhrozek.wordpress.com/2015/08/19/performance-tuning-sssd-for-large-ipa-ad-trust-deployments/ section 'Mount the cache in tmpfs'.


Comment from mogthesprog at 2016-01-30 17:31:19

Hi Sbose,

thanks for this link. I implemented the tmpfs cache and it's performing a lot more consistently. This cause me to test IO with dd, and it seems azure disk performance is a bit shocking (likely due to a noisy neighbour of come sort).

As a result of this I'll implement this on two VMs inside our own network and then take another look at authentication time.

We can close this for now if you like, unless you want thos microsecond logs anyway? I can do those test on monday morning.

If after i've implemented this internally I still have the same problem then I'll come back and reopen this ticket.

Thanks so much for your help, I've learnt a lot about sssd, I appreciate it.

Cheers,

Morgan


Comment from jhrozek at 2016-01-31 21:12:18

Before Sumit closes the issue, I would just like to point out that the cache performance is something we will be working on implementing during the 1.14 development cycle.


Comment from sbose at 2016-02-01 09:40:00

Thank you for the feedback. With the results of the research you did I think the additional logs are not needed anymore.

resolution: => worksforme
status: new => closed


Comment from mogthesprog at 2017-02-24 14:31:00

Metadata Update from @mogthesprog:

  • Issue set to the milestone: NEEDS_TRIAGE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant