Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clock thread safety: was hr etime by default #2375

Closed
389-ds-bot opened this issue Sep 13, 2020 · 25 comments
Closed

clock thread safety: was hr etime by default #2375

389-ds-bot opened this issue Sep 13, 2020 · 25 comments
Labels
closed: fixed Migration flag - Issue complex Fix is complex
Milestone

Comments

@389-ds-bot
Copy link

Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/49316


Issue Description

During an investigation it became apparent that DS clock handling was not thread safe. In an attempt to "speed up" clock access, the architecture used a time thread that would wake every second (or more) and update a global value in time.c. However, this value was not atomically updated, or atomically read. To update this to use atomics would likely cause a performance regression. Worse, these values were relied upon by csngen to generate csns for replication.

With hr log timestamps we showed that ns precision is virtually free (if not faster) on linux today. With this in mind, there is no reason not to enable HR etime by default now. As well, by using clock_gettime in calls to the clock, we take advantage of VDSO's in clock_gettime, that are going to be faster than atomics. Finally, we can use monotonic clock for timers, and utc for all other calls to remove performance and timezone issues.

@389-ds-bot 389-ds-bot added closed: fixed Migration flag - Issue complex Fix is complex labels Sep 13, 2020
@389-ds-bot 389-ds-bot added this to the 1.3.7.0 milestone Sep 13, 2020
@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-07-07 07:04:07

Metadata Update from @Firstyear:

  • Issue assigned to Firstyear

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-07-10 03:41:31

Metadata Update from @Firstyear:

  • Custom field type adjusted to defect
  • Issue priority set to: major
  • Issue set to the milestone: 1.3.7.0
  • Issue tagged with: Complex

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-07-10 03:51:20

Additionally, since it's been 24 years since clock_gettime was released, I'm making it a hard requirement of DS from now.

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2017-07-10 11:56:34

Regarding replication, we have only one time thread that periodically (each second) calls time(&currenttime). If the same thread calls time(&t2) after time(&t1), is there a chance that t1>t2 ?
If not then all DS components (including csn generation) that rely on the timethread are protected.

Now the others components that directly call time() are impacted and there are many of them.
I thought that time(2) was thread safe, so it is not ?

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-07-10 12:06:26

time(2) is safe. the time thread is not. The write to the g_current_time was not atomic, and the read from current_time was also not atomic. As a result, it's possible for the update from the time thread to never be made visible to other threads (or more practically, it's delayed until the next barrier, and you have to hope it's a store one else our time thread write is blown away), and on the read side, we may never see the update to the time (or more practically, until by some accident we get a barrier elsewhere, assuming the write thread actually barriered as well). So as a result, this is not a safe interface.

Linux clock_gettime is implemented as a VDSO, so there is almost 0 latency to just calling clock_gettime, so instead my fix will wrap this.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-07-10 12:07:56

Ohh and even better @tbordaz - most of our timeouts were only accurate to a second, so if you happen to start an op with a time out of 1 second, at the time point "0.9" second, and then the op runs to "1.000001" second, you'll be timed out - even though you only were in the op for 0.1 second, the "timeout" sees the second change and says "nope"

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-07-12 06:04:30

0001-Ticket-49316-Fix-clock-unsafety-in-DS.patch

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-07-12 06:04:31

Metadata Update from @Firstyear:

  • Custom field reviewstatus adjusted to review

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-07-12 06:05:22

This patch has been extensively tested with the full test suite twice now.

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2017-07-13 04:29:33

A few minor issues....

indentation issue:

 498 diff --git a/ldap/servers/plugins/chainingdb/cb_search.c b/ldap/servers/plugins/chainingdb/cb_search.c
 499 index dad2785..e2aa163 100644
 500 --- a/ldap/servers/plugins/chainingdb/cb_search.c
 501 +++ b/ldap/servers/plugins/chainingdb/cb_search.c
 502 @@ -40,9 +40,8 @@ chainingdb_build_candidate_list ( Slapi_PBlock *pb )
 503  »       const char          *target = NULL;
 504  »       char                *filter;
 505  »       char                **attrs = NULL;
 506 -»       struct timeval      timeout;
 507 -»       time_t              optime;
 508 -»       time_t              endbefore = 0;
 509 +»       struct timespec     expire_time;
 510 +»       struct timeval     timeout;
 511  »       time_t              endtime = 0;
 512  »       char                *matched_msg, *error_msg;
 513  »       char                **referrals = NULL;

Logging issue:

1631 @@ -4422,17 +4460,45 @@ static int checkpoint_threadmain(void *param)
1632                  if (!db || rc ) {
1633                      continue;
1634                  }
1635 -                slapi_log_err(SLAPI_LOG_BACKLDBM, "checkpoint_threadmain", "Compacting DB start: %s\n",
1636 +                slapi_log_err(SLAPI_LOG_NOTICE, "checkpoint_threadmain", "Compacting DB start: %s\n",
1637                                inst->inst_name);
1638 +
1639 +                /*
1640 +                 * It's possible for this to heap us after free because when we access db
1641 +                 * *just* as the server shut's down, we don't know it. So we should probably
1642 +                 * do something like wrapping access to the db var in a rwlock, and have "read"
1643 +                 * to access, and take writes to change the state. This would prevent the issue.
1644 +                 */
1645 +                DBTYPE type;
1646 +                rc = db->get_type(db, &type);
1647 +                if (rc) {
1648 +                    slapi_log_err(SLAPI_LOG_ERR,
1649 +                              "compactdb: failed to determine db type for %s: db error - %d %s\n",
1650 +                              inst->inst_name, rc, db_strerror(rc));

There is no subsystem specified for the logging call, it should be:

slapi_log_err(SLAPI_LOG_ERR, "checkpoint_threadmain",
                            "compactdb: failed to determine db type for %s: db error - %d %s\n",
                             inst->inst_name, rc, db_strerror(rc));

Indentation issue:

2677 +
2678 +  /*
2679 +   * If the duration set is larger than time_t max - current time, we probably have
2680 +   * made it to the heat death of the universe. Congratulations on finding this bug.
2681 +   */
2682 +  if ( errno == ERANGE || duration <= 0 || duration > (MAX_ALLOWED_TIME_IN_SECS_64 - slapi_current_utc_time()) ) {
2683 +    slapi_create_errormsg(errorbuf, SLAPI_DSE_RETURNTEXT_SIZE, "password lockout duration \"%s\" is invalid. ", value);
2684 +    retVal = LDAP_OPERATIONS_ERROR;
2685 +    return retVal;
2686    }

The rest looks good, and no compiler warnings!! I'll ack this, but please fix the above issues.

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2017-07-13 04:29:34

Metadata Update from @mreynolds389:

  • Custom field reviewstatus adjusted to ack (was: review)

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-07-13 05:01:57

I went through the patch 3 times to try and catch all these! Good spotting. I'll fix these any commit :)

Thanks so much,

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2017-07-13 05:25:38

I went through the patch 3 times to try and catch all these! Good spotting. I'll fix these any commit :)

I swear I don't set out just looking for these indentation issues, but they just really stand out to me and it makes me cringe :-) I'm just glad I caught the missing logging subsystem though. Since it's a macro there is no compiler warning/error, but I'm not trying to reignite that conversation ;-) Anyway, vacation time for me, talk to you next week!

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-07-13 05:47:36

Fixed all those issues, thanks again!

commit c70baf1d509f6dc144d839d71e874e3d59d72150
To ssh://git@pagure.io/389-ds-base.git
4940c5f..680a1a5 master -> master

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-07-13 05:47:36

Metadata Update from @Firstyear:

  • Issue close_status updated to: fixed
  • Issue status updated to: Closed (was: Open)

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2017-07-13 13:53:36

@Firstyear just for curiosity, time() and clock_gettime() are thread safe. What is the reason the time thread was not thread safe ? is it because it updated/returned a global variable ?. If this is the reason would a fix like this one being thread safe ?

diff --git a/ldap/servers/slapd/time.c b/ldap/servers/slapd/time.c
index 16467af..8315dda 100644
--- a/ldap/servers/slapd/time.c
+++ b/ldap/servers/slapd/time.c
@@ -120,22 +120,13 @@ free_timestring(char *timestr)
 time_t
 poll_current_time()
 {
-    if ( !currenttime_set ) {
-       currenttime_set = 1;
-    }
-
-    time( &currenttime );
-    return( currenttime );
+    return 0;
 }
 
 time_t
 current_time( void )
 {
 -    if ( currenttime_set ) {
 -        return( currenttime );
 -    } else {
 -        return( time( (time_t *)0 ));
 -    }
+    return( time( (time_t *)0 ));
 }

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-07-14 02:31:59

If you did not have current time set, this would be thread safe, but there is a caveat, and that is you and I the human's.

What is current_time? Thoughout the code I saw current time as:

  • utc
  • localtime
  • relative system time

People who used the api either had it change without knowing underthem, or they didn't know what they were "getting" and assumed. As well, we had a slapi_current_time wrapper that just wrapped the function in a (pointless) extra fn call.

So my solution was to explicitly name the replacement, and make it part of the slapi api so we just get exactly what we want. I'm hoping to avoid human error in the future.

@389-ds-bot
Copy link
Author

Comment from lkrispen (@elkris) at 2017-07-17 13:21:08

When testing on top of current master I noticed that the event queue thread is consuming 100 % cpu.

Could you revisit your changes in the eventq code. There are loops calling functions ... and taking time, but you use curtime looked up before starting the loop, wheras the previous code was using current_time() inside the loop

@389-ds-bot
Copy link
Author

Comment from lkrispen (@elkris) at 2017-07-17 13:44:30

I think there was an incorrect macro replacement applying the NOT only to the first and component, the following patch seems to fix it

0001-fix-macro-replacement.patch

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-07-18 02:07:18

I think you're right. I'll apply this now (with my ack)

commit 1b95045
To ssh://git@pagure.io/389-ds-base.git
b95f5f2..1b95045 master -> master

@389-ds-bot
Copy link
Author

Comment from lkrispen (@elkris) at 2017-07-24 22:06:13

Metadata Update from @elkris:

  • Custom field component adjusted to None
  • Custom field origin adjusted to None
  • Custom field version adjusted to None

@389-ds-bot
Copy link
Author

Comment from lkrispen (@elkris) at 2017-07-24 22:07:43

this change broke checkpointing, fix attached

@389-ds-bot
Copy link
Author

Comment from lkrispen (@elkris) at 2017-07-24 22:08:14

0001-make-checkpointing-work-again.patch

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2017-07-25 00:52:14

ack

mreynolds389 added a commit to mreynolds389/389-ds-base that referenced this issue Jun 23, 2023
Description:

Add check for

- unauthorized binds are allowed
- Access log buffering is disabled
- Security log buffering is disabled
- Make mapping tree check more robust for case

relates: 389ds#2375

Reviewed by: spichugi(Thanks!)
mreynolds389 added a commit that referenced this issue Jun 23, 2023
Description:

Add check for

- unauthorized binds are allowed
- Access log buffering is disabled
- Security log buffering is disabled
- Make mapping tree check more robust for case

relates: #2375

Reviewed by: spichugi(Thanks!)
mreynolds389 added a commit that referenced this issue Jun 23, 2023
Description:

Add check for

- unauthorized binds are allowed
- Access log buffering is disabled
- Security log buffering is disabled
- Make mapping tree check more robust for case

relates: #2375

Reviewed by: spichugi(Thanks!)
mreynolds389 added a commit that referenced this issue Jun 23, 2023
Description:

Add check for

- unauthorized binds are allowed
- Access log buffering is disabled
- Make mapping tree check more robust for case

relates: #2375

Reviewed by: spichugi(Thanks!)
mreynolds389 added a commit that referenced this issue Jun 23, 2023
Description:

Add check for

- unauthorized binds are allowed
- Access log buffering is disabled
- Make mapping tree check more robust for case

relates: #2375

Reviewed by: spichugi(Thanks!)
mreynolds389 added a commit that referenced this issue Jun 23, 2023
Description:

Add check for

- unauthorized binds are allowed
- Access log buffering is disabled
- Make mapping tree check more robust for case

relates: #2375

Reviewed by: spichugi(Thanks!)
@mreynolds389
Copy link
Contributor

0272d99..6981d08 389-ds-base-2.3 -> 389-ds-base-2.3
98644dc..835f12b 389-ds-base-2.2 -> 389-ds-base-2.2
6897dc5..f5112d9 389-ds-base-2.1 -> 389-ds-base-2.1
c06e225..ea3d4e8 389-ds-base-1.4.3 -> 389-ds-base-1.4.3

bsimonova added a commit to bsimonova/389-ds-base that referenced this issue Jul 31, 2023
Description:
Updated the healthcheck tests with nsslapd-accesslog-logbuffering
settings so it does not report a warning.
Also adjusted certificate tests so it covers days in leap year.

Relates: 389ds#2375

Reviewed by: ???
bsimonova added a commit to bsimonova/389-ds-base that referenced this issue Aug 2, 2023
Description:
Updated the healthcheck tests with nsslapd-accesslog-logbuffering
settings so it does not report a warning.
Also adjusted certificate tests so it covers days in leap year.

Relates: 389ds#2375

Reviewed by: @vashirov (Thanks!)
bsimonova added a commit that referenced this issue Aug 2, 2023
Description:
Updated the healthcheck tests with nsslapd-accesslog-logbuffering
settings so it does not report a warning.
Also adjusted certificate tests so it covers days in leap year.

Relates: #2375

Reviewed by: @vashirov (Thanks!)
lab-at-nohl pushed a commit to lab-at-nohl/cockpit-389-ds-containerproxy that referenced this issue May 9, 2024
Description:

Add check for

- unauthorized binds are allowed
- Access log buffering is disabled
- Make mapping tree check more robust for case

relates: 389ds/389-ds-base#2375

Reviewed by: spichugi(Thanks!)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed: fixed Migration flag - Issue complex Fix is complex
Projects
None yet
Development

No branches or pull requests

2 participants