Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track time operation spends in work queue #4218

Closed
389-ds-bot opened this issue Sep 13, 2020 · 23 comments
Closed

Track time operation spends in work queue #4218

389-ds-bot opened this issue Sep 13, 2020 · 23 comments
Labels
documentation Improvements or additions to documentation performance Issue impacts performance
Milestone

Comments

@389-ds-bot
Copy link

Cloned from Pagure issue: https://pagure.io/389-ds-base/issue/51165


Issue Description

There are customer performance cases where it would be very useful to know how long an operation waited in the work queue before being assigned to a worker thread. This should be easy to do, and we can add a new keyword in the access log, maybe something like "wtime", like "etime", but for how long it "waited" in the queue.

@389-ds-bot 389-ds-bot added documentation Improvements or additions to documentation performance Issue impacts performance labels Sep 13, 2020
@389-ds-bot 389-ds-bot added this to the 1.4.2 milestone Sep 13, 2020
@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2020-06-19 01:30:33

I think this would help, but I think we can think bigger than this. It would be good to have finegrained timings without just systemtap probes or whatever. In the op struct, we could have a nested tree of timings that we build up to geta profile of an operation. IE a complete time, the queue time, how much time in plugins etc.

That would help us much much more to identify performance gaps that goes beyond just the work queue. And I think it wouldn't be much work work on top to achieve it. What do you think?

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2020-06-19 01:30:33

Metadata Update from @Firstyear:

  • Custom field origin adjusted to None
  • Custom field reviewstatus adjusted to None

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2020-06-19 02:30:07

I think this would help, but I think we can think bigger than this. It would be good to have finegrained timings without just systemtap probes or whatever. In the op struct, we could have a nested tree of timings that we build up to geta profile of an operation. IE a complete time, the queue time, how much time in plugins etc.
That would help us much much more to identify performance gaps that goes beyond just the work queue. And I think it wouldn't be much work work on top to achieve it. What do you think?

I like it, but I'm not sure I am envisioning your nested tree structure of timings. There are only so many things we can get timings for so I would think we could just use a static array of timespecs and enum the indexes, but I suspect you have something better in mind :-)

The other issue is how much do we write to the access log, 15 keywords for various timings of the operation? This would potentially double the output for each logged RESULT. Maybe it's a non-issue and I'm just being paranoid. But to be clear I want this in the access log by default. I don't want it to be something we turn on as this would make analysing customer problems way easier if it's already there in SOS reports.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2020-06-19 03:22:48

I think this would help, but I think we can think bigger than this. It would be good to have finegrained timings without just systemtap probes or whatever. In the op struct, we could have a nested tree of timings that we build up to geta profile of an operation. IE a complete time, the queue time, how much time in plugins etc.
That would help us much much more to identify performance gaps that goes beyond just the work queue. And I think it wouldn't be much work work on top to achieve it. What do you think?

I like it, but I'm not sure I am envisioning your nested tree structure of timings. There are only so many things we can get timings for so I would think we could just use a static array of timespecs and enum the indexes, but I suspect you have something better in mind :-)
The other issue is how much do we write to the access log, 15 keywords for various timings of the operation? This would potentially double the output for each logged RESULT. Maybe it's a non-issue and I'm just being paranoid. But to be clear I want this in the access log by default. I don't want it to be something we turn on as this would make analysing customer problems way easier if it's already there in SOS reports.

Perhaps my idea is better saved for when we have the logging subsystem rewrite, because else we do risk performance issues I guess. I'd say I want a seperate log for this kind of measurement I think. So maybe my ideas can wait :) but I like the wtime idea.

@389-ds-bot
Copy link
Author

389-ds-bot commented Sep 13, 2020

Comment from tbordaz (@tbordaz) at 2020-06-19 12:11:41

@Firstyear @mreynolds389 I think the extended idea you are discussing is the purpose of the PR #3730 I sent few months ago.

Basically attaching a an operation extended structure to monitor various aspect of the operation itself. In the PR it was related to index keys list, but it can be easily extended to others places.

I wrote a design of it http://www.port389.org/docs/389ds/design/log-operation-stats.html
to limit the impact (cost) so that we can monitor only thing we are interested in.

I wanted to be back on this PR for a long time as I think it has some advantages compare to a set of stap scripts. Unfortunately I did not get the time to be back on it. I think it is a good opportunity to complete the PR review and possibly use the new approach to fix this ticket.

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2020-06-19 22:30:46

After talking with @tbordaz I will implement the "wtime" and "optime", and then we can look into real logging changes via a different approach/ticket.

So my plan is to have 3 times in the result message:

  • etime = elapsed time, same as it always was, and this is the total time it took for the operation to reach DS and to be completed.
  • wtime = the amount of time the operation waited in the work queue
  • optime = the amount of time it took for the actual operation to do its work.

If you add "optime" and "wtime" together it should be equal to the "etime"

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2020-06-19 22:30:56

Metadata Update from @mreynolds389:

  • Issue assigned to mreynolds389

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2020-06-19 22:31:24

Metadata Update from @mreynolds389:

  • Issue priority set to: normal
  • Issue set to the milestone: 1.4.2

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2020-06-22 04:11:31

Sounds good to me! I look forward to reviewing it :)

@389-ds-bot
Copy link
Author

389-ds-bot commented Sep 13, 2020

Comment from mreynolds (@mreynolds389) at 2020-06-23 22:45:29

@Firstyear enjoy: #4236

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2020-06-23 23:16:15

Metadata Update from @mreynolds389:

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2020-06-24 03:15:18

I will enjoy it thoroughly :) Thanks mate,

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2020-06-24 07:52:35

Metadata Update from @tbordaz:

  • Issue tagged with: Documentation

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2020-06-24 18:56:52

Commit 3246fe79 relates to this ticket

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2020-06-24 18:59:37

Commit 3246fe7 relates to this ticket

0d9e848..0f97891 389-ds-base-1.4.3 -> 389-ds-base-1.4.3

653a00b..19b6caf 389-ds-base-1.4.2 -> 389-ds-base-1.4.2

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2020-06-24 18:59:37

Metadata Update from @mreynolds389:

  • Issue close_status updated to: fixed
  • Issue status updated to: Closed (was: Open)

@389-ds-bot
Copy link
Author

@389-ds-bot
Copy link
Author

389-ds-bot commented Sep 13, 2020

Comment from mreynolds (@mreynolds389) at 2020-06-28 21:50:04

Add more enhancements to logconv

#4247

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2020-06-29 04:06:25

Commit f771ff7e relates to this ticket

@389-ds-bot
Copy link
Author

389-ds-bot commented Sep 13, 2020

Comment from mreynolds (@mreynolds389) at 2020-06-29 04:11:23

Add more enhancements to logconv
#4247

Commit f771ff7 relates to this ticket

8ff268f..87ca196 389-ds-base-1.4.3 -> 389-ds-base-1.4.3

19b6caf..edda830 389-ds-base-1.4.2 -> 389-ds-base-1.4.2

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2020-08-17 15:22:45

Commit dda3ab0d fixes this issue

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2020-08-17 15:24:26

Fix for extended ops:

Commit dda3ab0d fixes this issue

357d28b..25eaefe 389-ds-base-1.4.3 -> 389-ds-base-1.4.3

674dc76..1b05d81 389-ds-base-1.4.2 -> 389-ds-base-1.4.2

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2020-08-17 15:24:27

Metadata Update from @mreynolds389:

  • Issue close_status updated to: None (was: Fixed)

droideck pushed a commit that referenced this issue Nov 3, 2020
Description: Add a test case to dirsrvtests/tests/suites/ds_logs/ds_logs_test.py:
test_optime_and_wtime_keywords. It tests that the new optime and wtime keywords
are present in the access log and have correct values
Also, adapt test_etime_order_of_magnitude adapted to the new RESULT string format 
in the access log

Relates: #4218

Reviewed by: @droideck, @Firstyear (Thanks!)
tbordaz added a commit to tbordaz/389-ds-base that referenced this issue May 6, 2021
Bug description:
	with 389ds#4218 (wtime, optime in access log), hrtime is set in the
	operation. But it is done before checking if the operation is
	set. covscan fails

Fix description:
	move the setting after verification that operation != NULL

relates: 389ds#4759

Reviewed by: Simon Pichugin

Platforms tested: F34
tbordaz added a commit that referenced this issue May 6, 2021
Bug description:
	with #4218 (wtime, optime in access log), hrtime is set in the
	operation. But it is done before checking if the operation is
	set. covscan fails

Fix description:
	move the setting after verification that operation != NULL

relates: #4759

Reviewed by: Simon Pichugin

Platforms tested: F34
tbordaz added a commit that referenced this issue May 7, 2021
Bug description:
	with #4218 (wtime, optime in access log), hrtime is set in the
	operation. But it is done before checking if the operation is
	set. covscan fails

Fix description:
	move the setting after verification that operation != NULL

relates: #4759

Reviewed by: Simon Pichugin

Platforms tested: F34
tbordaz added a commit that referenced this issue May 7, 2021
Bug description:
	with #4218 (wtime, optime in access log), hrtime is set in the
	operation. But it is done before checking if the operation is
	set. covscan fails

Fix description:
	move the setting after verification that operation != NULL

relates: #4759

Reviewed by: Simon Pichugin

Platforms tested: F34
tbordaz added a commit that referenced this issue May 7, 2021
Bug description:
	with #4218 (wtime, optime in access log), hrtime is set in the
	operation. But it is done before checking if the operation is
	set. covscan fails

Fix description:
	move the setting after verification that operation != NULL

relates: #4759

Reviewed by: Simon Pichugin

Platforms tested: F34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation performance Issue impacts performance
Projects
None yet
Development

No branches or pull requests

1 participant