Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Timestamping #176

Merged
merged 4 commits into from Mar 7, 2021
Merged

Conversation

alexander-g
Copy link
Contributor

Added the option to latch timestamps after each operation with vkCmdWriteTimestamp().

Notes:

  • Timestamps might be not very accurate, especially for data copy. I've noticed that a shader dispatch takes longer directly after a copy. One might consider adding more timestamps within OpAlgoDispatch, e.g. after the memory barrier
  • The user needs to know in advance the maximum number of operations they want to record when creating a sequence. This might not always be possible.
  • The timestamps are returned as a simple list, one might consider adding a parameter timestamp_name="banana" to record() for easier identification which operation took how much time.

@axsaucedo
Copy link
Member

axsaucedo commented Mar 6, 2021

Interesting - thanks for the PR! Can you show an example of how this works?

Looking at the relationships between commandbuffer and timestamp pool, it seems like it's more similar to that of the Tensor/Op or Algo/Op than to something that would be specific to a sequence. Is there a reason why this isn't explored instead as something similar to an Algorithm, or Tensor, where there can be a Timestamp object, and then we could have an OpTimestamp type that basically would be able to call the Timestamp's record.

That way also you would be able to set the timestamps explicitly as:

std::shared_ptr<kp::OpTimestamp> opTimestamp = { new kp::OpTimestamp(mgr.timestamp(...) };

mgr.sequence()
  ->record(opTimestamp)
  ->record<kp::OpTensorSyncDevice>(...)
  ->record(opTimestamp)
  ->record<kp::AlgoDispatch>(...)
  ->record<opTimestamp)
  ->record<kp::OpTensorSyncLocal>(...)
  ->record(opTimestamp)
  ->eval()

std::vector<std::uint64_t> timestamps = opTimestamp->timestamps()

It would be good to get a better insight on timestamps, how are you currently using them?

@alexander-g
Copy link
Contributor Author

A simple example:

seq = mgr.sequence(nrOfTimestamps=100)
(seq.record(kp.OpTensorSyncDevice([a,b]))
    .record(kp.OpAlgoDispatch(algo0))
    .record(kp.OpAlgoDispatch(algo1))
    .record(kp.OpAlgoDispatch(algo2))
    .record(kp.OpTensorSyncLocal([c]))
    .eval())

print( np.diff(seq.get_timestamps()) )

>>>[ 6976 31264 16384 16416 9312]

OpTimestamp might be an alternative but requires more work for the user.

I've just started using this, so I cannot tell much about it yet. Previously I created a sequence for each operation and measured the time for it to evaluate as you suggested in #110 but this is slow and requires writing additional profiling code, so I was looking for alternatives and found that Vulkan has this built-in.

@axsaucedo
Copy link
Member

axsaucedo commented Mar 6, 2021

Ok interesting, yes I see what you mean in regards to potentially requiring more work for the user. Is there a reason why we'd want to have a max number of timestamps? Would it not be possible to just allow the timestamps to be set in between each operation when enabled and then cleared every time eval is called without limits? Mainly as the re-record would also be cleared as well

@alexander-g
Copy link
Contributor Author

A maximum number is required by Vulkan when creating a QueryPool.
rerecord() looks like it might help, but this would require the user to call manually wouldn't it?

@axsaucedo
Copy link
Member

axsaucedo commented Mar 6, 2021

Edit: Did a first pass, looks like this initial approach in the PR is a good way to go and can be explored from there

Oh I see what you mean now by the point above in regards to not having the exact number of operations upfront... I agree that re-record could help but yeah it would require adding manual timestamp. It feels a bit awkward to expose the exact number of timestamps that needs to be set by the user - perhaps what could be set in this case is a function to trigger a re-record, this way the queue creation would be set up with the right size and it would only live for as long as the recorded commands request this explicitly. It could then be used as:

mgr.sequence()
   ->record(...)
   ->record(...)
   ->record(...)
   ->record(...)
   ->rerecord(true) // creates queue and adds timestamps
   ->eval()
   ->record(...) // deletes queue and records new command buffer
   ->rerecord(false) // re-records but doesn't add timestamps
   ->rerecord(true) // re-records and adds timestamps
   ->eval()

One thing to ask is whether the simple eval(...) could still have a timestamp, but it seems like it would be necessary to always run ->record()->rerecord()->eval() to add timestamps.

There's still a couple of things that are not clear:

  • What happens if you want to trigger re-record on an existing queue pool? Are you expected to keep track of all of the eval without clearing? Is that the expected behaviour, or would it be better to expect the user to read the queues after every eval / evalAwait?

Copy link
Member

@axsaucedo axsaucedo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had another look and it seems that the current approach does seem to make sense, I did an initial pass, would be good to also dive into some of the points:

  • When calling record after eval is it still expected to use the same query pool? It seems like it makes more sense to have one pool specific for the recorded batch
  • Would be good to also implement rerecord(uint32_t totalTimestamps) to allow for resizing

Overall looks like a great addition - probably good to add tests and then we can use the test to add an example in the advanced-examples.rst documentation

python/src/main.cpp Outdated Show resolved Hide resolved
single_include/kompute/Kompute.hpp Outdated Show resolved Hide resolved
single_include/kompute/Kompute.hpp Outdated Show resolved Hide resolved
src/Manager.cpp Outdated Show resolved Hide resolved
src/Sequence.cpp Show resolved Hide resolved
src/Sequence.cpp Outdated Show resolved Hide resolved
src/Sequence.cpp Outdated Show resolved Hide resolved
src/Sequence.cpp Show resolved Hide resolved
src/Sequence.cpp Outdated Show resolved Hide resolved
src/include/kompute/Sequence.hpp Outdated Show resolved Hide resolved
@alexander-g
Copy link
Contributor Author

alexander-g commented Mar 7, 2021

It feels a bit awkward to expose the exact number of timestamps that needs to be set by the user

It's not the exact number but the maximum number. The user can set it to a high number like one million if they are ok with allocating that additional memory.

What happens if you want to trigger re-record on an existing queue pool? Are you expected to keep track of all of the eval without clearing? Is that the expected behaviour, or would it be better to expect the user to read the queues after every eval / evalAwait?

I don't quite understand what you mean with that. (I believe) the query pool is simply a buffer where timestamps are written into. If I call rerecord() the buffer stays untouched and will be simply reused on next eval().

What's the point of rerecord anyway? I can't see a use case. Personally, I would rather create a new sequence instead.

Copy link
Member

@axsaucedo axsaucedo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - only added a minor comment but should be quite fast change, it would also be great if we can add a test to make sure the exptected timestamps can be retrieved. Here's also some thoughts on the points above:

It's not the exact number but the maximum number. The user can set it to a high number like one million if they are ok with allocating that additional memory.

I agree it makes sense to initially allow the user to provide a default value - I think the ability to allow the user to update it could also make sense, and could be another main value / purpose of the re-record command, I mention further thoughts below.

I don't quite understand what you mean with that. (I believe) the query pool is simply a buffer where timestamps are written into. If I call rerecord() the buffer stays untouched and will be simply reused on next eval().

I was thinking that it would be useful to be able to "clear" and "resize" the timestamp queue, this could be a useful purpose of re-record. By extending it to receive the rerecord(uint32_t totalTimestamps) it can be possible to allow for either a re-creation of the pipeline with a different size, and/or a clearing of the current logged timestamps by recreating the pipeline with the same size.

What's the point of rerecord anyway? I can't see a use case. Personally, I would rather create a new sequence instead.

Tensors and algorithms can be rebuilt with different values / configurations, which means for a sequence to be "refreshed" it would need to be re-recorded. You can certainly go through all the record commands manually, but rerecord provides a simpler way to trigger a "reload" which can take into account any changes on tensors. There is a test in TestSequence which shows an example.

By the way, if you get a chance it would be good to get your overall thoughts on #177 as I'm adding support for multiple types on tensor (uint, int, float, double & bool).

src/Manager.cpp Outdated Show resolved Hide resolved
@alexander-g
Copy link
Contributor Author

Added test case but filtering it in Makefile because SwiftShader does not support timestamps.
There are also some problems when running on my GPU: the test succeeds if I run it alone, but fails when I run it together with all other tests with the "Device does not support timestamps" exception

Copy link
Member

@axsaucedo axsaucedo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - I've just tested locally, I didn't get any failures when running all the tests, and also passes when only running the test. As you say swiftshader doesn't seem to support Timestamps - but there's limitation in other aspects like supporting double so I think skipping it would be ok.

PR looks good, would be good to confirm whether the addition of the rerecord is something that we'd want as part of the PR or whether it's something that can be explored later on - it seems probably the latter makes more sense but may be worth creating an issue. Let me know and I can merge or hold accordingly.

@alexander-g
Copy link
Contributor Author

I would merge it as it is. I will take a closer look at re-recording a bit later.

@axsaucedo
Copy link
Member

Sounds good 👍

@axsaucedo axsaucedo merged commit cc1ec74 into KomputeProject:master Mar 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants