Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.0] Fix block time reporting log statement #1434

Merged
merged 6 commits into from
Jul 26, 2023
Merged

Conversation

heifner
Copy link
Member

@heifner heifner commented Jul 20, 2023

Fix issue with the tracking of block idle time.
Add producer name to the block time log statement.
Note: time logging trxs is now included in the success/fail time.

 Block #321409127 eosasia11111 trx idle: 253403us out of 340621us, success: 11, 11380us, fail: 10, 74407us, transient: 0, 0us, other: 1431us

Resolves #1360
Resolves #1417

@heifner heifner linked an issue Jul 20, 2023 that may be closed by this pull request
@heifner heifner added the OCI Work exclusive to OCI team label Jul 20, 2023
@greg7mdp
Copy link
Contributor

In this location, I think we do add_idle_time twice for the same time interval if we get an exception.

I think it would be more logical, everytime we effectively do the equivalent of:

_time_tracker.add_idle_time( fc::time_point::now() - self->_idle_trx_time);

We should immediately afterwards do: self->_idle_trx_time = fc::time_point::now();

This way we can ensure we don't count the same interval twice as idle.

And these two statements should be in a producer_plugin_impl() member function.

@greg7mdp
Copy link
Contributor

greg7mdp commented Jul 21, 2023

Also _idle_trx_time is a bad name as it implies a duration, not a time stamp.

Maybe _unaccounted_idling_tstamp would be better, as it is the timestamp when we started being idle, and not accounted in time_tracker.

@greg7mdp
Copy link
Contributor

I think a better design would be for the time tracker to hold the _unaccounted_timestamp and have a single api like:

class block_time_tracker
{
   enum class time_usage { idle, success, fail, last };

    void add_time(time_usage u)
    {
        auto now = fc::time_point::now();
        _time[static_cast<int>(u)] += now - _unaccounted_tstamp;
        _unaccounted_tstamp = now;
    }

   std::array<fc::microseconds, static_cast<int>(time_usage::last)> _time;
};

@heifner
Copy link
Member Author

heifner commented Jul 24, 2023

In this location, I think we do add_idle_time twice for the same time interval if we get an exception.

No, the only exception here (outside alloc exceptions which we can ignore since those are program exit) is via the future.get() as
process_incoming_transaction_async() does not throw. So the idle time is only set once.

@heifner
Copy link
Member Author

heifner commented Jul 24, 2023

We should immediately afterwards do: self->_idle_trx_time = fc::time_point::now();

The idle time does not start from there. It starts at the end of processing. Adding it here just makes it less likely you notice the logic is wrong.

@heifner
Copy link
Member Author

heifner commented Jul 24, 2023

Also _idle_trx_time is a bad name as it implies a duration, not a time stamp.

Interesting. I guess I have always thought of the term timestamp (tstamp) as more of an unix epoch term or as an implementation detail. A duration I would expect to be expressed as elapsed or similar. Maybe _idle_trx_timepoint ?

@heifner
Copy link
Member Author

heifner commented Jul 24, 2023

I think a better design would be for the time tracker to hold the _unaccounted_timestamp and have a single api like:

class block_time_tracker
{
   enum class time_usage { idle, success, fail, last };

    void add_time(time_usage u)
    {
        auto now = fc::time_point::now();
        _time[static_cast<int>(u)] += now - _unaccounted_tstamp;
        _unaccounted_tstamp = now;
    }

   std::array<fc::microseconds, static_cast<int>(time_usage::last)> _time;
};

A bit of Deja Vu on this. Seems like we had a similar discussion already. I'll see if I can clean up the current impl some.

@greg7mdp
Copy link
Contributor

No, the only exception here (outside alloc exceptions which we can ignore since those are program exit) is via the future.get() as process_incoming_transaction_async() does not throw. So the idle time is only set once.

I'm not following. Either we can't have exceptions (in which case we can remove the try/catch block and the exception handler), or we can have exceptions (and when we do we will call add_idle_time twice with the same parameters). What am I missing?

@greg7mdp
Copy link
Contributor

We should immediately afterwards do: self->_idle_trx_time = fc::time_point::now();

The idle time does not start from there. It starts at the end of processing. Adding it here just makes it less likely you notice the logic is wrong.

Yes, but when we call add_idle_time we add the interval [end of processing, now] to the accounted idle time. We should make sure that next time we call add_idle_time we don't take the same interval into account again.

@greg7mdp
Copy link
Contributor

A bit of Deja Vu on this. Seems like we had a similar discussion already. I'll see if I can clean up the current impl some.

haha I already forgot all about it! You have a better memory than I do (not a high bar :-).
I'm just finding it hard to convince myself that we do update the idle time correctly from looking at the code, and I always try to figure out a way to make the code more obviously correct.

@heifner
Copy link
Member Author

heifner commented Jul 24, 2023

No, the only exception here (outside alloc exceptions which we can ignore since those are program exit) is via the future.get() as process_incoming_transaction_async() does not throw. So the idle time is only set once.

I'm not following. Either we can't have exceptions (in which case we can remove the try/catch block and the exception handler), or we can have exceptions (and when we do we will call add_idle_time twice with the same parameters). What am I missing?

The main reason the try-catch is here is for the future.get(). This can throw if signature validation fails.
The other reason is if process_incoming_transaction_async() throws a guard_exception which means app.quit() has been called and the program will soon exit, but we would not want an unhandled exception on the _thread_pool executor. Although, with recent changes to the named_thread_pool to handle uncaught exceptions on the executor we could likely get away with just a try-catch around the future.get(). But it is still nice that we call next in that corner case so that an API trx gets the reason for the failure.

@greg7mdp
Copy link
Contributor

The main reason the try-catch is here is for the future.get(). This can throw if signature validation fails.

So if that happens, we'll call add_idle_time() twice with the same parameters, right?

@greg7mdp
Copy link
Contributor

greg7mdp commented Jul 24, 2023

Interesting. I guess I have always thought of the term timestamp (tstamp) as more of an unix epoch term or as an implementation detail. A duration I would expect to be expressed as elapsed or similar. Maybe _idle_trx_timepoint ?

Yes, it is used in unix, but I was more thinking of the generic definition. But I'm fine with timepoint as well.

However, because we do update this timepoint, it is not necessarily the point we became idle, but really the point at which we last accounted for idle time (and updated the variable). That's what I was trying to convey with unaccounted, but maybe too much of a mouthful.

@heifner
Copy link
Member Author

heifner commented Jul 24, 2023

The main reason the try-catch is here is for the future.get(). This can throw if signature validation fails.

So if that happens, we'll call add_idle_time() twice with the same parameters, right?

I think I finally see what you are saying. Thanks! No idea why that add_idle_time call is in the exception_handler there is no idle time here. I'll remove it.

@heifner
Copy link
Member Author

heifner commented Jul 24, 2023

@greg7mdp Refactored to use RAII for a simpler interface.

Copy link
Contributor

@greg7mdp greg7mdp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice changes.
I'd prefer with my suggestions below, so that everytime we add a time interval to success/fail/idle time, we also reset the last_trx_time_point

plugins/producer_plugin/producer_plugin.cpp Show resolved Hide resolved
plugins/producer_plugin/producer_plugin.cpp Show resolved Hide resolved
plugins/producer_plugin/producer_plugin.cpp Show resolved Hide resolved
Comment on lines 284 to 286
break; // just reset timer which happens below
}
_block_time_tracker.start_idle_time();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
break; // just reset timer which happens below
}
_block_time_tracker.start_idle_time();
_block_time_tracker.start_idle_time(); // just reset timer
break;
}

Comment on lines 842 to 844
if (!pr.failed) {
trx_tracker.trx_success();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we count exhausted transactions as successful?

Copy link
Member Author

@heifner heifner Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They didn't fail, so don't want to count them as failed. We could count as other, but other is suppose to be time not spent evaluating trxs. We could add a whole new category, but that seems like it would just muddy the water. I think success is the best option. Time we spent successfully executing transactions; which is true.

@@ -2661,6 +2702,8 @@ void producer_plugin_impl::schedule_production_loop() {
} else {
fc_dlog(_log, "Speculative Block Created");
}

_time_tracker.start_idle_time();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, I wonder if it shouldn't be add_idle_time, with the change I mentioned before in add_idle_time resetting last_trx_time_point = now;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand; there is no idle time to add.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see that start_idle_time(); does:

last_trx_time_point = now;

So assuming that now != last_trx_time_point, what was the time spent between last_trx_time_point and now used for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was used for start block which includes potentially many trx processing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we should probably call start_idle_time() only in ~trx_time_tracker() { (where we skip time to be accounted as other), and only call add_idle_time() elsewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create a trx_time_tracker only when we start processing a transaction, right? So could it do add_other_time in its constructor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I really like the RAII implementation!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create a trx_time_tracker only when we start processing a transaction, right? So could it do add_other_time in its constructor?

But then you are just back to what it does now except you explicitly track the value instead of just calculate it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, it is kind of true, especially with the current version of the code which increases the time_point every time we add a duration.
But still, if we track the "other" value explicitly, we can verify that the sum of all accounted times is the elapsed time, which may be not true if we were to create trx_time_tracker objects on more than one thread for example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added explicit tracking of other. Also added a pause/unpause for when in read-window as it didn't seem correct to call that "other" time.

Copy link
Contributor

@greg7mdp greg7mdp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I think the code is easier to follow now.

@heifner heifner requested a review from linh2931 July 25, 2023 00:00
Copy link
Member

@linh2931 linh2931 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR. We report timing only in debug logging, but we track timing no matter whether logging is enabled or not. Not sure how much time is spent on tracking in non-debugging mode.

plugins/producer_plugin/producer_plugin.cpp Outdated Show resolved Hide resolved
@heifner
Copy link
Member Author

heifner commented Jul 25, 2023

Not for this PR. We report timing only in debug logging, but we track timing no matter whether logging is enabled or not. Not sure how much time is spent on tracking in non-debugging mode.

I wonder if this shouldn't be info level logging.

@heifner heifner merged commit e82c245 into release/4.0 Jul 26, 2023
16 checks passed
@heifner heifner deleted the GH-1417-blk-idle-4.0 branch July 26, 2023 12:11
@heifner heifner restored the GH-1417-blk-idle-4.0 branch July 26, 2023 14:18
@heifner heifner deleted the GH-1417-blk-idle-4.0 branch July 26, 2023 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCI Work exclusive to OCI team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Block trx idle not being reported correctly
3 participants