Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Logging for Slow Database Sessions #7494

Merged
merged 1 commit into from Jul 5, 2023

Conversation

kozlovsky
Copy link
Collaborator

Sometimes we can observe a very long execution of coroutine steps that entirely freeze the asyncio loop for huge periods (like, 87.143 seconds)

The primary cause of these freezes is usually the waiting time to acquire an exclusive lock before a database transaction begins. This occurs when another database session, operating in a different thread, holds the database lock.

This PR introduces logging functionality to monitor slow database sessions across all threads in the Tribler application. Slow database sessions can sometimes cause the asyncio loop to freeze for extended periods, which impacts the performance and responsiveness of the application.

Key Features:

  • This PR establishes a threshold of one second for database sessions. Any database session that exceeds this duration will be reported in the application logs.
  • The logs will contain comprehensive information, including the session's thread, duration, wait time for the exclusive lock, and the time for which the exclusive lock was held.
  • Additionally, the logs will present a stack trace of the database session and query statistics, including the total number of queries, execution time, and details of the slowest query.

This enhancement will assist in identifying performance bottlenecks and aid in optimizing database interactions within the Tribler application.

Example log entry:

[tribler-core PID:15612] 2023-06-19 14:10:55,918 - WARNING - tribler.core.utilities.pony_utils(147) - Long db_session detected.
Session info:
    Thread: 'asyncio_6'
    db_session duration: 0.055 seconds
    db_session waited for the exclusive lock for 0.051 seconds
    db_session held exclusive lock for 0.004 seconds
The db_session stack:
  File "C:\dev\tribler\src\tribler\core\components\metadata_store\db\store.py", line 473, in process_compressed_mdblob
    return self.process_squashed_mdblob(decompressed_data, health_info=health_info, **kwargs)
  File "C:\dev\tribler\src\tribler\core\components\metadata_store\db\store.py", line 536, in process_squashed_mdblob
    with db_session(immediate=True):

Queries statistics for the current db_session:
13 queries executed in a total of 0.002 seconds;
The slowest query (0.001 seconds) is:
    SELECT "g"."rowid", "g"."metadata_type", "g"."reserved_flags", "g"."origin_id", "g"."public_key", "g"."id_", "g"."timestamp", "g"."signature", "g"."added_on", "g"."status", "g"."title", "g"."tags", "g"."num_entries", "g"."infohash", "g"."size", "g"."torrent_date", "g"."tracker_info", "g"."xxx", "g"."health", "g"."tag_processor_version", "g"."start_timestamp", "g"."subscribed", "g"."share", "g"."votes", "g"."local_version"
    FROM "ChannelNode" "g"
    WHERE "g"."metadata_type" IN (400, 300)
      AND ("g"."infohash" = ? OR "g"."id_" = ? AND "g"."public_key" = ?)
    LIMIT 1

Queries statistics for the entire application:
19090 queries executed in a total of 3.182 seconds;
The slowest query (0.327 seconds) is:
    SELECT "health"."rowid", "health"."infohash", "health"."seeders", "health"."leechers", "health"."last_check", "health"."self_checked", "health"."has_data"
    FROM "TorrentState" "health"
    WHERE "health"."last_check" >= ?
      AND ("health"."seeders" > 0 OR "health"."leechers" > 0)
    ORDER BY "health"."seeders" DESC, "health"."leechers" DESC, "health"."last_check" DESC
    LIMIT 100

@kozlovsky kozlovsky requested a review from a team as a code owner June 19, 2023 12:47
@kozlovsky kozlovsky requested review from drew2a and removed request for a team June 19, 2023 12:47
src/tribler/core/utilities/pony_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/pony_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/pony_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/pony_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/pony_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/pony_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/pony_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/pony_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/pony_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/pony_utils.py Outdated Show resolved Hide resolved
@kozlovsky kozlovsky force-pushed the slow_dbsession_detection branch 7 times, most recently from 5d0250c to c1cd86a Compare June 29, 2023 15:02
@kozlovsky kozlovsky requested a review from drew2a June 29, 2023 15:14
Copy link
Collaborator

@drew2a drew2a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few minor comments regarding the main part of the PR and some substantial remarks concerning the testing section.

Since we're awaiting this PR for the release of a new RC version, I've given my approval.

src/tribler/core/utilities/pony_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/pony_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/tests/test_pony_utils.py Outdated Show resolved Hide resolved
src/tribler/core/utilities/tests/test_pony_utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@drew2a drew2a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just left one minor comment. The rest of the PR appears good to me.


def _format_warning(self, db_session_duration: float, info: DbSessionInfo) -> str:
local_statistics = self._summarize_stat(db.local_stats for db in databases_to_track)
threshold = SLOW_DB_SESSION_DURATION_THRESHOLD if self.duration_threshold is None else self.duration_threshold
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider the following option:

class TriblerDbSession(core.DBSessionContextManager):
    def __init__(self, *args, duration_threshold: float = SLOW_DB_SESSION_DURATION_THRESHOLD, **kwargs):
        super().__init__(*args, **kwargs)
        # `duration_threshold` specifies how long db_session should be to trigger the long db_session warning.
        # When `duration_threshold` is None, `SLOW_DB_SESSION_DURATION_THRESHOLD` value is used instead.
        self.duration_threshold = duration_threshold

this way it is more clear (and should keep the desired logic).

Copy link
Collaborator Author

@kozlovsky kozlovsky Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes it may be useful to change the value of SLOW_DB_SESSION_DURATION_THRESHOLD dynamically (for example, in the run_tribler.py script). The current design allows it.

The alternate approach of setting SLOW_DB_SESSION_DURATION_THRESHOLD as a default parameter in the db_session constructor has a significant drawback. In Python, default values are evaluated at the point of function definition in the defining scope. Therefore, when @db_session is used as a decorator, it captures the value of SLOW_DB_SESSION_DURATION_THRESHOLD at the time of import. This makes it impossible to modify the SLOW_DB_SESSION_DURATION_THRESHOLD value post-import in a manner that affects the behavior of the @db_session decorator. Consequently, the sequence in which modules are imported gains significance, which can introduce unexpected behaviors and maintenance challenges.

In contrast, the current design ensures that the db_session retrieves the value of SLOW_DB_SESSION_DURATION_THRESHOLD each time it is invoked. This live retrieval accommodates dynamic changes and guarantees that the session consistently uses the most up-to-date threshold, thus circumventing the abovementioned issues.

In conclusion, the present architecture not only affords greater flexibility but also prevents potential complications arising from import order dependencies.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed modification enables dynamic alteration of the duration_threshold value. This is evidenced by the test_patched_db_session passing after the said changes were applied to the current code:

diff --git a/src/tribler/core/utilities/pony_utils.py b/src/tribler/core/utilities/pony_utils.py
index 70f0859..6142082 100644
--- a/src/tribler/core/utilities/pony_utils.py
+++ b/src/tribler/core/utilities/pony_utils.py
@@ -15,8 +15,8 @@ from weakref import WeakSet
 
 from pony import orm
 from pony.orm import core
-from pony.orm.dbproviders import sqlite
 from pony.orm.core import Database, select
+from pony.orm.dbproviders import sqlite
 from pony.utils import localbase
 
 SLOW_DB_SESSION_DURATION_THRESHOLD = 1.0
@@ -132,7 +132,7 @@ Queries statistics for the entire application:
 
 
 class TriblerDbSession(core.DBSessionContextManager):
-    def __init__(self, *args, duration_threshold: Optional[float] = None, **kwargs):
+    def __init__(self, *args, duration_threshold: float = SLOW_DB_SESSION_DURATION_THRESHOLD, **kwargs):
         super().__init__(*args, **kwargs)
         # `duration_threshold` specifies how long db_session should be to trigger the long db_session warning.
         # When `duration_threshold` is None, `SLOW_DB_SESSION_DURATION_THRESHOLD` value is used instead.
@@ -193,8 +193,7 @@ class TriblerDbSession(core.DBSessionContextManager):
         start_time = info.start_time
         db_session_duration = time.time() - start_time
 
-        threshold = SLOW_DB_SESSION_DURATION_THRESHOLD if self.duration_threshold is None else self.duration_threshold
-        if db_session_duration > threshold:
+        if db_session_duration > self.duration_threshold:
             self._log_warning(db_session_duration, info)
 
     def _log_warning(self, db_session_duration: float, info: DbSessionInfo):

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed modification enables dynamic alteration of the duration_threshold value. This is evidenced by the test_patched_db_session passing after the said changes were applied to the current code.

No, it doesn't. After the suggested change, any db_session applied as a decorator in the module scope will use the value of SLOW_DB_SESSION_DURATION_THRESHOLD at the moment of the module import.

There are two reasons why the test is passed successfully after your changes:

  1. The db_session in the test uses an explicitly specified argument duration_threshold=0.0, and in the current discussion, we are talking about the case when the duration threshold is not explicitly specified for a db_session (that is, for all db_sessions used in Tribler's actual code)
  2. Even if we remove the explicit duration_threshold=0.0 specification for the db_session in the test, the db_session is applied to the _perform_queries inner function at the test function execution time and not at the module import time, so its successful execution does not mean the same will work for db_sessions applied to module-level functions.

I added a new test_patched_db_session_default_duration_threshold test that checks the db_session defined at the module level, and the new test fails with your changes. In the test, I'm again dynamically changing the value SLOW_DB_SESSION_DURATION_THRESHOLD (as I did originally) because this is the case I want to test here, that the dynamic change of SLOW_DB_SESSION_DURATION_THRESHOLD is handled correctly.

Copy link
Collaborator

@drew2a drew2a Jul 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your points are valid.

However, I don't think it's necessary or beneficial to customize the default value of the threshold. Could you provide a plausible example where this feature would be used?

The change I proposed, allows for the dynamic alteration of the actual threshold value, which you even confirmed by acknowledging that the tests passed due to the explicitly specified threshold.

In the current discussion, we are discussing the case when the duration threshold is not explicitly specified.

I perceive our current discussion differently.

We both agreed on the point: "Sometimes it may be useful to dynamically change the threshold value". Both our code versions allow this.

I don't see any advantages to allowing the default value to be changed. However, I do see the benefits in a simplified version of the code without this feature.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To present a plausible use case, imagine running a pytest suite where there's a need to adjust the db_session duration threshold across all functions under test. This is a legitimate and practical scenario. The implementation in my PR caters to this requirement perfectly, but the modification you propose would make it infeasible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is not about functionality. It is about surprising code behavior.

Any code written by someone else can lead to unexpected behavior. This is why it should be as simple as possible and provide the most straightforward API to the user (future developer) as possible.

What I'm advocating for is not an exotic feature but a standard flexibility that could be expected by developers who are used to configuring applications dynamically.

You're developing a feature that is not needed in Tribler unless it becomes a requirement. Less code equates to less maintenance. Reduced complexity results in fewer bugs. Even a small step in this direction is worth taking.

Your analogy with sending emails about test results is not directly comparable. Email notifications are an entirely different feature, whereas the capability I am defending is an extension of configuration flexibility, which is a fundamental aspect of software customization.

It is a bit extreme. I wrote it in the hope you could get the point. So, here is the refined version:

I acknowledge your perspective, but I must emphasize that a future developer could potentially find the option of filtering SQL queries by regex beneficial.

We can imagine so many features that "could potentially be beneficial", but it doesn't mean we should implement them.

Copy link
Collaborator Author

@kozlovsky kozlovsky Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any code written by someone else can lead to unexpected behavior. This is why it should be as simple as possible and provide the most straightforward API to the user (future developer) as possible.

Absolutely! And one such unexpected behavior in Python relates to default function arguments, a pitfall that's well-documented and being addressed via PEP 671.

My approach aims to use the late binding of SLOW_DB_SESSION_DURATION_THRESHOLD, keeping the logic straightforward. However, the proposed change uses default arguments, which bind early and can lead to the aforementioned pitfall.

Let me clarify:

  1. The proposed change blocks an important PR that is essential for the Tribler release.
  2. It disrupts use cases that I believe are valid:
    • Adjusting the configuration when a specific script starts
    • Altering the db_session duration threshold for testing purposes.
  3. It introduces a potential complexity and error-prone behavior by relying on Python's default arguments.
  4. The rationale for the proposed change seems to be primarily centered on simplicity, but I contend that it is at the cost of flexibility and falling into known Python pitfalls.

Reduced complexity results in fewer bugs. Even a small step in this direction is worth taking.

I concur, and evading the default argument issue in Python is a step toward reducing complexity.

We can imagine so many features that "could potentially be beneficial", but it doesn't mean we should implement them.

I agree, but in this instance, the flexibility is a natural outcome of the implementation in the PR, not an additional feature. It is streamlined, and the scenarios are accommodated by a single line of code.

In conclusion, while I appreciate the insights and value our collaborative efforts, I firmly believe that the current implementation is in the best interest of flexibility and reliability.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed change you're referring to is not as consequential as you suggest:

  1. As we discussed yesterday, it is not essential for the Tribler release. All the necessary features are already in the release branch.
  2. Any perceived blockage is only on your side as we already have one approval. This means a majority of developers are in agreement with the change.

Your argument about disrupting valid use cases is speculative. We do not have any real necessity for supporting these scenarios.

Your assertion that the proposed change introduces potential complexity and error-prone behavior contradicts the arguments I wrote above (also, I asked you to prove your earlier claim, but you were not able to prove it).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any perceived blockage is only on your side as we already have one approval. This means a majority of developers are in agreement with the change.

I appreciate the clarification regarding the approval process. I was under the impression that an explicit rejection would necessitate addressing it before proceeding. Thank you for clearing that up.

As for the suggested change, I respect the opinions and the discussion we had, but I must stand by my conviction that the proposed modification could introduce design issues. With the understanding that the PR has received majority approval, I would like to proceed without incorporating this specific change.

Thank you for the engagement and valuable feedback throughout this review.

@kozlovsky kozlovsky requested a review from drew2a July 3, 2023 07:10
@kozlovsky kozlovsky merged commit 518132c into Tribler:release/7.13 Jul 5, 2023
16 checks passed
@kozlovsky kozlovsky deleted the slow_dbsession_detection branch July 5, 2023 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants