Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further enhancements to speed up the store procedure #3796

Merged
merged 3 commits into from
Mar 24, 2023

Conversation

vodorok
Copy link
Collaborator

@vodorok vodorok commented Nov 21, 2022

This pull request tries to optimize the store process by trimming down on the python access layer of SQLAlchemy.
Consists of two parts:

  1. Only get the review status rules from the source code in the initial time of the report storage. After all the reports were added to the database, in a separate steps first get all applying review status rules from the server, then do a "bulk update" of all the newly stored reports, where a GUI based review status rule applies.
  2. Cache the severity of the reports. Before this patch, a relatively expensive search of the checker labels were done, for every report being stored. Due to the nature of the analysis results, the number of reports greatly outweigh the number of type of reports (checkers), by caching this severity level can improve the store time.

Profiled the storage time of CodeChecker (master) before and after the modifications:
The reports folder was made with master clang + master cppcheck with master CodeChecker, with the --enable-all flag:
Number of processed analyzer result files | 1047
Number of analyzer reports | 184471

Reports were stored into empty databases.
On the labels, the colors are preserved for the sets. (in case the labels are hidden on the overview labels.)

Baseline measurements:
Mass storage time was 1823 secs.
image
The above label shows the full profile of the entire API call.
The below label shows the __add_report method. Please not the checker_labels.py(severity) call batch of 145 seconds.
image
The following label shows the detailed view of the get_review_status call batch:
image
Here the magenta bar represents the parsing of the source files (parse_codechecker_review_comment), the runtime of that was: 140 sec. It is faster than the previous release, due to fixes applied: #3777
The other call batches are python (and SQLA) overhead of the per report severity query.

Measurements after modification:
Mass storage time was 1313 secs
image
Please note the absence of the get review status rules call batch on the above label.
The orange section after the report_file.py is the remnant of the slow review status retrieval logic. Parsing the file is still needed, (parse_codechecker_review_comment)
image

The last labels shows the impact of severity levels cache:
This is 1000x times faster (on large report counts)
image
image

All in all I've managed to shave off ~28% percent of the run time. (1823s -> 1313s), Which can be pretty significant when the store times are scratching the one hour mark.

The above figures clearly show, that the same improvement could be done for the __add_report method. (Cache the new reports, and do a bulk update after the setup. With this approach the bulk update step of this patch could be incorporated into the bulk insert.)

An interesting (and needed in my opinion) measurement would be to do the same measurements with a database heavily populated with review status rules.

@vodorok
Copy link
Collaborator Author

vodorok commented Nov 21, 2022

Please give me some time to fix all the failing tests before diving into the details of the modifications.

@vodorok vodorok force-pushed the store_speedup_2 branch 4 times, most recently from 2daa99c to b47b45c Compare November 23, 2022 10:53
# The original file path is needed here, not the trimmed, because
# the source files are extracted as the original file path.
source_file_name = os.path.realpath(os.path.join(
source_root, report.file.original_path.strip("/")))

review_status = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does using a dict() improve the performance? I mean, what overhead does creating a ReviewStatus object have?

Comment on lines -867 to -871
review_status = session.query(ReviewStatus) \
.filter(ReviewStatus.bug_hash == report.report_hash) \
.one_or_none()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, change the function's documentation accordingly.

Comment on lines 998 to 1001
for root_dir_path, _, report_file_paths in os.walk(report_dir):
for f in report_file_paths:
if not report_file.is_supported(f):
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this part?

processed_result_file_count += 1

# Get all relevant review_statuses for the newly stored reports
# TODO Call self.getReviewStatusRules instead of the beloew query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO Call self.getReviewStatusRules instead of the beloew query
# TODO Call self.getReviewStatusRules instead of the query below

Comment on lines 1028 to 1125
reports_to_rs_rules = session.query(ReviewStatus, DBReport) \
.join(DBReport, DBReport.bug_id == ReviewStatus.bug_hash) \
.filter(sqlalchemy.and_(DBReport.run_id == run_id,
ReviewStatus.bug_hash.
in_(self.__new_report_hashes)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth joining the table DBReport and filter on the run_id?

}
)
# Update all newly stored reports if there are any rs-rules for them
session.bulk_update_mappings(DBReport, review_status_change)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the docs, this bulk_update_mappings is a legacy function. The modern version is a simple update statement: https://docs.sqlalchemy.org/en/20/orm/queryguide/dml.html#orm-queryguide-bulk-update
This is a minor comment, though. We can use this one too.

Copy link
Member

@dkrupp dkrupp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure that you test this change extensively on an open source project.

I suggest the following.

  1. Analyze version x of xerces and store it to the server
  2. Change the review status of 10 reports on the GUI
  3. Change the review status of 5 report in the source code
  4. Clone version x+y of xerces and apply 3 source code suppressions to the new versoion
  5. analyze and store it to the same run

Repeat this experiment with the baseline version of codechecker.

The detection & review status of the issues must be the same.

Create a jenkins job of this. We could use this as a test for further store changes.

@vodorok vodorok force-pushed the store_speedup_2 branch 4 times, most recently from 5556ef1 to 8a5c6ec Compare March 8, 2023 15:16
@Szelethus Szelethus added this to the release 6.22.0 milestone Mar 16, 2023
@vodorok vodorok requested a review from bruntib March 20, 2023 08:57
@vodorok
Copy link
Collaborator Author

vodorok commented Mar 20, 2023

This modification will affect the idle in transaction timeout settings of the PostgreSQL config. For larger report folders I would advise 15-30 min settings.

Previously for every new inserted report, a review status query was done
to determine its target value to be set. This was factored out, and
instead is done in a single batch, at the end of inserting reports.

[server] Cache severity levels

Remove profiling code, and fix linter errors and tests

Refactor report storing, runtime performace gains

To be able to move the session.flush() from after each report to after
the completion of the processing of all report files, a new helper
method was needed to be introduced: __add_report_context().

This will handle the addition of notes, macro expansions in a separate
step, after the reports have been flushed to the db, and already gotten
their primary keys assigned.

Also added a helper class to handle review statuses stored in source
code.
By moving the the additon of annotations from the add_report method to
the add_report_context method (and removing the flush from the handler
function), the same session flushing strategy was restored.
@vodorok
Copy link
Collaborator Author

vodorok commented Mar 23, 2023

image

This graph compares the memory consumption of CodeChecker with this PR-s changes vs the latest release 6.21.
The max consumption is the same, because of the spike at the beginning, but the after the store is finished, roughly 4x times as much memory is retained in the process than before.

We could combat this in multiple ways in the future

  • Moving the flushing in the "middle", eg flushing per plist files.
  • Restarting the worker processes after the storing is finished.
  • Wrapping the whole store in a process that will die when the store is finished.

@vodorok vodorok requested a review from dkrupp March 23, 2023 13:42
Copy link
Member

@dkrupp dkrupp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dkrupp dkrupp merged commit e95173e into Ericsson:master Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants