-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Further enhancements to speed up the store procedure #3796
Conversation
Please give me some time to fix all the failing tests before diving into the details of the modifications. |
2daa99c
to
b47b45c
Compare
# The original file path is needed here, not the trimmed, because | ||
# the source files are extracted as the original file path. | ||
source_file_name = os.path.realpath(os.path.join( | ||
source_root, report.file.original_path.strip("/"))) | ||
|
||
review_status = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does using a dict()
improve the performance? I mean, what overhead does creating a ReviewStatus
object have?
review_status = session.query(ReviewStatus) \ | ||
.filter(ReviewStatus.bug_hash == report.report_hash) \ | ||
.one_or_none() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, change the function's documentation accordingly.
for root_dir_path, _, report_file_paths in os.walk(report_dir): | ||
for f in report_file_paths: | ||
if not report_file.is_supported(f): | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this part?
processed_result_file_count += 1 | ||
|
||
# Get all relevant review_statuses for the newly stored reports | ||
# TODO Call self.getReviewStatusRules instead of the beloew query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO Call self.getReviewStatusRules instead of the beloew query | |
# TODO Call self.getReviewStatusRules instead of the query below |
reports_to_rs_rules = session.query(ReviewStatus, DBReport) \ | ||
.join(DBReport, DBReport.bug_id == ReviewStatus.bug_hash) \ | ||
.filter(sqlalchemy.and_(DBReport.run_id == run_id, | ||
ReviewStatus.bug_hash. | ||
in_(self.__new_report_hashes))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth joining the table DBReport
and filter on the run_id
?
} | ||
) | ||
# Update all newly stored reports if there are any rs-rules for them | ||
session.bulk_update_mappings(DBReport, review_status_change) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the docs, this bulk_update_mappings
is a legacy function. The modern version is a simple update statement: https://docs.sqlalchemy.org/en/20/orm/queryguide/dml.html#orm-queryguide-bulk-update
This is a minor comment, though. We can use this one too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure that you test this change extensively on an open source project.
I suggest the following.
- Analyze version x of xerces and store it to the server
- Change the review status of 10 reports on the GUI
- Change the review status of 5 report in the source code
- Clone version x+y of xerces and apply 3 source code suppressions to the new versoion
- analyze and store it to the same run
Repeat this experiment with the baseline version of codechecker.
The detection & review status of the issues must be the same.
Create a jenkins job of this. We could use this as a test for further store changes.
5556ef1
to
8a5c6ec
Compare
This modification will affect the |
Previously for every new inserted report, a review status query was done to determine its target value to be set. This was factored out, and instead is done in a single batch, at the end of inserting reports. [server] Cache severity levels Remove profiling code, and fix linter errors and tests Refactor report storing, runtime performace gains To be able to move the session.flush() from after each report to after the completion of the processing of all report files, a new helper method was needed to be introduced: __add_report_context(). This will handle the addition of notes, macro expansions in a separate step, after the reports have been flushed to the db, and already gotten their primary keys assigned. Also added a helper class to handle review statuses stored in source code.
By moving the the additon of annotations from the add_report method to the add_report_context method (and removing the flush from the handler function), the same session flushing strategy was restored.
8a5c6ec
to
4055ffd
Compare
This graph compares the memory consumption of CodeChecker with this PR-s changes vs the latest release 6.21. We could combat this in multiple ways in the future
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This pull request tries to optimize the store process by trimming down on the python access layer of SQLAlchemy.
Consists of two parts:
Profiled the storage time of CodeChecker (master) before and after the modifications:
The reports folder was made with master clang + master cppcheck with master CodeChecker, with the --enable-all flag:
Number of processed analyzer result files | 1047
Number of analyzer reports | 184471
Reports were stored into empty databases.
On the labels, the colors are preserved for the sets. (in case the labels are hidden on the overview labels.)
Baseline measurements:
Mass storage time was 1823 secs.
The above label shows the full profile of the entire API call.
The below label shows the
__add_report
method. Please not thechecker_labels.py(severity)
call batch of 145 seconds.The following label shows the detailed view of the get_review_status call batch:
Here the magenta bar represents the parsing of the source files (parse_codechecker_review_comment), the runtime of that was: 140 sec. It is faster than the previous release, due to fixes applied: #3777
The other call batches are python (and SQLA) overhead of the per report severity query.
Measurements after modification:
Mass storage time was 1313 secs
Please note the absence of the get review status rules call batch on the above label.
The orange section after the report_file.py is the remnant of the slow review status retrieval logic. Parsing the file is still needed, (parse_codechecker_review_comment)
The last labels shows the impact of severity levels cache:
This is 1000x times faster (on large report counts)
All in all I've managed to shave off ~28% percent of the run time. (1823s -> 1313s), Which can be pretty significant when the store times are scratching the one hour mark.
The above figures clearly show, that the same improvement could be done for the
__add_report
method. (Cache the new reports, and do a bulk update after the setup. With this approach the bulk update step of this patch could be incorporated into the bulk insert.)An interesting (and needed in my opinion) measurement would be to do the same measurements with a database heavily populated with review status rules.