Further enhancements to speed up the store procedure #3796

vodorok · 2022-11-21T13:32:19Z

This pull request tries to optimize the store process by trimming down on the python access layer of SQLAlchemy.
Consists of two parts:

Only get the review status rules from the source code in the initial time of the report storage. After all the reports were added to the database, in a separate steps first get all applying review status rules from the server, then do a "bulk update" of all the newly stored reports, where a GUI based review status rule applies.
Cache the severity of the reports. Before this patch, a relatively expensive search of the checker labels were done, for every report being stored. Due to the nature of the analysis results, the number of reports greatly outweigh the number of type of reports (checkers), by caching this severity level can improve the store time.

Profiled the storage time of CodeChecker (master) before and after the modifications:
The reports folder was made with master clang + master cppcheck with master CodeChecker, with the --enable-all flag:
Number of processed analyzer result files | 1047
Number of analyzer reports | 184471

Reports were stored into empty databases.
On the labels, the colors are preserved for the sets. (in case the labels are hidden on the overview labels.)

Baseline measurements:
Mass storage time was 1823 secs.

The above label shows the full profile of the entire API call.
The below label shows the __add_report method. Please not the checker_labels.py(severity) call batch of 145 seconds.

The following label shows the detailed view of the get_review_status call batch:

Here the magenta bar represents the parsing of the source files (parse_codechecker_review_comment), the runtime of that was: 140 sec. It is faster than the previous release, due to fixes applied: #3777
The other call batches are python (and SQLA) overhead of the per report severity query.

Measurements after modification:
Mass storage time was 1313 secs

Please note the absence of the get review status rules call batch on the above label.
The orange section after the report_file.py is the remnant of the slow review status retrieval logic. Parsing the file is still needed, (parse_codechecker_review_comment)

The last labels shows the impact of severity levels cache:
This is 1000x times faster (on large report counts)

All in all I've managed to shave off ~28% percent of the run time. (1823s -> 1313s), Which can be pretty significant when the store times are scratching the one hour mark.

The above figures clearly show, that the same improvement could be done for the __add_report method. (Cache the new reports, and do a bulk update after the setup. With this approach the bulk update step of this patch could be incorporated into the bulk insert.)

An interesting (and needed in my opinion) measurement would be to do the same measurements with a database heavily populated with review status rules.

vodorok · 2022-11-21T13:33:16Z

Please give me some time to fix all the failing tests before diving into the details of the modifications.

bruntib · 2022-12-07T10:51:15Z

web/server/codechecker_server/api/mass_store_run.py

            # The original file path is needed here, not the trimmed, because
            # the source files are extracted as the original file path.
            source_file_name = os.path.realpath(os.path.join(
                source_root, report.file.original_path.strip("/")))

+            review_status = {}


Does using a dict() improve the performance? I mean, what overhead does creating a ReviewStatus object have?

bruntib · 2022-12-07T10:56:04Z

web/server/codechecker_server/api/mass_store_run.py

-            review_status = session.query(ReviewStatus) \
-                .filter(ReviewStatus.bug_hash == report.report_hash) \
-                .one_or_none()


Please, change the function's documentation accordingly.

bruntib · 2022-12-07T12:21:04Z

web/server/codechecker_server/api/mass_store_run.py

+        for root_dir_path, _, report_file_paths in os.walk(report_dir):
+            for f in report_file_paths:
+                if not report_file.is_supported(f):
+                    continue


What is the purpose of this part?

bruntib · 2022-12-07T12:21:51Z

web/server/codechecker_server/api/mass_store_run.py

                processed_result_file_count += 1

+        # Get all relevant review_statuses for the newly stored reports
+        # TODO Call self.getReviewStatusRules instead of the beloew query


Suggested change

# TODO Call self.getReviewStatusRules instead of the beloew query

# TODO Call self.getReviewStatusRules instead of the query below

bruntib · 2022-12-07T12:28:41Z

web/server/codechecker_server/api/mass_store_run.py

+        reports_to_rs_rules = session.query(ReviewStatus, DBReport) \
+            .join(DBReport, DBReport.bug_id == ReviewStatus.bug_hash) \
+            .filter(sqlalchemy.and_(DBReport.run_id == run_id,
+                                    ReviewStatus.bug_hash.
+                                    in_(self.__new_report_hashes)))


Is it worth joining the table DBReport and filter on the run_id?

bruntib · 2022-12-07T12:33:22Z

web/server/codechecker_server/api/mass_store_run.py

+                }
+            )
+        # Update all newly stored reports if there are any rs-rules for them
+        session.bulk_update_mappings(DBReport, review_status_change)


According to the docs, this bulk_update_mappings is a legacy function. The modern version is a simple update statement: https://docs.sqlalchemy.org/en/20/orm/queryguide/dml.html#orm-queryguide-bulk-update
This is a minor comment, though. We can use this one too.

dkrupp

Please make sure that you test this change extensively on an open source project.

I suggest the following.

Analyze version x of xerces and store it to the server
Change the review status of 10 reports on the GUI
Change the review status of 5 report in the source code
Clone version x+y of xerces and apply 3 source code suppressions to the new versoion
analyze and store it to the same run

Repeat this experiment with the baseline version of codechecker.

The detection & review status of the issues must be the same.

Create a jenkins job of this. We could use this as a test for further store changes.

vodorok · 2023-03-20T09:03:42Z

This modification will affect the idle in transaction timeout settings of the PostgreSQL config. For larger report folders I would advise 15-30 min settings.

Previously for every new inserted report, a review status query was done to determine its target value to be set. This was factored out, and instead is done in a single batch, at the end of inserting reports. [server] Cache severity levels Remove profiling code, and fix linter errors and tests Refactor report storing, runtime performace gains To be able to move the session.flush() from after each report to after the completion of the processing of all report files, a new helper method was needed to be introduced: __add_report_context(). This will handle the addition of notes, macro expansions in a separate step, after the reports have been flushed to the db, and already gotten their primary keys assigned. Also added a helper class to handle review statuses stored in source code.

By moving the the additon of annotations from the add_report method to the add_report_context method (and removing the flush from the handler function), the same session flushing strategy was restored.

vodorok · 2023-03-23T10:29:10Z

This graph compares the memory consumption of CodeChecker with this PR-s changes vs the latest release 6.21.
The max consumption is the same, because of the spike at the beginning, but the after the store is finished, roughly 4x times as much memory is retained in the process than before.

We could combat this in multiple ways in the future

Moving the flushing in the "middle", eg flushing per plist files.
Restarting the worker processes after the storing is finished.
Wrapping the whole store in a process that will die when the store is finished.

web/server/codechecker_server/api/mass_store_run.py

dkrupp

LGTM

vodorok added the server 🖥️ label Nov 21, 2022

vodorok requested review from dkrupp and Szelethus November 21, 2022 13:32

vodorok self-assigned this Nov 21, 2022

vodorok requested a review from bruntib as a code owner November 21, 2022 13:32

vodorok force-pushed the store_speedup_2 branch 4 times, most recently from 2daa99c to b47b45c Compare November 23, 2022 10:53

bruntib requested changes Dec 7, 2022

View reviewed changes

dkrupp requested changes Dec 14, 2022

View reviewed changes

vodorok force-pushed the store_speedup_2 branch 4 times, most recently from 5556ef1 to 8a5c6ec Compare March 8, 2023 15:16

Szelethus added the performance 🏃 label Mar 16, 2023

Szelethus added this to the release 6.22.0 milestone Mar 16, 2023

vodorok requested a review from bruntib March 20, 2023 08:57

vodorok added 2 commits March 22, 2023 23:47

[feat] Make report annotations compatible with store speedup

4055ffd

By moving the the additon of annotations from the add_report method to the add_report_context method (and removing the flush from the handler function), the same session flushing strategy was restored.

vodorok force-pushed the store_speedup_2 branch from 8a5c6ec to 4055ffd Compare March 23, 2023 09:21

bruntib approved these changes Mar 23, 2023

View reviewed changes

web/server/codechecker_server/api/mass_store_run.py Outdated Show resolved Hide resolved

Fix docstring

25dcd5b

vodorok requested a review from dkrupp March 23, 2023 13:42

bruntib approved these changes Mar 23, 2023

View reviewed changes

dkrupp approved these changes Mar 24, 2023

View reviewed changes

dkrupp merged commit e95173e into Ericsson:master Mar 24, 2023

vodorok deleted the store_speedup_2 branch May 22, 2023 11:50

vodorok mentioned this pull request Feb 29, 2024

feat(server): Store information about available checkers to the database #4089

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further enhancements to speed up the store procedure #3796

Further enhancements to speed up the store procedure #3796

vodorok commented Nov 21, 2022

vodorok commented Nov 21, 2022

bruntib Dec 7, 2022

bruntib Dec 7, 2022

bruntib Dec 7, 2022

bruntib Dec 7, 2022

bruntib Dec 7, 2022

bruntib Dec 7, 2022

dkrupp left a comment

vodorok commented Mar 20, 2023

vodorok commented Mar 23, 2023

dkrupp left a comment

	# TODO Call self.getReviewStatusRules instead of the beloew query
	# TODO Call self.getReviewStatusRules instead of the query below

Further enhancements to speed up the store procedure #3796

Further enhancements to speed up the store procedure #3796

Conversation

vodorok commented Nov 21, 2022

vodorok commented Nov 21, 2022

bruntib Dec 7, 2022

Choose a reason for hiding this comment

bruntib Dec 7, 2022

Choose a reason for hiding this comment

bruntib Dec 7, 2022

Choose a reason for hiding this comment

bruntib Dec 7, 2022

Choose a reason for hiding this comment

bruntib Dec 7, 2022

Choose a reason for hiding this comment

bruntib Dec 7, 2022

Choose a reason for hiding this comment

dkrupp left a comment

Choose a reason for hiding this comment

vodorok commented Mar 20, 2023

vodorok commented Mar 23, 2023

dkrupp left a comment

Choose a reason for hiding this comment