[store] Unique reports _before_ storing #4152

Szelethus · 2024-01-15T10:46:33Z

We used to zip every result file in the result directory and send it to the server. This is tremendously wasteful, as many report files contained the very same reports (for instance, if the report originated from a header file).

In this patch, I unique all reports, dump them in a tmpfile, and store that to the server. For a clang-tidy --enable-all analysis on xerces, this reduced the size of the zipfile drastically:

BEFORE:

Compressing report zip file done (1.6GiB / 62.2MiB).
[...]
real	8m32,381s
user	15m57,436s
sys	0m14,708s

AFTER:

Compressing report zip file done (413.0MiB / 9.0MiB).
[...]
real	5m45,621s
user	15m54,858s
sys	0m14,458s

While this doesn't speed up CodeChecker parse or CodeChecker diff, it still speeds up the server quite a bit, both for storage and query times.

I also did a little cleanup on the debug logs -- the one in report converter spit out a line for literally every report, that feels a little unnecessary. I also added some logs to show that the parsing of result files is in progress, though that could be a little more verbose.

edit:
In addition, I needed to fix plist parsing at one point, and also, I had a looooot of trouble with AnalysisInfo not being set for each report. The problem was in this nasty ass loop:

    LOG.info("Building report zip file...")
    with zipfile.ZipFile(zip_file, 'a', allowZip64=True) as zipf:
        # Add the files to the zip which will be sent to the server.
        for file_path in files_to_compress:
            _, file_name = os.path.split(file_path)

            # Create a unique report directory name.
            report_dir_name = hashlib.md5(os.path.dirname(
                file_path).encode('utf-8')).hexdigest()

            zip_target = os.path.join('reports', report_dir_name, file_name)
            zipf.write(file_path, zip_target)

Now, since we support multiple report dir storages at once, in the zipped file, each of those report dirs get their own report dir. However, do you also not see the clear implementation of that in the form of "for each input dir, create an output dir"? That nasty os.path.dirname part does all that, and my initial implementation didn't set the dirname right. Since this manifested in errors seemingly unrelated parts of the code, I was lead on a month-long chase to find out that this was the issue.

Also, since we handle all input report dirs all at once in a rather cryptic manner, I also chose to just brute-force my implementation into the code, instead of refactoring it into an easy-to-digest "for each dir in input dir, zip dir". That might be worth a followup refactoring patch.

We used to zip every result file in the result directory and send it to the server. This is tremendously wasteful, as many report files contained the very same reports (for instance, if the report originated from a header file). In this patch, I unique all reports, dump them in a tmpfile, and store that to the server. For a clang-tidy --enable-all analysis on xerces, this reduced the size of the zipfile drastically: BEFORE: Compressing report zip file done (1.6GiB / 62.2MiB). AFTER: Compressing report zip file done (372.6MiB / 7.9MiB). While this doesn't speed up CodeChecker parse or CodeChecker diff, it still speeds up the server quite a bit, both for storage and query times.

tools/report-converter/codechecker_report_converter/report/hash.py

bruntib · 2024-03-11T13:36:32Z

web/client/codechecker_client/cmd/store.py

-                analyzer_result_files)):
+                analyzer_result_files))):
+        if idx % 10 == 0:
+            LOG.debug(f"Parsed {idx}/{len(analyzer_result_files)} files...")


I think, we should leave debug logs in the code only if they are valuable when asking them from the users for some debugging session.

I changed the output to print the name of the file as it is parsed. I found this helpful during development, and I don't think its too verbose.

[DEBUG] store.py:458 assemble_zip() - Processing 3 report files ... [DEBUG] store.py:412 parse_analyzer_result_files() - [0/3] Parsed '<path>/prune_paths.cpp_clangsa_e17b9269d5755586c9edd407de32f1df.plist' ... [DEBUG] store.py:412 parse_analyzer_result_files() - [1/3] Parsed '<path>/prune_paths.cpp_clang-tidy_e17b9269d5755586c9edd407de32f1df.plist' ... [DEBUG] store.py:412 parse_analyzer_result_files() - [2/3] Parsed '<path>/prune_paths.cpp_cppcheck_e17b9269d5755586c9edd407de32f1df.plist' ...

web/client/codechecker_client/cmd/store.py

tools/report-converter/codechecker_report_converter/report/parser/plist.py

Szelethus added the server 🖥️ label Jan 15, 2024

Szelethus added this to the release 6.24.0 milestone Jan 15, 2024

Szelethus requested review from whisperity, dkrupp and cservakt January 15, 2024 10:46

Szelethus requested review from bruntib and vodorok as code owners January 15, 2024 10:46

Szelethus added 2 commits March 7, 2024 11:45

Tidy up the patch

3ba2130

Szelethus force-pushed the store_primitive_speedup branch from 4c3bb47 to 3ba2130 Compare March 11, 2024 10:04

bruntib requested changes Mar 12, 2024

View reviewed changes

Partial fixes according to bruntib's comments

40b1c6d

whisperity added the enhancement 🌟 label Mar 21, 2024

Remove the temporary files

5328ded

Szelethus requested a review from bruntib March 21, 2024 11:59

whisperity added the performance 🏃 label Mar 27, 2024

bruntib reviewed Mar 27, 2024

View reviewed changes

tools/report-converter/codechecker_report_converter/report/parser/plist.py Show resolved Hide resolved

bruntib approved these changes Apr 2, 2024

View reviewed changes

bruntib merged commit 3115d95 into Ericsson:master Apr 2, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[store] Unique reports _before_ storing #4152

[store] Unique reports _before_ storing #4152

Szelethus commented Jan 15, 2024 •

edited

bruntib Mar 11, 2024

Szelethus Mar 21, 2024

[store] Unique reports _before_ storing #4152

[store] Unique reports _before_ storing #4152

Conversation

Szelethus commented Jan 15, 2024 • edited

bruntib Mar 11, 2024

Choose a reason for hiding this comment

Szelethus Mar 21, 2024

Choose a reason for hiding this comment

Szelethus commented Jan 15, 2024 •

edited