Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[store] Unique reports _before_ storing #4152

Merged
merged 4 commits into from
Apr 2, 2024

Conversation

Szelethus
Copy link
Collaborator

@Szelethus Szelethus commented Jan 15, 2024

We used to zip every result file in the result directory and send it to the server. This is tremendously wasteful, as many report files contained the very same reports (for instance, if the report originated from a header file).

In this patch, I unique all reports, dump them in a tmpfile, and store that to the server. For a clang-tidy --enable-all analysis on xerces, this reduced the size of the zipfile drastically:

BEFORE:

Compressing report zip file done (1.6GiB / 62.2MiB).
[...]
real	8m32,381s
user	15m57,436s
sys	0m14,708s

AFTER:

Compressing report zip file done (413.0MiB / 9.0MiB).
[...]
real	5m45,621s
user	15m54,858s
sys	0m14,458s

While this doesn't speed up CodeChecker parse or CodeChecker diff, it still speeds up the server quite a bit, both for storage and query times.

I also did a little cleanup on the debug logs -- the one in report converter spit out a line for literally every report, that feels a little unnecessary. I also added some logs to show that the parsing of result files is in progress, though that could be a little more verbose.

edit:
In addition, I needed to fix plist parsing at one point, and also, I had a looooot of trouble with AnalysisInfo not being set for each report. The problem was in this nasty ass loop:

    LOG.info("Building report zip file...")
    with zipfile.ZipFile(zip_file, 'a', allowZip64=True) as zipf:
        # Add the files to the zip which will be sent to the server.
        for file_path in files_to_compress:
            _, file_name = os.path.split(file_path)

            # Create a unique report directory name.
            report_dir_name = hashlib.md5(os.path.dirname(
                file_path).encode('utf-8')).hexdigest()

            zip_target = os.path.join('reports', report_dir_name, file_name)
            zipf.write(file_path, zip_target)

Now, since we support multiple report dir storages at once, in the zipped file, each of those report dirs get their own report dir. However, do you also not see the clear implementation of that in the form of "for each input dir, create an output dir"? That nasty os.path.dirname part does all that, and my initial implementation didn't set the dirname right. Since this manifested in errors seemingly unrelated parts of the code, I was lead on a month-long chase to find out that this was the issue.

Also, since we handle all input report dirs all at once in a rather cryptic manner, I also chose to just brute-force my implementation into the code, instead of refactoring it into an easy-to-digest "for each dir in input dir, zip dir". That might be worth a followup refactoring patch.

We used to zip every result file in the result directory and send it to
the server. This is tremendously wasteful, as many report files
contained the very same reports (for instance, if the report originated
from a header file).

In this patch, I unique all reports, dump them in a tmpfile, and store
that to the server. For a clang-tidy --enable-all analysis on xerces,
this reduced the size of the zipfile drastically:

BEFORE: Compressing report zip file done (1.6GiB / 62.2MiB).
AFTER: Compressing report zip file done (372.6MiB / 7.9MiB).

While this doesn't speed up CodeChecker parse or CodeChecker diff, it
still speeds up the server quite a bit, both for storage and query
times.
analyzer_result_files)):
analyzer_result_files))):
if idx % 10 == 0:
LOG.debug(f"Parsed {idx}/{len(analyzer_result_files)} files...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we should leave debug logs in the code only if they are valuable when asking them from the users for some debugging session.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the output to print the name of the file as it is parsed. I found this helpful during development, and I don't think its too verbose.

[DEBUG] store.py:458 assemble_zip() - Processing 3 report files ...
[DEBUG] store.py:412 parse_analyzer_result_files() - [0/3] Parsed '<path>/prune_paths.cpp_clangsa_e17b9269d5755586c9edd407de32f1df.plist' ...
[DEBUG] store.py:412 parse_analyzer_result_files() - [1/3] Parsed '<path>/prune_paths.cpp_clang-tidy_e17b9269d5755586c9edd407de32f1df.plist' ...
[DEBUG] store.py:412 parse_analyzer_result_files() - [2/3] Parsed '<path>/prune_paths.cpp_cppcheck_e17b9269d5755586c9edd407de32f1df.plist' ...

web/client/codechecker_client/cmd/store.py Outdated Show resolved Hide resolved
web/client/codechecker_client/cmd/store.py Outdated Show resolved Hide resolved
@bruntib bruntib merged commit 3115d95 into Ericsson:master Apr 2, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants