Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for report generator plugins that process modified files as a batch #300

Conversation

barrywhart
Copy link
Contributor

Fixes #299

@@ -64,7 +64,8 @@ To install the development version:

git clone https://github.com/Bachmann1234/diff-cover.git
cd diff-cover
python setup.py install
poetry install
poetry shell
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to the rest of the PR -- I noticed that the instructions mention setup.py, which no longer exists.

@barrywhart barrywhart marked this pull request as draft November 26, 2022 22:26
@barrywhart barrywhart marked this pull request as ready for review November 26, 2022 22:48
@Bachmann1234
Copy link
Owner

I'll look at this closely this week. Sorry I did not get to it yet

@barrywhart
Copy link
Contributor Author

No problem! It's a roadmap item for us, not urgent. Hope you had a great holiday (if you're in the U.S.).

self._diff_violations_dict = {
src_path: DiffViolations(
self._violations.violations(src_path),
violations.get(src_path, [])
if violations is not None
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the violations reporter is supposed to cache these results so each is only processed once.

The function you are calling violations_batch could also be described as warm_cache.

Does that sound right to you? The reason I wanna rename this is because I want to hint that the implementor should be saving the results so we dont have to remember the state later (basically remove this if check)

basically....

Im wondering if you would go with renaming violations_batch to warm_cache and then removing the if violations is not None and just calling violations as is (under the assumption that the implementor has saved the results so the lookup is quick)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption in checking for if violations is not None is that it will be None if the function is not implemented. In that case, wouldn't we need to call violations()?

Bigger picture: For SQLFluff, violations_batch() is not a warm cache. Rather, SQLFluff is a fairly slow-running tool that is often used to process hundreds or even thousands of files. If diff-quality calls SQLFluff one file at a time, there is no opportunity to use parallel processing. Adding a function like violations_batch() provides more opportunity for the implementor to use parallel processing or other clever strategies to improve performance. It can definitely be used for caching as well.

Does this change your thoughts on the above?

It may be helpful (in terms of code clarity) to remove the line violations = None and change the subsequent code to:

            try:
                violations = self._violations.violations_batch(src_paths_changed)
            except NotImplementedError:
                violations = {src_path: self._violations.violations(src_path)}
            self._diff_violations_dict = {
                src_path: DiffViolations(
                    violations.get(src_path, []),
                    ...

This ensures that violations is always defined and not None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I'm okay renaming the function if you like, but I wanted to explain my thinking, since it's not always going to be used for caching.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your updated code block.

Essentially I was trying to think of a way where once we do the check for violations_batch we dont have to think about it any more.

So the idea behind the 'warming the cache' was that violations_batch would precompute the violations and then violations itself would just become a lookup rather than a full computation.

But the block you posted at the end achieves the same goal (I think). Violations would either come from violations_batch or the dict comprehension.

I just wanna make sure we only have to check if this function exists the one time and then sorta process as normal from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My proposed code above doesn't work "as is" because it uses src_path outside the loop. I'll try to do something similar. I was trying to avoid looping over the list of files twice, but maybe that's the best option.

@@ -172,15 +172,29 @@ def _diff_violations(self):

To make this efficient, we cache and reuse the result.
"""
src_paths_changed = self._diff.src_paths_changed()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest reviewing this file with "Hide whitespace" enabled in GitHub.

for src_path in src_paths_changed
}
except NotImplementedError:
self._diff_violations_dict = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is essentially identical to main (only difference is that we're looping over src_paths_changed.

self._violations.measured_lines(src_path),
self._diff.lines_changed(src_path),
try:
violations = self._violations.violations_batch(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is very similar to the except block, but it uses results from violations_batch() rather than calling violations().

It duplicates some code from the except block, but overall the duplication seems to make the code more readable by keeping the two implementations (using violations_batch() versus violations()) distinct from one another.

@barrywhart
Copy link
Contributor Author

Ready for another review, I think.

pyproject.toml Outdated
@@ -1,6 +1,6 @@
[tool.poetry]
name = "diff_cover"
version = "7.1.0"
version = "7.1.1"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me handle this. I just do this when I bump versions.

This is a feature and non breaking so I would probably go ahead and make this release 7.2.0 anyway

@Bachmann1234
Copy link
Owner

Ok. Think this can work. I think if you just update from main (and just remove the change to pyproject.toml ill do that when I cut the release) ill go ahead and merge this

@Bachmann1234
Copy link
Owner

wait, im an admin, what am I doing. Illl fix the conflict

@Bachmann1234
Copy link
Owner

@barrywhart I realized I should probably get a final "this is ready" from you before I merge and release. I tend to be a bit trigger happy with the release button. So just leave me a note and ill get this out asap (likely the evening I see it)

Remove commented-out code in test
@barrywhart
Copy link
Contributor Author

I removed a few lines of commented-out code. This is ready to merge once the build passes.

Thanks for the review! SQLFluff users will appreciate the speedier performance once we update it to take advantage of this. 🙏

@Bachmann1234 Bachmann1234 merged commit 66fa119 into Bachmann1234:main Dec 1, 2022
@Bachmann1234
Copy link
Owner

https://pypi.org/project/diff-cover/7.2.0/ is released!

@barrywhart
Copy link
Contributor Author

That's great -- thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for report generator plugins that process modified files as a batch, rather than individually
2 participants