-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for report generator plugins that process modified files as a batch #300
Add support for report generator plugins that process modified files as a batch #300
Conversation
@@ -64,7 +64,8 @@ To install the development version: | |||
|
|||
git clone https://github.com/Bachmann1234/diff-cover.git | |||
cd diff-cover | |||
python setup.py install | |||
poetry install | |||
poetry shell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to the rest of the PR -- I noticed that the instructions mention setup.py
, which no longer exists.
I'll look at this closely this week. Sorry I did not get to it yet |
No problem! It's a roadmap item for us, not urgent. Hope you had a great holiday (if you're in the U.S.). |
diff_cover/report_generator.py
Outdated
self._diff_violations_dict = { | ||
src_path: DiffViolations( | ||
self._violations.violations(src_path), | ||
violations.get(src_path, []) | ||
if violations is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the violations reporter is supposed to cache these results so each is only processed once.
The function you are calling violations_batch
could also be described as warm_cache
.
Does that sound right to you? The reason I wanna rename this is because I want to hint that the implementor should be saving the results so we dont have to remember the state later (basically remove this if check)
basically....
Im wondering if you would go with renaming violations_batch
to warm_cache
and then removing the if violations is not None
and just calling violations
as is (under the assumption that the implementor has saved the results so the lookup is quick)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assumption in checking for if violations is not None
is that it will be None
if the function is not implemented. In that case, wouldn't we need to call violations()
?
Bigger picture: For SQLFluff, violations_batch()
is not a warm cache. Rather, SQLFluff is a fairly slow-running tool that is often used to process hundreds or even thousands of files. If diff-quality
calls SQLFluff one file at a time, there is no opportunity to use parallel processing. Adding a function like violations_batch()
provides more opportunity for the implementor to use parallel processing or other clever strategies to improve performance. It can definitely be used for caching as well.
Does this change your thoughts on the above?
It may be helpful (in terms of code clarity) to remove the line violations = None
and change the subsequent code to:
try:
violations = self._violations.violations_batch(src_paths_changed)
except NotImplementedError:
violations = {src_path: self._violations.violations(src_path)}
self._diff_violations_dict = {
src_path: DiffViolations(
violations.get(src_path, []),
...
This ensures that violations
is always defined and not None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I'm okay renaming the function if you like, but I wanted to explain my thinking, since it's not always going to be used for caching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your updated code block.
Essentially I was trying to think of a way where once we do the check for violations_batch
we dont have to think about it any more.
So the idea behind the 'warming the cache' was that violations_batch
would precompute the violations and then violations itself would just become a lookup rather than a full computation.
But the block you posted at the end achieves the same goal (I think). Violations would either come from violations_batch
or the dict comprehension.
I just wanna make sure we only have to check if this function exists the one time and then sorta process as normal from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My proposed code above doesn't work "as is" because it uses src_path
outside the loop. I'll try to do something similar. I was trying to avoid looping over the list of files twice, but maybe that's the best option.
@@ -172,15 +172,29 @@ def _diff_violations(self): | |||
|
|||
To make this efficient, we cache and reuse the result. | |||
""" | |||
src_paths_changed = self._diff.src_paths_changed() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest reviewing this file with "Hide whitespace" enabled in GitHub.
for src_path in src_paths_changed | ||
} | ||
except NotImplementedError: | ||
self._diff_violations_dict = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is essentially identical to main
(only difference is that we're looping over src_paths_changed
.
self._violations.measured_lines(src_path), | ||
self._diff.lines_changed(src_path), | ||
try: | ||
violations = self._violations.violations_batch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is very similar to the except
block, but it uses results from violations_batch()
rather than calling violations()
.
It duplicates some code from the except
block, but overall the duplication seems to make the code more readable by keeping the two implementations (using violations_batch()
versus violations()
) distinct from one another.
Ready for another review, I think. |
pyproject.toml
Outdated
@@ -1,6 +1,6 @@ | |||
[tool.poetry] | |||
name = "diff_cover" | |||
version = "7.1.0" | |||
version = "7.1.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me handle this. I just do this when I bump versions.
This is a feature and non breaking so I would probably go ahead and make this release 7.2.0 anyway
Ok. Think this can work. I think if you just update from main (and just remove the change to |
wait, im an admin, what am I doing. Illl fix the conflict |
@barrywhart I realized I should probably get a final "this is ready" from you before I merge and release. I tend to be a bit trigger happy with the release button. So just leave me a note and ill get this out asap (likely the evening I see it) |
Remove commented-out code in test
I removed a few lines of commented-out code. This is ready to merge once the build passes. Thanks for the review! SQLFluff users will appreciate the speedier performance once we update it to take advantage of this. 🙏 |
https://pypi.org/project/diff-cover/7.2.0/ is released! |
That's great -- thank you so much! |
Fixes #299