Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1][Py3] Check for 'return' with arguments inside generators #3869

Merged
merged 1 commit into from Dec 20, 2019

Conversation

elacuesta
Copy link
Member

Fixes #3484

I also considered a version using ast.NodeVisitor, I think the walk-based solution should be more efficient since it can stop the iteration if the conditions are satisfied, while the NodeVisitor class would visit all Return nodes.

class ReturnNodeVisitor(ast.NodeVisitor):

    def __init__(self, *args, **kwargs):
        super(ReturnNodeVisitor, self).__init__(*args, **kwargs)
        self.return_value = False

    def visit_Return(self, node):
        """
        Sets the 'return_value' instance variable to True if the value returned
        by the node is not None (either explicitly or implicitly)
        """
        if not is_safe_return_node(node):
            self.return_value = True


def check_gen_with_no_none_return_visitor(callable):
    """
    Logs a warning if a callable is a generator function and includes
    a 'return' statement with a value different than None
    """
    if inspect.isgeneratorfunction(callable):
        tree = ast.parse(inspect.getsource(callable))
        node_visitor = ReturnNodeVisitor()
        node_visitor.visit(tree)
        if node_visitor.return_value:
            print('The generator has a return statement with a value different than None')

I plan on doing some benchmarking to determine if this has a noticeable performance impact, but I look forward to reading your feedback.

@elacuesta elacuesta changed the title Warn if a callback/errback is a generator and includes a return statement with a no-None value Check for 'return' with arguments inside generators Jul 12, 2019
@elacuesta
Copy link
Member Author

Right, before Python 3.3 'return' with argument inside generator was a SyntaxError, I need to skip the test in py2. Good news is the py3 tests are green \o/

scrapy/utils/misc.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jul 12, 2019

Codecov Report

Merging #3869 into master will decrease coverage by 2.81%.
The diff coverage is 90%.

@@            Coverage Diff             @@
##           master    #3869      +/-   ##
==========================================
- Coverage   85.56%   82.74%   -2.82%     
==========================================
  Files         164      164              
  Lines        9551     9575      +24     
  Branches     1431     1436       +5     
==========================================
- Hits         8172     7923     -249     
- Misses       1132     1393     +261     
- Partials      247      259      +12
Impacted Files Coverage Δ
scrapy/core/scraper.py 86.75% <87.5%> (-1.76%) ⬇️
scrapy/utils/misc.py 95.09% <90.9%> (-1.2%) ⬇️
scrapy/linkextractors/sgml.py 0% <0%> (-96.81%) ⬇️
scrapy/linkextractors/regex.py 0% <0%> (-95.66%) ⬇️
scrapy/linkextractors/htmlparser.py 0% <0%> (-92.07%) ⬇️
scrapy/core/downloader/handlers/s3.py 62.9% <0%> (-32.26%) ⬇️
scrapy/extensions/statsmailer.py 0% <0%> (-30.44%) ⬇️
scrapy/utils/boto.py 46.66% <0%> (-26.67%) ⬇️
scrapy/_monkeypatches.py 42.85% <0%> (-14.29%) ⬇️
scrapy/link.py 86.36% <0%> (-13.64%) ⬇️
... and 15 more

@codecov
Copy link

codecov bot commented Jul 12, 2019

Codecov Report

Merging #3869 into master will decrease coverage by 0.12%.
The diff coverage is 93.47%.

@@            Coverage Diff             @@
##           master    #3869      +/-   ##
==========================================
- Coverage   83.85%   83.73%   -0.13%     
==========================================
  Files         165      165              
  Lines        9615     9656      +41     
  Branches     1440     1445       +5     
==========================================
+ Hits         8063     8085      +22     
- Misses       1304     1323      +19     
  Partials      248      248
Impacted Files Coverage Δ
scrapy/utils/datatypes.py 66.46% <100%> (+3.3%) ⬆️
scrapy/core/scraper.py 89.67% <87.5%> (+0.2%) ⬆️
scrapy/utils/misc.py 95.19% <91.3%> (-1.11%) ⬇️
scrapy/robotstxt.py 75.3% <0%> (-22.23%) ⬇️
scrapy/utils/trackref.py 85.71% <0%> (+2.85%) ⬆️

@elacuesta elacuesta closed this Jul 14, 2019
@elacuesta elacuesta reopened this Jul 14, 2019
@elacuesta
Copy link
Member Author

I should also write a note about this on the docs.

@elacuesta elacuesta closed this Jul 15, 2019
@elacuesta elacuesta reopened this Jul 15, 2019
docs/topics/spiders.rst Outdated Show resolved Hide resolved
scrapy/core/scraper.py Outdated Show resolved Hide resolved
scrapy/utils/misc.py Outdated Show resolved Hide resolved
scrapy/core/scraper.py Outdated Show resolved Hide resolved
@elacuesta elacuesta force-pushed the detect_return_in_generator_callbacks branch from 8951334 to 8ed9e80 Compare July 18, 2019 15:38
@elacuesta
Copy link
Member Author

elacuesta commented Jul 18, 2019

@kmike:

scrapy-bench @ f03011d, on my admittedly old i3-4160

In some tests I modified this line from return r to yield from r, to trigger the code inspection.

Command: scrapy-bench --n-runs=20 --book_url http://domain1:8880/index.html bookworm

  • Scrapy master branch @ ae4eab9
    py36, return r
    Mean : 59.91042690444443 Median : 61.46755669534238 Std Dev : 4.994336655654268

    pypy, return r
    Mean : 45.292893988704805 Median : 45.20211211580239 Std Dev : 1.0998095175238132

    py36, yield from r
    Mean : 61.68034161779178 Median : 61.85817134315842 Std Dev : 2.102064300135808

    pypy, yield from r
    Mean : 43.366948759982265 Median : 43.56445793776966 Std Dev : 2.036978866845991

  • Scrapy detect_return_in_generator_callbacks branch @ 8ed9e80
    py36, return r
    Mean : 54.98886208722822 Median : 57.48276553924386 Std Dev : 7.392744405205249

    pypy, return r
    Mean : 43.1211460939031 Median : 43.05007627473238 Std Dev : 0.7440402187424218

    py36, yield from r
    Mean : 55.381843120702726 Median : 60.00684786392303 Std Dev : 8.969017328279934

    pypy, yield from r
    Mean : 42.526517136254654 Median : 43.12169432034433 Std Dev : 2.2184298213790634

There doesn't seem to be a noticeable impact on performance. If I'm not mistaken there is only one callback in this spider, and the function call is cached now, so the results might be partial.

@elacuesta
Copy link
Member Author

Travis failure is due to the fact that I removed the py2 compatibility, I can revert that commit at any time if there is in fact going to be a 1.8 release.

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered the scenarios where the passed callback is a module-level function or a lambda? They may be worth covering in the tests.

I also wonder if caching affects lambda functions. I assume two identical lambda functions are still considered different objects, but I’m not sure.

I may need to start drinking coffee.

@Gallaecio
Copy link
Member

Taking https://stackoverflow.com/questions/11815873/memoization-library-for-python-2-7 into account, and since adding a new Python 2-specific dependency for this at this point may not be wise, it may still make sense to make your changes only work on Python 3, using a six.P2 check as you used to.

@elacuesta elacuesta force-pushed the detect_return_in_generator_callbacks branch from 8ed9e80 to 9ae6414 Compare July 30, 2019 14:04
@Gallaecio Gallaecio changed the title Check for 'return' with arguments inside generators [MRG+1] Check for 'return' with arguments inside generators Jul 31, 2019
scrapy/utils/misc.py Outdated Show resolved Hide resolved
@elacuesta elacuesta force-pushed the detect_return_in_generator_callbacks branch from af5f186 to 8d6c2ec Compare August 26, 2019 19:10
@elacuesta elacuesta changed the title [MRG+1] Check for 'return' with arguments inside generators [MRG+1][Py3] Check for 'return' with arguments inside generators Aug 26, 2019
@elacuesta elacuesta added this to the v2.0 milestone Aug 26, 2019
@elacuesta elacuesta force-pushed the detect_return_in_generator_callbacks branch 2 times, most recently from 0f672d0 to 426b93e Compare November 4, 2019 13:28
scrapy/utils/datatypes.py Outdated Show resolved Hide resolved
@elacuesta elacuesta force-pushed the detect_return_in_generator_callbacks branch from 6fd99fe to 451e7a6 Compare December 16, 2019 14:30
@kmike
Copy link
Member

kmike commented Dec 20, 2019

Thanks @elacuesta! As usual, thanks @Gallaecio for the review :)

@Digenis
Copy link
Member

Digenis commented Aug 15, 2022

This triggers an issue when writing spiders in hy-lang

Was pep 484#annotating-generator-functions-and-coroutines considered when solving this?

I know my hy-lang scenario is extreme
but I don't think anyone would disagree that such things usually belong to static analysis.
Is it postponed for #4041 or is it more complicated than it seems?

@Gallaecio
Copy link
Member

Gallaecio commented Jan 11, 2023

I don’t think we considered annotations as a solution here, but I am not sure they are a good fit because the point was to log something at run time without requiring users to run a static analysis tool separately.

@wRAR wRAR mentioned this pull request May 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

On better handling use cases of PEP-380 inside request callbacks
4 participants