New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
browser.follow_link() has no way to pass kwargs to requests #362
Comments
Hi, thanks for your interest in MechanicalSoup! I think what you ended up with looks right to me (as that's basically all I agree that it would be convenient to be able to forward arguments to both bs4 and requests from a single interface, and I believe we've had this issue before where we didn't know how to clearly define the forwarding endpoints. If I could remake the interface, I'd probably avoid the flat argument lists that we use now. As far as remediation goes, your option (2) looks like our best bet, and I'd suggest that Would you be interested in submitting a PR for this change? |
Yes, I'd be happy to do so later this week. |
I'm looking at this now, @hemberger , and I don't understand the reasoning for the code you referenced above: MechanicalSoup/mechanicalsoup/stateful_browser.py Lines 238 to 243 in a27ff22
If the user passes in a Even worse, it doesn't just do so in a local copy that is then passed to requests, it actually modifies the passed-in Am I misunderstanding this? |
Your point is well-taken, and I wouldn't have any objection to it respecting an explicitly passed-in I'm also fine with creating new objects where necessary so that we don't modify inputs outside the scope of the function. Thanks for your careful review! |
Don't override a Referer field that a user passes in. Note that since HTTP headers are case-insensiive, use requests' CaseInsensiiveDict to handle this. Futhermore, don't modify the caller's **kwargs, the caller should assume we do not modify the submitted dict, so make a copy first. Do this in _merge_referer() helper function, as we anticipate needing this in other functions when MechanicalSoup#362 is fixed. Add a test for overriding Referers.
Don't override a Referer field that a user passes in. Note that since HTTP headers are case-insensitive, use requests' CaseInsensitiveDict to handle this. Futhermore, don't modify the caller's **kwargs, the caller should assume we do not modify the submitted dict, so make a copy first. Do this in _merge_referer() helper function, as we anticipate needing this in other functions when MechanicalSoup#362 is fixed. Add a test for overriding Referers.
Don't override a Referer field that a user passes in. Note that since HTTP headers are case-insensitive, use requests' CaseInsensitiveDict to handle this. Furthermore, don't modify the caller's **kwargs, the caller should assume we do not modify the submitted dict, so make a copy first. Do this in _merge_referer() helper function, as we anticipate needing this in other functions when MechanicalSoup#362 is fixed. Add a test for overriding Referers.
Some questions, mostly relevant to this Issue.
MechanicalSoup/mechanicalsoup/stateful_browser.py Lines 238 to 245 in a27ff22
is this an oversight or a stylistic difference between multiple authors? Which is "right"?
I had meant adding both parameters, i.e.: def follow_link(self, link=None, bs_kwargs={},
requests_kwargs={}, *args, **kwargs): which requires merging them with constructs like But I'm not sure what you thought you were approving, and maybe you think a better plan is: def follow_link(self, link=None, bs_kwargs={},
*args, **requests_kwargs):
Thanks. |
Thanks again for your careful review. It's great to have a critical eye on the code! |
I know this is probably best in another issue, but:
Well, as I look now,
I think that's right, unless there is an interface preference for being able to write code that is more clear. one = browser.follow_link(bs4_kwargs={'text': 'Link anchor'}, requests_kwargs={'verify': False})
two = browser.follow_link(requests_kwargs={'verify': False}, bs4_kwargs={'text': 'Link anchor'}) But I think that doesn't work. Given: def test_link(requests_kwargs, **bs4_kwargs):
print("request_kwargs: "+str(requests_kwargs))
print(" bs4_kwargs: "+str(bs4_kwargs)) Then the naive invokation does the wrong thing: >>> test_link(bs4_kwargs={'text': 'Link anchor'}, requests_kwargs={'verify': False})
request_kwargs: {'verify': False}
bs4_kwargs: {'bs4_kwargs': {'text': 'Link anchor'}} And loose dicts sink ships: >>> test_link({'text': 'Link anchor'}, {'verify': False})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: test_link() takes 1 positional argument but 2 were given So the necessary thing is quite unfamiliar-looking: >>> test_link({'text': 'Link anchor'}, **{'verify': False})
request_kwargs: {'text': 'Link anchor'}
bs4_kwargs: {'verify': False} which really doesn't feel right. On the other hand, there's this: def test_link2(requests_kwargs, bs4_kwargs, **kwargs):
print("request_kwargs: "+str(requests_kwargs))
print(" bs4_kwargs: "+str({**bs4_kwargs, **kwargs})) giving us: >>> test_link2(bs4_kwargs={'text': 'Link anchor'}, requests_kwargs={'verify': False}, more=5 )
request_kwargs: {'verify': False}
bs4_kwargs: {'text': 'Link anchor', 'more': 5} And of course, it seems like it should be more logical to give the
Of course you're right that the interface needs to not break for past callers, that's really important. Updating the package shouldn't break the contract. And so that necessarily implies that loose
Sounds good.
I guess it doesn't cause any harm, other than confused human brains?
You're welcome. Thanks for providing the module and being responsive and welcoming! Honestly, thinking about this particular issue has gotten quite a bit more complicated than I thought it was at the outset. |
* Don't override a Referer field that a user passes in. Note that since HTTP headers are case-insensitive, use requests' CaseInsensitiveDict to handle this. * Furthermore, don't modify the caller's **kwargs, the caller should assume we do not modify the submitted dict, so make a copy first. * Do this in _merge_referer() helper function, as we anticipate needing this in other functions when MechanicalSoup#362 is fixed. With tests: * test_referer_submit_override(): . Test both uppercase and lowercase variants. . Use @pytest.mark.parametrize as suggested by @hemberger * test_submit_dont_modify_kwargs() ensure that submit_selected() does not modify the caller's passed-in dict, which we used to do when we added the Referer: header.
Any thoughts on this, @hemberger? |
Hey, sorry for the delayed response! Regarding The presence of Let me suggest a revision of the interface I wrote out earlier: def follow_link(self, link=None, *bs4_args,
requests_kwargs={}, **bs4_kwargs): This appears to satisfy the criteria of:
It is, however, not the most intuitive interface, especially because of the |
So, the more I think about it, the more I am bothered by the lack of parallelism in what you propose. def follow_link(self, link=None, *bs4_args, requests_kwargs={}, **bs4_kwargs):
print("request_kwargs: "+str(requests_kwargs))
print(" bs4_kwargs: "+str(bs4_kwargs)) If we call it the naive way, it blows up in your face: >>> follow_link('self', requests_kwargs={'c': 'd'}, bs4_kwargs={'a': 'b'})
request_kwargs: {'c': 'd'}
bs4_kwargs: {'bs4_kwargs': {'a': 'b'}} Instead we have to call it like this, with >>> follow_link('self', requests_kwargs={'c': 'd'}, **{'a': 'b'})
request_kwargs: {'c': 'd'}
bs4_kwargs: {'a': 'b'} I think we'd be much better off with a pattern like this: def test_link2(requests_kwargs, bs4_kwargs, **kwargs):
print("request_kwargs: "+str(requests_kwargs))
print(" bs4_kwargs: "+str({**bs4_kwargs, **kwargs})) So we could use: >>> test_link2(bs4_kwargs={'text': 'Link anchor'}, requests_kwargs={'verify': False}, more=5 )
request_kwargs: {'verify': False}
bs4_kwargs: {'text': 'Link anchor', 'more': 5} Do you disagree? |
For the sake of providing an interface that treats all endpoints consistently, I can see why it'd be nice to have both How about something like this? def follow_link(self, link=None, *bs4_args,
requests_kwargs={}, bs4_kwargs={}, **extra_bs4_kwargs): Just looking at the interface, a user might be pretty confused about what's going on. However, if we provide an example that demonstrates the intended usage, and if we clearly document the |
Yes, I think so. OK, I'll proceed in this fashion. |
* Don't override a Referer field that a user passes in. Note that since HTTP headers are case-insensitive, use requests' CaseInsensitiveDict to handle this. * Furthermore, don't modify the caller's **kwargs, the caller should assume we do not modify the submitted dict, so make a copy first. * Do this in _merge_referer() helper function, as we anticipate needing this in other functions when MechanicalSoup#362 is fixed. With tests: * test_referer_submit_override(): . Test both uppercase and lowercase variants. . Use @pytest.mark.parametrize as suggested by @hemberger * test_submit_dont_modify_kwargs() ensure that submit_selected() does not modify the caller's passed-in dict, which we used to do when we added the Referer: header.
Be consistent and use self.__state.url over the self.url() @Property for internal consumers. Discussed somewhat in MechanicalSoup#362, the logic being that: self.url is an interface for external callers to get access to the internal state, and there's no reason to force the internal users to do so. There's extra cognitive load for readers of the code to follow through the indirection.
Be consistent and use self.__state.url over the self.url() @Property for internal consumers. Discussed somewhat in MechanicalSoup#362, the logic being that: self.url is an interface for external callers to get access to the internal state, and there's no reason to force the internal users to do so. There's extra cognitive load for readers of the code to follow through the indirection.
* Don't override a Referer field that a user passes in. Note that since HTTP headers are case-insensitive, use requests' CaseInsensitiveDict to handle this. * Furthermore, don't modify the caller's **kwargs, the caller should assume we do not modify the submitted dict, so make a copy first. * Do this in _merge_referer() helper function, as we anticipate needing this in other functions when MechanicalSoup#362 is fixed. With tests: * test_referer_submit_override(): . Test both uppercase and lowercase variants. . Use @pytest.mark.parametrize as suggested by @hemberger * test_submit_dont_modify_kwargs() ensure that submit_selected() does not modify the caller's passed-in dict, which we used to do when we added the Referer: header.
Be consistent and use self.__state.url over the self.url() @Property for internal consumers. Discussed somewhat in MechanicalSoup#362, the logic being that: self.url is an interface for external callers to get access to the internal state, and there's no reason to force the internal users to do so. There's extra cognitive load for readers of the code to follow through the indirection.
Be consistent and use self.__state.url over the self.url() @Property for internal consumers. Discussed somewhat in MechanicalSoup#362, the logic being that: self.url is an interface for external callers to get access to the internal state, and there's no reason to force the internal users to do so. There's extra cognitive load for readers of the code to follow through the indirection.
Addresses MechanicalSoup#362 browser.follow_link() has no way to pass kwargs to requests Accept keywords args in three ways: bs4_kwargs Explicitly passed to Beautiful Soup, via find_link, &c. requests_kwargs Passed to requests, via open_relative **kwargs Excess args, merged with bs4_kwargs for backwards compatibility Adjust docstrings as appropriate. Rename *args to *bs4_args to more clearly indicate that these positional args go to the bs4 functions not to the requests functions. Add tests: test_download_link_nofile_bs4 passes args to BeautifulSoup via bs4_kwargs test_download_link_nofile_excess passes args to BeautifulSoup via excess **kwargs test_follow_link_ua() test_download_link_nofile_ua() both of which pass in requests_kwargs that set the User-Agent header
Fixed in #368. |
As noted elsewhere, I've recently been debugging behind an SSL proxy, which requires telling requests to not verify SSL certificates. Generally I've done that with
which is fine. But it's not so fine when I need to follow a link, because
browser.follow_link()
uses its**kwargs
for BS4's tag finding, but not for actually following the link.So instead of
I end up with
I am not sure how to fix this. Some thoughts:
browser.follow_link()
's documentation explaining how to work around this situation.browser.follow_link()
, one for BS4 and one for Requests. Of course, only one gets to be**kwargs
, but at least one might be able to callbrowser.follow_link(text='Link anchor', requests_args=kwargs)
or something.**kwargs
parameter to bothMaybe there's a better way. I guess in my case I could set this state in requests' Session object,
which I think would beno, that's not right, I'm not sure how to accomplish it actually.browser.session.merge_environment_settings(...)
The text was updated successfully, but these errors were encountered: