Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

browser.follow_link() has no way to pass kwargs to requests #362

Closed
johnhawkinson opened this issue Mar 28, 2021 · 13 comments
Closed

browser.follow_link() has no way to pass kwargs to requests #362

johnhawkinson opened this issue Mar 28, 2021 · 13 comments

Comments

@johnhawkinson
Copy link
Contributor

As noted elsewhere, I've recently been debugging behind an SSL proxy, which requires telling requests to not verify SSL certificates. Generally I've done that with

    kwargs = { "verify": False }
    # ...
    r = br.submit_selected(**kwargs)

which is fine. But it's not so fine when I need to follow a link, because browser.follow_link() uses its **kwargs for BS4's tag finding, but not for actually following the link.

So instead of

    r = br.follow_link(text='Link anchor', **kwargs)

I end up with

    link = br.find_link(text='Link anchor')
    r = br.open_relative(link['href'], **kwargs)

I am not sure how to fix this. Some thoughts:

  1. If nothing changes, add some more clarity to browser.follow_link()'s documentation explaining how to work around this situation.
  2. Add kwargs-ish params to browser.follow_link(), one for BS4 and one for Requests. Of course, only one gets to be **kwargs, but at least one might be able to call browser.follow_link(text='Link anchor', requests_args=kwargs) or something.
  3. Send the same **kwargs parameter to both

Maybe there's a better way. I guess in my case I could set this state in requests' Session object, which I think would be browser.session.merge_environment_settings(...) no, that's not right, I'm not sure how to accomplish it actually.

@hemberger
Copy link
Contributor

Hi, thanks for your interest in MechanicalSoup!

I think what you ended up with looks right to me (as that's basically all follow_link does), and that's probably the best you'll be able to do with the current version of MechanicalSoup.

I agree that it would be convenient to be able to forward arguments to both bs4 and requests from a single interface, and I believe we've had this issue before where we didn't know how to clearly define the forwarding endpoints. If I could remake the interface, I'd probably avoid the flat argument lists that we use now.

As far as remediation goes, your option (2) looks like our best bet, and I'd suggest that follow_link and download_link be changed in the same way for consistency. We would probably just need to handle the case of headers being specified like we do in submit_selected.

Would you be interested in submitting a PR for this change?

@johnhawkinson
Copy link
Contributor Author

Would you be interested in submitting a PR for this change?

Yes, I'd be happy to do so later this week.

@johnhawkinson
Copy link
Contributor Author

I'm looking at this now, @hemberger , and I don't understand the reasoning for the code you referenced above:

referer = self.url
if referer is not None:
if 'headers' in kwargs:
kwargs['headers']['Referer'] = referer
else:
kwargs['headers'] = {'Referer': referer}

If the user passes in a headers dictionary and it contains a Referer key, then submit_selected() overrides it.
That seems like a bug? If I have gone so far as to explicitly set the Referer field in what I pass to MechanicalSoup, why should it be overridden?

Even worse, it doesn't just do so in a local copy that is then passed to requests, it actually modifies the passed-in kwargs.

Am I misunderstanding this?

@hemberger
Copy link
Contributor

Your point is well-taken, and I wouldn't have any objection to it respecting an explicitly passed-in Referer header. Hopefully this wouldn't negatively impact someone who is, perhaps, perfunctorily reusing headers from some previous request. Since @moy authored the fix to #179, I'd want to make sure he's okay with a change in behavior as well, if possible.

I'm also fine with creating new objects where necessary so that we don't modify inputs outside the scope of the function. Thanks for your careful review!

johnhawkinson added a commit to johnhawkinson/MechanicalSoup that referenced this issue Apr 4, 2021
Don't override a Referer field that a user passes in.
Note that since HTTP headers are case-insensiive, use requests'
CaseInsensiiveDict to handle this.

Futhermore, don't modify the caller's **kwargs, the caller should
assume we do not modify the submitted dict, so make a copy
first.

Do this in _merge_referer() helper function, as we anticipate needing
this in other functions when MechanicalSoup#362 is fixed.

Add a test for overriding Referers.
johnhawkinson added a commit to johnhawkinson/MechanicalSoup that referenced this issue Apr 4, 2021
Don't override a Referer field that a user passes in.
Note that since HTTP headers are case-insensitive, use requests'
CaseInsensitiveDict to handle this.

Futhermore, don't modify the caller's **kwargs, the caller should
assume we do not modify the submitted dict, so make a copy
first.

Do this in _merge_referer() helper function, as we anticipate needing
this in other functions when MechanicalSoup#362 is fixed.

Add a test for overriding Referers.
johnhawkinson added a commit to johnhawkinson/MechanicalSoup that referenced this issue Apr 4, 2021
Don't override a Referer field that a user passes in.
Note that since HTTP headers are case-insensitive, use requests'
CaseInsensitiveDict to handle this.

Furthermore, don't modify the caller's **kwargs, the caller should
assume we do not modify the submitted dict, so make a copy
first.

Do this in _merge_referer() helper function, as we anticipate needing
this in other functions when MechanicalSoup#362 is fixed.

Add a test for overriding Referers.
@johnhawkinson
Copy link
Contributor Author

Some questions, mostly relevant to this Issue.

  1. As I look at submit_selected(), at least before submit_selected(): Don't override user's Referer #364, it uses bothself.url and self.__state.url:

referer = self.url
if referer is not None:
if 'headers' in kwargs:
kwargs['headers']['Referer'] = referer
else:
kwargs['headers'] = {'Referer': referer}
resp = self.submit(self.__state.form, url=self.__state.url,

is this an oversight or a stylistic difference between multiple authors? Which is "right"?

  1. More importantly, I realize I was ambiguous when I asked about the desired function signature, and I'm having second thoughts. When I said:

Add kwargs-ish params to browser.follow_link(), one for BS4 and one for Requests. Of course, only one gets to be **kwargs, but at least one might be able to call browser.follow_link(text='Link anchor', requests_args=kwargs) or something.

I had meant adding both parameters, i.e.:

    def follow_link(self, link=None, bs_kwargs={},
                    requests_kwargs={}, *args, **kwargs):

which requires merging them with constructs like {**bs_kwargs, **kwargs}.

But I'm not sure what you thought you were approving, and maybe you think a better plan is:

    def follow_link(self, link=None, bs_kwargs={},
                    *args, **requests_kwargs):
  1. Also, I'm not sure how these should be named, and that naming (and API design) may be more challenging than the actual implementation.
    The Naming of Cats is a difficult matter and all that.
    In particular, I suggest bs_kwargs and requests_kwargs but perhaps soup_kwargs or bs4_kwargs would be better. And there already exists a request_kwargs ("request" singular) in browser.py, but that's for something different.

Thanks.

@hemberger
Copy link
Contributor

  1. self.url and self.__state.url are synonyms. The former was added very recently as a property, so I think it just hasn't propagated to all of the places it should be used yet. I would probably argue for self.url as being "right", though I'm open to persuasive arguments. :)
  2. If {bs,requests}_kwargs both exist, then I think there would be no need for *args and **kwargs. However, I was thinking that the interface would remain the same, except for the addition of a requests_kwargs argument, i.e. like your second example, but reversed:
        def follow_link(self, link=None, requests_kwargs={},
                        *bs4_args, **bs4_kwargs):
    This would primarily be for the sake of backwards compatibility (which we have historically favored, even though some early design decisions have led to unsatisfying compromises later). I'm not sure *args makes any sense here even currently, but maybe that's a separate issue.
  3. In browser.py, we use request_kwargs because the kwargs are ultimately passed to self.session.request. So the name matches the endpoint function. I doubt this is employed consistently, but it seems a reasonable choice. In our case here, we could use that strategy, but one endpoint is self.page.find_all, which probably doesn't have such a succinct way to refer to it. Personally, I like the idea of disambiguating them by module name, so requests_kwargs and bs4_kwargs, but I'm fine with whatever you ultimately choose.

Thanks again for your careful review. It's great to have a critical eye on the code!

@johnhawkinson
Copy link
Contributor Author

I know this is probably best in another issue, but:

  1. self.url and self.__state.url are synonyms. The former was added very recently as a property, so I think it just hasn't propagated to all of the places it should be used yet. I would probably argue for self.url as being "right", though I'm open to persuasive arguments. :)

Well, as I look now, self.url is used in 4 places, 3 of which are this referer handling that should be abstracted out. It seems to me after some thought that self.__state.url is probably right, because self.url is an interface for external callers to get access to the internal state, and there's no reason to force the internal users to do so, and presumably there's some microscopic performance hit (or is it optimized away?), as well as a little bit of extra cognitive load for the indirection for someone reading the code. So I'd say self.__state.url is better :)

  1. If {bs,requests}_kwargs both exist, then I think there would be no need for *args and **kwargs.

I think that's right, unless there is an interface preference for being able to write code that is more clear.
That is, ideally the user shouldn't have to remember the order of arguments and could call either of

one = browser.follow_link(bs4_kwargs={'text': 'Link anchor'}, requests_kwargs={'verify': False})
two = browser.follow_link(requests_kwargs={'verify': False}, bs4_kwargs={'text': 'Link anchor'})

But I think that doesn't work. Given:

def test_link(requests_kwargs, **bs4_kwargs):
  print("request_kwargs: "+str(requests_kwargs))
  print("    bs4_kwargs: "+str(bs4_kwargs))

Then the naive invokation does the wrong thing:

>>> test_link(bs4_kwargs={'text': 'Link anchor'}, requests_kwargs={'verify': False})
request_kwargs: {'verify': False}
    bs4_kwargs: {'bs4_kwargs': {'text': 'Link anchor'}}

And loose dicts sink ships:

>>> test_link({'text': 'Link anchor'}, {'verify': False})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: test_link() takes 1 positional argument but 2 were given

So the necessary thing is quite unfamiliar-looking:

>>> test_link({'text': 'Link anchor'}, **{'verify': False})
request_kwargs: {'text': 'Link anchor'}
    bs4_kwargs: {'verify': False}

which really doesn't feel right.

On the other hand, there's this:

def test_link2(requests_kwargs, bs4_kwargs, **kwargs):
   print("request_kwargs: "+str(requests_kwargs))
   print("    bs4_kwargs: "+str({**bs4_kwargs, **kwargs}))

giving us:

>>> test_link2(bs4_kwargs={'text': 'Link anchor'}, requests_kwargs={'verify': False}, more=5 )
request_kwargs: {'verify': False}
    bs4_kwargs: {'text': 'Link anchor', 'more': 5}

And of course, it seems like it should be more logical to give the bs4 args first since the bs4 processing happens first.

However, I was thinking that the interface would remain the same, except for the addition of a requests_kwargs argument
...
This would primarily be for the sake of backwards compatibility (which we have historically favored, even though some early design decisions have led to unsatisfying compromises later). I'm not sure *args makes any sense here even currently, but maybe that's a separate issue.

Of course you're right that the interface needs to not break for past callers, that's really important. Updating the package shouldn't break the contract. And so that necessarily implies that loose **kwargs at the end have to go to bs4.

  1. In browser.py, we use request_kwargs because the kwargs are ultimately passed to self.session.request. So the name matches the endpoint function. I doubt this is employed consistently, but it seems a reasonable choice. In our case here, we could use that strategy, but one endpoint is self.page.find_all, which probably doesn't have such a succinct way to refer to it. Personally, I like the idea of disambiguating them by module name, so requests_kwargs and bs4_kwargs, but I'm fine with whatever you ultimately choose.

Sounds good.

  1. Yet another observation, I'm not sure why we're worrying about *args, which are for variable positional arguments.
    submit_selected() passes them to submit() which doesn't accept positional args, so what's the point?
    open() and open_relative() pass them to browser.get() which passes them to requests.session.get() which doesn't accept them. list_links(), find_link(), and links() pass them to bs4.find_all() which doesn't accept them. &c.

I guess it doesn't cause any harm, other than confused human brains?

Thanks again for your careful review. It's great to have a critical eye on the code!

You're welcome. Thanks for providing the module and being responsive and welcoming!

Honestly, thinking about this particular issue has gotten quite a bit more complicated than I thought it was at the outset.

johnhawkinson added a commit to johnhawkinson/MechanicalSoup that referenced this issue Apr 6, 2021
* Don't override a Referer field that a user passes in.
  Note that since HTTP headers are case-insensitive, use requests'
  CaseInsensitiveDict to handle this.

* Furthermore, don't modify the caller's **kwargs, the caller should
  assume we do not modify the submitted dict, so make a copy
  first.

* Do this in _merge_referer() helper function, as we anticipate needing
  this in other functions when MechanicalSoup#362 is fixed.

With tests:

* test_referer_submit_override():
.  Test both uppercase and lowercase variants.
.  Use @pytest.mark.parametrize as suggested by @hemberger

* test_submit_dont_modify_kwargs()
  ensure that submit_selected() does not modify the caller's passed-in
  dict, which we used to do when we added the Referer: header.
@johnhawkinson
Copy link
Contributor Author

Any thoughts on this, @hemberger?
I'd been fairly busy this week through yesterday, and that has now passed, but I think I am blocked on choosing the right path here?

@hemberger
Copy link
Contributor

Hey, sorry for the delayed response!

Regarding self.url, I don't have a notable preference. If you feel like making its use internally consistent, I think that would be fine with me.

The presence of *args here is purely historical, and if we could safely remove it from the interface, I'd be all for it. However, I think it is theoretically possible to have used a non-zero number of positional arguments (I'd be very happy if I were wrong though!).

Let me suggest a revision of the interface I wrote out earlier:

    def follow_link(self, link=None, *bs4_args,
                    requests_kwargs={}, **bs4_kwargs):

This appears to satisfy the criteria of:

  1. backwards compatibility
  2. allows passing kwargs to both bs4 and requests

It is, however, not the most intuitive interface, especially because of the *bs4_args! Does this help you move things forward?

@johnhawkinson
Copy link
Contributor Author

It is, however, not the most intuitive interface, especially because of the *bs4_args! Does this help you move things forward?

So, the more I think about it, the more I am bothered by the lack of parallelism in what you propose.
I alluded to it before, but let me be more explicit. Given your interface:, and setting aside for now the questino of *bs4_args:

def follow_link(self, link=None, *bs4_args, requests_kwargs={}, **bs4_kwargs):
   print("request_kwargs: "+str(requests_kwargs))
   print("    bs4_kwargs: "+str(bs4_kwargs))

If we call it the naive way, it blows up in your face:

>>> follow_link('self', requests_kwargs={'c': 'd'}, bs4_kwargs={'a': 'b'})
request_kwargs: {'c': 'd'}
    bs4_kwargs: {'bs4_kwargs': {'a': 'b'}}

Instead we have to call it like this, with ** but only on the bs4 args:

>>> follow_link('self', requests_kwargs={'c': 'd'}, **{'a': 'b'})
request_kwargs: {'c': 'd'}
    bs4_kwargs: {'a': 'b'}

I think we'd be much better off with a pattern like this:

def test_link2(requests_kwargs, bs4_kwargs, **kwargs):
   print("request_kwargs: "+str(requests_kwargs))
   print("    bs4_kwargs: "+str({**bs4_kwargs, **kwargs}))

So we could use:

>>> test_link2(bs4_kwargs={'text': 'Link anchor'}, requests_kwargs={'verify': False}, more=5 )
request_kwargs: {'verify': False}
    bs4_kwargs: {'text': 'Link anchor', 'more': 5}

Do you disagree?

@hemberger
Copy link
Contributor

For the sake of providing an interface that treats all endpoints consistently, I can see why it'd be nice to have both requests_kwargs and bs4_kwargs.

How about something like this?

def follow_link(self, link=None, *bs4_args,
                requests_kwargs={}, bs4_kwargs={}, **extra_bs4_kwargs):

Just looking at the interface, a user might be pretty confused about what's going on. However, if we provide an example that demonstrates the intended usage, and if we clearly document the bs4_args and extra_bs4_kwargs as "for backwards compatibility", then maybe it wouldn't be too incomprehensible?

@johnhawkinson
Copy link
Contributor Author

Just looking at the interface, a user might be pretty confused about what's going on. However, if we provide an example that demonstrates the intended usage, and if we clearly document the bs4_args and extra_bs4_kwargs as "for backwards compatibility", then maybe it wouldn't be too incomprehensible?

Yes, I think so. OK, I'll proceed in this fashion.
Thanks!

johnhawkinson added a commit to johnhawkinson/MechanicalSoup that referenced this issue Apr 18, 2021
* Don't override a Referer field that a user passes in.
  Note that since HTTP headers are case-insensitive, use requests'
  CaseInsensitiveDict to handle this.

* Furthermore, don't modify the caller's **kwargs, the caller should
  assume we do not modify the submitted dict, so make a copy
  first.

* Do this in _merge_referer() helper function, as we anticipate needing
  this in other functions when MechanicalSoup#362 is fixed.

With tests:

* test_referer_submit_override():
.  Test both uppercase and lowercase variants.
.  Use @pytest.mark.parametrize as suggested by @hemberger

* test_submit_dont_modify_kwargs()
  ensure that submit_selected() does not modify the caller's passed-in
  dict, which we used to do when we added the Referer: header.
johnhawkinson added a commit to johnhawkinson/MechanicalSoup that referenced this issue Apr 18, 2021
Be consistent and use self.__state.url over the self.url() @Property for internal consumers.

Discussed somewhat in MechanicalSoup#362, the logic being that:

  self.url is an interface for external callers to get access to the
  internal state, and there's no reason to force the internal users to
  do so.

  There's extra cognitive load for readers of the code to follow
  through the indirection.
johnhawkinson added a commit to johnhawkinson/MechanicalSoup that referenced this issue Apr 18, 2021
Be consistent and use self.__state.url over the self.url() @Property for internal consumers.

Discussed somewhat in MechanicalSoup#362, the logic being that:

  self.url is an interface for external callers to get access to the
  internal state, and there's no reason to force the internal users to
  do so.

  There's extra cognitive load for readers of the code to follow
  through the indirection.
johnhawkinson added a commit to johnhawkinson/MechanicalSoup that referenced this issue Apr 18, 2021
* Don't override a Referer field that a user passes in.
  Note that since HTTP headers are case-insensitive, use requests'
  CaseInsensitiveDict to handle this.

* Furthermore, don't modify the caller's **kwargs, the caller should
  assume we do not modify the submitted dict, so make a copy
  first.

* Do this in _merge_referer() helper function, as we anticipate needing
  this in other functions when MechanicalSoup#362 is fixed.

With tests:

* test_referer_submit_override():
.  Test both uppercase and lowercase variants.
.  Use @pytest.mark.parametrize as suggested by @hemberger

* test_submit_dont_modify_kwargs()
  ensure that submit_selected() does not modify the caller's passed-in
  dict, which we used to do when we added the Referer: header.
johnhawkinson added a commit to johnhawkinson/MechanicalSoup that referenced this issue Apr 18, 2021
Be consistent and use self.__state.url over the self.url() @Property for internal consumers.

Discussed somewhat in MechanicalSoup#362, the logic being that:

  self.url is an interface for external callers to get access to the
  internal state, and there's no reason to force the internal users to
  do so.

  There's extra cognitive load for readers of the code to follow
  through the indirection.
johnhawkinson added a commit to johnhawkinson/MechanicalSoup that referenced this issue Apr 18, 2021
Be consistent and use self.__state.url over the self.url() @Property for internal consumers.

Discussed somewhat in MechanicalSoup#362, the logic being that:

  self.url is an interface for external callers to get access to the
  internal state, and there's no reason to force the internal users to
  do so.

  There's extra cognitive load for readers of the code to follow
  through the indirection.
johnhawkinson added a commit to johnhawkinson/MechanicalSoup that referenced this issue Apr 18, 2021
Addresses MechanicalSoup#362 browser.follow_link() has no way to pass kwargs to requests
Accept keywords args in three ways:
  bs4_kwargs	  Explicitly passed to Beautiful Soup, via find_link, &c.
  requests_kwargs Passed to requests, via open_relative
  **kwargs	  Excess args, merged with bs4_kwargs for backwards
  		  compatibility

Adjust docstrings as appropriate.

Rename *args to *bs4_args to more clearly indicate that these positional
args go to the bs4 functions not to the requests functions.

Add tests:
test_download_link_nofile_bs4
  passes args to BeautifulSoup via bs4_kwargs
test_download_link_nofile_excess
  passes args to BeautifulSoup via excess **kwargs
test_follow_link_ua()
test_download_link_nofile_ua()
  both of which pass in requests_kwargs that set the User-Agent header
@hemberger
Copy link
Contributor

Fixed in #368.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants