browser.follow_link() has no way to pass kwargs to requests #362

johnhawkinson · 2021-03-28T16:30:13Z

As noted elsewhere, I've recently been debugging behind an SSL proxy, which requires telling requests to not verify SSL certificates. Generally I've done that with

    kwargs = { "verify": False }
    # ...
    r = br.submit_selected(**kwargs)

which is fine. But it's not so fine when I need to follow a link, because browser.follow_link() uses its **kwargs for BS4's tag finding, but not for actually following the link.

So instead of

    r = br.follow_link(text='Link anchor', **kwargs)

I end up with

    link = br.find_link(text='Link anchor')
    r = br.open_relative(link['href'], **kwargs)

I am not sure how to fix this. Some thoughts:

If nothing changes, add some more clarity to browser.follow_link()'s documentation explaining how to work around this situation.
Add kwargs-ish params to browser.follow_link(), one for BS4 and one for Requests. Of course, only one gets to be **kwargs, but at least one might be able to call browser.follow_link(text='Link anchor', requests_args=kwargs) or something.
Send the same **kwargs parameter to both

Maybe there's a better way. I guess in my case I could set this state in requests' Session object, ~~which I think would be browser.session.merge_environment_settings(...)~~ no, that's not right, I'm not sure how to accomplish it actually.

The text was updated successfully, but these errors were encountered:

hemberger · 2021-03-29T20:58:44Z

Hi, thanks for your interest in MechanicalSoup!

I think what you ended up with looks right to me (as that's basically all follow_link does), and that's probably the best you'll be able to do with the current version of MechanicalSoup.

I agree that it would be convenient to be able to forward arguments to both bs4 and requests from a single interface, and I believe we've had this issue before where we didn't know how to clearly define the forwarding endpoints. If I could remake the interface, I'd probably avoid the flat argument lists that we use now.

As far as remediation goes, your option (2) looks like our best bet, and I'd suggest that follow_link and download_link be changed in the same way for consistency. We would probably just need to handle the case of headers being specified like we do in submit_selected.

Would you be interested in submitting a PR for this change?

johnhawkinson · 2021-03-29T21:13:03Z

Would you be interested in submitting a PR for this change?

Yes, I'd be happy to do so later this week.

johnhawkinson · 2021-04-03T20:32:05Z

I'm looking at this now, @hemberger , and I don't understand the reasoning for the code you referenced above:

MechanicalSoup/mechanicalsoup/stateful_browser.py

Lines 238 to 243 in a27ff22

    
           referer = self.url 
        
           if referer is not None: 
        
               if 'headers' in kwargs: 
        
                   kwargs['headers']['Referer'] = referer 
        
               else: 
        
                   kwargs['headers'] = {'Referer': referer}

If the user passes in a headers dictionary and it contains a Referer key, then submit_selected() overrides it.
That seems like a bug? If I have gone so far as to explicitly set the Referer field in what I pass to MechanicalSoup, why should it be overridden?

Even worse, it doesn't just do so in a local copy that is then passed to requests, it actually modifies the passed-in kwargs.

Am I misunderstanding this?

hemberger · 2021-04-03T20:53:46Z

Your point is well-taken, and I wouldn't have any objection to it respecting an explicitly passed-in Referer header. Hopefully this wouldn't negatively impact someone who is, perhaps, perfunctorily reusing headers from some previous request. Since @moy authored the fix to #179, I'd want to make sure he's okay with a change in behavior as well, if possible.

I'm also fine with creating new objects where necessary so that we don't modify inputs outside the scope of the function. Thanks for your careful review!

Don't override a Referer field that a user passes in. Note that since HTTP headers are case-insensiive, use requests' CaseInsensiiveDict to handle this. Futhermore, don't modify the caller's **kwargs, the caller should assume we do not modify the submitted dict, so make a copy first. Do this in _merge_referer() helper function, as we anticipate needing this in other functions when MechanicalSoup#362 is fixed. Add a test for overriding Referers.

Don't override a Referer field that a user passes in. Note that since HTTP headers are case-insensitive, use requests' CaseInsensitiveDict to handle this. Futhermore, don't modify the caller's **kwargs, the caller should assume we do not modify the submitted dict, so make a copy first. Do this in _merge_referer() helper function, as we anticipate needing this in other functions when MechanicalSoup#362 is fixed. Add a test for overriding Referers.

Don't override a Referer field that a user passes in. Note that since HTTP headers are case-insensitive, use requests' CaseInsensitiveDict to handle this. Furthermore, don't modify the caller's **kwargs, the caller should assume we do not modify the submitted dict, so make a copy first. Do this in _merge_referer() helper function, as we anticipate needing this in other functions when MechanicalSoup#362 is fixed. Add a test for overriding Referers.

johnhawkinson · 2021-04-04T07:28:25Z

Some questions, mostly relevant to this Issue.

As I look at submit_selected(), at least before submit_selected(): Don't override user's Referer #364, it uses bothself.url and self.__state.url:

MechanicalSoup/mechanicalsoup/stateful_browser.py

Lines 238 to 245 in a27ff22

    
           referer = self.url 
        
           if referer is not None: 
        
               if 'headers' in kwargs: 
        
                   kwargs['headers']['Referer'] = referer 
        
               else: 
        
                   kwargs['headers'] = {'Referer': referer} 
        
           resp = self.submit(self.__state.form, url=self.__state.url,

is this an oversight or a stylistic difference between multiple authors? Which is "right"?

More importantly, I realize I was ambiguous when I asked about the desired function signature, and I'm having second thoughts. When I said:

Add kwargs-ish params to browser.follow_link(), one for BS4 and one for Requests. Of course, only one gets to be **kwargs, but at least one might be able to call browser.follow_link(text='Link anchor', requests_args=kwargs) or something.

I had meant adding both parameters, i.e.:

    def follow_link(self, link=None, bs_kwargs={},
                    requests_kwargs={}, *args, **kwargs):

which requires merging them with constructs like {**bs_kwargs, **kwargs}.

But I'm not sure what you thought you were approving, and maybe you think a better plan is:

    def follow_link(self, link=None, bs_kwargs={},
                    *args, **requests_kwargs):

Also, I'm not sure how these should be named, and that naming (and API design) may be more challenging than the actual implementation.
The Naming of Cats is a difficult matter and all that.
In particular, I suggest bs_kwargs and requests_kwargs but perhaps soup_kwargs or bs4_kwargs would be better. And there already exists a request_kwargs ("request" singular) in browser.py, but that's for something different.

Thanks.

hemberger · 2021-04-04T18:51:56Z

self.url and self.__state.url are synonyms. The former was added very recently as a property, so I think it just hasn't propagated to all of the places it should be used yet. I would probably argue for self.url as being "right", though I'm open to persuasive arguments. :)
If {bs,requests}_kwargs both exist, then I think there would be no need for *args and **kwargs. However, I was thinking that the interface would remain the same, except for the addition of a requests_kwargs argument, i.e. like your second example, but reversed:
```
    def follow_link(self, link=None, requests_kwargs={},
                    *bs4_args, **bs4_kwargs):
```
This would primarily be for the sake of backwards compatibility (which we have historically favored, even though some early design decisions have led to unsatisfying compromises later). I'm not sure *args makes any sense here even currently, but maybe that's a separate issue.
In browser.py, we use request_kwargs because the kwargs are ultimately passed to self.session.request. So the name matches the endpoint function. I doubt this is employed consistently, but it seems a reasonable choice. In our case here, we could use that strategy, but one endpoint is self.page.find_all, which probably doesn't have such a succinct way to refer to it. Personally, I like the idea of disambiguating them by module name, so requests_kwargs and bs4_kwargs, but I'm fine with whatever you ultimately choose.

Thanks again for your careful review. It's great to have a critical eye on the code!

johnhawkinson · 2021-04-05T00:48:26Z

I know this is probably best in another issue, but:

self.url and self.__state.url are synonyms. The former was added very recently as a property, so I think it just hasn't propagated to all of the places it should be used yet. I would probably argue for self.url as being "right", though I'm open to persuasive arguments. :)

Well, as I look now, self.url is used in 4 places, 3 of which are this referer handling that should be abstracted out. It seems to me after some thought that self.__state.url is probably right, because self.url is an interface for external callers to get access to the internal state, and there's no reason to force the internal users to do so, and presumably there's some microscopic performance hit (or is it optimized away?), as well as a little bit of extra cognitive load for the indirection for someone reading the code. So I'd say self.__state.url is better :)

If {bs,requests}_kwargs both exist, then I think there would be no need for *args and **kwargs.

I think that's right, unless there is an interface preference for being able to write code that is more clear.
That is, ideally the user shouldn't have to remember the order of arguments and could call either of

one = browser.follow_link(bs4_kwargs={'text': 'Link anchor'}, requests_kwargs={'verify': False})
two = browser.follow_link(requests_kwargs={'verify': False}, bs4_kwargs={'text': 'Link anchor'})

But I think that doesn't work. Given:

def test_link(requests_kwargs, **bs4_kwargs):
  print("request_kwargs: "+str(requests_kwargs))
  print("    bs4_kwargs: "+str(bs4_kwargs))

Then the naive invokation does the wrong thing:

>>> test_link(bs4_kwargs={'text': 'Link anchor'}, requests_kwargs={'verify': False})
request_kwargs: {'verify': False}
    bs4_kwargs: {'bs4_kwargs': {'text': 'Link anchor'}}

And loose dicts sink ships:

>>> test_link({'text': 'Link anchor'}, {'verify': False})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: test_link() takes 1 positional argument but 2 were given

So the necessary thing is quite unfamiliar-looking:

>>> test_link({'text': 'Link anchor'}, **{'verify': False})
request_kwargs: {'text': 'Link anchor'}
    bs4_kwargs: {'verify': False}

which really doesn't feel right.

On the other hand, there's this:

def test_link2(requests_kwargs, bs4_kwargs, **kwargs):
   print("request_kwargs: "+str(requests_kwargs))
   print("    bs4_kwargs: "+str({**bs4_kwargs, **kwargs}))

giving us:

>>> test_link2(bs4_kwargs={'text': 'Link anchor'}, requests_kwargs={'verify': False}, more=5 )
request_kwargs: {'verify': False}
    bs4_kwargs: {'text': 'Link anchor', 'more': 5}

And of course, it seems like it should be more logical to give the bs4 args first since the bs4 processing happens first.

However, I was thinking that the interface would remain the same, except for the addition of a requests_kwargs argument
...
This would primarily be for the sake of backwards compatibility (which we have historically favored, even though some early design decisions have led to unsatisfying compromises later). I'm not sure *args makes any sense here even currently, but maybe that's a separate issue.

Of course you're right that the interface needs to not break for past callers, that's really important. Updating the package shouldn't break the contract. And so that necessarily implies that loose **kwargs at the end have to go to bs4.

In browser.py, we use request_kwargs because the kwargs are ultimately passed to self.session.request. So the name matches the endpoint function. I doubt this is employed consistently, but it seems a reasonable choice. In our case here, we could use that strategy, but one endpoint is self.page.find_all, which probably doesn't have such a succinct way to refer to it. Personally, I like the idea of disambiguating them by module name, so requests_kwargs and bs4_kwargs, but I'm fine with whatever you ultimately choose.

Sounds good.

Yet another observation, I'm not sure why we're worrying about *args, which are for variable positional arguments.
submit_selected() passes them to submit() which doesn't accept positional args, so what's the point?
open() and open_relative() pass them to browser.get() which passes them to requests.session.get() which doesn't accept them. list_links(), find_link(), and links() pass them to bs4.find_all() which doesn't accept them. &c.

I guess it doesn't cause any harm, other than confused human brains?

Thanks again for your careful review. It's great to have a critical eye on the code!

You're welcome. Thanks for providing the module and being responsive and welcoming!

Honestly, thinking about this particular issue has gotten quite a bit more complicated than I thought it was at the outset.

@hemberger

* Don't override a Referer field that a user passes in. Note that since HTTP headers are case-insensitive, use requests' CaseInsensitiveDict to handle this. * Furthermore, don't modify the caller's **kwargs, the caller should assume we do not modify the submitted dict, so make a copy first. * Do this in _merge_referer() helper function, as we anticipate needing this in other functions when MechanicalSoup#362 is fixed. With tests: * test_referer_submit_override(): . Test both uppercase and lowercase variants. . Use @pytest.mark.parametrize as suggested by @hemberger * test_submit_dont_modify_kwargs() ensure that submit_selected() does not modify the caller's passed-in dict, which we used to do when we added the Referer: header.

johnhawkinson · 2021-04-10T14:47:09Z

Any thoughts on this, @hemberger?
I'd been fairly busy this week through yesterday, and that has now passed, but I think I am blocked on choosing the right path here?

hemberger · 2021-04-10T22:11:45Z

Hey, sorry for the delayed response!

Regarding self.url, I don't have a notable preference. If you feel like making its use internally consistent, I think that would be fine with me.

The presence of *args here is purely historical, and if we could safely remove it from the interface, I'd be all for it. However, I think it is theoretically possible to have used a non-zero number of positional arguments (I'd be very happy if I were wrong though!).

Let me suggest a revision of the interface I wrote out earlier:

    def follow_link(self, link=None, *bs4_args,
                    requests_kwargs={}, **bs4_kwargs):

This appears to satisfy the criteria of:

backwards compatibility
allows passing kwargs to both bs4 and requests

It is, however, not the most intuitive interface, especially because of the *bs4_args! Does this help you move things forward?

johnhawkinson · 2021-04-10T23:20:11Z

It is, however, not the most intuitive interface, especially because of the *bs4_args! Does this help you move things forward?

So, the more I think about it, the more I am bothered by the lack of parallelism in what you propose.
I alluded to it before, but let me be more explicit. Given your interface:, and setting aside for now the questino of *bs4_args:

def follow_link(self, link=None, *bs4_args, requests_kwargs={}, **bs4_kwargs):
   print("request_kwargs: "+str(requests_kwargs))
   print("    bs4_kwargs: "+str(bs4_kwargs))

If we call it the naive way, it blows up in your face:

>>> follow_link('self', requests_kwargs={'c': 'd'}, bs4_kwargs={'a': 'b'})
request_kwargs: {'c': 'd'}
    bs4_kwargs: {'bs4_kwargs': {'a': 'b'}}

Instead we have to call it like this, with ** but only on the bs4 args:

>>> follow_link('self', requests_kwargs={'c': 'd'}, **{'a': 'b'})
request_kwargs: {'c': 'd'}
    bs4_kwargs: {'a': 'b'}

I think we'd be much better off with a pattern like this:

def test_link2(requests_kwargs, bs4_kwargs, **kwargs):
   print("request_kwargs: "+str(requests_kwargs))
   print("    bs4_kwargs: "+str({**bs4_kwargs, **kwargs}))

So we could use:

>>> test_link2(bs4_kwargs={'text': 'Link anchor'}, requests_kwargs={'verify': False}, more=5 )
request_kwargs: {'verify': False}
    bs4_kwargs: {'text': 'Link anchor', 'more': 5}

Do you disagree?

hemberger · 2021-04-10T23:47:32Z

For the sake of providing an interface that treats all endpoints consistently, I can see why it'd be nice to have both requests_kwargs and bs4_kwargs.

How about something like this?

def follow_link(self, link=None, *bs4_args,
                requests_kwargs={}, bs4_kwargs={}, **extra_bs4_kwargs):

Just looking at the interface, a user might be pretty confused about what's going on. However, if we provide an example that demonstrates the intended usage, and if we clearly document the bs4_args and extra_bs4_kwargs as "for backwards compatibility", then maybe it wouldn't be too incomprehensible?

johnhawkinson · 2021-04-10T23:55:41Z

Just looking at the interface, a user might be pretty confused about what's going on. However, if we provide an example that demonstrates the intended usage, and if we clearly document the bs4_args and extra_bs4_kwargs as "for backwards compatibility", then maybe it wouldn't be too incomprehensible?

Yes, I think so. OK, I'll proceed in this fashion.
Thanks!

@hemberger

* Don't override a Referer field that a user passes in. Note that since HTTP headers are case-insensitive, use requests' CaseInsensitiveDict to handle this. * Furthermore, don't modify the caller's **kwargs, the caller should assume we do not modify the submitted dict, so make a copy first. * Do this in _merge_referer() helper function, as we anticipate needing this in other functions when MechanicalSoup#362 is fixed. With tests: * test_referer_submit_override(): . Test both uppercase and lowercase variants. . Use @pytest.mark.parametrize as suggested by @hemberger * test_submit_dont_modify_kwargs() ensure that submit_selected() does not modify the caller's passed-in dict, which we used to do when we added the Referer: header.

Be consistent and use self.__state.url over the self.url() @Property for internal consumers. Discussed somewhat in MechanicalSoup#362, the logic being that: self.url is an interface for external callers to get access to the internal state, and there's no reason to force the internal users to do so. There's extra cognitive load for readers of the code to follow through the indirection.

@hemberger

* Don't override a Referer field that a user passes in. Note that since HTTP headers are case-insensitive, use requests' CaseInsensitiveDict to handle this. * Furthermore, don't modify the caller's **kwargs, the caller should assume we do not modify the submitted dict, so make a copy first. * Do this in _merge_referer() helper function, as we anticipate needing this in other functions when MechanicalSoup#362 is fixed. With tests: * test_referer_submit_override(): . Test both uppercase and lowercase variants. . Use @pytest.mark.parametrize as suggested by @hemberger * test_submit_dont_modify_kwargs() ensure that submit_selected() does not modify the caller's passed-in dict, which we used to do when we added the Referer: header.

Be consistent and use self.__state.url over the self.url() @Property for internal consumers. Discussed somewhat in MechanicalSoup#362, the logic being that: self.url is an interface for external callers to get access to the internal state, and there's no reason to force the internal users to do so. There's extra cognitive load for readers of the code to follow through the indirection.

Addresses MechanicalSoup#362 browser.follow_link() has no way to pass kwargs to requests Accept keywords args in three ways: bs4_kwargs Explicitly passed to Beautiful Soup, via find_link, &c. requests_kwargs Passed to requests, via open_relative **kwargs Excess args, merged with bs4_kwargs for backwards compatibility Adjust docstrings as appropriate. Rename *args to *bs4_args to more clearly indicate that these positional args go to the bs4 functions not to the requests functions. Add tests: test_download_link_nofile_bs4 passes args to BeautifulSoup via bs4_kwargs test_download_link_nofile_excess passes args to BeautifulSoup via excess **kwargs test_follow_link_ua() test_download_link_nofile_ua() both of which pass in requests_kwargs that set the User-Agent header

hemberger · 2021-05-01T18:33:48Z

Fixed in #368.

johnhawkinson mentioned this issue Apr 4, 2021

submit_selected(): Don't override user's Referer #364

Merged

johnhawkinson mentioned this issue Apr 18, 2021

stateful_browser: self.url -> self.__state.url for internal consumers #367

Open

johnhawkinson mentioned this issue Apr 18, 2021

stateful_browser: {follow,download}_link: Allow requests_kwargs #368

Merged

hemberger closed this as completed May 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

browser.follow_link() has no way to pass kwargs to requests #362

browser.follow_link() has no way to pass kwargs to requests #362

johnhawkinson commented Mar 28, 2021

hemberger commented Mar 29, 2021

johnhawkinson commented Mar 29, 2021

johnhawkinson commented Apr 3, 2021

hemberger commented Apr 3, 2021

johnhawkinson commented Apr 4, 2021

hemberger commented Apr 4, 2021

johnhawkinson commented Apr 5, 2021

johnhawkinson commented Apr 10, 2021

hemberger commented Apr 10, 2021

johnhawkinson commented Apr 10, 2021

hemberger commented Apr 10, 2021

johnhawkinson commented Apr 10, 2021

hemberger commented May 1, 2021

browser.follow_link() has no way to pass kwargs to requests #362

browser.follow_link() has no way to pass kwargs to requests #362

Comments

johnhawkinson commented Mar 28, 2021

hemberger commented Mar 29, 2021

johnhawkinson commented Mar 29, 2021

johnhawkinson commented Apr 3, 2021

hemberger commented Apr 3, 2021

johnhawkinson commented Apr 4, 2021

hemberger commented Apr 4, 2021

johnhawkinson commented Apr 5, 2021

johnhawkinson commented Apr 10, 2021

hemberger commented Apr 10, 2021

johnhawkinson commented Apr 10, 2021

hemberger commented Apr 10, 2021

johnhawkinson commented Apr 10, 2021

hemberger commented May 1, 2021