New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add handling of cookies to HTTP protocol #32

Closed
jnioche opened this Issue Dec 4, 2014 · 22 comments

Comments

Projects
None yet
2 participants
@jnioche
Member

jnioche commented Dec 4, 2014

The cookies info could be stored in the metadata (and inherited from the parent URL) and used by the Protocol implementation.

@jnioche jnioche added the enhancement label Dec 4, 2014

@jnioche jnioche changed the title from Add handling of sessions to HTTP protocol to Add handling of cookies to HTTP protocol Apr 22, 2015

@jnioche jnioche added the core label Feb 23, 2017

@foromer4

This comment has been minimized.

Show comment
Hide comment
@foromer4

foromer4 Mar 6, 2017

I would like to implement this. @jnioche, could you please add some info on what classes should be changed to save and read these stored cookies?
from what I can tell. JSoupParserBolt might be the place to detect cookies.
However, I'm not sure how to do it only once for each website?
Does Storm crawler keep info to connect each url to a specific parent url?
ANd does it store infromation for the parent url?
thanks

foromer4 commented Mar 6, 2017

I would like to implement this. @jnioche, could you please add some info on what classes should be changed to save and read these stored cookies?
from what I can tell. JSoupParserBolt might be the place to detect cookies.
However, I'm not sure how to do it only once for each website?
Does Storm crawler keep info to connect each url to a specific parent url?
ANd does it store infromation for the parent url?
thanks

@jnioche

This comment has been minimized.

Show comment
Hide comment
@jnioche

jnioche Mar 6, 2017

Member

Here is how I see it. You should only need to modify things at the protocol level. Basically, the cookies are stored as text in the metadata of the parent URL, transferred to the outlinks which get persisted to the status in the usual ways, see https://github.com/DigitalPebble/storm-crawler/wiki/MetadataTransfer.

When the tuple for an outlink is sent down the topology, we'll need to convert the info from the metadata into cookies in the getProtocolOutput() method so that the low level protocol stuff sends it.

from what I can tell. JSoupParserBolt might be the place to detect cookies.

Nope. Cookies are nothing more than http headers with an attitude.

However, I'm not sure how to do it only once for each website?

We don't. We could have different cookies information for different URLs belonging to the same host or domain.

Does Storm crawler keep info to connect each url to a specific parent url?

Part of the optional path info in the metadatam see Wiki page about MetadataTransfer

ANd does it store information for the parent url?

Like any other URL, nothing more.

Hope this helps

Member

jnioche commented Mar 6, 2017

Here is how I see it. You should only need to modify things at the protocol level. Basically, the cookies are stored as text in the metadata of the parent URL, transferred to the outlinks which get persisted to the status in the usual ways, see https://github.com/DigitalPebble/storm-crawler/wiki/MetadataTransfer.

When the tuple for an outlink is sent down the topology, we'll need to convert the info from the metadata into cookies in the getProtocolOutput() method so that the low level protocol stuff sends it.

from what I can tell. JSoupParserBolt might be the place to detect cookies.

Nope. Cookies are nothing more than http headers with an attitude.

However, I'm not sure how to do it only once for each website?

We don't. We could have different cookies information for different URLs belonging to the same host or domain.

Does Storm crawler keep info to connect each url to a specific parent url?

Part of the optional path info in the metadatam see Wiki page about MetadataTransfer

ANd does it store information for the parent url?

Like any other URL, nothing more.

Hope this helps

@foromer4

This comment has been minimized.

Show comment
Hide comment
@foromer4

foromer4 Mar 9, 2017

@jnioche , So you want the crawler to visit a ceratin page and collect from the response the cookies, and then use the same cookies for discovered links? and if it find other cookies in the discovered links add them also to their "child links" and so on?

Might it be beneficial to set the desired cookies once per site , or per regex in the config, instead of passing them from parent to child?

foromer4 commented Mar 9, 2017

@jnioche , So you want the crawler to visit a ceratin page and collect from the response the cookies, and then use the same cookies for discovered links? and if it find other cookies in the discovered links add them also to their "child links" and so on?

Might it be beneficial to set the desired cookies once per site , or per regex in the config, instead of passing them from parent to child?

@jnioche

This comment has been minimized.

Show comment
Hide comment
@jnioche

jnioche Mar 9, 2017

Member

if it find other cookies in the discovered links add them also to their "child links" and so on?

they would replace the existing ones instead of adding to them I think

Might it be beneficial to set the desired cookies once per site ,

there is no global repository for sites info in SC

or per regex in the config, instead of passing them from parent to child?

you could set them as seed metadata instead of keeping them in the config and have an option in the protocol implementation to NOT replace an existing value. This would be more flexible I think as you could modify it on the fly instead of relaunching the topology

which problem are you trying to solve or to put it differently, any reason you think my suggestion is a bad idea?

Thanks

Member

jnioche commented Mar 9, 2017

if it find other cookies in the discovered links add them also to their "child links" and so on?

they would replace the existing ones instead of adding to them I think

Might it be beneficial to set the desired cookies once per site ,

there is no global repository for sites info in SC

or per regex in the config, instead of passing them from parent to child?

you could set them as seed metadata instead of keeping them in the config and have an option in the protocol implementation to NOT replace an existing value. This would be more flexible I think as you could modify it on the fly instead of relaunching the topology

which problem are you trying to solve or to put it differently, any reason you think my suggestion is a bad idea?

Thanks

@foromer4

This comment has been minimized.

Show comment
Hide comment
@foromer4

foromer4 Mar 12, 2017

Well I definitely see the value of passing cookie from url to discovered link - and will start working on this as you suggested. In addition, I can think of some cases where you would want to set up specific cookies in advance (for example session cookies ,as login is hard to do sometimes in a crawler)

foromer4 commented Mar 12, 2017

Well I definitely see the value of passing cookie from url to discovered link - and will start working on this as you suggested. In addition, I can think of some cases where you would want to set up specific cookies in advance (for example session cookies ,as login is hard to do sometimes in a crawler)

@foromer4

This comment has been minimized.

Show comment
Hide comment
@foromer4

foromer4 Mar 12, 2017

@jnioche , I started looking at the solution you suggested and ran into an issue.
It seems that in order to get the cookies I need to add a cookie store to the httpClinet.

Now the problem as I see it is that in order to get the cookies, I need to make the HttpClient a member of HttpProtocol, and since HttpProtocol itself is being reused over and over again (stored in a cache , provided by ProtocolFactory) it will be a problem.

Possible solution:
Add flag for "useCookies" , when this flag in on, ProtocolFactory will return a new instance per request.
Of course it might incur a performance penalty (hence the flag).
Please let me know if this seems right to you.

foromer4 commented Mar 12, 2017

@jnioche , I started looking at the solution you suggested and ran into an issue.
It seems that in order to get the cookies I need to add a cookie store to the httpClinet.

Now the problem as I see it is that in order to get the cookies, I need to make the HttpClient a member of HttpProtocol, and since HttpProtocol itself is being reused over and over again (stored in a cache , provided by ProtocolFactory) it will be a problem.

Possible solution:
Add flag for "useCookies" , when this flag in on, ProtocolFactory will return a new instance per request.
Of course it might incur a performance penalty (hence the flag).
Please let me know if this seems right to you.

@jnioche

This comment has been minimized.

Show comment
Hide comment
@jnioche

jnioche Mar 13, 2017

Member

Hi @foromer4

You don't need a cookie store. The advantage of passing it as metadata is that we don't expect the protocol instance to persist anything or have any prerequisites. The cookies (if any) can simply be added to the header like this:

    List<Cookie> cookies = CookieConverter.getCookies(datum, url);

    for (Cookie c : cookies)
      httpget.setHeader("Cookie", c.getName() + "=" + c.getValue());

as for the CookieConverter.getCookies method, it simply populates a list of BasicClientCookie instances from the metadata key values.

There should be nothing special to do on receiving the cookies, as I said, they are just normal HTTP response metadata.

Member

jnioche commented Mar 13, 2017

Hi @foromer4

You don't need a cookie store. The advantage of passing it as metadata is that we don't expect the protocol instance to persist anything or have any prerequisites. The cookies (if any) can simply be added to the header like this:

    List<Cookie> cookies = CookieConverter.getCookies(datum, url);

    for (Cookie c : cookies)
      httpget.setHeader("Cookie", c.getName() + "=" + c.getValue());

as for the CookieConverter.getCookies method, it simply populates a list of BasicClientCookie instances from the metadata key values.

There should be nothing special to do on receiving the cookies, as I said, they are just normal HTTP response metadata.

@foromer4

This comment has been minimized.

Show comment
Hide comment
@foromer4

foromer4 Mar 13, 2017

Hi @jnioche, sorry for being difficult, I just want to make sure I understand.
If you do not want to persist the cookies, only to use the cookies in the http headers of the response as the cookies to the output link, why can't we use the exiting MetaData field of _response_headers as it is already saved in HttpProtocol.handleResponse?

We can than use MetadataTransfer.getMetaForOutlink as you suggested to set the cookie to the output link.

foromer4 commented Mar 13, 2017

Hi @jnioche, sorry for being difficult, I just want to make sure I understand.
If you do not want to persist the cookies, only to use the cookies in the http headers of the response as the cookies to the output link, why can't we use the exiting MetaData field of _response_headers as it is already saved in HttpProtocol.handleResponse?

We can than use MetadataTransfer.getMetaForOutlink as you suggested to set the cookie to the output link.

@jnioche

This comment has been minimized.

Show comment
Hide comment
@jnioche

jnioche Mar 13, 2017

Member

Hi @jnioche, sorry for being difficult, I just want to make sure I understand.

don't worry you are not being difficult at all.

If you do not want to persist the cookies, only to use the cookies in the http headers of the response as the cookies to the output link, why can't we use the exiting MetaData field of _response_headers as it is already saved in HttpProtocol.handleResponse?

that field is optional and contains ALL the key/values returned by the server. It can get quite large and would not typically passed on to the outlinks. Also it you want to inject the cookie into on a per seed basis then having a separate key to do it would be cleaner.

Member

jnioche commented Mar 13, 2017

Hi @jnioche, sorry for being difficult, I just want to make sure I understand.

don't worry you are not being difficult at all.

If you do not want to persist the cookies, only to use the cookies in the http headers of the response as the cookies to the output link, why can't we use the exiting MetaData field of _response_headers as it is already saved in HttpProtocol.handleResponse?

that field is optional and contains ALL the key/values returned by the server. It can get quite large and would not typically passed on to the outlinks. Also it you want to inject the cookie into on a per seed basis then having a separate key to do it would be cleaner.

@foromer4

This comment has been minimized.

Show comment
Hide comment
@foromer4

foromer4 Mar 13, 2017

foromer4 commented Mar 13, 2017

@jnioche

This comment has been minimized.

Show comment
Hide comment
@jnioche

jnioche Mar 14, 2017

Member

Here is some code which you could use as a starting point [https://gist.github.com/jnioche/6141308519694b5c57d4fbd45d5990ac]. It was written for Nutch but could easily be adapted to SC.

Member

jnioche commented Mar 14, 2017

Here is some code which you could use as a starting point [https://gist.github.com/jnioche/6141308519694b5c57d4fbd45d5990ac]. It was written for Nutch but could easily be adapted to SC.

@foromer4

This comment has been minimized.

Show comment
Hide comment
@foromer4

foromer4 Mar 26, 2017

I started implementing, and I have another question,
it seems the http headers are stored even if storeHTTPHeaders is false.
Inside the iterator, the verbatim only get values if this flag is true,
however the header itself is set in the metadata regardless.

see:
https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/protocol/httpclient/HttpProtocol.java#L188

foromer4 commented Mar 26, 2017

I started implementing, and I have another question,
it seems the http headers are stored even if storeHTTPHeaders is false.
Inside the iterator, the verbatim only get values if this flag is true,
however the header itself is set in the metadata regardless.

see:
https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/protocol/httpclient/HttpProtocol.java#L188

@jnioche

This comment has been minimized.

Show comment
Hide comment
@jnioche

jnioche Mar 27, 2017

Member

Yes, that's normal. We do want to store the http headers in metadata.

Member

jnioche commented Mar 27, 2017

Yes, that's normal. We do want to store the http headers in metadata.

@foromer4

This comment has been minimized.

Show comment
Hide comment
@foromer4

foromer4 Mar 28, 2017

Well , finally I believe I'm done, the code is based on your CookieConverter and includes some small fixes and unit tests to it. Before I do a pull request I would like to also test manually end-to-end on a site. Is there any "debug sandbox" I should use? or just run the crawler as described [here] and analyze the logs?(http://stormcrawler.net/getting-started/)

foromer4 commented Mar 28, 2017

Well , finally I believe I'm done, the code is based on your CookieConverter and includes some small fixes and unit tests to it. Before I do a pull request I would like to also test manually end-to-end on a site. Is there any "debug sandbox" I should use? or just run the crawler as described [here] and analyze the logs?(http://stormcrawler.net/getting-started/)

@jnioche

This comment has been minimized.

Show comment
Hide comment
@jnioche

jnioche Mar 28, 2017

Member

You can use the archetype to generate a basic config then either:

  • run the java topology from Eclipse in debug mode
  • export STORM_JAR_JVM_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=localhost:8000" then storm jar ... then remote debug from Eclipse
Member

jnioche commented Mar 28, 2017

You can use the archetype to generate a basic config then either:

  • run the java topology from Eclipse in debug mode
  • export STORM_JAR_JVM_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=localhost:8000" then storm jar ... then remote debug from Eclipse

foromer4 pushed a commit to foromer4/storm-crawler that referenced this issue Mar 30, 2017

jnioche added a commit that referenced this issue Mar 30, 2017

Merge pull request #449 from foromer4/master
Resolve issue #32 - implement cookie handeling

@jnioche jnioche added this to the 1.5 milestone Apr 4, 2017

@jnioche

This comment has been minimized.

Show comment
Hide comment
@jnioche

jnioche Apr 4, 2017

Member

Implemented in #449

@foromer4 any chance you could open a PR to add a description of the config to the WIKI?

Member

jnioche commented Apr 4, 2017

Implemented in #449

@foromer4 any chance you could open a PR to add a description of the config to the WIKI?

@jnioche jnioche closed this Apr 4, 2017

@foromer4

This comment has been minimized.

Show comment
Hide comment
@foromer4

foromer4 Apr 5, 2017

Sure , no problem. Since I don't have access to the main branch, and no way I can see to fork Wiki , what's the MO here?

foromer4 commented Apr 5, 2017

Sure , no problem. Since I don't have access to the main branch, and no way I can see to fork Wiki , what's the MO here?

@jnioche

This comment has been minimized.

Show comment
Hide comment
@foromer4

This comment has been minimized.

Show comment
Hide comment
@foromer4

foromer4 Apr 9, 2017

I don't see any way to get the wiki part from the fork. My fork doesn't have wiki in it, if I try to run:

git clone https://github.com/foromer4/storm-crawler.wiki.git sc_wiki

I get this error:

fatal: remote error: access denied or repository not exported: /c/nw/c7/cd/42/93
95761/86227349.wiki.git

And it I just clone the wiki, I can't push to master , for lack of permissions.

foromer4 commented Apr 9, 2017

I don't see any way to get the wiki part from the fork. My fork doesn't have wiki in it, if I try to run:

git clone https://github.com/foromer4/storm-crawler.wiki.git sc_wiki

I get this error:

fatal: remote error: access denied or repository not exported: /c/nw/c7/cd/42/93
95761/86227349.wiki.git

And it I just clone the wiki, I can't push to master , for lack of permissions.

@jnioche

This comment has been minimized.

Show comment
Hide comment
@jnioche

jnioche Apr 9, 2017

Member

what about

git clone https://github.com/DigitalPebble/storm-crawler.wiki.git  sc_wiki
cd sc_wiki
git remote add omer https://github.com/foromer4/storm-crawler.wiki.git
DO YOUR CHANGES
git commit -a -m "message here"
git push -u omer master
Member

jnioche commented Apr 9, 2017

what about

git clone https://github.com/DigitalPebble/storm-crawler.wiki.git  sc_wiki
cd sc_wiki
git remote add omer https://github.com/foromer4/storm-crawler.wiki.git
DO YOUR CHANGES
git commit -a -m "message here"
git push -u omer master
@foromer4

This comment has been minimized.

Show comment
Hide comment
@foromer4

foromer4 Apr 9, 2017

well . I still need to somehow create a fork of storm crawler wiki , which I can't , so doing as you suggested, trying to add a remote I get this error:

git remote add omer https://github.com/foromer4/storm-crawler.wiki.git
fatal: Not a git repository (or any of the parent directories): .git

Maybe I need to be able to push to master? or else, is there any way to make the wiki a part of the
git repo , so that when I fork the repo I also get access to wiki pages? First time editing Wiki not on master , so sorry if I'm missing something obvious here...

foromer4 commented Apr 9, 2017

well . I still need to somehow create a fork of storm crawler wiki , which I can't , so doing as you suggested, trying to add a remote I get this error:

git remote add omer https://github.com/foromer4/storm-crawler.wiki.git
fatal: Not a git repository (or any of the parent directories): .git

Maybe I need to be able to push to master? or else, is there any way to make the wiki a part of the
git repo , so that when I fork the repo I also get access to wiki pages? First time editing Wiki not on master , so sorry if I'm missing something obvious here...

@jnioche

This comment has been minimized.

Show comment
Hide comment
@jnioche

jnioche Apr 9, 2017

Member

let's keep it simple. I've added you temporarily to a group which should allow you to write to the WIKI. Thanks!

Member

jnioche commented Apr 9, 2017

let's keep it simple. I've added you temporarily to a group which should allow you to write to the WIKI. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment