New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selenium-based protocol implementation #144

Closed
jnioche opened this Issue Jun 18, 2015 · 12 comments

Comments

Projects
None yet
5 participants
@jnioche
Member

jnioche commented Jun 18, 2015

This should allow us to deal with the dynamic content. See discussion #142
Ideally we'd want to be able to have actions/navigations either programmatically or via configuration.

We could use :

@jnioche

This comment has been minimized.

Show comment
Hide comment
@jnioche

jnioche Apr 22, 2016

Member

Use jBrowserDriver? 100% Java and headless.

Member

jnioche commented Apr 22, 2016

Use jBrowserDriver? 100% Java and headless.

@kkrugler

This comment has been minimized.

Show comment
Hide comment
@kkrugler

kkrugler Apr 22, 2016

I think jBrowserDriver required Java 8 - would that be an issue?

Also, in the past we used HTMLUnit, though not without challenges.

kkrugler commented Apr 22, 2016

I think jBrowserDriver required Java 8 - would that be an issue?

Also, in the past we used HTMLUnit, though not without challenges.

@jnioche

This comment has been minimized.

Show comment
Hide comment
@jnioche

jnioche Apr 22, 2016

Member

@kkrugler could put that in a separate repo so that the requirement for Java 8 does not become necessary for core and the other modules.

Nutch has a HTMLUnit-based protocol implementation I think but not sure it's been used much yet and I haven't heard on that. There's also a Selenium one.

Member

jnioche commented Apr 22, 2016

@kkrugler could put that in a separate repo so that the requirement for Java 8 does not become necessary for core and the other modules.

Nutch has a HTMLUnit-based protocol implementation I think but not sure it's been used much yet and I haven't heard on that. There's also a Selenium one.

@jnioche jnioche added this to the 1.3 milestone Nov 2, 2016

@jnioche

This comment has been minimized.

Show comment
Hide comment
Member

jnioche commented Nov 24, 2016

@jnioche jnioche removed this from the 1.3 milestone Jan 9, 2017

@rkrombho

This comment has been minimized.

Show comment
Hide comment
@rkrombho

rkrombho Jan 11, 2017

Maybe Geb?

It's very easy to use and based on Selenium WebDriver which means it supports all browser that have a Driver implementation.
It would mean that users could theoretically decide if they want to do headless (e.g. HtmlUnitDriver, PhantomJSDriver), go with a real browser or to use Selenium Grid with a variety of different browsers.

I did some very intensive integration testing with Geb (including waiting for AJAX responses etc.) and it is absolutely awesome.
Would be easy to let the user provide Groovy/Geb scripts that are executed against Page context that is currently being crawled but I have no Idea how this could work with the Protocol Interface.

rkrombho commented Jan 11, 2017

Maybe Geb?

It's very easy to use and based on Selenium WebDriver which means it supports all browser that have a Driver implementation.
It would mean that users could theoretically decide if they want to do headless (e.g. HtmlUnitDriver, PhantomJSDriver), go with a real browser or to use Selenium Grid with a variety of different browsers.

I did some very intensive integration testing with Geb (including waiting for AJAX responses etc.) and it is absolutely awesome.
Would be easy to let the user provide Groovy/Geb scripts that are executed against Page context that is currently being crawled but I have no Idea how this could work with the Protocol Interface.

@iRajashekharC

This comment has been minimized.

Show comment
Hide comment
@iRajashekharC

iRajashekharC Apr 12, 2017

Hi @jnioche - curious to know if the current version of stormcrawler supports this Ajax/Dynamic content parsing?

Thanks
Raj

iRajashekharC commented Apr 12, 2017

Hi @jnioche - curious to know if the current version of stormcrawler supports this Ajax/Dynamic content parsing?

Thanks
Raj

@jnioche

This comment has been minimized.

Show comment
Hide comment
@jnioche

jnioche Apr 12, 2017

Member

Hi @raaz1234, see branch https://github.com/DigitalPebble/storm-crawler/tree/jBrowserDriver. Not yet merged but please give it a try

Member

jnioche commented Apr 12, 2017

Hi @raaz1234, see branch https://github.com/DigitalPebble/storm-crawler/tree/jBrowserDriver. Not yet merged but please give it a try

@owenrh

This comment has been minimized.

Show comment
Hide comment
@owenrh

owenrh Apr 20, 2017

Contributor

Hi @jnioche - is it just a case of configuring http.protocol.implementation to use the JBrowserProtocol? Or is more needed to make this work?

Contributor

owenrh commented Apr 20, 2017

Hi @jnioche - is it just a case of configuring http.protocol.implementation to use the JBrowserProtocol? Or is more needed to make this work?

@jnioche

This comment has been minimized.

Show comment
Hide comment
@jnioche

jnioche Apr 20, 2017

Member

Hi @owenrh (am sitting at your desk, will try not to leave crumbs). Yes, should be just that indeed!

Member

jnioche commented Apr 20, 2017

Hi @owenrh (am sitting at your desk, will try not to leave crumbs). Yes, should be just that indeed!

@jnioche

This comment has been minimized.

Show comment
Hide comment
@jnioche

jnioche Apr 20, 2017

Member

@owenrh please have a look at #457

Member

jnioche commented Apr 20, 2017

@owenrh please have a look at #457

@jnioche jnioche added the core label Apr 21, 2017

@jnioche jnioche added this to the 1.5 milestone Apr 21, 2017

@jnioche jnioche closed this Apr 24, 2017

@owenrh

This comment has been minimized.

Show comment
Hide comment
@owenrh

owenrh Apr 27, 2017

Contributor

@jnioche ha, thanks for the msgs, had an error on my inbox filters so I missed them. Will check it out, ta.

Contributor

owenrh commented Apr 27, 2017

@jnioche ha, thanks for the msgs, had an error on my inbox filters so I missed them. Will check it out, ta.

@jnioche

This comment has been minimized.

Show comment
Hide comment
Member

jnioche commented Apr 27, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment