Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Documentation] WebDriverHttpFetcher capabilities examples #1017

Open
jetnet opened this issue Jun 15, 2024 · 12 comments
Open

[Documentation] WebDriverHttpFetcher capabilities examples #1017

jetnet opened this issue Jun 15, 2024 · 12 comments

Comments

@jetnet
Copy link

jetnet commented Jun 15, 2024

It would be great if there would be examples in the documentation how to set various web-driver capabilities, in particular:

  • proxy with and without auth
  • user agent
  • allowed SSL protocols
  • trust all certificates: true, false
  • add custom headers

If someone has working examples could you please share it here?
Thank you!

@essiembre
Copy link
Contributor

Hello @jetnet,

WebDrivers are maintained external to the our web crawler by their publishers and their capabilities vary. You'll need to refer to your specific web driver implementation for configuration details.

Luckily, you can configure your WebDriverHttpFetcher to use any "capabilities" offered by your driver via:

  <capabilities>
    <capability name="(capability name)">(capability value)</capability>
    <!-- multiple "capability" tags allowed -->
  </capabilities>

On the other hand, whether your web driver supports all the requirements you are after is to be seen. I suggest you check with the community behind your web driver.

For instance, Mozilla seems to support "trust all certificates" as described here. It would translate to this in your config:

    <capability name="acceptInsecureCerts">true</capability>

For proxy settings, you can refer to #799. Depending on your WebDriver, something like this may do it:

    <capability name="proxy.proxyType">MANUAL</capability>
    <capability name="proxy.httpProxy">proxy_address:proxy_port"</capability>
    ... and so on ...

For setting the user agent and having custom headers, I am not sure what your webdriver can do, but in any case, you can configure the "httpSniffer" within your fetcher to set those:

  <httpSniffer>
    <userAgent>(optionally overwrite browser user agent)</userAgent>
    <headers>
      <!-- You can repeat this header tag as needed. -->
      <header name="(header name)">(header value)</header>
    </headers>
  </httpSniffer>

More elaborate options are also possible, like using Java to create your modified version of the WebDriverHttpFetcher, or you can inject JavaScript into the crawled pages to perform extra customization (look at earlyPageScript or latePageScript).

Refer to WebDriverHttpFetcher for other of configuration options.

@jetnet
Copy link
Author

jetnet commented Jun 18, 2024

Hello Pascal, thank you very much for the quick reply. I was not able to set up any capability using the latest Selenium docker images (firefox or chrome).
I hoped, you or someone could provide a working Webdriver example, it does not matter which one. So, I keep trying and share my results here then.

@jetnet
Copy link
Author

jetnet commented Jun 19, 2024

Finally, I did some progress on that, using a Docker image:

docker run -d \
  --network=host \
  --add-host host:127.0.0.1 \
  -e SE_START_XVFB=false \
  -e SE_SESSION_REQUEST_TIMEOUT=60 \
  -e SE_NODE_MAX_SESSIONS=5 \
  --shm-size="2g" --name firefox selenium/standalone-firefox

Notes:

  • --network=host: is required, otherwise the webdriver cannot reach the HttpSniffer proxy
  • --add-host host:127.0.0.1: the name of my working station (WSL) is host, the HttpSniffer configures the Webdriver to use it as the proxy, using that name
  • SE_NODE_MAX_SESSIONS=5: is required, as the default is just one session and the Norconex (FF-driver) establishes 2 sessions

The issues follow... one issue per comment, please let me know, if I should open dedicated tickets for them.

@jetnet
Copy link
Author

jetnet commented Jun 19, 2024

Collector opens two sessions to the webdriver, but closes only one, the remaining session is blocked and waiting to time out:

2024-06-19 13:56:48,117 INFO c.n.c.h.f.i.w.Browser [ifconfig.io] Creating remote "FirefoxDriver" web driver.
...
2024-06-19 13:56:52,387 INFO C.CRAWLER_RUN_THREAD_BEGIN [ifconfig.io#1] Thread[ifconfig.io#1,5,main]
2024-06-19 13:56:52,388 INFO c.n.c.h.f.i.w.Browser [ifconfig.io#1] Creating remote "FirefoxDriver" web driver.
...
...
2024-06-19 13:56:57,019 INFO c.n.c.h.f.i.w.WebDriverHttpFetcher [ifconfig.io#1] Shutting down FIREFOX web driver.

Expected behavior: the first (main?) thread should shutdown the web driver as well.

@jetnet
Copy link
Author

jetnet commented Jun 19, 2024

HttpSniffer configures its own proxy for the webdriver. How to use a "real" proxy then?

Probably, it'd be an enhancement request: HttpProxy should support an external (customer's) proxy server.

Crawled URL: https://ifconfig.io/all

I tried to set up the standard proxy env vars, hoping, that the HttpSniffer's libs would use them, but it did not help:

export http_proxy=http://local-proxy:8118
export https_proxy=http://local-proxy:8118

The content field shows the Remote address which is NOT from the configured proxy.

@jetnet
Copy link
Author

jetnet commented Jun 19, 2024

HttpSniffer does not use the provided Trusted cert store

Norconex start params:

java -Dlog4j.configurationFile="file:${EXT_CONFIG_DIR}/test/log4j2.xml" \
	-Xms2G -Xmx10G \
	-Dnashorn.args=--no-deprecation-warning \
	-Djavax.net.ssl.trustStore=${TRUST_STORE} \
	-Dfile.encoding=UTF8 -Duser.country=US -Duser.language=en \
	-cp "${HTTP_DIR}/lib/*:${HTTP_DIR}/classes:${ES_DIR}/lib/*:${EXT_LIB}/*" \
	com.norconex.collector.http.HttpCollector "$@"

I thought, it used the default (standard) trust store path:

❯ ll /usr/lib/jvm/java-11-openjdk-amd64/lib/security/
total 0
lrwxrwxrwx 1 root root 27 May 29 14:02 cacerts -> /etc/ssl/certs/java/cacerts

and copied my custom cacerts there, but it did not help.
The error, httpSniffer's port 44444:

2024-06-19 13:55:02,556 ERROR o.l.p.i.ClientToProxyConnection [LittleProxy-0-ClientToProxyWorker-7] (NEGOTIATING_CONNECT) [id: 0x24bd5a61, L:0.0.0.0/0.0.0.0:44444 ! R:/127.0.0.1:60882]: Caught an exception on ClientToProxyConnection
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Received fatal alert: unknown_ca
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:477) ~[netty-codec-4.1.72.Final.jar:4.1.72.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[netty-codec-4.1.72.Final.jar:4.1.72.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.68.Final.jar:4.1.68.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.68.Final.jar:4.1.68.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.68.Final.jar:4.1.68.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.68.Final.jar:4.1.68.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.68.Final.jar:4.1.68.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.68.Final.jar:4.1.68.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.68.Final.jar:4.1.68.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) [netty-transport-4.1.68.Final.jar:4.1.68.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) [netty-transport-4.1.68.Final.jar:4.1.68.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655) [netty-transport-4.1.68.Final.jar:4.1.68.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581) [netty-transport-4.1.68.Final.jar:4.1.68.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) [netty-transport-4.1.68.Final.jar:4.1.68.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) [netty-common-4.1.68.Final.jar:4.1.68.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.68.Final.jar:4.1.68.Final]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: javax.net.ssl.SSLHandshakeException: Received fatal alert: unknown_ca
	at sun.security.ssl.Alert.createSSLException(Alert.java:131) ~[?:?]
	at sun.security.ssl.Alert.createSSLException(Alert.java:117) ~[?:?]
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:347) ~[?:?]
	at sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:293) ~[?:?]
	at sun.security.ssl.TransportContext.dispatch(TransportContext.java:186) ~[?:?]
	at sun.security.ssl.SSLTransport.decode(SSLTransport.java:172) ~[?:?]
	at sun.security.ssl.SSLEngineImpl.decode(SSLEngineImpl.java:681) ~[?:?]
	at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:636) ~[?:?]
	at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:454) ~[?:?]
	at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:433) ~[?:?]
	at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:637) ~[?:?]
	at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:295) ~[netty-handler-4.1.72.Final.jar:4.1.72.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1341) ~[netty-handler-4.1.72.Final.jar:4.1.72.Final]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1234) ~[netty-handler-4.1.72.Final.jar:4.1.72.Final]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1283) ~[netty-handler-4.1.72.Final.jar:4.1.72.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507) ~[netty-codec-4.1.72.Final.jar:4.1.72.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446) ~[netty-codec-4.1.72.Final.jar:4.1.72.Final]
	... 16 more

@jetnet
Copy link
Author

jetnet commented Jun 19, 2024

HttpSniffer ignore userAgent parameter

<httpSniffer>
	<port>44444</port>
	<userAgent>Custom-HTTP Collector</userAgent>
	<headers>
		<header name="Content-Type">*</header>
	</headers>
</httpSniffer>

This value is not used, the client's header, reported by ifconfig.io/all is still from the web driver:

ua: Mozilla/5.0 (X11; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0

@jetnet
Copy link
Author

jetnet commented Jun 19, 2024

Firefox capabilities ignored

Configuration:

	<httpFetchers>
		<fetcher class="com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher">
			<browser>firefox</browser> <!-- NOTE: there must be more than one node session available !!! -->
			<remoteURL>http://localhost:4444</remoteURL>
			<capabilities>
				<capability name="TESTCAP">TESTVAL</capability>
				<capability name="general.useragent.override"HTTP-Collector</capability>
				<capability name="network.proxy.type">1</capability>
				<capability name="network.proxy.http">192.168.178.23</capability>
				<capability name="network.proxy.http_port">8118</capability>
			</capabilities>
		</fetcher>
	</httpFetchers>

Web driver session info: http://localhost:4444/status

"session": {
  "capabilities": {
    "acceptInsecureCerts": true,
    "browserName": "firefox",
    "browserVersion": "126.0",
    "moz:accessibilityChecks": false,
    "moz:buildID": "20240509170740",
    "moz:debuggerAddress": "127.0.0.1:27816",
    "moz:firefoxOptions": {
      "args": [
        "-headless"
      ],
      "profile": "...base64 encoded zipped user.js ..."
    },

Encoded and unzipped user.js does not contain any of capabilities from the collector config.

@jetnet
Copy link
Author

jetnet commented Jun 19, 2024

suggested config

    <capability name="proxy.proxyType">MANUAL</capability>
    <capability name="proxy.httpProxy">proxy_address:proxy_port"</capability>

has no effect on the created session (http://localhost:4444/status):

"session": {
  "capabilities": {
    ...
    "proxy": {
    },
    ...

@jetnet
Copy link
Author

jetnet commented Jun 21, 2024

Chrome webdriver tests

Start local proxy, e.g. (listen on http://localhost:8118):

docker run --name='tor-privoxy' -d \
  --network=host \
  dockage/tor-privoxy:latest

Start Chrome webdriver (listen on http://localhost:4444):

docker run -d \
  --network=host \
  --add-host host:127.0.0.1 \
  -e SE_START_XVFB=false \
  -e SE_SESSION_REQUEST_TIMEOUT=60 \
  -e SE_NODE_MAX_SESSIONS=5 \
  --shm-size="2g" --name chrome selenium/standalone-chrome

@jetnet
Copy link
Author

jetnet commented Jun 21, 2024

Chrome webdriver - no custom capabilities can be set at all

Tried:

<browser>chrome</browser>
<remoteURL>http://localhost:4444</remoteURL>
<capabilities>
  <capability name="TESTCAP">TESTVAL</capability>
  <capability name="proxy.http">http://localhost:8118</capability>
  <capability name="proxy.https">http://localhost:8118</capability>
  <capability name="proxy.no_proxy">localhost,127.0.0.1</capability>
</capabilities>

and

<browser>chrome</browser>
<remoteURL>http://localhost:4444</remoteURL>
<capabilities>
  <capability name="TESTCAP">TESTVAL</capability>
  <capability name="proxy.httpProxy">localhost:8118</capability>
  <capability name="proxy.sslProxy">localhost:8118</capability>
  <capability name="proxy.noProxy">localhost,127.0.0.1</capability>
  <capability name="proxy.proxyType">MANUAL</capability>
</capabilities>

in both cases the webdriver status page (http://localhost:4444) does not show any capabilities from above.

NOTE: httpSniffer proxy was turned off, because it makes no sense to use it, as it sets (properly!) its own proxy for the webdriver.

Copy link

stale bot commented Aug 21, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale From automation, when inactive for too long. label Aug 21, 2024
@stale stale bot closed this as completed Aug 31, 2024
@essiembre essiembre reopened this Sep 1, 2024
@stale stale bot removed the stale From automation, when inactive for too long. label Sep 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants