Skip to content

Conversation

johnvonessen
Copy link
Contributor

  • Add SocksProxyConfiguration, SocksProxyManager, and SocksProxyHttpClientFactory classes
  • Integrate SOCKS proxy support into all crawler HTTP clients
  • Support round-robin and random proxy selection strategies
  • Add comprehensive documentation for SOCKS proxy configuration
  • Configure via system properties for easy deployment

   - Add SocksProxyConfiguration, SocksProxyManager, and SocksProxyHttpClientFactory classes
   - Integrate SOCKS proxy support into all crawler HTTP clients
   - Support round-robin and random proxy selection strategies
   - Add comprehensive documentation for SOCKS proxy configuration
   - Configure via system properties for easy deployment
@vlofgren
Copy link
Contributor

vlofgren commented Sep 6, 2025

I'm not getting this to compile.

One problem is that code/common/config/build.gradle is missing

    implementation libs.bundles.httpcomponents

in its dependencies-block, but there appear to be other errors as well that are less easy to fix.

If you're using LLMS to drive this change, it may help to inform it that "Apache HttpClient v5" is being used, as the v4 API looks similar but is not identical and breaks a bunch of stuff.

You can also use ./gradlew assemble to verify the changes compile, and ./gradlew fastT to run the fast test suite to ensure things aren't overtly broken.

- Add missing libs.bundles.httpcomponents dependency to build.gradle
- Fix SocksProxy class references to use SocksProxyConfiguration.SocksProxy
- Update HTTP Components API calls to match v5 interface signatures
- Fix ConnectionSocketFactory method signatures (TimeValue, HttpHost parameters)
- Remove invalid setConnectTimeout() calls on Socket class
- Add placeholder implementation for v5 SOCKS proxy configuration

Resolves compilation errors and provides foundation for proper v5 implementation.
@johnvonessen
Copy link
Contributor Author

@vlofgren Thanks... With LLM help, made a bunch of fixes including the API v4 vs v5. ./gradle assemble was clean, but ./gradle fastT has 1 failure - but I think unrelated? Here it is:


> Task :code:libraries:array:fastTests FAILED

nu.marginalia.array.NativeAlgosTest

  Test testDirectFileReader() FAILED

  java.io.IOException: Failed to read data at 0
      at nu.marginalia.array.NativeAlgosTest.testDirectFileReader(NativeAlgosTest.java:60)

// For HTTP Components v5, SOCKS proxy support needs to be configured differently
// This is a placeholder implementation - the actual SOCKS proxy configuration
// would need to be handled at the JVM level or through system properties
logger.warn("SOCKS proxy configuration not fully implemented for HTTP Components v5");
Copy link
Contributor

@vlofgren vlofgren Sep 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it still hasn't actually hooked in the change. I'm also not sure if this is correct, you definitely don't need system properties or JVM alterations to use SOCKS proxies.

public static void setupAll() throws URISyntaxException {
System.setProperty("crawler.socksProxy.enabled", "true");
// Can make changes here to an invalid proxy URL to verify the behavior breaks
System.setProperty("crawler.socksProxy.list", "127.0.0.1:1080");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a test I built to test the proxy. If you change this line to some invalid port, the test is expected to break but doesn't.

@vlofgren
Copy link
Contributor

vlofgren commented Sep 7, 2025

@vlofgren Thanks... With LLM help, made a bunch of fixes including the API v4 vs v5. ./gradle assemble was clean, but ./gradle fastT has 1 failure - but I think unrelated? Here it is:


> Task :code:libraries:array:fastTests FAILED

nu.marginalia.array.NativeAlgosTest

  Test testDirectFileReader() FAILED

  java.io.IOException: Failed to read data at 0
      at nu.marginalia.array.NativeAlgosTest.testDirectFileReader(NativeAlgosTest.java:60)

Yeah that test may just be a bit flakey or not fully portable, definitely ignore that.

- Replace placeholder implementation with working SocketConfig.setSocksProxyAddress()
- Remove complex ConnectionSocketFactory code that wasn't compatible with v5
- Simplify SocksProxyHttpClientFactory to use correct v5 API
- Fix Timeout vs TimeValue compilation error
- SOCKS proxy configuration now fully functional

Resolves the incomplete implementation and enables proper proxy routing.
@johnvonessen
Copy link
Contributor Author

@vlofgren pushed another commit to make socks usable, tests passed, rebuild the docker image with ./gradlew docker and it rebuilt the marginalia-browserless image. Stop/started everything, run a new crawl with the socks config - but no traffic hit socks. So something must still be off.

johnvonessen and others added 3 commits September 9, 2025 01:02
- Add SOCKS proxy system properties to ProcessSpawnerService
- Ensures crawler.socksProxy.* properties are passed to spawned crawler processes
- Fixes issue where SOCKS proxy configuration was loaded by control service
  but not inherited by spawned crawler processes

This resolves the root cause of SOCKS proxy not working in crawler processes.
The code was selecting the proxy too late, so that it ended up being hardcoded for the entire crawl run, thus breaking the proxy selection logic.

There was also a problem where the socket configuration was overwritten by another socket configuration, thus disabling the proxy injection.
@vlofgren
Copy link
Contributor

I made some fixes that resolve a few outstanding issues, seems to work as intended now. Merging.

@vlofgren vlofgren merged commit 4cd1834 into MarginaliaSearch:master Sep 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants