Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bounty] Headless browser module #40 #47

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

dlbears
Copy link

@dlbears dlbears commented Nov 15, 2022

Why Babashka (Clojure)?

  • Minimal runtime (No JVM required)
  • Native Clojure linking (via GraalVM native-image)
  • Fast to boot (via babashka)
  • Parallel batching (via pmap)
  • Supports lambda (via holy-lambda)
  • Supports a selenium-free webdriver implementation (via Etaoin)

Design

  • currently only supports chrome/chromium drivers, but can easily be extended to support Firefox and/or safari

HTTP API

  • GET /health
  • GET /v1/find (supports single and batched requests which are run in parallel)
    • individual: JSON { url: string, matcher: string, match-by: string, secret: string, timeout: uint, strategy: string }
    • batch: JSON { batch: JSON[ { individual } ] }
    • response: JSON { match: boolean, message: string }
    • batch-response: JSON [ { response } ]

CLI API (for individual find):

bb src/finder.clj
-url --url string required
-matcher --matcher string required (css/js/regex/xpath expression)
-match-by --match-by string required (choose one ["css" "js" "regex" "xpath" ])
-secret --secret string
-timeout --timeout unit
-strategy --strategy string required (choose one ["fallback" "static" "webdriver"])

CLI API (for server process):

bb src/finder.clj
-server --server true
-port --port uint (default: 8080)
-address --address string (default 0.0.0.0)

Strategies

  • static: performant, for searching raw html with regex or using css selectors on SSR'd html
  • webdriver: supports js, css, xpath, and regex, 2025% slower compared to static regex/css
  • fallback: (default only for regex or css) if match fails on static try webdriver

Performance example and demo

HTTP Batching

6 requests using fallback strategy processed in parallel ~4secs

Image 11-15-22 at 10 06 AM

6 requests using static strategy processed in parallel ~3secs

Image 11-15-22 at 12 28 PM

Currently WIP

  • [Done] Lambda docker Image (multi-stage/multi-arch; need Graal EE to build one of the dependencies)
  • [Done] Lambda build config (compilation of the entire script to a static native executable)
  • [Done] Lambda path traces (local testing, and tracing)

@nykma
Copy link
Member

nykma commented Nov 17, 2022

Amazing contribution!
But when I want to run it locally (using bb src/find.clj -server true -port 3010), it throws with this error:

----- Error --------------------------------------------------------------------
Type:     java.io.IOException
Message:  Cannot run program "./bb/finder": error=2, No such file or directory
Location: /home/nykma/Company/NextID/proof_server/headless/src/find.clj:11:1

----- Context ------------------------------------------------------------------
 7:                    [org.httpkit.client :refer [get]]
 8:                    [clojure.core.match :refer [match]] ;; Pattern Matching
 9:                    [clojure.java.io :as io]))
10: 
11: (pods/load-pod "./bb/finder")
    ^--- Cannot run program "./bb/finder": error=2, No such file or directory
12: (require '[pod.jaydeesimon.jsoup :as jsoup]) ;; jsoup css selectors library
13: 
14: (def CSS jsoup/select)
15: 
16: (def chrome-driver-opts {:capabilities {:chromeOptions {:args ["--headless" "--no-sandbox"]}}})

Looks like you hard-coded a path into this code (?)

@nykma
Copy link
Member

nykma commented Nov 17, 2022

Oh, got it, you've prepared this file in Dockerfile. Nevermind. I'm keep reading...

Copy link
Member

@nykma nykma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good enough. Only some trivial things to mention.

headless/Dockerfile Outdated Show resolved Hide resolved
headless/Dockerfile Outdated Show resolved Hide resolved
headless/Dockerfile Outdated Show resolved Hide resolved
amended - removed generated native configs (only needed for binary generation), updated go wrapper, split    script docker into arm and amd versions
@dlbears
Copy link
Author

dlbears commented Nov 21, 2022

New Files/Folders

  • Dependency Management
    • bb.edn
    • deps.edn
  • Build Script
    • build.clj
  • Dockerfile
    • .amd/arm (for native binaries; static only)
    • .script.amd/.script.arm (for bb runtime; webdriver and static)
  • dist (JARs and platform binaries)
    • *-noop-gc (GC does nothing, meant for short run processes/batches like serverless functions)
    • *-serial-gc (single threaded low footprint GC for longer running processes)
  • jsoup/pod/jaydeesimon (untracked jsoup dependency, necessary for bb runtime support, not needed for native)

Resources

Exciting Performance Improvements (specifically for static strategies; fallback/webdriver remain unchanged)

6 request batched using static strategy (same as OP) but using native binary 1.04s

typo should be cljc not clj
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants