Shelob wraps Selenium to let you browse a website and scrape its contents.
Selenium automates web browsing, primarily for testing and administration purposes.
Shelob wraps Selenium to make it more idiomatic and adherent to the Clojure way of coding, and exposes facilities to scrape web pages.
A simple DuckDuckGo search:
-
Type "clojure" on the search field
-
Click on the magnify glass to perform the search
-
Retrieve the URLs of the visible results
(require '[shelob.core :as sh])
(require '[shelob.browser :as shb])
(require '[shelob.scraper :as shs])
(def context
{:driver-options {:browser :firefox}
:pool-size 2
:init-messages [{:msg :go :url "https://duckduckgo.com/html/"}]})
(defn scrape-result
[document]
(println (map
#(shs/attribute % "href")
(shs/select document ".result__a"))))
(defn example
[]
(sh/init context)
(let [msg [{:msg :fill
:locator (shb/by-css-selector "#search_form_input_homepage")
:text "Clojure"}
{:msg :click :locator (shb/by-css-selector "#search_button_homepage")}
{:msg :wait-for :condition
(shb/presence-of-element-located
(shb/by-css-selector ".serp__results"))}]]
(sh/send-message context scrape-result msg))
(sh/stop))
Running (example)
results in:
user> (https://clojure.org/ https://en.wikipedia.org/wiki/Clojure https://github.com/clojure/clojure https://www.reddit.com/r/Clojure/ https://clojuredocs.org/ https://clojuredocs.org/clojure.core/when https://www.braveclojure.com/ https://repl.it/languages/clojure https://www.zhihu.com/question/21446061 https://leiningen.org/ https://clojure.github.io/clojure/ https://github.com/clojure https://www.tutorialspoint.com/clojure/clojure_basic_syntax.htm https://learnxinyminutes.com/docs/clojure/ https://cursive-ide.com/ https://clojurescript.org/ https://www.tutorialspoint.com/clojure/clojure_loops.htm http://www.clojurekoans.com/ https://www.braveclojure.com/clojure-for-the-brave-and-true/ https://marketplace.visualstudio.com/items?itemName=avli.clojure https://www.youtube.com/user/ClojureTV https://www.slant.co/options/1538/~clojure-review https://www.amazon.com/Clojure-Programming-Practical-Lisp-World/dp/1449394701 https://kimh.github.io/clojure-by-example/ https://ja.wikipedia.org/wiki/Clojure http://www.4clojure.
com/ https://developer.mozilla.org/en-US/docs/Web/JavaScript/Closures https://www.clojure.org/guides/getting_started https://en.wikibooks.org/wiki/Clojure_Programming)
By default, Shelob prints out exceptions on standard output and continues the
crawling process; you can customise exception management by passing a custom
exception management function to send-message
, as such:
....
(defn exception-custom-fn
"Prints out exception with a custom message"
[_source e]
(println "An exception occurred, this is a custom message -" (.getMessage e)))
(defn example
[]
(sh/init context)
(let [msg [...]
(sh/send-message context scrape-result exception-custom-fn msg))
(sh/stop))