Skip to content

7bridges-eu/shelob

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Shelob

Shelob wraps Selenium to let you browse a website and scrape its contents.

Rationale

Selenium automates web browsing, primarily for testing and administration purposes.

Shelob wraps Selenium to make it more idiomatic and adherent to the Clojure way of coding, and exposes facilities to scrape web pages.

Version

shelob

Example

A simple DuckDuckGo search:

  • Type "clojure" on the search field

  • Click on the magnify glass to perform the search

  • Retrieve the URLs of the visible results

(require '[shelob.core :as sh])
(require '[shelob.browser :as shb])
(require '[shelob.scraper :as shs])

(def context
  {:driver-options {:browser :firefox}
   :pool-size 2
   :init-messages [{:msg :go :url "https://duckduckgo.com/html/"}]})

(defn scrape-result
  [document]
  (println (map
            #(shs/attribute % "href")
            (shs/select document ".result__a"))))

(defn example
  []
  (sh/init context)
  (let [msg [{:msg :fill
              :locator (shb/by-css-selector "#search_form_input_homepage")
              :text "Clojure"}
             {:msg :click :locator (shb/by-css-selector "#search_button_homepage")}
             {:msg :wait-for :condition
              (shb/presence-of-element-located
               (shb/by-css-selector ".serp__results"))}]]
    (sh/send-message context scrape-result msg))
  (sh/stop))

Running (example) results in:

user> (https://clojure.org/ https://en.wikipedia.org/wiki/Clojure https://github.com/clojure/clojure https://www.reddit.com/r/Clojure/ https://clojuredocs.org/ https://clojuredocs.org/clojure.core/when https://www.braveclojure.com/ https://repl.it/languages/clojure https://www.zhihu.com/question/21446061 https://leiningen.org/ https://clojure.github.io/clojure/ https://github.com/clojure https://www.tutorialspoint.com/clojure/clojure_basic_syntax.htm https://learnxinyminutes.com/docs/clojure/ https://cursive-ide.com/ https://clojurescript.org/ https://www.tutorialspoint.com/clojure/clojure_loops.htm http://www.clojurekoans.com/ https://www.braveclojure.com/clojure-for-the-brave-and-true/ https://marketplace.visualstudio.com/items?itemName=avli.clojure https://www.youtube.com/user/ClojureTV https://www.slant.co/options/1538/~clojure-review https://www.amazon.com/Clojure-Programming-Practical-Lisp-World/dp/1449394701 https://kimh.github.io/clojure-by-example/ https://ja.wikipedia.org/wiki/Clojure http://www.4clojure.
com/ https://developer.mozilla.org/en-US/docs/Web/JavaScript/Closures https://www.clojure.org/guides/getting_started https://en.wikibooks.org/wiki/Clojure_Programming)

Exception management

By default, Shelob prints out exceptions on standard output and continues the crawling process; you can customise exception management by passing a custom exception management function to send-message, as such:

....

(defn exception-custom-fn
  "Prints out exception with a custom message"
  [_source e]
  (println "An exception occurred, this is a custom message -" (.getMessage e)))

(defn example
  []
  (sh/init context)
  (let [msg [...]
    (sh/send-message context scrape-result exception-custom-fn msg))
  (sh/stop))

License

Copyright © 2019 7bridges s.r.l. — Distributed under the Apache License 2.0.